google/gemma-3n-E4B-it · Non-text output?

Hi @sidhusmart ,

Welcome to Gemma family of open source models, thanks for reaching out to us - The Gemma 3n models have multimodality capabilities, these 3n models are capable of taking inputs like text, image, audio and produce the result in the text for the give input based the user query/prompt - Analyzing the image, answering the user question..etc.

The input for the 3n models:
Text string, such as a question, a prompt, or a document to be summarized
Images, normalized to 256x256, 512x512, or 768x768 resolution and encoded to 256 tokens each
Audio data encoded to 6.25 tokens per second from a single channel
Total input context of 32K tokens

The output from the 3n models:
Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document
Total output length up to 32K tokens, subtracting the request input tokens

Please feel free to reach out to me for any additional assistance.

Thanks.