Non-text output?
Hi team, I noticed in the model description that the output tokens are text only. However, in the Gemma3n video here there is a live demo where Gemma3n behaves like a voice assistant with voice responses.
Is that an additional module that does text to speech (separate from the model) or does the model natively support audio output tokens?
Hi @sidhusmart ,
Welcome to Gemma family of open source models, thanks for reaching out to us - The Gemma 3n models have multimodality capabilities, these 3n models are capable of taking inputs like text, image, audio
and produce the result in the text for the give input based the user query/prompt - Analyzing the image, answering the user question..etc.
The input for the 3n models:
Text string, such as a question, a prompt, or a document to be summarized
Images, normalized to 256x256, 512x512, or 768x768 resolution and encoded to 256 tokens each
Audio data encoded to 6.25 tokens per second from a single channel
Total input context of 32K tokens
The output from the 3n models:
Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document
Total output length up to 32K tokens, subtracting the request input tokens
Please feel free to reach out to me for any additional assistance.
Thanks.