Time to first token for GPU is wrong

#4
by yuimo - opened

is the Time to first token data wrong for GPU inference?
i think it should be 1024 / 620 = 1.65s
image.png

سلام

Hi @yuimo ,

Apologies for the late reply, Your calculation of 1024 / 620 = 1.65s is based on the pre-fill tokens 1024) and the GPU pre-fill tokens/sec (620). While this calculation gives a time for pre-flling the 1024 tokens, it does not represent the Time to first token as reported in the benchmarks. Time to first token often includes various overheads beyond just the pre-fill processing.

Thanks.

Sign up or log in to comment