Zenithwang commited on
Commit
7376455
·
verified ·
1 Parent(s): b85635b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +419 -3
README.md CHANGED
@@ -1,3 +1,419 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ ---
5
+ <div align="center">
6
+ <picture>
7
+ <img src="stepfun-logo.png" width="30%" alt="StepFun: Cost-Effective Multimodal Intelligence">
8
+ </picture>
9
+ </div>
10
+
11
+ <hr>
12
+
13
+ <div align="center" style="line-height:1">
14
+ <a href="https://stepfun.com/" target="_blank"><img alt="Chat" src="https://img.shields.io/badge/Chat-StepFun-ff6b6b?color=1783ff&logoColor=white"/></a>
15
+ <a href="https://stepfun.com/" target="_blank"><img alt="Homepage" src="https://img.shields.io/badge/Homepage-StepFun-white?logo=StepFun&logoColor=white"/></a>
16
+ </div>
17
+
18
+ <div align="center" style="line-height: 1;">
19
+ <a href="https://github.com/stepfun-ai/Step3" target="_blank"><img alt="Github" src="https://img.shields.io/badge/🤖Github-StepFun-ffc107?color=ffc107&logoColor=white"/></a>
20
+ <a href="https://www.modelscope.cn/models/stepfun-ai/step3" target="_blank"><img alt="ModelScope" src="https://img.shields.io/badge/🤖ModelScope-StepFun-ffc107?color=7963eb&logoColor=white"/></a>
21
+ <a href="https://x.com/StepFun_ai" target="_blank"><img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-StepFun-white?logo=x&logoColor=white"/></a>
22
+ </div>
23
+
24
+ <div align="center" style="line-height: 1;">
25
+ <a href="https://discord.com/invite/XHheP5Fn" target="_blank"><img alt="Discord" src="https://img.shields.io/badge/Discord-StepFun-white?logo=discord&logoColor=white"/></a>
26
+ <a href="LICENSE"><img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-blue?&color=blue"/></a>
27
+ </div>
28
+
29
+ <div align="center">
30
+ <b>📰&nbsp;&nbsp;<a href="https://stepfun.ai/research/step3">Step3 Model Blog</a></b> &nbsp;&nbsp;&nbsp; | &nbsp;&nbsp;&nbsp; <b>📄&nbsp;&nbsp;<a href="https://arxiv.org/abs/2507.19427">Step3 System Blog</a></b>
31
+ </div>
32
+
33
+ ## Introduction
34
+
35
+ Step3 is our cutting-edge multimodal reasoning model—built on a Mixture-of-Experts architecture with 321B total parameters and 38B active.
36
+ It is designed end-to-end to minimize decoding costs while delivering top-tier performance in vision–language reasoning.
37
+ Through the co-design of Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD),
38
+ Step3 maintains exceptional efficiency across both flagship and low-end accelerators.
39
+
40
+ ### Step3 model card:
41
+
42
+ | Config | Value |
43
+ |------------------------|---------|
44
+ | **Number of Layers (Dense layer included)**|61|
45
+ |**Number of Dense Layers**| 5|
46
+ | **Hidden Dimension** | 7168 |
47
+ | **Attention Mechanism** | MFA |
48
+ | **Low-rank Query Dimension** | 2048 |
49
+ | **Number of Query Heads** | 64 |
50
+ | **Head Dimension** | 256 |
51
+ |**Number of Experts** |48|
52
+ |**Selected Experts per Token**|3|
53
+ |**Number of Shared Experts**| 1|
54
+ | **Max Context Length** | 65536 |
55
+ | **Tokenizer** | Deepseek V3 |
56
+ | **Total Parameters (LLM)** | 316B |
57
+ | **Activated Params per Token** | 38B |
58
+ | **Total Parameters (VLM)** | 321B |
59
+
60
+
61
+ ## Evaluation Results
62
+ <table>
63
+ <thead>
64
+ <tr>
65
+ <th></th>
66
+ <th>Model</th>
67
+ <th>Total Params.</th>
68
+ <th>MMMU</th>
69
+ <th>MathVision</th>
70
+ <th>ZeroBench(sub)</th>
71
+ <th>DYNAMATH</th>
72
+ <th>SimpleVQA</th>
73
+ <th>HallusionBench</th>
74
+ <th>AIME25</th>
75
+ <th>HMMT25</th>
76
+ <th>CNMO24</th>
77
+ <th>GPQA-Diamond</th>
78
+ <th>LiveCodeBench<br>(24.8-25.5)</th>
79
+ </tr>
80
+ </thead>
81
+ <tbody>
82
+ <tr>
83
+ <td rowspan="6">Open-Source VLM</td>
84
+ <td>Step3</td>
85
+ <td>321B</td>
86
+ <td>74.2</td>
87
+ <td>64.8</td>
88
+ <td>23.0</td>
89
+ <td>50.1</td>
90
+ <td>62.2</td>
91
+ <td>64.2</td>
92
+ <td>82.9</td>
93
+ <td>70.0</td>
94
+ <td>83.7</td>
95
+ <td>73.0</td>
96
+ <td>67.1</td>
97
+ </tr>
98
+ <tr>
99
+ <td>ERINE4.5 - thinking</td>
100
+ <td>300B/424B</td>
101
+ <td>70.0</td>
102
+ <td>47.6</td>
103
+ <td>22.5</td>
104
+ <td>46.9</td>
105
+ <td>59.8</td>
106
+ <td>60.0</td>
107
+ <td>35.1</td>
108
+ <td>40.5*</td>
109
+ <td>75.5</td>
110
+ <td>76.8</td>
111
+ <td>38.8</td>
112
+ </tr>
113
+ <tr>
114
+ <td>GLM-4.1V-thinking</td>
115
+ <td>9B</td>
116
+ <td>68.0</td>
117
+ <td>49.4</td>
118
+ <td>22.8</td>
119
+ <td>41.9</td>
120
+ <td>48.1</td>
121
+ <td>60.8</td>
122
+ <td>13.3</td>
123
+ <td>6.7</td>
124
+ <td>25.0</td>
125
+ <td>47.4</td>
126
+ <td>24.2</td>
127
+ </tr>
128
+ <tr>
129
+ <td>MiMo-VL</td>
130
+ <td>7B</td>
131
+ <td>66.7</td>
132
+ <td>60.4</td>
133
+ <td>18.6</td>
134
+ <td>45.9</td>
135
+ <td>48.5</td>
136
+ <td>59.6</td>
137
+ <td>60.0</td>
138
+ <td>34.6</td>
139
+ <td>69.9</td>
140
+ <td>55.5</td>
141
+ <td>50.1</td>
142
+ </tr>
143
+ <tr>
144
+ <td>QvQ-72B-Preview</td>
145
+ <td>72B</td>
146
+ <td>70.3</td>
147
+ <td>35.9</td>
148
+ <td>15.9</td>
149
+ <td>30.7</td>
150
+ <td>40.3</td>
151
+ <td>50.8</td>
152
+ <td>22.7</td>
153
+ <td>49.5</td>
154
+ <td>47.3</td>
155
+ <td>10.9</td>
156
+ <td>24.1</td>
157
+ </tr>
158
+ <tr>
159
+ <td>LLaMA-Maverick</td>
160
+ <td>400B</td>
161
+ <td>73.4</td>
162
+ <td>47.2</td>
163
+ <td>22.8</td>
164
+ <td>47.1</td>
165
+ <td>45.4</td>
166
+ <td>57.1</td>
167
+ <td>19.2</td>
168
+ <td>8.91</td>
169
+ <td>41.6</td>
170
+ <td>69.8</td>
171
+ <td>33.9</td>
172
+ </tr>
173
+ <tr>
174
+ <td rowspan="4">Open-Source LLM</td>
175
+ <td>MiniMax-M1-80k</td>
176
+ <td>456B</td>
177
+ <td>-</td>
178
+ <td>-</td>
179
+ <td>-</td>
180
+ <td>-</td>
181
+ <td>-</td>
182
+ <td>-</td>
183
+ <td>76.9</td>
184
+ <td>-</td>
185
+ <td>-</td>
186
+ <td>70.0</td>
187
+ <td>65.0</td>
188
+ </tr>
189
+ <tr>
190
+ <td>Qwen3-235B-A22B-Thinking</td>
191
+ <td>235B</td>
192
+ <td>-</td>
193
+ <td>-</td>
194
+ <td>-</td>
195
+ <td>-</td>
196
+ <td>-</td>
197
+ <td>-</td>
198
+ <td>81.5</td>
199
+ <td>62.5</td>
200
+ <td>-</td>
201
+ <td>71.1</td>
202
+ <td>65.9</td>
203
+ </tr>
204
+ <tr>
205
+ <td>DeepSeek R1-0528</td>
206
+ <td>671B</td>
207
+ <td>-</td>
208
+ <td>-</td>
209
+ <td>-</td>
210
+ <td>-</td>
211
+ <td>-</td>
212
+ <td>-</td>
213
+ <td>87.5</td>
214
+ <td>79.4</td>
215
+ <td>86.9</td>
216
+ <td>81.0</td>
217
+ <td>73.3</td>
218
+ </tr>
219
+ <tr>
220
+ <td>Qwen3-235B-A22B-Thinking-2507</td>
221
+ <td>235B</td>
222
+ <td>-</td>
223
+ <td>-</td>
224
+ <td>-</td>
225
+ <td>-</td>
226
+ <td>-</td>
227
+ <td>-</td>
228
+ <td>92.3</td>
229
+ <td>83.9</td>
230
+ <td>-</td>
231
+ <td>81.1</td>
232
+ <td>-</td>
233
+ </tr>
234
+ <tr>
235
+ <td rowspan="6">Proprietary VLM</td>
236
+ <td>O3</td>
237
+ <td>-</td>
238
+ <td>82.9</td>
239
+ <td>72.8</td>
240
+ <td>25.2</td>
241
+ <td>58.1</td>
242
+ <td>59.8</td>
243
+ <td>60.1</td>
244
+ <td>88.9</td>
245
+ <td>70.1</td>
246
+ <td>86.7</td>
247
+ <td>83.3</td>
248
+ <td>75.8</td>
249
+ </tr>
250
+ <tr>
251
+ <td>Claude4 Sonnet (thinking)</td>
252
+ <td>-</td>
253
+ <td>76.9</td>
254
+ <td>64.6</td>
255
+ <td>26.1</td>
256
+ <td>48.1</td>
257
+ <td>43.7</td>
258
+ <td>57.0</td>
259
+ <td>70.5</td>
260
+ <td>-</td>
261
+ <td>-</td>
262
+ <td>75.4</td>
263
+ <td>55.9</td>
264
+ </tr>
265
+ <tr>
266
+ <td>Claude4 opus (thinking)</td>
267
+ <td>-</td>
268
+ <td>79.8</td>
269
+ <td>66.1</td>
270
+ <td>25.2</td>
271
+ <td>49.3</td>
272
+ <td>47.2</td>
273
+ <td>59.9</td>
274
+ <td>75.5</td>
275
+ <td>-</td>
276
+ <td>-</td>
277
+ <td>79.6</td>
278
+ <td>56.6</td>
279
+ </tr>
280
+ <tr>
281
+ <td>Gemini 2.5 Flash (thinking)</td>
282
+ <td>-</td>
283
+ <td>73.2</td>
284
+ <td>57.3</td>
285
+ <td>20.1</td>
286
+ <td>57.1</td>
287
+ <td>61.1</td>
288
+ <td>65.2</td>
289
+ <td>72.0</td>
290
+ <td>-</td>
291
+ <td>-</td>
292
+ <td>82.8</td>
293
+ <td>61.9</td>
294
+ </tr>
295
+ <tr>
296
+ <td>Gemini 2.5 Pro</td>
297
+ <td>-</td>
298
+ <td>81.7</td>
299
+ <td>73.3</td>
300
+ <td>30.8</td>
301
+ <td>56.3</td>
302
+ <td>66.8</td>
303
+ <td>66.8</td>
304
+ <td>88.0</td>
305
+ <td>-</td>
306
+ <td>-</td>
307
+ <td>86.4</td>
308
+ <td>71.8</td>
309
+ </tr>
310
+ <!-- 新增 Grok 4 -->
311
+ <tr>
312
+ <td>Grok 4</td>
313
+ <td>-</td>
314
+ <td>80.9</td>
315
+ <td>70.3</td>
316
+ <td>22.5</td>
317
+ <td>40.7</td>
318
+ <td>55.9</td>
319
+ <td>64.8</td>
320
+ <td>98.8</td>
321
+ <td>93.9</td>
322
+ <td>85.5</td>
323
+ <td>87.5</td>
324
+ <td>79.3</td>
325
+ </tr>
326
+ </tbody>
327
+ </table>
328
+
329
+ Note: Parts of the evaluation results are reproduced using the same settings.
330
+ †: Evaluation results of Gemini 2.5 Flash (thinking) may be lower than real model performance, especially on MathVision, due to insufficient instruction following ability.
331
+ ## Deployment
332
+
333
+
334
+ > You can access Step3's API on https://platform.stepfun.com/ , we provide OpenAI/Anthropic-compatible API for you.
335
+ >
336
+
337
+
338
+ ### Inference with Hugging Face Transformers
339
+
340
+ We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.54.0 as the development environment.We currently only support bf16 inference, and multi-patch is supported by default. This behavior is aligned with vllm and sglang.
341
+
342
+
343
+ ```python
344
+ from transformers import AutoProcessor, AutoModelForCausalLM
345
+
346
+ key_mapping = {
347
+ "^vision_model": "model.vision_model",
348
+ r"^model(?!\.(language_model|vision_model))": "model.language_model",
349
+ "vit_downsampler": "model.vit_downsampler",
350
+ "vit_downsampler2": "model.vit_downsampler2",
351
+ "vit_large_projector": "model.vit_large_projector",
352
+ }
353
+
354
+ model_path = "stepfun-ai/step3"
355
+
356
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
357
+ model = AutoModelForCausalLM.from_pretrained(model_path,
358
+ device_map="auto", torch_dtype="auto",trust_remote_code=True,
359
+ key_mapping=key_mapping)
360
+
361
+ messages = [
362
+ {
363
+ "role": "user",
364
+ "content": [
365
+ {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
366
+ {"type": "text", "text": "What's in this picture?"}
367
+ ]
368
+ },
369
+ ]
370
+
371
+ inputs = processor.apply_chat_template(
372
+ messages, add_generation_prompt=True, tokenize=True,
373
+ return_dict=True, return_tensors="pt"
374
+ ).to(model.device)
375
+
376
+ generate_ids = model.generate(**inputs, max_new_tokens=32768, do_sample=False)
377
+ decoded = processor.decode(generate_ids[0, inputs["input_ids"].shape[-1] :], skip_special_tokens=True)
378
+
379
+ print(decoded)
380
+
381
+ ```
382
+
383
+
384
+ ### Inference with vLLM and SGLang
385
+
386
+
387
+ Our model checkpoints are stored in bf16 and block-fp8 format, you can find it on [Huggingface](https://huggingface.co/stepfun-ai/step3).
388
+
389
+ Currently, it is recommended to run Step3 on the following inference engines:
390
+
391
+ * vLLM
392
+ * SGLang
393
+
394
+ Deployment and Request examples for vLLM and SGLang can be found in the [Model Deployment Guide](docs/deploy_guidance.md).
395
+
396
+ ## Contact Us
397
+ If you have any questions, please reach out at [contact@stepfun.com](mailto:contact@stepfun.com) .
398
+
399
+ ## License
400
+ Both the code repository and the model weights are released under the [Apache License (Version 2.0)](./LICENSE).
401
+
402
+ ## Citation
403
+ ```
404
+ @misc{step3system,
405
+ title={Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding},
406
+ author={StepFun Team},
407
+ year={2025},
408
+ eprint={2507.19427},
409
+ archivePrefix={arXiv},
410
+ primaryClass={cs.LG},
411
+ url={https://arxiv.org/abs/2507.19427},
412
+ }
413
+
414
+ @misc{step3blog,
415
+ title={Step3: Cost-Effective Multimodal Intelligence},
416
+ author={StepFun Team},
417
+ url={https://stepfun.ai/research/step3},
418
+ }
419
+ ```