File size: 6,079 Bytes
b736f3d
 
 
a9c632b
b736f3d
7376455
 
86778cb
7376455
 
 
 
 
 
 
 
 
 
 
7bf5511
7376455
 
 
 
 
 
7bf5511
7376455
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b221ae8
7376455
 
 
05d31da
 
 
7376455
 
e0dc63b
7376455
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7bf5511
7376455
 
 
 
 
 
affc2fd
7376455
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7bf5511
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
---
<div align="center">
  <picture>
      <img src="https://github.com/stepfun-ai/Step3/blob/main/figures/stepfun-logo.png?raw=true" width="30%" alt="StepFun: Cost-Effective Multimodal Intelligence">
  </picture>
</div>

<hr>

<div align="center" style="line-height:1">
  <a href="https://stepfun.com/" target="_blank"><img alt="Chat" src="https://img.shields.io/badge/Chat-StepFun-ff6b6b?color=1783ff&logoColor=white"/></a>
  <a href="https://stepfun.com/" target="_blank"><img alt="Homepage" src="https://img.shields.io/badge/Homepage-StepFun-white?logo=StepFun&logoColor=white"/></a>
</div>

<div align="center" style="line-height: 1;">
  <a href="https://github.com/stepfun-ai/Step3" target="_blank"><img alt="GitHub" src="https://img.shields.io/badge/GitHub-StepFun-white?logo=github&logoColor=white"/></a>
  <a href="https://www.modelscope.cn/models/stepfun-ai/step3" target="_blank"><img alt="ModelScope" src="https://img.shields.io/badge/🤖ModelScope-StepFun-ffc107?color=7963eb&logoColor=white"/></a>
  <a href="https://x.com/StepFun_ai" target="_blank"><img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-StepFun-white?logo=x&logoColor=white"/></a>
</div>

<div align="center" style="line-height: 1;">
<a href="https://discord.com/invite/XHheP5Fn" target="_blank"><img alt="Discord" src="https://img.shields.io/badge/Discord-StepFun-white?logo=discord&logoColor=white"/></a>
  <a href="https://huggingface.co/stepfun-ai/step3/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-blue?&color=blue"/></a>
</div>

<div align="center">
<b>📰&nbsp;&nbsp;<a href="https://stepfun.ai/research/step3">Step3 Model Blog</a></b> &nbsp;&nbsp;&nbsp; | &nbsp;&nbsp;&nbsp; <b>📄&nbsp;&nbsp;<a href="https://arxiv.org/abs/2507.19427">Step3 System Blog</a></b>
</div>

## Introduction

Step3 is our cutting-edge multimodal reasoning model—built on a Mixture-of-Experts architecture with 321B total parameters and 38B active. 
It is designed end-to-end to minimize decoding costs while delivering top-tier performance in vision–language reasoning. 
Through the co-design of Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD), 
Step3 maintains exceptional efficiency across both flagship and low-end accelerators.

### Step3 model card:

|          Config        |  Value  |
|------------------------|---------|
| **Number of Layers (Dense layer included)**|61|
|**Number of Dense Layers**| 5|
| **Hidden Dimension**       | 7168    |
| **Attention Mechanism**    | MFA     |
| **Low-rank Query Dimension** | 2048  |
| **Number of Query Heads**          | 64      |
| **Head Dimension**        | 256     |
|**Number of Experts** |48|
|**Selected Experts per Token**|3|
|**Number of Shared Experts**| 1|
| **Max Context Length** | 65536 |
| **Tokenizer** | Deepseek V3 |
| **Total Parameters (LLM)** | 316B |
| **Activated Params per Token** | 38B |
| **Total Parameters (VLM)** | 321B |


## Evaluation Results
![](figures/step3_bmk.jpeg)

## Deployment

> [!Note]
> Step3's API is accessible at https://platform.stepfun.com/, where we offer OpenAI-compatible API for you.

### Inference with Hugging Face Transformers

We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.54.0 as the development environment.We currently only support bf16 inference, and multi-patch for image preprocessing is supported by default. This behavior is aligned with vllm and sglang.


```python
from transformers import AutoProcessor, AutoModelForCausalLM

key_mapping = {
    "^vision_model": "model.vision_model",
    r"^model(?!\.(language_model|vision_model))": "model.language_model",
    "vit_downsampler": "model.vit_downsampler",
    "vit_downsampler2": "model.vit_downsampler2",
    "vit_large_projector": "model.vit_large_projector",
}

model_path = "stepfun-ai/step3"

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, 
                device_map="auto", torch_dtype="auto",trust_remote_code=True, 
                key_mapping=key_mapping)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "What's in this picture?"}
        ]
    },
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

generate_ids = model.generate(**inputs, max_new_tokens=32768, do_sample=False)
decoded = processor.decode(generate_ids[0, inputs["input_ids"].shape[-1] :], skip_special_tokens=True)

print(decoded)

```


### Inference with vLLM and SGLang


Our model checkpoints are stored in bf16 and block-fp8 format, you can find it on [Huggingface](https://huggingface.co/collections/stepfun-ai/step3-688a3d652dbb45d868f9d42d).

Currently, it is recommended to run Step3 on the following inference engines:

* vLLM
* SGLang

Deployment and Request examples for vLLM and SGLang can be found in the [Model Deployment Guide](docs/deploy_guidance.md).

## Contact Us
If you have any questions, please reach out at [contact@stepfun.com](mailto:contact@stepfun.com) .

## License
Both the code repository and the model weights are released under the [Apache License (Version 2.0)](./LICENSE).

## Citation
```
@misc{step3system,
      title={Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding}, 
      author={StepFun Team},
      year={2025},
      eprint={2507.19427},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.19427}, 
}

@misc{step3blog,
      title={Step3: Cost-Effective Multimodal Intelligence}, 
      author={StepFun Team},
      url={https://stepfun.ai/research/step3}, 
}
```