Update README.md
Browse files
README.md
CHANGED
@@ -87,4 +87,8 @@ Training took ~10 hours on a TPUv4-32.
|
|
87 |
|
88 |
The current version of this model is trained for 20k steps with 32*2048 bytes per batch (= 1.3B bytes ≈ 328M subword tokens total). It was unexpected that it performs as well as it does with this very short training procedure. We plan to train a new version for more steps (you can also do so yourself using [`tokenkit`](https://github.com/bminixhofer/tokenkit)).
|
89 |
|
90 |
-
To preserve efficiency, we would have to add (a combination of) [BLT-style hierarchical processing](https://arxiv.org/abs/2412.09871), [attention approximations](https://hkunlp.github.io/blog/2025/evabyte/), and [self-speculative decoding](https://arxiv.org/abs/2309.08168).
|
|
|
|
|
|
|
|
|
|
87 |
|
88 |
The current version of this model is trained for 20k steps with 32*2048 bytes per batch (= 1.3B bytes ≈ 328M subword tokens total). It was unexpected that it performs as well as it does with this very short training procedure. We plan to train a new version for more steps (you can also do so yourself using [`tokenkit`](https://github.com/bminixhofer/tokenkit)).
|
89 |
|
90 |
+
To preserve efficiency, we would have to add (a combination of) [BLT-style hierarchical processing](https://arxiv.org/abs/2412.09871), [attention approximations](https://hkunlp.github.io/blog/2025/evabyte/), and [self-speculative decoding](https://arxiv.org/abs/2309.08168).
|
91 |
+
|
92 |
+
## Acknowledgments
|
93 |
+
|
94 |
+
Training was enabled by Cloud TPUs from Google’s TPU Research Cloud (TRC).
|