https://huggingface.co/xai-org/grok-2

#1316
by sasa2000 - opened

Can grok2 be quantized like grok1?

It probably can but some minor changes to llama.cpp will be required first according to https://github.com/ggml-org/llama.cpp/issues/15534. I'm currently downloading the model, will follow the discussions on llama.cpp about it and try to fulfill your request as soon official grok-2 support landed.

Download is now compleate and on latest llama.cpp it currently fails with:

root@AI:/apool/llama.cpp# venv/bin/python convert_hf_to_gguf.py /cpool/grok-2 --outtype=source --outfile=/transfer/grok-2.gguf
INFO:hf-to-gguf:Loading model: grok-2
INFO:hf-to-gguf:Model architecture: Grok1ForCausalLM
ERROR:hf-to-gguf:Model Grok1ForCausalLM is not supported

Interesting so turns out grok-1 used GrokForCausalLM while grok-2 uses Grok1ForCausalLM

Even if I patch llama.cpp to treat it like grok-1 it will not conveart as the model is missing tokenizer.model:

root@AI:/apool/llama.cpp# venv/bin/python convert_hf_to_gguf.py /cpool/grok-2 --outtype=source --outfile=/transfer/grok-2.gguf
INFO:hf-to-gguf:Loading model: grok-2
INFO:hf-to-gguf:Model architecture: Grok1ForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 131072
INFO:hf-to-gguf:gguf: embedding length = 8192
INFO:hf-to-gguf:gguf: feed forward length = 32768
INFO:hf-to-gguf:gguf: head count = 64
INFO:hf-to-gguf:gguf: key-value head count = 8
INFO:hf-to-gguf:gguf: rope theta = 208533496
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-05
INFO:hf-to-gguf:gguf: layer norm epsilon = 1e-12
INFO:hf-to-gguf:gguf: expert count = 8
INFO:hf-to-gguf:gguf: experts used count = 2
INFO:hf-to-gguf:gguf: file type = 1025
INFO:hf-to-gguf:Set model quantization version
INFO:hf-to-gguf:Set model tokenizer
Traceback (most recent call last):
  File "/apool/llama.cpp/convert_hf_to_gguf.py", line 8833, in <module>
    main()
  File "/apool/llama.cpp/convert_hf_to_gguf.py", line 8827, in main
    model_instance.write()
  File "/apool/llama.cpp/convert_hf_to_gguf.py", line 442, in write
    self.prepare_metadata(vocab_only=False)
  File "/apool/llama.cpp/convert_hf_to_gguf.py", line 563, in prepare_metadata
    self.set_vocab()
  File "/apool/llama.cpp/convert_hf_to_gguf.py", line 2653, in set_vocab
    self._set_vocab_sentencepiece()
  File "/apool/llama.cpp/convert_hf_to_gguf.py", line 990, in _set_vocab_sentencepiece
    tokens, scores, toktypes = self._create_vocab_sentencepiece()
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apool/llama.cpp/convert_hf_to_gguf.py", line 1007, in _create_vocab_sentencepiece
    raise FileNotFoundError(f"File not found: {tokenizer_path}")
FileNotFoundError: File not found: /cpool/grok-2/tokenizer.model

I see. Thank you

It's almost ready... ;)

https://github.com/ggml-org/llama.cpp/pull/15539

any in-progress ggufs somewhere? :)

any in-progress ggufs somewhere? :)

I have them locally. The model works great. Maybe I should redo them using latest commit and upload the most important ones to my account but is it really worth the efford to do so before it gets merged? Realisticaly anyone that knows how to compile llama.cpp from source also knows how to create thair own quants. For mradermacher quants we always wait for the PR to get merged.

any in-progress ggufs somewhere? :)

I have them locally. The model works great. Maybe I should redo them using latest commit and upload the most important ones to my account but is it really worth the efford to do so before it gets merged? Realisticaly anyone that knows how to compile llama.cpp from source also knows how to create thair own quants. For mradermacher quants we always wait for the PR to get merged.

in my case it's diskspace and time to download... ;)
I use the Q3 of Qwen3-235B-A22B, so I’m wondering will Q3 work for me with Grok, or will it be too slow?

I realized that grok-2 will be really difficult to convert to GGUF for basically anyone given the required modifications that have to be made to the original model. This probably justifies me uploading a llamacppfixed of grok-2 and pre-release GGUFs. I'm currently converting the model to the source GGUF using the latest commit. I'm at a company event for almost the entire day so no idea when I get the ability to generate and upload them. Unlikely for mradermacher GGUFs where most of the process is automated here a lot of manual work is involved. I guess I will at least write a script to generate all the quants I can about automatically. I assume I do BF16, Q8_0, Q5_K_M, Q4_K_M, IQ4_XS, Q3_K_M, Q2_K.

I will test Q3 asap :)

Executing this never before tested quantisation script while on my way to a company event. Lets see how this works out. Usually everything that can go wrong will go wrong when I do things like this:

./llama-quantize /transfer/grok-2.gguf /bpool/grok-2-GGUF/grok-2.Q2_K.gguf Q2_K
./llama-gguf-split --split-max-size 48G /bpool/grok-2-GGUF/grok-2.Q2_K.gguf /bpool/grok-2-GGUF/grok-2-Q2_K/grok-2.Q2_K
rm /bpool/grok-2-GGUF/grok-2.Q2_K.gguf
/apool/Download/venv/bin/huggingface-cli upload nicoboss/grok-2-GGUF /bpool/grok-2-GGUF
rm -rf /bpool/grok-2-GGUF/grok-2-Q2_K
./llama-quantize /transfer/grok-2.gguf /bpool/grok-2-GGUF/grok-2.Q3_K_M.gguf Q3_K_M
./llama-gguf-split --split-max-size 48G /bpool/grok-2-GGUF/grok-2.Q3_K_M.gguf /bpool/grok-2-GGUF/grok-2-Q3_K_M/grok-2.Q3_K_M
rm /bpool/grok-2-GGUF/grok-2.Q3_K_M.gguf
/apool/Download/venv/bin/huggingface-cli upload nicoboss/grok-2-GGUF /bpool/grok-2-GGUF
rm -rf /bpool/grok-2-GGUF/grok-2-Q3_K_M
./llama-quantize /transfer/grok-2.gguf /bpool/grok-2-GGUF/grok-2.IQ4_XS.gguf IQ4_XS
./llama-gguf-split --split-max-size 48G /bpool/grok-2-GGUF/grok-2.IQ4_XS.gguf /bpool/grok-2-GGUF/grok-2-IQ4_XS/grok-2.IQ4_XS
rm /bpool/grok-2-GGUF/grok-2.IQ4_XS.gguf
/apool/Download/venv/bin/huggingface-cli upload nicoboss/grok-2-GGUF /bpool/grok-2-GGUF
rm -rf /bpool/grok-2-GGUF/grok-2-IQ4_XS
./llama-quantize /transfer/grok-2.gguf /bpool/grok-2-GGUF/grok-2.Q4_K_M.gguf Q4_K_M
./llama-gguf-split --split-max-size 48G /bpool/grok-2-GGUF/grok-2.Q4_K_M.gguf /bpool/grok-2-GGUF/grok-2-Q4_K_M/grok-2.Q4_K_M
rm /bpool/grok-2-GGUF/grok-2.Q4_K_M.gguf
/apool/Download/venv/bin/huggingface-cli upload nicoboss/grok-2-GGUF /bpool/grok-2-GGUF
rm -rf /bpool/grok-2-GGUF/grok-2-Q4_K_M
./llama-quantize /transfer/grok-2.gguf /bpool/grok-2-GGUF/grok-2.Q5_K_M.gguf Q5_K_M
./llama-gguf-split --split-max-size 48G /bpool/grok-2-GGUF/grok-2.Q5_K_M.gguf /bpool/grok-2-GGUF/grok-2-Q5_K_M/grok-2.Q5_K_M
rm /bpool/grok-2-GGUF/grok-2.Q5_K_M.gguf
/apool/Download/venv/bin/huggingface-cli upload nicoboss/grok-2-GGUF /bpool/grok-2-GGUF
rm -rf /bpool/grok-2-GGUF/grok-2-Q5_K_M
./llama-quantize /transfer/grok-2.gguf /bpool/grok-2-GGUF/grok-2.Q8_0.gguf Q8_0
./llama-gguf-split --split-max-size 48G /bpool/grok-2-GGUF/grok-2.Q8_0.gguf /bpool/grok-2-GGUF/grok-2-Q8_0/grok-2.Q8_0
rm /bpool/grok-2-GGUF/grok-2.Q8_0.gguf
/apool/Download/venv/bin/huggingface-cli upload nicoboss/grok-2-GGUF /bpool/grok-2-GGUF
rm -rf /bpool/grok-2-GGUF/grok-2-Q8_0
./llama-gguf-split --split-max-size 48G /transfer/grok-2.gguf /bpool/grok-2-GGUF/grok-2-BF16/grok-2.BF16
/apool/Download/venv/bin/huggingface-cli upload nicoboss/grok-2-GGUF /bpool/grok-2-GGUF
rm -rf /bpool/grok-2-GGUF/grok-2-BF16

I just came back from the company event and was finally was able to look how grok-2 quantization is doing. I'm impressed how well this worked. It already uploaded all quants (Q8_0, Q5_K_M, Q4_K_M, IQ4_XS, Q3_K_M, Q2_K) and is currently working at splitting BF16 to upload the source GGUF. I never would have expected above posted script to work first-try.

You can download my grok-2 pre-release quants from https://huggingface.co/nicoboss/grok-2-GGUF
Please keep in mind that to run them you need to manually build https://github.com/ggml-org/llama.cpp/tree/cisc/grok-2
Official mradermacher quants will be provided as soon https://github.com/ggml-org/llama.cpp/pull/15539 is merged.

Thank you very much!

Sign up or log in to comment