📢 NVIDIA Releases Nemotron-CC-Math Pre-Training Dataset: A High-Quality, Web-Scale Math Corpus for Pretraining Large Language Models

Community Article Published August 18, 2025

Upvote

Rabeeh Karimi Mahabadi

rkarimimahab

nvidia

Sanjeev Satheesh

sanjeevnv

nvidia

➡️ Dataset page: Nemotron-CC-Math

📜 License: NVIDIA Open Data License Agreement

🧠 Paper: Nemotron-cc-math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

Highlights

We’re excited to release Nemotron-CC-Math, a large-scale, high-quality math corpus extracted from Common Crawl. This dataset significantly raises the bar for open-source math pretraining corpora, outperforming prior datasets like FineMath, MegaMath, and OpenWebMath on math benchmarks, while achieving comparable or better results on code and general reasoning.

We included Nemotron-CC-Math in the pretraining mixture for the NVIDIA Nano V2 12B/9B models. To illustrate its impact, we pretrained Nano-sized models using this dataset and compared them to other open models across various math benchmarks: GSM8K CoT, MATH, MATH Level 5, and AIME 2024. Including Nemotron-CC-Math in the pretraining mixture clearly boosts math reasoning performance across all tasks, especially on more challenging datasets like MATH Level 5 and AIME.

✨ Why Build a New Math Corpus?

High-quality math datasets are critical for improving reasoning, symbolic understanding, and general intelligence in large language models (LLMs). However, most existing open math corpora suffer from:

Brittle extraction pipelines
Lossy HTML-to-text conversions
Missing or corrupted equations
Inconsistent formatting and low data fidelity

Many of the best-performing proprietary LLMs (e.g., Minerva, DeepSeekMath, Qwen-Math) rely on large, unreleased math corpora. To support the open research community, we built Nemotron-CC-Math from scratch using a new domain-agnostic extraction pipeline designed for scientific content.

🔍 What’s Inside the Dataset?

Nemotron-CC-Math comes in two variants — nemotron-cc-math-3plus and nemotron-cc-math-4plus — created by classifying data with our FineMath classifier. In this scheme, 3plus corresponds to samples scoring 3, 4, or 5, while 4plus includes only samples scoring 4 or 5. Our dataset is constructed from 98 Common Crawl snapshots (2014–2024). In total, we process content from over 980,000 unique domains, making it one of the most diverse math corpora available. We also regenerated the Nemotron-MIND dataset using nemotron-cc-math-4plus, our high-quality subset, which yielded consistent gains over previous Nemotron-MIND.

Dataset	# Tokens	# Documents
`nemotron-cc-math-3plus`	133B	101.15M
`nemotron-cc-math-4plus`	52B	45.10M
`nemotron-mind-v1`	73B	88.73M

🔨 How We Built It

We built a scalable and robust pipeline tailored to mathematical and scientific content with five key steps:

Lynx Rendering – Instead of relying on brittle DOM parsing, we use lynx to convert HTML into structured text while preserving equations and layout.
LLM Cleaning – We use a lightweight LLM (Phi-4, 14B) to remove boilerplate, standardizes mathematical expressions into consistent LaTeX, and improves formatting.
Quality Filtering – We use FineMath classifier to assign a quality score from 1–5 to each page.
Deduplication – We use MinHash-based Locality Sensitive Hashing (NeMo-Curator) to remove near-duplicates.
Decontamination – We apply LLM-based contamination detection against test benchmarks (MATH, GSM8K, MMLU, MMLU-Pro) to prevent benchmark leakage.

🏆 Performance

We ran mid-training ablations on 8B sized models using this corpus and compared against prior math pretraining datasets including, OpenWebMath, MegaMath, FineMath. Our dataset obtains substantial improvements across math, code, and general reasoning.

Dataset	MATH (EM)	GSM8K (EM)	HumanEval+ (average@20)	MBPP+ (average@20)	MMLU (EM)	MMLU-STEM (EM)
OpenWebMath	34.2	76.42	33.54	37.59	65.20	59.20
FineMath-3+	34.6	79.45	34.18	29.19	67.92	62.29
MegaMath-Web	31.6	78.24	32.29	38.89	65.44	59.88
Nemotron-CC-Math-3+	44.20	80.06	37.16	43.51	68.20	64.26

Using Nemotron-CC-Math-4plus, we regenerated Nemotron-MIND which leads to substantial improvements across math, code, and general reasoning tasks:

Dataset	#Unique Tokens (B)	MMLU Pro	MMLU	MMLU STEM	Code	Math-500	GSM8K
Nemotron-MIND	126	36.1	66.1	60	43.4	33.4	80.7
Nemotron-MIND-V1	73	39.7	67.5	63.7	44.2	47.8	84.5

🔍 Qualitative Examples

We present a side-by-side comparison between our dataset and prior work (MegaMath). The illustrative samples highlight how our model preserves mathematical equations, in contrast to existing approaches where such structures are often lost or distorted.

📦 Get Started

The datasets are uploaded as 3 huggingface dataset subsets - 3 (documents with quality label 3), 4plus (documents with quality labels 4 and 5) and 4plus_MIND (MIND method applied to the 4plus subsset). To build the 3plus subset, load both 3 and 4plus subsets.

You can download the dataset directly from the Hugging Face Hub:

pip install datasets

from datasets import load_dataset

ds = load_dataset("nvidia/Nemotron-CC-Math-v1", "4plus", streaming=True)

🔓 Open-Source Everything

We believe high-quality pretraining data should be open. Thats why we will release our full processing pipline (HTML parsing, cleaning, deduplication, filtering) and our dataset.

📝 Citation

Please cite the following if you use our dataset in your work:

@article{karimi2025nemotroncc,
  title = {Nemotron-cc-math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset},
  author = {Rabeeh Karimi Mahabadi, Sanjeev Satheesh, Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro },
  url = {https://arxiv.org/abs/2508.15096},
  year = {2025}
}

@misc{nvidia2025nvidianemotronnano2,
      title={NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model}, 
      author={NVIDIA and : and Aarti Basant and Abhijit Khairnar and Abhijit Paithankar and Abhinav Khattar and Adithya Renduchintala and Aditya Malte and Akhiad Bercovich and Akshay Hazare and Alejandra Rico and Aleksander Ficek and Alex Kondratenko and Alex Shaposhnikov and Alexander Bukharin and Ali Taghibakhshi and Amelia Barton and Ameya Sunil Mahabaleshwarkar and Amy Shen and Andrew Tao and Ann Guan and Anna Shors and Anubhav Mandarwal and Arham Mehta and Arun Venkatesan and Ashton Sharabiani and Ashwath Aithal and Ashwin Poojary and Ayush Dattagupta and Balaram Buddharaju and Banghua Zhu and Barnaby Simkin and Bilal Kartal and Bita Darvish Rouhani and Bobby Chen and Boris Ginsburg and Brandon Norick and Brian Yu and Bryan Catanzaro and Charles Wang and Charlie Truong and Chetan Mungekar and Chintan Patel and Chris Alexiuk and Christian Munley and Christopher Parisien and Dan Su and Daniel Afrimi and Daniel Korzekwa and Daniel Rohrer and Daria Gitman and David Mosallanezhad and Deepak Narayanan and Dima Rekesh and Dina Yared and Dmytro Pykhtar and Dong Ahn and Duncan Riach and Eileen Long and Elliott Ning and Eric Chung and Erick Galinkin and Evelina Bakhturina and Gargi Prasad and Gerald Shen and Haifeng Qian and Haim Elisha and Harsh Sharma and Hayley Ross and Helen Ngo and Herman Sahota and Hexin Wang and Hoo Chang Shin and Hua Huang and Iain Cunningham and Igor Gitman and Ivan Moshkov and Jaehun Jung and Jan Kautz and Jane Polak Scowcroft and Jared Casper and Jian Zhang and Jiaqi Zeng and Jimmy Zhang and Jinze Xue and Jocelyn Huang and Joey Conway and John Kamalu and Jonathan Cohen and Joseph Jennings and Julien Veron Vialard and Junkeun Yi and Jupinder Parmar and Kari Briski and Katherine Cheung and Katherine Luna and Keith Wyss and Keshav Santhanam and Kezhi Kong and Krzysztof Pawelec and Kumar Anik and Kunlun Li and Kushan Ahmadian and Lawrence McAfee and Laya Sleiman and Leon Derczynski and Luis Vega and Maer Rodrigues de Melo and Makesh Narsimhan Sreedhar and Marcin Chochowski and Mark Cai and Markus Kliegl and Marta Stepniewska-Dziubinska and Matvei Novikov and Mehrzad Samadi and Meredith Price and Meriem Boubdir and Michael Boone and Michael Evans and Michal Bien and Michal Zawalski and Miguel Martinez and Mike Chrzanowski and Mohammad Shoeybi and Mostofa Patwary and Namit Dhameja and Nave Assaf and Negar Habibi and Nidhi Bhatia and Nikki Pope and Nima Tajbakhsh and Nirmal Kumar Juluru and Oleg Rybakov and Oleksii Hrinchuk and Oleksii Kuchaiev and Oluwatobi Olabiyi and Pablo Ribalta and Padmavathy Subramanian and Parth Chadha and Pavlo Molchanov and Peter Dykas and Peter Jin and Piotr Bialecki and Piotr Januszewski and Pradeep Thalasta and Prashant Gaikwad and Prasoon Varshney and Pritam Gundecha and Przemek Tredak and Rabeeh Karimi Mahabadi and Rajen Patel and Ran El-Yaniv and Ranjit Rajan and Ria Cheruvu and Rima Shahbazyan and Ritika Borkar and Ritu Gala and Roger Waleffe and Ruoxi Zhang and Russell J. Hewett and Ryan Prenger and Sahil Jain and Samuel Kriman and Sanjeev Satheesh and Saori Kaji and Sarah Yurick and Saurav Muralidharan and Sean Narenthiran and Seonmyeong Bak and Sepehr Sameni and Seungju Han and Shanmugam Ramasamy and Shaona Ghosh and Sharath Turuvekere Sreenivas and Shelby Thomas and Shizhe Diao and Shreya Gopal and Shrimai Prabhumoye and Shubham Toshniwal and Shuoyang Ding and Siddharth Singh and Siddhartha Jain and Somshubra Majumdar and Soumye Singhal and Stefania Alborghetti and Syeda Nahida Akter and Terry Kong and Tim Moon and Tomasz Hliwiak and Tomer Asida and Tony Wang and Tugrul Konuk and Twinkle Vashishth and Tyler Poon and Udi Karpas and Vahid Noroozi and Venkat Srinivasan and Vijay Korthikanti and Vikram Fugro and Vineeth Kalluru and Vitaly Kurin and Vitaly Lavrukhin and Wasi Uddin Ahmad and Wei Du and Wonmin Byeon and Ximing Lu and Xin Dong and Yashaswi Karnati and Yejin Choi and Yian Zhang and Ying Lin and Yonggan Fu and Yoshi Suhara and Zhen Dong and Zhiyu Li and Zhongbo Zhu and Zijia Chen},
      year={2025},
      eprint={2508.14444},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.14444}, 
}

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote