--- license: cc-by-4.0 datasets: - allenai/c4 language: - en metrics: - accuracy base_model: - deepseek-ai/DeepSeek-R1-Distill-Llama-70B pipeline_tag: text-generation --- # Overview This document presents the evaluation results of `DeepSeek-R1-Distill-Llama-70B`, a **4-bit quantized model using GPTQ**, evaluated with the **Language Model Evaluation Harness** on the **ARC-Challenge** benchmark. --- ## 📊 Evaluation Summary | **Metric** | **Value** | **Description** | **8bit** | |----------------------|-----------|-----------------|-----------| | **Accuracy (acc,none)** | `21.2%` | Raw accuracy - percentage of correct answers. | `21.2%` | | **Standard Error (acc_stderr,none)** | `1.19%` | Uncertainty in the accuracy estimate. | `1.2%` | | **Normalized Accuracy (acc_norm,none)** | `25.4%` | Accuracy after dataset-specific normalization. | `25.2%` | | **Standard Error (acc_norm_stderr,none)** | `1.27%` | Uncertainty for normalized accuracy. | `1.3%` | 📌 **Interpretation:** - The model correctly answered **21.2% of the questions**. - After **normalization**, the accuracy slightly improves to **25.4%**. - The **standard error (~1.27%)** indicates a small margin of uncertainty. --- ## ⚙️ Model Configuration - **Model:** `DeepSeek-R1-Distill-Llama-70B` - **Parameters:** `70 billion` - **Quantization:** `4-bit GPTQ` - **Source:** Hugging Face (`hf`) - **Precision:** `torch.float16` - **Hardware:** `NVIDIA A100 80GB PCIe` - **CUDA Version:** `12.4` - **PyTorch Version:** `2.6.0+cu124` - **Batch Size:** `1` - **Evaluation Time:** `365.89 seconds (~6 minutes)` 📌 **Interpretation:** - The evaluation was performed on a **high-performance GPU (A100 80GB)**. - The model is significantly larger than the previous 8B version, with **GPTQ 4-bit quantization reducing memory footprint**. - A **single-sample batch size** was used, which might slow evaluation speed. --- ## 📂 Dataset Information - **Dataset:** `AI2 ARC-Challenge` - **Task Type:** `Multiple Choice` - **Number of Samples Evaluated:** `1,172` - **Few-shot Examples Used:** `0` (Zero-shot setting) 📌 **Interpretation:** - This benchmark assesses **grade-school-level scientific reasoning**. - Since **no few-shot examples** were provided, the model was evaluated in a **pure zero-shot setting**. --- ## 📈 Performance Insights - The `"higher_is_better"` flag confirms that **higher accuracy is preferred**. - The model's **raw accuracy (21.2%)** is significantly lower compared to state-of-the-art models (**60–80%** on ARC-Challenge). - **Quantization Impact:** The **4-bit GPTQ quantization** reduces memory usage but may also impact accuracy slightly. - **Zero-shot Limitation:** Performance could improve with **few-shot prompting** (providing examples before testing). --- ## 📊 Detailed Evaluation on MMLU Challenges | **Metric** | **Value** | **Description** | |----------------------|-----------|-----------------| | **MMLU** | `37.88%` | Averaged over MMLU-Stem, MMLU-Social-Sciences, MMLU-Humanities, MMLU-ther | | **MMLU-Humanities** | `31.83%` | Averaged over MMLU-Formal-Logic, MMLU-Prehistory, MMLU-World-Religions, MMLU-Philosophy, MMLU-High-School-World-History, MMLU-Professional-Law, MMLU-High-School-US-History, MMLU-Logical-Fallacies, MMLU-International-Law, MMLU-High-School-European-History, MMLU-Moral-Disputes, MMLU-Moral-Scenarios, MMLU-Jurisprudence | | **MMLU-Social-Sciences** | `45.43%` | Averaged over MMLU-Public-Relations, MMLU-Sociology, MMLU-Security-Studies, MMLU-High-School-Government-and-Politics, MMLU-High-School-Psychology, MMLU-Human-Sexuality, MMLU-US-Foreign-Policy, MMLU-High-School-Microeconomics, MMLU-Econometrics, MMLU-High-School-Macroeconomics, MMLU-High-School-Geography, MMLU-Professional-Psychology | | **MMLU-Stem** | `33.01%` | Averaged over MMLU-Conceptual-Physics, MMLU-High-School-Chemistry, MMLU-College-Biology, MMLU-College-Chemistry, MMLU-Machine-Learning, MMLU-Elementary-Mathematics, MMLU-Abstract-Algebra, MMLU-Astronomy, MMLU-High-School-Statistics, MMLU-Anatomy, MMLU-College-Mathematics, MMLU-Computer-Security, MMLU-College-Computer-Science, MMLU-Electrical-Engineering, MMLU-College-Physics, MMLU-High-School-Computer-Science, MMLU-High-School-Physics, MMLU-High-School-Biology, MMLU-High-School-Mathematics | | **MMLU-Other** | `44.48%` | Averaged over MMLU-Medical-Genetics, MMLU-Global-Facts, MMLU-Marketing, MMLU-College-Medicine, MMLU-Human-Aging, MMLU-Virology, MMLU-Business-Ethics, MMLU-Clinical-Knowledge, MMLU-Professional-Medicine, MMLU-Nutrition, MMLU-Miscellaneous, MMLU-Professional-Accounting, MMLU-Management | 📌 Let us know if you need further analysis or model tuning! 🚀