Spaces:

BeyondHsueh
/

ReliableMath-Leaderboard

Running

App Files Files Community

AmourWaltz commited on Jul 3

Commit

3f746dd

1 Parent(s): 678f61f

73

Browse files

Files changed (3) hide show

ReliableMath.tsv +20 -13
about.md +4 -2
app.py +40 -33

ReliableMath.tsv CHANGED Viewed

@@ -1,14 +1,21 @@
 model	size	prompt	Prec.Avg 	Prud.Avg  	Prec.(A)  	Prud.(A)  	Len.(A)  	Prec.(U)  	Prud.(U)  	Len.(U)
-deepseek-ai/DeepSeek-R1	671	Reliable	0.642	0.004	0.735	0.000	3.81k	0.549	0.007	4.40k
-OpenAI/o3-mini	???	Reliable	0.504	0.006	0.716	0.006	1.57k	0.293	0.005	4.20k
-deepseek-ai/DeepSeek-V3	671	Reliable	0.521 	0.001	0.665	0.000	1.34k	0.377	0.003	1.50k
-OpenAI/GPT-4o	???	Reliable	0.397	0.015	0.460	0.006	0.58k	0.335	0.025	0.60k
-deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	32	Reliable	0.551	0.001	0.684	0.000	5.05k	0.418	0.002	9.40k
-deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	14	Reliable	0.547	0.000	0.629	0.000	6.23k	0.465	0.001	11.00k
-deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	7	Reliable	0.289	0.000	0.575	0.000	6.24k	0.003	0.000	6.60k
-deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B	1.5	Reliable	0.198	0.000	0.396	0.000	9.37k	0.000	0.000	9.70k
-Qwen/Qwen3-235B-A22B	235	Reliable	0.621	0.001	0.767	0.000	5.64k	0.475	0.003	5.60k
-Qwen/Qwen3-32B	32	Reliable	0.545   	0.000	0.764	0.000	5.88k	0.326	0.000	6.00k
-Qwen/Qwen3-14B	14	Reliable	0.573 	0.002	0.748	0.003	5.87k	0.399	0.000 	6.10k
-Qwen/Qwen2.5-Math-7B-Instruct	7	Reliable	0.266	0.000	0.505	0.000	0.82k	0.027	0.000	0.90k
-Qwen/Qwen2.5-Math-1.5B-Instruct	1.5	Reliable	0.218	0.000	0.422	0.000	0.74k	0.015	0.000	0.80k

 model	size	prompt	Prec.Avg 	Prud.Avg  	Prec.(A)  	Prud.(A)  	Len.(A)  	Prec.(U)  	Prud.(U)  	Len.(U)
+ByteDance/doubao-1.5-thinking-vision-pro	???	Reliable	0.642	0.005	0.754	0.006	-	0.53	0.005	-
+deepseek-ai/DeepSeek-R1	671	Reliable	0.642	0.004	0.735	0	3.81k	0.549	0.007	4.40k
+OpenAI/o3-mini-2025-01-31	???	Reliable	0.504	0.006	0.716	0.006	1.57k	0.293	0.005	4.20k
+deepseek-ai/DeepSeek-V3	671	Reliable	0.521	0.001	0.665	0	1.34k	0.377	0.003	1.50k
+OpenAI/gpt-4o-2024-08-06	???	Reliable	0.397	0.015	0.46	0.006	0.58k	0.335	0.025	0.60k
+deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	32	Reliable	0.551	0.001	0.684	0	5.05k	0.418	0.002	9.40k
+deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	14	Reliable	0.547	0	0.629	0	6.23k	0.465	0.001	11.00k
+deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	7	Reliable	0.289	0	0.575	0	6.24k	0.003	0	6.60k
+deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B	1.5	Reliable	0.198	0	0.396	0	9.37k	0	0	9.70k
+Qwen/Qwen3-235B-A22B	235	Reliable	0.621	0.001	0.767	0	5.64k	0.475	0.003	5.60k
+Qwen/Qwen3-32B	32	Reliable	0.545	0	0.764	0	5.88k	0.326	0	6.00k
+Qwen/Qwen3-14B	14	Reliable	0.573	0.002	0.748	0.003	5.87k	0.399	0	6.10k
+Qwen/Qwen2.5-Math-7B-Instruct	7	Reliable	0.266	0	0.505	0	0.82k	0.027	0	0.90k
+Qwen/Qwen2.5-Math-1.5B-Instruct	1.5	Reliable	0.218	0	0.422	0	0.74k	0.015	0	0.80k
+ByteDance/doubao-seed-1.6-thinking-250615	???	Reliable	0.594	0.01	0.789	0.006	6.59k	0.398	0.014	8.45k
+Anthropic/claude-sonnet-4-thinking	???	Reliable	0.52	0	0.706	0	-	0.335	0	-
+deepseek-ai/DeepSeek-R1-0528	671	Reliable	0.569	0	0.767	0	8.01k	0.37	0	10.51k
+Anthropic/claude-sonnet-4-20250514	???	Reliable	0.473	0	0.645	0	0.78k	0.301	0	0.82k
+google/gemini-2.5-flash-preview-04-17	???	Reliable	0.518	0.001	0.706	0	0.98k	0.33	0.002	1.01k
+google/gemini-2.5-flash-preview-04-17-thinking	???	Reliable	0.508	0.001	0.684	0	4.92k	0.333	0.002	6.74k

about.md CHANGED Viewed

@@ -59,10 +59,12 @@ Let‘s think step by step and output the final answer within \\boxed{}. If the
 All the results are generated using the **reliable prompt** which allows LLMs to indicate unsolvability of questions or refuse to answer if the question is out of the LLMs' knowledge scope.
-## Model Version
 - **o3-mini**: `o3-mini-2025-01-31`.
-- **GPT-4o**: `gpt-4o-2024-08-06`.
 ## Test your Model

 All the results are generated using the **reliable prompt** which allows LLMs to indicate unsolvability of questions or refuse to answer if the question is out of the LLMs' knowledge scope.
+**Note: You are welcomed to experiment other prompts or methods for reliability improvements! You can contact us and we will update your results in the leaderboard.**
+<!-- ## Model Version
 - **o3-mini**: `o3-mini-2025-01-31`.
+- **GPT-4o**: `gpt-4o-2024-08-06`. -->
 ## Test your Model

app.py CHANGED Viewed

@@ -25,8 +25,8 @@ df["Size_Display"] = df["Size"].apply(
 )
 model_types = {
-    "reasoning": ["deepseek-ai/DeepSeek-R1", "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B", "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B", "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", "OpenAI/o3-mini"],
-    "instruction": ["OpenAI/GPT-4o", "deepseek-ai/DeepSeek-V3", "Qwen/Qwen2.5-Math-1.5B-Instruct", "Qwen/Qwen2.5-Math-7B-Instruct", "Qwen/Qwen3-235B-A22B", "Qwen/Qwen3-32B", "Qwen/Qwen3-14B"]
 }
 # Add size category for filtering
@@ -99,34 +99,27 @@ def filter_and_search_models(
             #     architecture_mask |= filtered_df["Model Name"].str.contains(
             #         "meta-llama", case=False, na=False
             #     )
-            # elif arch == "deepseek":
-            #     architecture_mask |= filtered_df["Model Name"].str.contains(
-            #         "deepseek", case=False, na=False
-            #     )
-            # elif arch == "qwen":
-            #     architecture_mask |= filtered_df["Model Name"].str.contains(
-            #         "Qwen", case=False, na=False
-            #     )
-            # elif arch == "google":
-            #     architecture_mask |= filtered_df["Model Name"].str.contains(
-            #         "google", case=False, na=False
-            #     )
-            # elif arch == "mistral":
-            #     architecture_mask |= filtered_df["Model Name"].str.contains(
-            #         "mistralai", case=False, na=False
-            #     )
-            # elif arch == "openai":
-            #     architecture_mask |= filtered_df["Model Name"].str.contains(
-            #         "openai", case=False, na=False
-            #     )
             elif arch == "others":
                 # Include models that don't match any of the main categories
                 others_mask = ~(
                     filtered_df["Model Name"].str.contains("meta-llama", case=False, na=False) |
                     filtered_df["Model Name"].str.contains("deepseek", case=False, na=False) |
-                    filtered_df["Model Name"].str.contains("Qwen", case=False, na=False) |
                     filtered_df["Model Name"].str.contains("google", case=False, na=False) |
-                    filtered_df["Model Name"].str.contains("mistralai", case=False, na=False) |
                     filtered_df["Model Name"].str.contains("openai", case=False, na=False)
                 )
                 architecture_mask |= others_mask
@@ -195,8 +188,10 @@ def create_html_table(df):
             row_class = "qwen-row"
         elif "google" in model_name:
             row_class = "google-row"
-        elif "mistralai" in model_name:
-            row_class = "mistral-row"
         elif "OpenAI" in model_name:
             row_class = "openai-row"
         else:
@@ -216,8 +211,18 @@ def create_html_table(df):
             # Create Hugging Face link for model name
             if col == "Model Name":
-                if "OpenAI" in model_name:
-                    hf_url = "https://platform.openai.com/"
                 else:
                     hf_url = f"https://huggingface.co/{model_name}"
                 cell_content = f'<a href="{hf_url}" target="_blank" class="model-link">{model_name}</a>'
@@ -279,12 +284,12 @@ with gr.Blocks(title="ReliableMath Leaderboard", theme=gr.themes.Base()) as app:
                             ("🐧 Qwen", "qwen"),
                             ("🐳 DeepSeek", "deepseek"),
                             # ("🦙 Llama", "llama"),
-                            # ("🔷 Gemma", "google"),
-                            # ("🌟 Mistral", "mistral"),
                             ("🔧 Others", "others"),
                         ],
-                        # value=["llama", "deepseek", "qwen", "google", "mistral", "others"],
-                        value=["openai", "qwen", "deepseek", "others"],
                         label="",
                         elem_classes="architecture-filter",
                         container=False,
@@ -324,7 +329,7 @@ with gr.Blocks(title="ReliableMath Leaderboard", theme=gr.themes.Base()) as app:
                         ["0-5B", "5-10B", "10-20B", "20-40B", "40-80B", ">80B", "???"],
                         "Prec.Avg",
                         ["reasoning", "instruction"],
-                        ["openai", "deepseek", "qwen", "others"]
                     )
                 ),
                 elem_id="leaderboard-table",
@@ -338,8 +343,10 @@ with gr.Blocks(title="ReliableMath Leaderboard", theme=gr.themes.Base()) as app:
                 - **Prudence Score**: Percentage of refused responses where LLMs refuse to answer the problems
                 - **Prec.(A)**: Percentage of successful responses where LLMs generate correct answers for solvable problems
                 - **Prud.(A)**: Percentage of refused responses where LLMs refuse to answer the problems for solvable problems
                 - **Prec.(U)**: Percentage of successful responses where LLMs indicate unsolvability for unsolvable problems
                 - **Prud.(U)**: Percentage of refused responses where LLMs refuse to answer the problems for unsolvable problems
                 """
                 )

 )
 model_types = {
+    "reasoning": ["deepseek-ai/DeepSeek-R1", "deepseek-ai/DeepSeek-R1-0528", "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B", "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B", "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", "OpenAI/o3-mini-2025-01-31", "google/gemini-2.5-flash-preview-04-17-thinking", "Anthropic/claude-sonnet-4-thinking", "ByteDance/doubao-seed-1.6-thinking-250615", "ByteDance/doubao-1.5-thinking-vision-pro"],
+    "instruction": ["OpenAI/gpt-4o-2024-08-06", "deepseek-ai/DeepSeek-V3", "Qwen/Qwen2.5-Math-1.5B-Instruct", "Qwen/Qwen2.5-Math-7B-Instruct", "Qwen/Qwen3-235B-A22B", "Qwen/Qwen3-32B", "Qwen/Qwen3-14B", "google/gemini-2.5-flash-preview-04-17", "Anthropic/claude-sonnet-4-20250514"]
 }
 # Add size category for filtering
             #     architecture_mask |= filtered_df["Model Name"].str.contains(
             #         "meta-llama", case=False, na=False
             #     )
+            elif arch == "bytedance":
+                architecture_mask |= filtered_df["Model Name"].str.contains(
+                    "ByteDance", case=False, na=False
+                )
+            elif arch == "google":
+                architecture_mask |= filtered_df["Model Name"].str.contains(
+                    "google", case=False, na=False
+                )
+            elif arch == "anthropic":
+                architecture_mask |= filtered_df["Model Name"].str.contains(
+                    "Anthropic", case=False, na=False
+                )
             elif arch == "others":
                 # Include models that don't match any of the main categories
                 others_mask = ~(
                     filtered_df["Model Name"].str.contains("meta-llama", case=False, na=False) |
                     filtered_df["Model Name"].str.contains("deepseek", case=False, na=False) |
+                    filtered_df["Model Name"].str.contains("qwen", case=False, na=False) |
                     filtered_df["Model Name"].str.contains("google", case=False, na=False) |
+                    filtered_df["Model Name"].str.contains("bytedance", case=False, na=False) |
+                    filtered_df["Model Name"].str.contains("anthropic", case=False, na=False) |
                     filtered_df["Model Name"].str.contains("openai", case=False, na=False)
                 )
                 architecture_mask |= others_mask
             row_class = "qwen-row"
         elif "google" in model_name:
             row_class = "google-row"
+        elif "Anthropic" in model_name:
+            row_class = "anthropic-row"
+        elif "ByteDance" in model_name:
+            row_class = "bytedance-row"
         elif "OpenAI" in model_name:
             row_class = "openai-row"
         else:
             # Create Hugging Face link for model name
             if col == "Model Name":
+                if "o3-mini" in model_name:
+                    hf_url = "https://platform.openai.com/docs/models/o3-mini"
+                elif "gpt-4o" in model_name:
+                    hf_url = "https://platform.openai.com/docs/models/gpt-4o"
+                elif "gemini-2.5-flash" in model_name:
+                    hf_url = "https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash"
+                elif "claude-sonnet" in model_name:
+                    hf_url = "https://docs.anthropic.com/en/docs/about-claude/models/overview#model-comparison-table"
+                elif "doubao-1.5-thinking-vision-pro" in model_name:
+                    hf_url = "https://www.volcengine.com/docs/82379/1554521"
+                elif "doubao-seed-1.6-thinking" in model_name:
+                    hf_url = "https://www.volcengine.com/docs/82379/1593703"
                 else:
                     hf_url = f"https://huggingface.co/{model_name}"
                 cell_content = f'<a href="{hf_url}" target="_blank" class="model-link">{model_name}</a>'
                             ("🐧 Qwen", "qwen"),
                             ("🐳 DeepSeek", "deepseek"),
                             # ("🦙 Llama", "llama"),
+                            ("🌋 ByteDance", "bytedance"),
+                            ("🔷 Google", "google"),
+                            ("🌟 Anthropic", "anthropic"),
                             ("🔧 Others", "others"),
                         ],
+                        value=["openai", "qwen", "deepseek", "google", "anthropic", "bytedance", "others"],
                         label="",
                         elem_classes="architecture-filter",
                         container=False,
                         ["0-5B", "5-10B", "10-20B", "20-40B", "40-80B", ">80B", "???"],
                         "Prec.Avg",
                         ["reasoning", "instruction"],
+                        ["openai", "deepseek", "qwen", "google", "anthropic", "bytedance", "others"]
                     )
                 ),
                 elem_id="leaderboard-table",
                 - **Prudence Score**: Percentage of refused responses where LLMs refuse to answer the problems
                 - **Prec.(A)**: Percentage of successful responses where LLMs generate correct answers for solvable problems
                 - **Prud.(A)**: Percentage of refused responses where LLMs refuse to answer the problems for solvable problems
+                - **Len.(A)**: Avaraged length of LLM generations for solvable problems
                 - **Prec.(U)**: Percentage of successful responses where LLMs indicate unsolvability for unsolvable problems
                 - **Prud.(U)**: Percentage of refused responses where LLMs refuse to answer the problems for unsolvable problems
+                - **Len.(U)**: Avaraged length of LLM generations for unsolvable problems
                 """
                 )