pradeepd commited on
Commit
52f1fe9
Β·
verified Β·
1 Parent(s): 210609f

Update src/md.py

Browse files
Files changed (1) hide show
  1. src/md.py +7 -7
src/md.py CHANGED
@@ -3,11 +3,11 @@ import pytz
3
 
4
  ABOUT_TEXT = """
5
  ## Overview
6
- HREF is evaluation benchmark that evaluates language models' capacity of following human instructions. It is consisted of 4,258 instructions covering 11 distinct categories, including Brainstorm ,Open QA ,Closed QA ,Extract ,Generation ,Rewrite ,Summarize ,Coding ,Classify ,Fact Checking or Attributed QA ,Multi-Document Synthesis , and Reasoning Over Numerical Data.
7
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64dff1ddb5cc372803af964d/0TK6xku0gdJPDs_nfwzns.png)
8
 
9
  ## Generation Configuration
10
- For reproductability, we use greedy decoding for all model generation as default. We apply chat templates to the instructions if they are implemented in model's tokenizer or explicity recommanded by the model's creators. Please contact us if you would like to change this default configuration.
11
 
12
  ## Why HREF
13
  | Benchmark | Size | Evaluation Method | Baseline Model | Judge Model | Task Oriented | Contamination Resistant | Contains Human Reference|
@@ -17,19 +17,19 @@ For reproductability, we use greedy decoding for all model generation as default
17
  | Chatbot Arena | --- | PWC | --- | Human | βœ— | βœ“ | βœ— |
18
  | Arena-Hard | 500 | PWC | gpt4-0314 | gpt4-turbo | βœ— | βœ— | βœ— |
19
  | WildBench | 1,024 | Score/PWC | gpt4-turbo | three models | βœ— | βœ— | βœ— |
20
- | **HREF** | 4,258 | PWC | Llama-3.1-405B-Instruct | Llama-3.1-70B-Instruct | βœ“ | βœ“ | βœ“ |
21
 
22
- - **Human Reference**: HREF leverages human-written answer as reference to provide more reliable evaluation than previous method.
23
  - **Large**: HREF has the largest evaluation size among similar benchmarks, making its evaluation more reliable.
24
- - **Contamination-resistant**: HREF's evaluation set is hidden and uses public models for both the baseline model and judge model, which makes it completely free of contamination.
25
- - **Task Oriented**: Instead of naturally collected instructions from the user, HREF contains instructions that are written specifically targetting 8 distinct categories that are used in instruction tuning, which allows it to provide more insights about how to improve language models.
26
  """
27
 
28
  # Get Pacific time zone (handles PST/PDT automatically)
29
  pacific_tz = pytz.timezone('America/Los_Angeles')
30
  current_time = datetime.now(pacific_tz).strftime("%H:%M %Z, %d %b %Y")
31
 
32
- TOP_TEXT = f"""# HREF: Human Reference Guided Evaluation for Instructiong Following
33
  [Code](https://github.com/allenai/href) | [Validation Set](https://huggingface.co/datasets/allenai/href) | [Human Agreement Set](https://huggingface.co/datasets/allenai/href_preference) | [Results](https://huggingface.co/datasets/allenai/href_results) | [Paper](https://arxiv.org/abs/2412.15524) | Total models: {{}} | Last restart (PST): {current_time}
34
  """
35
 
 
3
 
4
  ABOUT_TEXT = """
5
  ## Overview
6
+ HREF is an evaluation benchmark that evaluates language models' ability to follow human instructions. It consists of 4,258 instructions covering 11 distinct categories, including various general chat capabilities like brainstorming, question answering and summarization and those focused on scientific text understanding like reasoning over numerical data.
7
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64dff1ddb5cc372803af964d/0TK6xku0gdJPDs_nfwzns.png)
8
 
9
  ## Generation Configuration
10
+ For reproducibility, we use greedy decoding for all models as default. We apply chat templates to the instructions if they are implemented in the model's tokenizer or explicity recommended by the model's creators. Please contact us if you would like to change this default configuration.
11
 
12
  ## Why HREF
13
  | Benchmark | Size | Evaluation Method | Baseline Model | Judge Model | Task Oriented | Contamination Resistant | Contains Human Reference|
 
17
  | Chatbot Arena | --- | PWC | --- | Human | βœ— | βœ“ | βœ— |
18
  | Arena-Hard | 500 | PWC | gpt4-0314 | gpt4-turbo | βœ— | βœ— | βœ— |
19
  | WildBench | 1,024 | Score/PWC | gpt4-turbo | three models | βœ— | βœ— | βœ— |
20
+ | **HREF** | 4,258 | PWC | Llama-3.1-405B-Instruct | Llama-3.3-70B-Instruct | βœ“ | βœ“ | βœ“ |
21
 
22
+ - **Human Reference**: HREF leverages human-written responses to provide more reliable evaluation than previous methods.
23
  - **Large**: HREF has the largest evaluation size among similar benchmarks, making its evaluation more reliable.
24
+ - **Contamination-resistant**: HREF's evaluation set is hidden and uses public models as both the baseline model and the judge model, which makes it completely free of contamination.
25
+ - **Task Oriented**: Instead of prompts from the users, HREF contains instructions that are written specifically targetting 8 distinct categories that are commonly used for instruction tuning, which allows it to provide more insights about how to improve language models.
26
  """
27
 
28
  # Get Pacific time zone (handles PST/PDT automatically)
29
  pacific_tz = pytz.timezone('America/Los_Angeles')
30
  current_time = datetime.now(pacific_tz).strftime("%H:%M %Z, %d %b %Y")
31
 
32
+ TOP_TEXT = f"""# HREF: Human Response-Guided Evaluation of Instruction Following in Language Models
33
  [Code](https://github.com/allenai/href) | [Validation Set](https://huggingface.co/datasets/allenai/href) | [Human Agreement Set](https://huggingface.co/datasets/allenai/href_preference) | [Results](https://huggingface.co/datasets/allenai/href_results) | [Paper](https://arxiv.org/abs/2412.15524) | Total models: {{}} | Last restart (PST): {current_time}
34
  """
35