File size: 6,762 Bytes
45f8fc7
 
1dc16cb
 
deee64d
1dc16cb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
829d82e
1dc16cb
 
 
 
 
 
 
829d82e
1dc16cb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45f8fc7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1dc16cb
45f8fc7
1dc16cb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45f8fc7
1dc16cb
 
 
 
 
 
 
 
30e987d
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
# Arena-Lite (former Arena-Lite)
Arena-Lite conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs.

For more information, the followings may help understanding how it works.
* [Paper](https://arxiv.org/abs/2411.01281)
* [Blog Post (KR)](https://ncsoft.github.io/ncresearch/12cc62c1ea0d981971a8923401e8fe6a0f18563d)


## Quickstart
### Running Web Demo locally (streamlit, Recommended!)
```bash
git clone [THIS_REPO]
# install requirements below. we recommend miniforge to manage environment
cd streamlit_app_local
bash run.sh
```
For more details, see `[THIS_REPO]/streamlit_app_local/README.md`

### CLI use
* located at
  * `varco_arena/`
* debug configurations for vscode at
  * `varco_arena/.vscode`
```bash
## gpt-4o-mini as a judge
python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -m tournament -e "gpt-4o-mini"
## vllm-openai served LLM as a judge
python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -e SOME_MODEL_NAME_SERVED -m tournament -u "http://url_to/your/vllm_openai_server:someport"

# dbg lines
## openai api judge dbg
python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini
## other testing lines
python main.py -i "rsc/inputs_for_dbg/[SOME_DIRECTORY]/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini
## dummy judge dbg (checking errors without api requests)
python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e debug
```

## Requirements
```
pip install -r requirements.txt # python 3.11

# Linux
uvloop
# Windows
winloop
```


#### Argument
- -i, --input : directory path which contains input jsonlines files (llm outputs)
- -o, --output_dir : directory where results to be put
- -e, --evaluation : judge model specification (e.g. "gpt-4o-2024-05-13", "gpt-4o-mini", \[vllm-served-model-name\])
- -k, --openai_api_key : OpenAI API Key
- -u, --openai_url: URL to openai_styled_llm_server (requested by openai sdk)

#### advanced
- -j, --n_jobs : n jobs to be put to `asyncio.semaphore(n=)`
- -p, --evalprompt : [see the directory](./varco_arena/prompts/*.yaml)
- -lr, --limit_requests : vLLM OpenAI server request limit (default: 7,680)
- -lt, --limit_tokens : vLLM OpenAI server token limit (default: 15,728,640)

#### Input Data Format
[input jsonl guides](./streamlit_app_local/guide_mds/input_jsonls_en.md)


## Contributing & Customizing
#### Do this after git clone and installation
```bash
pip install pre-commit
pre-commit install
```
#### before commit
```bash
bash precommit.sh # black formatter will reformat the codes
```

### 📝 Adding a Custom Prompt

Here’s how to add a new evaluation prompt. The process has been simplified recently, as the Judge logic now only relies on the `parsed_output` method.

The easiest way is to copy `llmbar_brief.py` and `llmbar_brief.yaml` to create your own prompt.

#### 1. Create Prompt `.py` and `.yaml` Files

-   Create files like `my_prompt.py` and `my_prompt.yaml` in the `varco_arena/varco_arena_core/prompts/` directory.
-   **`my_prompt.py`**:
    -   Define a class that inherits from `ComparisonPromptBase`.
    -   You **must** implement the `parsed_output(self, response)` method. This function should take the LLM Judge's `response` and return a decision token (e.g., `'a'`, `'b'`) indicating the winner.
-   **`my_prompt.yaml`**:
    -   Define necessary elements for your prompt, such as `sampling_parameters`, `decision_tokens`, and `prompt_template`.
    -   The strings in prompt_template are processed by string.Template and finalized in eval_utils.py via the BasePrompt.complete_prompt() function.
    -   Do not use ${task} in prompt_template. It is a reserved keyword due to the llmbar prompt.

#### 2. Register the Prompt in `prompts/__init__.py`

-   Import your new prompt class:
    ```python
    from .my_prompt import MyPrompt
    ```
-   Add your new prompt's name and class instance to the `NAME2PROMPT_CLS` dictionary:
    ```python
    NAME2PROMPT_CLS = dict(
        # ... other prompts
        my_prompt=MyPrompt(),
    )
    ```
-   Add the new prompt name to the `Literal` type hint for the `promptname` argument in the `load_prompt` function:
    ```python
    def load_prompt(
        promptname: Literal[
            # ... other prompt names
            "my_prompt",
        ],
        # ...
    ):
    ```

#### 3. Add the Prompt to `eval_prompt_list.txt`

-   Open the `eval_prompt_list.txt` file in the project root and add the name of your new prompt (`my_prompt`) on a new line.

#### 4. (Recommended) Test and Debug

-   It is highly recommended to debug your prompt to ensure it works as expected.
-   In the `.vscode/launch.json` file, modify the `"VA"` configuration's `args`:
    -   Change `"-p", "translation_fortunecookie"` to `"-p", "my_prompt"`.
    -   If necessary, update the `"-i", "..."` argument to the path of your test data suitable for the new prompt.
-   Go to the `Run and Debug` tab in VS Code (Ctrl+Shift+D), select the "VA" configuration, and press F5 to run the debugger.
-   Find `result.json` inside the output directory you specified after `-o`. It will show every judge prompt used for each match.


## FAQ
* I want to apply my custom judge prompt to run Arena-Lite
  * [`./varco_arena/prompts/`](./varco_arena/prompts/__init__.py) defines the prompts with `yaml` file and the class objects for those. Edit those as your need.
* I want tailored judge prompts for each line of the test set row (i.e. ~100th row - `prompt1`, 101st~ - `prompt2`)
  * You could see `load_prompt` at the above link receives `promptname` + `task` as a parameters to load the prompt. The function is called at [`./varco_arena/manager.py:async_run`](./varco_arena/manager.py).

## Special Thanks to (contributors)
- Minho Lee (@Dialogue Model Team, NCSOFT) [github](https://github.com/minolee/)
  - query wrapper
  - rag prompt
- Jumin Oh (@Generation Model Team, NCSOFT)
  - overall prototyping of the system in haste


## Citation
If you found our work helpful, consider citing our paper!
```
@misc{son2024varcoarenatournamentapproach,
      title={VARCO Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models},
      author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim},
      year={2024},
      eprint={2411.01281},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.01281},
}
```

## Paper Experimental Results (raw data)
https://huggingface.co/datasets/fgenie777/Arena-Lite-Experiments-Result-Data/tree/main