sonsus commited on
Commit
45f8fc7
ยท
1 Parent(s): 1eadcf1

rebrand: varco-arena -> arena-lite

Browse files
.vscode/launch.json CHANGED
@@ -13,13 +13,15 @@
13
  "console": "integratedTerminal",
14
  "args": [
15
  "-i",
16
- "rsc/inputs_for_dbg/dbg_llmbar_inputs/", // "rsc/inputs_for_dbg/dbg_trans_inputs/",
 
17
  "-o",
18
  "DBGOUT",
19
  "-e",
20
  "gpt-4.1-mini",
21
  "-p",
22
- "llmbar", // "translation_fortunecookie",
 
23
 
24
  ]
25
  }
 
13
  "console": "integratedTerminal",
14
  "args": [
15
  "-i",
16
+ // "rsc/inputs_for_dbg/dbg_llmbar_inputs/",
17
+ "rsc/inputs_for_dbg/dbg_trans_inputs/",
18
  "-o",
19
  "DBGOUT",
20
  "-e",
21
  "gpt-4.1-mini",
22
  "-p",
23
+ // "llmbar",
24
+ "translation_pair",
25
 
26
  ]
27
  }
README.md CHANGED
@@ -1,21 +1,8 @@
1
- ---
2
- title: VARCO Arena
3
- emoji: ๐Ÿ”ฅ
4
- colorFrom: pink
5
- colorTo: yellow
6
- sdk: streamlit
7
- sdk_version: 1.40.2
8
- app_file: app.py
9
- pinned: false
10
- license: cc-by-4.0
11
- short_description: VARCO Arena is a reference-free LLM benchmarking approach
12
- ---
13
-
14
- # Varco Arena
15
- Varco Arena conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs.
16
 
17
  For more information, the followings may help understanding how it works.
18
- * [Paper](https://huggingface.co/papers/2411.01281)
19
  * [Blog Post (KR)](https://ncsoft.github.io/ncresearch/12cc62c1ea0d981971a8923401e8fe6a0f18563d)
20
 
21
 
@@ -42,7 +29,7 @@ python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -e
42
 
43
  # dbg lines
44
  ## openai api judge dbg
45
- python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e o4-mini
46
  ## other testing lines
47
  python main.py -i "rsc/inputs_for_dbg/[SOME_DIRECTORY]/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini
48
  ## dummy judge dbg (checking errors without api requests)
@@ -102,15 +89,66 @@ pre-commit install
102
  bash precommit.sh # black formatter will reformat the codes
103
  ```
104
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
  ## FAQ
106
- * I want to apply my custom judge prompt to run Varco Arena
107
  * [`./varco_arena/prompts/`](./varco_arena/prompts/__init__.py) defines the prompts with `yaml` file and the class objects for those. Edit those as your need.
108
  * I want tailored judge prompts for each line of the test set row (i.e. ~100th row - `prompt1`, 101st~ - `prompt2`)
109
  * You could see `load_prompt` at the above link receives `promptname` + `task` as a parameters to load the prompt. The function is called at [`./varco_arena/manager.py:async_run`](./varco_arena/manager.py).
110
- * I want more fields for my llm outputs jsonl files for tailored use, i.e. want more fields beyond `instruction`, `source`, `generated`.
111
- * It's going to get tricky but let me briefly guide you about this.
112
- * You might have to edit `varco_arena/eval_utils.py`:`async_eval_w_prompt` (this part calls `PROMPT_OBJ.complete_prompt()`)
113
- * And all the related codes will require revision.
114
 
115
  ## Special Thanks to (contributors)
116
  - Minho Lee (@Dialogue Model Team, NCSOFT) [github](https://github.com/minolee/)
@@ -122,10 +160,9 @@ bash precommit.sh # black formatter will reformat the codes
122
 
123
  ## Citation
124
  If you found our work helpful, consider citing our paper!
125
- [arxiv](https://arxiv.org/abs/2411.19103v1)
126
  ```
127
  @misc{son2024varcoarenatournamentapproach,
128
- title={Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models},
129
  author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim},
130
  year={2024},
131
  eprint={2411.01281},
 
1
+ # Arena-Lite
2
+ Arena-Lite conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs.
 
 
 
 
 
 
 
 
 
 
 
 
 
3
 
4
  For more information, the followings may help understanding how it works.
5
+ * [Paper](https://arxiv.org/abs/2411.01281)
6
  * [Blog Post (KR)](https://ncsoft.github.io/ncresearch/12cc62c1ea0d981971a8923401e8fe6a0f18563d)
7
 
8
 
 
29
 
30
  # dbg lines
31
  ## openai api judge dbg
32
+ python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini
33
  ## other testing lines
34
  python main.py -i "rsc/inputs_for_dbg/[SOME_DIRECTORY]/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini
35
  ## dummy judge dbg (checking errors without api requests)
 
89
  bash precommit.sh # black formatter will reformat the codes
90
  ```
91
 
92
+ ### ๐Ÿ“ Adding a Custom Prompt
93
+
94
+ Hereโ€™s how to add a new evaluation prompt. The process has been simplified recently, as the Judge logic now only relies on the `parsed_output` method.
95
+
96
+ The easiest way is to copy `llmbar_brief.py` and `llmbar_brief.yaml` to create your own prompt.
97
+
98
+ #### 1. Create Prompt `.py` and `.yaml` Files
99
+
100
+ - Create files like `my_prompt.py` and `my_prompt.yaml` in the `varco_arena/varco_arena_core/prompts/` directory.
101
+ - **`my_prompt.py`**:
102
+ - Define a class that inherits from `ComparisonPromptBase`.
103
+ - You **must** implement the `parsed_output(self, response)` method. This function should take the LLM Judge's `response` and return a decision token (e.g., `'a'`, `'b'`) indicating the winner.
104
+ - **`my_prompt.yaml`**:
105
+ - Define necessary elements for your prompt, such as `sampling_parameters`, `decision_tokens`, and `prompt_template`.
106
+ - The strings in prompt_template are processed by string.Template and finalized in eval_utils.py via the BasePrompt.complete_prompt() function.
107
+ - Do not use ${task} ${generated} ${model_id} in prompt_template. They are reserved for Arena-Lite.
108
+
109
+ #### 2. Register the Prompt in `prompts/__init__.py`
110
+
111
+ - Import your new prompt class:
112
+ ```python
113
+ from .my_prompt import MyPrompt
114
+ ```
115
+ - Add your new prompt's name and class instance to the `NAME2PROMPT_CLS` dictionary:
116
+ ```python
117
+ NAME2PROMPT_CLS = dict(
118
+ # ... other prompts
119
+ my_prompt=MyPrompt(),
120
+ )
121
+ ```
122
+ - Add the new prompt name to the `Literal` type hint for the `promptname` argument in the `load_prompt` function:
123
+ ```python
124
+ def load_prompt(
125
+ promptname: Literal[
126
+ # ... other prompt names
127
+ "my_prompt",
128
+ ],
129
+ # ...
130
+ ):
131
+ ```
132
+
133
+ #### 3. Add the Prompt to `eval_prompt_list.txt`
134
+
135
+ - Open the `eval_prompt_list.txt` file in the project root and add the name of your new prompt (`my_prompt`) on a new line.
136
+
137
+ #### 4. (Recommended) Test and Debug
138
+
139
+ - It is highly recommended to debug your prompt to ensure it works as expected.
140
+ - In the `.vscode/launch.json` file, modify the `"VA"` configuration's `args`:
141
+ - Change `"-p", "translation_fortunecookie"` to `"-p", "my_prompt"`.
142
+ - If necessary, update the `"-i", "..."` argument to the path of your test data suitable for the new prompt.
143
+ - Go to the `Run and Debug` tab in VS Code (Ctrl+Shift+D), select the "VA" configuration, and press F5 to run the debugger.
144
+ - Find `result.json` inside the output directory you specified after `-o`. It will show every judge prompt used for each match.
145
+
146
+
147
  ## FAQ
148
+ * I want to apply my custom judge prompt to run Arena-Lite
149
  * [`./varco_arena/prompts/`](./varco_arena/prompts/__init__.py) defines the prompts with `yaml` file and the class objects for those. Edit those as your need.
150
  * I want tailored judge prompts for each line of the test set row (i.e. ~100th row - `prompt1`, 101st~ - `prompt2`)
151
  * You could see `load_prompt` at the above link receives `promptname` + `task` as a parameters to load the prompt. The function is called at [`./varco_arena/manager.py:async_run`](./varco_arena/manager.py).
 
 
 
 
152
 
153
  ## Special Thanks to (contributors)
154
  - Minho Lee (@Dialogue Model Team, NCSOFT) [github](https://github.com/minolee/)
 
160
 
161
  ## Citation
162
  If you found our work helpful, consider citing our paper!
 
163
  ```
164
  @misc{son2024varcoarenatournamentapproach,
165
+ title={VARCO Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models},
166
  author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim},
167
  year={2024},
168
  eprint={2411.01281},
README_en.md CHANGED
@@ -1,5 +1,5 @@
1
- # Varco Arena
2
- Varco Arena conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs.
3
 
4
  For more information, the followings may help understanding how it works.
5
  * [Paper](https://arxiv.org/abs/2411.01281)
@@ -89,15 +89,66 @@ pre-commit install
89
  bash precommit.sh # black formatter will reformat the codes
90
  ```
91
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
  ## FAQ
93
- * I want to apply my custom judge prompt to run Varco Arena
94
  * [`./varco_arena/prompts/`](./varco_arena/prompts/__init__.py) defines the prompts with `yaml` file and the class objects for those. Edit those as your need.
95
  * I want tailored judge prompts for each line of the test set row (i.e. ~100th row - `prompt1`, 101st~ - `prompt2`)
96
  * You could see `load_prompt` at the above link receives `promptname` + `task` as a parameters to load the prompt. The function is called at [`./varco_arena/manager.py:async_run`](./varco_arena/manager.py).
97
- * I want more fields for my llm outputs jsonl files for tailored use, i.e. want more fields beyond `instruction`, `source`, `generated`.
98
- * It's going to get tricky but let me briefly guide you about this.
99
- * You might have to edit `varco_arena/eval_utils.py`:`async_eval_w_prompt` (this part calls `PROMPT_OBJ.complete_prompt()`)
100
- * And all the related codes will require revision.
101
 
102
  ## Special Thanks to (contributors)
103
  - Minho Lee (@Dialogue Model Team, NCSOFT) [github](https://github.com/minolee/)
@@ -111,7 +162,7 @@ bash precommit.sh # black formatter will reformat the codes
111
  If you found our work helpful, consider citing our paper!
112
  ```
113
  @misc{son2024varcoarenatournamentapproach,
114
- title={Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models},
115
  author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim},
116
  year={2024},
117
  eprint={2411.01281},
 
1
+ # Arena-Lite (former Arena-Lite)
2
+ Arena-Lite conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs.
3
 
4
  For more information, the followings may help understanding how it works.
5
  * [Paper](https://arxiv.org/abs/2411.01281)
 
89
  bash precommit.sh # black formatter will reformat the codes
90
  ```
91
 
92
+ ### ๐Ÿ“ Adding a Custom Prompt
93
+
94
+ Hereโ€™s how to add a new evaluation prompt. The process has been simplified recently, as the Judge logic now only relies on the `parsed_output` method.
95
+
96
+ The easiest way is to copy `llmbar_brief.py` and `llmbar_brief.yaml` to create your own prompt.
97
+
98
+ #### 1. Create Prompt `.py` and `.yaml` Files
99
+
100
+ - Create files like `my_prompt.py` and `my_prompt.yaml` in the `varco_arena/varco_arena_core/prompts/` directory.
101
+ - **`my_prompt.py`**:
102
+ - Define a class that inherits from `ComparisonPromptBase`.
103
+ - You **must** implement the `parsed_output(self, response)` method. This function should take the LLM Judge's `response` and return a decision token (e.g., `'a'`, `'b'`) indicating the winner.
104
+ - **`my_prompt.yaml`**:
105
+ - Define necessary elements for your prompt, such as `sampling_parameters`, `decision_tokens`, and `prompt_template`.
106
+ - The strings in prompt_template are processed by string.Template and finalized in eval_utils.py via the BasePrompt.complete_prompt() function.
107
+ - Do not use ${task} in prompt_template. It is a reserved keyword due to the llmbar prompt.
108
+
109
+ #### 2. Register the Prompt in `prompts/__init__.py`
110
+
111
+ - Import your new prompt class:
112
+ ```python
113
+ from .my_prompt import MyPrompt
114
+ ```
115
+ - Add your new prompt's name and class instance to the `NAME2PROMPT_CLS` dictionary:
116
+ ```python
117
+ NAME2PROMPT_CLS = dict(
118
+ # ... other prompts
119
+ my_prompt=MyPrompt(),
120
+ )
121
+ ```
122
+ - Add the new prompt name to the `Literal` type hint for the `promptname` argument in the `load_prompt` function:
123
+ ```python
124
+ def load_prompt(
125
+ promptname: Literal[
126
+ # ... other prompt names
127
+ "my_prompt",
128
+ ],
129
+ # ...
130
+ ):
131
+ ```
132
+
133
+ #### 3. Add the Prompt to `eval_prompt_list.txt`
134
+
135
+ - Open the `eval_prompt_list.txt` file in the project root and add the name of your new prompt (`my_prompt`) on a new line.
136
+
137
+ #### 4. (Recommended) Test and Debug
138
+
139
+ - It is highly recommended to debug your prompt to ensure it works as expected.
140
+ - In the `.vscode/launch.json` file, modify the `"VA"` configuration's `args`:
141
+ - Change `"-p", "translation_fortunecookie"` to `"-p", "my_prompt"`.
142
+ - If necessary, update the `"-i", "..."` argument to the path of your test data suitable for the new prompt.
143
+ - Go to the `Run and Debug` tab in VS Code (Ctrl+Shift+D), select the "VA" configuration, and press F5 to run the debugger.
144
+ - Find `result.json` inside the output directory you specified after `-o`. It will show every judge prompt used for each match.
145
+
146
+
147
  ## FAQ
148
+ * I want to apply my custom judge prompt to run Arena-Lite
149
  * [`./varco_arena/prompts/`](./varco_arena/prompts/__init__.py) defines the prompts with `yaml` file and the class objects for those. Edit those as your need.
150
  * I want tailored judge prompts for each line of the test set row (i.e. ~100th row - `prompt1`, 101st~ - `prompt2`)
151
  * You could see `load_prompt` at the above link receives `promptname` + `task` as a parameters to load the prompt. The function is called at [`./varco_arena/manager.py:async_run`](./varco_arena/manager.py).
 
 
 
 
152
 
153
  ## Special Thanks to (contributors)
154
  - Minho Lee (@Dialogue Model Team, NCSOFT) [github](https://github.com/minolee/)
 
162
  If you found our work helpful, consider citing our paper!
163
  ```
164
  @misc{son2024varcoarenatournamentapproach,
165
+ title={VARCO Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models},
166
  author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim},
167
  year={2024},
168
  eprint={2411.01281},
README_kr.md CHANGED
@@ -1,5 +1,5 @@
1
- # Varco Arena
2
- ๋ฐ”๋ฅด์ฝ” ์•„๋ ˆ๋‚˜๋Š” ํ…Œ์ŠคํŠธ์…‹ ๋ช…๋ น์–ด๋ณ„๋กœ ๋น„๊ตํ•  ๋ชจ๋ธ๋“ค์˜ ํ† ๋„ˆ๋จผํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ ์ •ํ™•ํ•˜๊ฒŒ ๋ชจ๋ธ๋“ค์˜ ์ˆœ์œ„๋ฅผ ๋งค๊น๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ reference ์•„์›ƒํ’‹๊ณผ ๋น„๊ตํ•˜์—ฌ ์Šน๋ฅ ์„ ๋งค๊ธฐ๋Š” ๋ฐฉ๋ฒ•๋ณด๋‹ค ์ •ํ™•ํ•˜๋ฉฐ ์กฐ๊ธˆ ๋” ์ €๋ ดํ•ฉ๋‹ˆ๋‹ค.
3
 
4
  ๋” ์ž์„ธํ•œ ๋‚ด์šฉ์— ๋Œ€ํ•ด์„œ๋Š” ์•„๋ž˜์˜ ๋งํฌ๋ฅผ ์ฐธ์กฐํ•˜์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.
5
  * [๋…ผ๋ฌธ](https://arxiv.org/abs/2411.01281)
@@ -91,16 +91,66 @@ pre-commit install
91
  bash precommit.sh # ์ด๊ฒŒ ์ฝ”๋“œ๋“ค์„ ๋‹ค ๋ฆฌํฌ๋งทํ•ด์ค„๊ฑฐ์ž„
92
  ```
93
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
 
95
  ๋ฌธ์˜: ์†์„ ์ผ
96
  * ๋‚ด๊ฐ€ ๋งŒ๋“  ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ์–ด์š”
97
  * [`./varco_arena/prompts/`](./varco_arena_core/prompts/__init__.py) ์—์„  ๊ฐ์ข… ํ”„๋กฌํ”„ํŠธ ํด๋ž˜์Šค ๋ฐ `yaml` ํŒŒ์ผ ํ˜•ํƒœ๋กœ ์ •์˜๋œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค. ํ”„๋ฆฌ์…‹์„ ์ฐธ์กฐํ•˜์—ฌ ์ž‘์„ฑํ•˜์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.
98
  * ํ…Œ์ŠคํŠธ์…‹ ๋ณ„๋กœ ๋‹ค๋ฅธ ํ‰๊ฐ€ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ์–ด์š” (e.g. ์ž‘์—…์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ์–ด์š”)
99
  * ์œ„ ๊ฑธ์–ด๋“œ๋ฆฐ ๋งํฌ์˜ `load_prompt` ๋ฅผ ํ†ตํ•ด์„œ `promptname` + `task` ํ˜•ํƒœ๋กœ [`./varco_arena_core/manager.py:async_run`](./varco_arena_core/manager.py) ํ”„๋กฌํ”„ํŠธ๊ฐ€ ๋กœ๋“œ๋˜๋„๋ก ํ•ด๋†“์•˜์Šต๋‹ˆ๋‹ค.
100
- * ์ œ๊ฐ€ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ์€ ์ž…๋ ฅํŒŒ์ผ์— `instruction`, `source`, `generated` ์ด์™ธ์— ๋‹ค๋ฅธ ํ•„๋“œ๋ฅผ ์ถ”๊ฐ€ํ•ด์„œ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ์–ด์š”.
101
- * ์กฐ๊ธˆ ๋ณต์žกํ•ด์ง€๋Š”๋ฐ ๋‹ค์Œ ๋ถ€๋ถ„์„ ๊ณ ์ณ์ฃผ์„ธ์š”
102
- * `varco_arena/eval_utils.py` ์—์„œ `async_eval_w_prompt` ๋ถ€๋ถ„์„ ์†๋ด์•ผํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (์—ฌ๊ธฐ์—์„œ PROMPT_OBJ.complete_prompt()์„ ํ˜ธ์ถœํ•จ)
103
- * ๊ทธ ์™ธ ์—ฐ๊ด€๋œ ๋ถ€๋ถ„์€ ํƒ€๊ณ ํƒ€๊ณ  ๊ณ ์ณ์ฃผ์…”์•ผ...
104
 
105
  ## Special Thanks to (contributors)
106
  - ์ด๋ฏผํ˜ธ (@๋Œ€ํ™”๋ชจ๋ธํŒ€, NCSOFT) [github](https://github.com/minolee/)
@@ -113,7 +163,7 @@ bash precommit.sh # ์ด๊ฒŒ ์ฝ”๋“œ๋“ค์„ ๋‹ค ๋ฆฌํฌ๋งทํ•ด์ค„๊ฑฐ์ž„
113
  ์ €ํฌ ์ž‘์—…๋ฌผ์ด ๋„์›€์ด ๋˜์—ˆ๋‹ค๋ฉด ์ €ํฌ๋„ ๋„์›€์„ ๋ฐ›์•„๋ณผ ์ˆ˜ ์žˆ์„๊นŒ์š”?๐Ÿ˜‰
114
  ```
115
  @misc{son2024varcoarenatournamentapproach,
116
- title={Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models},
117
  author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim},
118
  year={2024},
119
  eprint={2411.01281},
 
1
+ # Arena-Lite (๊ตฌ Arena-Lite)
2
+ ์•„๋ ˆ๋‚˜-๋ผ์ดํŠธ๋Š” ํ…Œ์ŠคํŠธ์…‹ ๋ช…๋ น์–ด๋ณ„๋กœ ๋น„๊ตํ•  ๋ชจ๋ธ๋“ค์˜ ํ† ๋„ˆ๋จผํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ ์ •ํ™•ํ•˜๊ฒŒ ๋ชจ๋ธ๋“ค์˜ ์ˆœ์œ„๋ฅผ ๋งค๊น๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ reference ์•„์›ƒํ’‹๊ณผ ๋น„๊ตํ•˜์—ฌ ์Šน๋ฅ ์„ ๋งค๊ธฐ๋Š” ๋ฐฉ๋ฒ•๋ณด๋‹ค ์ •ํ™•ํ•˜๋ฉฐ ์กฐ๊ธˆ ๋” ์ €๋ ดํ•ฉ๋‹ˆ๋‹ค.
3
 
4
  ๋” ์ž์„ธํ•œ ๋‚ด์šฉ์— ๋Œ€ํ•ด์„œ๋Š” ์•„๋ž˜์˜ ๋งํฌ๋ฅผ ์ฐธ์กฐํ•˜์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.
5
  * [๋…ผ๋ฌธ](https://arxiv.org/abs/2411.01281)
 
91
  bash precommit.sh # ์ด๊ฒŒ ์ฝ”๋“œ๋“ค์„ ๋‹ค ๋ฆฌํฌ๋งทํ•ด์ค„๊ฑฐ์ž„
92
  ```
93
 
94
+ ### ๐Ÿ“ ์ปค์Šคํ…€ ํ”„๋กฌํ”„ํŠธ ์ถ”๊ฐ€ํ•˜๊ธฐ
95
+
96
+ ์ƒˆ๋กœ์šด ํ‰๊ฐ€ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์ตœ๊ทผ Judge ๋กœ์ง์ด `parsed_output` ๋ฉ”์†Œ๋“œ๋งŒ ์‚ฌ์šฉํ•˜๋„๋ก ๊ฐ„์†Œํ™”๋˜์–ด ์ด์ „๋ณด๋‹ค ์‰ฝ๊ฒŒ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
97
+
98
+ ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์€ `llmbar_brief.py`์™€ `llmbar_brief.yaml` ํŒŒ์ผ์„ ๋ณต์‚ฌํ•˜์—ฌ ์ž์‹ ๋งŒ์˜ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
99
+
100
+ #### 1. ํ”„๋กฌํ”„ํŠธ `.py` ๋ฐ `.yaml` ํŒŒ์ผ ์ƒ์„ฑ
101
+
102
+ - `varco_arena/varco_arena_core/prompts/` ๊ฒฝ๋กœ์— `my_prompt.py`์™€ `my_prompt.yaml`์ฒ˜๋Ÿผ ํŒŒ์ผ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
103
+ - **`my_prompt.py`**:
104
+ - `ComparisonPromptBase`๋ฅผ ์ƒ์†๋ฐ›๋Š” ํด๋ž˜์Šค๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.
105
+ - `parsed_output(self, response)` ๋ฉ”์†Œ๋“œ๋ฅผ ๋ฐ˜๋“œ์‹œ ๊ตฌํ˜„ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด ํ•จ์ˆ˜๋Š” LLM Judge์˜ ์‘๋‹ต(`response`)์„ ๋ฐ›์•„, ์Šน์ž๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒฐ์ • ํ† ํฐ(์˜ˆ: `'a'`, `'b'`)์„ ๋ฐ˜ํ™˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
106
+ - **`my_prompt.yaml`**:
107
+ - `sampling_parameters`, `decision_tokens`, `prompt_template` ๋“ฑ ํ”„๋กฌํ”„ํŠธ์— ํ•„์š”ํ•œ ์š”์†Œ๋“ค์„ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.
108
+ - `prompt_template` ์— ๋“ค์–ด๊ฐ€๋Š” ๋ฌธ์ž์—ด์€ `string.Template`์œผ๋กœ ์ฒ˜๋ฆฌ๋˜๋ฉฐ `BasePrompt.complete_prompt()` ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด `eval_utils.py`์—์„œ ์ตœ์ข… ์™„์„ฑ๋ฉ๋‹ˆ๋‹ค.
109
+ - `${task}, ${generated}, ${model_id}`๋ฅผ `prompt_template`์— ์‚ฌ์šฉํ•˜์ง€ ๋งˆ์„ธ์š”. ์˜ˆ์•ฝ๋œ ํ‚ค์›Œ๋“œ๋“ค์ž…๋‹ˆ๋‹ค.
110
+
111
+ #### 2. `prompts/__init__.py`์— ํ”„๋กฌํ”„ํŠธ ๋“ฑ๋ก
112
+
113
+ - ์ƒ์„ฑํ•œ ํ”„๋กฌํ”„ํŠธ ํด๋ž˜์Šค๋ฅผ `import` ํ•ฉ๋‹ˆ๋‹ค.
114
+ ```python
115
+ from .my_prompt import MyPrompt
116
+ ```
117
+ - `NAME2PROMPT_CLS` ๋”•์…”๋„ˆ๋ฆฌ์— ์ƒˆ ํ”„๋กฌํ”„ํŠธ ์ด๋ฆ„๊ณผ ํด๋ž˜์Šค ๊ฐ์ฒด๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
118
+ ```python
119
+ NAME2PROMPT_CLS = dict(
120
+ # ... ๊ธฐ์กด ํ”„๋กฌํ”„ํŠธ๋“ค
121
+ my_prompt=MyPrompt(),
122
+ )
123
+ ```
124
+ - `load_prompt` ํ•จ์ˆ˜์˜ `promptname` ์ธ์ž์˜ `Literal` ํƒ€์ž… ํžŒํŠธ์— ์ƒˆ ํ”„๋กฌํ”„ํŠธ ์ด๋ฆ„์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
125
+ ```python
126
+ def load_prompt(
127
+ promptname: Literal[
128
+ # ... ๊ธฐ์กด ํ”„๋กฌํ”„ํŠธ ์ด๋ฆ„๋“ค
129
+ "my_prompt",
130
+ ],
131
+ # ...
132
+ ):
133
+ ```
134
+
135
+ #### 3. `eval_prompt_list.txt`์— ํ”„๋กฌํ”„ํŠธ ์ถ”๊ฐ€
136
+
137
+ - ํ”„๋กœ์ ํŠธ ๋ฃจํŠธ์˜ `eval_prompt_list.txt` ํŒŒ์ผ์„ ์—ด๊ณ , ์ƒˆ ํ”„๋กฌํ”„ํŠธ์˜ ์ด๋ฆ„(`my_prompt`)์„ ์ƒˆ ์ค„์— ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
138
+
139
+ #### 4. (๊ถŒ์žฅ) ํ…Œ์ŠคํŠธ ๋ฐ ๋””๋ฒ„๊น…
140
+
141
+ - ํ”„๋กฌํ”„ํŠธ๊ฐ€ ์˜๋„๋Œ€๋กœ ์ž‘๋™ํ•˜๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด ๋””๋ฒ„๊น…์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.
142
+ - `.vscode/launch.json` ํŒŒ์ผ์˜ `"VA"` ์„ค์ •์—์„œ `args`๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค.
143
+ - `"-p", "translation_fortunecookie"` ๋ถ€๋ถ„์„ `"-p", "my_prompt"`๋กœ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค.
144
+ - ํ•„์š”์‹œ `"-i", "..."` ๋ถ€๋ถ„์— ์ƒˆ ํ”„๋กฌํ”„ํŠธ์— ์ ํ•ฉํ•œ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.
145
+ - VS Code์˜ `Run and Debug` ํƒญ(Ctrl+Shift+D)์œผ๋กœ ์ด๋™ํ•˜์—ฌ "VA" ์„ค์ •์„ ์„ ํƒํ•˜๊ณ  F5 ํ‚ค๋ฅผ ๋ˆŒ๋Ÿฌ ๋””๋ฒ„๊ฑฐ๋ฅผ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.
146
+ - `-o` ๋’ค์— ๋ช…์‹œํ•œ output ๋””๋ ‰ํ† ๋ฆฌ ์•ˆ์—์„œ `result.json` ๋ฅผ ์ฐพ์•„์„œ ์›ํ•˜๋Š”๋Œ€๋กœ ๋™์ž‘ํ–ˆ๋Š”์ง€ ํ™•์ธํ•ด๋ณด์„ธ์š”. ๋ชจ๋“  judge์™€ ๋งค์น˜์— ํ™œ์šฉ๋œ ํ”„๋กฌํ”„ํŠธ ์ •๋ณด๊ฐ€ ๋‹ด๊ฒจ์žˆ์Šต๋‹ˆ๋‹ค.
147
 
148
  ๋ฌธ์˜: ์†์„ ์ผ
149
  * ๋‚ด๊ฐ€ ๋งŒ๋“  ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ์–ด์š”
150
  * [`./varco_arena/prompts/`](./varco_arena_core/prompts/__init__.py) ์—์„  ๊ฐ์ข… ํ”„๋กฌํ”„ํŠธ ํด๋ž˜์Šค ๋ฐ `yaml` ํŒŒ์ผ ํ˜•ํƒœ๋กœ ์ •์˜๋œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค. ํ”„๋ฆฌ์…‹์„ ์ฐธ์กฐํ•˜์—ฌ ์ž‘์„ฑํ•˜์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.
151
  * ํ…Œ์ŠคํŠธ์…‹ ๋ณ„๋กœ ๋‹ค๋ฅธ ํ‰๊ฐ€ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ์–ด์š” (e.g. ์ž‘์—…์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ์–ด์š”)
152
  * ์œ„ ๊ฑธ์–ด๋“œ๋ฆฐ ๋งํฌ์˜ `load_prompt` ๋ฅผ ํ†ตํ•ด์„œ `promptname` + `task` ํ˜•ํƒœ๋กœ [`./varco_arena_core/manager.py:async_run`](./varco_arena_core/manager.py) ํ”„๋กฌํ”„ํŠธ๊ฐ€ ๋กœ๋“œ๋˜๋„๋ก ํ•ด๋†“์•˜์Šต๋‹ˆ๋‹ค.
153
+
 
 
 
154
 
155
  ## Special Thanks to (contributors)
156
  - ์ด๋ฏผํ˜ธ (@๋Œ€ํ™”๋ชจ๋ธํŒ€, NCSOFT) [github](https://github.com/minolee/)
 
163
  ์ €ํฌ ์ž‘์—…๋ฌผ์ด ๋„์›€์ด ๋˜์—ˆ๋‹ค๋ฉด ์ €ํฌ๋„ ๋„์›€์„ ๋ฐ›์•„๋ณผ ์ˆ˜ ์žˆ์„๊นŒ์š”?๐Ÿ˜‰
164
  ```
165
  @misc{son2024varcoarenatournamentapproach,
166
+ title={VARCO Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models},
167
  author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim},
168
  year={2024},
169
  eprint={2411.01281},
app.py CHANGED
@@ -253,18 +253,18 @@ def main():
253
  False, sidebar_placeholder=sidebar_placeholder, toggle_hashstr="app_init"
254
  )
255
 
256
- st.title("โš”๏ธ VARCO ARENA โš”๏ธ")
257
  if st.session_state.korean:
258
  st.write(
259
- """**๋ฐ”๋ฅด์ฝ” ์•„๋ ˆ๋‚˜๋Š” ํ…Œ์ŠคํŠธ์…‹ ๋ช…๋ น์–ด๋ณ„๋กœ ๋น„๊ตํ•  ๋ชจ๋ธ(์ƒ์„ฑ๋ฌธ)์˜ ํ† ๋„ˆ๋จผํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ  ๊ฒฐ๊ณผ๋“ค์„ ์ข…ํ•ฉํ•˜์—ฌ ๋ชจ๋ธ๋“ค์˜ ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๋Š” ๋ฒค์น˜๋งˆํ‚น ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ reference ์•„์›ƒํ’‹๊ณผ ๋น„๊ตํ•˜์—ฌ ์Šน๋ฅ ์„ ๋งค๊ธฐ๋Š” ๋ฐฉ๋ฒ•๋ณด๋‹ค ์ •ํ™•ํ•˜๋ฉฐ ๋” ์ €๋ ดํ•ฉ๋‹ˆ๋‹ค.**
260
 
261
  ๋ชจ๋ฒ”๋‹ต์•ˆ์„ ํ•„์š”๋กœ ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ์ปค์Šคํ…€ ํ…Œ์ŠคํŠธ์…‹ (50+ ํ–‰) ์„ ํ™œ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ํŽธ๋ฆฌํ•œ ๋ฒค์น˜๋งˆํ‚น์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค."""
262
  )
263
  else:
264
  st.write(
265
- """**VARCO Arena is an LLM benchmarking system that compares model responses across customized test scenarios (recommend >50 prompts) without requiring reference answers.**
266
 
267
- VARCO Arena conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs."""
268
  )
269
 
270
  st.divider()
@@ -389,9 +389,9 @@ def main():
389
  # Form for actual run
390
  with st.form("run_arena_form"):
391
  if st.session_state.korean:
392
- st.write("### 3. Varco Arena ๊ตฌ๋™ํ•˜๊ธฐ")
393
  else:
394
- st.write("### 3. Run Varco Arena")
395
  api_key = st.text_input("Enter your OpenAI API Key", type="password")
396
 
397
  # demo exp name fixated
@@ -434,12 +434,12 @@ def main():
434
  )
435
  if return_code:
436
  st.error(
437
- "โŒ RuntimeError: An error occurred during Varco Arena run. Check the file and **restart from file upload!**"
438
  )
439
  purge_user_sub_data(data_path_to_purge=VA_ROOT)
440
 
441
  else:
442
- st.success("โœ… Varco Arena run completed successfully")
443
  st.session_state.result_file_path = list(
444
  result_file_path.glob("**/result.json")
445
  )[-1]
 
253
  False, sidebar_placeholder=sidebar_placeholder, toggle_hashstr="app_init"
254
  )
255
 
256
+ st.title("โš”๏ธ Arena-Lite โš”๏ธ")
257
  if st.session_state.korean:
258
  st.write(
259
+ """**Arena-Lite๋Š” ํ…Œ์ŠคํŠธ์…‹ ๋ช…๋ น์–ด๋ณ„๋กœ ๋น„๊ตํ•  ๋ชจ๋ธ(์ƒ์„ฑ๋ฌธ)์˜ ํ† ๋„ˆ๋จผํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ  ๊ฒฐ๊ณผ๋“ค์„ ์ข…ํ•ฉํ•˜์—ฌ ๋ชจ๋ธ๋“ค์˜ ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๋Š” ๋ฒค์น˜๋งˆํ‚น ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ reference ์•„์›ƒํ’‹๊ณผ ๋น„๊ตํ•˜์—ฌ ์Šน๋ฅ ์„ ๋งค๊ธฐ๋Š” ๋ฐฉ๋ฒ•๋ณด๋‹ค ์ •ํ™•ํ•˜๋ฉฐ ๋” ์ €๋ ดํ•ฉ๋‹ˆ๋‹ค.**
260
 
261
  ๋ชจ๋ฒ”๋‹ต์•ˆ์„ ํ•„์š”๋กœ ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ์ปค์Šคํ…€ ํ…Œ์ŠคํŠธ์…‹ (50+ ํ–‰) ์„ ํ™œ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ํŽธ๋ฆฌํ•œ ๋ฒค์น˜๋งˆํ‚น์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค."""
262
  )
263
  else:
264
  st.write(
265
+ """**Arena-Lite is an LLM benchmarking system that compares model responses across customized test scenarios (recommend >50 prompts) without requiring reference answers.**
266
 
267
+ Arena-Lite conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs."""
268
  )
269
 
270
  st.divider()
 
389
  # Form for actual run
390
  with st.form("run_arena_form"):
391
  if st.session_state.korean:
392
+ st.write("### 3. Arena-Lite ๊ตฌ๋™ํ•˜๊ธฐ")
393
  else:
394
+ st.write("### 3. Run Arena-Lite")
395
  api_key = st.text_input("Enter your OpenAI API Key", type="password")
396
 
397
  # demo exp name fixated
 
434
  )
435
  if return_code:
436
  st.error(
437
+ "โŒ RuntimeError: An error occurred during Arena-Lite run. Check the file and **restart from file upload!**"
438
  )
439
  purge_user_sub_data(data_path_to_purge=VA_ROOT)
440
 
441
  else:
442
+ st.success("โœ… Arena-Lite run completed successfully")
443
  st.session_state.result_file_path = list(
444
  result_file_path.glob("**/result.json")
445
  )[-1]
eval_prompt_list.txt CHANGED
@@ -1,4 +1,5 @@
1
  llmbar
 
2
  translation_pair
3
  rag_pair_kr
4
  translation_fortunecookie
 
1
  llmbar
2
+ llmbar_brief
3
  translation_pair
4
  rag_pair_kr
5
  translation_fortunecookie
guide_mds/input_jsonls_en.md CHANGED
@@ -1,37 +1,38 @@
1
- #### \[EN\] Upload guide (`jsonl`)
2
- **Basic Requirements**
3
- * Upload one `jsonl` file per model (e.g., five files to compare five LLMs)
4
- * โš ๏ธ Important: All `jsonl` files must have the same number of rows
5
- * โš ๏ธ Important: The `model_id` field must be unique within and across all files
6
-
7
- **Required Fields**
8
- * Per Model Fields
9
- * `model_id`: Unique identifier for the model (recommendation: keep it short)
10
- * `generated`: The LLM's response to the test instruction
11
-
12
- * Required only for Translation (`translation_pair` prompt need those. See `streamlit_app_local/user_submit/mt/llama5.jsonl`)
13
- * `source_lang`: input language (e.g. Korean, KR, kor, ...)
14
- * `target_lang`: output language (e.g. English, EN, ...)
15
-
16
- * Common Fields (Must be identical across all files)
17
- * `instruction`: The input prompt or test instruction given to the model
18
- * `task`: Category label used to group results (useful when using different evaluation prompts per task)
19
-
20
- **Example Format**
21
  ```python
22
  # model1.jsonl
23
- {"model_id": "model1", "task": "directions", "instruction": "Where should I go?", "generated": "Over there"}
24
- {"model_id": "model1", "task": "arithmetic", "instruction": "1+1", "generated": "2"}
25
 
26
- # model2.jsonl
27
- {"model_id": "model2", "task": "directions", "instruction": "Where should I go?", "generated": "Head north"}
28
- {"model_id": "model2", "task": "arithmetic", "instruction": "1+1", "generated": "3"}
29
  ...
30
  ..
31
- .
32
  ```
33
- **Use Case Example**
34
- If you want to compare different prompting strategies for the same model:
35
- * Use the same `instruction` across files (using unified test scenarios).
36
- * `generated` responses of each prompting strategy will vary across the files.
37
- * Use descriptive `model_id` values like "prompt1", "prompt2", etc.
 
 
 
1
+ #### \[EN\] Guide for Input .jsonl Files
2
+ If you have five models to compare, upload five .jsonl files.
3
+ * ๐Ÿ’ฅAll `.jsonl` files must have the same number of rows.
4
+ * ๐Ÿ’ฅThe `model_id` field must be different for each file and unique within each file.
5
+ * ๐Ÿ’ฅEach `.jsonl` file should have different `generated`, `model_id` from the other files. `instruction`, `task` should be the same.
6
+
7
+ **Required `.jsonl` Fields**
8
+ * Reserved Fields (Mandatory)
9
+ * `model_id`: The name of the model being evaluated. (Recommended to be short)
10
+ * `instruction`: The instruction given to the model. This corresponds to the test set prompt (not the evaluation prompt).
11
+ * `generated`: Enter the response generated by the model for the test set instruction.
12
+ * `task`: Used to group and display overall results as a subset. Can be utilized when you want to use different evaluation prompts per row.
13
+ * Additional
14
+ * Depending on the evaluation prompt you use, you can utilize other additional fields. You can freely add them to your `.jsonl` files, avoiding the keywords
15
+ mentioned above.
16
+ * Example: For `translation_pair.yaml` and `translation_fortunecookie.yaml` prompts, the `source_lang` and `target_lang` fields are read from the `.jsonl` and
17
+ utilized.
18
+
19
+ For example, when evaluating with the `translation_pair` prompt, each .jsonl file looks like this:
 
20
  ```python
21
  # model1.jsonl
22
+ {"model_id": "๋ชจ๋ธ1", "task": "์˜ํ•œ", "instruction": "์–ด๋””๋กœ ๊ฐ€์•ผํ•˜์˜ค", "generated": "Where should I go", "source_lang": "Korean", "target_lang": "English"}
23
+ {"model_id": "๋ชจ๋ธ1", "task": "ํ•œ์˜", "instruction": "1+1?", "generated": "1+1?", "source_lang": "English", "target_lang": "Korean"}
24
 
25
+ # model2.jsonl -* model1.jsonl๊ณผ `instruction`์€ ๊ฐ™๊ณ  `generated`, `model_id` ๋Š” ๋‹ค๋ฆ…๋‹ˆ๋‹ค!
26
+ {"model_id": "๋ชจ๋ธ2", "task": "์˜ํ•œ", "instruction": "์–ด๋””๋กœ ๊ฐ€์•ผํ•˜์˜ค", "generated": "๊ธ€์Ž„๋‹ค", "source_lang": "Korean", "target_lang": "English"}
27
+ {"model_id": "๋ชจ๋ธ2", "task": "ํ•œ์˜", "instruction": "1+1?", "generated": "2", "source_lang": "English", "target_lang": "Korean"}
28
  ...
29
  ..
30
+
31
  ```
32
+ On the other hand, when evaluating with the `llmbar` prompt, fields like source_lang and target_lang are not used, similar to translation evaluation, and naturally, you don't need to add them to your .jsonl.
33
+
34
+
35
+
36
+
37
+
38
+
guide_mds/input_jsonls_kr.md CHANGED
@@ -2,33 +2,30 @@
2
  ๋น„๊ตํ•  ๋ชจ๋ธ์ด ๋‹ค์„ฏ ๊ฐœ๋ผ๋ฉด ๋‹ค์„ฏ ๊ฐœ์˜ .jsonl ํŒŒ์ผ์„ ์—…๋กœ๋“œํ•˜์„ธ์š”.
3
  * ๐Ÿ’ฅ๋ชจ๋“  jsonl ์€ ๊ฐ™์€ ์ˆ˜์˜ ํ–‰์„ ๊ฐ€์ ธ์•ผํ•ฉ๋‹ˆ๋‹ค.
4
  * ๐Ÿ’ฅ`model_id` ํ•„๋“œ๋Š” ํŒŒ์ผ๋งˆ๋‹ค ๋‹ฌ๋ผ์•ผํ•˜๋ฉฐ ํŒŒ์ผ ๋‚ด์—์„œ๋Š” ์œ ์ผํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.
 
 
5
 
6
  **jsonl ํ•„์ˆ˜ ํ•„๋“œ**
7
- * ๊ฐœ๋ณ„
8
  * `model_id`: ํ‰๊ฐ€๋ฐ›๋Š” ๋ชจ๋ธ์˜ ์ด๋ฆ„์ž…๋‹ˆ๋‹ค. (์งง๊ฒŒ ์“ฐ๋Š” ๊ฒƒ ์ถ”์ฒœ)
 
9
  * `generated`: ๋ชจ๋ธ์ด testset instruction ์— ์ƒ์„ฑํ•œ ์‘๋‹ต์„ ๋„ฃ์œผ์„ธ์š”.
10
-
11
- * ๋ฒˆ์—ญํ‰๊ฐ€ ํ”„๋กฌํ”„ํŠธ ์‚ฌ์šฉ์‹œ (`translation_pair`. `streamlit_app_local/user_submit/mt/llama5.jsonl` ์—์„œ ์˜ˆ์‹œ ๋ณผ ์ˆ˜ ์žˆ์Œ)
12
- * `source_lang`: input language (e.g. Korean, KR, kor, ...)
13
- * `target_lang`: output language (e.g. English, EN, ...)
14
-
15
- * ๊ณตํ†ต ๋ถ€๋ถ„ (**๋ชจ๋“  ํŒŒ์ผ์— ๋Œ€ํ•ด ๊ฐ™์•„์•ผ ํ•จ**)
16
- * `instruction`: ๋ชจ๋ธ์— ์ง‘์–ด๋„ฃ๋Š” `testset instruction` ํ˜น์€ `input`์— ํ•ด๋‹นํ•˜๋Š” ๋ฌด์–ธ๊ฐ€์ž…๋‹ˆ๋‹ค.
17
  * `task`: ์ „์ฒด ๊ฒฐ๊ณผ๋ฅผ subset์œผ๋กœ ๊ทธ๋ฃน์ง€์–ด์„œ ๋ณด์—ฌ์ค„ ๋•Œ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. `evaluation prompt`๋ฅผ ํ–‰๋ณ„๋กœ ๋‹ค๋ฅด๊ฒŒ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ์„ ๋•Œ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
 
 
 
18
 
19
-
20
- ๊ฐ jsonl ํŒŒ์ผ์€ ์•„๋ž˜์ฒ˜๋Ÿผ ์ƒ๊ฒผ์Šต๋‹ˆ๋‹ค.
21
  ```python
22
  # model1.jsonl
23
- {"model_id": "๋ชจ๋ธ1", "task": "๊ธธ ๋ฌป๊ธฐ", "instruction": "์–ด๋””๋กœ ๊ฐ€์•ผํ•˜์˜ค", "generated": "์ €๊ธฐ๋กœ์š”"}
24
- {"model_id": "๋ชจ๋ธ1", "task": "์‚ฐ์ˆ˜", "instruction": "1+1", "generated": "2"} # ๊ธธ ๋ฌป๊ธฐ์™€ ์‚ฐ์ˆ˜์˜ ๊ฒฝ์šฐ ๋‹ค๋ฅธ ํ‰๊ฐ€ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ์„ ์ˆ˜ ์žˆ๊ฒ ์ฃ ?
25
 
26
  # model2.jsonl -* model1.jsonl๊ณผ `instruction`์€ ๊ฐ™๊ณ  `generated`, `model_id` ๋Š” ๋‹ค๋ฆ…๋‹ˆ๋‹ค!
27
- {"model_id": "๋ชจ๋ธ2", "task": "๊ธธ ๋ฌป๊ธฐ", "instruction": "์–ด๋””๋กœ ๊ฐ€์•ผํ•˜์˜ค", "generated": "ํ•˜์ด"}
28
- {"model_id": "๋ชจ๋ธ2", "task": "์‚ฐ์ˆ˜", "instruction": "1+1", "generated": "3"}
29
-
30
  ...
31
  ..
32
- ```
33
 
34
- ์˜ˆ๋ฅผ ๋“ค์–ด, ํ•œ๊ฐ€์ง€ ๋ชจ๋ธ์— ๋Œ€ํ•ด ๋‹ค๋ฅธ ํ”„๋กฌํ”„ํŒ…์„ ์‹œ๋„ํ•˜์—ฌ ๋‹ค๋ฅธ ์ƒ์„ฑ๋ฌธ์„ ์–ป์—ˆ๊ณ  ์ด๋ฅผ ๋น„๊ตํ•˜๊ณ  ์‹ถ์€ ๊ฒฝ์šฐ๋ฅผ ์ƒ๊ฐํ•ด๋ด…์‹œ๋‹ค. ์ด ๋•Œ ํ‰๊ฐ€๋ฐ›์„ testset์€ ๊ฐ™์œผ๋ฏ€๋กœ `instruction`์€ ๋ชจ๋‘ ๊ฐ™๊ณ  ํ”„๋กฌํ”„ํŒ…์— ๋”ฐ๋ผ `generated`๋Š” ๋‹ฌ๋ผ์ง€๊ฒ ์ฃ ? `model_id` ๋Š” `"prompt1"`, `"prompt2"` ๋“ฑ ์ทจํ–ฅ์— ๋งž๊ฒŒ ์ ์–ด์ฃผ์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.
 
 
2
  ๋น„๊ตํ•  ๋ชจ๋ธ์ด ๋‹ค์„ฏ ๊ฐœ๋ผ๋ฉด ๋‹ค์„ฏ ๊ฐœ์˜ .jsonl ํŒŒ์ผ์„ ์—…๋กœ๋“œํ•˜์„ธ์š”.
3
  * ๐Ÿ’ฅ๋ชจ๋“  jsonl ์€ ๊ฐ™์€ ์ˆ˜์˜ ํ–‰์„ ๊ฐ€์ ธ์•ผํ•ฉ๋‹ˆ๋‹ค.
4
  * ๐Ÿ’ฅ`model_id` ํ•„๋“œ๋Š” ํŒŒ์ผ๋งˆ๋‹ค ๋‹ฌ๋ผ์•ผํ•˜๋ฉฐ ํŒŒ์ผ ๋‚ด์—์„œ๋Š” ์œ ์ผํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.
5
+ * ๐Ÿ’ฅ๊ฐ jsonl ํŒŒ์ผ์ด ์„œ๋กœ ๋‹ค๋ฅธ generated ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค. `instruction`, `model_id`, `task` ๋Š” ๊ฐ™์•„์•ผํ•ฉ๋‹ˆ๋‹ค.
6
+
7
 
8
  **jsonl ํ•„์ˆ˜ ํ•„๋“œ**
9
+ * ์˜ˆ์•ฝ๋œ ํ•„๋“œ (ํ•„์ˆ˜)
10
  * `model_id`: ํ‰๊ฐ€๋ฐ›๋Š” ๋ชจ๋ธ์˜ ์ด๋ฆ„์ž…๋‹ˆ๋‹ค. (์งง๊ฒŒ ์“ฐ๋Š” ๊ฒƒ ์ถ”์ฒœ)
11
+ * `instruction`: ๋ชจ๋ธ์ด ๋ฐ›์€ ์ง€์‹œ๋ฌธ์ž…๋‹ˆ๋‹ค. ํ…Œ์ŠคํŠธ์…‹ ํ”„๋กฌํ”„ํŠธ์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค (ํ‰๊ฐ€ ํ”„๋กฌํ”„ํŠธ ์•„๋‹˜)
12
  * `generated`: ๋ชจ๋ธ์ด testset instruction ์— ์ƒ์„ฑํ•œ ์‘๋‹ต์„ ๋„ฃ์œผ์„ธ์š”.
 
 
 
 
 
 
 
13
  * `task`: ์ „์ฒด ๊ฒฐ๊ณผ๋ฅผ subset์œผ๋กœ ๊ทธ๋ฃน์ง€์–ด์„œ ๋ณด์—ฌ์ค„ ๋•Œ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. `evaluation prompt`๋ฅผ ํ–‰๋ณ„๋กœ ๋‹ค๋ฅด๊ฒŒ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ์„ ๋•Œ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
14
+ * ์ถ”๊ฐ€
15
+ * ๋‹น์‹ ์ด ์‚ฌ์šฉํ•˜๋Š” ํ‰๊ฐ€ ํ”„๋กฌํ”„ํŠธ์— ๋”ฐ๋ผ์„œ ์ถ”๊ฐ€๋กœ ๋‹ค๋ฅธ ํ•„๋“œ๋“ค์„ ๋” ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์œ„์˜ ํ‚ค์›Œ๋“œ๋“ค์„ ํ”ผํ•ด์„œ ์ž์œ ๋กญ๊ฒŒ jsonl์— ์ถ”๊ฐ€ํ•˜์—ฌ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
16
+ * ์˜ˆ์‹œ: translation_pair.yaml, translation_fortunecookie.yaml ํ”„๋กฌํ”„ํŠธ์˜ ๊ฒฝ์šฐ๋Š” `source_lang`, `target_lang` ํ•„๋“œ๋ฅผ jsonl ์—์„œ ์ฝ์–ด์„œ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.
17
 
18
+ ์˜ˆ๋ฅผ๋“ค์–ด translation_pair ํ”„๋กฌํ”„ํŠธ๋กœ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒฝ์šฐ ๊ฐ jsonl ํŒŒ์ผ์€ ์•„๋ž˜์ฒ˜๋Ÿผ ์ƒ๊ฒผ์Šต๋‹ˆ๋‹ค.
 
19
  ```python
20
  # model1.jsonl
21
+ {"model_id": "๋ชจ๋ธ1", "task": "์˜ํ•œ", "instruction": "์–ด๋””๋กœ ๊ฐ€์•ผํ•˜์˜ค", "generated": "Where should I go", "source_lang": "Korean", "target_lang": "English"}
22
+ {"model_id": "๋ชจ๋ธ1", "task": "ํ•œ์˜", "instruction": "1+1?", "generated": "1+1?", "source_lang": "English", "target_lang": "Korean"}
23
 
24
  # model2.jsonl -* model1.jsonl๊ณผ `instruction`์€ ๊ฐ™๊ณ  `generated`, `model_id` ๋Š” ๋‹ค๋ฆ…๋‹ˆ๋‹ค!
25
+ {"model_id": "๋ชจ๋ธ2", "task": "์˜ํ•œ", "instruction": "์–ด๋””๋กœ ๊ฐ€์•ผํ•˜์˜ค", "generated": "๊ธ€์Ž„๋‹ค", "source_lang": "Korean", "target_lang": "English"}
26
+ {"model_id": "๋ชจ๋ธ2", "task": "ํ•œ์˜", "instruction": "1+1?", "generated": "2", "source_lang": "English", "target_lang": "Korean"}
 
27
  ...
28
  ..
 
29
 
30
+ ```
31
+ ๋ฐ˜๋ฉด `llmbar` ํ”„๋กฌํ”„ํŠธ๋กœ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒฝ์šฐ, ๋ฒˆ์—ญํ‰๊ฐ€์ฒ˜๋Ÿผ `source_lang`, `target_lang` ํ•„๋“œ๊ฐ€ ์‚ฌ์šฉ๋˜์ง€ ์•Š์œผ๋ฉฐ ๋‹น์—ฐํžˆ jsonl์—๋„ ์ถ”๊ฐ€ํ•˜์ง€ ์•Š์œผ์…”๋„ ๋ฉ๋‹ˆ๋‹ค.
modules/nav.py CHANGED
@@ -24,7 +24,7 @@ def Navbar(sidebar_placeholder, toggle_hashstr: str = ""):
24
 
25
  st.page_link(
26
  "app.py",
27
- label="Varco Arena ๊ตฌ๋™" if st.session_state.korean else "Run VARCO Arena",
28
  icon="๐Ÿ”ฅ",
29
  )
30
  st.page_link(
 
24
 
25
  st.page_link(
26
  "app.py",
27
+ label="Arena-Lite ๊ตฌ๋™" if st.session_state.korean else "Run Arena-Lite",
28
  icon="๐Ÿ”ฅ",
29
  )
30
  st.page_link(
pages/brief_intro.py CHANGED
@@ -23,7 +23,7 @@ else:
23
  st.image("va_concept_new.png")
24
  st.markdown(
25
  """
26
- | |Current Practice|Varco Arena|
27
  |-|-|-|
28
  |Total no. matches|$$n_{\\text{model}}*\\|X\\|$$|$$(n_{\\text{model}}-1)*\\|X\\|$$|
29
  |No. matches per LLM|$$\\|X\\|$$|$$\\left[\\|X\\|,\\|X\\|\\text{log}n_{\\text{model}}\\right]$$|
@@ -32,9 +32,9 @@ st.markdown(
32
  )
33
  if st.session_state.korean:
34
  st.info(
35
- "Varco Arena๋Š” ์‹ ๋ขฐ์„ฑ ์žˆ๋Š” ์ˆœ์œ„๋ฅผ ๋” ์ ์€ ํšŸ์ˆ˜์˜ ๋น„๊ต ๋‚ด์— ์–ป์–ด๋‚ด๋ฉฐ, ์ด๋Ÿฌํ•œ ํŠน์ง•์€ LLM ์ง์ ‘ ๋น„๊ต์˜ ์ด์ ์œผ๋กœ๋ถ€ํ„ฐ ๊ธฐ์ธํ•ฉ๋‹ˆ๋‹ค."
36
  )
37
  else:
38
  st.info(
39
- "Varco Arena takes advantage of direct comparison between LLM responses to guarantee better reliability in fewer number of total matches."
40
  )
 
23
  st.image("va_concept_new.png")
24
  st.markdown(
25
  """
26
+ | |Current Practice|Arena-Lite|
27
  |-|-|-|
28
  |Total no. matches|$$n_{\\text{model}}*\\|X\\|$$|$$(n_{\\text{model}}-1)*\\|X\\|$$|
29
  |No. matches per LLM|$$\\|X\\|$$|$$\\left[\\|X\\|,\\|X\\|\\text{log}n_{\\text{model}}\\right]$$|
 
32
  )
33
  if st.session_state.korean:
34
  st.info(
35
+ "Arena-Lite๋Š” ์‹ ๋ขฐ์„ฑ ์žˆ๋Š” ์ˆœ์œ„๋ฅผ ๋” ์ ์€ ํšŸ์ˆ˜์˜ ๋น„๊ต ๋‚ด์— ์–ป์–ด๋‚ด๋ฉฐ, ์ด๋Ÿฌํ•œ ํŠน์ง•์€ LLM ์ง์ ‘ ๋น„๊ต์˜ ์ด์ ์œผ๋กœ๋ถ€ํ„ฐ ๊ธฐ์ธํ•ฉ๋‹ˆ๋‹ค."
36
  )
37
  else:
38
  st.info(
39
+ "Arena-Lite takes advantage of direct comparison between LLM responses to guarantee better reliability in fewer number of total matches."
40
  )
pages/see_results.py CHANGED
@@ -60,9 +60,9 @@ def main():
60
 
61
  if result_select is None:
62
  if st.session_state.korean:
63
- st.markdown("๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜๋ ค๋ฉด ๋จผ์ € **๐Ÿ”ฅVARCO Arena๋ฅผ ๊ตฌ๋™**ํ•˜์…”์•ผ ํ•ฉ๋‹ˆ๋‹ค")
64
  else:
65
- st.markdown("You should **๐Ÿ”ฅRun VARCO Arena** first to see results")
66
  st.image("streamlit_app_local/page_result_1.png")
67
  st.image("streamlit_app_local/page_result_2.png")
68
  st.image("streamlit_app_local/page_result_3.png")
@@ -334,18 +334,18 @@ def main():
334
  with st.expander("ํŽผ์ณ์„œ ๋ณด๊ธฐ" if st.session_state.korean else "Expand to show"):
335
  st.info(
336
  """
337
- Varco Arena์—์„œ๋Š” position bias์˜ ์˜ํ–ฅ์„ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋“  ๋ชจ๋ธ์ด A๋‚˜ B์œ„์น˜์— ๋ฒˆ๊ฐˆ์•„ ์œ„์น˜ํ•˜๋„๋ก ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ LLM Judge ํ˜น์€ Prompt์˜ ์„ฑ๋Šฅ์ด ๋ถ€์กฑํ•˜๋‹ค๊ณ  ๋А๊ปด์ง„๋‹ค๋ฉด, ์•„๋ž˜ ์•Œ๋ ค์ง„ LLM Judge bias๊ฐ€ ์ฐธ๊ณ ๊ฐ€ ๋ ๊ฒ๋‹ˆ๋‹ค.
338
  * position bias (์™ผ์ชฝ)
339
  * length bias (์˜ค๋ฅธ์ชฝ)
340
 
341
- ๊ฒฐ๊ณผ์˜ ์™œ๊ณก์ด LLM Judge์˜ ๋ถ€์กฑํ•จ ๋–„๋ฌธ์ด์—ˆ๋‹ค๋Š” ์ ์„ ๊ทœ๋ช…ํ•˜๋ ค๋ฉด ์‚ฌ์šฉํ•˜์‹  LLM Judge์™€ Prompt์˜ binary classification ์ •ํ™•๋„๋ฅผ ์ธก์ •ํ•ด๋ณด์‹œ๊ธธ ๋ฐ”๋ž๋‹ˆ๋‹ค (Varco Arena๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ด๋ฅผ ์ˆ˜ํ–‰ํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!).""".strip()
342
  if st.session_state.korean
343
  else """
344
- In Varco Arena, to minimize the effect of position bias, all models are alternately positioned in either position A or B. However, if you feel the LLM Judge or Prompt performance is insufficient, the following known LLM Judge biases may be helpful to reference:
345
  * position bias (left)
346
  * length bias (right)
347
 
348
- To determine if result distortion was due to LLM Judge limitations, please measure the binary classification accuracy of your LLM Judge and Prompt (You could use Varco Arena for this purpose!).
349
  """.strip()
350
  )
351
  st.markdown(f"#### {judgename} + prompt = {eval_prompt_name}")
 
60
 
61
  if result_select is None:
62
  if st.session_state.korean:
63
+ st.markdown("๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜๋ ค๋ฉด ๋จผ์ € **๐Ÿ”ฅArena-Lite๋ฅผ ๊ตฌ๋™**ํ•˜์…”์•ผ ํ•ฉ๋‹ˆ๋‹ค")
64
  else:
65
+ st.markdown("You should **๐Ÿ”ฅRun Arena-Lite** first to see results")
66
  st.image("streamlit_app_local/page_result_1.png")
67
  st.image("streamlit_app_local/page_result_2.png")
68
  st.image("streamlit_app_local/page_result_3.png")
 
334
  with st.expander("ํŽผ์ณ์„œ ๋ณด๊ธฐ" if st.session_state.korean else "Expand to show"):
335
  st.info(
336
  """
337
+ Arena-Lite์—์„œ๋Š” position bias์˜ ์˜ํ–ฅ์„ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋“  ๋ชจ๋ธ์ด A๋‚˜ B์œ„์น˜์— ๋ฒˆ๊ฐˆ์•„ ์œ„์น˜ํ•˜๋„๋ก ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ LLM Judge ํ˜น์€ Prompt์˜ ์„ฑ๋Šฅ์ด ๋ถ€์กฑํ•˜๋‹ค๊ณ  ๋А๊ปด์ง„๋‹ค๋ฉด, ์•„๋ž˜ ์•Œ๋ ค์ง„ LLM Judge bias๊ฐ€ ์ฐธ๊ณ ๊ฐ€ ๋ ๊ฒ๋‹ˆ๋‹ค.
338
  * position bias (์™ผ์ชฝ)
339
  * length bias (์˜ค๋ฅธ์ชฝ)
340
 
341
+ ๊ฒฐ๊ณผ์˜ ์™œ๊ณก์ด LLM Judge์˜ ๋ถ€์กฑํ•จ ๋–„๋ฌธ์ด์—ˆ๋‹ค๋Š” ์ ์„ ๊ทœ๋ช…ํ•˜๋ ค๋ฉด ์‚ฌ์šฉํ•˜์‹  LLM Judge์™€ Prompt์˜ binary classification ์ •ํ™•๋„๋ฅผ ์ธก์ •ํ•ด๋ณด์‹œ๊ธธ ๋ฐ”๋ž๋‹ˆ๋‹ค (Arena-Lite๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ด๋ฅผ ์ˆ˜ํ–‰ํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!).""".strip()
342
  if st.session_state.korean
343
  else """
344
+ In Arena-Lite, to minimize the effect of position bias, all models are alternately positioned in either position A or B. However, if you feel the LLM Judge or Prompt performance is insufficient, the following known LLM Judge biases may be helpful to reference:
345
  * position bias (left)
346
  * length bias (right)
347
 
348
+ To determine if result distortion was due to LLM Judge limitations, please measure the binary classification accuracy of your LLM Judge and Prompt (You could use Arena-Lite for this purpose!).
349
  """.strip()
350
  )
351
  st.markdown(f"#### {judgename} + prompt = {eval_prompt_name}")
streamlit_app_local/README.md CHANGED
@@ -1,4 +1,4 @@
1
- # Varco Arena web app
2
  ```bash
3
  cd ./streamlit_app_local/
4
  bash run.sh
 
1
+ # Arena-Lite web app
2
  ```bash
3
  cd ./streamlit_app_local/
4
  bash run.sh
streamlit_app_local/app.py CHANGED
@@ -51,7 +51,7 @@ def upload_files(uploaded_files) -> Path:
51
  if not uploaded_files:
52
  st.warning("โŒ No files to upload. Please drag/drop or browse files to upload.")
53
  elif len(uploaded_files) < 2:
54
- st.error("โŒ You need at least 2 jsonlines files to properly run VA.")
55
  else: # properly uploaded
56
  for file in uploaded_files:
57
  # Create a path for the file in the server directory
@@ -154,18 +154,18 @@ def main():
154
  False, sidebar_placeholder=sidebar_placeholder, toggle_hashstr="app_init"
155
  )
156
 
157
- st.title("โš”๏ธ VARCO ARENA โš”๏ธ")
158
  if st.session_state.korean:
159
  st.write(
160
- """**๋ฐ”๋ฅด์ฝ” ์•„๋ ˆ๋‚˜๋Š” ํ…Œ์ŠคํŠธ์…‹ ๋ช…๋ น์–ด๋ณ„๋กœ ๋น„๊ตํ•  ๋ชจ๋ธ(์ƒ์„ฑ๋ฌธ)์˜ ํ† ๋„ˆ๋จผํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ  ๊ฒฐ๊ณผ๋“ค์„ ์ข…ํ•ฉํ•˜์—ฌ ๋ชจ๋ธ๋“ค์˜ ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๋Š” ๋ฒค์น˜๋งˆํ‚น ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ reference ์•„์›ƒํ’‹๊ณผ ๋น„๊ตํ•˜์—ฌ ์Šน๋ฅ ์„ ๋งค๊ธฐ๋Š” ๋ฐฉ๋ฒ•๋ณด๋‹ค ์ •ํ™•ํ•˜๋ฉฐ ๋” ์ €๋ ดํ•ฉ๋‹ˆ๋‹ค.**
161
 
162
  ๋ชจ๋ฒ”๋‹ต์•ˆ์„ ํ•„์š”๋กœ ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ์ปค์Šคํ…€ ํ…Œ์ŠคํŠธ์…‹ (50+ ํ–‰) ์„ ํ™œ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ํŽธ๋ฆฌํ•œ ๋ฒค์น˜๋งˆํ‚น์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค."""
163
  )
164
  else:
165
  st.write(
166
- """**VARCO Arena is an LLM benchmarking system that compares model responses across customized test scenarios (recommend >50 prompts) without requiring reference answers.**
167
 
168
- VARCO Arena conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs."""
169
  )
170
 
171
  st.divider()
@@ -261,9 +261,9 @@ def main():
261
  # Form for actual run
262
  with st.form("run_arena_form"):
263
  if st.session_state.korean:
264
- st.write("### 3. Varco Arena ๊ตฌ๋™ํ•˜๊ธฐ")
265
  else:
266
- st.write("### 3. Run Varco Arena")
267
  api_key = st.text_input("Enter your OpenAI API Key", type="password")
268
  exp_name = st.text_input("(Optional) Enter Exp. name")
269
  exp_name = exp_name.replace(
@@ -298,7 +298,7 @@ def main():
298
  "โŒ Requirements: You have to upload jsonlines files first to proceed"
299
  )
300
  elif not api_key:
301
- st.error("โŒ Requirements: OpenAI key required to run VA.")
302
  else:
303
  result_file_path, return_code = run_varco_arena(
304
  # upload_dir=st.session_state.upfiles_dir,
@@ -309,9 +309,9 @@ def main():
309
  evaluation_model=eval_model,
310
  )
311
  if return_code:
312
- st.error("โŒ RuntimeError: An error occurred during Varco Arena run")
313
  else:
314
- st.success("โœ… Varco Arena run completed successfully")
315
  st.session_state.result_file_path = result_file_path
316
  set_nav_bar(
317
  False, sidebar_placeholder=sidebar_placeholder, toggle_hashstr="app_run_done"
 
51
  if not uploaded_files:
52
  st.warning("โŒ No files to upload. Please drag/drop or browse files to upload.")
53
  elif len(uploaded_files) < 2:
54
+ st.error("โŒ You need at least 2 jsonlines files to properly run.")
55
  else: # properly uploaded
56
  for file in uploaded_files:
57
  # Create a path for the file in the server directory
 
154
  False, sidebar_placeholder=sidebar_placeholder, toggle_hashstr="app_init"
155
  )
156
 
157
+ st.title("โš”๏ธ Arena-Lite (former VARCO ARENA) โš”๏ธ")
158
  if st.session_state.korean:
159
  st.write(
160
+ """**Arena-Lite๋Š” ํ…Œ์ŠคํŠธ์…‹ ๋ช…๋ น์–ด๋ณ„๋กœ ๋น„๊ตํ•  ๋ชจ๋ธ(์ƒ์„ฑ๋ฌธ)์˜ ํ† ๋„ˆ๋จผํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ  ๊ฒฐ๊ณผ๋“ค์„ ์ข…ํ•ฉํ•˜์—ฌ ๋ชจ๋ธ๋“ค์˜ ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๋Š” ๋ฒค์น˜๋งˆํ‚น ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ reference ์•„์›ƒํ’‹๊ณผ ๋น„๊ตํ•˜์—ฌ ์Šน๋ฅ ์„ ๋งค๊ธฐ๋Š” ๋ฐฉ๋ฒ•๋ณด๋‹ค ์ •ํ™•ํ•˜๋ฉฐ ๋” ์ €๋ ดํ•ฉ๋‹ˆ๋‹ค.**
161
 
162
  ๋ชจ๋ฒ”๋‹ต์•ˆ์„ ํ•„์š”๋กœ ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ์ปค์Šคํ…€ ํ…Œ์ŠคํŠธ์…‹ (50+ ํ–‰) ์„ ํ™œ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ํŽธ๋ฆฌํ•œ ๋ฒค์น˜๋งˆํ‚น์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค."""
163
  )
164
  else:
165
  st.write(
166
+ """**Arena-Lite is an LLM benchmarking system that compares model responses across customized test scenarios (recommend >50 prompts) without requiring reference answers.**
167
 
168
+ Arena-Lite conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs."""
169
  )
170
 
171
  st.divider()
 
261
  # Form for actual run
262
  with st.form("run_arena_form"):
263
  if st.session_state.korean:
264
+ st.write("### 3. Arena-Lite ๊ตฌ๋™ํ•˜๊ธฐ")
265
  else:
266
+ st.write("### 3. Run Arena-Lite")
267
  api_key = st.text_input("Enter your OpenAI API Key", type="password")
268
  exp_name = st.text_input("(Optional) Enter Exp. name")
269
  exp_name = exp_name.replace(
 
298
  "โŒ Requirements: You have to upload jsonlines files first to proceed"
299
  )
300
  elif not api_key:
301
+ st.error("โŒ Requirements: OpenAI key required to run.")
302
  else:
303
  result_file_path, return_code = run_varco_arena(
304
  # upload_dir=st.session_state.upfiles_dir,
 
309
  evaluation_model=eval_model,
310
  )
311
  if return_code:
312
+ st.error("โŒ RuntimeError: An error occurred during Arena-Lite run")
313
  else:
314
+ st.success("โœ… Arena-Lite run completed successfully")
315
  st.session_state.result_file_path = result_file_path
316
  set_nav_bar(
317
  False, sidebar_placeholder=sidebar_placeholder, toggle_hashstr="app_run_done"
streamlit_app_local/modules/nav.py CHANGED
@@ -16,7 +16,7 @@ def Navbar(sidebar_placeholder, toggle_hashstr: str = ""):
16
 
17
  st.page_link(
18
  "app.py",
19
- label="Varco Arena ๊ตฌ๋™" if st.session_state.korean else "Run VARCO Arena",
20
  icon="๐Ÿ”ฅ",
21
  )
22
  st.page_link(
 
16
 
17
  st.page_link(
18
  "app.py",
19
+ label="Arena-Lite ๊ตฌ๋™" if st.session_state.korean else "Run Arena-Lite",
20
  icon="๐Ÿ”ฅ",
21
  )
22
  st.page_link(
streamlit_app_local/pages/brief_intro.py CHANGED
@@ -23,7 +23,7 @@ else:
23
  st.image("va_concept_new.png")
24
  st.markdown(
25
  """
26
- | |Current Practice|Varco Arena|
27
  |-|-|-|
28
  |Total no. matches|$$n_{\\text{model}}*\\|X\\|$$|$$(n_{\\text{model}}-1)*\\|X\\|$$|
29
  |No. matches per LLM|$$\\|X\\|$$|$$\\left[\\|X\\|,\\|X\\|\\text{log}n_{\\text{model}}\\right]$$|
@@ -32,9 +32,9 @@ st.markdown(
32
  )
33
  if st.session_state.korean:
34
  st.info(
35
- "Varco Arena๋Š” ์‹ ๋ขฐ์„ฑ ์žˆ๋Š” ์ˆœ์œ„๋ฅผ ๋” ์ ์€ ํšŸ์ˆ˜์˜ ๋น„๊ต ๋‚ด์— ์–ป์–ด๋‚ด๋ฉฐ, ์ด๋Ÿฌํ•œ ํŠน์ง•์€ LLM ์ง์ ‘ ๋น„๊ต์˜ ์ด์ ์œผ๋กœ๋ถ€ํ„ฐ ๊ธฐ์ธํ•ฉ๋‹ˆ๋‹ค."
36
  )
37
  else:
38
  st.info(
39
- "Varco Arena takes advantage of direct comparison between LLM responses to guarantee better reliability in fewer number of total matches."
40
  )
 
23
  st.image("va_concept_new.png")
24
  st.markdown(
25
  """
26
+ | |Current Practice|Arena-Lite|
27
  |-|-|-|
28
  |Total no. matches|$$n_{\\text{model}}*\\|X\\|$$|$$(n_{\\text{model}}-1)*\\|X\\|$$|
29
  |No. matches per LLM|$$\\|X\\|$$|$$\\left[\\|X\\|,\\|X\\|\\text{log}n_{\\text{model}}\\right]$$|
 
32
  )
33
  if st.session_state.korean:
34
  st.info(
35
+ "Arena-Lite๋Š” ์‹ ๋ขฐ์„ฑ ์žˆ๋Š” ์ˆœ์œ„๋ฅผ ๋” ์ ์€ ํšŸ์ˆ˜์˜ ๋น„๊ต ๋‚ด์— ์–ป์–ด๋‚ด๋ฉฐ, ์ด๋Ÿฌํ•œ ํŠน์ง•์€ LLM ์ง์ ‘ ๋น„๊ต์˜ ์ด์ ์œผ๋กœ๋ถ€ํ„ฐ ๊ธฐ์ธํ•ฉ๋‹ˆ๋‹ค."
36
  )
37
  else:
38
  st.info(
39
+ "Arena-Lite takes advantage of direct comparison between LLM responses to guarantee better reliability in fewer number of total matches."
40
  )
streamlit_app_local/pages/see_results.py CHANGED
@@ -354,18 +354,18 @@ def main():
354
  with st.expander("ํŽผ์ณ์„œ ๋ณด๊ธฐ" if st.session_state.korean else "Expand to show"):
355
  st.info(
356
  """
357
- Varco Arena์—์„œ๋Š” position bias์˜ ์˜ํ–ฅ์„ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋“  ๋ชจ๋ธ์ด A๋‚˜ B์œ„์น˜์— ๋ฒˆ๊ฐˆ์•„ ์œ„์น˜ํ•˜๋„๋ก ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ LLM Judge ํ˜น์€ Prompt์˜ ์„ฑ๋Šฅ์ด ๋ถ€์กฑํ•˜๋‹ค๊ณ  ๋А๊ปด์ง„๋‹ค๋ฉด, ์•„๋ž˜ ์•Œ๋ ค์ง„ LLM Judge bias๊ฐ€ ์ฐธ๊ณ ๊ฐ€ ๋ ๊ฒ๋‹ˆ๋‹ค.
358
  * position bias (์™ผ์ชฝ)
359
  * length bias (์˜ค๋ฅธ์ชฝ)
360
 
361
- ๊ฒฐ๊ณผ์˜ ์™œ๊ณก์ด LLM Judge์˜ ๋ถ€์กฑํ•จ ๋–„๋ฌธ์ด์—ˆ๋‹ค๋Š” ์ ์„ ๊ทœ๋ช…ํ•˜๋ ค๋ฉด ์‚ฌ์šฉํ•˜์‹  LLM Judge์™€ Prompt์˜ binary classification ์ •ํ™•๋„๋ฅผ ์ธก์ •ํ•ด๋ณด์‹œ๊ธธ ๋ฐ”๋ž๋‹ˆ๋‹ค (Varco Arena๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ด๋ฅผ ์ˆ˜ํ–‰ํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!).""".strip()
362
  if st.session_state.korean
363
  else """
364
- In Varco Arena, to minimize the effect of position bias, all models are alternately positioned in either position A or B. However, if you feel the LLM Judge or Prompt performance is insufficient, the following known LLM Judge biases may be helpful to reference:
365
  * position bias (left)
366
  * length bias (right)
367
 
368
- To determine if result distortion was due to LLM Judge limitations, please measure the binary classification accuracy of your LLM Judge and Prompt (You could use Varco Arena for this purpose!).
369
  """.strip()
370
  )
371
  st.markdown(f"#### {judgename} + prompt = {eval_prompt_name}")
 
354
  with st.expander("ํŽผ์ณ์„œ ๋ณด๊ธฐ" if st.session_state.korean else "Expand to show"):
355
  st.info(
356
  """
357
+ Arena-Lite์—์„œ๋Š” position bias์˜ ์˜ํ–ฅ์„ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋“  ๋ชจ๋ธ์ด A๋‚˜ B์œ„์น˜์— ๋ฒˆ๊ฐˆ์•„ ์œ„์น˜ํ•˜๋„๋ก ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ LLM Judge ํ˜น์€ Prompt์˜ ์„ฑ๋Šฅ์ด ๋ถ€์กฑํ•˜๋‹ค๊ณ  ๋А๊ปด์ง„๋‹ค๋ฉด, ์•„๋ž˜ ์•Œ๋ ค์ง„ LLM Judge bias๊ฐ€ ์ฐธ๊ณ ๊ฐ€ ๋ ๊ฒ๋‹ˆ๋‹ค.
358
  * position bias (์™ผ์ชฝ)
359
  * length bias (์˜ค๋ฅธ์ชฝ)
360
 
361
+ ๊ฒฐ๊ณผ์˜ ์™œ๊ณก์ด LLM Judge์˜ ๋ถ€์กฑํ•จ ๋–„๋ฌธ์ด์—ˆ๋‹ค๋Š” ์ ์„ ๊ทœ๋ช…ํ•˜๋ ค๋ฉด ์‚ฌ์šฉํ•˜์‹  LLM Judge์™€ Prompt์˜ binary classification ์ •ํ™•๋„๋ฅผ ์ธก์ •ํ•ด๋ณด์‹œ๊ธธ ๋ฐ”๋ž๋‹ˆ๋‹ค (Arena-Lite๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ด๋ฅผ ์ˆ˜ํ–‰ํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!).""".strip()
362
  if st.session_state.korean
363
  else """
364
+ In Arena-Lite, to minimize the effect of position bias, all models are alternately positioned in either position A or B. However, if you feel the LLM Judge or Prompt performance is insufficient, the following known LLM Judge biases may be helpful to reference:
365
  * position bias (left)
366
  * length bias (right)
367
 
368
+ To determine if result distortion was due to LLM Judge limitations, please measure the binary classification accuracy of your LLM Judge and Prompt (You could use Arena-Lite for this purpose!).
369
  """.strip()
370
  )
371
  st.markdown(f"#### {judgename} + prompt = {eval_prompt_name}")
streamlit_app_local/view_utils.py CHANGED
@@ -16,7 +16,7 @@ from modules.nav import Navbar
16
  def default_page_setting(
17
  layout: Literal["wide", "centered"] = "centered",
18
  ):
19
- st.set_page_config(page_title="VARCO Arena", layout=layout)
20
  sidebar_placeholder = st.sidebar.empty()
21
 
22
  css = f"""
@@ -126,7 +126,7 @@ def compute_mle_elo(df, SCALE=400, BASE=10, INIT_RATING=1000):
126
  Y = np.zeros(n)
127
  Y[df["winner"] == "A"] = 1.0
128
 
129
- WARNING = "elo.py:L{L} compute_mle_elo() // Warning: Seeing this message indicates the regression result for elo is unreliable. You should be test-running the Varco Arena or something odd (perfect one-sided wins) is happening\n\nto avoid logistic regressor error, manually putting other class"
130
  if (Y == 0).all():
131
  print(WARNING.format(L=32))
132
  Y[-1] = 1.0
 
16
  def default_page_setting(
17
  layout: Literal["wide", "centered"] = "centered",
18
  ):
19
+ st.set_page_config(page_title="Arena-Lite", layout=layout)
20
  sidebar_placeholder = st.sidebar.empty()
21
 
22
  css = f"""
 
126
  Y = np.zeros(n)
127
  Y[df["winner"] == "A"] = 1.0
128
 
129
+ WARNING = "elo.py:L{L} compute_mle_elo() // Warning: Seeing this message indicates the regression result for elo is unreliable. You should be test-running the Arena-Lite or something odd (perfect one-sided wins) is happening\n\nto avoid logistic regressor error, manually putting other class"
130
  if (Y == 0).all():
131
  print(WARNING.format(L=32))
132
  Y[-1] = 1.0
view_utils.py CHANGED
@@ -16,7 +16,7 @@ from modules.nav import Navbar
16
  def default_page_setting(
17
  layout: Literal["wide", "centered"] = "centered",
18
  ):
19
- st.set_page_config(page_title="VARCO Arena", layout=layout)
20
  sidebar_placeholder = st.sidebar.empty()
21
 
22
  css = f"""
@@ -126,7 +126,7 @@ def compute_mle_elo(df, SCALE=400, BASE=10, INIT_RATING=1000):
126
  Y = np.zeros(n)
127
  Y[df["winner"] == "A"] = 1.0
128
 
129
- WARNING = "elo.py:L{L} compute_mle_elo() // Warning: Seeing this message indicates the regression result for elo is unreliable. You should be test-running the Varco Arena or something odd (perfect one-sided wins) is happening\n\nto avoid logistic regressor error, manually putting other class"
130
  if (Y == 0).all():
131
  print(WARNING.format(L=32))
132
  Y[-1] = 1.0
 
16
  def default_page_setting(
17
  layout: Literal["wide", "centered"] = "centered",
18
  ):
19
+ st.set_page_config(page_title="Arena-Lite", layout=layout)
20
  sidebar_placeholder = st.sidebar.empty()
21
 
22
  css = f"""
 
126
  Y = np.zeros(n)
127
  Y[df["winner"] == "A"] = 1.0
128
 
129
+ WARNING = "elo.py:L{L} compute_mle_elo() // Warning: Seeing this message indicates the regression result for elo is unreliable. You should be test-running the Arena-Lite or something odd (perfect one-sided wins) is happening\n\nto avoid logistic regressor error, manually putting other class"
130
  if (Y == 0).all():
131
  print(WARNING.format(L=32))
132
  Y[-1] = 1.0