Spaces:

NCSOFT
/

ArenaLite

Sleeping

App Files Files Community

sonsus commited on Jul 15

Commit

45f8fc7

1 Parent(s): 1eadcf1

rebrand: varco-arena -> arena-lite

Browse files

Files changed (18) hide show

.vscode/launch.json +4 -2
README.md +61 -24
README_en.md +59 -8
README_kr.md +57 -7
app.py +8 -8
eval_prompt_list.txt +1 -0
guide_mds/input_jsonls_en.md +32 -31
guide_mds/input_jsonls_kr.md +14 -17
modules/nav.py +1 -1
pages/brief_intro.py +3 -3
pages/see_results.py +6 -6
streamlit_app_local/README.md +1 -1
streamlit_app_local/app.py +10 -10
streamlit_app_local/modules/nav.py +1 -1
streamlit_app_local/pages/brief_intro.py +3 -3
streamlit_app_local/pages/see_results.py +4 -4
streamlit_app_local/view_utils.py +2 -2
view_utils.py +2 -2

.vscode/launch.json CHANGED Viewed

@@ -13,13 +13,15 @@
             "console": "integratedTerminal",
             "args": [
                 "-i",
-                "rsc/inputs_for_dbg/dbg_llmbar_inputs/", // "rsc/inputs_for_dbg/dbg_trans_inputs/",
                 "-o",
                 "DBGOUT",
                 "-e",
                 "gpt-4.1-mini",
                 "-p",
-                "llmbar", // "translation_fortunecookie",
             ]
         }

             "console": "integratedTerminal",
             "args": [
                 "-i",
+                // "rsc/inputs_for_dbg/dbg_llmbar_inputs/",
+                "rsc/inputs_for_dbg/dbg_trans_inputs/",
                 "-o",
                 "DBGOUT",
                 "-e",
                 "gpt-4.1-mini",
                 "-p",
+                // "llmbar",
+                "translation_pair",
             ]
         }

README.md CHANGED Viewed

@@ -1,21 +1,8 @@
----
-title: VARCO Arena
-emoji: 🔥
-colorFrom: pink
-colorTo: yellow
-sdk: streamlit
-sdk_version: 1.40.2
-app_file: app.py
-pinned: false
-license: cc-by-4.0
-short_description: VARCO Arena is a reference-free LLM benchmarking approach
----
-# Varco Arena
-Varco Arena conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs.
 For more information, the followings may help understanding how it works.
-* [Paper](https://huggingface.co/papers/2411.01281)
 * [Blog Post (KR)](https://ncsoft.github.io/ncresearch/12cc62c1ea0d981971a8923401e8fe6a0f18563d)
@@ -42,7 +29,7 @@ python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -e
 # dbg lines
 ## openai api judge dbg
-python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e o4-mini
 ## other testing lines
 python main.py -i "rsc/inputs_for_dbg/[SOME_DIRECTORY]/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini
 ## dummy judge dbg (checking errors without api requests)
@@ -102,15 +89,66 @@ pre-commit install
 bash precommit.sh # black formatter will reformat the codes
 ```
 ## FAQ
-* I want to apply my custom judge prompt to run Varco Arena
   * [`./varco_arena/prompts/`](./varco_arena/prompts/__init__.py) defines the prompts with `yaml` file and the class objects for those. Edit those as your need.
 * I want tailored judge prompts for each line of the test set row (i.e. ~100th row - `prompt1`, 101st~ - `prompt2`)
   * You could see `load_prompt` at the above link receives `promptname` + `task` as a parameters to load the prompt. The function is called at [`./varco_arena/manager.py:async_run`](./varco_arena/manager.py).
-* I want more fields for my llm outputs jsonl files for tailored use, i.e. want more fields beyond `instruction`, `source`, `generated`.
-  * It's going to get tricky but let me briefly guide you about this.
-    * You might have to edit `varco_arena/eval_utils.py`:`async_eval_w_prompt` (this part calls `PROMPT_OBJ.complete_prompt()`)
-    * And all the related codes will require revision.
 ## Special Thanks to (contributors)
 - Minho Lee (@Dialogue Model Team, NCSOFT) [github](https://github.com/minolee/)
@@ -122,10 +160,9 @@ bash precommit.sh # black formatter will reformat the codes
 ## Citation
 If you found our work helpful, consider citing our paper!
-[arxiv](https://arxiv.org/abs/2411.19103v1)
 ```
 @misc{son2024varcoarenatournamentapproach,
-      title={Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models},
       author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim},
       year={2024},
       eprint={2411.01281},

+# Arena-Lite
+Arena-Lite conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs.
 For more information, the followings may help understanding how it works.
+* [Paper](https://arxiv.org/abs/2411.01281)
 * [Blog Post (KR)](https://ncsoft.github.io/ncresearch/12cc62c1ea0d981971a8923401e8fe6a0f18563d)
 # dbg lines
 ## openai api judge dbg
+python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini
 ## other testing lines
 python main.py -i "rsc/inputs_for_dbg/[SOME_DIRECTORY]/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini
 ## dummy judge dbg (checking errors without api requests)
 bash precommit.sh # black formatter will reformat the codes
 ```
+### 📝 Adding a Custom Prompt
+Here’s how to add a new evaluation prompt. The process has been simplified recently, as the Judge logic now only relies on the `parsed_output` method.
+The easiest way is to copy `llmbar_brief.py` and `llmbar_brief.yaml` to create your own prompt.
+#### 1. Create Prompt `.py` and `.yaml` Files
+-   Create files like `my_prompt.py` and `my_prompt.yaml` in the `varco_arena/varco_arena_core/prompts/` directory.
+-   **`my_prompt.py`**:
+    -   Define a class that inherits from `ComparisonPromptBase`.
+    -   You **must** implement the `parsed_output(self, response)` method. This function should take the LLM Judge's `response` and return a decision token (e.g., `'a'`, `'b'`) indicating the winner.
+-   **`my_prompt.yaml`**:
+    -   Define necessary elements for your prompt, such as `sampling_parameters`, `decision_tokens`, and `prompt_template`.
+    -   The strings in prompt_template are processed by string.Template and finalized in eval_utils.py via the BasePrompt.complete_prompt() function.
+    -   Do not use ${task} ${generated} ${model_id} in prompt_template. They are reserved for Arena-Lite.
+#### 2. Register the Prompt in `prompts/__init__.py`
+-   Import your new prompt class:
+    ```python
+    from .my_prompt import MyPrompt
+    ```
+-   Add your new prompt's name and class instance to the `NAME2PROMPT_CLS` dictionary:
+    ```python
+    NAME2PROMPT_CLS = dict(
+        # ... other prompts
+        my_prompt=MyPrompt(),
+    )
+    ```
+-   Add the new prompt name to the `Literal` type hint for the `promptname` argument in the `load_prompt` function:
+    ```python
+    def load_prompt(
+        promptname: Literal[
+            # ... other prompt names
+            "my_prompt",
+        ],
+        # ...
+    ):
+    ```
+#### 3. Add the Prompt to `eval_prompt_list.txt`
+-   Open the `eval_prompt_list.txt` file in the project root and add the name of your new prompt (`my_prompt`) on a new line.
+#### 4. (Recommended) Test and Debug
+-   It is highly recommended to debug your prompt to ensure it works as expected.
+-   In the `.vscode/launch.json` file, modify the `"VA"` configuration's `args`:
+    -   Change `"-p", "translation_fortunecookie"` to `"-p", "my_prompt"`.
+    -   If necessary, update the `"-i", "..."` argument to the path of your test data suitable for the new prompt.
+-   Go to the `Run and Debug` tab in VS Code (Ctrl+Shift+D), select the "VA" configuration, and press F5 to run the debugger.
+-   Find `result.json` inside the output directory you specified after `-o`. It will show every judge prompt used for each match.
 ## FAQ
+* I want to apply my custom judge prompt to run Arena-Lite
   * [`./varco_arena/prompts/`](./varco_arena/prompts/__init__.py) defines the prompts with `yaml` file and the class objects for those. Edit those as your need.
 * I want tailored judge prompts for each line of the test set row (i.e. ~100th row - `prompt1`, 101st~ - `prompt2`)
   * You could see `load_prompt` at the above link receives `promptname` + `task` as a parameters to load the prompt. The function is called at [`./varco_arena/manager.py:async_run`](./varco_arena/manager.py).
 ## Special Thanks to (contributors)
 - Minho Lee (@Dialogue Model Team, NCSOFT) [github](https://github.com/minolee/)
 ## Citation
 If you found our work helpful, consider citing our paper!
 ```
 @misc{son2024varcoarenatournamentapproach,
+      title={VARCO Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models},
       author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim},
       year={2024},
       eprint={2411.01281},

README_en.md CHANGED Viewed

@@ -1,5 +1,5 @@
-# Varco Arena
-Varco Arena conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs.
 For more information, the followings may help understanding how it works.
 * [Paper](https://arxiv.org/abs/2411.01281)
@@ -89,15 +89,66 @@ pre-commit install
 bash precommit.sh # black formatter will reformat the codes
 ```
 ## FAQ
-* I want to apply my custom judge prompt to run Varco Arena
   * [`./varco_arena/prompts/`](./varco_arena/prompts/__init__.py) defines the prompts with `yaml` file and the class objects for those. Edit those as your need.
 * I want tailored judge prompts for each line of the test set row (i.e. ~100th row - `prompt1`, 101st~ - `prompt2`)
   * You could see `load_prompt` at the above link receives `promptname` + `task` as a parameters to load the prompt. The function is called at [`./varco_arena/manager.py:async_run`](./varco_arena/manager.py).
-* I want more fields for my llm outputs jsonl files for tailored use, i.e. want more fields beyond `instruction`, `source`, `generated`.
-  * It's going to get tricky but let me briefly guide you about this.
-    * You might have to edit `varco_arena/eval_utils.py`:`async_eval_w_prompt` (this part calls `PROMPT_OBJ.complete_prompt()`)
-    * And all the related codes will require revision.
 ## Special Thanks to (contributors)
 - Minho Lee (@Dialogue Model Team, NCSOFT) [github](https://github.com/minolee/)
@@ -111,7 +162,7 @@ bash precommit.sh # black formatter will reformat the codes
 If you found our work helpful, consider citing our paper!
 ```
 @misc{son2024varcoarenatournamentapproach,
-      title={Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models},
       author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim},
       year={2024},
       eprint={2411.01281},

+# Arena-Lite (former Arena-Lite)
+Arena-Lite conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs.
 For more information, the followings may help understanding how it works.
 * [Paper](https://arxiv.org/abs/2411.01281)
 bash precommit.sh # black formatter will reformat the codes
 ```
+### 📝 Adding a Custom Prompt
+Here’s how to add a new evaluation prompt. The process has been simplified recently, as the Judge logic now only relies on the `parsed_output` method.
+The easiest way is to copy `llmbar_brief.py` and `llmbar_brief.yaml` to create your own prompt.
+#### 1. Create Prompt `.py` and `.yaml` Files
+-   Create files like `my_prompt.py` and `my_prompt.yaml` in the `varco_arena/varco_arena_core/prompts/` directory.
+-   **`my_prompt.py`**:
+    -   Define a class that inherits from `ComparisonPromptBase`.
+    -   You **must** implement the `parsed_output(self, response)` method. This function should take the LLM Judge's `response` and return a decision token (e.g., `'a'`, `'b'`) indicating the winner.
+-   **`my_prompt.yaml`**:
+    -   Define necessary elements for your prompt, such as `sampling_parameters`, `decision_tokens`, and `prompt_template`.
+    -   The strings in prompt_template are processed by string.Template and finalized in eval_utils.py via the BasePrompt.complete_prompt() function.
+    -   Do not use ${task} in prompt_template. It is a reserved keyword due to the llmbar prompt.
+#### 2. Register the Prompt in `prompts/__init__.py`
+-   Import your new prompt class:
+    ```python
+    from .my_prompt import MyPrompt
+    ```
+-   Add your new prompt's name and class instance to the `NAME2PROMPT_CLS` dictionary:
+    ```python
+    NAME2PROMPT_CLS = dict(
+        # ... other prompts
+        my_prompt=MyPrompt(),
+    )
+    ```
+-   Add the new prompt name to the `Literal` type hint for the `promptname` argument in the `load_prompt` function:
+    ```python
+    def load_prompt(
+        promptname: Literal[
+            # ... other prompt names
+            "my_prompt",
+        ],
+        # ...
+    ):
+    ```
+#### 3. Add the Prompt to `eval_prompt_list.txt`
+-   Open the `eval_prompt_list.txt` file in the project root and add the name of your new prompt (`my_prompt`) on a new line.
+#### 4. (Recommended) Test and Debug
+-   It is highly recommended to debug your prompt to ensure it works as expected.
+-   In the `.vscode/launch.json` file, modify the `"VA"` configuration's `args`:
+    -   Change `"-p", "translation_fortunecookie"` to `"-p", "my_prompt"`.
+    -   If necessary, update the `"-i", "..."` argument to the path of your test data suitable for the new prompt.
+-   Go to the `Run and Debug` tab in VS Code (Ctrl+Shift+D), select the "VA" configuration, and press F5 to run the debugger.
+-   Find `result.json` inside the output directory you specified after `-o`. It will show every judge prompt used for each match.
 ## FAQ
+* I want to apply my custom judge prompt to run Arena-Lite
   * [`./varco_arena/prompts/`](./varco_arena/prompts/__init__.py) defines the prompts with `yaml` file and the class objects for those. Edit those as your need.
 * I want tailored judge prompts for each line of the test set row (i.e. ~100th row - `prompt1`, 101st~ - `prompt2`)
   * You could see `load_prompt` at the above link receives `promptname` + `task` as a parameters to load the prompt. The function is called at [`./varco_arena/manager.py:async_run`](./varco_arena/manager.py).
 ## Special Thanks to (contributors)
 - Minho Lee (@Dialogue Model Team, NCSOFT) [github](https://github.com/minolee/)
 If you found our work helpful, consider citing our paper!
 ```
 @misc{son2024varcoarenatournamentapproach,
+      title={VARCO Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models},
       author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim},
       year={2024},
       eprint={2411.01281},

README_kr.md CHANGED Viewed

@@ -1,5 +1,5 @@
-# Varco Arena
-바르코 아레나는 테스트셋 명령어별로 비교할 모델들의 토너먼트를 수행하여 정확하게 모델들의 순위를 매깁니다. 이것은 reference 아웃풋과 비교하여 승률을 매기는 방법보다 정확하며 조금 더 저렴합니다.
 더 자세한 내용에 대해서는 아래의 링크를 참조하시면 됩니다.
 * [논문](https://arxiv.org/abs/2411.01281)
@@ -91,16 +91,66 @@ pre-commit install
 bash precommit.sh # 이게 코드들을 다 리포맷해줄거임
 ```
 문의: 손선일
 * 내가 만든 프롬프트를 사용하고 싶어요
   * [`./varco_arena/prompts/`](./varco_arena_core/prompts/__init__.py) 에선 각종 프롬프트 클래스 및 `yaml` 파일 형태로 정의된 프롬프트를 로드합니다. 프리셋을 참조하여 작성하시면 됩니다.
 * 테스트셋 별로 다른 평가 프롬프트를 사용하고 싶어요 (e.g. 작업에 따라 다른 프롬프트를 사용하고 싶어요)
   * 위 걸어드린 링크의 `load_prompt` 를 통해서 `promptname` + `task` 형태로 [`./varco_arena_core/manager.py:async_run`](./varco_arena_core/manager.py) 프롬프트가 로드되도록 해놓았습니다.
-* 제가 사용하고 싶은 입력파일에 `instruction`, `source`, `generated` 이외에 다른 필드를 추가해서 사용하고 싶어요.
-  * 조금 복잡해지는데 다음 부분을 고쳐주세요
-    * `varco_arena/eval_utils.py` 에서 `async_eval_w_prompt` 부분을 손봐야할 수 있습니다 (여기에서 PROMPT_OBJ.complete_prompt()을 호출함)
-    * 그 외 연관된 부분은 타고타고 고쳐주셔야...
 ## Special Thanks to (contributors)
 - 이민호 (@대화모델팀, NCSOFT) [github](https://github.com/minolee/)
@@ -113,7 +163,7 @@ bash precommit.sh # 이게 코드들을 다 리포맷해줄거임
 저희 작업물이 도움이 되었다면 저희도 도움을 받아볼 수 있을까요?😉
 ```
 @misc{son2024varcoarenatournamentapproach,
-      title={Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models},
       author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim},
       year={2024},
       eprint={2411.01281},

+# Arena-Lite (구 Arena-Lite)
+아레나-라이트는 테스트셋 명령어별로 비교할 모델들의 토너먼트를 수행하여 정확하게 모델들의 순위를 매깁니다. 이것은 reference 아웃풋과 비교하여 승률을 매기는 방법보다 정확하며 조금 더 저렴합니다.
 더 자세한 내용에 대해서는 아래의 링크를 참조하시면 됩니다.
 * [논문](https://arxiv.org/abs/2411.01281)
 bash precommit.sh # 이게 코드들을 다 리포맷해줄거임
 ```
+### 📝 커스텀 프롬프트 추가하기
+새로운 평가 프롬프트를 추가하는 과정은 다음과 같습니다. 최근 Judge 로직이 `parsed_output` 메소드만 사용하도록 간소화되어 이전보다 쉽게 프롬프트를 추가할 수 있습니다.
+가장 간단한 방법은 `llmbar_brief.py`와 `llmbar_brief.yaml` 파일을 복사하여 자신만의 프롬프트를 만드는 것입니다.
+#### 1. 프롬프트 `.py` 및 `.yaml` 파일 생성
+-   `varco_arena/varco_arena_core/prompts/` 경로에 `my_prompt.py`와 `my_prompt.yaml`처럼 파일을 생성합니다.
+-   **`my_prompt.py`**:
+    -   `ComparisonPromptBase`를 상속받는 클래스를 정의합니다.
+    -   `parsed_output(self, response)` 메소드를 반드시 구현해야 합니다. 이 함수는 LLM Judge의 응답(`response`)을 받아, 승자를 나타내는 결정 토큰(예: `'a'`, `'b'`)을 반환해야 합니다.
+-   **`my_prompt.yaml`**:
+    -   `sampling_parameters`, `decision_tokens`, `prompt_template` 등 프롬프트에 필요한 요소들을 정의합니다.
+    -   `prompt_template` 에 들어가는 문자열은 `string.Template`으로 처리되며 `BasePrompt.complete_prompt()` 함수를 통해 `eval_utils.py`에서 최종 완성됩니다.
+    -   `${task}, ${generated}, ${model_id}`를 `prompt_template`에 사용하지 마세요. 예약된 키워드들입니다.
+#### 2. `prompts/__init__.py`에 프롬프트 등록
+-   생성한 프롬프트 클래스를 `import` 합니다.
+    ```python
+    from .my_prompt import MyPrompt
+    ```
+-   `NAME2PROMPT_CLS` 딕셔너리에 새 프롬프트 이름과 클래스 객체를 추가합니다.
+    ```python
+    NAME2PROMPT_CLS = dict(
+        # ... 기존 프롬프트들
+        my_prompt=MyPrompt(),
+    )
+    ```
+-   `load_prompt` 함수의 `promptname` 인자의 `Literal` 타입 힌트에 새 프롬프트 이름을 추가합니다.
+    ```python
+    def load_prompt(
+        promptname: Literal[
+            # ... 기존 프롬프트 이름들
+            "my_prompt",
+        ],
+        # ...
+    ):
+    ```
+#### 3. `eval_prompt_list.txt`에 프롬프트 추가
+-   프로젝트 루트의 `eval_prompt_list.txt` 파일을 열고, 새 프롬프트의 이름(`my_prompt`)을 새 줄에 추가합니다.
+#### 4. (권장) 테스트 및 디버깅
+-   프롬프트가 의도대로 작동하는지 확인하기 위해 디버깅을 권장합니다.
+-   `.vscode/launch.json` 파일의 `"VA"` 설정에서 `args`를 다음과 같이 수정합니다.
+    -   `"-p", "translation_fortunecookie"` 부분을 `"-p", "my_prompt"`로 변경합니다.
+    -   필요시 `"-i", "..."` 부분에 새 프롬프트에 적합한 테스트 데이터 경로를 지정합니다.
+-   VS Code의 `Run and Debug` 탭(Ctrl+Shift+D)으로 이동하여 "VA" 설정을 선택하고 F5 키를 눌러 디버거를 실행합니다.
+-   `-o` 뒤에 명시한 output 디렉토리 안에서 `result.json` 를 찾아서 원하는대로 동작했는지 확인해보세요. 모든 judge와 매치에 활용된 프롬프트 정보가 담겨있습니다.
 문의: 손선일
 * 내가 만든 프롬프트를 사용하고 싶어요
   * [`./varco_arena/prompts/`](./varco_arena_core/prompts/__init__.py) 에선 각종 프롬프트 클래스 및 `yaml` 파일 형태로 정의된 프롬프트를 로드합니다. 프리셋을 참조하여 작성하시면 됩니다.
 * 테스트셋 별로 다른 평가 프롬프트를 사용하고 싶어요 (e.g. 작업에 따라 다른 프롬프트를 사용하고 싶어요)
   * 위 걸어드린 링크의 `load_prompt` 를 통해서 `promptname` + `task` 형태로 [`./varco_arena_core/manager.py:async_run`](./varco_arena_core/manager.py) 프롬프트가 로드되도록 해놓았습니다.
 ## Special Thanks to (contributors)
 - 이민호 (@대화모델팀, NCSOFT) [github](https://github.com/minolee/)
 저희 작업물이 도움이 되었다면 저희도 도움을 받아볼 수 있을까요?😉
 ```
 @misc{son2024varcoarenatournamentapproach,
+      title={VARCO Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models},
       author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim},
       year={2024},
       eprint={2411.01281},

app.py CHANGED Viewed

@@ -253,18 +253,18 @@ def main():
         False, sidebar_placeholder=sidebar_placeholder, toggle_hashstr="app_init"
     )
-    st.title("⚔️ VARCO ARENA ⚔️")
     if st.session_state.korean:
         st.write(
-            """**바르코 아레나는 테스트셋 명령어별로 비교할 모델(생성문)의 토너먼트를 수행하고 결과들을 종합하여 모델들의 순위를 매기는 벤치마킹 시스템입니다. 이것은 reference 아웃풋과 비교하여 승률을 매기는 방법보다 정확하며 더 저렴합니다.**
             모범답안을 필요로 하지 않으므로 커스텀 테스트셋 (50+ 행) 을 활용하는 경우 편리한 벤치마킹이 가능합니다."""
         )
     else:
         st.write(
-            """**VARCO Arena is an LLM benchmarking system that compares model responses across customized test scenarios (recommend >50 prompts) without requiring reference answers.**
-            VARCO Arena conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs."""
         )
     st.divider()
@@ -389,9 +389,9 @@ def main():
     # Form for actual run
     with st.form("run_arena_form"):
         if st.session_state.korean:
-            st.write("### 3. Varco Arena 구동하기")
         else:
-            st.write("### 3. Run Varco Arena")
         api_key = st.text_input("Enter your OpenAI API Key", type="password")
         # demo exp name fixated
@@ -434,12 +434,12 @@ def main():
                 )
                 if return_code:
                     st.error(
-                        "❌ RuntimeError: An error occurred during Varco Arena run. Check the file and **restart from file upload!**"
                     )
                     purge_user_sub_data(data_path_to_purge=VA_ROOT)
                 else:
-                    st.success("✅ Varco Arena run completed successfully")
                     st.session_state.result_file_path = list(
                         result_file_path.glob("**/result.json")
                     )[-1]

         False, sidebar_placeholder=sidebar_placeholder, toggle_hashstr="app_init"
     )
+    st.title("⚔️ Arena-Lite ⚔️")
     if st.session_state.korean:
         st.write(
+            """**Arena-Lite는 테스트셋 명령어별로 비교할 모델(생성문)의 토너먼트를 수행하고 결과들을 종합하여 모델들의 순위를 매기는 벤치마킹 시스템입니다. 이것은 reference 아웃풋과 비교하여 승률을 매기는 방법보다 정확하며 더 저렴합니다.**
             모범답안을 필요로 하지 않으므로 커스텀 테스트셋 (50+ 행) 을 활용하는 경우 편리한 벤치마킹이 가능합니다."""
         )
     else:
         st.write(
+            """**Arena-Lite is an LLM benchmarking system that compares model responses across customized test scenarios (recommend >50 prompts) without requiring reference answers.**
+            Arena-Lite conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs."""
         )
     st.divider()
     # Form for actual run
     with st.form("run_arena_form"):
         if st.session_state.korean:
+            st.write("### 3. Arena-Lite 구동하기")
         else:
+            st.write("### 3. Run Arena-Lite")
         api_key = st.text_input("Enter your OpenAI API Key", type="password")
         # demo exp name fixated
                 )
                 if return_code:
                     st.error(
+                        "❌ RuntimeError: An error occurred during Arena-Lite run. Check the file and **restart from file upload!**"
                     )
                     purge_user_sub_data(data_path_to_purge=VA_ROOT)
                 else:
+                    st.success("✅ Arena-Lite run completed successfully")
                     st.session_state.result_file_path = list(
                         result_file_path.glob("**/result.json")
                     )[-1]

eval_prompt_list.txt CHANGED Viewed

@@ -1,4 +1,5 @@
 llmbar
 translation_pair
 rag_pair_kr
 translation_fortunecookie

 llmbar
+llmbar_brief
 translation_pair
 rag_pair_kr
 translation_fortunecookie

guide_mds/input_jsonls_en.md CHANGED Viewed

@@ -1,37 +1,38 @@
-#### \[EN\] Upload guide (`jsonl`)
-**Basic Requirements**
-  * Upload one `jsonl` file per model (e.g., five files to compare five LLMs)
-  * ⚠️ Important: All `jsonl` files must have the same number of rows
-  * ⚠️ Important: The `model_id` field must be unique within and across all files
-**Required Fields**
-* Per Model Fields
-  * `model_id`: Unique identifier for the model (recommendation: keep it short)
-  * `generated`: The LLM's response to the test instruction
-* Required only for Translation (`translation_pair` prompt need those. See `streamlit_app_local/user_submit/mt/llama5.jsonl`)
-  * `source_lang`: input language (e.g. Korean, KR, kor, ...)
-  * `target_lang`: output language (e.g. English, EN, ...)
-* Common Fields (Must be identical across all files)
-  * `instruction`: The input prompt or test instruction given to the model
-  * `task`: Category label used to group results (useful when using different evaluation prompts per task)
-**Example Format**
 ```python
 # model1.jsonl
-{"model_id": "model1", "task": "directions", "instruction": "Where should I go?", "generated": "Over there"}
-{"model_id": "model1", "task": "arithmetic", "instruction": "1+1", "generated": "2"}
-# model2.jsonl
-{"model_id": "model2", "task": "directions", "instruction": "Where should I go?", "generated": "Head north"}
-{"model_id": "model2", "task": "arithmetic", "instruction": "1+1", "generated": "3"}
 ...
 ..
-.
 ```
-**Use Case Example**
-If you want to compare different prompting strategies for the same model:
-* Use the same `instruction` across files (using unified test scenarios).
-* `generated` responses of each prompting strategy will vary across the files.
-* Use descriptive `model_id` values like "prompt1", "prompt2", etc.

+####  \[EN\] Guide for Input .jsonl Files
+If you have five models to compare, upload five .jsonl files.
+  * 💥All `.jsonl` files must have the same number of rows.
+  * 💥The `model_id` field must be different for each file and unique within each file.
+  * 💥Each `.jsonl` file should have different `generated`, `model_id` from the other files. `instruction`, `task` should be the same.
+**Required `.jsonl` Fields**
+  * Reserved Fields (Mandatory)
+    * `model_id`: The name of the model being evaluated. (Recommended to be short)
+    * `instruction`: The instruction given to the model. This corresponds to the test set prompt (not the evaluation prompt).
+    * `generated`: Enter the response generated by the model for the test set instruction.
+    * `task`: Used to group and display overall results as a subset. Can be utilized when you want to use different evaluation prompts per row.
+  * Additional
+    * Depending on the evaluation prompt you use, you can utilize other additional fields. You can freely add them to your `.jsonl` files, avoiding the keywords
+      mentioned above.
+      * Example: For `translation_pair.yaml` and `translation_fortunecookie.yaml` prompts, the `source_lang` and `target_lang` fields are read from the `.jsonl` and
+        utilized.
+For example, when evaluating with the `translation_pair` prompt, each .jsonl file looks like this:
 ```python
 # model1.jsonl
+{"model_id": "모델1", "task": "영한", "instruction": "어디로 가야하오", "generated": "Where should I go", "source_lang": "Korean", "target_lang": "English"}
+{"model_id": "모델1", "task": "한영", "instruction": "1+1?", "generated": "1+1?", "source_lang": "English", "target_lang": "Korean"}
+# model2.jsonl -* model1.jsonl과 `instruction`은 같고 `generated`, `model_id` 는 다릅니다!
+{"model_id": "모델2", "task": "영한", "instruction": "어디로 가야하오", "generated": "글쎄다", "source_lang": "Korean", "target_lang": "English"}
+{"model_id": "모델2", "task": "한영", "instruction": "1+1?", "generated": "2", "source_lang": "English", "target_lang": "Korean"}
 ...
 ..
 ```
+On the other hand, when evaluating with the `llmbar` prompt, fields like source_lang and target_lang are not used, similar to translation evaluation, and naturally, you don't need to add them to your .jsonl.

guide_mds/input_jsonls_kr.md CHANGED Viewed

@@ -2,33 +2,30 @@
 비교할 모델이 다섯 개라면 다섯 개의 .jsonl 파일을 업로드하세요.
 * 💥모든 jsonl 은 같은 수의 행을 가져야합니다.
 * 💥`model_id` 필드는 파일마다 달라야하며 파일 내에서는 유일해야합니다.
 **jsonl 필수 필드**
-* 개별
   * `model_id`: 평가받는 모델의 이름입니다. (짧게 쓰는 것 추천)
   * `generated`: 모델이 testset instruction 에 생성한 응답을 넣으세요.
-* 번역평가 프롬프트 사용시 (`translation_pair`. `streamlit_app_local/user_submit/mt/llama5.jsonl` 에서 예시 볼 수 있음)
-  * `source_lang`: input language (e.g. Korean, KR, kor, ...)
-  * `target_lang`: output language (e.g. English, EN, ...)
-* 공통 부분 (**모든 파일에 대해 같아야 함**)
-  * `instruction`: 모델에 집어넣는 `testset instruction` 혹은 `input`에 해당하는 무언가입니다.
   * `task`: 전체 결과를 subset으로 그룹지어서 보여줄 때 사용됩니다. `evaluation prompt`를 행별로 다르게 사용하고 싶을 때 활용될 수 있습니다.
-각 jsonl 파일은 아래처럼 생겼습니다.
 ```python
 # model1.jsonl
-{"model_id": "모델1", "task": "길 묻기", "instruction": "어디로 가야하오", "generated": "저기로요"}
-{"model_id": "모델1", "task": "산수", "instruction": "1+1", "generated": "2"} # 길 묻기와 산수의 경우 다른 평가 프롬프트를 사용하고 싶을 수 있겠죠?
 # model2.jsonl -* model1.jsonl과 `instruction`은 같고 `generated`, `model_id` 는 다릅니다!
-{"model_id": "모델2", "task": "길 묻기", "instruction": "어디로 가야하오", "generated": "하이"}
-{"model_id": "모델2", "task": "산수", "instruction": "1+1", "generated": "3"}
 ...
 ..
-```
-예를 들어, 한가지 모델에 대해 다른 프롬프팅을 시도하여 다른 생성문을 얻었고 이를 비교하고 싶은 경우를 생각해봅시다. 이 때 평가받을 testset은 같으므로 `instruction`은 모두 같고 프롬프팅에 따라 `generated`는 달라지겠죠? `model_id` 는 `"prompt1"`, `"prompt2"` 등 취향에 맞게 적어주시면 됩니다.

 비교할 모델이 다섯 개라면 다섯 개의 .jsonl 파일을 업로드하세요.
 * 💥모든 jsonl 은 같은 수의 행을 가져야합니다.
 * 💥`model_id` 필드는 파일마다 달라야하며 파일 내에서는 유일해야합니다.
+* 💥각 jsonl 파일이 서로 다른 generated 를 가집니다. `instruction`, `model_id`, `task` 는 같아야합니다.
 **jsonl 필수 필드**
+* 예약된 필드 (필수)
   * `model_id`: 평가받는 모델의 이름입니다. (짧게 쓰는 것 추천)
+  * `instruction`: 모델이 받은 지시문입니다. 테스트셋 프롬프트에 해당합니다 (평가 프롬프트 아님)
   * `generated`: 모델이 testset instruction 에 생성한 응답을 넣으세요.
   * `task`: 전체 결과를 subset으로 그룹지어서 보여줄 때 사용됩니다. `evaluation prompt`를 행별로 다르게 사용하고 싶을 때 활용될 수 있습니다.
+* 추가
+  * 당신이 사용하는 평가 프롬프트에 따라서 추가로 다른 필드들을 더 활용할 수 있습니다. 위의 키워드들을 피해서 자유롭게 jsonl에 추가하여 사용할 수 있습니다.
+    * 예시: translation_pair.yaml, translation_fortunecookie.yaml 프롬프트의 경우는 `source_lang`, `target_lang` 필드를 jsonl 에서 읽어서 활용합니다.
+예를들어 translation_pair 프롬프트로 평가하는 경우 각 jsonl 파일은 아래처럼 생겼습니다.
 ```python
 # model1.jsonl
+{"model_id": "모델1", "task": "영한", "instruction": "어디로 가야하오", "generated": "Where should I go", "source_lang": "Korean", "target_lang": "English"}
+{"model_id": "모델1", "task": "한영", "instruction": "1+1?", "generated": "1+1?", "source_lang": "English", "target_lang": "Korean"}
 # model2.jsonl -* model1.jsonl과 `instruction`은 같고 `generated`, `model_id` 는 다릅니다!
+{"model_id": "모델2", "task": "영한", "instruction": "어디로 가야하오", "generated": "글쎄다", "source_lang": "Korean", "target_lang": "English"}
+{"model_id": "모델2", "task": "한영", "instruction": "1+1?", "generated": "2", "source_lang": "English", "target_lang": "Korean"}
 ...
 ..
+```
+반면 `llmbar` 프롬프트로 평가하는 경우, 번역평가처럼 `source_lang`, `target_lang` 필드가 사용되지 않으며 당연히 jsonl에도 추가하지 않으셔도 됩니다.

modules/nav.py CHANGED Viewed

@@ -24,7 +24,7 @@ def Navbar(sidebar_placeholder, toggle_hashstr: str = ""):
         st.page_link(
             "app.py",
-            label="Varco Arena 구동" if st.session_state.korean else "Run VARCO Arena",
             icon="🔥",
         )
         st.page_link(

         st.page_link(
             "app.py",
+            label="Arena-Lite 구동" if st.session_state.korean else "Run Arena-Lite",
             icon="🔥",
         )
         st.page_link(

pages/brief_intro.py CHANGED Viewed

@@ -23,7 +23,7 @@ else:
 st.image("va_concept_new.png")
 st.markdown(
     """
-| |Current Practice|Varco Arena|
 |-|-|-|
 |Total no. matches|$$n_{\\text{model}}*\\|X\\|$$|$$(n_{\\text{model}}-1)*\\|X\\|$$|
 |No. matches per LLM|$$\\|X\\|$$|$$\\left[\\|X\\|,\\|X\\|\\text{log}n_{\\text{model}}\\right]$$|
@@ -32,9 +32,9 @@ st.markdown(
 )
 if st.session_state.korean:
     st.info(
-        "Varco Arena는 신뢰성 있는 순위를 더 적은 횟수의 비교 내에 얻어내며, 이러한 특징은 LLM 직접 비교의 이점으로부터 기인합니다."
     )
 else:
     st.info(
-        "Varco Arena takes advantage of direct comparison between LLM responses to guarantee better reliability in fewer number of total matches."
     )

 st.image("va_concept_new.png")
 st.markdown(
     """
+| |Current Practice|Arena-Lite|
 |-|-|-|
 |Total no. matches|$$n_{\\text{model}}*\\|X\\|$$|$$(n_{\\text{model}}-1)*\\|X\\|$$|
 |No. matches per LLM|$$\\|X\\|$$|$$\\left[\\|X\\|,\\|X\\|\\text{log}n_{\\text{model}}\\right]$$|
 )
 if st.session_state.korean:
     st.info(
+        "Arena-Lite는 신뢰성 있는 순위를 더 적은 횟수의 비교 내에 얻어내며, 이러한 특징은 LLM 직접 비교의 이점으로부터 기인합니다."
     )
 else:
     st.info(
+        "Arena-Lite takes advantage of direct comparison between LLM responses to guarantee better reliability in fewer number of total matches."
     )

pages/see_results.py CHANGED Viewed

@@ -60,9 +60,9 @@ def main():
     if result_select is None:
         if st.session_state.korean:
-            st.markdown("결과를 확인하려면 먼저 **🔥VARCO Arena를 구동**하셔야 합니다")
         else:
-            st.markdown("You should **🔥Run VARCO Arena** first to see results")
         st.image("streamlit_app_local/page_result_1.png")
         st.image("streamlit_app_local/page_result_2.png")
         st.image("streamlit_app_local/page_result_3.png")
@@ -334,18 +334,18 @@ def main():
     with st.expander("펼쳐서 보기" if st.session_state.korean else "Expand to show"):
         st.info(
             """
-Varco Arena에서는 position bias의 영향을 최소화하기 위해 모든 모델이 A나 B위치에 번갈아 위치하도록 하였습니다. 그러나 LLM Judge 혹은 Prompt의 성능이 부족하다고 느껴진다면, 아래 알려진 LLM Judge bias가 참고가 될겁니다.
 * position bias (왼쪽)
 * length bias (오른쪽)
-결과의 왜곡이 LLM Judge의 부족함 떄문이었다는 점을 규명하려면 사용하신 LLM Judge와 Prompt의 binary classification 정확도를 측정해보시길 바랍니다 (Varco Arena를 활용하여 이를 수행해볼 수 있습니다!).""".strip()
             if st.session_state.korean
             else """
-In Varco Arena, to minimize the effect of position bias, all models are alternately positioned in either position A or B. However, if you feel the LLM Judge or Prompt performance is insufficient, the following known LLM Judge biases may be helpful to reference:
 * position bias (left)
 * length bias (right)
-To determine if result distortion was due to LLM Judge limitations, please measure the binary classification accuracy of your LLM Judge and Prompt (You could use Varco Arena for this purpose!).
 """.strip()
         )
         st.markdown(f"#### {judgename} + prompt = {eval_prompt_name}")

     if result_select is None:
         if st.session_state.korean:
+            st.markdown("결과를 확인하려면 먼저 **🔥Arena-Lite를 구동**하셔야 합니다")
         else:
+            st.markdown("You should **🔥Run Arena-Lite** first to see results")
         st.image("streamlit_app_local/page_result_1.png")
         st.image("streamlit_app_local/page_result_2.png")
         st.image("streamlit_app_local/page_result_3.png")
     with st.expander("펼쳐서 보기" if st.session_state.korean else "Expand to show"):
         st.info(
             """
+Arena-Lite에서는 position bias의 영향을 최소화하기 위해 모든 모델이 A나 B위치에 번갈아 위치하도록 하였습니다. 그러나 LLM Judge 혹은 Prompt의 성능이 부족하다고 느껴진다면, 아래 알려진 LLM Judge bias가 참고가 될겁니다.
 * position bias (왼쪽)
 * length bias (오른쪽)
+결과의 왜곡이 LLM Judge의 부족함 떄문이었다는 점을 규명하려면 사용하신 LLM Judge와 Prompt의 binary classification 정확도를 측정해보시길 바랍니다 (Arena-Lite를 활용하여 이를 수행해볼 수 있습니다!).""".strip()
             if st.session_state.korean
             else """
+In Arena-Lite, to minimize the effect of position bias, all models are alternately positioned in either position A or B. However, if you feel the LLM Judge or Prompt performance is insufficient, the following known LLM Judge biases may be helpful to reference:
 * position bias (left)
 * length bias (right)
+To determine if result distortion was due to LLM Judge limitations, please measure the binary classification accuracy of your LLM Judge and Prompt (You could use Arena-Lite for this purpose!).
 """.strip()
         )
         st.markdown(f"#### {judgename} + prompt = {eval_prompt_name}")

streamlit_app_local/README.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# Varco Arena web app
 ```bash
 cd ./streamlit_app_local/
 bash run.sh

+# Arena-Lite web app
 ```bash
 cd ./streamlit_app_local/
 bash run.sh

streamlit_app_local/app.py CHANGED Viewed

@@ -51,7 +51,7 @@ def upload_files(uploaded_files) -> Path:
     if not uploaded_files:
         st.warning("❌ No files to upload. Please drag/drop or browse files to upload.")
     elif len(uploaded_files) < 2:
-        st.error("❌ You need at least 2 jsonlines files to properly run VA.")
     else:  # properly uploaded
         for file in uploaded_files:
             # Create a path for the file in the server directory
@@ -154,18 +154,18 @@ def main():
         False, sidebar_placeholder=sidebar_placeholder, toggle_hashstr="app_init"
     )
-    st.title("⚔️ VARCO ARENA ⚔️")
     if st.session_state.korean:
         st.write(
-            """**바르코 아레나는 테스트셋 명령어별로 비교할 모델(생성문)의 토너먼트를 수행하고 결과들을 종합하여 모델들의 순위를 매기는 벤치마킹 시스템입니다. 이것은 reference 아웃풋과 비교하여 승률을 매기는 방법보다 정확하며 더 저렴합니다.**
             모범답안을 필요로 하지 않으므로 커스텀 테스트셋 (50+ 행) 을 활용하는 경우 편리한 벤치마킹이 가능합니다."""
         )
     else:
         st.write(
-            """**VARCO Arena is an LLM benchmarking system that compares model responses across customized test scenarios (recommend >50 prompts) without requiring reference answers.**
-            VARCO Arena conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs."""
         )
     st.divider()
@@ -261,9 +261,9 @@ def main():
     # Form for actual run
     with st.form("run_arena_form"):
         if st.session_state.korean:
-            st.write("### 3. Varco Arena 구동하기")
         else:
-            st.write("### 3. Run Varco Arena")
         api_key = st.text_input("Enter your OpenAI API Key", type="password")
         exp_name = st.text_input("(Optional) Enter Exp. name")
         exp_name = exp_name.replace(
@@ -298,7 +298,7 @@ def main():
                     "❌ Requirements: You have to upload jsonlines files first to proceed"
                 )
             elif not api_key:
-                st.error("❌ Requirements: OpenAI key required to run VA.")
             else:
                 result_file_path, return_code = run_varco_arena(
                     # upload_dir=st.session_state.upfiles_dir,
@@ -309,9 +309,9 @@ def main():
                     evaluation_model=eval_model,
                 )
                 if return_code:
-                    st.error("❌ RuntimeError: An error occurred during Varco Arena run")
                 else:
-                    st.success("✅ Varco Arena run completed successfully")
                     st.session_state.result_file_path = result_file_path
     set_nav_bar(
         False, sidebar_placeholder=sidebar_placeholder, toggle_hashstr="app_run_done"

     if not uploaded_files:
         st.warning("❌ No files to upload. Please drag/drop or browse files to upload.")
     elif len(uploaded_files) < 2:
+        st.error("❌ You need at least 2 jsonlines files to properly run.")
     else:  # properly uploaded
         for file in uploaded_files:
             # Create a path for the file in the server directory
         False, sidebar_placeholder=sidebar_placeholder, toggle_hashstr="app_init"
     )
+    st.title("⚔️ Arena-Lite (former VARCO ARENA) ⚔️")
     if st.session_state.korean:
         st.write(
+            """**Arena-Lite는 테스트셋 명령어별로 비교할 모델(생성문)의 토너먼트를 수행하고 결과들을 종합하여 모델들의 순위를 매기는 벤치마킹 시스템입니다. 이것은 reference 아웃풋과 비교하여 승률을 매기는 방법보다 정확하며 더 저렴합니다.**
             모범답안을 필요로 하지 않으므로 커스텀 테스트셋 (50+ 행) 을 활용하는 경우 편리한 벤치마킹이 가능합니다."""
         )
     else:
         st.write(
+            """**Arena-Lite is an LLM benchmarking system that compares model responses across customized test scenarios (recommend >50 prompts) without requiring reference answers.**
+            Arena-Lite conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs."""
         )
     st.divider()
     # Form for actual run
     with st.form("run_arena_form"):
         if st.session_state.korean:
+            st.write("### 3. Arena-Lite 구동하기")
         else:
+            st.write("### 3. Run Arena-Lite")
         api_key = st.text_input("Enter your OpenAI API Key", type="password")
         exp_name = st.text_input("(Optional) Enter Exp. name")
         exp_name = exp_name.replace(
                     "❌ Requirements: You have to upload jsonlines files first to proceed"
                 )
             elif not api_key:
+                st.error("❌ Requirements: OpenAI key required to run.")
             else:
                 result_file_path, return_code = run_varco_arena(
                     # upload_dir=st.session_state.upfiles_dir,
                     evaluation_model=eval_model,
                 )
                 if return_code:
+                    st.error("❌ RuntimeError: An error occurred during Arena-Lite run")
                 else:
+                    st.success("✅ Arena-Lite run completed successfully")
                     st.session_state.result_file_path = result_file_path
     set_nav_bar(
         False, sidebar_placeholder=sidebar_placeholder, toggle_hashstr="app_run_done"

streamlit_app_local/modules/nav.py CHANGED Viewed

@@ -16,7 +16,7 @@ def Navbar(sidebar_placeholder, toggle_hashstr: str = ""):
         st.page_link(
             "app.py",
-            label="Varco Arena 구동" if st.session_state.korean else "Run VARCO Arena",
             icon="🔥",
         )
         st.page_link(

         st.page_link(
             "app.py",
+            label="Arena-Lite 구동" if st.session_state.korean else "Run Arena-Lite",
             icon="🔥",
         )
         st.page_link(

streamlit_app_local/pages/brief_intro.py CHANGED Viewed

@@ -23,7 +23,7 @@ else:
 st.image("va_concept_new.png")
 st.markdown(
     """
-| |Current Practice|Varco Arena|
 |-|-|-|
 |Total no. matches|$$n_{\\text{model}}*\\|X\\|$$|$$(n_{\\text{model}}-1)*\\|X\\|$$|
 |No. matches per LLM|$$\\|X\\|$$|$$\\left[\\|X\\|,\\|X\\|\\text{log}n_{\\text{model}}\\right]$$|
@@ -32,9 +32,9 @@ st.markdown(
 )
 if st.session_state.korean:
     st.info(
-        "Varco Arena는 신뢰성 있는 순위를 더 적은 횟수의 비교 내에 얻어내며, 이러한 특징은 LLM 직접 비교의 이점으로부터 기인합니다."
     )
 else:
     st.info(
-        "Varco Arena takes advantage of direct comparison between LLM responses to guarantee better reliability in fewer number of total matches."
     )

 st.image("va_concept_new.png")
 st.markdown(
     """
+| |Current Practice|Arena-Lite|
 |-|-|-|
 |Total no. matches|$$n_{\\text{model}}*\\|X\\|$$|$$(n_{\\text{model}}-1)*\\|X\\|$$|
 |No. matches per LLM|$$\\|X\\|$$|$$\\left[\\|X\\|,\\|X\\|\\text{log}n_{\\text{model}}\\right]$$|
 )
 if st.session_state.korean:
     st.info(
+        "Arena-Lite는 신뢰성 있는 순위를 더 적은 횟수의 비교 내에 얻어내며, 이러한 특징은 LLM 직접 비교의 이점으로부터 기인합니다."
     )
 else:
     st.info(
+        "Arena-Lite takes advantage of direct comparison between LLM responses to guarantee better reliability in fewer number of total matches."
     )

streamlit_app_local/pages/see_results.py CHANGED Viewed

@@ -354,18 +354,18 @@ def main():
     with st.expander("펼쳐서 보기" if st.session_state.korean else "Expand to show"):
         st.info(
             """
-Varco Arena에서는 position bias의 영향을 최소화하기 위해 모든 모델이 A나 B위치에 번갈아 위치하도록 하였습니다. 그러나 LLM Judge 혹은 Prompt의 성능이 부족하다고 느껴진다면, 아래 알려진 LLM Judge bias가 참고가 될겁니다.
 * position bias (왼쪽)
 * length bias (오른쪽)
-결과의 왜곡이 LLM Judge의 부족함 떄문이었다는 점을 규명하려면 사용하신 LLM Judge와 Prompt의 binary classification 정확도를 측정해보시길 바랍니다 (Varco Arena를 활용하여 이를 수행해볼 수 있습니다!).""".strip()
             if st.session_state.korean
             else """
-In Varco Arena, to minimize the effect of position bias, all models are alternately positioned in either position A or B. However, if you feel the LLM Judge or Prompt performance is insufficient, the following known LLM Judge biases may be helpful to reference:
 * position bias (left)
 * length bias (right)
-To determine if result distortion was due to LLM Judge limitations, please measure the binary classification accuracy of your LLM Judge and Prompt (You could use Varco Arena for this purpose!).
 """.strip()
         )
         st.markdown(f"#### {judgename} + prompt = {eval_prompt_name}")

     with st.expander("펼쳐서 보기" if st.session_state.korean else "Expand to show"):
         st.info(
             """
+Arena-Lite에서는 position bias의 영향을 최소화하기 위해 모든 모델이 A나 B위치에 번갈아 위치하도록 하였습니다. 그러나 LLM Judge 혹은 Prompt의 성능이 부족하다고 느껴진다면, 아래 알려진 LLM Judge bias가 참고가 될겁니다.
 * position bias (왼쪽)
 * length bias (오른쪽)
+결과의 왜곡이 LLM Judge의 부족함 떄문이었다는 점을 규명하려면 사용하신 LLM Judge와 Prompt의 binary classification 정확도를 측정해보시길 바랍니다 (Arena-Lite를 활용하여 이를 수행해볼 수 있습니다!).""".strip()
             if st.session_state.korean
             else """
+In Arena-Lite, to minimize the effect of position bias, all models are alternately positioned in either position A or B. However, if you feel the LLM Judge or Prompt performance is insufficient, the following known LLM Judge biases may be helpful to reference:
 * position bias (left)
 * length bias (right)
+To determine if result distortion was due to LLM Judge limitations, please measure the binary classification accuracy of your LLM Judge and Prompt (You could use Arena-Lite for this purpose!).
 """.strip()
         )
         st.markdown(f"#### {judgename} + prompt = {eval_prompt_name}")

streamlit_app_local/view_utils.py CHANGED Viewed

@@ -16,7 +16,7 @@ from modules.nav import Navbar
 def default_page_setting(
     layout: Literal["wide", "centered"] = "centered",
 ):
-    st.set_page_config(page_title="VARCO Arena", layout=layout)
     sidebar_placeholder = st.sidebar.empty()
     css = f"""
@@ -126,7 +126,7 @@ def compute_mle_elo(df, SCALE=400, BASE=10, INIT_RATING=1000):
     Y = np.zeros(n)
     Y[df["winner"] == "A"] = 1.0
-    WARNING = "elo.py:L{L} compute_mle_elo() // Warning: Seeing this message indicates the regression result for elo is unreliable. You should be test-running the Varco Arena or something odd (perfect one-sided wins) is happening\n\nto avoid logistic regressor error, manually putting other class"
     if (Y == 0).all():
         print(WARNING.format(L=32))
         Y[-1] = 1.0

 def default_page_setting(
     layout: Literal["wide", "centered"] = "centered",
 ):
+    st.set_page_config(page_title="Arena-Lite", layout=layout)
     sidebar_placeholder = st.sidebar.empty()
     css = f"""
     Y = np.zeros(n)
     Y[df["winner"] == "A"] = 1.0
+    WARNING = "elo.py:L{L} compute_mle_elo() // Warning: Seeing this message indicates the regression result for elo is unreliable. You should be test-running the Arena-Lite or something odd (perfect one-sided wins) is happening\n\nto avoid logistic regressor error, manually putting other class"
     if (Y == 0).all():
         print(WARNING.format(L=32))
         Y[-1] = 1.0

view_utils.py CHANGED Viewed

@@ -16,7 +16,7 @@ from modules.nav import Navbar
 def default_page_setting(
     layout: Literal["wide", "centered"] = "centered",
 ):
-    st.set_page_config(page_title="VARCO Arena", layout=layout)
     sidebar_placeholder = st.sidebar.empty()
     css = f"""
@@ -126,7 +126,7 @@ def compute_mle_elo(df, SCALE=400, BASE=10, INIT_RATING=1000):
     Y = np.zeros(n)
     Y[df["winner"] == "A"] = 1.0
-    WARNING = "elo.py:L{L} compute_mle_elo() // Warning: Seeing this message indicates the regression result for elo is unreliable. You should be test-running the Varco Arena or something odd (perfect one-sided wins) is happening\n\nto avoid logistic regressor error, manually putting other class"
     if (Y == 0).all():
         print(WARNING.format(L=32))
         Y[-1] = 1.0

 def default_page_setting(
     layout: Literal["wide", "centered"] = "centered",
 ):
+    st.set_page_config(page_title="Arena-Lite", layout=layout)
     sidebar_placeholder = st.sidebar.empty()
     css = f"""
     Y = np.zeros(n)
     Y[df["winner"] == "A"] = 1.0
+    WARNING = "elo.py:L{L} compute_mle_elo() // Warning: Seeing this message indicates the regression result for elo is unreliable. You should be test-running the Arena-Lite or something odd (perfect one-sided wins) is happening\n\nto avoid logistic regressor error, manually putting other class"
     if (Y == 0).all():
         print(WARNING.format(L=32))
         Y[-1] = 1.0