Spaces:
Running
Running
# Arena-Lite (ꡬ Arena-Lite) | |
μλ λ-λΌμ΄νΈλ ν μ€νΈμ λͺ λ Ήμ΄λ³λ‘ λΉκ΅ν λͺ¨λΈλ€μ ν λλ¨ΌνΈλ₯Ό μννμ¬ μ ννκ² λͺ¨λΈλ€μ μμλ₯Ό λ§€κΉλλ€. μ΄κ²μ reference μμνκ³Ό λΉκ΅νμ¬ μΉλ₯ μ λ§€κΈ°λ λ°©λ²λ³΄λ€ μ ννλ©° μ‘°κΈ λ μ λ ΄ν©λλ€. | |
λ μμΈν λ΄μ©μ λν΄μλ μλμ λ§ν¬λ₯Ό μ°Έμ‘°νμλ©΄ λ©λλ€. | |
* [λ Όλ¬Έ](https://arxiv.org/abs/2411.01281) | |
* [μμ¨μννΈ ν ν¬λΈλ‘κ·Έ (KR)](https://ncsoft.github.io/ncresearch/12cc62c1ea0d981971a8923401e8fe6a0f18563d) | |
## Quickstart | |
### λ‘컬μμ μ€νΈλ¦Όλ¦Ώ μ±μΌλ‘ μμνκΈ° (μΆμ²!) | |
```bash | |
git clone [THIS_REPO] | |
# install requirements below. we recommend miniforge to manage environment | |
cd streamlit_app_local | |
bash run.sh | |
``` | |
λ μμΈν λ΄μ©μ `[THIS_REPO]/streamlit_app_local/README.md` μ μ°Έμ‘°νμΈμ! | |
### CLI μ¬μ© | |
* cliμ μΉ μ±μ μλ‘ κ°μ μ½λλ₯Ό νμ©νλ©°, μλμ λλ ν 리μ μμ΅λλ€. | |
* `varco_arena/` | |
* vscode μμμ λλ²κΉ μ μν ν리μ ν둬ννΈλ³ ν μ€νΈ λͺ λ Ήμ΄λ λ€μ νμΌμ μ νμμ΅λλ€. | |
* `varco_arena/.vscode/launch.json` | |
```bash | |
## gpt-4o-mini as a judge | |
python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -m tournament -e "gpt-4o-mini" | |
## vllm-openai served LLM as a judge | |
python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -e SOME_MODEL_NAME_SERVED -m tournament -u "http://url_to/your/vllm_openai_server:someport" | |
# dbg lines | |
## openai api judge dbg | |
python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini | |
## other testing lines | |
python main.py -i "rsc/inputs_for_dbg/[SOME_DIRECTORY]/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini | |
## dummy judge dbg (checking errors without api requests) | |
python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e debug | |
``` | |
## Requirements | |
``` | |
pip install -r requirements.txt # python 3.11 | |
# LinuxμΈ κ²½μ° | |
uvloop | |
# WindowsμΈ κ²½μ° | |
winloop | |
``` | |
#### Argument | |
- -i, --input : μ λ ₯ νμΌ or λλ ν 리 or νμΌλͺ μ λν μ κ· ννμ | |
- -o, --output_dir : μΆλ ₯ νμΌμ΄ μ μ₯λλ λλ ν 리 | |
- -e, --evaluation : νκ° λͺ¨λΈ (e.g. "gpt-4o-2024-05-13", "gpt-4o-mini", vllmμμ λμ΄ λͺ¨λΈ λͺ λ±) | |
- -m, --matching_method: λ§€μΉ λ°©μ (κΈ°λ³Έκ° "tournament", "league" (λΉμΆμ²) ) | |
- -k, --openai_api_key : OpenAI API Key | |
- -u, --openai_url: λ‘컬 vLLM OpenAI μλ² μ¬μ© μ URL(ipμ£Όμ+ν¬νΈ) | |
#### advanced | |
- -j, --n_jobs : asyncio.semaphore()μ μ λ¬λ μΈμ. Arenaκ° μ§νλμ§ μλλ€λ©΄ κΈ°λ³Έκ°μΈ 32 μ΄νλ‘ λ΄λ €λ³΄μ | |
- -p, --evalprompt : [ν΄λΉ λλ ν 리 μ°Έμ‘°](./varco_arena/prompts/*.yaml) | |
- -lr, --limit_requests : vLLM OpenAI μλ² μμ² μ ν (default: 7,680) | |
- -lt, --limit_tokens : vLLM OpenAI μλ² ν ν° μ ν (default: 15,728,640) | |
#### Input Data Format | |
[input jsonl κ°μ΄λ λ§ν¬](./streamlit_app_local/guide_mds/input_jsonls_kr.md) | |
## Contributing & Customizing | |
#### git clone λ° dependency μ€μΉ νμ ν μΌ | |
```bash | |
pip install pre-commit | |
pre-commit install | |
``` | |
#### commit νκΈ° μ μ ν μΌ | |
```bash | |
bash precommit.sh # μ΄κ² μ½λλ€μ λ€ λ¦¬ν¬λ§·ν΄μ€κ±°μ | |
``` | |
### π 컀μ€ν ν둬ννΈ μΆκ°νκΈ° | |
μλ‘μ΄ νκ° ν둬ννΈλ₯Ό μΆκ°νλ κ³Όμ μ λ€μκ³Ό κ°μ΅λλ€. μ΅κ·Ό Judge λ‘μ§μ΄ `parsed_output` λ©μλλ§ μ¬μ©νλλ‘ κ°μνλμ΄ μ΄μ λ³΄λ€ μ½κ² ν둬ννΈλ₯Ό μΆκ°ν μ μμ΅λλ€. | |
κ°μ₯ κ°λ¨ν λ°©λ²μ `llmbar_brief.py`μ `llmbar_brief.yaml` νμΌμ 볡μ¬νμ¬ μμ λ§μ ν둬ννΈλ₯Ό λ§λλ κ²μ λλ€. | |
#### 1. ν둬ννΈ `.py` λ° `.yaml` νμΌ μμ± | |
- `varco_arena/varco_arena_core/prompts/` κ²½λ‘μ `my_prompt.py`μ `my_prompt.yaml`μ²λΌ νμΌμ μμ±ν©λλ€. | |
- **`my_prompt.py`**: | |
- `ComparisonPromptBase`λ₯Ό μμλ°λ ν΄λμ€λ₯Ό μ μν©λλ€. | |
- `parsed_output(self, response)` λ©μλλ₯Ό λ°λμ ꡬνν΄μΌ ν©λλ€. μ΄ ν¨μλ LLM Judgeμ μλ΅(`response`)μ λ°μ, μΉμλ₯Ό λνλ΄λ κ²°μ ν ν°(μ: `'a'`, `'b'`)μ λ°νν΄μΌ ν©λλ€. | |
- **`my_prompt.yaml`**: | |
- `sampling_parameters`, `decision_tokens`, `prompt_template` λ± ν둬ννΈμ νμν μμλ€μ μ μν©λλ€. | |
- `prompt_template` μ λ€μ΄κ°λ λ¬Έμμ΄μ `string.Template`μΌλ‘ μ²λ¦¬λλ©° `BasePrompt.complete_prompt()` ν¨μλ₯Ό ν΅ν΄ `eval_utils.py`μμ μ΅μ’ μμ±λ©λλ€. | |
- `${task}, ${generated}, ${model_id}`λ₯Ό `prompt_template`μ μ¬μ©νμ§ λ§μΈμ. μμ½λ ν€μλλ€μ λλ€. | |
#### 2. `prompts/__init__.py`μ ν둬ννΈ λ±λ‘ | |
- μμ±ν ν둬ννΈ ν΄λμ€λ₯Ό `import` ν©λλ€. | |
```python | |
from .my_prompt import MyPrompt | |
``` | |
- `NAME2PROMPT_CLS` λμ λ리μ μ ν둬ννΈ μ΄λ¦κ³Ό ν΄λμ€ κ°μ²΄λ₯Ό μΆκ°ν©λλ€. | |
```python | |
NAME2PROMPT_CLS = dict( | |
# ... κΈ°μ‘΄ ν둬ννΈλ€ | |
my_prompt=MyPrompt(), | |
) | |
``` | |
- `load_prompt` ν¨μμ `promptname` μΈμμ `Literal` νμ ννΈμ μ ν둬ννΈ μ΄λ¦μ μΆκ°ν©λλ€. | |
```python | |
def load_prompt( | |
promptname: Literal[ | |
# ... κΈ°μ‘΄ ν둬ννΈ μ΄λ¦λ€ | |
"my_prompt", | |
], | |
# ... | |
): | |
``` | |
#### 3. `eval_prompt_list.txt`μ ν둬ννΈ μΆκ° | |
- νλ‘μ νΈ λ£¨νΈμ `eval_prompt_list.txt` νμΌμ μ΄κ³ , μ ν둬ννΈμ μ΄λ¦(`my_prompt`)μ μ μ€μ μΆκ°ν©λλ€. | |
#### 4. (κΆμ₯) ν μ€νΈ λ° λλ²κΉ | |
- ν둬ννΈκ° μλλλ‘ μλνλμ§ νμΈνκΈ° μν΄ λλ²κΉ μ κΆμ₯ν©λλ€. | |
- `.vscode/launch.json` νμΌμ `"VA"` μ€μ μμ `args`λ₯Ό λ€μκ³Ό κ°μ΄ μμ ν©λλ€. | |
- `"-p", "translation_fortunecookie"` λΆλΆμ `"-p", "my_prompt"`λ‘ λ³κ²½ν©λλ€. | |
- νμμ `"-i", "..."` λΆλΆμ μ ν둬ννΈμ μ ν©ν ν μ€νΈ λ°μ΄ν° κ²½λ‘λ₯Ό μ§μ ν©λλ€. | |
- VS Codeμ `Run and Debug` ν(Ctrl+Shift+D)μΌλ‘ μ΄λνμ¬ "VA" μ€μ μ μ ννκ³ F5 ν€λ₯Ό λλ¬ λλ²κ±°λ₯Ό μ€νν©λλ€. | |
- `-o` λ€μ λͺ μν output λλ ν 리 μμμ `result.json` λ₯Ό μ°Ύμμ μνλλλ‘ λμνλμ§ νμΈν΄λ³΄μΈμ. λͺ¨λ judgeμ λ§€μΉμ νμ©λ ν둬ννΈ μ λ³΄κ° λ΄κ²¨μμ΅λλ€. | |
λ¬Έμ: μμ μΌ | |
* λ΄κ° λ§λ ν둬ννΈλ₯Ό μ¬μ©νκ³ μΆμ΄μ | |
* [`./varco_arena/prompts/`](./varco_arena_core/prompts/__init__.py) μμ κ°μ’ ν둬ννΈ ν΄λμ€ λ° `yaml` νμΌ ννλ‘ μ μλ ν둬ννΈλ₯Ό λ‘λν©λλ€. ν리μ μ μ°Έμ‘°νμ¬ μμ±νμλ©΄ λ©λλ€. | |
* ν μ€νΈμ λ³λ‘ λ€λ₯Έ νκ° ν둬ννΈλ₯Ό μ¬μ©νκ³ μΆμ΄μ (e.g. μμ μ λ°λΌ λ€λ₯Έ ν둬ννΈλ₯Ό μ¬μ©νκ³ μΆμ΄μ) | |
* μ κ±Έμ΄λλ¦° λ§ν¬μ `load_prompt` λ₯Ό ν΅ν΄μ `promptname` + `task` ννλ‘ [`./varco_arena_core/manager.py:async_run`](./varco_arena_core/manager.py) ν둬ννΈκ° λ‘λλλλ‘ ν΄λμμ΅λλ€. | |
## Special Thanks to (contributors) | |
- μ΄λ―ΌνΈ (@λνλͺ¨λΈν, NCSOFT) [github](https://github.com/minolee/) | |
- query wrapper | |
- rag prompt | |
- μ€μ£Όλ―Ό (@μμ±λͺ¨λΈν, NCSOFT) | |
- overall prototyping of the system in haste | |
## Citation | |
μ ν¬ μμ λ¬Όμ΄ λμμ΄ λμλ€λ©΄ μ ν¬λ λμμ λ°μλ³Ό μ μμκΉμ?π | |
``` | |
@misc{son2024varcoarenatournamentapproach, | |
title={VARCO Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models}, | |
author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim}, | |
year={2024}, | |
eprint={2411.01281}, | |
archivePrefix={arXiv}, | |
primaryClass={cs.CL}, | |
url={https://arxiv.org/abs/2411.01281}, | |
} | |
``` | |
## νμ΄νΌ μ€ν λ°μ΄ν° (μ¬νμ©) | |
https://huggingface.co/datasets/fgenie777/Arena-Lite-Experiments-Result-Data/tree/main | |