Spaces:
Running
A newer version of the Streamlit SDK is available:
1.48.1
Arena-Lite (former Arena-Lite)
Arena-Lite conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs.
For more information, the followings may help understanding how it works.
Quickstart
Running Web Demo locally (streamlit, Recommended!)
git clone [THIS_REPO]
# install requirements below. we recommend miniforge to manage environment
cd streamlit_app_local
bash run.sh
For more details, see [THIS_REPO]/streamlit_app_local/README.md
CLI use
- located at
varco_arena/
- debug configurations for vscode at
varco_arena/.vscode
## gpt-4o-mini as a judge
python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -m tournament -e "gpt-4o-mini"
## vllm-openai served LLM as a judge
python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -e SOME_MODEL_NAME_SERVED -m tournament -u "http://url_to/your/vllm_openai_server:someport"
# dbg lines
## openai api judge dbg
python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini
## other testing lines
python main.py -i "rsc/inputs_for_dbg/[SOME_DIRECTORY]/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini
## dummy judge dbg (checking errors without api requests)
python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e debug
Requirements
pip install -r requirements.txt # python 3.11
# Linux
uvloop
# Windows
winloop
Argument
- -i, --input : directory path which contains input jsonlines files (llm outputs)
- -o, --output_dir : directory where results to be put
- -e, --evaluation : judge model specification (e.g. "gpt-4o-2024-05-13", "gpt-4o-mini", [vllm-served-model-name])
- -k, --openai_api_key : OpenAI API Key
- -u, --openai_url: URL to openai_styled_llm_server (requested by openai sdk)
advanced
- -j, --n_jobs : n jobs to be put to
asyncio.semaphore(n=)
- -p, --evalprompt : see the directory
- -lr, --limit_requests : vLLM OpenAI server request limit (default: 7,680)
- -lt, --limit_tokens : vLLM OpenAI server token limit (default: 15,728,640)
Input Data Format
Contributing & Customizing
Do this after git clone and installation
pip install pre-commit
pre-commit install
before commit
bash precommit.sh # black formatter will reformat the codes
📝 Adding a Custom Prompt
Here’s how to add a new evaluation prompt. The process has been simplified recently, as the Judge logic now only relies on the parsed_output
method.
The easiest way is to copy llmbar_brief.py
and llmbar_brief.yaml
to create your own prompt.
1. Create Prompt .py
and .yaml
Files
- Create files like
my_prompt.py
andmy_prompt.yaml
in thevarco_arena/varco_arena_core/prompts/
directory. my_prompt.py
:- Define a class that inherits from
ComparisonPromptBase
. - You must implement the
parsed_output(self, response)
method. This function should take the LLM Judge'sresponse
and return a decision token (e.g.,'a'
,'b'
) indicating the winner.
- Define a class that inherits from
my_prompt.yaml
:- Define necessary elements for your prompt, such as
sampling_parameters
,decision_tokens
, andprompt_template
. - The strings in prompt_template are processed by string.Template and finalized in eval_utils.py via the BasePrompt.complete_prompt() function.
- Do not use ${task} in prompt_template. It is a reserved keyword due to the llmbar prompt.
- Define necessary elements for your prompt, such as
2. Register the Prompt in prompts/__init__.py
- Import your new prompt class:
from .my_prompt import MyPrompt
- Add your new prompt's name and class instance to the
NAME2PROMPT_CLS
dictionary:NAME2PROMPT_CLS = dict( # ... other prompts my_prompt=MyPrompt(), )
- Add the new prompt name to the
Literal
type hint for thepromptname
argument in theload_prompt
function:def load_prompt( promptname: Literal[ # ... other prompt names "my_prompt", ], # ... ):
3. Add the Prompt to eval_prompt_list.txt
- Open the
eval_prompt_list.txt
file in the project root and add the name of your new prompt (my_prompt
) on a new line.
4. (Recommended) Test and Debug
- It is highly recommended to debug your prompt to ensure it works as expected.
- In the
.vscode/launch.json
file, modify the"VA"
configuration'sargs
:- Change
"-p", "translation_fortunecookie"
to"-p", "my_prompt"
. - If necessary, update the
"-i", "..."
argument to the path of your test data suitable for the new prompt.
- Change
- Go to the
Run and Debug
tab in VS Code (Ctrl+Shift+D), select the "VA" configuration, and press F5 to run the debugger. - Find
result.json
inside the output directory you specified after-o
. It will show every judge prompt used for each match.
FAQ
- I want to apply my custom judge prompt to run Arena-Lite
./varco_arena/prompts/
defines the prompts withyaml
file and the class objects for those. Edit those as your need.
- I want tailored judge prompts for each line of the test set row (i.e.
100th row --prompt1
, 101stprompt2
)- You could see
load_prompt
at the above link receivespromptname
+task
as a parameters to load the prompt. The function is called at./varco_arena/manager.py:async_run
.
- You could see
Special Thanks to (contributors)
- Minho Lee (@Dialogue Model Team, NCSOFT) github
- query wrapper
- rag prompt
- Jumin Oh (@Generation Model Team, NCSOFT)
- overall prototyping of the system in haste
Citation
If you found our work helpful, consider citing our paper!
@misc{son2024varcoarenatournamentapproach,
title={VARCO Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models},
author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim},
year={2024},
eprint={2411.01281},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.01281},
}
Paper Experimental Results (raw data)
https://huggingface.co/datasets/fgenie777/Arena-Lite-Experiments-Result-Data/tree/main