ArenaLite / README_en.md
sonsus's picture
Update README_en.md
30e987d

A newer version of the Streamlit SDK is available: 1.48.1

Upgrade

Arena-Lite (former Arena-Lite)

Arena-Lite conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs.

For more information, the followings may help understanding how it works.

Quickstart

Running Web Demo locally (streamlit, Recommended!)

git clone [THIS_REPO]
# install requirements below. we recommend miniforge to manage environment
cd streamlit_app_local
bash run.sh

For more details, see [THIS_REPO]/streamlit_app_local/README.md

CLI use

  • located at
    • varco_arena/
  • debug configurations for vscode at
    • varco_arena/.vscode
## gpt-4o-mini as a judge
python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -m tournament -e "gpt-4o-mini"
## vllm-openai served LLM as a judge
python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -e SOME_MODEL_NAME_SERVED -m tournament -u "http://url_to/your/vllm_openai_server:someport"

# dbg lines
## openai api judge dbg
python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini
## other testing lines
python main.py -i "rsc/inputs_for_dbg/[SOME_DIRECTORY]/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini
## dummy judge dbg (checking errors without api requests)
python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e debug

Requirements

pip install -r requirements.txt # python 3.11

# Linux
uvloop
# Windows
winloop

Argument

  • -i, --input : directory path which contains input jsonlines files (llm outputs)
  • -o, --output_dir : directory where results to be put
  • -e, --evaluation : judge model specification (e.g. "gpt-4o-2024-05-13", "gpt-4o-mini", [vllm-served-model-name])
  • -k, --openai_api_key : OpenAI API Key
  • -u, --openai_url: URL to openai_styled_llm_server (requested by openai sdk)

advanced

  • -j, --n_jobs : n jobs to be put to asyncio.semaphore(n=)
  • -p, --evalprompt : see the directory
  • -lr, --limit_requests : vLLM OpenAI server request limit (default: 7,680)
  • -lt, --limit_tokens : vLLM OpenAI server token limit (default: 15,728,640)

Input Data Format

input jsonl guides

Contributing & Customizing

Do this after git clone and installation

pip install pre-commit
pre-commit install

before commit

bash precommit.sh # black formatter will reformat the codes

📝 Adding a Custom Prompt

Here’s how to add a new evaluation prompt. The process has been simplified recently, as the Judge logic now only relies on the parsed_output method.

The easiest way is to copy llmbar_brief.py and llmbar_brief.yaml to create your own prompt.

1. Create Prompt .py and .yaml Files

  • Create files like my_prompt.py and my_prompt.yaml in the varco_arena/varco_arena_core/prompts/ directory.
  • my_prompt.py:
    • Define a class that inherits from ComparisonPromptBase.
    • You must implement the parsed_output(self, response) method. This function should take the LLM Judge's response and return a decision token (e.g., 'a', 'b') indicating the winner.
  • my_prompt.yaml:
    • Define necessary elements for your prompt, such as sampling_parameters, decision_tokens, and prompt_template.
    • The strings in prompt_template are processed by string.Template and finalized in eval_utils.py via the BasePrompt.complete_prompt() function.
    • Do not use ${task} in prompt_template. It is a reserved keyword due to the llmbar prompt.

2. Register the Prompt in prompts/__init__.py

  • Import your new prompt class:
    from .my_prompt import MyPrompt
    
  • Add your new prompt's name and class instance to the NAME2PROMPT_CLS dictionary:
    NAME2PROMPT_CLS = dict(
        # ... other prompts
        my_prompt=MyPrompt(),
    )
    
  • Add the new prompt name to the Literal type hint for the promptname argument in the load_prompt function:
    def load_prompt(
        promptname: Literal[
            # ... other prompt names
            "my_prompt",
        ],
        # ...
    ):
    

3. Add the Prompt to eval_prompt_list.txt

  • Open the eval_prompt_list.txt file in the project root and add the name of your new prompt (my_prompt) on a new line.

4. (Recommended) Test and Debug

  • It is highly recommended to debug your prompt to ensure it works as expected.
  • In the .vscode/launch.json file, modify the "VA" configuration's args:
    • Change "-p", "translation_fortunecookie" to "-p", "my_prompt".
    • If necessary, update the "-i", "..." argument to the path of your test data suitable for the new prompt.
  • Go to the Run and Debug tab in VS Code (Ctrl+Shift+D), select the "VA" configuration, and press F5 to run the debugger.
  • Find result.json inside the output directory you specified after -o. It will show every judge prompt used for each match.

FAQ

  • I want to apply my custom judge prompt to run Arena-Lite
    • ./varco_arena/prompts/ defines the prompts with yaml file and the class objects for those. Edit those as your need.
  • I want tailored judge prompts for each line of the test set row (i.e. 100th row - prompt1, 101st - prompt2)
    • You could see load_prompt at the above link receives promptname + task as a parameters to load the prompt. The function is called at ./varco_arena/manager.py:async_run.

Special Thanks to (contributors)

  • Minho Lee (@Dialogue Model Team, NCSOFT) github
    • query wrapper
    • rag prompt
  • Jumin Oh (@Generation Model Team, NCSOFT)
    • overall prototyping of the system in haste

Citation

If you found our work helpful, consider citing our paper!

@misc{son2024varcoarenatournamentapproach,
      title={VARCO Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models},
      author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim},
      year={2024},
      eprint={2411.01281},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.01281},
}

Paper Experimental Results (raw data)

https://huggingface.co/datasets/fgenie777/Arena-Lite-Experiments-Result-Data/tree/main