How are evaluation results generated for existing multilingual benchmarks that consist of queries only?
#2
by
haidequanbu
- opened
Thank you to the authors for your contribution to the open-source community.
While reading the paper, I noted that PolyGuard is trained on prompt-response pairs. Could you clarify how the evaluation is conducted on test sets consisting of prompts only, such as MultiJ and XSafety?