Welcome to try the Chatbot Arena voting demo. Keep in mind that each benchmark has its limitations. Please consider the results as guiding references. See our discussion below for more technical details. Evaluating Chatbots with MT-bench and Arena Motivation While several benchmarks exist for evaluating Large Language Model's (LLM) performance, such as MMLU, HellaSwag, and HumanEval, we noticed