大语言模型定量能力评估表 - LLMs Leaderboard - 250824
本表格汇总了常用大语言模型在常用评测榜单上的表现,整合评测结果,得到综合排名。榜单涵盖人类偏好、知识与推理能力、数学能力、代码能力等多个方面。
This table summarizes the performance of popular large language models across well-known benchmark leaderboards, integrating evaluation results to obtain an overall ranking. These rankings cover a range of capabilities, including human preference, knowledge and reasoning, mathematical skills, and coding ability.
25-08-24更新:
GPT-5依旧断档领先,免费模型中最强的是Gemini-2.5-Pro,内地模型最强的是Qwen3-Thinking-235B。
榜单中新增DeepSeek V3.1(深度求索DeepSeek),Exaone 4.0(LG AI Research),和Mistral-Medium(Mistral AI)。DeepSeek V3.1 Thinking在代码和数学推理方面尤其强大。
Claude Opus 4.1 Thinking在Text Arena中的分数是1451,非Thinking模型是1439。在其他榜单更新数据之后,这两个模型估计可以双双上榜,甚至能进前10。
Update on August 25, 2024:
GPT-5 remains far ahead of the competition. Among free models, Gemini 2.5 Pro is currently the strongest.
New entries on the leaderboard include DeepSeek V3.1 (by DeepSeek), Exaone 4.0 (from LG AI Research), and Mistral-Medium (by Mistral AI). Notably, DeepSeek V3.1 Thinking excels in code and mathematical reasoning.
Claude Opus 4.1 Thinking scored 1451 in Text Arena, while the non-thinking variant scored 1439. Once the leaderboard data is fully updated, both versions are expected to make the list—potentially breaking into the top 10.