LLM排行榜：25/08/24 - LLMs Leaderboard：25/08/24

发表于 - Posted on 2025/08/24 编辑于 - Edited on 2025/08/30 系列 - Series LLM排行榜 - LLM Leaderboard 字数 - Word count: 322 阅读时间 - Reading time ≈ 1 mins.

本表格汇总了常用大语言模型在常用评测榜单上的表现，整合评测结果，得到综合排名。榜单涵盖人类偏好、知识与推理能力、数学能力、代码能力等多个方面。

This table summarizes the performance of popular large language models across well-known benchmark leaderboards, integrating evaluation results to obtain an overall ranking. These rankings cover a range of capabilities, including human preference, knowledge and reasoning, mathematical skills, and coding ability.

25-08-24更新：

GPT-5依旧断档领先，免费模型中最强的是Gemini-2.5-Pro，内地模型最强的是Qwen3-Thinking-235B。

榜单中新增DeepSeek V3.1（深度求索DeepSeek），Exaone 4.0（LG AI Research），和Mistral-Medium（Mistral AI）。DeepSeek V3.1 Thinking在代码和数学推理方面尤其强大。

Claude Opus 4.1 Thinking在Text Arena中的分数是1451，非Thinking模型是1439。在其他榜单更新数据之后，这两个模型估计可以双双上榜，甚至能进前10。

Update on August 25, 2024:

GPT-5 remains far ahead of the competition. Among free models, Gemini 2.5 Pro is currently the strongest.

New entries on the leaderboard include DeepSeek V3.1 (by DeepSeek), Exaone 4.0 (from LG AI Research), and Mistral-Medium (by Mistral AI). Notably, DeepSeek V3.1 Thinking excels in code and mathematical reasoning.

Claude Opus 4.1 Thinking scored 1451 in Text Arena, while the non-thinking variant scored 1439. Once the leaderboard data is fully updated, both versions are expected to make the list—potentially breaking into the top 10.