大语言模型综合排行榜 - LLM Composite Rankings – 250907
简介:
本表格汇总了常用大语言模型在主流评测排行榜上的表现。评测范围涵盖:
人类偏好(文字和视觉),知识与推理,数学能力,代码能力,和长文本推理。
在整合各项评测结果的基础上,计算出综合排名。
Overview:
This chart compiles the performance of commonly used large language models across major benchmark leaderboards. Evaluation categories include:
Human preference (text & vision), Knowledge and reasoning, Mathematical ability, Coding capability, and Long-context reasoning.
Based on the aggregated results from these evaluations, an overall ranking is produced.
更新:
本周的排行榜新增了2个模型:xAI推出的grok-codefast-1和智谱AI推出的GLM 4.5v。
此外,claude-opus-4.1在许多榜单上的得分也得到了补全。
Updates:
This week's update introduces two new models: grok-codefast-1 from xAI and GLM 4.5v from Zhipu AI.
Additionally, claude-opus-4.1 now has more complete scores across several leaderboards.
评价:
最强的模型仍然是gpt-5,远超对手,在各项指标上几乎没有短板(至少它自己是这么说的)。
最强的“免费”模型仍然是gemini-2.5-pro,最强的内地能用的模型仍然是阿里的qwen3-235b。开源模型的能力也毫不逊色。
grok-codefast-1虽然是为代码任务优化的,但在综合排名上也表现不错。谁说厨子不能读兵法?
claude并不是不可替代的。Claude Code可以用Gemini CLI或Codex CLI替代。本人使用起来手感都差不多。
Assessment:
The top-performing model remains GPT-5, far ahead of the competition with virtually no weaknesses (according to itself).
The best "free" model is still Gemini 2.5 Pro. Open-source models are also showing impressive capabilities.
Though grok-codefast-1 is optimized for coding, it holds its own in the overall rankings.