MBPP Benchmark 评估报告

Mostly Basic Python Programming — EvalPlus 增强测试

DeepSeek V4 Pro

📊 执行摘要

MBPP Base pass@1
75.7%
286 / 378 通过
MBPP Plus pass@1
66.1%
250 / 378 通过
空白答案
16.4%
62 / 378 未生成代码
Base→Plus 衰减
9.6%
37 个任务 Base 通过但 Plus 失败
MBPP Base 通过率75.7%
75.7%
MBPP Plus 通过率 (EvalPlus 增强)66.1%
66.1%

本次测试对 DeepSeek V4 Pro 模型在 MBPP(Mostly Basic Python Programming)基准测试上的代码生成能力进行了系统评估。 测试使用 EvalPlus 框架,包含 378 个 Python 编程任务,分别评估基础测试(Base)和增强测试(Plus)两个层面的通过率。 DeepSeek V4 Pro 在 Base 测试中取得了 75.7% 的通过率,在更严格的 Plus 测试中下降了 9.6 个百分点至 66.1%。 值得注意的是,模型在 62 个任务(16.4%)中未能生成有效代码。

🔬 测试方法论

MBPP 基准简介

MBPP (Mostly Basic Python Programming) 是由 Google Research 在 2021 年提出的代码生成基准测试, 包含 974 个由众包工作者编写的入门级 Python 编程问题。每个问题包含自然语言描述、一个解决方案和三个测试用例。

本次测试使用经过清理的 MBPP-sanitized 版本(378 个任务),每个任务包含:

EvalPlus 框架

EvalPlus 是由伊利诺伊大学厄巴纳-香槟分校(UIUC)研究团队开发的代码评估增强框架。 它通过自动生成大量额外测试用例来暴露模型生成代码中的边界条件和隐藏缺陷。 EvalPlus 的增强测试集(MBPP+)平均每个任务包含约 35 倍于原始 MBPP 的测试数量, 能够更精确地评估代码质量。

测试配置

模型DeepSeek V4 Pro
API 端点yhapiplatform.com (OpenAI 兼容代理)
温度参数0.1 (接近确定性解码)
评估模式pass@1 (单次生成)
测试框架EvalPlus (evalplus v0.2.0)
任务数量378 个 MBPP-sanitized 任务
测试时间2026-06-07 02:44
结果哈希ee43ecabebf20deef4bb776a405ac5b1

📈 测试结果

Base 测试结果分布

Plus 测试结果分布

结果细分

249
✅ Base+Plus 均通过
65.9%
37
⚠️ 仅 Base 通过
9.8%
91
❌ 均未通过
24.1%
62
🈳 空白答案
16.4%

从 Base 到 Plus 的通过率下降 9.6 个百分点, 表明在 37 个任务中,模型生成的代码虽通过了基础测试,但在更全面的增强测试中暴露了缺陷。 这凸显了 EvalPlus 增强评估对于发现隐蔽代码问题的价值。

🏆 模型对比

以下为 DeepSeek V4 Pro 与其他主流模型在 同一 EvalPlus MBPP v0.2.0 基准上的对比。 数据来源:EvalPlus Leaderboard (2026-06-07 获取)。

MBPP Base pass@1 对比

O1 Preview (Sept 2024)
95.5%
O1 Mini (Sept 2024)
93.1%
Qwen2.5-Coder-32B-Instruct
90.5%
Gemini 1.5 Pro 002
89.7%
Claude 3.5 Sonnet (June 2024)
89.4%
Claude 3 Opus (Mar 2024)
89.4%
DeepSeek-Coder-V2-Instruct
89.4%
GPT-4o (Aug 2024)
87.6%
DeepSeek-V3 (Nov 2024)
87.6%
DeepSeek-V2.5 (Nov 2024)
87.6%
Grok Beta
86.0%
GPT-4-Turbo (Nov 2023)
85.7%
GPT-4o Mini (July 2024)
85.4%
Gemini 1.5 Flash 002
84.7%
Claude 3 Sonnet (Mar 2024)
83.6%
GPT-3.5-Turbo (Nov 2023)
82.5%
Llama 3 70B Instruct
82.3%
DeepSeek-Coder-33B-Instruct
80.4%
Claude 3 Haiku (Mar 2024)
80.2%
CodeQwen1.5-7B-Chat
79.4%
→ DeepSeek V4 Pro
75.7%

MBPP Plus pass@1 对比

O1 Preview (Sept 2024)
80.2%
O1 Mini (Sept 2024)
78.8%
Qwen2.5-Coder-32B-Instruct
77.0%
Gemini 1.5 Pro 002
74.6%
Claude 3.5 Sonnet (June 2024)
74.3%
Claude 3 Opus (Mar 2024)
73.3%
DeepSeek-Coder-V2-Instruct
75.1%
GPT-4o (Aug 2024)
72.2%
DeepSeek-V3 (Nov 2024)
73.0%
DeepSeek-V2.5 (Nov 2024)
74.1%
Grok Beta
65.6%
GPT-4-Turbo (Nov 2023)
73.3%
GPT-4o Mini (July 2024)
72.2%
Gemini 1.5 Flash 002
67.5%
Claude 3 Sonnet (Mar 2024)
69.3%
GPT-3.5-Turbo (Nov 2023)
69.7%
Llama 3 70B Instruct
69.0%
DeepSeek-Coder-33B-Instruct
70.1%
Claude 3 Haiku (Mar 2024)
68.8%
CodeQwen1.5-7B-Chat
69.0%
→ DeepSeek V4 Pro
66.1%

完整对比表

# 模型 厂商 MBPP Base MBPP Plus 衰减
1 O1 Preview (Sept 2024) OpenAI 95.5% 80.2% 15.3%
2 O1 Mini (Sept 2024) OpenAI 93.1% 78.8% 14.3%
3 Qwen2.5-Coder-32B-Instruct Alibaba (Qwen) 90.5% 77.0% 13.5%
4 Gemini 1.5 Pro 002 Google 89.7% 74.6% 15.1%
5 Claude 3.5 Sonnet (June 2024) Anthropic 89.4% 74.3% 15.1%
6 Claude 3 Opus (Mar 2024) Anthropic 89.4% 73.3% 16.1%
7 DeepSeek-Coder-V2-Instruct DeepSeek 89.4% 75.1% 14.3%
8 GPT-4o (Aug 2024) OpenAI 87.6% 72.2% 15.4%
9 DeepSeek-V3 (Nov 2024) DeepSeek 87.6% 73.0% 14.6%
10 DeepSeek-V2.5 (Nov 2024) DeepSeek 87.6% 74.1% 13.5%
11 Grok Beta xAI 86.0% 65.6% 20.4%
12 GPT-4-Turbo (Nov 2023) OpenAI 85.7% 73.3% 12.4%
13 GPT-4o Mini (July 2024) OpenAI 85.4% 72.2% 13.2%
14 Gemini 1.5 Flash 002 Google 84.7% 67.5% 17.2%
15 Claude 3 Sonnet (Mar 2024) Anthropic 83.6% 69.3% 14.3%
16 GPT-3.5-Turbo (Nov 2023) OpenAI 82.5% 69.7% 12.8%
17 Llama 3 70B Instruct Meta 82.3% 69.0% 13.3%
18 DeepSeek-Coder-33B-Instruct DeepSeek 80.4% 70.1% 10.3%
19 Claude 3 Haiku (Mar 2024) Anthropic 80.2% 68.8% 11.4%
20 CodeQwen1.5-7B-Chat Alibaba (Qwen) 79.4% 69.0% 10.4%
21 Gemini Pro 1.0 Google 75.4% 61.4% 14.0%
22 DeepSeek-Coder-6.7B-Instruct DeepSeek 74.9% 65.6% 9.3%
23 Command-R+ Cohere 74.3% 63.5% 10.8%
24 Mixtral-8x22B-Instruct-v0.1 Mistral AI 73.8% 64.3% 9.5%
25 Mistral Large (Mar 2024) Mistral AI 72.8% 59.5% 13.3%
26 Codestral-22B-v0.1 Mistral AI 72.5% 61.9% 10.6%
27 Qwen1.5-72B-Chat Alibaba (Qwen) 72.5% 61.6% 10.9%
28 Code Llama 34B Meta 69.3% 56.3% 13.0%
29 Llama 3.1 8B Instruct Meta 68.3% 55.6% 12.7%
30 Phi-3-mini-4k-instruct Microsoft 65.9% 54.2% 11.7%
31 Llama 3 8B Instruct Meta 64.6% 54.8% 9.8%
32 Code Llama 13B Meta 63.5% 52.6% 10.9%
33 Code Llama 7B Meta 59.5% 46.8% 12.7%
- ◆ DeepSeek V4 Pro DeepSeek 75.7% 66.1% 9.6%

🔍 错误分析

严重问题:62 个空白答案 (16.4%)
模型在 62 个任务中完全未能生成代码,这是影响通过率的最大因素。 若排除空白答案,有效答案的 Base 通过率可达 90.5%, Plus 通过率可达 79.1%
Base→Plus 衰减分析
37 个任务通过了 Base 测试但未能通过 Plus 测试(占全部任务的 9.8%)。 这些任务中,模型代码在基础测试中表现正确,但在边界条件、异常输入等方面存在缺陷。 常见失败模式包括:空输入处理不当、边界值错误、类型转换问题。
代码质量特征
模型倾向于生成简洁、符合 Python 风格的代码。通过的任务中代码平均长度适中, 函数包含清晰的文档字符串(Docstring)和类型注解。 失败案例主要集中在:数学/算法密集任务、字符串边界处理、复杂数据结构操作。

Plus 失败典型模式

类别典型任务说明
空输入/零值Mbpp/7, Mbpp/119, Mbpp/559未正确处理空列表、空字符串等边界输入
负数处理Mbpp/99, Mbpp/244, Mbpp/589函数在负数输入时行为异常
大数值溢出Mbpp/267, Mbpp/287在 1,000,000 级别输入时性能或正确性问题
字符串边界Mbpp/16, Mbpp/748, Mbpp/806单字符、空字符串等极端字符串输入
嵌套结构Mbpp/161, Mbpp/792深层嵌套数据结构的遍历/处理
格式解析Mbpp/427, Mbpp/593非标准格式或边界格式的输入解析

📋 完整任务列表

共 378 个任务。点击任务 ID 查看详细代码和测试结果。 ✅ Perfect: 249 ⚠️ Base Only: 37 ❌ Failed: 91 🈳 Empty: 62

任务 ID Base Plus 状态 代码长度 详情
Mbpp/100 ❌ Empty 0 View →
Mbpp/101 ✅ Perfect 446 View →
Mbpp/102 ⚠️ Base Only 263 View →
Mbpp/103 ❌ Empty 0 View →
Mbpp/104 ✅ Perfect 179 View →
Mbpp/105 ✅ Perfect 109 View →
Mbpp/106 ✅ Perfect 312 View →
Mbpp/108 ✅ Perfect 476 View →
Mbpp/109 ❌ Empty 0 View →
Mbpp/11 ✅ Perfect 535 View →
Mbpp/111 ✅ Perfect 364 View →
Mbpp/113 ⚠️ Base Only 326 View →
Mbpp/116 ✅ Perfect 438 View →
Mbpp/118 ✅ Perfect 281 View →
Mbpp/119 ⚠️ Base Only 731 View →
Mbpp/12 ✅ Perfect 317 View →
Mbpp/120 ❌ Failed 337 View →
Mbpp/123 ❌ Failed 556 View →
Mbpp/124 ❌ Empty 0 View →
Mbpp/125 ❌ Empty 0 View →
Mbpp/126 ❌ Empty 0 View →
Mbpp/127 ✅ Perfect 103 View →
Mbpp/128 ✅ Perfect 412 View →
Mbpp/129 ❌ Failed 569 View →
Mbpp/130 ❌ Failed 9 View →
Mbpp/131 ❌ Failed 484 View →
Mbpp/132 ✅ Perfect 286 View →
Mbpp/133 ✅ Perfect 302 View →
Mbpp/135 ⚠️ Base Only 394 View →
Mbpp/137 ❌ Empty 0 View →
Mbpp/138 ❌ Empty 0 View →
Mbpp/139 ✅ Perfect 300 View →
Mbpp/14 ❌ Failed 9 View →
Mbpp/140 ✅ Perfect 487 View →
Mbpp/141 ❌ Failed 953 View →
Mbpp/142 ✅ Perfect 429 View →
Mbpp/145 ✅ Perfect 316 View →
Mbpp/16 ⚠️ Base Only 491 View →
Mbpp/160 ❌ Failed 1090 View →
Mbpp/161 ⚠️ Base Only 343 View →
Mbpp/162 ❌ Failed 9 View →
Mbpp/165 ✅ Perfect 454 View →
Mbpp/166 ✅ Perfect 517 View →
Mbpp/167 ❌ Empty 0 View →
Mbpp/168 ✅ Perfect 282 View →
Mbpp/17 ✅ Perfect 283 View →
Mbpp/170 ✅ Perfect 362 View →
Mbpp/171 ✅ Perfect 331 View →
Mbpp/172 ✅ Perfect 153 View →
Mbpp/18 ✅ Perfect 262 View →
Mbpp/19 ✅ Perfect 339 View →
Mbpp/2 ✅ Perfect 323 View →
Mbpp/20 ✅ Perfect 400 View →
Mbpp/222 ✅ Perfect 403 View →
Mbpp/223 ❌ Empty 0 View →
Mbpp/224 ✅ Perfect 290 View →
Mbpp/226 ✅ Perfect 188 View →
Mbpp/227 ✅ Perfect 198 View →
Mbpp/230 ✅ Perfect 319 View →
Mbpp/232 ❌ Failed 9 View →
Mbpp/233 ✅ Perfect 369 View →
Mbpp/234 ✅ Perfect 291 View →
Mbpp/235 ❌ Empty 0 View →
Mbpp/237 ❌ Empty 0 View →
Mbpp/238 ✅ Perfect 242 View →
Mbpp/239 ❌ Empty 0 View →
Mbpp/240 ✅ Perfect 288 View →
Mbpp/242 ✅ Perfect 142 View →
Mbpp/244 ⚠️ Base Only 312 View →
Mbpp/245 ❌ Empty 0 View →
Mbpp/247 ❌ Empty 0 View →
Mbpp/250 ✅ Perfect 290 View →
Mbpp/251 ✅ Perfect 337 View →
Mbpp/252 ✅ Perfect 389 View →
Mbpp/253 ✅ Perfect 387 View →
Mbpp/255 ⚠️ Base Only 400 View →
Mbpp/256 ✅ Perfect 406 View →
Mbpp/257 ✅ Perfect 159 View →
Mbpp/259 ❌ Failed 194 View →
Mbpp/260 ❌ Empty 0 View →
Mbpp/261 ⚠️ Base Only 483 View →
Mbpp/262 ✅ Perfect 330 View →
Mbpp/264 ✅ Perfect 529 View →
Mbpp/265 ✅ Perfect 395 View →
Mbpp/266 ✅ Perfect 229 View →
Mbpp/267 ⚠️ Base Only 209 View →
Mbpp/268 ✅ Perfect 281 View →
Mbpp/269 ✅ Perfect 216 View →
Mbpp/270 ✅ Perfect 342 View →
Mbpp/271 ✅ Perfect 374 View →
Mbpp/272 ✅ Perfect 303 View →
Mbpp/273 ✅ Perfect 400 View →
Mbpp/274 ✅ Perfect 471 View →
Mbpp/276 ✅ Perfect 318 View →
Mbpp/277 ✅ Perfect 180 View →
Mbpp/278 ⚠️ Base Only 623 View →
Mbpp/279 ✅ Perfect 432 View →
Mbpp/280 ✅ Perfect 291 View →
Mbpp/281 ✅ Perfect 331 View →
Mbpp/282 ✅ Perfect 355 View →
Mbpp/283 ✅ Perfect 422 View →
Mbpp/284 ✅ Perfect 360 View →
Mbpp/285 ❌ Empty 0 View →
Mbpp/286 ❌ Empty 0 View →
Mbpp/287 ⚠️ Base Only 360 View →
Mbpp/290 ✅ Perfect 419 View →
Mbpp/292 ✅ Perfect 178 View →
Mbpp/293 ✅ Perfect 163 View →
Mbpp/294 ⚠️ Base Only 432 View →
Mbpp/296 ❌ Failed 911 View →
Mbpp/297 ✅ Perfect 257 View →
Mbpp/299 ❌ Failed 35 View →
Mbpp/3 ✅ Perfect 583 View →
Mbpp/300 ⚠️ Base Only 395 View →
Mbpp/301 ❌ Empty 0 View →
Mbpp/305 ❌ Failed 400 View →
Mbpp/306 ❌ Empty 0 View →
Mbpp/308 ❌ Empty 0 View →
Mbpp/309 ✅ Perfect 140 View →
Mbpp/310 ⚠️ Base Only 343 View →
Mbpp/311 ❌ Empty 0 View →
Mbpp/312 ✅ Perfect 321 View →
Mbpp/388 ✅ Perfect 324 View →
Mbpp/389 ✅ Perfect 383 View →
Mbpp/390 ✅ Perfect 330 View →
Mbpp/391 ✅ Perfect 466 View →
Mbpp/392 ❌ Empty 0 View →
Mbpp/394 ✅ Perfect 190 View →
Mbpp/395 ✅ Perfect 476 View →
Mbpp/397 ✅ Perfect 193 View →
Mbpp/398 ❌ Failed 568 View →
Mbpp/4 ✅ Perfect 455 View →
Mbpp/404 ✅ Perfect 219 View →
Mbpp/405 ✅ Perfect 223 View →
Mbpp/406 ❌ Empty 0 View →
Mbpp/409 ✅ Perfect 540 View →
Mbpp/410 ⚠️ Base Only 364 View →
Mbpp/412 ✅ Perfect 294 View →
Mbpp/413 ✅ Perfect 382 View →
Mbpp/414 ✅ Perfect 233 View →
Mbpp/415 ❌ Failed 996 View →
Mbpp/418 ✅ Perfect 250 View →
Mbpp/419 ✅ Perfect 381 View →
Mbpp/420 ✅ Perfect 406 View →
Mbpp/421 ✅ Perfect 136 View →
Mbpp/422 ❌ Empty 0 View →
Mbpp/424 ✅ Perfect 263 View →
Mbpp/425 ✅ Perfect 434 View →
Mbpp/426 ✅ Perfect 264 View →
Mbpp/427 ⚠️ Base Only 310 View →
Mbpp/428 ✅ Perfect 777 View →
Mbpp/429 ✅ Perfect 162 View →
Mbpp/430 ❌ Empty 0 View →
Mbpp/432 ❌ Failed 9 View →
Mbpp/433 ✅ Perfect 172 View →
Mbpp/435 ✅ Perfect 245 View →
Mbpp/436 ✅ Perfect 146 View →
Mbpp/437 ✅ Perfect 357 View →
Mbpp/439 ✅ Perfect 400 View →
Mbpp/440 ❌ Empty 0 View →
Mbpp/441 ✅ Perfect 306 View →
Mbpp/445 ❌ Failed 9 View →
Mbpp/446 ⚠️ Base Only 251 View →
Mbpp/447 ✅ Perfect 310 View →
Mbpp/448 ❌ Empty 0 View →
Mbpp/450 ✅ Perfect 372 View →
Mbpp/451 ⚠️ Base Only 287 View →
Mbpp/453 ❌ Empty 0 View →
Mbpp/454 ❌ Failed 9 View →
Mbpp/455 ✅ Perfect 467 View →
Mbpp/456 ✅ Perfect 299 View →
Mbpp/457 ✅ Perfect 442 View →
Mbpp/458 ✅ Perfect 294 View →
Mbpp/459 ❌ Failed 148 View →
Mbpp/460 ✅ Perfect 370 View →
Mbpp/462 ❌ Empty 0 View →
Mbpp/463 ❌ Empty 0 View →
Mbpp/465 ✅ Perfect 254 View →
Mbpp/468 ❌ Empty 0 View →
Mbpp/470 ✅ Perfect 315 View →
Mbpp/471 ❌ Empty 0 View →
Mbpp/472 ❌ Empty 0 View →
Mbpp/473 ❌ Empty 0 View →
Mbpp/474 ✅ Perfect 408 View →
Mbpp/475 ✅ Perfect 396 View →
Mbpp/476 ✅ Perfect 230 View →
Mbpp/477 ✅ Perfect 223 View →
Mbpp/478 ❌ Empty 0 View →
Mbpp/479 ✅ Perfect 257 View →
Mbpp/554 ✅ Perfect 165 View →
Mbpp/555 ✅ Perfect 324 View →
Mbpp/556 ⚠️ Base Only 418 View →
Mbpp/557 ✅ Perfect 292 View →
Mbpp/558 ✅ Perfect 665 View →
Mbpp/559 ⚠️ Base Only 522 View →
Mbpp/56 ✅ Perfect 264 View →
Mbpp/560 ✅ Perfect 177 View →
Mbpp/562 ✅ Perfect 367 View →
Mbpp/563 ✅ Perfect 467 View →
Mbpp/564 ✅ Perfect 323 View →
Mbpp/565 ✅ Perfect 287 View →
Mbpp/566 ✅ Perfect 313 View →
Mbpp/567 ✅ Perfect 352 View →
Mbpp/568 ✅ Perfect 259 View →
Mbpp/569 ✅ Perfect 340 View →
Mbpp/57 ✅ Perfect 524 View →
Mbpp/572 ❌ Empty 0 View →
Mbpp/573 ✅ Perfect 224 View →
Mbpp/576 ⚠️ Base Only 363 View →
Mbpp/577 ⚠️ Base Only 401 View →
Mbpp/578 ✅ Perfect 498 View →
Mbpp/579 ❌ Empty 0 View →
Mbpp/58 ✅ Perfect 278 View →
Mbpp/580 ✅ Perfect 572 View →
Mbpp/581 ❌ Empty 0 View →
Mbpp/583 ✅ Perfect 328 View →
Mbpp/585 ✅ Perfect 412 View →
Mbpp/586 ❌ Failed 9 View →
Mbpp/587 ✅ Perfect 254 View →
Mbpp/588 ✅ Perfect 369 View →
Mbpp/589 ⚠️ Base Only 533 View →
Mbpp/59 ✅ Perfect 167 View →
Mbpp/590 ❌ Empty 0 View →
Mbpp/591 ❌ Failed 340 View →
Mbpp/592 ✅ Perfect 358 View →
Mbpp/593 ⚠️ Base Only 418 View →
Mbpp/594 ✅ Perfect 940 View →
Mbpp/596 ✅ Perfect 122 View →
Mbpp/597 ❌ Empty 0 View →
Mbpp/598 ✅ Perfect 389 View →
Mbpp/599 ✅ Perfect 217 View →
Mbpp/6 ✅ Perfect 427 View →
Mbpp/600 ✅ Perfect 243 View →
Mbpp/602 ✅ Perfect 283 View →
Mbpp/603 ❌ Empty 0 View →
Mbpp/604 ✅ Perfect 292 View →
Mbpp/605 ✅ Perfect 601 View →
Mbpp/606 ✅ Perfect 114 View →
Mbpp/607 ✅ Perfect 531 View →
Mbpp/608 ❌ Empty 0 View →
Mbpp/61 ❌ Empty 0 View →
Mbpp/610 ✅ Perfect 246 View →
Mbpp/611 ✅ Perfect 410 View →
Mbpp/612 ✅ Perfect 567 View →
Mbpp/614 ✅ Perfect 380 View →
Mbpp/615 ❌ Empty 0 View →
Mbpp/616 ✅ Perfect 376 View →
Mbpp/618 ✅ Perfect 584 View →
Mbpp/619 ✅ Perfect 323 View →
Mbpp/62 ✅ Perfect 108 View →
Mbpp/620 ❌ Failed 510 View →
Mbpp/622 ❌ Empty 0 View →
Mbpp/623 ✅ Perfect 368 View →
Mbpp/624 ✅ Perfect 106 View →
Mbpp/626 ❌ Empty 0 View →
Mbpp/628 ✅ Perfect 264 View →
Mbpp/629 ✅ Perfect 282 View →
Mbpp/63 ✅ Perfect 585 View →
Mbpp/630 ⚠️ Base Only 422 View →
Mbpp/631 ✅ Perfect 451 View →
Mbpp/632 ✅ Perfect 579 View →
Mbpp/633 ❌ Empty 0 View →
Mbpp/635 ⚠️ Plus Only 475 View →
Mbpp/637 ✅ Perfect 346 View →
Mbpp/638 ❌ Empty 0 View →
Mbpp/639 ⚠️ Base Only 246 View →
Mbpp/64 ✅ Perfect 328 View →
Mbpp/641 ✅ Perfect 152 View →
Mbpp/643 ❌ Empty 0 View →
Mbpp/644 ✅ Perfect 245 View →
Mbpp/65 ✅ Perfect 463 View →
Mbpp/66 ✅ Perfect 235 View →
Mbpp/67 ❌ Failed 1320 View →
Mbpp/68 ✅ Perfect 443 View →
Mbpp/69 ✅ Perfect 522 View →
Mbpp/7 ⚠️ Base Only 291 View →
Mbpp/70 ✅ Perfect 277 View →
Mbpp/71 ✅ Perfect 820 View →
Mbpp/72 ❌ Failed 476 View →
Mbpp/720 ✅ Perfect 339 View →
Mbpp/721 ❌ Failed 532 View →
Mbpp/722 ✅ Perfect 627 View →
Mbpp/723 ✅ Perfect 456 View →
Mbpp/724 ✅ Perfect 197 View →
Mbpp/725 ✅ Perfect 337 View →
Mbpp/726 ✅ Perfect 231 View →
Mbpp/728 ✅ Perfect 316 View →
Mbpp/730 ✅ Perfect 363 View →
Mbpp/731 ✅ Perfect 434 View →
Mbpp/732 ❌ Empty 0 View →
Mbpp/733 ✅ Perfect 578 View →
Mbpp/734 ❌ Empty 0 View →
Mbpp/735 ❌ Empty 0 View →
Mbpp/736 ✅ Perfect 406 View →
Mbpp/737 ✅ Perfect 481 View →
Mbpp/739 ❌ Empty 0 View →
Mbpp/74 ❌ Empty 0 View →
Mbpp/740 ✅ Perfect 428 View →
Mbpp/741 ✅ Perfect 224 View →
Mbpp/742 ❌ Empty 0 View →
Mbpp/743 ✅ Perfect 319 View →
Mbpp/744 ✅ Perfect 173 View →
Mbpp/745 ✅ Perfect 499 View →
Mbpp/748 ⚠️ Base Only 397 View →
Mbpp/749 ✅ Perfect 292 View →
Mbpp/75 ✅ Perfect 479 View →
Mbpp/750 ✅ Perfect 189 View →
Mbpp/751 ✅ Perfect 881 View →
Mbpp/752 ✅ Perfect 380 View →
Mbpp/753 ✅ Perfect 479 View →
Mbpp/754 ✅ Perfect 529 View →
Mbpp/755 ❌ Empty 0 View →
Mbpp/757 ✅ Perfect 335 View →
Mbpp/758 ✅ Perfect 440 View →
Mbpp/759 ❌ Failed 9 View →
Mbpp/760 ✅ Perfect 377 View →
Mbpp/762 ✅ Perfect 365 View →
Mbpp/763 ✅ Perfect 676 View →
Mbpp/764 ✅ Perfect 252 View →
Mbpp/765 ❌ Empty 0 View →
Mbpp/766 ✅ Perfect 398 View →
Mbpp/767 ✅ Perfect 530 View →
Mbpp/769 ❌ Empty 0 View →
Mbpp/77 ✅ Perfect 124 View →
Mbpp/770 ✅ Perfect 369 View →
Mbpp/771 ⚠️ Base Only 806 View →
Mbpp/772 ✅ Perfect 421 View →
Mbpp/773 ✅ Perfect 473 View →
Mbpp/775 ✅ Perfect 440 View →
Mbpp/777 ✅ Perfect 131 View →
Mbpp/778 ✅ Perfect 497 View →
Mbpp/780 ✅ Perfect 403 View →
Mbpp/781 ✅ Perfect 419 View →
Mbpp/782 ❌ Empty 0 View →
Mbpp/784 ✅ Perfect 607 View →
Mbpp/785 ⚠️ Base Only 640 View →
Mbpp/786 ✅ Perfect 528 View →
Mbpp/787 ✅ Perfect 280 View →
Mbpp/788 ✅ Perfect 228 View →
Mbpp/79 ✅ Perfect 280 View →
Mbpp/790 ⚠️ Base Only 412 View →
Mbpp/791 ✅ Perfect 167 View →
Mbpp/792 ⚠️ Base Only 230 View →
Mbpp/793 ✅ Perfect 463 View →
Mbpp/794 ⚠️ Base Only 170 View →
Mbpp/796 ✅ Perfect 267 View →
Mbpp/797 ✅ Perfect 485 View →
Mbpp/798 ✅ Perfect 161 View →
Mbpp/799 ❌ Failed 349 View →
Mbpp/8 ✅ Perfect 281 View →
Mbpp/80 ✅ Perfect 397 View →
Mbpp/800 ⚠️ Base Only 355 View →
Mbpp/801 ✅ Perfect 304 View →
Mbpp/803 ✅ Perfect 390 View →
Mbpp/804 ✅ Perfect 338 View →
Mbpp/805 ✅ Perfect 336 View →
Mbpp/806 ⚠️ Base Only 398 View →
Mbpp/807 ✅ Perfect 224 View →
Mbpp/808 ✅ Perfect 285 View →
Mbpp/809 ✅ Perfect 266 View →
Mbpp/82 ✅ Perfect 272 View →
Mbpp/84 ❌ Failed 287 View →
Mbpp/85 ✅ Perfect 282 View →
Mbpp/86 ✅ Perfect 229 View →
Mbpp/87 ✅ Perfect 317 View →
Mbpp/88 ✅ Perfect 394 View →
Mbpp/89 ✅ Perfect 156 View →
Mbpp/9 ❌ Empty 0 View →
Mbpp/90 ✅ Perfect 327 View →
Mbpp/91 ✅ Perfect 414 View →
Mbpp/92 ❌ Empty 0 View →
Mbpp/93 ✅ Perfect 99 View →
Mbpp/94 ✅ Perfect 444 View →
Mbpp/95 ✅ Perfect 318 View →
Mbpp/96 ✅ Perfect 394 View →
Mbpp/97 ✅ Perfect 414 View →
Mbpp/98 ✅ Perfect 420 View →
Mbpp/99 ⚠️ Base Only 271 View →

💡 总结与建议

总体评价
DeepSeek V4 Pro 在 MBPP 基准上的得分为 Base 75.7% / Plus 66.1%。 该成绩使模型处于中等偏下水平,与 Gemini Pro 1.0 (75.4%/61.4%) 和 DeepSeek-Coder-6.7B (74.9%/65.6%) 处于同一区间。 主要拖累因素是 16.4% 的空白回答率
与其他 DeepSeek 模型差距显著
与 DeepSeek 家族其他模型相比,V4 Pro (75.7%/66.1%) 明显落后于:
  • DeepSeek-Coder-V2: 89.4%/75.1% (差距 13.7/9.0 pp)
  • DeepSeek-V3: 87.6%/73.0% (差距 11.9/6.9 pp)
  • DeepSeek-V2.5: 87.6%/74.1% (差距 11.9/8.0 pp)
这可能与代理 API 的配置、模型版本差异或温度参数设置有关。
潜力
在成功生成代码的 316 个任务中,有效通过率为 Base 90.5% / Plus 79.1%。 这表明模型在代码生成能力本身是可观的,主要瓶颈在于 回答稳定性而非代码质量。
改进建议
  1. 降低空白率:调整 API 调用策略(如系统提示、max_tokens 限制)以减少空白回答
  2. 温度优化:尝试 temperature=0 的纯贪婪解码以获得更稳定的输出
  3. 增强边界处理:针对空输入、负数等边界条件增强代码鲁棒性
  4. 与官方结果交叉验证:在 DeepSeek 官方 API 上复现测试,排除代理引入的差异
  5. 补充 HumanEval 测试:进行 HumanEval 基准测试以获得更全面的编程能力画像