Source: en_mcqa_2025-10-04.json ยท Updated: 2025-10-04 14:44:07
Scoring: columns ordered as Overall, Val, Test. Per column, 1st is bold, 2nd is underlined.
Name | Size | Modality | Overall | Val | Test | License | Date |
---|---|---|---|---|---|---|---|
Gemini 2.5 Pro | - | multimodal | 85.8% | 85.3% | 86.4% | proprietary | 2025-09-02 |
GPT-5 | - | multimodal | 79.8% | 76.2% | 83.3% | proprietary | 2025-08-26 |
Claude Sonnet 4 | - | multimodal | 75.7% | 75% | 76.4% | proprietary | 2025-08-31 |
Claude Opus 4.1 | - | multimodal | 76.6% | 74.5% | 78.8% | proprietary | 2025-08-29 |
GPT-4.1 | - | multimodal | 71.1% | 69.5% | 72.7% | proprietary | 2025-08-24 |
Gemini 2.5 Flash | - | multimodal | 68% | 68.4% | 67.7% | proprietary | 2025-09-01 |
gpt-oss-20b | 21B | text-only | 63.5% | 60.7% | 66.2% | open-source | 2025-08-27 |
gpt-oss-120b | 117B | text-only | 64.8% | 60.3% | 69.2% | open-source | 2025-08-26 |
llama4:scout | 109B | multimodal | 51.2% | 47% | 55.4% | open-source | 2025-08-28 |
llava-v1.6:13b | 13B | multimodal | 46.5% | 44.4% | 48.6% | open-source | 2025-08-28 |
llava-v1.6:7b | 7B | multimodal | 45.9% | 42.6% | 49.1% | open-source | 2025-08-28 |
Frequent Choice | 1 | None | 34.2% | 39.1% | 29.2% | - | 2025-09-28 |
Stratified Random Choice | 2 | None | 31.7% | 37.5% | 25.9% | - | 2025-09-28 |
deepseek-vl2 | 27.5B | multimodal | 34.9% | 31.4% | 38.3% | open-source | 2025-09-08 |
Random Choice | 0 | None | 25.1% | 26.3% | 23.8% | - | 2025-09-28 |