A Multimodal Knowledge and Reasoning Benchmark for
Korean Construction Engineering & Management
Overview of the KoCEM dataset. KoCEM presents four challenges: 1) comprehensiveness: 1000+ questions across 11 construction subjects; 2) highly heterogeneous image types; 3) interleaved text and images; 4) expert-level perception and reasoning rooted in deep domain knowledge.
We introduce KoCEM: a new benchmark designed to evaluate multimodal models on Korean Construction Engineering & Management tasks demanding expert-level domain knowledge and deliberate reasoning. KoCEM includes meticulously collected multimodal questions from construction engineering exams, textbooks, and professional materials, covering 11 core subjects including Architectural Planning, Building System, Construction Management, Drawing Interpretation, Structural Engineering, and Safety Management.
Unlike existing benchmarks focused on general knowledge, KoCEM focuses on advanced perception and reasoning with domain-specific Korean construction knowledge. Our evaluation of various state-of-the-art multimodal models highlights the substantial challenges posed by KoCEM. Even the most advanced models show significant room for improvement, indicating the complexity of expert-level construction engineering tasks.
We introduce the Korean Construction Engineering & Management (KoCEM) benchmark, a novel benchmark meticulously curated to assess the expert-level multimodal understanding capability of foundation models in the construction domain. Covering 11 subjects across the construction engineering discipline, including Architectural Planning, Building System, Construction Management, and more.
KoCEM is designed to measure three essential skills in LMMs: perception, knowledge, and reasoning. Our aim is to evaluate how well these models can not only perceive and understand information across different modalities but also apply reasoning with subject-specific knowledge to derive the solution.
Our KoCEM benchmark introduces key challenges to multimodal foundation models. Among these, we particularly highlight the challenge stemming from the requirement for both expert-level visual perceptual abilities and deliberate reasoning with subject-specific knowledge in Korean construction domain.
To further distinguish the difference between KoCEM and other existing benchmarks, we elaborate the benchmark details. From the breadth perspective, prior benchmarks are heavily focused on general knowledge and common sense. The covered image format is also limited. Our benchmark aims to cover expert-level construction knowledge with various image formats including technical drawings, blueprints, diagrams, charts, and photographs.
Sampled KoCEM examples from each subject. The questions and images need expert-level knowledge to understand and reason.
Sampled KoCEM examples from each subject. The questions and images need expert-level knowledge to understand and reason.
We evaluate various models including LLMs and LMMs. In each type, we consider both closed- and open-source models. Our evaluation is conducted under a zero-shot setting to assess the capability of models to generate accurate answers without fine-tuning or few-shot demonstrations on our benchmark.
Last updated: 12/10/2024
| Model | Size | Date | Modality | Score | License |
|---|
We compare the performance of various models across different construction engineering subjects. Across all subjects, Gemini 2.5 Pro consistently outperforms the other models by a significant margin. Open-source models demonstrate relatively strong performance in categories like Materials and Safety Management, which are more frequently seen during training. However, for specialized subjects like Drawing Interpretation and Domain Reasoning, all models obtain lower scores.
| Model | Arch. Planning | Building Sys. | Const. Mgmt. | Drawing Int. | Struct. Eng. | Overall |
|---|---|---|---|---|---|---|
| Gemini 2.5 Pro | 84.2 | 91.0 | 83.7 | 70.6 | 87.8 | 86.5 |
| GPT-5 | 85.2 | 89.8 | 82.0 | 67.9 | 86.0 | 82.5 |
| Claude Opus 4.1 | 79.8 | 83.6 | 80.7 | 52.2 | 73.0 | 78.4 |
| gpt-oss-120b | 72.1 | 72.5 | 70.8 | 37.3 | 54.8 | 68.3 |
| llama4-scout | 59.0 | 61.3 | 59.8 | 32.8 | 52.7 | 54.4 |
Selected models' performance on different construction engineering subjects.
We delve into the analysis of errors by Gemini 2.5 Pro, a pivotal aspect for understanding its operational capabilities and limitations. This analysis serves not only to identify the model's current shortcomings but also to guide future enhancements in its design and training. We meticulously examine 50 randomly sampled error instances from Gemini 2.5 Pro's predictions.
Error distribution over 50 annotated Gemini 2.5 Pro errors.
@mastersthesis{dcollection,
title={KoCEM: A Multimodal Knowledge and Reasoning Benchmark for Korean Construction Engineering & Management},
author={Yoo, Byunghee},
school={Seoul National University},
year={2024}
}