KoCEM

A Multimodal Knowledge and Reasoning Benchmark for
Korean Construction Engineering & Management

KoCEM Overview

Overview of the KoCEM dataset. KoCEM presents four challenges: 1) comprehensiveness: 1000+ questions across 11 construction subjects; 2) highly heterogeneous image types; 3) interleaved text and images; 4) expert-level perception and reasoning rooted in deep domain knowledge.

🔔 News

Introduction

We introduce KoCEM: a new benchmark designed to evaluate multimodal models on Korean Construction Engineering & Management tasks demanding expert-level domain knowledge and deliberate reasoning. KoCEM includes meticulously collected multimodal questions from construction engineering exams, textbooks, and professional materials, covering 11 core subjects including Architectural Planning, Building System, Construction Management, Drawing Interpretation, Structural Engineering, and Safety Management.

Unlike existing benchmarks focused on general knowledge, KoCEM focuses on advanced perception and reasoning with domain-specific Korean construction knowledge. Our evaluation of various state-of-the-art multimodal models highlights the substantial challenges posed by KoCEM. Even the most advanced models show significant room for improvement, indicating the complexity of expert-level construction engineering tasks.

Overview

We introduce the Korean Construction Engineering & Management (KoCEM) benchmark, a novel benchmark meticulously curated to assess the expert-level multimodal understanding capability of foundation models in the construction domain. Covering 11 subjects across the construction engineering discipline, including Architectural Planning, Building System, Construction Management, and more.

KoCEM is designed to measure three essential skills in LMMs: perception, knowledge, and reasoning. Our aim is to evaluate how well these models can not only perceive and understand information across different modalities but also apply reasoning with subject-specific knowledge to derive the solution.

Our KoCEM benchmark introduces key challenges to multimodal foundation models. Among these, we particularly highlight the challenge stemming from the requirement for both expert-level visual perceptual abilities and deliberate reasoning with subject-specific knowledge in Korean construction domain.

Comparisons with Existing Benchmarks

To further distinguish the difference between KoCEM and other existing benchmarks, we elaborate the benchmark details. From the breadth perspective, prior benchmarks are heavily focused on general knowledge and common sense. The covered image format is also limited. Our benchmark aims to cover expert-level construction knowledge with various image formats including technical drawings, blueprints, diagrams, charts, and photographs.

Comparison

Sampled KoCEM examples from each subject. The questions and images need expert-level knowledge to understand and reason.

Statistics

Subject Distribution

Sampled KoCEM examples from each subject. The questions and images need expert-level knowledge to understand and reason.

Experiment Results

Leaderboard

We evaluate various models including LLMs and LMMs. In each type, we consider both closed- and open-source models. Our evaluation is conducted under a zero-shot setting to assess the capability of models to generate accurate answers without fine-tuning or few-shot demonstrations on our benchmark.

Last updated: 12/10/2024

Model The name of the language model being evaluated Size Number of parameters in the model (e.g., 7B, 13B, 70B) Date Release date of the model Modality Type of input the model can process (text-only or multimodal) Score Accuracy score on the KoCEM benchmark (0-100%) License Model availability: open-source or proprietary

Different Subjects

We compare the performance of various models across different construction engineering subjects. Across all subjects, Gemini 2.5 Pro consistently outperforms the other models by a significant margin. Open-source models demonstrate relatively strong performance in categories like Materials and Safety Management, which are more frequently seen during training. However, for specialized subjects like Drawing Interpretation and Domain Reasoning, all models obtain lower scores.

Model Arch. Planning Building Sys. Const. Mgmt. Drawing Int. Struct. Eng. Overall
Gemini 2.5 Pro 84.2 91.0 83.7 70.6 87.8 86.5
GPT-5 85.2 89.8 82.0 67.9 86.0 82.5
Claude Opus 4.1 79.8 83.6 80.7 52.2 73.0 78.4
gpt-oss-120b 72.1 72.5 70.8 37.3 54.8 68.3
llama4-scout 59.0 61.3 59.8 32.8 52.7 54.4

Selected models' performance on different construction engineering subjects.

Error Analysis

We delve into the analysis of errors by Gemini 2.5 Pro, a pivotal aspect for understanding its operational capabilities and limitations. This analysis serves not only to identify the model's current shortcomings but also to guide future enhancements in its design and training. We meticulously examine 50 randomly sampled error instances from Gemini 2.5 Pro's predictions.

Error Distribution

Error distribution over 50 annotated Gemini 2.5 Pro errors.

BibTeX

@mastersthesis{dcollection,
  title={KoCEM: A Multimodal Knowledge and Reasoning Benchmark for Korean Construction Engineering & Management},
  author={Yoo, Byunghee},
  school={Seoul National University},
  year={2024}
}