Overview of the MMCE dataset. MMCE presents four challenges: 1) comprehensiveness: 1000+ questions across 11 construction subjects; 2) highly heterogeneous image types; 3) interleaved text and images; 4) expert-level perception and reasoning rooted in deep domain knowledge.

🔔 News

Introduction

We introduce MMCE: a new benchmark designed to evaluate multimodal models on Korean Construction Engineering & Management tasks demanding expert-level domain knowledge and deliberate reasoning. MMCE includes meticulously collected multimodal questions from construction engineering exams, textbooks, and professional materials, covering 11 core subjects including Architectural Planning, Building System, Construction Management, Drawing Interpretation, Structural Engineering, and Safety Management.

Unlike existing benchmarks focused on general knowledge, MMCE focuses on advanced perception and reasoning with domain-specific Korean construction knowledge. Our evaluation of various state-of-the-art multimodal models highlights the substantial challenges posed by MMCE. Even the most advanced models show significant room for improvement, indicating the complexity of expert-level construction engineering tasks.

Overview

We introduce the Korean Construction Engineering & Management (MMCE) benchmark, a novel benchmark meticulously curated to assess the expert-level multimodal understanding capability of foundation models in the construction domain. Covering 11 subjects across the construction engineering discipline, including Architectural Planning, Building System, Construction Management, and more.

MMCE is designed to measure three essential skills in LMMs: perception, knowledge, and reasoning. Our aim is to evaluate how well these models can not only perceive and understand information across different modalities but also apply reasoning with subject-specific knowledge to derive the solution.

Our MMCE benchmark introduces key challenges to multimodal foundation models. Among these, we particularly highlight the challenge stemming from the requirement for both expert-level visual perceptual abilities and deliberate reasoning with subject-specific knowledge in Korean construction domain.

Comparisons with Existing Benchmarks

To further distinguish the difference between MMCE and other existing benchmarks, we elaborate the benchmark details. From the breadth perspective, prior benchmarks are heavily focused on general knowledge and common sense. The covered image format is also limited. Our benchmark aims to cover expert-level construction knowledge with various image formats including technical drawings, blueprints, diagrams, charts, and photographs.

Sampled MMCE examples from each subject. The questions and images need expert-level knowledge to understand and reason.

Statistics

Sampled MMCE examples from each subject. The questions and images need expert-level knowledge to understand and reason.

Experiment Results

Leaderboard

We evaluate various models including LLMs and LMMs. In each type, we consider both closed- and open-source models. Our evaluation is conducted under a zero-shot setting to assess the capability of models to generate accurate answers without fine-tuning or few-shot demonstrations on our benchmark.

Last updated: 12/10/2024

	Model The name of the language model being evaluated	Size Number of parameters in the model (e.g., 7B, 13B, 70B)	Date Release date of the model	Modality Type of input the model can process (text-only or multimodal)	Score Accuracy score on the MMCE benchmark (0-100%)	License Model availability: open-source or proprietary

Different Subjects

We compare the performance of various models across different construction engineering subjects. Across all subjects, Gemini 2.5 Pro consistently outperforms the other models by a significant margin. Open-source models demonstrate relatively strong performance in categories like Materials and Safety Management, which are more frequently seen during training. However, for specialized subjects like Drawing Interpretation and Domain Reasoning, all models obtain lower scores.

Model	Arch. Planning	Building Sys.	Const. Mgmt.	Drawing Int.	Struct. Eng.	Overall
Gemini 2.5 Pro	84.2	91.0	83.7	70.6	87.8	86.5
GPT-5	85.2	89.8	82.0	67.9	86.0	82.5
Claude Opus 4.1	79.8	83.6	80.7	52.2	73.0	78.4
gpt-oss-120b	72.1	72.5	70.8	37.3	54.8	68.3
llama4-scout	59.0	61.3	59.8	32.8	52.7	54.4

Selected models' performance on different construction engineering subjects.

Error Analysis

We delve into the analysis of errors by Gemini 2.5 Pro, a pivotal aspect for understanding its operational capabilities and limitations. This analysis serves not only to identify the model's current shortcomings but also to guide future enhancements in its design and training. We meticulously examine 50 randomly sampled error instances from Gemini 2.5 Pro's predictions.

Error distribution over 50 annotated Gemini 2.5 Pro errors.

BibTeX

@mastersthesis{dcollection,
  title={MMCE: A Multimodal Multitask Understanding and Reasoning Benchmark for Expert Artificial Intelligence in Construction Engineering},
  author={Yoo, Byunghee},
  school={Seoul National University},
  year={2024}
}