Abstract

Construction sites are complex environments where traditional safety monitoring methods often suffer from low detection accuracy and limited interpretability. To address these challenges, this study proposes a modular multimodal agent framework that integrates computer vision, knowledge representation, and large language model (LLM)–based reasoning. First, the CLIP model fine-tuned with Low-Rank Adaptation (LoRA) is combined with YOLOv10 to achieve precise recognition of construction activities and personal protective equipment (PPE). Second, a construction safety knowledge graph integrating Retrieval-Augmented Generation (RAG) is constructed to provide structured domain knowledge and enhance contextual understanding. Third, the FusedChain prompting strategy is designed to guide large language models (LLMs) to perform step-by-step safety risk reasoning. Experimental results show that the proposed approach achieves 97.35% accuracy in activity recognition, an average F1-score of 0.84 in PPE detection, and significantly higher performance than existing methods in hazard reasoning. The modular design also facilitates scalable integration with more advanced foundation models, indicating strong potential for real-world deployment in intelligent construction safety management.

Affiliated Institutions

Related Publications

Publication Info

Year
2025
Type
article
Volume
15
Issue
24
Pages
4439-4439
Citations
0
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

0
OpenAlex

Cite This

Sheng Cheng, Yifan Qi, Rui Wu et al. (2025). A Multimodal Agent Framework for Construction Scenarios: Accurate Perception, Dynamic Retrieval, and Explainable Hazard Reasoning. Buildings , 15 (24) , 4439-4439. https://doi.org/10.3390/buildings15244439

Identifiers

DOI
10.3390/buildings15244439