Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties

Célian Ringwald; Fabien Gandon; Catherine Faron Zucker; Franck Michel; Hanna Abi Akl

doi:10.1145/3731443.3771342

Abstract

Small language models (SLMs) have shown promises for relation extraction (RE) when extracting RDF triples guided by SHACL shapes focused on common datatype properties. This paper investigates how SLMs handle both datatype and object properties for a complete RDF graph extraction. We show that the key bottleneck is related to long-tail distribution of rare properties. To solve this issue, we evaluate several strategies: stratified sampling, weighted loss, dataset scaling, and template-based synthetic data augmentation. We show that the best strategy to perform equally well over unbalanced target properties is to build a training set where the number of occurrences of each property exceeds a given threshold. To enable reproducibility, we publicly released our datasets, experimental results and code. Our findings offer practical guidance for training shape-aware SLMs and highlight promising directions for future work in semantic RE.

Affiliated Institutions

Related Publications

Random Erasing Data Augmentation

Zhun Zhong , Liang Zheng , Guoliang Kang +2 more

In this paper, we introduce Random Erasing, a new data augmentation method for training the convolutional neural network (CNN). In training, Random Erasing randomly selects a re...

2020 Proceedings of the AAAI Conference on... 2689 citations

CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features

Sangdoo Yun , Dongyoon Han , Sanghyuk Chun +3 more

Regional dropout strategies have been proposed to enhance performance of convolutional neural network classifiers. They have proved to be effective for guiding the model to atte...

2019 4293 citations

Unsupervised Domain Adaptation by Domain Invariant Projection

Mahsa Baktashmotlagh , Mehrtash Harandi , Brian C. Lovell +1 more

Domain-invariant representations are key to addressing the domain shift problem where the training and test examples follow different distributions. Existing techniques that hav...

2013 463 citations

SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition

Hao Zhang , Alexander C. Berg , Michael Maire +1 more

We consider visual category recognition in the framework of measuring similarities, or equivalently perceptual distances, to prototype examples of categories. This approach is q...

2006 1114 citations

Efficient Multi-Scale Attention Module with Cross-Spatial Learning

Daliang Ouyang , Su He , Guozhong Zhang +4 more

Remarkable effectiveness of the channel or spatial attention mechanisms for producing more discernible feature representation are illustrated in various computer vision tasks. H...

2023 1181 citations

Publication Info

Year: 2025
Type: preprint
Pages: 9-17
Citations: 0
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

OpenAlex

Cite This

APA Style

                            
                                    Célian Ringwald, 
                                
                                    Fabien Gandon, 
                                
                                    Catherine Faron Zucker
                                
                                et al.
                            
                            (2025). 
                            Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties. 
                            
                            , 9-17.
                            https://doi.org/10.1145/3731443.3771342

Identifiers

DOI: 10.1145/3731443.3771342