Modern Tokenization Methods for Text Processing in the Financial Domain

Eldar Boltachev; Mais Farkhadov; A. I. Tyulyakov

doi:10.26794/3033-7097-2025-1-3-19-29

Abstract

The paper discusses tokenization as a key step in textual data processing, especially in the ﬁnancial domain. Current tokenization techniques are analyzed with examples from recent research and their impact on the performance of NLP models. The study shows that word-based tokenization algorithms (BPE, WordPiece, Unigram) have become the standard for language models due to their ﬂexibility and text compression efﬁciency. We discuss the limitations of input sequence length in language models (BPE and WordPiece show a tendency to over-partition, Unigram requires complex training, and symbolic tokenisation creates excessively long sequences) and methods to overcome these limitations, including text partitioning, hierarchical processing and extrapolation of pre-trained models with Transformer architecture to handle long input data. For ﬁnancial data, it is recommended to use domain-speciﬁc tokenizers or additional training on specialized systems, which is conﬁrmed by the successful experience of BloomberGPT. Special attention is paid to the problem of processing long texts. Three solution approaches are proposed: text partitioning; hierarchical processing; extrapolation of transformer models. In conclusion, the importance of tokenization for ﬁnancial analytics is emphasized, where the quality of text processing directly affects decision-making. The development of tokenization methods continues in parallel with the improvement of NLP models, which makes this stage of text processing a critical component of modern analytical systems.

Affiliated Institutions

Related Publications

Domain adaptation with structural correspondence learning

John Blitzer , Ryan McDonald , Fernando Pereira

Discriminative learning methods are widely used in natural language processing. These methods work best when their training and test data are drawn from the same distribution. F...

2006 1550 citations

Recent Trends in Deep Learning Based Natural Language Processing [Review Article]

Tom Young , Devamanyu Hazarika , Soujanya Poria +1 more

Deep learning methods employ multiple processing layers to learn hierarchical representations of data, and have produced state-of-the-art results in many domains. Recently, a va...

2018 IEEE Computational Intelligence Magazine 2738 citations

Publicly Available Clinical

Emily Alsentzer , John R. Murphy , William Boag +4 more

Contextual word embedding models such as ELMo and BERT have dramatically improved performance for many natural language processing (NLP) tasks in recent months. However, these m...

2019 Proceedings of the 2nd Clinical Natur... 1422 citations

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Yunpeng Chen , Tao Wang , Weihao Yu +5 more

Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT...

2021 2021 IEEE/CVF International Conferenc... 2067 citations

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

裕二池谷 , Robert Tinn , Hao Cheng +6 more

Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on g...

2021 ACM Transactions on Computing for Hea... 1737 citations

Publication Info

Year: 2025
Type: article
Volume: 1
Issue: 3
Pages: 19-29
Citations: 0
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Modern Tokenization Methods for Text Processing in the Financial Domain

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

OpenAlex

Cite This

APA Style

                            
                                    Eldar Boltachev, 
                                
                                    Mais Farkhadov, 
                                
                                    A. I. Tyulyakov
                                
                            (2025). 
                            Modern Tokenization Methods for Text Processing in the Financial Domain. 
                            Digital Solutions and Artificial Intelligence Technologies
                            , 1
                            (3)
                            , 19-29.
                            https://doi.org/10.26794/3033-7097-2025-1-3-19-29

Identifiers

DOI: 10.26794/3033-7097-2025-1-3-19-29