Abstract

The paper discusses tokenization as a key step in textual data processing, especially in the financial domain. Current tokenization techniques are analyzed with examples from recent research and their impact on the performance of NLP models. The study shows that word-based tokenization algorithms (BPE, WordPiece, Unigram) have become the standard for language models due to their flexibility and text compression efficiency. We discuss the limitations of input sequence length in language models (BPE and WordPiece show a tendency to over-partition, Unigram requires complex training, and symbolic tokenisation creates excessively long sequences) and methods to overcome these limitations, including text partitioning, hierarchical processing and extrapolation of pre-trained models with Transformer architecture to handle long input data. For financial data, it is recommended to use domain-specific tokenizers or additional training on specialized systems, which is confirmed by the successful experience of BloomberGPT. Special attention is paid to the problem of processing long texts. Three solution approaches are proposed: text partitioning; hierarchical processing; extrapolation of transformer models. In conclusion, the importance of tokenization for financial analytics is emphasized, where the quality of text processing directly affects decision-making. The development of tokenization methods continues in parallel with the improvement of NLP models, which makes this stage of text processing a critical component of modern analytical systems.

Affiliated Institutions

Related Publications

Publication Info

Year
2025
Type
article
Volume
1
Issue
3
Pages
19-29
Citations
0
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

0
OpenAlex

Cite This

Eldar Boltachev, Mais Farkhadov, A. I. Tyulyakov (2025). Modern Tokenization Methods for Text Processing in the Financial Domain. Digital Solutions and Artificial Intelligence Technologies , 1 (3) , 19-29. https://doi.org/10.26794/3033-7097-2025-1-3-19-29

Identifiers

DOI
10.26794/3033-7097-2025-1-3-19-29