Abstract
The paper discusses tokenization as a key step in textual data processing, especially in the financial domain. Current tokenization techniques are analyzed with examples from recent research and their impact on the performance of NLP models. The study shows that word-based tokenization algorithms (BPE, WordPiece, Unigram) have become the standard for language models due to their flexibility and text compression efficiency. We discuss the limitations of input sequence length in language models (BPE and WordPiece show a tendency to over-partition, Unigram requires complex training, and symbolic tokenisation creates excessively long sequences) and methods to overcome these limitations, including text partitioning, hierarchical processing and extrapolation of pre-trained models with Transformer architecture to handle long input data. For financial data, it is recommended to use domain-specific tokenizers or additional training on specialized systems, which is confirmed by the successful experience of BloomberGPT. Special attention is paid to the problem of processing long texts. Three solution approaches are proposed: text partitioning; hierarchical processing; extrapolation of transformer models. In conclusion, the importance of tokenization for financial analytics is emphasized, where the quality of text processing directly affects decision-making. The development of tokenization methods continues in parallel with the improvement of NLP models, which makes this stage of text processing a critical component of modern analytical systems.
Affiliated Institutions
Related Publications
Domain adaptation with structural correspondence learning
Discriminative learning methods are widely used in natural language processing. These methods work best when their training and test data are drawn from the same distribution. F...
Recent Trends in Deep Learning Based Natural Language Processing [Review Article]
Deep learning methods employ multiple processing layers to learn hierarchical representations of data, and have produced state-of-the-art results in many domains. Recently, a va...
Publicly Available Clinical
Contextual word embedding models such as ELMo and BERT have dramatically improved performance for many natural language processing (NLP) tasks in recent months. However, these m...
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT...
Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing
Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on g...
Publication Info
- Year
- 2025
- Type
- article
- Volume
- 1
- Issue
- 3
- Pages
- 19-29
- Citations
- 0
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.26794/3033-7097-2025-1-3-19-29