Finance-Instruct-Tokenizer
Model Description
This is a custom tokenizer trained using the GPT-2 tokenizer as a base and fine-tuned on the Josephgflowers/Finance-Instruct-500k dataset.
The tokenizer is optimized for financial and economic instruction-based language tasks, including question answering, summarization, and conversational agents in the finance domain.
Key Features:
- Vocabulary size: 25,000 tokens
- Domain-specific token coverage for finance, banking, and investment terminology
- Compatible with GPT-2 and other models supporting BPE tokenization
Disclaimer
This tokenizer was created purely for experimental and personal use.
It is not a reliable or production-ready model and its use is not encouraged in critical systems or commercial applications.
Performance, safety, and bias have not been fully evaluated.
Intended Uses & Limitations
Intended uses:
- Tokenization for finance-related chatbots
- Preprocessing for financial text classification, QA, or summarization models
- Training or fine-tuning language models on domain-specific data
Limitations:
- Domain-specific — may underperform on unrelated general-domain tasks
- Based on a pre-trained GPT-2 tokenizer; not an entirely novel tokenization scheme
- Dataset is primarily English; performance on other languages may be limited
Training Data
- Dataset: Josephgflowers/Finance-Instruct-500k
- Data type: Finance-related instruction-response pairs
- Data source: Open-source dataset on Hugging Face Datasets
Training Procedure
The tokenizer was trained using:
tokenizer.train_new_from_iterator()
on text extracted from the dataset- Target vocabulary size: 25,000 tokens
- Special tokens inherited from GPT-2 tokenizer
Evaluation
Evaluation was performed qualitatively by checking token coverage and vocabulary composition on finance-related texts.
Quantitative evaluation can be performed by measuring:
- Token coverage rate on unseen financial documents
- Average tokens per sentence compared to baseline GPT-2 tokenizer
- Perplexity when integrated with a fine-tuned language model
Developed by
- Yakùl (Tokenizer developer)
Acknowledgements
- Base tokenizer: GPT-2 by OpenAI and Hugging Face
- Dataset: Josephgflowers/Finance-Instruct-500k
- Thanks to the Hugging Face community for open-source tools and resources
License
- Please refer to the licenses of the base tokenizer and dataset before commercial use.
- This tokenizer inherits the license of the base GPT-2 tokenizer and the dataset license.
Model tree for yakul259/finance-chat-tokenizer-25k
Base model
openai-community/gpt2