Finance-Instruct-Tokenizer

Model Description

This is a custom tokenizer trained using the GPT-2 tokenizer as a base and fine-tuned on the Josephgflowers/Finance-Instruct-500k dataset.
The tokenizer is optimized for financial and economic instruction-based language tasks, including question answering, summarization, and conversational agents in the finance domain.

Key Features:

  • Vocabulary size: 25,000 tokens
  • Domain-specific token coverage for finance, banking, and investment terminology
  • Compatible with GPT-2 and other models supporting BPE tokenization

Disclaimer

This tokenizer was created purely for experimental and personal use.
It is not a reliable or production-ready model and its use is not encouraged in critical systems or commercial applications.
Performance, safety, and bias have not been fully evaluated.


Intended Uses & Limitations

Intended uses:

  • Tokenization for finance-related chatbots
  • Preprocessing for financial text classification, QA, or summarization models
  • Training or fine-tuning language models on domain-specific data

Limitations:

  • Domain-specific — may underperform on unrelated general-domain tasks
  • Based on a pre-trained GPT-2 tokenizer; not an entirely novel tokenization scheme
  • Dataset is primarily English; performance on other languages may be limited

Training Data


Training Procedure

The tokenizer was trained using:

  • tokenizer.train_new_from_iterator() on text extracted from the dataset
  • Target vocabulary size: 25,000 tokens
  • Special tokens inherited from GPT-2 tokenizer

Evaluation

Evaluation was performed qualitatively by checking token coverage and vocabulary composition on finance-related texts.
Quantitative evaluation can be performed by measuring:

  • Token coverage rate on unseen financial documents
  • Average tokens per sentence compared to baseline GPT-2 tokenizer
  • Perplexity when integrated with a fine-tuned language model

Developed by

  • Yakùl (Tokenizer developer)

Acknowledgements


License

  • Please refer to the licenses of the base tokenizer and dataset before commercial use.
  • This tokenizer inherits the license of the base GPT-2 tokenizer and the dataset license.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yakul259/finance-chat-tokenizer-25k

Finetuned
(1837)
this model

Dataset used to train yakul259/finance-chat-tokenizer-25k