Model Details
Model Description
This model is a biomedical Named Entity Recognition (NER) model fine-tuned on three entity types: CellLine, CellType, and Tissue (collectively called CeLLaTE). It was developed to extract and classify named entities from biomedical literature, supporting knowledge curation and drug discovery pipelines. The model was finetuned using the bioformers/bioformer-16L base model on the a filtered version of the CellFinder dataset and is intended for token-classification tasks in the biomedical domain.
- Developed by: The Europe PMC and CheMBl team, for the OTAR3088 project initiative
- Funded by [optional]: [More Information Needed]
- Shared by [optional]: [More Information Needed]
- Model type: Token classification (NER
- Language(s) (NLP): English
- License: Apache-2.0
- Finetuned from model [optional]: bioformers/bioformer-16L
Model Sources [optional]
- Repository: [More Information Needed]
- Paper [optional]: [More Information Needed]
- Demo [optional]: [More Information Needed]
Uses
This model is intended for extracting CeLLaTE entity groups from biomedical literature, for knowledge curation, research, and downstream biomedical text mining applications.
Direct Use
The model can be used directly for token-level NER on biomedical text, producing entity labels and spans for CellLine, CellType, and Tissue.
Downstream Use
The extracted entities can be further processed for tasks such as:
- Building structured biomedical knowledge bases
- Enhancing drug discovery pipelines
- Supporting automated literature annotation
Out-of-Scope Use
- Non-biomedical text
- General-purpose NER outside the trained entity types
- Use cases requiring languages other than English
Bias, Risks, and Limitations
[More Information Needed]
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_name = "OTAR3088/bioformer-cellfinder_V1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
nlp_pipeline = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "The HeLa cell line is widely used in cancer research."
entities = nlp_pipeline(text)
print(entities)
Training Details
Training Data
Dataset: OTAR3088/CellFinder_ner_split-V1
Entity Types: CellLine, CellType, Tissue
Training Procedure
Training Hyperparameters
Training regime: - Base Model: bioformers/bioformer-16L
Epochs: 20
Batch Size: 16
Optimizer: AdamW
Learning Rate: 2e-3 (start)
Precision: BF16
GPU: 1ร NVIDIA A100
Evaluation
Metrics
Metric: seqeval (token-level classification)
Dataset: CellFinder validation set (OTAR3088/CellFinder_ner_split-V1)
Entity-wise Evaluation:
Entity | Precision | Recall | F1-score | Support |
---|---|---|---|---|
CellLine | 0.91 | 0.86 | 0.89 | 81 |
CellType | 0.85 | 0.86 | 0.86 | 423 |
Tissue | 0.80 | 0.75 | 0.77 | 123 |
Micro avg | 0.85 | 0.84 | 0.84 | 627 |
Macro avg | 0.85 | 0.83 | 0.84 | 627 |
Weighted avg | 0.85 | 0.84 | 0.84 | 627 |
Testing Data, Factors & Metrics
Testing Data
[More Information Needed]
Factors
[More Information Needed]
Metrics
[More Information Needed]
Results
[More Information Needed]
BibTeX:
[More Information Needed]
APA:
[More Information Needed]
Glossary [optional]
[More Information Needed]
More Information [optional]
[More Information Needed]
Model Card Authors [optional]
[More Information Needed]
Model Card Contact
[More Information Needed]
- Downloads last month
- 55
Model tree for OTAR3088/bioformer-cellfinder_V1
Base model
bioformers/bioformer-16L