Model Details

Model Description

This model is a biomedical Named Entity Recognition (NER) model fine-tuned on three entity types: CellLine, CellType, and Tissue (collectively called CeLLaTE). It was developed to extract and classify named entities from biomedical literature, supporting knowledge curation and drug discovery pipelines. The model was finetuned using the bioformers/bioformer-16L base model on the a filtered version of the CellFinder dataset and is intended for token-classification tasks in the biomedical domain.

Developed by: The Europe PMC and CheMBl team, for the OTAR3088 project initiative
Funded by [optional]: [More Information Needed]
Shared by [optional]: [More Information Needed]
Model type: Token classification (NER
Language(s) (NLP): English
License: Apache-2.0
Finetuned from model [optional]: bioformers/bioformer-16L

Model Sources [optional]

Repository: [More Information Needed]
Paper [optional]: [More Information Needed]
Demo [optional]: [More Information Needed]

Uses

This model is intended for extracting CeLLaTE entity groups from biomedical literature, for knowledge curation, research, and downstream biomedical text mining applications.

Direct Use

The model can be used directly for token-level NER on biomedical text, producing entity labels and spans for CellLine, CellType, and Tissue.

Downstream Use

The extracted entities can be further processed for tasks such as:

Building structured biomedical knowledge bases
Enhancing drug discovery pipelines
Supporting automated literature annotation

Out-of-Scope Use

Non-biomedical text
General-purpose NER outside the trained entity types
Use cases requiring languages other than English

Bias, Risks, and Limitations

[More Information Needed]

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "OTAR3088/bioformer-cellfinder_V1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

nlp_pipeline = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "The HeLa cell line is widely used in cancer research."
entities = nlp_pipeline(text)
print(entities)

Training Details

Training Data

Dataset: OTAR3088/CellFinder_ner_split-V1
Entity Types: CellLine, CellType, Tissue

Training Procedure

Training Hyperparameters

Training regime: - Base Model: bioformers/bioformer-16L
Epochs: 20
Batch Size: 16
Optimizer: AdamW
Learning Rate: 2e-3 (start)
Precision: BF16
GPU: 1× NVIDIA A100

Evaluation

Metrics

Metric: seqeval (token-level classification)
Dataset: CellFinder validation set (OTAR3088/CellFinder_ner_split-V1)
Entity-wise Evaluation:

Entity	Precision	Recall	F1-score	Support
CellLine	0.91	0.86	0.89	81
CellType	0.85	0.86	0.86	423
Tissue	0.80	0.75	0.77	123
Micro avg	0.85	0.84	0.84	627
Macro avg	0.85	0.83	0.84	627
Weighted avg	0.85	0.84	0.84	627

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]

Downloads last month: 55

Safetensors

Model size

41.4M params

Tensor type

F32

Model tree for OTAR3088/bioformer-cellfinder_V1

Base model

bioformers/bioformer-16L

Finetuned

(4)

this model

Collection including OTAR3088/bioformer-cellfinder_V1

CeLLaTE Models

Collection

4 items • Updated 20 days ago