Model Details

Model Description

This model is a biomedical Named Entity Recognition (NER) model fine-tuned on three entity types: CellLine, CellType, and Tissue (collectively called CeLLaTE). It was developed to extract and classify named entities from biomedical literature, supporting knowledge curation and drug discovery pipelines. The model was finetuned using the bioformers/bioformer-16L base model on the a filtered version of the CellFinder dataset and is intended for token-classification tasks in the biomedical domain.

  • Developed by: The Europe PMC and CheMBl team, for the OTAR3088 project initiative
  • Funded by [optional]: [More Information Needed]
  • Shared by [optional]: [More Information Needed]
  • Model type: Token classification (NER
  • Language(s) (NLP): English
  • License: Apache-2.0
  • Finetuned from model [optional]: bioformers/bioformer-16L

Model Sources [optional]

  • Repository: [More Information Needed]
  • Paper [optional]: [More Information Needed]
  • Demo [optional]: [More Information Needed]

Uses

This model is intended for extracting CeLLaTE entity groups from biomedical literature, for knowledge curation, research, and downstream biomedical text mining applications.

Direct Use

The model can be used directly for token-level NER on biomedical text, producing entity labels and spans for CellLine, CellType, and Tissue.

Downstream Use

The extracted entities can be further processed for tasks such as:

  • Building structured biomedical knowledge bases
  • Enhancing drug discovery pipelines
  • Supporting automated literature annotation

Out-of-Scope Use

  • Non-biomedical text
  • General-purpose NER outside the trained entity types
  • Use cases requiring languages other than English

Bias, Risks, and Limitations

[More Information Needed]

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "OTAR3088/bioformer-cellfinder_V1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

nlp_pipeline = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "The HeLa cell line is widely used in cancer research."
entities = nlp_pipeline(text)
print(entities)

Training Details

Training Data

  • Dataset: OTAR3088/CellFinder_ner_split-V1

  • Entity Types: CellLine, CellType, Tissue

Training Procedure

Training Hyperparameters

  • Training regime: - Base Model: bioformers/bioformer-16L

  • Epochs: 20

  • Batch Size: 16

  • Optimizer: AdamW

  • Learning Rate: 2e-3 (start)

  • Precision: BF16

  • GPU: 1ร— NVIDIA A100

Evaluation

Metrics

  • Metric: seqeval (token-level classification)

  • Dataset: CellFinder validation set (OTAR3088/CellFinder_ner_split-V1)

  • Entity-wise Evaluation:

Entity Precision Recall F1-score Support
CellLine 0.91 0.86 0.89 81
CellType 0.85 0.86 0.86 423
Tissue 0.80 0.75 0.77 123
Micro avg 0.85 0.84 0.84 627
Macro avg 0.85 0.83 0.84 627
Weighted avg 0.85 0.84 0.84 627

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]

Downloads last month
55
Safetensors
Model size
41.4M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for OTAR3088/bioformer-cellfinder_V1

Finetuned
(4)
this model

Collection including OTAR3088/bioformer-cellfinder_V1