|
--- |
|
license: bigscience-openrail-m |
|
datasets: |
|
- wikimedia/structured-wikipedia |
|
base_model: |
|
- FacebookAI/roberta-large |
|
pipeline_tag: sentence-similarity |
|
widget: |
|
- text: "Jensen Huang, [ENT] president [ENT] of Nvidia, is a guy who lives in California." |
|
example_title: "Nvidia President Example" |
|
sentences: |
|
- "A president is a leader of an organization, company, community, club, trade union, university or other group." |
|
- "The president of the United States (POTUS) is the head of state and head of government of the United States." |
|
- "A class president, also known as a class representative, is usually the leader of a student body class, and presides over its class cabinet or organization within a student council." |
|
--- |
|
|
|
# RoBERTa Large Entity Linking |
|
|
|
## Model Description |
|
|
|
**roberta-large-entity-linking** is a [RoBERTa large model](https://huggingface.co/FacebookAI/roberta-large) fine-tuned as a [bi-encoder](https://arxiv.org/pdf/1811.08008) for [entity linking](https://en.wikipedia.org/wiki/Entity_linking) tasks. The model separately embeds mentions-in-context and entity descriptions to enable matching between text mentions and knowledge base entities. |
|
|
|
### Primary Use Cases |
|
- **Entity Linking:** Link Wikipedia concepts mentioned in text to their corresponding Wikipedia pages. With [this dataset](https://huggingface.co/datasets/wikimedia/structured-wikipedia) [Wikimedia](https://huggingface.co/wikimedia) makes it easy, you can embed the entries in the "abstract" column (you may need to do some cleanup to filter out irrelevant entries). |
|
- **Zero-shot Entity Linking:** Link entities to knowledge bases without task-specific training |
|
- **Knowledge Base Construction:** Build and reference new knowledge bases using the model's strong generalization capabilities |
|
- **Notes:** You may use the model as a top-k retriever and do the final disambiguation with a more powerful model for classification |
|
|
|
### Recommended Preprocessing |
|
- Use `[ENT]` tokens to mark an entity mention: `left context [ENT] mention [ENT] right context` |
|
- Consider using an NER model to identify candidate mentions |
|
- For non-standard entities (e.g., "daytime"), you might extract noun phrases with NLTK or spaCy for example to locate candidate mentions |
|
|
|
## Code Example |
|
|
|
```python |
|
import torch |
|
import torch.nn.functional as F |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
if torch.cuda.is_available(): |
|
device = torch.device("cuda") |
|
print(f"Using CUDA: {torch.cuda.get_device_name()}") |
|
elif torch.backends.mps.is_available(): |
|
device = torch.device("mps") |
|
print("Using MPS (Apple Silicon)") |
|
else: |
|
device = torch.device("cpu") |
|
print("Using CPU") |
|
|
|
model_name = "GlassLewis/roberta-large-entity-linking" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModel.from_pretrained(model_name) |
|
|
|
model.to(device) |
|
|
|
# Verify the special token is there |
|
print('[ENT]' in tokenizer.get_added_vocab()) |
|
|
|
context = "Jensen Huang, [ENT] president [ENT] of Nvidia, is a guy who lives in California." |
|
|
|
definitions = [ |
|
"A president is a leader of an organization, company, community, club, trade union, university or other group.", |
|
"The president of the United States (POTUS) is the head of state and head of government of the United States.", |
|
"A class president, also known as a class representative, is usually the leader of a student body class, and presides over its class cabinet or organization within a student council." |
|
] |
|
|
|
tokenized_definition = tokenizer( |
|
definitions, |
|
truncation=True, |
|
max_length=256, |
|
padding='max_length', |
|
return_tensors='pt' |
|
) |
|
|
|
tokenized_context = tokenizer( |
|
context, |
|
truncation=True, |
|
max_length=256, |
|
padding='max_length', |
|
return_tensors='pt' |
|
) |
|
|
|
# Get embeddings |
|
embedded_context = model( |
|
input_ids=tokenized_context["input_ids"].to(device), |
|
attention_mask=tokenized_context["attention_mask"].to(device) |
|
) |
|
embedded_definition = model( |
|
input_ids=tokenized_definition["input_ids"].to(device), |
|
attention_mask=tokenized_definition["attention_mask"].to(device) |
|
) |
|
|
|
# Normalize embeddings for proper cosine similarity |
|
context_norm = F.normalize(embedded_context.last_hidden_state[:, 0, :], p=2, dim=1) |
|
definition_norm = F.normalize(embedded_definition.last_hidden_state[:, 0, :], p=2, dim=1) |
|
|
|
# Calculate cosine similarities |
|
similarities = torch.matmul(context_norm, definition_norm.t()) |
|
|
|
print("Cosine similarities:") |
|
print(similarities) |
|
|
|
print("\nClassification results:") |
|
for i, definition in enumerate(definitions): |
|
sim_value = similarities[0, i].item() |
|
print(f"Definition {i+1}: {definition}") |
|
print(f"Similarity: {sim_value:.4f}\n") |
|
``` |
|
|
|
### Training Data |
|
- **Dataset:** 3 million pairs of Wikipedia anchor text links in context marked by the special [ENT] tokens, and Wikipedia page abstracts, derived from [this dataset](https://huggingface.co/datasets/wikimedia/structured-wikipedia) |
|
- **Special Token:** `[ENT]` token added to vocabulary mark entity mentions |
|
- To illustrate the training data format, consider the following example: |
|
|
|
* **Input (Context with Special Token):** |
|
``` |
|
is a commune in the Hérault department in the Occitanie [ENT] region [ENT] in |
|
``` |
|
* **Target (Abstract):** |
|
``` |
|
France is divided into eighteen administrative regions, of which thirteen are located in metropolitan France, while the other five are overseas regions... |
|
``` |
|
|
|
|
|
### Training Details |
|
- **Hardware:** Single 80GB H100 GPU |
|
- **Batch Size:** 80 |
|
- **Learning Rate:** 1e-5 with cosine scheduler |
|
- **Loss Function:** Batch hard triplet loss (margin=0.4) |
|
- **Max Sequence Length:** 256 tokens (both mentions and descriptions) |
|
|
|
### Benchmark Results |
|
- **Dataset:** Zero-Shot Entity Linking [(Logeswaran et al., 2019)](https://arxiv.org/abs/1906.07348) test set. |
|
- **Metric:** Recall@64 (Average performance across the set of test worlds was computed by macroaveraging - we followed the same pattern as [Meta AI's BLINK paper](https://arxiv.org/pdf/1911.03814)) |
|
- **Score:** 80.29% |
|
- **Comparison:** Meta AI's BLINK achieves 82.06% on the same test set - slightly higher than ours, however, their model was trained on the training set but ours was not. |
|
- **Conclusion:** Our model has strong zero-shot performance |
|
|
|
### Usage Recommendations |
|
- **Similarity Threshold:** If using our model as a classifier, 0.7 for positive matches appears to be a reasonable threshold |
|
|
|
### License |
|
This model is licensed under the BigScience OpenRAIL-M License, which promotes responsible and ethical use of AI. |
|
This model is based on Facebook AI's RoBERTa Large model, which is licensed under the MIT License. The original RoBERTa model copyright notice: Copyright (c) Facebook, Inc. and its affiliates. |
|
The training dataset (Wikimedia Structured Wikipedia) is licensed under CC-BY-SA-4.0. |
|
|
|
### MIT License for RoBERTa Model |
|
MIT License |
|
|
|
Copyright (c) Facebook, Inc. and its affiliates. |
|
|
|
Permission is hereby granted, free of charge, to any person obtaining a copy |
|
of this software and associated documentation files (the "Software"), to deal |
|
in the Software without restriction, including without limitation the rights |
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell |
|
copies of the Software, and to permit persons to whom the Software is |
|
furnished to do so, subject to the following conditions: |
|
|
|
The above copyright notice and this permission notice shall be included in all |
|
copies or substantial portions of the Software. |
|
|
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR |
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, |
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE |
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER |
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, |
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE |
|
SOFTWARE. |
|
|
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{roberta-large-entity-linking, |
|
author = {[Glass, Lewis & Co.]}, |
|
title = {RoBERTa Large Entity Linking}, |
|
year = {2025}, |
|
publisher = {Hugging Face}, |
|
url = {https://huggingface.co/GlassLewis/roberta-large-entity-linking} |
|
} |
|
``` |