|
--- |
|
library_name: sentence-transformers |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
- autotrain |
|
base_model: sentence-transformers/all-MiniLM-L6-v2 |
|
pipeline_tag: sentence-similarity |
|
datasets: |
|
- trentmkelly/authorship-attribution-data |
|
language: |
|
- en |
|
--- |
|
|
|
# Authorship attribution embedding model |
|
|
|
This model is trained for authorship attribution, primarily on short form messages from Discord. |
|
|
|
To use this model, fetch 10 messages from one user and combine them into a single string, separated by instances of `\n<sep>\n`. Fetch 10 messages from another user and do the same thing. Embed both strings and compare the cosine similarity. Cosine similarities for this dataset will approach roughly 0.37 when the authors are the same person, and will reach roughly 0.03 when the authors are different. |
|
|
|
For samples showing how to format texts, please see [trentmkelly/authorship-attribution-data](https://huggingface.co/datasets/trentmkelly/authorship-attribution-data). |
|
|
|
## Accuracy |
|
|
|
This model achieved 95% accuracy on the validation set during training. However, in practice, this model's performance is often unpredictable, especially when used on texts that are strongly dissimilar to the training data. Use it with caution and don't treat the results as entirely reliable. Validate the conclusions this model reaches yourself for best results. |
|
|
|
## Usage and sample code |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
import numpy as np |
|
from sklearn.metrics.pairwise import cosine_similarity |
|
|
|
# Load the model |
|
model = SentenceTransformer('trentmkelly/autotrain-authorship') |
|
|
|
# Example: Format messages from two users |
|
user1_messages = [ |
|
"Hey everyone!", |
|
"How's it going?", |
|
"I'm working on a new project", |
|
"It's pretty exciting stuff", |
|
"Anyone want to collaborate?", |
|
"Let me know what you think", |
|
"I'll be around later", |
|
"Thanks for the help earlier", |
|
"See you all tomorrow", |
|
"Have a great day!" |
|
] |
|
|
|
user2_messages = [ |
|
"Good morning", |
|
"What's up guys", |
|
"Been busy with work lately", |
|
"Finally got some free time", |
|
"Looking forward to the weekend", |
|
"Anyone have plans?", |
|
"I might go hiking", |
|
"Weather looks nice", |
|
"Hope everyone is well", |
|
"Talk soon" |
|
] |
|
|
|
# Combine messages with separator as specified |
|
user1_text = "\n<sep>\n".join(user1_messages) |
|
user2_text = "\n<sep>\n".join(user2_messages) |
|
|
|
# Generate embeddings |
|
embeddings = model.encode([user1_text, user2_text]) |
|
|
|
# Calculate cosine similarity |
|
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0] |
|
|
|
print(f"Cosine similarity: {similarity:.4f}") |
|
|
|
# Interpretation based on model description: |
|
# ~0.37: Same author |
|
# ~0.03: Different authors |
|
if similarity > 0.2: |
|
print("Likely same author") |
|
else: |
|
print("Likely different authors") |
|
``` |