Authorship attribution embedding model

This model is trained for authorship attribution, primarily on short form messages from Discord.

To use this model, fetch 10 messages from one user and combine them into a single string, separated by instances of \n<sep>\n. Fetch 10 messages from another user and do the same thing. Embed both strings and compare the cosine similarity. Cosine similarities for this dataset will approach roughly 0.37 when the authors are the same person, and will reach roughly 0.03 when the authors are different.

For samples showing how to format texts, please see trentmkelly/authorship-attribution-data.

Accuracy

This model achieved 95% accuracy on the validation set during training. However, in practice, this model's performance is often unpredictable, especially when used on texts that are strongly dissimilar to the training data. Use it with caution and don't treat the results as entirely reliable. Validate the conclusions this model reaches yourself for best results.

Usage and sample code

from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load the model
model = SentenceTransformer('trentmkelly/autotrain-authorship')

# Example: Format messages from two users
user1_messages = [
    "Hey everyone!",
    "How's it going?",
    "I'm working on a new project",
    "It's pretty exciting stuff",
    "Anyone want to collaborate?",
    "Let me know what you think",
    "I'll be around later",
    "Thanks for the help earlier",
    "See you all tomorrow",
    "Have a great day!"
]

user2_messages = [
    "Good morning",
    "What's up guys",
    "Been busy with work lately",
    "Finally got some free time",
    "Looking forward to the weekend",
    "Anyone have plans?",
    "I might go hiking",
    "Weather looks nice",
    "Hope everyone is well",
    "Talk soon"
]

# Combine messages with separator as specified
user1_text = "\n<sep>\n".join(user1_messages)
user2_text = "\n<sep>\n".join(user2_messages)

# Generate embeddings
embeddings = model.encode([user1_text, user2_text])

# Calculate cosine similarity
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]

print(f"Cosine similarity: {similarity:.4f}")

# Interpretation based on model description:
# ~0.37: Same author
# ~0.03: Different authors
if similarity > 0.2:
    print("Likely same author")
else:
    print("Likely different authors")

trentmkelly
/

autotrain-authorship

Authorship attribution embedding model

Accuracy

Usage and sample code

Model tree for trentmkelly/autotrain-authorship

Dataset used to train trentmkelly/autotrain-authorship