trentmkelly
/

autotrain-authorship

Sentence Similarity

sentence-transformers

feature-extraction

Trained with AutoTrain

text-embeddings-inference

Model card Files Files and versions Metrics Training metrics Community

autotrain-authorship / README.md

trentmkelly's picture

Update README.md

822554b verified 3 months ago

|

history blame contribute delete

2.83 kB

	---
	library_name: sentence-transformers
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- autotrain
	base_model: sentence-transformers/all-MiniLM-L6-v2
	pipeline_tag: sentence-similarity
	datasets:
	- trentmkelly/authorship-attribution-data
	language:
	- en
	---

	# Authorship attribution embedding model

	This model is trained for authorship attribution, primarily on short form messages from Discord.

	To use this model, fetch 10 messages from one user and combine them into a single string, separated by instances of `\n<sep>\n`. Fetch 10 messages from another user and do the same thing. Embed both strings and compare the cosine similarity. Cosine similarities for this dataset will approach roughly 0.37 when the authors are the same person, and will reach roughly 0.03 when the authors are different.

	For samples showing how to format texts, please see [trentmkelly/authorship-attribution-data](https://huggingface.co/datasets/trentmkelly/authorship-attribution-data).

	## Accuracy

	This model achieved 95% accuracy on the validation set during training. However, in practice, this model's performance is often unpredictable, especially when used on texts that are strongly dissimilar to the training data. Use it with caution and don't treat the results as entirely reliable. Validate the conclusions this model reaches yourself for best results.

	## Usage and sample code

	```python
	from sentence_transformers import SentenceTransformer
	import numpy as np
	from sklearn.metrics.pairwise import cosine_similarity

	# Load the model
	model = SentenceTransformer('trentmkelly/autotrain-authorship')

	# Example: Format messages from two users
	user1_messages = [
	"Hey everyone!",
	"How's it going?",
	"I'm working on a new project",
	"It's pretty exciting stuff",
	"Anyone want to collaborate?",
	"Let me know what you think",
	"I'll be around later",
	"Thanks for the help earlier",
	"See you all tomorrow",
	"Have a great day!"
	]

	user2_messages = [
	"Good morning",
	"What's up guys",
	"Been busy with work lately",
	"Finally got some free time",
	"Looking forward to the weekend",
	"Anyone have plans?",
	"I might go hiking",
	"Weather looks nice",
	"Hope everyone is well",
	"Talk soon"
	]

	# Combine messages with separator as specified
	user1_text = "\n<sep>\n".join(user1_messages)
	user2_text = "\n<sep>\n".join(user2_messages)

	# Generate embeddings
	embeddings = model.encode([user1_text, user2_text])

	# Calculate cosine similarity
	similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]

	print(f"Cosine similarity: {similarity:.4f}")

	# Interpretation based on model description:
	# ~0.37: Same author
	# ~0.03: Different authors
	if similarity > 0.2:
	print("Likely same author")
	else:
	print("Likely different authors")
	```