Spaces:

hardiktiwari
/

tensora-autotrain

Running

App Files Files Community

tensora-autotrain / docs /source /tasks /token_classification.mdx

hardiktiwari

Upload 244 files

33d4721 verified about 2 months ago

raw

history blame contribute delete

1.78 kB

	# Token Classification

	Token classification is the task of classifying each token in a sequence. This can be used
	for Named Entity Recognition (NER), Part-of-Speech (POS) tagging, and more. Get your data ready in
	proper format and then with just a few clicks, your state-of-the-art model will be ready to
	be used in production.

	## Data Format

	The data should be in the following CSV format:

	```csv
	tokens,tags
	"['I', 'love', 'Paris']","['O', 'O', 'B-LOC']"
	"['I', 'live', 'in', 'New', 'York']","['O', 'O', 'O', 'B-LOC', 'I-LOC']"
	.
	.
	.
	```

	or you can also use JSONL format:

	```json
	{"tokens": ["I", "love", "Paris"],"tags": ["O", "O", "B-LOC"]}
	{"tokens": ["I", "live", "in", "New", "York"],"tags": ["O", "O", "O", "B-LOC", "I-LOC"]}
	.
	.
	.
	```

	As you can see, we have two columns in the CSV file. One column is the tokens and the other
	is the tags. Both the columns are stringified lists! The tokens column contains the tokens
	of the sentence and the tags column contains the tags for each token.

	If your CSV is huge, you can divide it into multiple CSV files and upload them separately.
	Please make sure that the column names are the same in all CSV files.

	One way to divide the CSV file using pandas is as follows:

	```python
	import pandas as pd

	# Set the chunk size
	chunk_size = 1000
	i = 1

	# Open the CSV file and read it in chunks
	for chunk in pd.read_csv('example.csv', chunksize=chunk_size):
	# Save each chunk to a new file
	chunk.to_csv(f'chunk_{i}.csv', index=False)
	i += 1
	```


	Sample dataset from HuggingFace Hub: [conll2003](https://huggingface.co/datasets/eriktks/conll2003)


	## Columns

	Your CSV/JSONL dataset must have two columns: `tokens` and `tags`.


	## Parameters

	[[autodoc]] trainers.token_classification.params.TokenClassificationParams