davanstrien HF Staff commited on
Commit
261046a
Β·
1 Parent(s): a2c1456

Update README with new repositories (synthetic-data, deduplication, openai-oss)

Browse files

- Add 3 new repositories to the script collections table
- Update featured scripts section with more diverse examples
- Include CPU-friendly deduplication example
- Add synthetic data generation with CoT reasoning example

Files changed (1) hide show
  1. README.md +20 -6
README.md CHANGED
@@ -40,6 +40,9 @@ hf jobs uv run --flavor l4x1 \
40
  | [classification](https://huggingface.co/datasets/uv-scripts/classification) | Text classification with guaranteed valid outputs | βœ… |
41
  | [dataset-creation](https://huggingface.co/datasets/uv-scripts/dataset-creation) | Create datasets from PDFs and files | ❌ |
42
  | [vllm](https://huggingface.co/datasets/uv-scripts/vllm) | High-performance inference with vLLM | βœ… |
 
 
 
43
 
44
  ## 🎯 Why UV Scripts?
45
 
@@ -64,16 +67,27 @@ hf jobs uv run --flavor l4x1 \
64
  your-images extracted-text
65
  ```
66
 
67
- ### Classify with Guaranteed Valid Outputs
68
 
69
- Text classification that always returns valid labels:
70
 
71
  ```bash
72
- # Uses vLLM's structured generation - no invalid outputs!
 
 
 
 
 
 
 
 
 
 
 
73
  hf jobs uv run --flavor l4x1 \
74
- https://huggingface.co/datasets/uv-scripts/classification/raw/main/classify-dataset.py \
75
- --input-dataset imdb --column text \
76
- --labels "positive,negative" --output-dataset imdb-classified
77
  ```
78
 
79
  ## πŸš€ Getting Started with HF Jobs
 
40
  | [classification](https://huggingface.co/datasets/uv-scripts/classification) | Text classification with guaranteed valid outputs | βœ… |
41
  | [dataset-creation](https://huggingface.co/datasets/uv-scripts/dataset-creation) | Create datasets from PDFs and files | ❌ |
42
  | [vllm](https://huggingface.co/datasets/uv-scripts/vllm) | High-performance inference with vLLM | βœ… |
43
+ | [synthetic-data](https://huggingface.co/datasets/uv-scripts/synthetic-data) | Generate high-quality synthetic data with CoT reasoning | βœ… |
44
+ | [deduplication](https://huggingface.co/datasets/uv-scripts/deduplication) | Remove duplicates using semantic similarity | ❌ |
45
+ | [openai-oss](https://huggingface.co/datasets/uv-scripts/openai-oss) | Generate responses with visible reasoning traces | βœ… |
46
 
47
  ## 🎯 Why UV Scripts?
48
 
 
67
  your-images extracted-text
68
  ```
69
 
70
+ ### Deduplicate Datasets (CPU-Friendly!)
71
 
72
+ Remove duplicates using semantic similarity - no GPU needed:
73
 
74
  ```bash
75
+ # Fast semantic deduplication on CPU
76
+ uv run https://huggingface.co/datasets/uv-scripts/deduplication/raw/main/semantic-dedupe.py \
77
+ your-dataset text your-dataset-clean \
78
+ --method duplicates --threshold 0.9
79
+ ```
80
+
81
+ ### Generate Synthetic Training Data
82
+
83
+ Create high-quality synthetic data with chain-of-thought reasoning:
84
+
85
+ ```bash
86
+ # Generate synthetic math problems with reasoning
87
  hf jobs uv run --flavor l4x1 \
88
+ https://huggingface.co/datasets/uv-scripts/synthetic-data/raw/main/cot-self-instruct.py \
89
+ --seed-dataset math-examples --output-dataset synthetic-math \
90
+ --task-type reasoning --num-samples 1000
91
  ```
92
 
93
  ## πŸš€ Getting Started with HF Jobs