File size: 6,220 Bytes
c207bc4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
# YourMT3+ Instrument Conditioning Implementation

## Overview

This implementation adds instrument-specific transcription capabilities to YourMT3+ to address the problem of inconsistent instrument classification during transcription. The main issues addressed are:

1. **Instrument switching mid-track**: Model switches between instruments (e.g., vocals → violin → guitar) on single-instrument audio
2. **Poor instrument-specific transcription**: Incomplete transcription of specific instruments (e.g., saxophone solo, flute parts)
3. **Lack of user control**: No way to specify which instrument you want transcribed

## Implementation Details

### 1. Core Architecture Changes

#### **model_helper.py** - Enhanced transcription function
- Added `instrument_hint` parameter to `transcribe()` function
- New `create_instrument_task_tokens()` function that leverages YourMT3's existing task conditioning system
- New `filter_instrument_consistency()` function for post-processing filtering

#### **app.py** - Enhanced Gradio Interface 
- Added instrument selection dropdown with options:
  - Auto (detect all instruments) 
  - Vocals/Singing
  - Guitar, Piano, Violin, Bass
  - Drums, Saxophone, Flute
- Updated both "Upload audio" and "From YouTube" tabs
- Maintains backward compatibility with existing functionality

#### **transcribe_cli.py** - New Command Line Interface
- Standalone CLI tool with full instrument conditioning support
- Support for confidence thresholds and filtering options
- Verbose output and error handling

### 2. How It Works

#### **Task Token Conditioning**
The implementation leverages YourMT3's existing task conditioning system:

```python
# Maps instrument hints to task events
instrument_mapping = {
    'vocals': 'transcribe_singing',
    'drums': 'transcribe_drum', 
    'guitar': 'transcribe_all'  # falls back to general transcription
}
```

#### **Post-Processing Consistency Filtering**
When an instrument hint is provided, the system:

1. Analyzes the transcribed notes to identify the dominant instrument
2. Filters out notes from other instruments if confidence is above threshold
3. Converts remaining notes to the target instrument program

```python
def filter_instrument_consistency(pred_notes, confidence_threshold=0.7):
    # Count instrument occurrences
    # If dominant instrument > threshold, filter others
    # Convert notes to primary instrument
```

## Usage Examples

### 1. Gradio Web Interface

1. **Upload audio tab**: 
   - Upload your audio file
   - Select target instrument from dropdown
   - Click "Transcribe"

2. **YouTube tab**:
   - Paste YouTube URL
   - Select target instrument  
   - Click "Get Audio from YouTube" then "Transcribe"

### 2. Command Line Interface

```bash
# Basic transcription (all instruments)
python transcribe_cli.py audio.wav

# Transcribe vocals only
python transcribe_cli.py audio.wav --instrument vocals

# Force single instrument with high confidence threshold
python transcribe_cli.py audio.wav --single-instrument --confidence-threshold 0.9

# Transcribe guitar with verbose output
python transcribe_cli.py guitar_solo.wav --instrument guitar --verbose

# Custom output path
python transcribe_cli.py audio.wav --instrument piano --output my_piano.mid
```

### 3. Python API Usage

```python
from model_helper import load_model_checkpoint, transcribe

# Load model
model = load_model_checkpoint(args=model_args, device="cuda")

# Prepare audio info
audio_info = {
    "filepath": "audio.wav",
    "track_name": "my_audio"
}

# Transcribe with instrument hint
midi_file = transcribe(model, audio_info, instrument_hint="vocals")
```

## Supported Instruments

- **vocals**, **singing**, **voice** → Uses existing 'transcribe_singing' task
- **drums**, **drum**, **percussion** → Uses existing 'transcribe_drum' task  
- **guitar**, **piano**, **violin**, **bass**, **saxophone**, **flute** → Uses enhanced filtering with 'transcribe_all' task

## Technical Benefits

### 1. **Leverages Existing Architecture**
- Uses YourMT3's built-in task conditioning system
- No model retraining required
- Backward compatible with existing code

### 2. **Two-Stage Approach**
- **Stage 1**: Task token conditioning biases the model toward specific instruments
- **Stage 2**: Post-processing filtering ensures consistency

### 3. **Configurable Confidence**
- Adjustable confidence thresholds for filtering
- Balances between accuracy and completeness

## Limitations & Future Improvements

### Current Limitations
1. **Limited task tokens**: Only vocals and drums have dedicated task tokens
2. **Post-processing dependency**: Other instruments rely on filtering 
3. **No instrument-specific training**: Uses general model weights

### Future Improvements
1. **Extended task vocabulary**: Add dedicated task tokens for more instruments
2. **Instrument-specific models**: Train specialized decoders for each instrument
3. **Confidence scoring**: Add per-note confidence scores for better filtering
4. **Pitch-based filtering**: Use pitch ranges typical for each instrument

## Installation & Setup

1. **Install dependencies** (from existing YourMT3 requirements):
   ```bash
   pip install torch torchaudio transformers gradio
   ```

2. **Model weights**: Ensure YourMT3 model weights are in `amt/logs/`

3. **Run web interface**:
   ```bash
   python app.py
   ```

4. **Run CLI**:
   ```bash
   python transcribe_cli.py --help
   ```

## Testing

Run the test suite:
```bash
python test_instrument_conditioning.py
```

This will verify:
- Code syntax and imports
- Function availability  
- Basic functionality (when dependencies are available)

## Conclusion

This implementation provides a practical solution to YourMT3+'s instrument confusion problem by:

1. **Adding user control** over instrument selection
2. **Leveraging existing architecture** for minimal changes
3. **Providing multiple interfaces** (web, CLI, API)
4. **Maintaining backward compatibility**

The approach addresses the core issue you mentioned: "*so many times i upload vocals and it transcribes half right, as vocals, then switches to violin although the whole track is just vocals*" by giving you direct control over the transcription focus.