Spaces:

KoalaBrainResearcher
/

mozart-piano-transcriber

Sleeping

App Files Files Community

KoalaBrainResearcher commited on Sep 16, 2024

Commit

2cc9258

verified ·

1 Parent(s): 5de23f7

Upload 9 files

Browse files

Files changed (9) hide show

.gitattributes +1 -35
.gitignore +10 -0
README.md +157 -13
app.py +53 -7
cog.yaml +37 -0
predict.py +42 -0
requirements.lock +114 -0
requirements.txt +9 -0
runme.sh +42 -0

.gitattributes CHANGED Viewed

@@ -1,35 +1 @@
-*.7z filter=lfs diff=lfs merge=lfs -text
-*.arrow filter=lfs diff=lfs merge=lfs -text
-*.bin filter=lfs diff=lfs merge=lfs -text
-*.bz2 filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.ftz filter=lfs diff=lfs merge=lfs -text
-*.gz filter=lfs diff=lfs merge=lfs -text
-*.h5 filter=lfs diff=lfs merge=lfs -text
-*.joblib filter=lfs diff=lfs merge=lfs -text
-*.lfs.* filter=lfs diff=lfs merge=lfs -text
-*.mlmodel filter=lfs diff=lfs merge=lfs -text
-*.model filter=lfs diff=lfs merge=lfs -text
-*.msgpack filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.npz filter=lfs diff=lfs merge=lfs -text
-*.onnx filter=lfs diff=lfs merge=lfs -text
-*.ot filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pickle filter=lfs diff=lfs merge=lfs -text
-*.pkl filter=lfs diff=lfs merge=lfs -text
-*.pt filter=lfs diff=lfs merge=lfs -text
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.rar filter=lfs diff=lfs merge=lfs -text
-*.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
-*.tflite filter=lfs diff=lfs merge=lfs -text
-*.tgz filter=lfs diff=lfs merge=lfs -text
-*.wasm filter=lfs diff=lfs merge=lfs -text
-*.xz filter=lfs diff=lfs merge=lfs -text
-*.zip filter=lfs diff=lfs merge=lfs -text
-*.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text


1	+ soundfont/MuseScore_General.sf3 filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,10 @@

+*
+!*/
+!*.*
+*.hdf5
+*.pyc
+__pycache__
+results/
+video_frames/
+model.pth

README.md CHANGED Viewed

@@ -1,13 +1,157 @@
----
-title: Mozart Piano Transcriber
-emoji: 🏃
-colorFrom: indigo
-colorTo: green
-sdk: gradio
-sdk_version: 4.44.0
-app_file: app.py
-pinned: false
-license: apache-2.0
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Piano transcription
+Piano transcription is the task of transcribing piano recordings into MIDI files. This repo is the PyTorch implementation of our proposed high-resolution piano transcription system [1].
+<a href="https://replicate.com/replicate/piano-transcription"><img src="https://replicate.com/replicate/piano-transcription/badge"></a>
+## Demos
+Here is a demo of our piano transcription system: https://www.youtube.com/watch?v=5U-WL0QvKCg
+[Demo and Docker image on Replicate](https://replicate.ai/bytedance/piano-transcription)
+## Environments
+This codebase is developed with Python 3.7 and PyTorch 1.4.0 (Should work with other versions, but not fully tested).
+Install dependencies:
+```
+pip install -r requirements.txt
+```
+## Piano transcription using pretrained model
+The easiest way is to transcribe a new piano recording is to install the piano_transcription_inference package: https://github.com/qiuqiangkong/piano_transcription_inference with pip as follows:
+```
+pip install piano_transcription_inference
+```
+Then, execute the following commands to transcribe this [audio](resources/cut_liszt.mp3).
+```
+from piano_transcription_inference import PianoTranscription, sample_rate, load_audio
+# Load audio
+(audio, _) = load_audio('resources/cut_liszt.mp3', sr=sample_rate, mono=True)
+# Transcriptor
+transcriptor = PianoTranscription(device='cuda')    # 'cuda' | 'cpu'
+# Transcribe and write out to MIDI file
+transcribed_dict = transcriptor.transcribe(audio, 'cut_liszt.mid')
+```
+## Train a piano transcription system from scratch
+This section provides instructions if users would like to train a piano transcription system from scratch.
+### 0. Prepare data
+We use MAESTRO dataset V2.0.0 [1] to train the piano transcription system. MAESTRO consists of over 200 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. MAESTRO dataset can be downloaded from https://magenta.tensorflow.org/datasets/maestro.
+Statistics of MAESTRO V2.0.0 [[ref]](https://magenta.tensorflow.org/datasets/maestro#v200):
+| Split      | Performances | Duration (hours) | Size (GB) | Notes (millions) |
+|------------|--------------|------------------|-----------|------------------|
+| Train      |          967 |            161.3 |      97.7 |             5.73 |
+| Validation |          137 |             19.4 |      11.8 |             0.64 |
+| Test       |          178 |             20.5 |      12.4 |             0.76 |
+| **Total**  |      **1282**|         **201.2**|  **121.8**|          **7.13**|
+After downloading, the dataset looks like:
+<pre>
+dataset_root
+├── 2004
+│    └── (264 files)
+├── 2006
+│    └── (230 files)
+├── 2008
+│    └── (294 files)
+├── 2009
+│    └── (250 files)
+├── 2011
+│    └── (326 files)
+├── 2013
+│    └── (254 files)
+├── 2014
+│    └── (210 files)
+├── 2015
+│    └── (258 files)
+├── 2017
+│    └── (280 files)
+├── 2018
+│    └── (198 files)
+├── LICENSE
+├── maestro-v2.0.0.csv
+├── maestro-v2.0.0.json
+└── README
+</pre>
+### 1. Train
+Execute the commands line by line in runme.sh, including:
+1) Config dataset path and your workspace.
+2) Pack audio recordings to hdf5 files.
+3) Train piano note transcription system.
+4) Train piano pedal transcription system.
+5) Combine piano note and piano pedal transcription systems.
+6) Evaluate.
+All training steps are described in runme.sh. It worth looking into runme.sh to see how the piano transcription system is trained. In total 29 GB GPU memoroy is required with a batch size of 12. Users may consider to reduce the batch size, or use multiple GPU cards to train this system.
+## Results
+The training uses a single Tesla-V100-PCIE-32GB card. The system is trained for 300k iterations for one week. The training looks like:
+<pre>
+Namespace(augmentation='none', batch_size=12, cuda=True, early_stop=300000, filename='main', learning_rate=0.0005, loss_type='regress_onset_offset_frame_velocity_bce', max_note_shift=0, mini_data=False, mode='train', model_type='Regress_onset_offset_frame_velocity_CRNN', reduce_iteration=10000, resume_iteration=0, workspace='.../workspaces/piano_transcription')
+Using GPU.
+train segments: 571589
+Evaluate train segments: 571589
+Evaluate validation segments: 68646
+Evaluate test segments: 71959
+------------------------------------
+Iteration: 0
+    Train statistics: {'frame_ap': 0.0613, 'reg_onset_mae': 0.514, 'reg_offset_mae': 0.482, 'velocity_mae': 0.1362}
+    Validation statistics: {'frame_ap': 0.0605, 'reg_onset_mae': 0.5143, 'reg_offset_mae': 0.4819, 'velocity_mae': 0.133}
+    Test statistics: {'frame_ap': 0.0601, 'reg_onset_mae': 0.5139, 'reg_offset_mae': 0.4821, 'velocity_mae': 0.1283}
+    Dump statistics to .../workspaces/piano_transcription/statistics/main/Regress_onset_offset_frame_velocity_CRNN/loss_type=regress_onset_offset_frame_velocity_bce/augmentation=none/batch_size=12/statistics.pkl
+    Dump statistics to .../workspaces/piano_transcription/statistics/main/Regress_onset_offset_frame_velocity_CRNN/loss_type=regress_onset_offset_frame_velocity_bce/augmentation=none/batch_size=12/statistics_2020-04-28_00-22-33.pickle
+Train time: 5.498 s, validate time: 92.863 s
+Model saved to .../workspaces/piano_transcription/checkpoints/main/Regress_onset_offset_frame_velocity_CRNN/loss_type=regress_onset_offset_frame_velocity_bce/augmentation=none/batch_size=12/0_iterations.pth
+------------------------------------
+...
+------------------------------------
+Iteration: 300000
+    Train statistics: {'frame_ap': 0.9439, 'reg_onset_mae': 0.091, 'reg_offset_mae': 0.127, 'velocity_mae': 0.0241}
+    Validation statistics: {'frame_ap': 0.9245, 'reg_onset_mae': 0.0985, 'reg_offset_mae': 0.1327, 'velocity_mae': 0.0265}
+    Test statistics: {'frame_ap': 0.9285, 'reg_onset_mae': 0.097, 'reg_offset_mae': 0.1353, 'velocity_mae': 0.027}
+    Dump statistics to .../workspaces/piano_transcription/statistics/main/Regress_onset_offset_frame_velocity_CRNN/loss_type=regress_onset_offset_frame_velocity_bce/augmentation=none/batch_size=12/statistics.pkl
+    Dump statistics to .../workspaces/piano_transcription/statistics/main/Regress_onset_offset_frame_velocity_CRNN/loss_type=regress_onset_offset_frame_velocity_bce/augmentation=none/batch_size=12/statistics_2020-04-28_00-22-33.pickle
+Train time: 8953.815 s, validate time: 93.683 s
+Model saved to .../workspaces/piano_transcription/checkpoints/main/Regress_onset_offset_frame_velocity_CRNN/loss_type=regress_onset_offset_frame_velocity_bce/augmentation=none/batch_size=12/300000_iterations.pth
+</pre>
+## Visualization of piano transcription
+**Demo 1.** Lang Lang: Franz Liszt - Love Dream (Liebestraum) [[audio]](resources/cut_liszt.mp3) [[transcribed_midi]](resources/cut_liszt.mid)
+<img src="resources/cut_liszt.png">
+**Demo 2.** Andras Schiff: J.S.Bach - French Suites [[audio]](resources/cut_bach.mp3) [[transcribed_midi]](resources/cut_bach.mid)
+<img src="resources/cut_bach.png">
+## FAQs
+If users met running out of GPU memory error, then try to reduce batch size.
+## LICENSE
+Apache 2.0
+## Applications
+We have built a large-scale classical piano MIDI dataset using our piano transcription system. See https://github.com/bytedance/GiantMIDI-Piano for details.
+## Contact
+Qiuqiang Kong, kongqiuqiang@bytedance.com
+## Cite
+[1] Qiuqiang Kong, Bochen Li, Xuchen Song, Yuan Wan, and Yuxuan Wang. "High-resolution Piano Transcription with Pedals by Regressing Onsets and Offsets Times." arXiv preprint arXiv:2010.01815 (2020). [[pdf]](https://arxiv.org/pdf/2010.01815.pdf)

app.py CHANGED Viewed

@@ -1,7 +1,53 @@
-import gradio as gr
-def greet(name):
-    return "Hello " + name + "!!"
-demo = gr.Interface(fn=greet, inputs="text", outputs="text")
-demo.launch()

+import gradio as gr
+import librosa
+import os
+from pathlib import Path
+from pytorch.inference import PianoTranscription
+from utils import config
+# from synthviz import create_video # TODO enable video rendering
+from midi2audio import FluidSynth
+RESULTS_DIR='results'
+# Initialize the transcriptor
+transcriptor = PianoTranscription("Note_pedal")
+# Soundfont
+soundfont_path = "soundfont/MuseScore_General.sf3"
+fs = FluidSynth(soundfont_path)
+def transcribe_and_visualize(audio_file):
+    # Generate a unique filename for the MIDI and video outputs
+    # base_name = os.path.splitext(os.path.basename(audio_file.name))[0]
+    base_name = os.path.splitext(os.path.basename(audio_file))[0]
+    midi_filename = f"{RESULTS_DIR}/{base_name}_transcription.mid"
+    video_filename = f"{RESULTS_DIR}/{base_name}_output.mp4"
+    flac_filename = f"{RESULTS_DIR}/{base_name}_transcription.flac"
+    # Load and transcribe audio
+    audio, _ = librosa.core.load(audio_file, sr=config.sample_rate)
+    transcriptor.transcribe(audio, midi_filename)
+    # Create visualization video # TODO enable video rendering
+    # create_video(input_midi=midi_filename, video_filename=video_filename)
+    # return video_filename
+    # Convert MIDI to FLAC
+    fs.midi_to_audio(midi_filename, flac_filename)
+    # Return midi
+    return flac_filename, midi_filename
+# Create Gradio interface
+iface = gr.Interface(
+    fn=transcribe_and_visualize,
+    inputs=gr.Audio(type="filepath", label="Upload Piano Audio"),
+    # outputs=gr.Video(label="Transcription Visualization"),
+    outputs=[gr.Audio(label="MIDI transcription"), gr.File(label="MIDI file")],
+    title="MOZART - AI Piano Transcriber",
+    description="Upload a piano audio file to transcribe it and visualize the result.",
+)
+# Launch the interface
+iface.launch()

cog.yaml ADDED Viewed

	@@ -0,0 +1,37 @@

+# Configuration for Cog ⚙️
+# Reference: https://github.com/replicate/cog/blob/main/docs/yaml.md
+build:
+  gpu: true
+  system_packages:
+    - "libgl1-mesa-glx"
+    - "libglib2.0-0"
+    - "libsndfile1-dev"
+    - "ffmpeg"
+    - "timidity"
+  python_version: "3.8"
+  python_packages:
+     - "torch==1.8.0"
+     - "torchvision==0.9.0"
+     - "piano_transcription_inference==0.0.5"
+     - "librosa==0.6.0"
+     - "h5py==2.10.0"
+     - "pandas==1.1.2"
+     - "librosa==0.6.0"
+     - "numba==0.48"
+     - "mido==1.2.9"
+     - "mir_eval==0.5"
+     - "matplotlib==3.0.3"
+     - "torchlibrosa==0.0.4"
+     - "sox==1.4.0"
+     - "tqdm==4.62.3"
+     - "pretty_midi==0.2.9"
+     - "synthviz==0.0.2"
+  run:
+     - "ffmpeg -version"
+predict: "predict.py:Predictor"

predict.py ADDED Viewed

	@@ -0,0 +1,42 @@

+# Prediction interface for Cog ⚙️
+# Reference: https://github.com/replicate/cog/blob/main/docs/python.md
+import os
+from pathlib import Path
+import cog
+import librosa
+# model repo: https://github.com/bytedance/piano_transcription
+# package repo: https://github.com/qiuqiangkong/piano_transcription_inference
+from piano_transcription_inference import PianoTranscription, sample_rate
+from synthviz import create_video
+# adapted from example: https://github.com/minzwon/sota-music-tagging-models/blob/master/predict.py
+class Predictor(cog.Predictor):
+    transcriptor: PianoTranscription
+    def setup(self):
+        self.transcriptor = PianoTranscription(
+            device="cuda", checkpoint_path="./model.pth"
+        )
+    @cog.input("audio_input", type=Path, help="Input audio file")
+    def predict(self, audio_input):
+        midi_intermediate_filename = "transcription.mid"
+        video_filename = os.path.join(Path.cwd(), "output.mp4")
+        audio, _ = librosa.core.load(str(audio_input), sr=sample_rate)
+        # Transcribe audio
+        self.transcriptor.transcribe(audio, midi_intermediate_filename)
+        # 'Visualization' output option
+        create_video(
+            input_midi=midi_intermediate_filename, video_filename=video_filename
+        )
+        print(
+            f"Created video of size {os.path.getsize(video_filename)} bytes at path {video_filename}"
+        )
+        # Return path to video
+        return Path(video_filename)

requirements.lock ADDED Viewed

	@@ -0,0 +1,114 @@

+h5py==3.11.0
+pandas==2.2.2
+librosa==0.10.2.post1
+numba==0.60.0
+mido==1.3.2
+mir-eval==0.7
+matplotlib==3.9.2
+torchlibrosa==0.1.0
+sox==1.5.0
+## The following requirements were added by pip freeze:
+aiofiles==23.2.1
+annotated-types==0.7.0
+anyio==4.4.0
+audioread==3.0.1
+certifi==2024.8.30
+cffi==1.17.1
+chardet==5.2.0
+charset-normalizer==3.3.2
+click==8.1.7
+contourpy==1.3.0
+cycler==0.12.1
+decorator==5.1.1
+exceptiongroup==1.2.2
+fastapi==0.114.2
+ffmpy==0.4.0
+filelock==3.16.0
+fluidsynth==0.2
+fonttools==4.53.1
+fsspec==2024.9.0
+future==1.0.0
+gradio==4.44.0
+gradio_client==1.3.0
+h11==0.14.0
+httpcore==1.0.5
+httpx==0.27.2
+huggingface-hub==0.24.7
+idna==3.9
+importlib_resources==6.4.5
+Jinja2==3.1.4
+joblib==1.4.2
+jsonpickle==3.3.0
+kiwisolver==1.4.7
+lazy_loader==0.4
+llvmlite==0.43.0
+markdown-it-py==3.0.0
+MarkupSafe==2.1.5
+mdurl==0.1.2
+midi2audio==0.1.1
+more-itertools==10.5.0
+mpmath==1.3.0
+msgpack==1.1.0
+music21==9.1.0
+networkx==3.3
+numpy==2.0.2
+nvidia-cublas-cu11==11.11.3.6
+nvidia-cuda-cupti-cu11==11.8.87
+nvidia-cuda-nvrtc-cu11==11.8.89
+nvidia-cuda-runtime-cu11==11.8.89
+nvidia-cudnn-cu11==9.1.0.70
+nvidia-cufft-cu11==10.9.0.58
+nvidia-curand-cu11==10.3.0.86
+nvidia-cusolver-cu11==11.4.1.48
+nvidia-cusparse-cu11==11.7.5.86
+nvidia-nccl-cu11==2.20.5
+nvidia-nvtx-cu11==11.8.86
+orjson==3.10.7
+packaging==23.2
+pillow==10.4.0
+platformdirs==4.3.3
+pooch==1.8.2
+pretty-midi==0.2.10
+pycparser==2.22
+pydantic==2.9.1
+pydantic_core==2.23.3
+pydub==0.25.1
+Pygments==2.18.0
+pyparsing==3.1.4
+python-dateutil==2.9.0.post0
+python-multipart==0.0.9
+pytz==2024.2
+PyYAML==6.0.2
+regex==2024.9.11
+requests==2.32.3
+resampy==0.4.3
+rich==13.8.1
+ruff==0.6.5
+safetensors==0.4.5
+scikit-learn==1.5.2
+scipy==1.14.1
+semantic-version==2.10.0
+shellingham==1.5.4
+six==1.16.0
+sniffio==1.3.1
+soundfile==0.12.1
+soxr==0.5.0.post1
+starlette==0.38.5
+sympy==1.13.2
+synthviz==0.0.2
+threadpoolctl==3.5.0
+tokenizers==0.19.1
+tomlkit==0.12.0
+torch==2.4.1+cu118
+torchaudio==2.4.1+cu118
+torchvision==0.19.1+cu118
+tqdm==4.66.5
+transformers==4.44.2
+triton==3.0.0
+typer==0.12.5
+typing_extensions==4.12.2
+tzdata==2024.1
+urllib3==2.2.3
+uvicorn==0.30.6
+webcolors==24.8.0
+websockets==12.0

requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+h5py==2.10.0
+pandas==1.1.2
+librosa==0.6.0
+numba==0.48
+mido==1.2.9
+mir_eval==0.5
+matplotlib==3.0.3
+torchlibrosa==0.0.4
+sox==1.4.0

runme.sh ADDED Viewed

	@@ -0,0 +1,42 @@

+#!/bin/bash
+# ============ Inference using pretrained model ============
+# Download checkpoint and inference
+CHECKPOINT_PATH="CRNN_note_F1=0.9677_pedal_F1=0.9186.pth"
+# wget -O $CHECKPOINT_PATH "https://zenodo.org/record/4034264/files/CRNN_note_F1%3D0.9677_pedal_F1%3D0.9186.pth?download=1"
+MODEL_TYPE="Note_pedal"
+# ORIGINAL
+# python3 pytorch/inference.py --model_type=$MODEL_TYPE --checkpoint_path=$CHECKPOINT_PATH --audio_path='resources/cut_liszt.mp3' --cuda
+python3 pytorch/inference.py --audio_path='resources/cut_liszt.mp3' --cuda
+# # ============ Train piano transcription system from scratch ============
+# # MAESTRO dataset directory. Users need to download MAESTRO dataset into this folder.
+# DATASET_DIR="./datasets/maestro/dataset_root"
+# # Modify to your workspace
+# WORKSPACE="./workspaces/piano_transcription"
+# # Pack audio files to hdf5 format for training
+# python3 utils/features.py pack_maestro_dataset_to_hdf5 --dataset_dir=$DATASET_DIR --workspace=$WORKSPACE
+# # --- 1. Train note transcription system ---
+# python3 pytorch/main.py train --workspace=$WORKSPACE --model_type='Regress_onset_offset_frame_velocity_CRNN' --loss_type='regress_onset_offset_frame_velocity_bce' --augmentation='none' --max_note_shift=0 --batch_size=12 --learning_rate=5e-4 --reduce_iteration=10000 --resume_iteration=0 --early_stop=300000 --cuda
+# # --- 2. Train pedal transcription system ---
+# python3 pytorch/main.py train --workspace=$WORKSPACE --model_type='Regress_pedal_CRNN' --loss_type='regress_pedal_bce' --augmentation='none' --max_note_shift=0 --batch_size=12 --learning_rate=5e-4 --reduce_iteration=10000 --resume_iteration=0 --early_stop=300000 --cuda
+# # --- 3. Combine the note and pedal models ---
+# # Users should copy and rename the following paths to their trained model paths
+# NOTE_CHECKPOINT_PATH="Regress_onset_offset_frame_velocity_CRNN_onset_F1=0.9677.pth"
+# PEDAL_CHECKPOINT_PATH="Regress_pedal_CRNN_onset_F1=0.9186.pth"
+# NOTE_PEDAL_CHECKPOINT_PATH="CRNN_note_F1=0.9677_pedal_F1=0.9186.pth"
+# python3 pytorch/combine_note_and_pedal_models.py --note_checkpoint_path=$NOTE_CHECKPOINT_PATH --pedal_checkpoint_path=$PEDAL_CHECKPOINT_PATH --output_checkpoint_path=$NOTE_PEDAL_CHECKPOINT_PATH
+# # ============ Evaluate (optional) ============
+# # Inference probability for evaluation
+# python3 pytorch/calculate_score_for_paper.py infer_prob --workspace=$WORKSPACE --model_type='Note_pedal' --checkpoint_path=$NOTE_PEDAL_CHECKPOINT_PATH --augmentation='none' --dataset='maestro' --split='test' --cuda
+# # Calculate metrics
+# python3 pytorch/calculate_score_for_paper.py calculate_metrics --workspace=$WORKSPACE --model_type='Note_pedal' --augmentation='aug' --dataset='maestro' --split='test'
+# python3 pytorch/calculate_score_for_paper.py calculate_metrics --workspace=$WORKSPACE --model_type='Note_pedal' --augmentation='aug' --dataset='maps' --split='test'