Upload benchmark_report.md
Browse files
Training With Mixed Dataset/benchmark_report.md
ADDED
|
@@ -0,0 +1,148 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Pitch Detection Algorithm Benchmark Report
|
| 2 |
+
|
| 3 |
+
## Benchmark Methodology
|
| 4 |
+
|
| 5 |
+
### Evaluation Setup
|
| 6 |
+
This benchmark evaluates pitch detection algorithms across multiple datasets with different characteristics, including synthetic and real audio from speech and music domains. Each algorithm is tested on noisy audio generated by mixing clean datasets with CHiME background noise at various signal-to-noise ratios (10-30 dB) and voice gain variations (-6 to +6 dB).
|
| 7 |
+
|
| 8 |
+
### Performance Metric Definition
|
| 9 |
+
The **Overall Performance Rankings** show the **Harmonic Mean (HM)** score as percentages, computed from six complementary components:
|
| 10 |
+
|
| 11 |
+
**HM = 6 / (1/RPA + 1/CA + 1/P + 1/R + 1/OA + 1/GEA)**
|
| 12 |
+
|
| 13 |
+
Where:
|
| 14 |
+
- **RPA** (Raw Pitch Accuracy): Fraction of voiced frames within 50 cents of ground truth
|
| 15 |
+
- **CA** (Cents Accuracy): exp(-mean_cents_error/500), penalizing larger deviations exponentially
|
| 16 |
+
- **P** (Voicing Precision): TP/(TP+FP), fraction of predicted voiced frames that are truly voiced
|
| 17 |
+
- **R** (Voicing Recall): TP/(TP+FN), fraction of truly voiced frames detected
|
| 18 |
+
- **OA** (Octave Accuracy): exp(-10×octave_error_rate), robustness against octave errors
|
| 19 |
+
- **GEA** (Gross Error Accuracy): exp(-5×gross_error_rate), penalizing deviations >200 cents
|
| 20 |
+
|
| 21 |
+
### Speed Benchmark Details
|
| 22 |
+
CPU timing measurements are performed on 1-second audio signals at 22.05 kHz sample rate with 256-sample hop length. The reported **CPU Time (ms)** represents the average processing time per 1-second audio segment across multiple runs. **Relative Speed** shows performance relative to CREPE as the baseline algorithm.
|
| 23 |
+
|
| 24 |
+
### Optimal Threshold Analysis
|
| 25 |
+
The **Optimal Threshold** refers to the voicing confidence threshold that maximizes the Harmonic Mean score. Algorithms test multiple thresholds (0.0 to 1.0 in steps of 0.1) and select the one yielding the highest combined score. **CV** stands for Coefficient of Variation (std/mean), measuring consistency across datasets.
|
| 26 |
+
|
| 27 |
+
## Dataset Descriptions
|
| 28 |
+
|
| 29 |
+
The benchmark evaluates algorithms across diverse datasets covering speech, music, synthetic, and real-world conditions:
|
| 30 |
+
|
| 31 |
+
| **Dataset** | **Domain** | **Type** | **Description** |
|
| 32 |
+
|---|---|---|---|
|
| 33 |
+
| **NSynth** | Music | Synthetic | Single-note synthetic audio from musical instruments with accurate pitch labels. Lacks temporal/spectral complexity of real-world environments. |
|
| 34 |
+
| **PTDB** | Speech | Real | Speech recordings with laryngograph signals capturing vocal fold vibrations. Ground truth derived from high-pass filtered laryngograph signals processed with RAPT algorithm. |
|
| 35 |
+
| **PTDBNoisy** | Speech | Real | Subset of 347 PTDB files (7.4%) with noticeable noise that were excluded from main evaluation. |
|
| 36 |
+
| **MIR1K** | Music | Real | Vocal excerpts with pitch contours initially extracted algorithmically (e.g., YIN) followed by manual correction. Labels still reflect some algorithmic biases. |
|
| 37 |
+
| **MDBStemSynth** | Music | Synthetic | Musically structured synthetic audio with accurate pitch annotations. Valuable for controlled evaluation but lacks real-world acoustic variability. |
|
| 38 |
+
| **Vocadito** | Music | Real | Solo vocal recordings with pitch annotations derived from pYIN algorithm, refined through manual verification process. |
|
| 39 |
+
| **Bach10Synth** | Music | Synthetic | High-quality pitch labels for synthesized musical performances. Similar to MDB-STEM-Synth but focused on Bach compositions. |
|
| 40 |
+
| **SpeechSynth** | Speech | Synthetic | Synthetic Mandarin speech generated using LightSpeech TTS model. Trained on 97.48 hours from AISHELL-3 and Biaobei datasets, providing exact pitch ground truth. |
|
| 41 |
+
|
| 42 |
+
**Key Characteristics:**
|
| 43 |
+
- **Synthetic datasets** provide perfect ground truth but may lack real-world complexity
|
| 44 |
+
- **Real datasets** capture natural acoustic variations but have imperfect ground truth annotations
|
| 45 |
+
- **Speech datasets** focus on vocal pitch tracking challenges
|
| 46 |
+
- **Music datasets** encompass instrumental and vocal music scenarios
|
| 47 |
+
- **SpeechSynth** addresses the gap of lacking synthetic speech data with accurate pitch labels
|
| 48 |
+
|
| 49 |
+
## Overall Performance Rankings
|
| 50 |
+
|
| 51 |
+
| **Algorithm** | **Bach10Synth** | **MIR1K** | **PTDB** | **PTDBNoisy** | **SpeechSynth** | **Vocadito** | **Average** |
|
| 52 |
+
|---|---|---|---|---|---|---|---|
|
| 53 |
+
| **SwiftF0** | 98.0% | 94.9% | **91.2%** | **75.7%** | **90.7%** | 95.0% | **90.9%** |
|
| 54 |
+
| RMVPE | **98.4%** | **96.0%** | 86.1% | 66.2% | 90.5% | **97.1%** | 89.0% |
|
| 55 |
+
| DJCM | 95.8% | 94.4% | 86.3% | 73.3% | 89.0% | 94.9% | 89.0% |
|
| 56 |
+
|
| 57 |
+
No speed benchmark results found.
|
| 58 |
+
## Detailed Performance Analysis
|
| 59 |
+
|
| 60 |
+
### Voicing Detection Performance
|
| 61 |
+
Measures how well algorithms distinguish between voiced (pitched) and unvoiced (unpitched) audio segments.
|
| 62 |
+
|
| 63 |
+
| **Algorithm** | **Precision ↑** | **Recall ↑** | **F1-Score ↑** |
|
| 64 |
+
|---|---|---|---|
|
| 65 |
+
| DJCM | **0.911** | 0.857 | 0.882 |
|
| 66 |
+
| RMVPE | 0.886 | 0.826 | 0.854 |
|
| 67 |
+
| **SwiftF0** | 0.891 | **0.889** | **0.888** |
|
| 68 |
+
|
| 69 |
+
### Pitch Accuracy Metrics
|
| 70 |
+
Detailed pitch estimation accuracy across different error types and magnitudes.
|
| 71 |
+
|
| 72 |
+
| **Algorithm** | **RPA ↑** | **RCA ↑** | **Cents Error ↓** | **RMSE (Hz) ↓** | **Octave Error ↓** | **Gross Error ↓** |
|
| 73 |
+
|---|---|---|---|---|---|---|
|
| 74 |
+
| DJCM | 0.891 | 0.897 | 40.4 | 25.3 | 0.015 | 0.022 |
|
| 75 |
+
| RMVPE | 0.888 | 0.892 | 35.8 | **11.6** | 0.012 | 0.015 |
|
| 76 |
+
| SwiftF0 | **0.912** | **0.916** | **32.4** | 11.9 | **0.011** | **0.014** |
|
| 77 |
+
|
| 78 |
+
**Additional Metric Definitions:**
|
| 79 |
+
- **RCA** (Raw Chroma Accuracy): Fraction with correct pitch class (note name), ignoring octave
|
| 80 |
+
- **Cents Error**: Mean absolute pitch deviation in cents (raw error, before exponential transform used in CA)
|
| 81 |
+
- **RMSE**: Root Mean Square Error in Hz
|
| 82 |
+
|
| 83 |
+
### Pitch Contour Smoothness
|
| 84 |
+
Measures the temporal stability and continuity of pitch tracks.
|
| 85 |
+
|
| 86 |
+
| **Algorithm** | **Relative Smoothness ↓** | **Continuity Breaks ↓** | **Overall Smoothness Rank ↓** |
|
| 87 |
+
|---|---|---|---|
|
| 88 |
+
| **SwiftF0** | 1.425 | **0.720** | **1.5** |
|
| 89 |
+
| RMVPE | **1.253** | 0.868 | 2.0 |
|
| 90 |
+
| DJCM | 3.976 | 0.771 | 2.5 |
|
| 91 |
+
|
| 92 |
+
**Metric Definitions:**
|
| 93 |
+
- **Relative Smoothness**: Coefficient of variation of consecutive pitch changes (std/mean of relative frame-to-frame changes)
|
| 94 |
+
- **Continuity Breaks**: Fraction of ground-truth voiced segments where predicted voicing has gaps
|
| 95 |
+
- **Overall Smoothness Rank**: Average rank across both smoothness metrics (1=best, lower is better)
|
| 96 |
+
|
| 97 |
+
### Optimal Threshold Analysis
|
| 98 |
+
Voicing confidence thresholds that maximize overall performance scores.
|
| 99 |
+
|
| 100 |
+
| **Algorithm** | **Mean Threshold** | **Std Dev ↓** | **Range** |
|
| 101 |
+
|---|---|---|---|
|
| 102 |
+
| DJCM | 0.517 | 0.069 | 0.40-0.60 |
|
| 103 |
+
| RMVPE | 0.683 | 0.069 | 0.60-0.80 |
|
| 104 |
+
| SwiftF0 | 0.900 | **0.000** | 0.90-0.90 |
|
| 105 |
+
|
| 106 |
+
### Algorithm Consistency
|
| 107 |
+
Measures performance stability across different datasets using Coefficient of Variation (CV = std/mean).
|
| 108 |
+
|
| 109 |
+
| **Algorithm** | **Performance CV ↓** | **Threshold CV ↓** |
|
| 110 |
+
|---|---|---|
|
| 111 |
+
| DJCM | 0.088 | 0.133 |
|
| 112 |
+
| RMVPE | 0.124 | 0.101 |
|
| 113 |
+
| SwiftF0 | **0.080** | **0.000** |
|
| 114 |
+
|
| 115 |
+
## Performance by Dataset Subsets
|
| 116 |
+
|
| 117 |
+
### By Origin
|
| 118 |
+
- **Synthetic**: Bach10Synth, MDBStemSynth, SpeechSynth, NSynth
|
| 119 |
+
- **Real**: MIR1K, PTDB, PTDBNoisy, Vocadito
|
| 120 |
+
|
| 121 |
+
| **Algorithm** | **Synthetic** | **Real** |
|
| 122 |
+
|---|---|---|
|
| 123 |
+
| DJCM | 92.4% | 87.2% |
|
| 124 |
+
| **RMVPE** | **94.4%** | 86.3% |
|
| 125 |
+
| **SwiftF0** | **94.4%** | **89.2%** |
|
| 126 |
+
|
| 127 |
+
### By Domain
|
| 128 |
+
- **Speech**: PTDB, PTDBNoisy, SpeechSynth
|
| 129 |
+
- **Music**: Bach10Synth, MDBStemSynth, NSynth, Vocadito, MIR1K
|
| 130 |
+
|
| 131 |
+
| **Algorithm** | **Speech** | **Music** |
|
| 132 |
+
|---|---|---|
|
| 133 |
+
| DJCM | 82.9% | 95.0% |
|
| 134 |
+
| RMVPE | 80.9% | **97.1%** |
|
| 135 |
+
| **SwiftF0** | **85.9%** | 96.0% |
|
| 136 |
+
|
| 137 |
+
### By Cross-Dimension
|
| 138 |
+
- **Synthetic + Speech**: SpeechSynth
|
| 139 |
+
- **Synthetic + Music**: Bach10Synth, MDBStemSynth, NSynth
|
| 140 |
+
- **Real + Speech**: PTDB, PTDBNoisy
|
| 141 |
+
- **Real + Music**: Vocadito, MIR1K
|
| 142 |
+
|
| 143 |
+
| **Algorithm** | **Synthetic + Speech** | **Synthetic + Music** | **Real + Speech** | **Real + Music** |
|
| 144 |
+
|---|---|---|---|---|
|
| 145 |
+
| DJCM | 89.0% | 95.8% | 79.8% | 94.7% |
|
| 146 |
+
| RMVPE | 90.5% | **98.4%** | 76.1% | **96.5%** |
|
| 147 |
+
| **SwiftF0** | **90.7%** | 98.0% | **83.5%** | 95.0% |
|
| 148 |
+
|