Breeze-ASR-25 CoreML ANE Optimized
Taiwanese/Mandarin mixed speech recognition model optimized for Apple Neural Engine (ANE).
π― Model Overview
Based on MediaTek-Research/Breeze-ASR-25, converted to CoreML format with ANE optimization for macOS/iOS.
Model Components
Component | File | Precision | Hardware | Size |
---|---|---|---|---|
Encoder | `encoder/ggml-breeze-asr-25-encoder.mlmodelc/` | FP16 | ANE | ~1.2 GB |
Decoder | `decoder/ggml-breeze-asr-25-q5k.bin` | Q5_K_M | CPU/GPU | ~1.0 GB |
Decoder Attribution
GGML Decoder from alan314159/Breeze-ASR-25-whispercpp. Thank you for sharing!
SHA256: `8efbf0ce8a3f50fe332b7617da787fb81354b358c288b008d3bdef8359df64c6`
π Quick Start
1. Download Models
```bash
Install HuggingFace CLI
pip install huggingface_hub
Download all models
huggingface-cli download sheep52031/breeze-asr-25-coreml-ane \ --local-dir ./models ```
2. Swift Integration (macOS/iOS)
```swift import CoreML import whisper
// Load CoreML Encoder let encoderURL = Bundle.main.url( forResource: "ggml-breeze-asr-25-encoder", withExtension: "mlmodelc" )!
// Load GGML Decoder let decoderPath = Bundle.main.path( forResource: "ggml-breeze-asr-25-q5k", ofType: "bin" )!
// Initialize whisper.cpp context var params = whisper_context_default_params() params.use_gpu = true // Enable ANE
let ctx = whisper_init_from_file_with_params(decoderPath, params) whisper_context_set_coreml_encoder(ctx, encoderURL.path)
// Transcribe let audioData: [Float] = loadAudio("audio.wav") whisper_full(ctx, params, audioData, Int32(audioData.count))
let text = whisper_full_get_segment_text(ctx, 0) print("Result: \(String(cString: text!))") ```
π Performance Benchmarks
macOS (Apple Silicon)
Test Environment: MacBook Pro M1/M2, 16GB RAM
Configuration:
- Window size: 30s (audio_ctx=3000)
- Overlap: 5s
- Processing: Serial (parallelism=1)
- State management: Shared state reuse
Actual Performance (Verified 2025-10-05):
Audio Length | Processing Time | RTR | Status |
---|---|---|---|
30s | ~10s | 0.33x | β Stable |
60s | ~19s | 0.32x | β Stable |
70s | ~22s | 0.31x | β Verified |
120s | ~37s | 0.31x | β Final |
RTR (Real-Time Ratio): Lower is better. 0.31 means 3.2x faster than real-time.
Comparison
Configuration | 120s Audio | RTR | Note |
---|---|---|---|
This Project (FP16 ANE + Q5_K) | ~37s | 0.31x | β Verified |
Full GGML (Estimated) | ~72s | 0.60x | π Theoretical |
Note: "Full GGML" is theoretical estimation based on ANE acceleration ratio. Performance may vary based on:
- Audio content (speech density)
- System resources
- Background tasks
π§ Technical Modifications for Breeze-ASR-25 Support
This project implements a hybrid inference architecture combining CoreML-accelerated Encoder with GGML-quantized Decoder to support Breeze-ASR-25 on Apple Silicon.
Why Official whisper.cpp Doesn't Work
Breeze-ASR-25 is a fine-tuned Whisper model with key differences:
- Vocabulary Size: 51,865 tokens (vs 51,864 in standard Whisper)
- Sequence Length:
max_source_positions=1500
(encoder output length) - Audio Window: Supports 30-second audio (3000 mel frames)
Official whisper.cpp assumptions:
- β Hardcodes
input_shape = (1, 80, 3000)
in CoreML conversion - β Expects vocab_size=51,864
- β Lacks dynamic audio_ctx configuration API
Our Key Modifications
1. CoreML Conversion Script Enhancement
# whisper.cpp/models/convert-whisper-to-coreml.py
# Dynamic sequence length (not hardcoded 3000)
input_shape = (1, hparams.n_mels, hparams.n_audio_ctx)
# Correct feature names for whisper.cpp compatibility
inputs=[ct.TensorType(name="mel", shape=input_shape)]
outputs=[ct.TensorType(name="encoder_output")]
2. whisper.cpp API Extension
// Added whisper_set_audio_ctx() for runtime configuration
// Allows models with smaller n_audio_ctx (like Breeze-ASR-25 with 1500)
// to work correctly instead of being padded to 30 seconds (3000 frames)
int whisper_set_audio_ctx(struct whisper_context * ctx, int n_audio_ctx);
Note: This is our custom modification to support Breeze-ASR-25. Not yet in official whisper.cpp.
3. Modified whisper.cpp Fork
We maintain a fork with all necessary modifications:
Repository: sheep52031/whisper.cpp (branch: breeze-asr-25-support
)
Key modifications:
- whisper_set_audio_ctx() API for dynamic audio context
- CoreML conversion enhancements for fine-tuned models
- Metal bfloat16 optimizations for M2+ GPUs
- Based on Splend1d/whisper-patch-breeze for vocab support
To use:
git clone -b breeze-asr-25-support https://github.com/sheep52031/whisper.cpp
cd whisper.cpp
cmake -B build && cmake --build build
4. Hybrid Inference Architecture
Audio Input (16kHz)
β Log-Mel Features (80 Γ 3000)
β CoreML Encoder (FP16, ANE-accelerated)
β Hidden States [1, 1500, 1280]
β GGML Decoder (Q5_K quantized, Metal GPU)
β Text Output
Technical Insights
Understanding max_source_positions=1500
- This is the Encoder output sequence length
- Actual input length = 1500 Γ 2 (conv_stride) = 3000 mel frames
- Equivalent to 30 seconds of audio (100 fps)
- Common misconception: "1500 = 15 seconds" β
Why GGML Conversion Works But CoreML Fails
- GGML: Directly reads config.json, preserves tensor shapes, dynamic runtime
- CoreML: Requires TorchScript trace with fixed shapes, hardcoded assumptions
- Our fix: Make CoreML conversion respect model configuration
Contributions to Open Source
We've identified and fixed critical issues in whisper.cpp's CoreML conversion:
- β Dynamic sequence length support (not just 3000 frames)
- β Runtime audio_ctx configuration API
- β Correct feature naming for hybrid inference
These modifications enable support for all fine-tuned Whisper variants, not just Breeze-ASR-25.
Source Code: All modifications are open-sourced at sheep52031/whisper.cpp (branch: breeze-asr-25-support
)
π οΈ Convert From Scratch
Requirements
```bash
macOS 13+, Xcode 14+, Python 3.9+
pip install -r conversion_tools/requirements.txt ```
Convert Encoder
```bash cd conversion_tools python convert_encoder.py --output ../encoder ```
Convert Decoder
```bash cd conversion_tools python convert_decoder.py --output ./output --quantize q5_k ```
β Verification
```bash
Encoder precision check
cat encoder/ggml-breeze-asr-25-encoder.mlmodelc/metadata.json | grep dataType
Should show: "dataType" : "Float16"
Decoder SHA256 check
shasum -a 256 decoder/ggml-breeze-asr-25-q5k.bin
Expected: 8efbf0ce8a3f50fe332b7617da787fb81354b358c288b008d3bdef8359df64c6
```
π License
Based on MediaTek-Research/Breeze-ASR-25 (Apache 2.0).
Attribution:
- CoreML ANE Optimization: sheep52031 (MIT License)
- GGML Conversion: alan314159
π Acknowledgments
- MediaTek Research: Breeze-ASR-25 model
- alan314159: GGML conversion & pretrained model
- ggerganov: whisper.cpp framework
- Apple: CoreML Tools & ANE
- OpenAI: Whisper base model
Last Updated: 2025-10-06
Version: 1.0.0
Model tree for sheep52031/breeze-asr-25-coreml-ane
Base model
openai/whisper-large-v2