Breeze-ASR-25 CoreML ANE Optimized

Taiwanese/Mandarin mixed speech recognition model optimized for Apple Neural Engine (ANE).

🎯 Model Overview

Based on MediaTek-Research/Breeze-ASR-25, converted to CoreML format with ANE optimization for macOS/iOS.

Model Components

Component	File	Precision	Hardware	Size
Encoder	`encoder/ggml-breeze-asr-25-encoder.mlmodelc/`	FP16	ANE	~1.2 GB
Decoder	`decoder/ggml-breeze-asr-25-q5k.bin`	Q5_K_M	CPU/GPU	~1.0 GB

Decoder Attribution

GGML Decoder from alan314159/Breeze-ASR-25-whispercpp. Thank you for sharing!

SHA256: `8efbf0ce8a3f50fe332b7617da787fb81354b358c288b008d3bdef8359df64c6`

🚀 Quick Start

1. Download Models

```bash

Install HuggingFace CLI

pip install huggingface_hub

Download all models

huggingface-cli download sheep52031/breeze-asr-25-coreml-ane \ --local-dir ./models ```

2. Swift Integration (macOS/iOS)

```swift import CoreML import whisper

// Load CoreML Encoder let encoderURL = Bundle.main.url( forResource: "ggml-breeze-asr-25-encoder", withExtension: "mlmodelc" )!

// Load GGML Decoder let decoderPath = Bundle.main.path( forResource: "ggml-breeze-asr-25-q5k", ofType: "bin" )!

// Initialize whisper.cpp context var params = whisper_context_default_params() params.use_gpu = true // Enable ANE

let ctx = whisper_init_from_file_with_params(decoderPath, params) whisper_context_set_coreml_encoder(ctx, encoderURL.path)

// Transcribe let audioData: [Float] = loadAudio("audio.wav") whisper_full(ctx, params, audioData, Int32(audioData.count))

let text = whisper_full_get_segment_text(ctx, 0) print("Result: \(String(cString: text!))") ```

📊 Performance Benchmarks

macOS (Apple Silicon)

Test Environment: MacBook Pro M1/M2, 16GB RAM

Configuration:

Window size: 30s (audio_ctx=3000)
Overlap: 5s
Processing: Serial (parallelism=1)
State management: Shared state reuse

Actual Performance (Verified 2025-10-05):

Audio Length	Processing Time	RTR	Status
30s	~10s	0.33x	✅ Stable
60s	~19s	0.32x	✅ Stable
70s	~22s	0.31x	✅ Verified
120s	~37s	0.31x	✅ Final

RTR (Real-Time Ratio): Lower is better. 0.31 means 3.2x faster than real-time.

Comparison

Configuration	120s Audio	RTR	Note
This Project (FP16 ANE + Q5_K)	~37s	0.31x	✅ Verified
Full GGML (Estimated)	~72s	0.60x	📊 Theoretical

Note: "Full GGML" is theoretical estimation based on ANE acceleration ratio. Performance may vary based on:

Audio content (speech density)
System resources
Background tasks

🔧 Technical Modifications for Breeze-ASR-25 Support

This project implements a hybrid inference architecture combining CoreML-accelerated Encoder with GGML-quantized Decoder to support Breeze-ASR-25 on Apple Silicon.

Why Official whisper.cpp Doesn't Work

Breeze-ASR-25 is a fine-tuned Whisper model with key differences:

Vocabulary Size: 51,865 tokens (vs 51,864 in standard Whisper)
Sequence Length: max_source_positions=1500 (encoder output length)
Audio Window: Supports 30-second audio (3000 mel frames)

Official whisper.cpp assumptions:

❌ Hardcodes input_shape = (1, 80, 3000) in CoreML conversion
❌ Expects vocab_size=51,864
❌ Lacks dynamic audio_ctx configuration API

Our Key Modifications

1. CoreML Conversion Script Enhancement

# whisper.cpp/models/convert-whisper-to-coreml.py

# Dynamic sequence length (not hardcoded 3000)
input_shape = (1, hparams.n_mels, hparams.n_audio_ctx)

# Correct feature names for whisper.cpp compatibility
inputs=[ct.TensorType(name="mel", shape=input_shape)]
outputs=[ct.TensorType(name="encoder_output")]

2. whisper.cpp API Extension

// Added whisper_set_audio_ctx() for runtime configuration
// Allows models with smaller n_audio_ctx (like Breeze-ASR-25 with 1500)
// to work correctly instead of being padded to 30 seconds (3000 frames)
int whisper_set_audio_ctx(struct whisper_context * ctx, int n_audio_ctx);

Note: This is our custom modification to support Breeze-ASR-25. Not yet in official whisper.cpp.

3. Modified whisper.cpp Fork

We maintain a fork with all necessary modifications:

Repository: sheep52031/whisper.cpp (branch: breeze-asr-25-support)

Key modifications:

whisper_set_audio_ctx() API for dynamic audio context
CoreML conversion enhancements for fine-tuned models
Metal bfloat16 optimizations for M2+ GPUs
Based on Splend1d/whisper-patch-breeze for vocab support

To use:

git clone -b breeze-asr-25-support https://github.com/sheep52031/whisper.cpp
cd whisper.cpp
cmake -B build && cmake --build build

4. Hybrid Inference Architecture

Audio Input (16kHz)
    → Log-Mel Features (80 × 3000)
    → CoreML Encoder (FP16, ANE-accelerated)
    → Hidden States [1, 1500, 1280]
    → GGML Decoder (Q5_K quantized, Metal GPU)
    → Text Output

Technical Insights

Understanding `max_source_positions=1500`

This is the Encoder output sequence length
Actual input length = 1500 × 2 (conv_stride) = 3000 mel frames
Equivalent to 30 seconds of audio (100 fps)
Common misconception: "1500 = 15 seconds" ❌

Why GGML Conversion Works But CoreML Fails

GGML: Directly reads config.json, preserves tensor shapes, dynamic runtime
CoreML: Requires TorchScript trace with fixed shapes, hardcoded assumptions
Our fix: Make CoreML conversion respect model configuration

Contributions to Open Source

We've identified and fixed critical issues in whisper.cpp's CoreML conversion:

✅ Dynamic sequence length support (not just 3000 frames)
✅ Runtime audio_ctx configuration API
✅ Correct feature naming for hybrid inference

These modifications enable support for all fine-tuned Whisper variants, not just Breeze-ASR-25.

Source Code: All modifications are open-sourced at sheep52031/whisper.cpp (branch: breeze-asr-25-support)

🛠️ Convert From Scratch

Requirements

```bash

macOS 13+, Xcode 14+, Python 3.9+

pip install -r conversion_tools/requirements.txt ```

Convert Encoder

```bash cd conversion_tools python convert_encoder.py --output ../encoder ```

Convert Decoder

```bash cd conversion_tools python convert_decoder.py --output ./output --quantize q5_k ```

✅ Verification

```bash

Encoder precision check

cat encoder/ggml-breeze-asr-25-encoder.mlmodelc/metadata.json | grep dataType

Should show: "dataType" : "Float16"

Decoder SHA256 check

shasum -a 256 decoder/ggml-breeze-asr-25-q5k.bin

Expected: 8efbf0ce8a3f50fe332b7617da787fb81354b358c288b008d3bdef8359df64c6

```

📄 License

Based on MediaTek-Research/Breeze-ASR-25 (Apache 2.0).

Attribution:

CoreML ANE Optimization: sheep52031 (MIT License)
GGML Conversion: alan314159

🙏 Acknowledgments

MediaTek Research: Breeze-ASR-25 model
alan314159: GGML conversion & pretrained model
ggerganov: whisper.cpp framework
Apple: CoreML Tools & ANE
OpenAI: Whisper base model

Last Updated: 2025-10-06
Version: 1.0.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sheep52031/breeze-asr-25-coreml-ane

Base model

openai/whisper-large-v2

Finetuned

MediaTek-Research/Breeze-ASR-25

Finetuned

(8)

this model

Breeze-ASR-25 CoreML ANE Optimized

🎯 Model Overview

Model Components

Decoder Attribution

🚀 Quick Start

1. Download Models

Install HuggingFace CLI

Download all models

2. Swift Integration (macOS/iOS)

📊 Performance Benchmarks

macOS (Apple Silicon)

Comparison

🔧 Technical Modifications for Breeze-ASR-25 Support

Why Official whisper.cpp Doesn't Work

Our Key Modifications

1. CoreML Conversion Script Enhancement

2. whisper.cpp API Extension

3. Modified whisper.cpp Fork

4. Hybrid Inference Architecture

Technical Insights

Understanding max_source_positions=1500

Why GGML Conversion Works But CoreML Fails

Contributions to Open Source

🛠️ Convert From Scratch

Requirements

macOS 13+, Xcode 14+, Python 3.9+

Convert Encoder

Convert Decoder

✅ Verification

Encoder precision check

Should show: "dataType" : "Float16"

Decoder SHA256 check

Expected: 8efbf0ce8a3f50fe332b7617da787fb81354b358c288b008d3bdef8359df64c6

📄 License

🙏 Acknowledgments

Model tree for sheep52031/breeze-asr-25-coreml-ane

Understanding `max_source_positions=1500`