Breeze-ASR-25 CoreML ANE Optimized

Taiwanese/Mandarin mixed speech recognition model optimized for Apple Neural Engine (ANE).

🎯 Model Overview

Based on MediaTek-Research/Breeze-ASR-25, converted to CoreML format with ANE optimization for macOS/iOS.

Model Components

Component File Precision Hardware Size
Encoder `encoder/ggml-breeze-asr-25-encoder.mlmodelc/` FP16 ANE ~1.2 GB
Decoder `decoder/ggml-breeze-asr-25-q5k.bin` Q5_K_M CPU/GPU ~1.0 GB

Decoder Attribution

GGML Decoder from alan314159/Breeze-ASR-25-whispercpp. Thank you for sharing!

SHA256: `8efbf0ce8a3f50fe332b7617da787fb81354b358c288b008d3bdef8359df64c6`


πŸš€ Quick Start

1. Download Models

```bash

Install HuggingFace CLI

pip install huggingface_hub

Download all models

huggingface-cli download sheep52031/breeze-asr-25-coreml-ane \ --local-dir ./models ```

2. Swift Integration (macOS/iOS)

```swift import CoreML import whisper

// Load CoreML Encoder let encoderURL = Bundle.main.url( forResource: "ggml-breeze-asr-25-encoder", withExtension: "mlmodelc" )!

// Load GGML Decoder let decoderPath = Bundle.main.path( forResource: "ggml-breeze-asr-25-q5k", ofType: "bin" )!

// Initialize whisper.cpp context var params = whisper_context_default_params() params.use_gpu = true // Enable ANE

let ctx = whisper_init_from_file_with_params(decoderPath, params) whisper_context_set_coreml_encoder(ctx, encoderURL.path)

// Transcribe let audioData: [Float] = loadAudio("audio.wav") whisper_full(ctx, params, audioData, Int32(audioData.count))

let text = whisper_full_get_segment_text(ctx, 0) print("Result: \(String(cString: text!))") ```


πŸ“Š Performance Benchmarks

macOS (Apple Silicon)

Test Environment: MacBook Pro M1/M2, 16GB RAM

Configuration:

  • Window size: 30s (audio_ctx=3000)
  • Overlap: 5s
  • Processing: Serial (parallelism=1)
  • State management: Shared state reuse

Actual Performance (Verified 2025-10-05):

Audio Length Processing Time RTR Status
30s ~10s 0.33x βœ… Stable
60s ~19s 0.32x βœ… Stable
70s ~22s 0.31x βœ… Verified
120s ~37s 0.31x βœ… Final

RTR (Real-Time Ratio): Lower is better. 0.31 means 3.2x faster than real-time.

Comparison

Configuration 120s Audio RTR Note
This Project (FP16 ANE + Q5_K) ~37s 0.31x βœ… Verified
Full GGML (Estimated) ~72s 0.60x πŸ“Š Theoretical

Note: "Full GGML" is theoretical estimation based on ANE acceleration ratio. Performance may vary based on:

  • Audio content (speech density)
  • System resources
  • Background tasks

πŸ”§ Technical Modifications for Breeze-ASR-25 Support

This project implements a hybrid inference architecture combining CoreML-accelerated Encoder with GGML-quantized Decoder to support Breeze-ASR-25 on Apple Silicon.

Why Official whisper.cpp Doesn't Work

Breeze-ASR-25 is a fine-tuned Whisper model with key differences:

  • Vocabulary Size: 51,865 tokens (vs 51,864 in standard Whisper)
  • Sequence Length: max_source_positions=1500 (encoder output length)
  • Audio Window: Supports 30-second audio (3000 mel frames)

Official whisper.cpp assumptions:

  1. ❌ Hardcodes input_shape = (1, 80, 3000) in CoreML conversion
  2. ❌ Expects vocab_size=51,864
  3. ❌ Lacks dynamic audio_ctx configuration API

Our Key Modifications

1. CoreML Conversion Script Enhancement

# whisper.cpp/models/convert-whisper-to-coreml.py

# Dynamic sequence length (not hardcoded 3000)
input_shape = (1, hparams.n_mels, hparams.n_audio_ctx)

# Correct feature names for whisper.cpp compatibility
inputs=[ct.TensorType(name="mel", shape=input_shape)]
outputs=[ct.TensorType(name="encoder_output")]

2. whisper.cpp API Extension

// Added whisper_set_audio_ctx() for runtime configuration
// Allows models with smaller n_audio_ctx (like Breeze-ASR-25 with 1500)
// to work correctly instead of being padded to 30 seconds (3000 frames)
int whisper_set_audio_ctx(struct whisper_context * ctx, int n_audio_ctx);

Note: This is our custom modification to support Breeze-ASR-25. Not yet in official whisper.cpp.

3. Modified whisper.cpp Fork

We maintain a fork with all necessary modifications:

Repository: sheep52031/whisper.cpp (branch: breeze-asr-25-support)

Key modifications:

  • whisper_set_audio_ctx() API for dynamic audio context
  • CoreML conversion enhancements for fine-tuned models
  • Metal bfloat16 optimizations for M2+ GPUs
  • Based on Splend1d/whisper-patch-breeze for vocab support

To use:

git clone -b breeze-asr-25-support https://github.com/sheep52031/whisper.cpp
cd whisper.cpp
cmake -B build && cmake --build build

4. Hybrid Inference Architecture

Audio Input (16kHz)
    β†’ Log-Mel Features (80 Γ— 3000)
    β†’ CoreML Encoder (FP16, ANE-accelerated)
    β†’ Hidden States [1, 1500, 1280]
    β†’ GGML Decoder (Q5_K quantized, Metal GPU)
    β†’ Text Output

Technical Insights

Understanding max_source_positions=1500

  • This is the Encoder output sequence length
  • Actual input length = 1500 Γ— 2 (conv_stride) = 3000 mel frames
  • Equivalent to 30 seconds of audio (100 fps)
  • Common misconception: "1500 = 15 seconds" ❌

Why GGML Conversion Works But CoreML Fails

  • GGML: Directly reads config.json, preserves tensor shapes, dynamic runtime
  • CoreML: Requires TorchScript trace with fixed shapes, hardcoded assumptions
  • Our fix: Make CoreML conversion respect model configuration

Contributions to Open Source

We've identified and fixed critical issues in whisper.cpp's CoreML conversion:

  1. βœ… Dynamic sequence length support (not just 3000 frames)
  2. βœ… Runtime audio_ctx configuration API
  3. βœ… Correct feature naming for hybrid inference

These modifications enable support for all fine-tuned Whisper variants, not just Breeze-ASR-25.

Source Code: All modifications are open-sourced at sheep52031/whisper.cpp (branch: breeze-asr-25-support)


πŸ› οΈ Convert From Scratch

Requirements

```bash

macOS 13+, Xcode 14+, Python 3.9+

pip install -r conversion_tools/requirements.txt ```

Convert Encoder

```bash cd conversion_tools python convert_encoder.py --output ../encoder ```

Convert Decoder

```bash cd conversion_tools python convert_decoder.py --output ./output --quantize q5_k ```


βœ… Verification

```bash

Encoder precision check

cat encoder/ggml-breeze-asr-25-encoder.mlmodelc/metadata.json | grep dataType

Should show: "dataType" : "Float16"

Decoder SHA256 check

shasum -a 256 decoder/ggml-breeze-asr-25-q5k.bin

Expected: 8efbf0ce8a3f50fe332b7617da787fb81354b358c288b008d3bdef8359df64c6

```


πŸ“„ License

Based on MediaTek-Research/Breeze-ASR-25 (Apache 2.0).

Attribution:

  • CoreML ANE Optimization: sheep52031 (MIT License)
  • GGML Conversion: alan314159

πŸ™ Acknowledgments

  • MediaTek Research: Breeze-ASR-25 model
  • alan314159: GGML conversion & pretrained model
  • ggerganov: whisper.cpp framework
  • Apple: CoreML Tools & ANE
  • OpenAI: Whisper base model

Last Updated: 2025-10-06
Version: 1.0.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for sheep52031/breeze-asr-25-coreml-ane

Finetuned
(8)
this model