|
--- |
|
library_name: onnxruntime |
|
tags: |
|
- snac |
|
- onnx |
|
- 24khz |
|
- decoder |
|
- browser |
|
license: other |
|
language: |
|
- en |
|
--- |
|
|
|
# SNAC 24 kHz β Decoder as ONNX (browser-ready) |
|
|
|
This repo provides **ONNX decoders** for the SNAC 24 kHz codec so you can decode SNAC tokens **on-device**, including **in the browser** with `onnxruntime-web`. |
|
|
|
**Why?** If your TTS front-end is a decoder-only Transformer (e.g. Orpheus-style) that can stream out SNAC tokens fast and cheaply, you can keep synthesis private and responsive by decoding the audio **in the userβs browser/CPU** (or WebGPU when available). |
|
|
|
> In a Colab CPU test, we saw ~**2.1Γ real-time** decoding for a longer file using the ONNX model (inference time only, excluding model load). Your mileage will vary with hardware and browser. |
|
|
|
--- |
|
|
|
## Files |
|
|
|
- **`snac24_int2wav_static.onnx`** β *int β wav* decoder |
|
Inputs (int64): |
|
- `codes0`: `[1, 12]` |
|
- `codes1`: `[1, 24]` |
|
- `codes2`: `[1, 48]` |
|
Output: |
|
- `audio`: `float32 [1, 1, 24576]` (24 kHz) |
|
|
|
Shapes correspond to a **48-frame window**. Each frame is **512 samples**, so one window = **24576 samples** β **1.024 s** at 24 kHz. |
|
Token alignment: `L0*4 = L1*2 = L2*1 = shared_frames`. |
|
|
|
- **`snac24_latent2wav_static.onnx`** β *latent β wav* decoder |
|
Input: `z` `float32 [1, 768, 48]` β Output: `audio [1, 1, 24576]` |
|
Use this if you reconstruct the latent yourself (RVQ embeddings + 1Γ1 conv projections). |
|
|
|
- **`snac24_codes.json`** β sample codes (for testing) |
|
|
|
- **`snac24_quantizers.json`** β RVQ metadata/weights (stride + embeddings + 1Γ1 projections) to reconstruct `z` if needed. |
|
|
|
--- |
|
|
|
## Browser (WASM/WebGPU) quickstart |
|
|
|
Serve these files from a local server with cross-origin isolation for multithreaded WASM (e.g., COOP/COEP headers). If not isolated, WASM will typically run **single-threaded**. |
|
|
|
```html |
|
<script src="https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.js"></script> |
|
<script> |
|
(async () => { |
|
// Prefer WebGPU if available; else WASM |
|
const providers = (typeof navigator.gpu !== 'undefined') ? ['webgpu','wasm'] : ['wasm']; |
|
// Enable SIMD; threads only if crossOriginIsolated |
|
ort.env.wasm.simd = true; |
|
ort.env.wasm.numThreads = crossOriginIsolated ? (navigator.hardwareConcurrency||4) : 1; |
|
|
|
const session = await ort.InferenceSession.create('snac24_int2wav_static.onnx', { |
|
executionProviders: providers, |
|
graphOptimizationLevel: 'all', |
|
}); |
|
|
|
// Example: one 48-frame window (12/24/48 tokens). Replace with real codes. |
|
const T0=12, T1=24, T2=48; |
|
const feed = { |
|
codes0: new ort.Tensor('int64', BigInt64Array.from(new Array(T0).fill(0), x=>BigInt(x)), [1,T0]), |
|
codes1: new ort.Tensor('int64', BigInt64Array.from(new Array(T1).fill(0), x=>BigInt(x)), [1,T1]), |
|
codes2: new ort.Tensor('int64', BigInt64Array.from(new Array(T2).fill(0), x=>BigInt(x)), [1,T2]), |
|
}; |
|
|
|
const t0 = performance.now(); |
|
const out = await session.run(feed); |
|
const t1 = performance.now(); |
|
const audio = out.audio.data; // Float32Array [1,1,24576] |
|
|
|
// Play it (24 kHz) |
|
const ctx = new (window.AudioContext||window.webkitAudioContext)({sampleRate:24000}); |
|
const buf = ctx.createBuffer(1, audio.length, 24000); |
|
buf.copyToChannel(audio, 0); |
|
const src = ctx.createBufferSource(); src.buffer = buf; src.connect(ctx.destination); src.start(); |
|
|
|
console.log({ usedEP: providers[0], infer_ms: (t1-t0).toFixed(2), samples: audio.length }); |
|
})(); |
|
</script> |
|
Streaming note |
|
|
|
SNAC is streamable in principle. For practical low-latency TTS, emit ~200 ms of tokens, decode in ~100 ms, |
|
start playback, and continue decoding subsequent chunks; cross-fade a few ms to hide seams. |
|
|
|
Threads / GPU |
|
|
|
Multithreaded WASM requires cross-origin isolation (COOP/COEP). Without it, browsers typically run single-threaded. |
|
|
|
WebGPU can accelerate on desktop and mobile when kernels are supported; this model usually falls back to WASM if not. |