File size: 3,911 Bytes
16f4426
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
library_name: onnxruntime
tags:
- snac
- onnx
- 24khz
- decoder
- browser
license: other
language:
- en
---

# SNAC 24 kHz — Decoder as ONNX (browser-ready)

This repo provides **ONNX decoders** for the SNAC 24 kHz codec so you can decode SNAC tokens **on-device**, including **in the browser** with `onnxruntime-web`.

**Why?** If your TTS front-end is a decoder-only Transformer (e.g. Orpheus-style) that can stream out SNAC tokens fast and cheaply, you can keep synthesis private and responsive by decoding the audio **in the user’s browser/CPU** (or WebGPU when available).

> In a Colab CPU test, we saw ~**2.1× real-time** decoding for a longer file using the ONNX model (inference time only, excluding model load). Your mileage will vary with hardware and browser.

---

## Files

- **`snac24_int2wav_static.onnx`** — *int → wav* decoder  
  Inputs (int64):
  - `codes0`: `[1, 12]`  
  - `codes1`: `[1, 24]`  
  - `codes2`: `[1, 48]`  
  Output:
  - `audio`: `float32 [1, 1, 24576]` (24 kHz)

  Shapes correspond to a **48-frame window**. Each frame is **512 samples**, so one window = **24576 samples** ≈ **1.024 s** at 24 kHz.  
  Token alignment: `L0*4 = L1*2 = L2*1 = shared_frames`.

- **`snac24_latent2wav_static.onnx`** — *latent → wav* decoder  
  Input: `z` `float32 [1, 768, 48]` → Output: `audio [1, 1, 24576]`  
  Use this if you reconstruct the latent yourself (RVQ embeddings + 1×1 conv projections).

- **`snac24_codes.json`** — sample codes (for testing)

- **`snac24_quantizers.json`** — RVQ metadata/weights (stride + embeddings + 1×1 projections) to reconstruct `z` if needed.

---

## Browser (WASM/WebGPU) quickstart

Serve these files from a local server with cross-origin isolation for multithreaded WASM (e.g., COOP/COEP headers). If not isolated, WASM will typically run **single-threaded**.

```html
<script src="https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.js"></script>
<script>
(async () => {
  // Prefer WebGPU if available; else WASM
  const providers = (typeof navigator.gpu !== 'undefined') ? ['webgpu','wasm'] : ['wasm'];
  // Enable SIMD; threads only if crossOriginIsolated
  ort.env.wasm.simd = true;
  ort.env.wasm.numThreads = crossOriginIsolated ? (navigator.hardwareConcurrency||4) : 1;

  const session = await ort.InferenceSession.create('snac24_int2wav_static.onnx', {
    executionProviders: providers,
    graphOptimizationLevel: 'all',
  });

  // Example: one 48-frame window (12/24/48 tokens). Replace with real codes.
  const T0=12, T1=24, T2=48;
  const feed = {
    codes0: new ort.Tensor('int64', BigInt64Array.from(new Array(T0).fill(0), x=>BigInt(x)), [1,T0]),
    codes1: new ort.Tensor('int64', BigInt64Array.from(new Array(T1).fill(0), x=>BigInt(x)), [1,T1]),
    codes2: new ort.Tensor('int64', BigInt64Array.from(new Array(T2).fill(0), x=>BigInt(x)), [1,T2]),
  };

  const t0 = performance.now();
  const out = await session.run(feed);
  const t1 = performance.now();
  const audio = out.audio.data; // Float32Array [1,1,24576]

  // Play it (24 kHz)
  const ctx = new (window.AudioContext||window.webkitAudioContext)({sampleRate:24000});
  const buf = ctx.createBuffer(1, audio.length, 24000);
  buf.copyToChannel(audio, 0);
  const src = ctx.createBufferSource(); src.buffer = buf; src.connect(ctx.destination); src.start();

  console.log({ usedEP: providers[0], infer_ms: (t1-t0).toFixed(2), samples: audio.length });
})();
</script>
Streaming note

SNAC is streamable in principle. For practical low-latency TTS, emit ~200 ms of tokens, decode in ~100 ms,
start playback, and continue decoding subsequent chunks; cross-fade a few ms to hide seams.

Threads / GPU

Multithreaded WASM requires cross-origin isolation (COOP/COEP). Without it, browsers typically run single-threaded.

WebGPU can accelerate on desktop and mobile when kernels are supported; this model usually falls back to WASM if not.