Add README.md

16f4426 verified about 1 month ago

3.91 kB

	---
	library_name: onnxruntime
	tags:
	- snac
	- onnx
	- 24khz
	- decoder
	- browser
	license: other
	language:
	- en
	---

	# SNAC 24 kHz — Decoder as ONNX (browser-ready)

	This repo provides ONNX decoders for the SNAC 24 kHz codec so you can decode SNAC tokens on-device, including in the browser with `onnxruntime-web`.

	Why? If your TTS front-end is a decoder-only Transformer (e.g. Orpheus-style) that can stream out SNAC tokens fast and cheaply, you can keep synthesis private and responsive by decoding the audio in the user’s browser/CPU (or WebGPU when available).

	> In a Colab CPU test, we saw ~2.1× real-time decoding for a longer file using the ONNX model (inference time only, excluding model load). Your mileage will vary with hardware and browser.

	---

	## Files

	- `snac24_int2wav_static.onnx` — int → wav decoder
	Inputs (int64):
	- `codes0`: `[1, 12]`
	- `codes1`: `[1, 24]`
	- `codes2`: `[1, 48]`
	Output:
	- `audio`: `float32 [1, 1, 24576]` (24 kHz)

	Shapes correspond to a 48-frame window. Each frame is 512 samples, so one window = 24576 samples ≈ 1.024 s at 24 kHz.
	Token alignment: `L04 = L12 = L2*1 = shared_frames`.

	- `snac24_latent2wav_static.onnx` — latent → wav decoder
	Input: `z` `float32 [1, 768, 48]` → Output: `audio [1, 1, 24576]`
	Use this if you reconstruct the latent yourself (RVQ embeddings + 1×1 conv projections).

	- `snac24_codes.json` — sample codes (for testing)

	- `snac24_quantizers.json` — RVQ metadata/weights (stride + embeddings + 1×1 projections) to reconstruct `z` if needed.

	---

	## Browser (WASM/WebGPU) quickstart

	Serve these files from a local server with cross-origin isolation for multithreaded WASM (e.g., COOP/COEP headers). If not isolated, WASM will typically run single-threaded.

	```html
	<script src="https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.js"></script>
	<script>
	(async () => {
	// Prefer WebGPU if available; else WASM
	const providers = (typeof navigator.gpu !== 'undefined') ? ['webgpu','wasm'] : ['wasm'];
	// Enable SIMD; threads only if crossOriginIsolated
	ort.env.wasm.simd = true;
	ort.env.wasm.numThreads = crossOriginIsolated ? (navigator.hardwareConcurrency\|\|4) : 1;

	const session = await ort.InferenceSession.create('snac24_int2wav_static.onnx', {
	executionProviders: providers,
	graphOptimizationLevel: 'all',
	});

	// Example: one 48-frame window (12/24/48 tokens). Replace with real codes.
	const T0=12, T1=24, T2=48;
	const feed = {
	codes0: new ort.Tensor('int64', BigInt64Array.from(new Array(T0).fill(0), x=>BigInt(x)), [1,T0]),
	codes1: new ort.Tensor('int64', BigInt64Array.from(new Array(T1).fill(0), x=>BigInt(x)), [1,T1]),
	codes2: new ort.Tensor('int64', BigInt64Array.from(new Array(T2).fill(0), x=>BigInt(x)), [1,T2]),
	};

	const t0 = performance.now();
	const out = await session.run(feed);
	const t1 = performance.now();
	const audio = out.audio.data; // Float32Array [1,1,24576]

	// Play it (24 kHz)
	const ctx = new (window.AudioContext\|\|window.webkitAudioContext)({sampleRate:24000});
	const buf = ctx.createBuffer(1, audio.length, 24000);
	buf.copyToChannel(audio, 0);
	const src = ctx.createBufferSource(); src.buffer = buf; src.connect(ctx.destination); src.start();

	console.log({ usedEP: providers[0], infer_ms: (t1-t0).toFixed(2), samples: audio.length });
	})();
	</script>
	Streaming note

	SNAC is streamable in principle. For practical low-latency TTS, emit ~200 ms of tokens, decode in ~100 ms,
	start playback, and continue decoding subsequent chunks; cross-fade a few ms to hide seams.

	Threads / GPU

	Multithreaded WASM requires cross-origin isolation (COOP/COEP). Without it, browsers typically run single-threaded.

	WebGPU can accelerate on desktop and mobile when kernels are supported; this model usually falls back to WASM if not.