Taking 50 sec for me on Colab

#15

by banank1989 - opened Jul 18

banank1989

Jul 18

I must be definitely missing something.

import time
text_hindi = "आज मैंने एक नई तकनीक के बारे में सीखा जो कृत्रिम बुद्धिमत्ता का उपयोग करके मानव जैसी आवाज़ उत्पन्न कर सकती है।"
s = time.time()
audio = generate_speech(text_hindi, speaker="kavya")
e = time.time()
d = e-s
print(d)
sf.write("output_hindi_kavya.wav", audio, 24000)

this is tking 50 sec
what am I missing?

banank1989

Jul 19

Even on A100, it is taking time for me:

Complete code:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from snac import SNAC
import soundfile as sf

Model configuration for 4-bit inference

quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
device = torch.device("cuda:0")

Load model and tokenizer

model = AutoModelForCausalLM.from_pretrained(
"maya-research/veena-tts",
quantization_config=quantization_config,
device_map={"": device}, # Force all model weights to cuda:0
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("maya-research/veena-tts", trust_remote_code=True)

Initialize SNAC decoder

#snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval().cuda()
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval().to(device)

Control token IDs (fixed for Veena)

START_OF_SPEECH_TOKEN = 128257
END_OF_SPEECH_TOKEN = 128258
START_OF_HUMAN_TOKEN = 128259
END_OF_HUMAN_TOKEN = 128260
START_OF_AI_TOKEN = 128261
END_OF_AI_TOKEN = 128262
AUDIO_CODE_BASE_OFFSET = 128266

Available speakers

speakers = ["kavya", "agastya", "maitri", "vinaya"]

def generate_speech(text, speaker="kavya", temperature=0.4, top_p=0.9):
"""Generate speech from text using specified speaker voice"""

# Prepare input with speaker token
prompt = f"<spk_{speaker}> {text}"
prompt_tokens = tokenizer.encode(prompt, add_special_tokens=False)

# Construct full sequence: [HUMAN] <spk_speaker> text [/HUMAN] [AI] [SPEECH]
input_tokens = [
    START_OF_HUMAN_TOKEN,
    *prompt_tokens,
    END_OF_HUMAN_TOKEN,
    START_OF_AI_TOKEN,
    START_OF_SPEECH_TOKEN
]

input_ids = torch.tensor([input_tokens], device=model.device)

input_ids = torch.tensor([input_tokens], device=device)


# Calculate max tokens based on text length
max_tokens = min(int(len(text) * 1.3) * 7 + 21, 700)

# Generate audio tokens
with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=max_tokens,
        do_sample=True,
        temperature=temperature,
        top_p=top_p,
        repetition_penalty=1.05,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=[END_OF_SPEECH_TOKEN, END_OF_AI_TOKEN]
    )

# Extract SNAC tokens
generated_ids = output[0][len(input_tokens):].tolist()
snac_tokens = [
    token_id for token_id in generated_ids
    if AUDIO_CODE_BASE_OFFSET <= token_id < (AUDIO_CODE_BASE_OFFSET + 7 * 4096)
]

if not snac_tokens:
    raise ValueError("No audio tokens generated")

# Decode audio
audio = decode_snac_tokens(snac_tokens, snac_model)
return audio

def decode_snac_tokens(snac_tokens, snac_model):
"""De-interleave and decode SNAC tokens to audio"""
if not snac_tokens or len(snac_tokens) % 7 != 0:
return None

# De-interleave tokens into 3 hierarchical levels
codes_lvl = [[] for _ in range(3)]
llm_codebook_offsets = [AUDIO_CODE_BASE_OFFSET + i * 4096 for i in range(7)]

for i in range(0, len(snac_tokens), 7):
    # Level 0: Coarse (1 token)
    codes_lvl[0].append(snac_tokens[i] - llm_codebook_offsets[0])
    # Level 1: Medium (2 tokens)
    codes_lvl[1].append(snac_tokens[i+1] - llm_codebook_offsets[1])
    codes_lvl[1].append(snac_tokens[i+4] - llm_codebook_offsets[4])
    # Level 2: Fine (4 tokens)
    codes_lvl[2].append(snac_tokens[i+2] - llm_codebook_offsets[2])
    codes_lvl[2].append(snac_tokens[i+3] - llm_codebook_offsets[3])
    codes_lvl[2].append(snac_tokens[i+5] - llm_codebook_offsets[5])
    codes_lvl[2].append(snac_tokens[i+6] - llm_codebook_offsets[6])

# Convert to tensors for SNAC decoder
hierarchical_codes = []
for lvl_codes in codes_lvl:
    tensor = torch.tensor(lvl_codes, dtype=torch.int32, device=device).unsqueeze(0)
    if torch.any((tensor < 0) | (tensor > 4095)):
        raise ValueError("Invalid SNAC token values")
    hierarchical_codes.append(tensor)

# Decode with SNAC
with torch.no_grad():
    audio_hat = snac_model.decode(hierarchical_codes)

return audio_hat.squeeze().clamp(-1, 1).cpu().numpy()

--- Example Usage ---

Hindi

import time
s = time.time()

text_hindi = "आज मैंने एक नई तकनीक के बारे में सीखा जो कृत्रिम बुद्धिमत्ता का उपयोग करके मानव जैसी आवाज़ उत्पन्न कर सकती है।"

audio = generate_speech(text_hindi, speaker="kavya")

sf.write("output_hindi_kavya.wav", audio, 24000)

Code-mixed

text_mixed = "मैं तो पूरा presentation prepare कर चुका हूं! कल रात को ही मैंने पूरा code base चेक किया।"
audio = generate_speech(text_mixed, speaker="maitri")
sf.write("output_mixed_maitri.wav", audio, 24000)

e = time.time()
d = e-s
print(d)

on A100, taking 20 secs
Taking is going in this step only:

Generate audio tokens

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=max_tokens,
        do_sample=True,
        temperature=temperature,
        top_p=top_p,
        repetition_penalty=1.05,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=[END_OF_SPEECH_TOKEN, END_OF_AI_TOKEN]
    )

sanjitpradhan2009

Jul 23

I am running on H100, just the short sentence "is it working perfect" taking around 6 second.

arunneel

Sep 11

I am running on H100,given example takes 13 seconds.. I am losing trust on these metrics they publish for marketing.

bharathkumarK

Maya Research org Sep 11

•

edited Sep 11

https://huggingface.co/maya-research/Veena/discussions/11#68642fe48c9c3f4bb93e7af4

This is a Text-To-Text model, which is converted to hearable speech bytes(using SNAC). Implement a sequencing way of streaming and vllm's async engine to flush out and convert those tokens to audio right away. so you dont have to wait till 30s to listen to a sentence. Infernece speed is vastly fast in latest gen of nvidia GPUs as we observed. A100, H100 should also be good with <200ms runtime. https://docs.vllm.ai/en/v0.6.5/dev/engine/async_llm_engine.html

And @arunneel before losing trust on these metrics, maybe you should read all the community replies.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment