Seeking help: TypeError: DacModel.decode() missing 1 required positional argument: 'quantized_representation'

by k1-m - opened Apr 22, 2025

k1-m

Apr 22, 2025

Could anyone please help to resolve the error below, (let me know which dac package worked for compilation of this code)?

In my case, Following compilation/interpreting error occured when I tried to use this model:
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids, attention_mask=attention_mask,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:_Develop_dev\px3.pixi\envs\default\Lib\site-packages\torch\utils_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "D:_Develop_dev\px3.pixi\envs\default\Lib\site-packages\parler_tts\modeling_parler_tts.py", line 3637, in generate
sample = self.audio_encoder.decode(audio_codes=sample[None, ...], **single_audio_decode_kwargs).audio_values
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: DacModel.decode() missing 1 required positional argument: 'quantized_representation'

(I added attentionmask for description, prompt due to errors still this DAC error is seen),

Is there any specific transformers version that worked for you?
Thanks in advance

In the same windows setup the base indic-parler-tts compiled (but was slow for our usage)

Abhaykoul

HAI-OLD org Apr 22, 2025

I guess it should work in transformers==4.46.0.dev0.

k1-m

Apr 23, 2025

•

edited Apr 30, 2025

1st call to generate() doesnot give error but
2nd call to generate() gives error : "AttributeError: 'StaticCache' object has no attribute 'max_batch_size'. Did you mean: 'batch_size'?"

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids, attention_mask=attention_mask,  prompt_attention_mask=prompt_attention_mask,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "D:_Develop_dev\px3.pixi\envs\default\Lib\site-packages\torch\utils_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "D:_Develop_dev\px3.pixi\envs\default\Lib\site-packages\parler_tts\modeling_parler_tts.py", line 3491, in generate
model_kwargs["past_key_values"] = self._get_cache(
^^^^^^^^^^^^^^^^
File "D:_Develop_dev\px3.pixi\envs\default\Lib\site-packages\parler_tts\modeling_parler_tts.py", line 3275, in _get_cache
or cache_to_check.max_batch_size != max_batch_size
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:_Develop_dev\px3.pixi\envs\default\Lib\site-packages\torch\nn\modules\module.py", line 1928, in getattr
raise AttributeError(
AttributeError: 'StaticCache' object has no attribute 'max_batch_size'. Did you mean: 'batch_size'?

Could you let me know a specific fix if can be done locally?

k1-m

Apr 30, 2025

Even with same Description string, in a single batch also the voice is different!
Is there any setting to only follow 1 single voice as long as the description string is identical?

Parveshiiii

Apr 30, 2025

It looks like the issue is related to the StaticCache object not having the max_batch_size attribute. This might be due to a change in the library where max_batch_size has been deprecated or replaced with batch_size.

Here are a few things you can try to fix this locally:

Check the library version: If you're using an older version of transformers, upgrading to the latest version might resolve the issue. Try running:
```
pip install --upgrade transformers
```
Modify the code: If max_batch_size has been replaced with batch_size, you can try changing occurrences of max_batch_size to batch_size in your code, particularly in _get_cache() function.
Check GitHub issues: There are discussions on GitHub where similar issues have been reported. You might find a fix or workaround there.
Verify StaticCache implementation: If you're manually using StaticCache, ensure that it supports max_batch_size. If not, you may need to initialize it differently.
----HelpingAI----
This is generated by a HelpingAI so make sure to double check it.

k1-m

Oct 29, 2025

It looks like the issue is related to the StaticCache object not having the max_batch_size attribute. This might be due to a change in the library where max_batch_size has been deprecated or replaced with batch_size.

Here are a few things you can try to fix this locally:
Check the library version: If you're using an older version of transformers, upgrading to the latest version might resolve the issue. Try running:
pip install --upgrade transformers
Modify the code: If max_batch_size has been replaced with batch_size, you can try changing occurrences of max_batch_size to batch_size in your code, particularly in _get_cache() function.

Check GitHub issues: There are discussions on GitHub where similar issues have been reported. You might find a fix or workaround there.

Verify StaticCache implementation: If you're manually using StaticCache, ensure that it supports max_batch_size. If not, you may need to initialize it differently.
----HelpingAI----
This is generated by a HelpingAI so make sure to double check it.

Thank you, had made change in code at that time to suppress it

Could you please let me know is there a way to make the speaker or voice remain consistent?
Like is there some parameter to be provided as input that ensures the voice and does not change (until the parameter value is changed )?

Parveshiiii

Oct 29, 2025

It looks like the issue is related to the StaticCache object not having the max_batch_size attribute. This might be due to a change in the library where max_batch_size has been deprecated or replaced with batch_size.

Here are a few things you can try to fix this locally:
Check the library version: If you're using an older version of transformers, upgrading to the latest version might resolve the issue. Try running:
pip install --upgrade transformers
Modify the code: If max_batch_size has been replaced with batch_size, you can try changing occurrences of max_batch_size to batch_size in your code, particularly in _get_cache() function.

Check GitHub issues: There are discussions on GitHub where similar issues have been reported. You might find a fix or workaround there.

Verify StaticCache implementation: If you're manually using StaticCache, ensure that it supports max_batch_size. If not, you may need to initialize it differently.
----HelpingAI----
This is generated by a HelpingAI so make sure to double check it.
Thank you, had made change in code at that time to suppress it

Could you please let me know is there a way to make the speaker or voice remain consistent?
Like is there some parameter to be provided as input that ensures the voice and does not change (until the parameter value is changed )?

Yes, Parler-TTS supports consistent speaker voice generation by using a fixed speaker embedding or conditioning input. To maintain the same voice across generations, you need to explicitly pass the same speaker-related parameters—such as speaker_embedding, speaker_id, or a consistent prompt_input_ids—depending on the model variant.

i may be wrong please let me know if it fixes the issue

k1-m

Oct 29, 2025

•

edited Oct 29, 2025

It looks like the issue is related to the StaticCache object not having the max_batch_size attribute. This might be due to a change in the library where max_batch_size has been deprecated or replaced with batch_size.

Here are a few things you can try to fix this locally:
Check the library version: If you're using an older version of transformers, upgrading to the latest version might resolve the issue. Try running:
pip install --upgrade transformers
Modify the code: If max_batch_size has been replaced with batch_size, you can try changing occurrences of max_batch_size to batch_size in your code, particularly in _get_cache() function.

Check GitHub issues: There are discussions on GitHub where similar issues have been reported. You might find a fix or workaround there.

Verify StaticCache implementation: If you're manually using StaticCache, ensure that it supports max_batch_size. If not, you may need to initialize it differently.
----HelpingAI----
This is generated by a HelpingAI so make sure to double check it.
Thank you, had made change in code at that time to suppress it

Could you please let me know is there a way to make the speaker or voice remain consistent?
Like is there some parameter to be provided as input that ensures the voice and does not change (until the parameter value is changed )?
Yes, Parler-TTS supports consistent speaker voice generation by using a fixed speaker embedding or conditioning input. To maintain the same voice across generations, you need to explicitly pass the same speaker-related parameters—such as speaker_embedding, speaker_id, or a consistent prompt_input_ids—depending on the model variant.

i may be wrong please let me know if it fixes the issue

in the model page it is mentioned :
prompt = "Hey, what's up? How’s it going?"
description = "A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."

I used description = "Lalitha is A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."

I ensured description is constant across generate() calls, and only gave input text in prompts (parts of a sentence, about 50 tokens)
Despite description being same, even within single generate call of a batch of 4 input text sentences, the generated voice is different, in accent, tone and other aspects, it was as if some different speaker's voice !
Since there is no speakerID in model page, It was not clear if there exists and speaker ID supported (for telugu language), I used speaker name 'lalitha'
So it generated a bunch of different accent audios

Parveshiiii

Oct 29, 2025

It looks like the issue is related to the StaticCache object not having the max_batch_size attribute. This might be due to a change in the library where max_batch_size has been deprecated or replaced with batch_size.

Here are a few things you can try to fix this locally:
Check the library version: If you're using an older version of transformers, upgrading to the latest version might resolve the issue. Try running:
pip install --upgrade transformers
Modify the code: If max_batch_size has been replaced with batch_size, you can try changing occurrences of max_batch_size to batch_size in your code, particularly in _get_cache() function.

Check GitHub issues: There are discussions on GitHub where similar issues have been reported. You might find a fix or workaround there.

Verify StaticCache implementation: If you're manually using StaticCache, ensure that it supports max_batch_size. If not, you may need to initialize it differently.
----HelpingAI----
This is generated by a HelpingAI so make sure to double check it.
Thank you, had made change in code at that time to suppress it

Could you please let me know is there a way to make the speaker or voice remain consistent?
Like is there some parameter to be provided as input that ensures the voice and does not change (until the parameter value is changed )?
Yes, Parler-TTS supports consistent speaker voice generation by using a fixed speaker embedding or conditioning input. To maintain the same voice across generations, you need to explicitly pass the same speaker-related parameters—such as speaker_embedding, speaker_id, or a consistent prompt_input_ids—depending on the model variant.

i may be wrong please let me know if it fixes the issue
in the model page it is mentioned :
prompt = "Hey, what's up? How’s it going?"
description = "A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."

I used description = "Lalitha is A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."

I ensured description is constant across generate() calls, and only gave input text in prompts (parts of a sentence, about 50 tokens)
Despite description being same, even within single generate call of a batch of 4 input text sentences, the generated voice is different, in accent, tone and other aspects, it was as if some different speaker's voice !
Since there is no speakerID in model page, It was not clear if there exists and speaker ID supported (for telugu language), I used speaker name 'lalitha'
So it generated a bunch of different accent audios

Punctuation can be used to control the prosody of the generations, e.g. use commas to add small breaks in speech
The remaining speech features (gender, speaking rate, pitch and reverberation) can be controlled directly through the prompt
Some Description Examples
Aditi - Slightly High-Pitched, Expressive Tone:
"Aditi speaks with a slightly higher pitch in a close-sounding environment. Her voice is clear, with subtle emotional depth and a normal pace, all captured in high-quality recording."

Sita - Rapid, Slightly Monotone:
"Sita speaks at a fast pace with a slightly low-pitched voice, captured clearly in a close-sounding environment with excellent recording quality."

Tapan - Male, Moderate Pace, Slightly Monotone:
"Tapan speaks at a moderate pace with a slightly monotone tone. The recording is clear, with a close sound and only minimal ambient noise."

Sunita - High-Pitched, Happy Tone:
"Sunita speaks with a high pitch in a close environment. Her voice is clear, with slight dynamic changes, and the recording is of excellent quality."

Karan - High-Pitched, Positive Tone:
"Karan’s high-pitched, engaging voice is captured in a clear, close-sounding recording. His slightly slower delivery conveys a positive tone."

Amrita - High-Pitched, Flat Tone:
"Amrita speaks with a high pitch at a slow pace. Her voice is clear, with excellent recording quality and only moderate background noise."

Aditi - Slow, Slightly Expressive:
"Aditi speaks slowly with a high pitch and expressive tone. The recording is clear, showcasing her energetic and emotive voice."

Young Male Speaker, American Accent:
"A young male speaker with a high-pitched American accent delivers speech at a slightly fast pace in a clear, close-sounding recording."

Bikram - High-Pitched, Urgent Tone:
"Bikram speaks with a higher pitch and fast pace, conveying urgency. The recording is clear and intimate, with great emotional depth."

Anjali - High-Pitched, Neutral Tone:
"Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality."

here is for you my friend its a fine tuned ver of indic parler tts so these should be supported

k1-m

Oct 29, 2025

It looks like the issue is related to the StaticCache object not having the max_batch_size attribute. This might be due to a change in the library where max_batch_size has been deprecated or replaced with batch_size.

Here are a few things you can try to fix this locally:
Check the library version: If you're using an older version of transformers, upgrading to the latest version might resolve the issue. Try running:
pip install --upgrade transformers
Modify the code: If max_batch_size has been replaced with batch_size, you can try changing occurrences of max_batch_size to batch_size in your code, particularly in _get_cache() function.

Check GitHub issues: There are discussions on GitHub where similar issues have been reported. You might find a fix or workaround there.

Verify StaticCache implementation: If you're manually using StaticCache, ensure that it supports max_batch_size. If not, you may need to initialize it differently.
----HelpingAI----
This is generated by a HelpingAI so make sure to double check it.
Thank you, had made change in code at that time to suppress it

Could you please let me know is there a way to make the speaker or voice remain consistent?
Like is there some parameter to be provided as input that ensures the voice and does not change (until the parameter value is changed )?
Yes, Parler-TTS supports consistent speaker voice generation by using a fixed speaker embedding or conditioning input. To maintain the same voice across generations, you need to explicitly pass the same speaker-related parameters—such as speaker_embedding, speaker_id, or a consistent prompt_input_ids—depending on the model variant.

i may be wrong please let me know if it fixes the issue
in the model page it is mentioned :
prompt = "Hey, what's up? How’s it going?"
description = "A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."

I used description = "Lalitha is A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."

I ensured description is constant across generate() calls, and only gave input text in prompts (parts of a sentence, about 50 tokens)
Despite description being same, even within single generate call of a batch of 4 input text sentences, the generated voice is different, in accent, tone and other aspects, it was as if some different speaker's voice !
Since there is no speakerID in model page, It was not clear if there exists and speaker ID supported (for telugu language), I used speaker name 'lalitha'
So it generated a bunch of different accent audios
Punctuation can be used to control the prosody of the generations, e.g. use commas to add small breaks in speech
The remaining speech features (gender, speaking rate, pitch and reverberation) can be controlled directly through the prompt
Some Description Examples
Aditi - Slightly High-Pitched, Expressive Tone:
"Aditi speaks with a slightly higher pitch in a close-sounding environment. Her voice is clear, with subtle emotional depth and a normal pace, all captured in high-quality recording."

Sita - Rapid, Slightly Monotone:
"Sita speaks at a fast pace with a slightly low-pitched voice, captured clearly in a close-sounding environment with excellent recording quality."

Tapan - Male, Moderate Pace, Slightly Monotone:
"Tapan speaks at a moderate pace with a slightly monotone tone. The recording is clear, with a close sound and only minimal ambient noise."

Sunita - High-Pitched, Happy Tone:
"Sunita speaks with a high pitch in a close environment. Her voice is clear, with slight dynamic changes, and the recording is of excellent quality."

Karan - High-Pitched, Positive Tone:
"Karan’s high-pitched, engaging voice is captured in a clear, close-sounding recording. His slightly slower delivery conveys a positive tone."

Amrita - High-Pitched, Flat Tone:
"Amrita speaks with a high pitch at a slow pace. Her voice is clear, with excellent recording quality and only moderate background noise."

Aditi - Slow, Slightly Expressive:
"Aditi speaks slowly with a high pitch and expressive tone. The recording is clear, showcasing her energetic and emotive voice."

Young Male Speaker, American Accent:
"A young male speaker with a high-pitched American accent delivers speech at a slightly fast pace in a clear, close-sounding recording."

Bikram - High-Pitched, Urgent Tone:
"Bikram speaks with a higher pitch and fast pace, conveying urgency. The recording is clear and intimate, with great emotional depth."

Anjali - High-Pitched, Neutral Tone:
"Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality."

here is for you my friend its a fine tuned ver of indic parler tts so these should be supported

The problem is once we set a description like above , the generation did not honor it, it kept changing the voices !
For example when we set Description = "Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality.""
And generate voice with a batch of 4 input texts, each generated audio has different voice ! like one is English accent, other is telugu, one is fast paced, other is sad voice like that!

The description given seems to be ignored by the model,
After generating audio for 10 sentences, tjhe voice is different for each of the 10 sentences!
Once a Description (for voice parameters) is set, it is fixed and is not modified in code , so same description for all sentences , But the output audio parameters for each generated sentence is different! Its like obtaining voice of a bunch of sentences from different speakers randomly

Parveshiiii

Oct 29, 2025

It looks like the issue is related to the StaticCache object not having the max_batch_size attribute. This might be due to a change in the library where max_batch_size has been deprecated or replaced with batch_size.

Here are a few things you can try to fix this locally:
Check the library version: If you're using an older version of transformers, upgrading to the latest version might resolve the issue. Try running:
pip install --upgrade transformers
Modify the code: If max_batch_size has been replaced with batch_size, you can try changing occurrences of max_batch_size to batch_size in your code, particularly in _get_cache() function.

Check GitHub issues: There are discussions on GitHub where similar issues have been reported. You might find a fix or workaround there.

Verify StaticCache implementation: If you're manually using StaticCache, ensure that it supports max_batch_size. If not, you may need to initialize it differently.
----HelpingAI----
This is generated by a HelpingAI so make sure to double check it.
Thank you, had made change in code at that time to suppress it

Could you please let me know is there a way to make the speaker or voice remain consistent?
Like is there some parameter to be provided as input that ensures the voice and does not change (until the parameter value is changed )?
Yes, Parler-TTS supports consistent speaker voice generation by using a fixed speaker embedding or conditioning input. To maintain the same voice across generations, you need to explicitly pass the same speaker-related parameters—such as speaker_embedding, speaker_id, or a consistent prompt_input_ids—depending on the model variant.

i may be wrong please let me know if it fixes the issue
in the model page it is mentioned :
prompt = "Hey, what's up? How’s it going?"
description = "A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."

I used description = "Lalitha is A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."

I ensured description is constant across generate() calls, and only gave input text in prompts (parts of a sentence, about 50 tokens)
Despite description being same, even within single generate call of a batch of 4 input text sentences, the generated voice is different, in accent, tone and other aspects, it was as if some different speaker's voice !
Since there is no speakerID in model page, It was not clear if there exists and speaker ID supported (for telugu language), I used speaker name 'lalitha'
So it generated a bunch of different accent audios
Punctuation can be used to control the prosody of the generations, e.g. use commas to add small breaks in speech
The remaining speech features (gender, speaking rate, pitch and reverberation) can be controlled directly through the prompt
Some Description Examples
Aditi - Slightly High-Pitched, Expressive Tone:
"Aditi speaks with a slightly higher pitch in a close-sounding environment. Her voice is clear, with subtle emotional depth and a normal pace, all captured in high-quality recording."

Sita - Rapid, Slightly Monotone:
"Sita speaks at a fast pace with a slightly low-pitched voice, captured clearly in a close-sounding environment with excellent recording quality."

Tapan - Male, Moderate Pace, Slightly Monotone:
"Tapan speaks at a moderate pace with a slightly monotone tone. The recording is clear, with a close sound and only minimal ambient noise."

Sunita - High-Pitched, Happy Tone:
"Sunita speaks with a high pitch in a close environment. Her voice is clear, with slight dynamic changes, and the recording is of excellent quality."

Karan - High-Pitched, Positive Tone:
"Karan’s high-pitched, engaging voice is captured in a clear, close-sounding recording. His slightly slower delivery conveys a positive tone."

Amrita - High-Pitched, Flat Tone:
"Amrita speaks with a high pitch at a slow pace. Her voice is clear, with excellent recording quality and only moderate background noise."

Aditi - Slow, Slightly Expressive:
"Aditi speaks slowly with a high pitch and expressive tone. The recording is clear, showcasing her energetic and emotive voice."

Young Male Speaker, American Accent:
"A young male speaker with a high-pitched American accent delivers speech at a slightly fast pace in a clear, close-sounding recording."

Bikram - High-Pitched, Urgent Tone:
"Bikram speaks with a higher pitch and fast pace, conveying urgency. The recording is clear and intimate, with great emotional depth."

Anjali - High-Pitched, Neutral Tone:
"Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality."

here is for you my friend its a fine tuned ver of indic parler tts so these should be supported
The problem is once we set a description like above , the generation did not honor it, it kept changing the voices !
For example when we set Description = "Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality.""
And generate voice with a batch of 4 input texts, each generated audio has different voice ! like one is English accent, other is telugu, one is fast paced, other is sad voice like that!

The description given seems to be ignored by the model,
After generating audio for 10 sentences, tjhe voice is different for each of the 10 sentences!
Once a Description (for voice parameters) is set, it is fixed and is not modified in code , so same description for all sentences , But the output audio parameters for each generated sentence is different! Its like obtaining voice of a bunch of sentences from different speakers randomly

what are your req for tts you switch to a better tts

k1-m

Oct 29, 2025

It looks like the issue is related to the StaticCache object not having the max_batch_size attribute. This might be due to a change in the library where max_batch_size has been deprecated or replaced with batch_size.

Here are a few things you can try to fix this locally:
Check the library version: If you're using an older version of transformers, upgrading to the latest version might resolve the issue. Try running:
pip install --upgrade transformers
Modify the code: If max_batch_size has been replaced with batch_size, you can try changing occurrences of max_batch_size to batch_size in your code, particularly in _get_cache() function.

Check GitHub issues: There are discussions on GitHub where similar issues have been reported. You might find a fix or workaround there.

Verify StaticCache implementation: If you're manually using StaticCache, ensure that it supports max_batch_size. If not, you may need to initialize it differently.
----HelpingAI----
This is generated by a HelpingAI so make sure to double check it.
Thank you, had made change in code at that time to suppress it

Could you please let me know is there a way to make the speaker or voice remain consistent?
Like is there some parameter to be provided as input that ensures the voice and does not change (until the parameter value is changed )?
Yes, Parler-TTS supports consistent speaker voice generation by using a fixed speaker embedding or conditioning input. To maintain the same voice across generations, you need to explicitly pass the same speaker-related parameters—such as speaker_embedding, speaker_id, or a consistent prompt_input_ids—depending on the model variant.

i may be wrong please let me know if it fixes the issue
in the model page it is mentioned :
prompt = "Hey, what's up? How’s it going?"
description = "A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."

I used description = "Lalitha is A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."

I ensured description is constant across generate() calls, and only gave input text in prompts (parts of a sentence, about 50 tokens)
Despite description being same, even within single generate call of a batch of 4 input text sentences, the generated voice is different, in accent, tone and other aspects, it was as if some different speaker's voice !
Since there is no speakerID in model page, It was not clear if there exists and speaker ID supported (for telugu language), I used speaker name 'lalitha'
So it generated a bunch of different accent audios
Punctuation can be used to control the prosody of the generations, e.g. use commas to add small breaks in speech
The remaining speech features (gender, speaking rate, pitch and reverberation) can be controlled directly through the prompt
Some Description Examples
Aditi - Slightly High-Pitched, Expressive Tone:
"Aditi speaks with a slightly higher pitch in a close-sounding environment. Her voice is clear, with subtle emotional depth and a normal pace, all captured in high-quality recording."

Sita - Rapid, Slightly Monotone:
"Sita speaks at a fast pace with a slightly low-pitched voice, captured clearly in a close-sounding environment with excellent recording quality."

Tapan - Male, Moderate Pace, Slightly Monotone:
"Tapan speaks at a moderate pace with a slightly monotone tone. The recording is clear, with a close sound and only minimal ambient noise."

Sunita - High-Pitched, Happy Tone:
"Sunita speaks with a high pitch in a close environment. Her voice is clear, with slight dynamic changes, and the recording is of excellent quality."

Karan - High-Pitched, Positive Tone:
"Karan’s high-pitched, engaging voice is captured in a clear, close-sounding recording. His slightly slower delivery conveys a positive tone."

Amrita - High-Pitched, Flat Tone:
"Amrita speaks with a high pitch at a slow pace. Her voice is clear, with excellent recording quality and only moderate background noise."

Aditi - Slow, Slightly Expressive:
"Aditi speaks slowly with a high pitch and expressive tone. The recording is clear, showcasing her energetic and emotive voice."

Young Male Speaker, American Accent:
"A young male speaker with a high-pitched American accent delivers speech at a slightly fast pace in a clear, close-sounding recording."

Bikram - High-Pitched, Urgent Tone:
"Bikram speaks with a higher pitch and fast pace, conveying urgency. The recording is clear and intimate, with great emotional depth."

Anjali - High-Pitched, Neutral Tone:
"Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality."

here is for you my friend its a fine tuned ver of indic parler tts so these should be supported
The problem is once we set a description like above , the generation did not honor it, it kept changing the voices !
For example when we set Description = "Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality.""
And generate voice with a batch of 4 input texts, each generated audio has different voice ! like one is English accent, other is telugu, one is fast paced, other is sad voice like that!

The description given seems to be ignored by the model,
After generating audio for 10 sentences, tjhe voice is different for each of the 10 sentences!
Once a Description (for voice parameters) is set, it is fixed and is not modified in code , so same description for all sentences , But the output audio parameters for each generated sentence is different! Its like obtaining voice of a bunch of sentences from different speakers randomly
what are your req for tts you switch to a better tts

requirement is to generate telugu female audio from telugu text.
Had tried these models:

Tried indic-parler-tts , on rtx-3080 it is slow (and not much control on the aspects like timbre or emotion when tested)
InficF5 : producedhalucinated audiosounds but no teluguvoice (community comment sshow someotherdevsalsofacedthis
facebook-mms-tel and other Vits models: These generate too low quality audio like parts of words are missing (many Vits models showed same issue)
Looking for a model that can generate telugu voice from telugu text with emotion support

Parveshiiii

Oct 29, 2025

It looks like the issue is related to the StaticCache object not having the max_batch_size attribute. This might be due to a change in the library where max_batch_size has been deprecated or replaced with batch_size.

Here are a few things you can try to fix this locally:
Check the library version: If you're using an older version of transformers, upgrading to the latest version might resolve the issue. Try running:
pip install --upgrade transformers
Modify the code: If max_batch_size has been replaced with batch_size, you can try changing occurrences of max_batch_size to batch_size in your code, particularly in _get_cache() function.

Check GitHub issues: There are discussions on GitHub where similar issues have been reported. You might find a fix or workaround there.

Verify StaticCache implementation: If you're manually using StaticCache, ensure that it supports max_batch_size. If not, you may need to initialize it differently.
----HelpingAI----
This is generated by a HelpingAI so make sure to double check it.
Thank you, had made change in code at that time to suppress it

Could you please let me know is there a way to make the speaker or voice remain consistent?
Like is there some parameter to be provided as input that ensures the voice and does not change (until the parameter value is changed )?
Yes, Parler-TTS supports consistent speaker voice generation by using a fixed speaker embedding or conditioning input. To maintain the same voice across generations, you need to explicitly pass the same speaker-related parameters—such as speaker_embedding, speaker_id, or a consistent prompt_input_ids—depending on the model variant.

i may be wrong please let me know if it fixes the issue
in the model page it is mentioned :
prompt = "Hey, what's up? How’s it going?"
description = "A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."

I used description = "Lalitha is A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."

I ensured description is constant across generate() calls, and only gave input text in prompts (parts of a sentence, about 50 tokens)
Despite description being same, even within single generate call of a batch of 4 input text sentences, the generated voice is different, in accent, tone and other aspects, it was as if some different speaker's voice !
Since there is no speakerID in model page, It was not clear if there exists and speaker ID supported (for telugu language), I used speaker name 'lalitha'
So it generated a bunch of different accent audios
Punctuation can be used to control the prosody of the generations, e.g. use commas to add small breaks in speech
The remaining speech features (gender, speaking rate, pitch and reverberation) can be controlled directly through the prompt
Some Description Examples
Aditi - Slightly High-Pitched, Expressive Tone:
"Aditi speaks with a slightly higher pitch in a close-sounding environment. Her voice is clear, with subtle emotional depth and a normal pace, all captured in high-quality recording."

Sita - Rapid, Slightly Monotone:
"Sita speaks at a fast pace with a slightly low-pitched voice, captured clearly in a close-sounding environment with excellent recording quality."

Tapan - Male, Moderate Pace, Slightly Monotone:
"Tapan speaks at a moderate pace with a slightly monotone tone. The recording is clear, with a close sound and only minimal ambient noise."

Sunita - High-Pitched, Happy Tone:
"Sunita speaks with a high pitch in a close environment. Her voice is clear, with slight dynamic changes, and the recording is of excellent quality."

Karan - High-Pitched, Positive Tone:
"Karan’s high-pitched, engaging voice is captured in a clear, close-sounding recording. His slightly slower delivery conveys a positive tone."

Amrita - High-Pitched, Flat Tone:
"Amrita speaks with a high pitch at a slow pace. Her voice is clear, with excellent recording quality and only moderate background noise."

Aditi - Slow, Slightly Expressive:
"Aditi speaks slowly with a high pitch and expressive tone. The recording is clear, showcasing her energetic and emotive voice."

Young Male Speaker, American Accent:
"A young male speaker with a high-pitched American accent delivers speech at a slightly fast pace in a clear, close-sounding recording."

Bikram - High-Pitched, Urgent Tone:
"Bikram speaks with a higher pitch and fast pace, conveying urgency. The recording is clear and intimate, with great emotional depth."

Anjali - High-Pitched, Neutral Tone:
"Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality."

here is for you my friend its a fine tuned ver of indic parler tts so these should be supported
The problem is once we set a description like above , the generation did not honor it, it kept changing the voices !
For example when we set Description = "Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality.""
And generate voice with a batch of 4 input texts, each generated audio has different voice ! like one is English accent, other is telugu, one is fast paced, other is sad voice like that!

The description given seems to be ignored by the model,
After generating audio for 10 sentences, tjhe voice is different for each of the 10 sentences!
Once a Description (for voice parameters) is set, it is fixed and is not modified in code , so same description for all sentences , But the output audio parameters for each generated sentence is different! Its like obtaining voice of a bunch of sentences from different speakers randomly
what are your req for tts you switch to a better tts
requirement is to generate telugu female audio from telugu text.
Had tried these models:

Tried indic-parler-tts , on rtx-3080 it is slow (and not much control on the aspects like timbre or emotion when tested)

InficF5 : producedhalucinated audiosounds but no teluguvoice (community comment sshow someotherdevsalsofacedthis

facebook-mms-tel and other Vits models: These generate too low quality audio like parts of words are missing (many Vits models showed same issue)
Looking for a model that can generate telugu voice from telugu text with emotion support

so this model is also same indic-parler-tts i guess the problem is with how you are using it as the helpingai tts is also the exact same model just fine tuned with some Kashmiri data

k1-m

Oct 29, 2025

It looks like the issue is related to the StaticCache object not having the max_batch_size attribute. This might be due to a change in the library where max_batch_size has been deprecated or replaced with batch_size.

Here are a few things you can try to fix this locally:
Check the library version: If you're using an older version of transformers, upgrading to the latest version might resolve the issue. Try running:
pip install --upgrade transformers
Modify the code: If max_batch_size has been replaced with batch_size, you can try changing occurrences of max_batch_size to batch_size in your code, particularly in _get_cache() function.

Check GitHub issues: There are discussions on GitHub where similar issues have been reported. You might find a fix or workaround there.

Verify StaticCache implementation: If you're manually using StaticCache, ensure that it supports max_batch_size. If not, you may need to initialize it differently.
----HelpingAI----
This is generated by a HelpingAI so make sure to double check it.
Thank you, had made change in code at that time to suppress it

Could you please let me know is there a way to make the speaker or voice remain consistent?
Like is there some parameter to be provided as input that ensures the voice and does not change (until the parameter value is changed )?
Yes, Parler-TTS supports consistent speaker voice generation by using a fixed speaker embedding or conditioning input. To maintain the same voice across generations, you need to explicitly pass the same speaker-related parameters—such as speaker_embedding, speaker_id, or a consistent prompt_input_ids—depending on the model variant.

i may be wrong please let me know if it fixes the issue
in the model page it is mentioned :
prompt = "Hey, what's up? How’s it going?"
description = "A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."

I used description = "Lalitha is A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."

I ensured description is constant across generate() calls, and only gave input text in prompts (parts of a sentence, about 50 tokens)
Despite description being same, even within single generate call of a batch of 4 input text sentences, the generated voice is different, in accent, tone and other aspects, it was as if some different speaker's voice !
Since there is no speakerID in model page, It was not clear if there exists and speaker ID supported (for telugu language), I used speaker name 'lalitha'
So it generated a bunch of different accent audios
Punctuation can be used to control the prosody of the generations, e.g. use commas to add small breaks in speech
The remaining speech features (gender, speaking rate, pitch and reverberation) can be controlled directly through the prompt
Some Description Examples
Aditi - Slightly High-Pitched, Expressive Tone:
"Aditi speaks with a slightly higher pitch in a close-sounding environment. Her voice is clear, with subtle emotional depth and a normal pace, all captured in high-quality recording."

Sita - Rapid, Slightly Monotone:
"Sita speaks at a fast pace with a slightly low-pitched voice, captured clearly in a close-sounding environment with excellent recording quality."

Tapan - Male, Moderate Pace, Slightly Monotone:
"Tapan speaks at a moderate pace with a slightly monotone tone. The recording is clear, with a close sound and only minimal ambient noise."

Sunita - High-Pitched, Happy Tone:
"Sunita speaks with a high pitch in a close environment. Her voice is clear, with slight dynamic changes, and the recording is of excellent quality."

Karan - High-Pitched, Positive Tone:
"Karan’s high-pitched, engaging voice is captured in a clear, close-sounding recording. His slightly slower delivery conveys a positive tone."

Amrita - High-Pitched, Flat Tone:
"Amrita speaks with a high pitch at a slow pace. Her voice is clear, with excellent recording quality and only moderate background noise."

Aditi - Slow, Slightly Expressive:
"Aditi speaks slowly with a high pitch and expressive tone. The recording is clear, showcasing her energetic and emotive voice."

Young Male Speaker, American Accent:
"A young male speaker with a high-pitched American accent delivers speech at a slightly fast pace in a clear, close-sounding recording."

Bikram - High-Pitched, Urgent Tone:
"Bikram speaks with a higher pitch and fast pace, conveying urgency. The recording is clear and intimate, with great emotional depth."

Anjali - High-Pitched, Neutral Tone:
"Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality."

here is for you my friend its a fine tuned ver of indic parler tts so these should be supported
The problem is once we set a description like above , the generation did not honor it, it kept changing the voices !
For example when we set Description = "Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality.""
And generate voice with a batch of 4 input texts, each generated audio has different voice ! like one is English accent, other is telugu, one is fast paced, other is sad voice like that!

The description given seems to be ignored by the model,
After generating audio for 10 sentences, tjhe voice is different for each of the 10 sentences!
Once a Description (for voice parameters) is set, it is fixed and is not modified in code , so same description for all sentences , But the output audio parameters for each generated sentence is different! Its like obtaining voice of a bunch of sentences from different speakers randomly
what are your req for tts you switch to a better tts
requirement is to generate telugu female audio from telugu text.
Had tried these models:

Tried indic-parler-tts , on rtx-3080 it is slow (and not much control on the aspects like timbre or emotion when tested)

InficF5 : producedhalucinated audiosounds but no teluguvoice (community comment sshow someotherdevsalsofacedthis

facebook-mms-tel and other Vits models: These generate too low quality audio like parts of words are missing (many Vits models showed same issue)
Looking for a model that can generate telugu voice from telugu text with emotion support
so this model is also same indic-parler-tts i guess the problem is with how you are using it as the helpingai tts is also the exact same model just fine tuned with some Kashmiri data

mentioning "finetuned with kashmiri data on top of indic-parler-tts" in this models ModelCard page would have helped and will help programmers who are comparing indic-parler-tts with this model
Since indic-parler-tts didnot work, wouldnot have invested effort on generating with this model (Given indic-parler-tts output, in the same setup , it wouldnt be prudent)
Better to have data before assuming it as usage problem!
Inconsistent output Speech characteristics despite fixed input description across generate calls is a known problem linked to parler-tts

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment