Seeking help: TypeError: DacModel.decode() missing 1 required positional argument: 'quantized_representation'
Could anyone please help to resolve the error below, (let me know which dac package worked for compilation of this code)?
In my case, Following compilation/interpreting error occured when I tried to use this model:
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids, attention_mask=attention_mask,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:_Develop_dev\px3.pixi\envs\default\Lib\site-packages\torch\utils_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "D:_Develop_dev\px3.pixi\envs\default\Lib\site-packages\parler_tts\modeling_parler_tts.py", line 3637, in generate
sample = self.audio_encoder.decode(audio_codes=sample[None, ...], **single_audio_decode_kwargs).audio_values
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: DacModel.decode() missing 1 required positional argument: 'quantized_representation'
(I added attentionmask for description, prompt due to errors still this DAC error is seen),
Is there any specific transformers version that worked for you?
Thanks in advance
In the same windows setup the base indic-parler-tts compiled (but was slow for our usage)
I guess it should work in transformers==4.46.0.dev0.
1st call to generate() doesnot give error but
2nd call to generate() gives error : "AttributeError: 'StaticCache' object has no attribute 'max_batch_size'. Did you mean: 'batch_size'?"
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids, attention_mask=attention_mask, prompt_attention_mask=prompt_attention_mask,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:_Develop_dev\px3.pixi\envs\default\Lib\site-packages\torch\utils_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "D:_Develop_dev\px3.pixi\envs\default\Lib\site-packages\parler_tts\modeling_parler_tts.py", line 3491, in generate
model_kwargs["past_key_values"] = self._get_cache(
^^^^^^^^^^^^^^^^
File "D:_Develop_dev\px3.pixi\envs\default\Lib\site-packages\parler_tts\modeling_parler_tts.py", line 3275, in _get_cache
or cache_to_check.max_batch_size != max_batch_size
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:_Develop_dev\px3.pixi\envs\default\Lib\site-packages\torch\nn\modules\module.py", line 1928, in getattr
raise AttributeError(
AttributeError: 'StaticCache' object has no attribute 'max_batch_size'. Did you mean: 'batch_size'?
Could you let me know a specific fix if can be done locally?
- Even with same Description string, in a single batch also the voice is different!
Is there any setting to only follow 1 single voice as long as the description string is identical?
It looks like the issue is related to the StaticCache object not having the max_batch_size attribute. This might be due to a change in the library where max_batch_size has been deprecated or replaced with batch_size.
Here are a few things you can try to fix this locally:
- Check the library version: If you're using an older version of
transformers, upgrading to the latest version might resolve the issue. Try running:pip install --upgrade transformers - Modify the code: If
max_batch_sizehas been replaced withbatch_size, you can try changing occurrences ofmax_batch_sizetobatch_sizein your code, particularly in_get_cache()function. - Check GitHub issues: There are discussions on GitHub where similar issues have been reported. You might find a fix or workaround there.
- Verify StaticCache implementation: If you're manually using
StaticCache, ensure that it supportsmax_batch_size. If not, you may need to initialize it differently.
----HelpingAI----
This is generated by a HelpingAI so make sure to double check it.
It looks like the issue is related to the
StaticCacheobject not having themax_batch_sizeattribute. This might be due to a change in the library wheremax_batch_sizehas been deprecated or replaced withbatch_size.Here are a few things you can try to fix this locally:
- Check the library version: If you're using an older version of
transformers, upgrading to the latest version might resolve the issue. Try running:pip install --upgrade transformers- Modify the code: If
max_batch_sizehas been replaced withbatch_size, you can try changing occurrences ofmax_batch_sizetobatch_sizein your code, particularly in_get_cache()function.- Check GitHub issues: There are discussions on GitHub where similar issues have been reported. You might find a fix or workaround there.
- Verify StaticCache implementation: If you're manually using
StaticCache, ensure that it supportsmax_batch_size. If not, you may need to initialize it differently.
----HelpingAI----
This is generated by a HelpingAI so make sure to double check it.
Thank you, had made change in code at that time to suppress it
Could you please let me know is there a way to make the speaker or voice remain consistent?
Like is there some parameter to be provided as input that ensures the voice and does not change (until the parameter value is changed )?
It looks like the issue is related to the
StaticCacheobject not having themax_batch_sizeattribute. This might be due to a change in the library wheremax_batch_sizehas been deprecated or replaced withbatch_size.Here are a few things you can try to fix this locally:
- Check the library version: If you're using an older version of
transformers, upgrading to the latest version might resolve the issue. Try running:pip install --upgrade transformers- Modify the code: If
max_batch_sizehas been replaced withbatch_size, you can try changing occurrences ofmax_batch_sizetobatch_sizein your code, particularly in_get_cache()function.- Check GitHub issues: There are discussions on GitHub where similar issues have been reported. You might find a fix or workaround there.
- Verify StaticCache implementation: If you're manually using
StaticCache, ensure that it supportsmax_batch_size. If not, you may need to initialize it differently.
----HelpingAI----
This is generated by a HelpingAI so make sure to double check it.Thank you, had made change in code at that time to suppress it
Could you please let me know is there a way to make the speaker or voice remain consistent?
Like is there some parameter to be provided as input that ensures the voice and does not change (until the parameter value is changed )?
Yes, Parler-TTS supports consistent speaker voice generation by using a fixed speaker embedding or conditioning input. To maintain the same voice across generations, you need to explicitly pass the same speaker-related parameters—such as speaker_embedding, speaker_id, or a consistent prompt_input_ids—depending on the model variant.
i may be wrong please let me know if it fixes the issue
It looks like the issue is related to the
StaticCacheobject not having themax_batch_sizeattribute. This might be due to a change in the library wheremax_batch_sizehas been deprecated or replaced withbatch_size.Here are a few things you can try to fix this locally:
- Check the library version: If you're using an older version of
transformers, upgrading to the latest version might resolve the issue. Try running:pip install --upgrade transformers- Modify the code: If
max_batch_sizehas been replaced withbatch_size, you can try changing occurrences ofmax_batch_sizetobatch_sizein your code, particularly in_get_cache()function.- Check GitHub issues: There are discussions on GitHub where similar issues have been reported. You might find a fix or workaround there.
- Verify StaticCache implementation: If you're manually using
StaticCache, ensure that it supportsmax_batch_size. If not, you may need to initialize it differently.
----HelpingAI----
This is generated by a HelpingAI so make sure to double check it.Thank you, had made change in code at that time to suppress it
Could you please let me know is there a way to make the speaker or voice remain consistent?
Like is there some parameter to be provided as input that ensures the voice and does not change (until the parameter value is changed )?Yes, Parler-TTS supports consistent speaker voice generation by using a fixed speaker embedding or conditioning input. To maintain the same voice across generations, you need to explicitly pass the same speaker-related parameters—such as speaker_embedding, speaker_id, or a consistent prompt_input_ids—depending on the model variant.
i may be wrong please let me know if it fixes the issue
in the model page it is mentioned :
prompt = "Hey, what's up? How’s it going?"
description = "A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."
I used description = "Lalitha is A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."
I ensured description is constant across generate() calls, and only gave input text in prompts (parts of a sentence, about 50 tokens)
Despite description being same, even within single generate call of a batch of 4 input text sentences, the generated voice is different, in accent, tone and other aspects, it was as if some different speaker's voice !
Since there is no speakerID in model page, It was not clear if there exists and speaker ID supported (for telugu language), I used speaker name 'lalitha'
So it generated a bunch of different accent audios
It looks like the issue is related to the
StaticCacheobject not having themax_batch_sizeattribute. This might be due to a change in the library wheremax_batch_sizehas been deprecated or replaced withbatch_size.Here are a few things you can try to fix this locally:
- Check the library version: If you're using an older version of
transformers, upgrading to the latest version might resolve the issue. Try running:pip install --upgrade transformers- Modify the code: If
max_batch_sizehas been replaced withbatch_size, you can try changing occurrences ofmax_batch_sizetobatch_sizein your code, particularly in_get_cache()function.- Check GitHub issues: There are discussions on GitHub where similar issues have been reported. You might find a fix or workaround there.
- Verify StaticCache implementation: If you're manually using
StaticCache, ensure that it supportsmax_batch_size. If not, you may need to initialize it differently.
----HelpingAI----
This is generated by a HelpingAI so make sure to double check it.Thank you, had made change in code at that time to suppress it
Could you please let me know is there a way to make the speaker or voice remain consistent?
Like is there some parameter to be provided as input that ensures the voice and does not change (until the parameter value is changed )?Yes, Parler-TTS supports consistent speaker voice generation by using a fixed speaker embedding or conditioning input. To maintain the same voice across generations, you need to explicitly pass the same speaker-related parameters—such as speaker_embedding, speaker_id, or a consistent prompt_input_ids—depending on the model variant.
i may be wrong please let me know if it fixes the issue
in the model page it is mentioned :
prompt = "Hey, what's up? How’s it going?"
description = "A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."I used description = "Lalitha is A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."
I ensured description is constant across generate() calls, and only gave input text in prompts (parts of a sentence, about 50 tokens)
Despite description being same, even within single generate call of a batch of 4 input text sentences, the generated voice is different, in accent, tone and other aspects, it was as if some different speaker's voice !
Since there is no speakerID in model page, It was not clear if there exists and speaker ID supported (for telugu language), I used speaker name 'lalitha'
So it generated a bunch of different accent audios
Punctuation can be used to control the prosody of the generations, e.g. use commas to add small breaks in speech
The remaining speech features (gender, speaking rate, pitch and reverberation) can be controlled directly through the prompt
Some Description Examples
Aditi - Slightly High-Pitched, Expressive Tone:
"Aditi speaks with a slightly higher pitch in a close-sounding environment. Her voice is clear, with subtle emotional depth and a normal pace, all captured in high-quality recording."
Sita - Rapid, Slightly Monotone:
"Sita speaks at a fast pace with a slightly low-pitched voice, captured clearly in a close-sounding environment with excellent recording quality."
Tapan - Male, Moderate Pace, Slightly Monotone:
"Tapan speaks at a moderate pace with a slightly monotone tone. The recording is clear, with a close sound and only minimal ambient noise."
Sunita - High-Pitched, Happy Tone:
"Sunita speaks with a high pitch in a close environment. Her voice is clear, with slight dynamic changes, and the recording is of excellent quality."
Karan - High-Pitched, Positive Tone:
"Karan’s high-pitched, engaging voice is captured in a clear, close-sounding recording. His slightly slower delivery conveys a positive tone."
Amrita - High-Pitched, Flat Tone:
"Amrita speaks with a high pitch at a slow pace. Her voice is clear, with excellent recording quality and only moderate background noise."
Aditi - Slow, Slightly Expressive:
"Aditi speaks slowly with a high pitch and expressive tone. The recording is clear, showcasing her energetic and emotive voice."
Young Male Speaker, American Accent:
"A young male speaker with a high-pitched American accent delivers speech at a slightly fast pace in a clear, close-sounding recording."
Bikram - High-Pitched, Urgent Tone:
"Bikram speaks with a higher pitch and fast pace, conveying urgency. The recording is clear and intimate, with great emotional depth."
Anjali - High-Pitched, Neutral Tone:
"Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality."
here is for you my friend its a fine tuned ver of indic parler tts so these should be supported
It looks like the issue is related to the
StaticCacheobject not having themax_batch_sizeattribute. This might be due to a change in the library wheremax_batch_sizehas been deprecated or replaced withbatch_size.Here are a few things you can try to fix this locally:
- Check the library version: If you're using an older version of
transformers, upgrading to the latest version might resolve the issue. Try running:pip install --upgrade transformers- Modify the code: If
max_batch_sizehas been replaced withbatch_size, you can try changing occurrences ofmax_batch_sizetobatch_sizein your code, particularly in_get_cache()function.- Check GitHub issues: There are discussions on GitHub where similar issues have been reported. You might find a fix or workaround there.
- Verify StaticCache implementation: If you're manually using
StaticCache, ensure that it supportsmax_batch_size. If not, you may need to initialize it differently.
----HelpingAI----
This is generated by a HelpingAI so make sure to double check it.Thank you, had made change in code at that time to suppress it
Could you please let me know is there a way to make the speaker or voice remain consistent?
Like is there some parameter to be provided as input that ensures the voice and does not change (until the parameter value is changed )?Yes, Parler-TTS supports consistent speaker voice generation by using a fixed speaker embedding or conditioning input. To maintain the same voice across generations, you need to explicitly pass the same speaker-related parameters—such as speaker_embedding, speaker_id, or a consistent prompt_input_ids—depending on the model variant.
i may be wrong please let me know if it fixes the issue
in the model page it is mentioned :
prompt = "Hey, what's up? How’s it going?"
description = "A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."I used description = "Lalitha is A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."
I ensured description is constant across generate() calls, and only gave input text in prompts (parts of a sentence, about 50 tokens)
Despite description being same, even within single generate call of a batch of 4 input text sentences, the generated voice is different, in accent, tone and other aspects, it was as if some different speaker's voice !
Since there is no speakerID in model page, It was not clear if there exists and speaker ID supported (for telugu language), I used speaker name 'lalitha'
So it generated a bunch of different accent audiosPunctuation can be used to control the prosody of the generations, e.g. use commas to add small breaks in speech
The remaining speech features (gender, speaking rate, pitch and reverberation) can be controlled directly through the prompt
Some Description Examples
Aditi - Slightly High-Pitched, Expressive Tone:
"Aditi speaks with a slightly higher pitch in a close-sounding environment. Her voice is clear, with subtle emotional depth and a normal pace, all captured in high-quality recording."Sita - Rapid, Slightly Monotone:
"Sita speaks at a fast pace with a slightly low-pitched voice, captured clearly in a close-sounding environment with excellent recording quality."Tapan - Male, Moderate Pace, Slightly Monotone:
"Tapan speaks at a moderate pace with a slightly monotone tone. The recording is clear, with a close sound and only minimal ambient noise."Sunita - High-Pitched, Happy Tone:
"Sunita speaks with a high pitch in a close environment. Her voice is clear, with slight dynamic changes, and the recording is of excellent quality."Karan - High-Pitched, Positive Tone:
"Karan’s high-pitched, engaging voice is captured in a clear, close-sounding recording. His slightly slower delivery conveys a positive tone."Amrita - High-Pitched, Flat Tone:
"Amrita speaks with a high pitch at a slow pace. Her voice is clear, with excellent recording quality and only moderate background noise."Aditi - Slow, Slightly Expressive:
"Aditi speaks slowly with a high pitch and expressive tone. The recording is clear, showcasing her energetic and emotive voice."Young Male Speaker, American Accent:
"A young male speaker with a high-pitched American accent delivers speech at a slightly fast pace in a clear, close-sounding recording."Bikram - High-Pitched, Urgent Tone:
"Bikram speaks with a higher pitch and fast pace, conveying urgency. The recording is clear and intimate, with great emotional depth."Anjali - High-Pitched, Neutral Tone:
"Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality."here is for you my friend its a fine tuned ver of indic parler tts so these should be supported
The problem is once we set a description like above , the generation did not honor it, it kept changing the voices !
For example when we set Description = "Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality.""
And generate voice with a batch of 4 input texts, each generated audio has different voice ! like one is English accent, other is telugu, one is fast paced, other is sad voice like that!
The description given seems to be ignored by the model,
After generating audio for 10 sentences, tjhe voice is different for each of the 10 sentences!
Once a Description (for voice parameters) is set, it is fixed and is not modified in code , so same description for all sentences , But the output audio parameters for each generated sentence is different! Its like obtaining voice of a bunch of sentences from different speakers randomly
It looks like the issue is related to the
StaticCacheobject not having themax_batch_sizeattribute. This might be due to a change in the library wheremax_batch_sizehas been deprecated or replaced withbatch_size.Here are a few things you can try to fix this locally:
- Check the library version: If you're using an older version of
transformers, upgrading to the latest version might resolve the issue. Try running:pip install --upgrade transformers- Modify the code: If
max_batch_sizehas been replaced withbatch_size, you can try changing occurrences ofmax_batch_sizetobatch_sizein your code, particularly in_get_cache()function.- Check GitHub issues: There are discussions on GitHub where similar issues have been reported. You might find a fix or workaround there.
- Verify StaticCache implementation: If you're manually using
StaticCache, ensure that it supportsmax_batch_size. If not, you may need to initialize it differently.
----HelpingAI----
This is generated by a HelpingAI so make sure to double check it.Thank you, had made change in code at that time to suppress it
Could you please let me know is there a way to make the speaker or voice remain consistent?
Like is there some parameter to be provided as input that ensures the voice and does not change (until the parameter value is changed )?Yes, Parler-TTS supports consistent speaker voice generation by using a fixed speaker embedding or conditioning input. To maintain the same voice across generations, you need to explicitly pass the same speaker-related parameters—such as speaker_embedding, speaker_id, or a consistent prompt_input_ids—depending on the model variant.
i may be wrong please let me know if it fixes the issue
in the model page it is mentioned :
prompt = "Hey, what's up? How’s it going?"
description = "A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."I used description = "Lalitha is A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."
I ensured description is constant across generate() calls, and only gave input text in prompts (parts of a sentence, about 50 tokens)
Despite description being same, even within single generate call of a batch of 4 input text sentences, the generated voice is different, in accent, tone and other aspects, it was as if some different speaker's voice !
Since there is no speakerID in model page, It was not clear if there exists and speaker ID supported (for telugu language), I used speaker name 'lalitha'
So it generated a bunch of different accent audiosPunctuation can be used to control the prosody of the generations, e.g. use commas to add small breaks in speech
The remaining speech features (gender, speaking rate, pitch and reverberation) can be controlled directly through the prompt
Some Description Examples
Aditi - Slightly High-Pitched, Expressive Tone:
"Aditi speaks with a slightly higher pitch in a close-sounding environment. Her voice is clear, with subtle emotional depth and a normal pace, all captured in high-quality recording."Sita - Rapid, Slightly Monotone:
"Sita speaks at a fast pace with a slightly low-pitched voice, captured clearly in a close-sounding environment with excellent recording quality."Tapan - Male, Moderate Pace, Slightly Monotone:
"Tapan speaks at a moderate pace with a slightly monotone tone. The recording is clear, with a close sound and only minimal ambient noise."Sunita - High-Pitched, Happy Tone:
"Sunita speaks with a high pitch in a close environment. Her voice is clear, with slight dynamic changes, and the recording is of excellent quality."Karan - High-Pitched, Positive Tone:
"Karan’s high-pitched, engaging voice is captured in a clear, close-sounding recording. His slightly slower delivery conveys a positive tone."Amrita - High-Pitched, Flat Tone:
"Amrita speaks with a high pitch at a slow pace. Her voice is clear, with excellent recording quality and only moderate background noise."Aditi - Slow, Slightly Expressive:
"Aditi speaks slowly with a high pitch and expressive tone. The recording is clear, showcasing her energetic and emotive voice."Young Male Speaker, American Accent:
"A young male speaker with a high-pitched American accent delivers speech at a slightly fast pace in a clear, close-sounding recording."Bikram - High-Pitched, Urgent Tone:
"Bikram speaks with a higher pitch and fast pace, conveying urgency. The recording is clear and intimate, with great emotional depth."Anjali - High-Pitched, Neutral Tone:
"Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality."here is for you my friend its a fine tuned ver of indic parler tts so these should be supported
The problem is once we set a description like above , the generation did not honor it, it kept changing the voices !
For example when we set Description = "Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality.""
And generate voice with a batch of 4 input texts, each generated audio has different voice ! like one is English accent, other is telugu, one is fast paced, other is sad voice like that!The description given seems to be ignored by the model,
After generating audio for 10 sentences, tjhe voice is different for each of the 10 sentences!
Once a Description (for voice parameters) is set, it is fixed and is not modified in code , so same description for all sentences , But the output audio parameters for each generated sentence is different! Its like obtaining voice of a bunch of sentences from different speakers randomly
what are your req for tts you switch to a better tts
It looks like the issue is related to the
StaticCacheobject not having themax_batch_sizeattribute. This might be due to a change in the library wheremax_batch_sizehas been deprecated or replaced withbatch_size.Here are a few things you can try to fix this locally:
- Check the library version: If you're using an older version of
transformers, upgrading to the latest version might resolve the issue. Try running:pip install --upgrade transformers- Modify the code: If
max_batch_sizehas been replaced withbatch_size, you can try changing occurrences ofmax_batch_sizetobatch_sizein your code, particularly in_get_cache()function.- Check GitHub issues: There are discussions on GitHub where similar issues have been reported. You might find a fix or workaround there.
- Verify StaticCache implementation: If you're manually using
StaticCache, ensure that it supportsmax_batch_size. If not, you may need to initialize it differently.
----HelpingAI----
This is generated by a HelpingAI so make sure to double check it.Thank you, had made change in code at that time to suppress it
Could you please let me know is there a way to make the speaker or voice remain consistent?
Like is there some parameter to be provided as input that ensures the voice and does not change (until the parameter value is changed )?Yes, Parler-TTS supports consistent speaker voice generation by using a fixed speaker embedding or conditioning input. To maintain the same voice across generations, you need to explicitly pass the same speaker-related parameters—such as speaker_embedding, speaker_id, or a consistent prompt_input_ids—depending on the model variant.
i may be wrong please let me know if it fixes the issue
in the model page it is mentioned :
prompt = "Hey, what's up? How’s it going?"
description = "A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."I used description = "Lalitha is A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."
I ensured description is constant across generate() calls, and only gave input text in prompts (parts of a sentence, about 50 tokens)
Despite description being same, even within single generate call of a batch of 4 input text sentences, the generated voice is different, in accent, tone and other aspects, it was as if some different speaker's voice !
Since there is no speakerID in model page, It was not clear if there exists and speaker ID supported (for telugu language), I used speaker name 'lalitha'
So it generated a bunch of different accent audiosPunctuation can be used to control the prosody of the generations, e.g. use commas to add small breaks in speech
The remaining speech features (gender, speaking rate, pitch and reverberation) can be controlled directly through the prompt
Some Description Examples
Aditi - Slightly High-Pitched, Expressive Tone:
"Aditi speaks with a slightly higher pitch in a close-sounding environment. Her voice is clear, with subtle emotional depth and a normal pace, all captured in high-quality recording."Sita - Rapid, Slightly Monotone:
"Sita speaks at a fast pace with a slightly low-pitched voice, captured clearly in a close-sounding environment with excellent recording quality."Tapan - Male, Moderate Pace, Slightly Monotone:
"Tapan speaks at a moderate pace with a slightly monotone tone. The recording is clear, with a close sound and only minimal ambient noise."Sunita - High-Pitched, Happy Tone:
"Sunita speaks with a high pitch in a close environment. Her voice is clear, with slight dynamic changes, and the recording is of excellent quality."Karan - High-Pitched, Positive Tone:
"Karan’s high-pitched, engaging voice is captured in a clear, close-sounding recording. His slightly slower delivery conveys a positive tone."Amrita - High-Pitched, Flat Tone:
"Amrita speaks with a high pitch at a slow pace. Her voice is clear, with excellent recording quality and only moderate background noise."Aditi - Slow, Slightly Expressive:
"Aditi speaks slowly with a high pitch and expressive tone. The recording is clear, showcasing her energetic and emotive voice."Young Male Speaker, American Accent:
"A young male speaker with a high-pitched American accent delivers speech at a slightly fast pace in a clear, close-sounding recording."Bikram - High-Pitched, Urgent Tone:
"Bikram speaks with a higher pitch and fast pace, conveying urgency. The recording is clear and intimate, with great emotional depth."Anjali - High-Pitched, Neutral Tone:
"Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality."here is for you my friend its a fine tuned ver of indic parler tts so these should be supported
The problem is once we set a description like above , the generation did not honor it, it kept changing the voices !
For example when we set Description = "Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality.""
And generate voice with a batch of 4 input texts, each generated audio has different voice ! like one is English accent, other is telugu, one is fast paced, other is sad voice like that!The description given seems to be ignored by the model,
After generating audio for 10 sentences, tjhe voice is different for each of the 10 sentences!
Once a Description (for voice parameters) is set, it is fixed and is not modified in code , so same description for all sentences , But the output audio parameters for each generated sentence is different! Its like obtaining voice of a bunch of sentences from different speakers randomlywhat are your req for tts you switch to a better tts
requirement is to generate telugu female audio from telugu text.
Had tried these models:
- Tried indic-parler-tts , on rtx-3080 it is slow (and not much control on the aspects like timbre or emotion when tested)
- InficF5 : producedhalucinated audiosounds but no teluguvoice (community comment sshow someotherdevsalsofacedthis
- facebook-mms-tel and other Vits models: These generate too low quality audio like parts of words are missing (many Vits models showed same issue)
Looking for a model that can generate telugu voice from telugu text with emotion support
It looks like the issue is related to the
StaticCacheobject not having themax_batch_sizeattribute. This might be due to a change in the library wheremax_batch_sizehas been deprecated or replaced withbatch_size.Here are a few things you can try to fix this locally:
- Check the library version: If you're using an older version of
transformers, upgrading to the latest version might resolve the issue. Try running:pip install --upgrade transformers- Modify the code: If
max_batch_sizehas been replaced withbatch_size, you can try changing occurrences ofmax_batch_sizetobatch_sizein your code, particularly in_get_cache()function.- Check GitHub issues: There are discussions on GitHub where similar issues have been reported. You might find a fix or workaround there.
- Verify StaticCache implementation: If you're manually using
StaticCache, ensure that it supportsmax_batch_size. If not, you may need to initialize it differently.
----HelpingAI----
This is generated by a HelpingAI so make sure to double check it.Thank you, had made change in code at that time to suppress it
Could you please let me know is there a way to make the speaker or voice remain consistent?
Like is there some parameter to be provided as input that ensures the voice and does not change (until the parameter value is changed )?Yes, Parler-TTS supports consistent speaker voice generation by using a fixed speaker embedding or conditioning input. To maintain the same voice across generations, you need to explicitly pass the same speaker-related parameters—such as speaker_embedding, speaker_id, or a consistent prompt_input_ids—depending on the model variant.
i may be wrong please let me know if it fixes the issue
in the model page it is mentioned :
prompt = "Hey, what's up? How’s it going?"
description = "A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."I used description = "Lalitha is A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."
I ensured description is constant across generate() calls, and only gave input text in prompts (parts of a sentence, about 50 tokens)
Despite description being same, even within single generate call of a batch of 4 input text sentences, the generated voice is different, in accent, tone and other aspects, it was as if some different speaker's voice !
Since there is no speakerID in model page, It was not clear if there exists and speaker ID supported (for telugu language), I used speaker name 'lalitha'
So it generated a bunch of different accent audiosPunctuation can be used to control the prosody of the generations, e.g. use commas to add small breaks in speech
The remaining speech features (gender, speaking rate, pitch and reverberation) can be controlled directly through the prompt
Some Description Examples
Aditi - Slightly High-Pitched, Expressive Tone:
"Aditi speaks with a slightly higher pitch in a close-sounding environment. Her voice is clear, with subtle emotional depth and a normal pace, all captured in high-quality recording."Sita - Rapid, Slightly Monotone:
"Sita speaks at a fast pace with a slightly low-pitched voice, captured clearly in a close-sounding environment with excellent recording quality."Tapan - Male, Moderate Pace, Slightly Monotone:
"Tapan speaks at a moderate pace with a slightly monotone tone. The recording is clear, with a close sound and only minimal ambient noise."Sunita - High-Pitched, Happy Tone:
"Sunita speaks with a high pitch in a close environment. Her voice is clear, with slight dynamic changes, and the recording is of excellent quality."Karan - High-Pitched, Positive Tone:
"Karan’s high-pitched, engaging voice is captured in a clear, close-sounding recording. His slightly slower delivery conveys a positive tone."Amrita - High-Pitched, Flat Tone:
"Amrita speaks with a high pitch at a slow pace. Her voice is clear, with excellent recording quality and only moderate background noise."Aditi - Slow, Slightly Expressive:
"Aditi speaks slowly with a high pitch and expressive tone. The recording is clear, showcasing her energetic and emotive voice."Young Male Speaker, American Accent:
"A young male speaker with a high-pitched American accent delivers speech at a slightly fast pace in a clear, close-sounding recording."Bikram - High-Pitched, Urgent Tone:
"Bikram speaks with a higher pitch and fast pace, conveying urgency. The recording is clear and intimate, with great emotional depth."Anjali - High-Pitched, Neutral Tone:
"Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality."here is for you my friend its a fine tuned ver of indic parler tts so these should be supported
The problem is once we set a description like above , the generation did not honor it, it kept changing the voices !
For example when we set Description = "Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality.""
And generate voice with a batch of 4 input texts, each generated audio has different voice ! like one is English accent, other is telugu, one is fast paced, other is sad voice like that!The description given seems to be ignored by the model,
After generating audio for 10 sentences, tjhe voice is different for each of the 10 sentences!
Once a Description (for voice parameters) is set, it is fixed and is not modified in code , so same description for all sentences , But the output audio parameters for each generated sentence is different! Its like obtaining voice of a bunch of sentences from different speakers randomlywhat are your req for tts you switch to a better tts
requirement is to generate telugu female audio from telugu text.
Had tried these models:
- Tried indic-parler-tts , on rtx-3080 it is slow (and not much control on the aspects like timbre or emotion when tested)
- InficF5 : producedhalucinated audiosounds but no teluguvoice (community comment sshow someotherdevsalsofacedthis
- facebook-mms-tel and other Vits models: These generate too low quality audio like parts of words are missing (many Vits models showed same issue)
Looking for a model that can generate telugu voice from telugu text with emotion support
so this model is also same indic-parler-tts i guess the problem is with how you are using it as the helpingai tts is also the exact same model just fine tuned with some Kashmiri data
It looks like the issue is related to the
StaticCacheobject not having themax_batch_sizeattribute. This might be due to a change in the library wheremax_batch_sizehas been deprecated or replaced withbatch_size.Here are a few things you can try to fix this locally:
- Check the library version: If you're using an older version of
transformers, upgrading to the latest version might resolve the issue. Try running:pip install --upgrade transformers- Modify the code: If
max_batch_sizehas been replaced withbatch_size, you can try changing occurrences ofmax_batch_sizetobatch_sizein your code, particularly in_get_cache()function.- Check GitHub issues: There are discussions on GitHub where similar issues have been reported. You might find a fix or workaround there.
- Verify StaticCache implementation: If you're manually using
StaticCache, ensure that it supportsmax_batch_size. If not, you may need to initialize it differently.
----HelpingAI----
This is generated by a HelpingAI so make sure to double check it.Thank you, had made change in code at that time to suppress it
Could you please let me know is there a way to make the speaker or voice remain consistent?
Like is there some parameter to be provided as input that ensures the voice and does not change (until the parameter value is changed )?Yes, Parler-TTS supports consistent speaker voice generation by using a fixed speaker embedding or conditioning input. To maintain the same voice across generations, you need to explicitly pass the same speaker-related parameters—such as speaker_embedding, speaker_id, or a consistent prompt_input_ids—depending on the model variant.
i may be wrong please let me know if it fixes the issue
in the model page it is mentioned :
prompt = "Hey, what's up? How’s it going?"
description = "A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."I used description = "Lalitha is A friendly, upbeat, and casual tone with a moderate speed. Speaker sounds confident and relaxed."
I ensured description is constant across generate() calls, and only gave input text in prompts (parts of a sentence, about 50 tokens)
Despite description being same, even within single generate call of a batch of 4 input text sentences, the generated voice is different, in accent, tone and other aspects, it was as if some different speaker's voice !
Since there is no speakerID in model page, It was not clear if there exists and speaker ID supported (for telugu language), I used speaker name 'lalitha'
So it generated a bunch of different accent audiosPunctuation can be used to control the prosody of the generations, e.g. use commas to add small breaks in speech
The remaining speech features (gender, speaking rate, pitch and reverberation) can be controlled directly through the prompt
Some Description Examples
Aditi - Slightly High-Pitched, Expressive Tone:
"Aditi speaks with a slightly higher pitch in a close-sounding environment. Her voice is clear, with subtle emotional depth and a normal pace, all captured in high-quality recording."Sita - Rapid, Slightly Monotone:
"Sita speaks at a fast pace with a slightly low-pitched voice, captured clearly in a close-sounding environment with excellent recording quality."Tapan - Male, Moderate Pace, Slightly Monotone:
"Tapan speaks at a moderate pace with a slightly monotone tone. The recording is clear, with a close sound and only minimal ambient noise."Sunita - High-Pitched, Happy Tone:
"Sunita speaks with a high pitch in a close environment. Her voice is clear, with slight dynamic changes, and the recording is of excellent quality."Karan - High-Pitched, Positive Tone:
"Karan’s high-pitched, engaging voice is captured in a clear, close-sounding recording. His slightly slower delivery conveys a positive tone."Amrita - High-Pitched, Flat Tone:
"Amrita speaks with a high pitch at a slow pace. Her voice is clear, with excellent recording quality and only moderate background noise."Aditi - Slow, Slightly Expressive:
"Aditi speaks slowly with a high pitch and expressive tone. The recording is clear, showcasing her energetic and emotive voice."Young Male Speaker, American Accent:
"A young male speaker with a high-pitched American accent delivers speech at a slightly fast pace in a clear, close-sounding recording."Bikram - High-Pitched, Urgent Tone:
"Bikram speaks with a higher pitch and fast pace, conveying urgency. The recording is clear and intimate, with great emotional depth."Anjali - High-Pitched, Neutral Tone:
"Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality."here is for you my friend its a fine tuned ver of indic parler tts so these should be supported
The problem is once we set a description like above , the generation did not honor it, it kept changing the voices !
For example when we set Description = "Anjali speaks with a high pitch at a normal pace in a clear, close-sounding environment. Her neutral tone is captured with excellent audio quality.""
And generate voice with a batch of 4 input texts, each generated audio has different voice ! like one is English accent, other is telugu, one is fast paced, other is sad voice like that!The description given seems to be ignored by the model,
After generating audio for 10 sentences, tjhe voice is different for each of the 10 sentences!
Once a Description (for voice parameters) is set, it is fixed and is not modified in code , so same description for all sentences , But the output audio parameters for each generated sentence is different! Its like obtaining voice of a bunch of sentences from different speakers randomlywhat are your req for tts you switch to a better tts
requirement is to generate telugu female audio from telugu text.
Had tried these models:
- Tried indic-parler-tts , on rtx-3080 it is slow (and not much control on the aspects like timbre or emotion when tested)
- InficF5 : producedhalucinated audiosounds but no teluguvoice (community comment sshow someotherdevsalsofacedthis
- facebook-mms-tel and other Vits models: These generate too low quality audio like parts of words are missing (many Vits models showed same issue)
Looking for a model that can generate telugu voice from telugu text with emotion supportso this model is also same indic-parler-tts i guess the problem is with how you are using it as the helpingai tts is also the exact same model just fine tuned with some Kashmiri data
mentioning "finetuned with kashmiri data on top of indic-parler-tts" in this models ModelCard page would have helped and will help programmers who are comparing indic-parler-tts with this model
Since indic-parler-tts didnot work, wouldnot have invested effort on generating with this model (Given indic-parler-tts output, in the same setup , it wouldnt be prudent)
Better to have data before assuming it as usage problem!
Inconsistent output Speech characteristics despite fixed input description across generate calls is a known problem linked to parler-tts