Voice cloning discrepancy
I am finding a discrepancy with the ONNX python implementation provided in this repo and the official chatterbox-tts mtl library. It is notable that the output speech file isn't accurately cloning the provided voice.
What is the origin of the speech_encoder.onnx file? Specifically, how was this ONNX model exported, and does it encapsulate all the custom neural modules (such as VoiceEncoder, S3Gen, T3Cond) used in the official PyTorch implementation? It appears that the ONNX implementation relies on this single model for all speaker feature extraction, whereas the official library uses a multi-stage, tightly integrated approach. This may be a key reason why the ONNX pipeline fails to achieve proper voice cloning. Any details on the export process and model architecture would be appreciated.
Also, any additional thoughts on resolving the voice cloning discrepancy would be appreciated!