Finetuned variant of this model
The readme and the blog over website both states the on website it's the fine-tuned model. I'd like to know if it's the fine-tuned model of this 1b or it's 3b or 8b fine-tuned model.
The results over website were significantly better from what I'm getting after running it locally
The finetuned model on the web is perhaps a bigger model with hours of audio+text data that might have been used to finetune it. The 1b seems more like a proof-of-technology. The key to its quality is a specific finetune which Sesame might not opensource it. They either have to gives the community a fine-tuned model or tell us how to do it, otherwise, its useless like many others who have come and gone.
This is completely different from what's on the website. I suspect this was released so they can claim that they open sourced. It does not appear they are trying to help the community embrace it. I hope I'm wrong.
Well, Llama-4 will likely have token-to-token speech capabilities and will garner wider community support. Sesame's uniqueness will then be its voice (thanks to voice actors behind it) and text finetune which they might sell as an AI companion. I have personally finetuned many models, both voice and text using XTTS V2 (+Deepspeed for realtime inference) as a base and even with a traditional voice pipeline used with WebRTC, can pretty much match Sesame in terms latency, speech and if more flexible in terms of using any text model desired.