Smol TTS models are here! OuteTTS-0.1-350M - Zero shot voice cloning, built on LLaMa architecture, CC-BY license! π₯
> Pure language modeling approach to TTS > Zero-shot voice cloning > LLaMa architecture w/ Audio tokens (WavTokenizer) > BONUS: Works on-device w/ llama.cpp β‘
Three-step approach to TTS:
> Audio tokenization using WavTokenizer (75 tok per second) > CTC forced alignment for word-to-audio token mapping > Structured prompt creation w/ transcription, duration, audio tokens
The model is extremely impressive for 350M parameters! Kudos to the OuteAI team on such a brilliant feat - I'd love to see this be applied on larger data and smarter backbones like SmolLM π€
> Trained with 1.3 trillion (dolma 1.7) tokens on 16 nodes, each with 4 MI250 GPUs
> Three checkpoints:
- AMD OLMo 1B: Pre-trained model - AMD OLMo 1B SFT: Supervised fine-tuned on Tulu V2, OpenHermes-2.5, WebInstructSub, and Code-Feedback datasets - AMD OLMo 1B SFT DPO: Aligned with human preferences using Direct Preference Optimization (DPO) on UltraFeedback dataset
Key Insights: > Pre-trained with less than half the tokens of OLMo-1B > Post-training steps include two-phase SFT and DPO alignment > Data for SFT: - Phase 1: Tulu V2 - Phase 2: OpenHermes-2.5, WebInstructSub, and Code-Feedback
> Model checkpoints on the Hub & Integrated with Transformers β‘οΈ
Congratulations & kudos to AMD on a brilliant smol model release! π€
Dive into multi-model evaluations, pinpoint the best model for your needs, and explore insights across top open LLMs all in one place. Ready to level up your model comparison game?
What a great day for Open Science! @AIatMeta released models, datasets, and code for many of its research artefacts! π₯
1. Meta Segment Anything Model 2.1: An updated checkpoint with improved results on visually similar objects, small objects and occlusion handling. A new developer suite will be added to make it easier for developers to build with SAM 2.
Rhymes AI drops Aria: small Multimodal MoE that beats GPT-4o and Gemini-1.5-Flash β‘οΈ
New player entered the game! Rhymes AI has just been announced, and unveiled Aria β a multimodal powerhouse that's punching above its weight.
Key insights:
π§ Mixture-of-Experts architecture: 25.3B total params, but only 3.9B active.
π Multimodal: text/image/video β text.
π Novel training approach: βmultimodal-nativeβ where multimodal training starts directly during pre-training, not just tacked on later
π Long 64K token context window
π Apache 2.0 license, with weights, code, and demos all open
β‘οΈ On the benchmark side, Aria leaves some big names in the dust.
- It beats Pixtral 12B or Llama-3.2-12B on several vision benchmarks like MMMU or MathVista. - It even overcomes the much bigger GPT-4o on long video tasks and even outshines Gemini 1.5 Flash when it comes to parsing lengthy documents.
But Rhymes AI isn't just showing off benchmarks. They've already got Aria powering a real-world augmented search app called βBeagoβ. Itβs handling even recent events with great accuracy!
And they partnered with AMD to make it much faster than competitors like Perplexity or Gemini search.
A 'small' MobileNet-V4 update, I just pushed weights for the smallest model I've trained in the series, a 0.5 width multiplier version of the MobileNet-V4 Conv Small.
Now you may look at this and say hey, why is this impressive? 64.8% top-1 and 2.2M params? MobileNetV3-Small 0.75, and MobileNet-V2 0.5 are both fewer params (at ~2M) and over 65% top-1, what gives? Well this is where MobileNet-V4 differs from the previous versions of the model family, it trades off (gives up) a little parameter efficiency for some computational efficiency.
Less than two days ago Kyutai Labs open sourced Moshi - an ~7.6B on-device Speech to Speech foundation model and Mimi - SoTA streaming speech codec! π₯
The release includes:
1. Moshiko & Moshika - Moshi finetuned on synthetic data (CC-BY license) (kyutai/moshi-v01-release-66eaeaf3302bef6bd9ad7acd) 2. Mimi - Streaiming Audio Codec, processes 24 kHz audio, down to a 12.5 Hz representation with a bandwidth of 1.1 kbps (CC-BY license) (kyutai/mimi) 3. Model checkpoints & Inference codebase written in Rust (Candle), PyTorch & MLX (Apache license) (https://github.com/kyutai-labs/moshi)
How does Moshi work?
1. Moshi processes two audio streams: one for itself and one for the user, with the user's stream coming from audio input and Moshi's stream generated by the model.
2. Along with these audio streams, Moshi predicts text tokens for its speech, enhancing its generation quality.
3. The model uses a small Depth Transformer for codebook dependencies and a large 7B parameter Temporal Transformer for temporal dependencies.
4. The theoretical latency is 160ms, with a practical latency of around 200ms on an L4 GPU.
Model size & inference:
Moshiko/ka are 7.69B param models
bf16 ~16GB VRAM 8-bit ~8GB VRAM 4-bit ~4GB VRAM
You can run inference via Candle π¦, PyTorch and MLX - based on your hardware.
The Kyutai team, @adefossez@lmz and team are cracked AF, they're bringing some serious firepower to the open source/ science AI scene, looking forward to what's next! π
π€― Ghost 8B Beta emerges as a clear leader, surpassing even proprietary models like xAI Grok 1, OpenAI GPT 3.5, and Mistral Mixtral 8x7B. This dominance extends to its parity with Mistral Medium, further solidifying its position as a top-tier language model. Furthermore, Ghost 8B Beta stands out as one of only three models employing the zero-shot method for evaluation, alongside Claude 2 and Claude 3, showcasing its unique capabilities and potential for groundbreaking applications. --- π¬ Chat with the model here: - Playground with Ghost 8B Beta (Ξ², 8k): lamhieu/ghost-8b-beta-8k - Playground with Ghost 8B Beta (Ξ², 128k): lamhieu/ghost-8b-beta-128k - Official website: https://ghost-x.org/docs/models/ghost-8b-beta/
@seyonec It's a great help to experiment with ChemBERTa.
BTW, There are several models that handle SMILES in the model repository. Can you kindly recommend the one with the best performance in handling hERG dataset? https://paperswithcode.com/dataset/herg
Auto Evol-Instruct automatically involves an iterative process of optimizing an Evol-Instruct V1 into an optimal one. The pipeline consists of two critical stages: Evol Trajectory Analysis, where the optimizer LLM analyzes the issues and failures exposed in instruction evolution performed by the evol LLM, and Evolving Method Optimization, where the optimizer LLM addresses these issues to progressively develop an effective evolving method. The optimal evolving method is then used to convert the entire instruction dataset into more diverse and complex forms, facilitating improved instruction tuning.
π2. Scaling Evol-Instruct with Arena Learning
With Auto Evol-Instruct, the evolutionary synthesis data of WizardLM-2 has scaled up from WizardLM-1 to dozens of domains, covering tasks in all aspects of large language models. This allows Arena Learning to train and learn from an almost infinite pool of high-difficulty instruction data, fully unlocking all the potential of Arena Learning.