Vaibhav Srivastav

reach-vb

AI & ML interests

TTS + LM performance prediction

Articles

Organizations

reach-vb's activity

posted an update 13 days ago
view post
Post
1192
Smol TTS models are here! OuteTTS-0.1-350M - Zero shot voice cloning, built on LLaMa architecture, CC-BY license! πŸ”₯

> Pure language modeling approach to TTS
> Zero-shot voice cloning
> LLaMa architecture w/ Audio tokens (WavTokenizer)
> BONUS: Works on-device w/ llama.cpp ⚑

Three-step approach to TTS:

> Audio tokenization using WavTokenizer (75 tok per second)
> CTC forced alignment for word-to-audio token mapping
> Structured prompt creation w/ transcription, duration, audio tokens

The model is extremely impressive for 350M parameters! Kudos to the
OuteAI team on such a brilliant feat - I'd love to see this be applied on larger data and smarter backbones like SmolLM πŸ€—

Check out the models here: OuteAI/outetts-6728aa71a53a076e4ba4817c
posted an update 15 days ago
view post
Post
2940
Smol models ftw! AMD released AMD OLMo 1B - beats OpenELM, tiny llama on MT Bench, Alpaca Eval - Apache 2.0 licensed πŸ”₯

> Trained with 1.3 trillion (dolma 1.7) tokens on 16 nodes, each with 4 MI250 GPUs

> Three checkpoints:

- AMD OLMo 1B: Pre-trained model
- AMD OLMo 1B SFT: Supervised fine-tuned on Tulu V2, OpenHermes-2.5, WebInstructSub, and Code-Feedback datasets
- AMD OLMo 1B SFT DPO: Aligned with human preferences using Direct Preference Optimization (DPO) on UltraFeedback dataset

Key Insights:
> Pre-trained with less than half the tokens of OLMo-1B
> Post-training steps include two-phase SFT and DPO alignment
> Data for SFT:
- Phase 1: Tulu V2
- Phase 2: OpenHermes-2.5, WebInstructSub, and Code-Feedback

> Model checkpoints on the Hub & Integrated with Transformers ⚑️

Congratulations & kudos to AMD on a brilliant smol model release! πŸ€—

amd/amd-olmo-6723e7d04a49116d8ec95070
reacted to albertvillanova's post with πŸ”₯❀️ 19 days ago
view post
Post
3055
πŸš€ Exciting update! You can now compare multiple models side-by-side with the Hugging Face Open LLM Comparator! πŸ“Š

open-llm-leaderboard/comparator

Dive into multi-model evaluations, pinpoint the best model for your needs, and explore insights across top open LLMs all in one place. Ready to level up your model comparison game?
posted an update 30 days ago
view post
Post
2396
What a great day for Open Science! @AIatMeta released models, datasets, and code for many of its research artefacts! πŸ”₯

1. Meta Segment Anything Model 2.1: An updated checkpoint with improved results on visually similar objects, small objects and occlusion handling. A new developer suite will be added to make it easier for developers to build with SAM 2.

Model checkpoints: reach-vb/sam-21-6702d40defe7611a8bafa881

2. Layer Skip: Inference code and fine-tuned checkpoints demonstrating a new method for enhancing LLM performance.

Model checkpoints: facebook/layerskip-666b25c50c8ae90e1965727a

3. SALSA: New code enables researchers to benchmark AI-based attacks to validate security for post-quantum cryptography.

Repo: https://github.com/facebookresearch/LWE-benchmarking

4. Meta Lingua: A lightweight and self-contained codebase designed to train language models at scale.

Repo: https://github.com/facebookresearch/lingua

5. Meta Open Materials: New open source models and the largest dataset to accelerate AI-driven discovery of new inorganic materials.

Model checkpoints: fairchem/OMAT24

6. MEXMA: A new research paper and code for our novel pre-trained cross-lingual sentence encoder covering 80 languages.

Model checkpoint: facebook/MEXMA

7. Self-Taught Evaluator: a new method for generating synthetic preference data to train reward models without relying on human annotations.

Model checkpoint: facebook/Self-taught-evaluator-llama3.1-70B

8. Meta Spirit LM: An open-source language model for seamless speech and text integration.

Repo: https://github.com/facebookresearch/spiritlm
  • 3 replies
Β·
posted an update about 1 month ago
view post
Post
5339
Multimodal Ichigo Llama 3.1 - Real Time Voice AI πŸ”₯

> WhisperSpeech X Llama 3.1 8B
> Trained on 50K hours of speech (7 languages)
> Continually trained on 45hrs 10x A1000s
> MLS -> WhisperVQ tokens -> Llama 3.1
> Instruction tuned on 1.89M samples
> 70% speech, 20% transcription, 10% text
> Apache 2.0 licensed ⚑

Architecture:
> WhisperSpeech/ VQ for Semantic Tokens
> Llama 3.1 8B Instruct for Text backbone
> Early fusion (Chameleon)

I'm super bullish on HomeBrew/ Jan and early fusion, audio and text, multimodal models!

(P.S. Play with the demo on Hugging Face: jan-hq/Ichigo-llama3.1-s-instruct)
reacted to m-ric's post with πŸ‘€πŸ”₯ about 1 month ago
view post
Post
2901
Rhymes AI drops Aria: small Multimodal MoE that beats GPT-4o and Gemini-1.5-Flash ⚑️

New player entered the game! Rhymes AI has just been announced, and unveiled Aria – a multimodal powerhouse that's punching above its weight.

Key insights:

🧠 Mixture-of-Experts architecture: 25.3B total params, but only 3.9B active.

🌈 Multimodal: text/image/video β†’ text.

πŸ“š Novel training approach: β€œmultimodal-native” where multimodal training starts directly during pre-training, not just tacked on later

πŸ“ Long 64K token context window

πŸ”“ Apache 2.0 license, with weights, code, and demos all open

⚑️ On the benchmark side, Aria leaves some big names in the dust.

- It beats Pixtral 12B or Llama-3.2-12B on several vision benchmarks like MMMU or MathVista.
- It even overcomes the much bigger GPT-4o on long video tasks and even outshines Gemini 1.5 Flash when it comes to parsing lengthy documents.

But Rhymes AI isn't just showing off benchmarks. They've already got Aria powering a real-world augmented search app called β€œBeago”. It’s handling even recent events with great accuracy!

And they partnered with AMD to make it much faster than competitors like Perplexity or Gemini search.

Read their paper for Aria πŸ‘‰Β  Aria: An Open Multimodal Native Mixture-of-Experts Model (2410.05993)

Try BeaGo 🐢 πŸ‘‰Β https://rhymes.ai/blog-details/introducing-beago-your-smarter-faster-ai-search
  • 1 reply
Β·
posted an update about 1 month ago
view post
Post
3052
NEW: Open Source Text/ Image to video model is out - MIT licensed - Rivals Gen-3, Pika & Kling πŸ”₯

> Pyramid Flow: Training-efficient Autoregressive Video Generation method
> Utilizes Flow Matching
> Trains on open-source datasets
> Generates high-quality 10-second videos
> Video resolution: 768p
> Frame rate: 24 FPS
> Supports image-to-video generation

> Model checkpoints available on the hub πŸ€—: rain1011/pyramid-flow-sd3
posted an update about 1 month ago
view post
Post
2038
On-device AI framework ecosystem is blooming these days:

1. llama.cpp - All things Whisper, LLMs & VLMs - run across Metal, CUDA and other backends (AMD/ NPU etc)
https://github.com/ggerganov/llama.cpp

2. MLC - Deploy LLMs across platforms especially WebGPU (fastest WebGPU LLM implementation out there)
https://github.com/mlc-ai/web-llm

3. MLX - Arguably the fastest general purpose framework (Mac only) - Supports all major Image Generation (Flux, SDXL, etc), Transcription (Whisper), LLMs
https://github.com/ml-explore/mlx-examples

4. Candle - Cross-platform general purpose framework written in Rust - wide coverage across model categories
https://github.com/huggingface/candle

Honorable mentions:

1. Transformers.js - Javascript (WebGPU) implementation built on top of ONNXruntimeweb
https://github.com/xenova/transformers.js

2. Mistral rs - Rust implementation for LLMs & VLMs, built on top of Candle
https://github.com/EricLBuehler/mistral.rs

3. Ratchet - Cross platform, rust based WebGPU framework built for battle-tested deployments
https://github.com/huggingface/ratchet

4. Zml - Cross platform, Zig based ML framework
https://github.com/zml/zml

Looking forward to how the ecosystem would look 1 year from now - Quite bullish on the top 4 atm - but open source ecosystem changes quite a bit! πŸ€—

Also, which frameworks did I miss?
  • 1 reply
Β·
reacted to rwightman's post with ❀️ about 2 months ago
view post
Post
2463
A 'small' MobileNet-V4 update, I just pushed weights for the smallest model I've trained in the series, a 0.5 width multiplier version of the MobileNet-V4 Conv Small.

Now you may look at this and say hey, why is this impressive? 64.8% top-1 and 2.2M params? MobileNetV3-Small 0.75, and MobileNet-V2 0.5 are both fewer params (at ~2M) and over 65% top-1, what gives? Well this is where MobileNet-V4 differs from the previous versions of the model family, it trades off (gives up) a little parameter efficiency for some computational efficiency.

So, let's look at the speed. On a 4090 w/ torchcompile
* 98K img/sec - timm/mobilenetv4_conv_small_050.e3000_r224_in1k
* 58K img/sec - timm/mobilenetv3_small_075.lamb_in1k
* 37K img/sec - timm/mobilenetv2_050.lamb_in1k

And there you go, if you have a need for speed, MNV4 is the better option.
posted an update about 2 months ago
view post
Post
2792
Less than two days ago Kyutai Labs open sourced Moshi - an ~7.6B on-device Speech to Speech foundation model and Mimi - SoTA streaming speech codec! πŸ”₯

The release includes:

1. Moshiko & Moshika - Moshi finetuned on synthetic data (CC-BY license) ( kyutai/moshi-v01-release-66eaeaf3302bef6bd9ad7acd)
2. Mimi - Streaiming Audio Codec, processes 24 kHz audio, down to a 12.5 Hz representation with a bandwidth of 1.1 kbps (CC-BY license) ( kyutai/mimi)
3. Model checkpoints & Inference codebase written in Rust (Candle), PyTorch & MLX (Apache license) (https://github.com/kyutai-labs/moshi)

How does Moshi work?

1. Moshi processes two audio streams: one for itself and one for the user, with the user's stream coming from audio input and Moshi's stream generated by the model.

2. Along with these audio streams, Moshi predicts text tokens for its speech, enhancing its generation quality.

3. The model uses a small Depth Transformer for codebook dependencies and a large 7B parameter Temporal Transformer for temporal dependencies.

4. The theoretical latency is 160ms, with a practical latency of around 200ms on an L4 GPU.

Model size & inference:

Moshiko/ka are 7.69B param models

bf16 ~16GB VRAM
8-bit ~8GB VRAM
4-bit ~4GB VRAM

You can run inference via Candle πŸ¦€, PyTorch and MLX - based on your hardware.

The Kyutai team, @adefossez @lmz and team are cracked AF, they're bringing some serious firepower to the open source/ science AI scene, looking forward to what's next! 🐐
  • 1 reply
Β·
reacted to vikhyatk's post with πŸ”₯ 3 months ago
view post
Post
4163
Pushed a new update to vikhyatk/moondream2 today. TextVQA up from 60.2 to 65.2, DocVQA up from 61.9 to 70.5.

Space has been updated to the new model if you want to try it out! vikhyatk/moondream2
reacted to lamhieu's post with πŸ‘ 4 months ago
view post
Post
2114
🀯 Ghost 8B Beta emerges as a clear leader, surpassing even proprietary models like xAI Grok 1, OpenAI GPT 3.5, and Mistral Mixtral 8x7B. This dominance extends to its parity with Mistral Medium, further solidifying its position as a top-tier language model. Furthermore, Ghost 8B Beta stands out as one of only three models employing the zero-shot method for evaluation, alongside Claude 2 and Claude 3, showcasing its unique capabilities and potential for groundbreaking applications.
---
πŸ’¬ Chat with the model here:
- Playground with Ghost 8B Beta (Ξ², 8k): lamhieu/ghost-8b-beta-8k
- Playground with Ghost 8B Beta (Ξ², 128k): lamhieu/ghost-8b-beta-128k
- Official website: https://ghost-x.org/docs/models/ghost-8b-beta/
  • 2 replies
Β·
reacted to KnutJaegersberg's post with πŸ‘€ 4 months ago
reacted to joohaeng's post with πŸ”₯ 4 months ago
reacted to Ihor's post with πŸ”₯ 4 months ago
view post
Post
887
πŸš€ Meet Our New Line of Efficient and Accurate Zero-Shot Classifiers! πŸš€

The new architecture brings better inter-label understanding and can solve complex classification tasks at a single forward pass.

Key Applications:
βœ… Multi-class classification (up to 100 classes in a single run)
βœ… Topic classification
βœ… Sentiment analysis
βœ… Event classification
βœ… Prompt-based constrained classification
βœ… Natural Language Inference
βœ… Multi- and single-label classification

knowledgator/gliclass-6661838823756265f2ac3848
knowledgator/GLiClass_SandBox
knowledgator/gliclass-base-v1.0-lw
reacted to WizardLM's post with β€οΈπŸ‘πŸš€ 4 months ago
view post
Post
9246
πŸ”₯ πŸ”₯πŸ”₯
Excited to announce WizardLM new Paper: Auto Evol-Instruct!

🐦 Twitter: https://x.com/WizardLM_AI/status/1812857977122202087

πŸ“ƒ Paper: https://arxiv.org/pdf/2406.00770

πŸ€– 1. Fully AI-Powered Pipeline

Auto Evol-Instruct automatically involves an iterative process of optimizing an Evol-Instruct V1 into an optimal one. The pipeline consists of two critical stages: Evol Trajectory Analysis, where the optimizer LLM analyzes the issues and failures exposed in instruction evolution performed by the evol LLM, and Evolving Method Optimization, where the optimizer LLM addresses these issues to progressively develop an effective evolving method. The optimal evolving method is then used to convert the entire instruction dataset into more diverse and complex forms, facilitating improved instruction tuning.

πŸ“ˆ2. Scaling Evol-Instruct with Arena Learning

With Auto Evol-Instruct, the evolutionary synthesis data of WizardLM-2 has scaled up from WizardLM-1 to dozens of domains, covering tasks in all aspects of large language models. This allows Arena Learning to train and learn from an almost infinite pool of high-difficulty instruction data, fully unlocking all the potential of Arena Learning.
  • 1 reply
Β·