Donutanti's picture
3 2

Donutanti

Donutanti
ยท

AI & ML interests

None yet

Recent Activity

upvoted a collection about 1 month ago
๐ŸŽญ Avatars
liked a Space 2 months ago
yuntian-deng/o1mini
reacted to m-ric's post with ๐Ÿ”ฅ 4 months ago
Emu3: Next-token prediction conquers multimodal tasks ๐Ÿ”ฅ This is the most important research in months: weโ€™re now very close to having a single architecture to handle all modalities. The folks at Beijing Academy of Artificial Intelligence (BAAI) just released Emu3, a single model that handles text, images, and videos all at once. ๐—ช๐—ต๐—ฎ๐˜'๐˜€ ๐˜๐—ต๐—ฒ ๐—ฏ๐—ถ๐—ด ๐—ฑ๐—ฒ๐—ฎ๐—น? ๐ŸŒŸ Emu3 is the first model to truly unify all these different types of data (text, images, video) using just one simple trick: predicting the next token. And itโ€™s only 8B, but really strong: ๐Ÿ–ผ๏ธ For image generation, it's matching the best specialized models out there, like SDXL. ๐Ÿ‘๏ธ In vision tasks, it's outperforming top models like LLaVA-1.6-7B, which is a big deal for a model that wasn't specifically designed for this. ๐ŸŽฌ It's the first to nail video generation without using complicated diffusion techniques. ๐—›๐—ผ๐˜„ ๐—ฑ๐—ผ๐—ฒ๐˜€ ๐—ถ๐˜ ๐˜„๐—ผ๐—ฟ๐—ธ? ๐Ÿงฉ Emu3 uses a special tokenizer (SBER-MoVQGAN) to turn images and video clips into sequences of 4,096 tokens. ๐Ÿ”— Then, it treats everything - text, images, and videos - as one long series of tokens to predict. ๐Ÿ”ฎ During training, it just tries to guess the next token, whether that's a word, part of an image, or a video frame. ๐—–๐—ฎ๐˜ƒ๐—ฒ๐—ฎ๐˜๐˜€ ๐—ผ๐—ป ๐˜๐—ต๐—ฒ ๐—ฟ๐—ฒ๐˜€๐˜‚๐—น๐˜๐˜€: ๐Ÿ‘‰ In image generation, Emu3 beats SDXL, but itโ€™s also much bigger (8B vs 3.5B). It would be more difficult to beat the real diffusion GOAT FLUX-dev. ๐Ÿ‘‰ In vision, authors also donโ€™t show a comparison against all the current SOTA models like Qwen-VL or Pixtral. This approach is exciting because it's simple (next token prediction) and scalable(handles all sorts of data)! Read the paper ๐Ÿ‘‰ https://huggingface.co/papers/2409.18869
View all activity

Organizations

None yet

Donutanti's activity

reacted to m-ric's post with ๐Ÿ”ฅ 4 months ago
view post
Post
1170
Emu3: Next-token prediction conquers multimodal tasks ๐Ÿ”ฅ

This is the most important research in months: weโ€™re now very close to having a single architecture to handle all modalities. The folks at Beijing Academy of Artificial Intelligence (BAAI) just released Emu3, a single model that handles text, images, and videos all at once.

๐—ช๐—ต๐—ฎ๐˜'๐˜€ ๐˜๐—ต๐—ฒ ๐—ฏ๐—ถ๐—ด ๐—ฑ๐—ฒ๐—ฎ๐—น?
๐ŸŒŸ Emu3 is the first model to truly unify all these different types of data (text, images, video) using just one simple trick: predicting the next token.
And itโ€™s only 8B, but really strong:
๐Ÿ–ผ๏ธ For image generation, it's matching the best specialized models out there, like SDXL.
๐Ÿ‘๏ธ In vision tasks, it's outperforming top models like LLaVA-1.6-7B, which is a big deal for a model that wasn't specifically designed for this.
๐ŸŽฌ It's the first to nail video generation without using complicated diffusion techniques.

๐—›๐—ผ๐˜„ ๐—ฑ๐—ผ๐—ฒ๐˜€ ๐—ถ๐˜ ๐˜„๐—ผ๐—ฟ๐—ธ?
๐Ÿงฉ Emu3 uses a special tokenizer (SBER-MoVQGAN) to turn images and video clips into sequences of 4,096 tokens.
๐Ÿ”— Then, it treats everything - text, images, and videos - as one long series of tokens to predict.
๐Ÿ”ฎ During training, it just tries to guess the next token, whether that's a word, part of an image, or a video frame.

๐—–๐—ฎ๐˜ƒ๐—ฒ๐—ฎ๐˜๐˜€ ๐—ผ๐—ป ๐˜๐—ต๐—ฒ ๐—ฟ๐—ฒ๐˜€๐˜‚๐—น๐˜๐˜€:
๐Ÿ‘‰ In image generation, Emu3 beats SDXL, but itโ€™s also much bigger (8B vs 3.5B). It would be more difficult to beat the real diffusion GOAT FLUX-dev.
๐Ÿ‘‰ In vision, authors also donโ€™t show a comparison against all the current SOTA models like Qwen-VL or Pixtral.

This approach is exciting because it's simple (next token prediction) and scalable(handles all sorts of data)!

Read the paper ๐Ÿ‘‰ Emu3: Next-Token Prediction is All You Need (2409.18869)
upvoted 2 articles 7 months ago
view article
Article

Introduction to 3D Gaussian Splatting

โ€ข 37
reacted to KingNish's post with ๐Ÿ”ฅ 8 months ago
view post
Post
4635
Microsoft Just Launched 3 Powerful Models

1. Phi 3 Medium (4k and 128k): A 14b Instruct tuned models that outperformed big models like Command R+ (104b), GPT 3.5 Pro, Gemini Pro, and is highly competitive with top models such as Mixtral 8x22b, Llama3 70B, and GPT 4.
microsoft/Phi-3-medium-4k-instruct
DEMO: https://huggingface.co/spaces/Walmart-the-bag/Phi-3-Medium

2. Phi 3 Mini Vision 128k: A 4.5 billion-parameter, instruction-tuned vision model that has outperformed models such as Llava3 and Claude 3, and is providing stiff competition to Gemini 1Pro Vision.
microsoft/Phi-3-vision-128k-instruct

3. Phi3 Small (8k and 128k): Better than Llama3 8b, Mixtral 8x7b and GPT 3.5 turbo.
microsoft/Phi-3-small-128k-instruct
ยท