Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
MoritzLaurerย 
posted an update Apr 3, 2024
Post
3710
๐Ÿ†• Releasing a new series of 8 zeroshot classifiers: better performance, fully commercially useable thanks to synthetic data, up to 8192 tokens, run on any hardware.

Summary:
๐Ÿค– The zeroshot-v2.0-c series replaces commercially restrictive training data with synthetic data generated with mistralai/Mixtral-8x7B-Instruct-v0.1 (Apache 2.0). All models are released under the MIT license.
๐Ÿฆพ The best model performs 17%-points better across 28 tasks vs. facebook/bart-large-mnli (the most downloaded commercially-friendly baseline).
๐ŸŒ The series includes a multilingual variant fine-tuned from BAAI/bge-m3 for zeroshot classification in 100+ languages and with a context window of 8192 tokens
๐Ÿชถ The models are 0.2 - 0.6 B parameters small, so they run on any hardware. The base-size models are +2x faster than bart-large-mnli while performing significantly better.
๐Ÿค The models are not generative LLMs, they are efficient encoder-only models specialized in zeroshot classification through the universal NLI task.
๐Ÿค‘ For users where commercially restrictive training data is not an issue, I've also trained variants with even more human data for improved performance.

Next steps:
โœ๏ธ I'll publish a blog post with more details soon
๐Ÿ”ฎ There are several improvements I'm planning for v2.1. Especially the multilingual model has room for improvement.

All models are available for download in this Hugging Face collection: MoritzLaurer/zeroshot-classifiers-6548b4ff407bb19ff5c3ad6f

These models are an extension of the approach explained in this paper, but with additional synthetic data: https://arxiv.org/abs/2312.17543

Looking forward to your blogpost! It's always exciting to see solid non-generative models.

hello. The model is very interesting.
In the "model="MoritzLaurer/bge-m3-zeroshot-v2.0" model, a maximum of 8192 tokens are said to run on all hardware, but in Python, 'Token indices sequence length is longer than the specified maximum sequence length for this model (1615 > 512). Is there a solution to the error 'Running this sequence through the model will result in indexing errors'?
Thank you for answer.
error.png

ยท

@HAMRONI can you share the full inference code that caused this error? you can open a discussion in the model repo