Ame Vi
Ameeeee
ยท
AI & ML interests
None yet
Recent Activity
reacted
to
tomaarsen's
post
with ๐ฅ
1 day ago
An assembly of 18 European companies, labs, and universities have banded together to launch ๐ช๐บ EuroBERT! It's a state-of-the-art multilingual encoder for 15 European languages, designed to be finetuned for retrieval, classification, etc.
๐ช๐บ 15 Languages: English, French, German, Spanish, Chinese, Italian, Russian, Polish, Portuguese, Japanese, Vietnamese, Dutch, Arabic, Turkish, Hindi
3๏ธโฃ 3 model sizes: 210M, 610M, and 2.1B parameters - very very useful sizes in my opinion
โก๏ธ Sequence length of 8192 tokens! Nice to see these higher sequence lengths for encoders becoming more common.
โ๏ธ Architecture based on Llama, but with bi-directional (non-causal) attention to turn it into an encoder. Flash Attention 2 is supported.
๐ฅ A new Pareto frontier (stronger *and* smaller) for multilingual encoder models
๐ Evaluated against mDeBERTa, mGTE, XLM-RoBERTa for Retrieval, Classification, and Regression (after finetuning for each task separately): EuroBERT punches way above its weight.
๐ Detailed paper with all details, incl. data: FineWeb for English and CulturaX for multilingual data, The Stack v2 and Proof-Pile-2 for code.
Check out the release blogpost here: https://huggingface.co/blog/EuroBERT/release
* https://huggingface.co/EuroBERT/EuroBERT-210m
* https://huggingface.co/EuroBERT/EuroBERT-610m
* https://huggingface.co/EuroBERT/EuroBERT-2.1B
The next step is for researchers to build upon the 3 EuroBERT base models and publish strong retrieval, zero-shot classification, etc. models for all to use. I'm very much looking forward to it!
View all activity
Organizations
Ameeeee's activity
-
-
-
-
-
-
-
-
-
-
-
view article
Synthetic data: save money, time and carbon with open source
upvoted
an
article
about 2 months ago
view article
Fine-tune ModernBERT for RAG with Synthetic Data
view article
Letโs make a generation of amazing image generation models