Open-Source AI Meetup

community

AI & ML interests

Open science and open source

SFEvent's activity

tomaarsen 
posted an update 3 days ago
view post
Post
4092
🏎️ Today I'm introducing a method to train static embedding models that run 100x to 400x faster on CPU than common embedding models, while retaining 85%+ of the quality! Including 2 fully open models: training scripts, datasets, metrics.

We apply our recipe to train 2 Static Embedding models that we release today! We release:
2️⃣ an English Retrieval model and a general-purpose Multilingual similarity model (e.g. classification, clustering, etc.), both Apache 2.0
🧠 my modern training strategy: ideation -> dataset choice -> implementation -> evaluation
📜 my training scripts, using the Sentence Transformers library
📊 my Weights & Biases reports with losses & metrics
📕 my list of 30 training and 13 evaluation datasets

The 2 Static Embedding models have the following properties:
🏎️ Extremely fast, e.g. 107500 sentences per second on a consumer CPU, compared to 270 for 'all-mpnet-base-v2' and 56 for 'gte-large-en-v1.5'
0️⃣ Zero active parameters: No Transformer blocks, no attention, not even a matrix multiplication. Super speed!
📏 No maximum sequence length! Embed texts at any length (note: longer texts may embed worse)
📐 Linear instead of exponential complexity: 2x longer text takes 2x longer, instead of 2.5x or more.
🪆 Matryoshka support: allow you to truncate embeddings with minimal performance loss (e.g. 4x smaller with a 0.56% perf. decrease for English Similarity tasks)

Check out the full blogpost if you'd like to 1) use these lightning-fast models or 2) learn how to train them with consumer-level hardware: https://huggingface.co/blog/static-embeddings

The blogpost contains a lengthy list of possible advancements; I'm very confident that our 2 models are only the tip of the iceberg, and we may be able to get even better performance.

Alternatively, check out the models:
* sentence-transformers/static-retrieval-mrl-en-v1
* sentence-transformers/static-similarity-mrl-multilingual-v1
  • 1 reply
·
tomaarsen 
posted an update 18 days ago
view post
Post
2819
That didn't take long! Nomic AI has finetuned the new ModernBERT-base encoder model into a strong embedding model for search, classification, clustering and more!

Details:
🤖 Based on ModernBERT-base with 149M parameters.
📊 Outperforms both nomic-embed-text-v1 and nomic-embed-text-v1.5 on MTEB!
🏎️ Immediate FA2 and unpacking support for super efficient inference.
🪆 Trained with Matryoshka support, i.e. 2 valid output dimensionalities: 768 and 256.
➡️ Maximum sequence length of 8192 tokens!
2️⃣ Trained in 2 stages: unsupervised contrastive data -> high quality labeled datasets.
➕ Integrated in Sentence Transformers, Transformers, LangChain, LlamaIndex, Haystack, etc.
🏛️ Apache 2.0 licensed: fully commercially permissible

Try it out here: nomic-ai/modernbert-embed-base

Very nice work by Zach Nussbaum and colleagues at Nomic AI.
freddyaboulton 
posted an update about 1 month ago
AtAndDev 
posted an update about 1 month ago
view post
Post
419
@s3nh Hey man check your discord! Got some news.
  • 4 replies
·
freddyaboulton 
posted an update about 1 month ago
freddyaboulton 
posted an update about 1 month ago
view post
Post
1997
Version 0.0.21 of gradio-pdf now properly loads chinese characters!
freddyaboulton 
posted an update about 1 month ago
view post
Post
1557
Hello Llama 3.2! 🗣️🦙

Build a Siri-like coding assistant that responds to "Hello Llama" in 100 lines of python! All with Gradio, webRTC 😎

freddyaboulton/hey-llama-code-editor
freddyaboulton 
posted an update about 1 month ago
julien-c 
posted an update about 1 month ago
view post
Post
8426
After some heated discussion 🔥, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community 🔥

cc: @reach-vb @pierric @victor and the HF team
·
julien-c 
posted an update about 2 months ago
view post
Post
2624
wow 😮

INTELLECT-1 is the first collaboratively trained 10 billion parameter language model trained from scratch on 1 trillion tokens of English text and code.

PrimeIntellect/INTELLECT-1-Instruct
tomaarsen 
posted an update 2 months ago
view post
Post
5669
I just released Sentence Transformers v3.3.0 & it's huge! 4.5x speedup for CPU with OpenVINO int8 static quantization, training with prompts for a free perf. boost, PEFT integration, evaluation on NanoBEIR, and more! Details:

1. We integrate Post-Training Static Quantization using OpenVINO, a very efficient solution for CPUs that processes 4.78x as many texts per second on average, while only hurting performance by 0.36% on average. There's a new export_static_quantized_openvino_model method to quantize a model.

2. We add the option to train with prompts, e.g. strings like "query: ", "search_document: " or "Represent this sentence for searching relevant passages: ". It's as simple as using the prompts argument in SentenceTransformerTrainingArguments. Our experiments show that you can easily reach 0.66% to 0.90% relative performance improvement on NDCG@10 at no extra cost by adding "query: " before each training query and "document: " before each training answer.

3. Sentence Transformers now supports training PEFT adapters via 7 new methods for adding new adapters or loading pre-trained ones. You can also directly load a trained adapter with SentenceTransformer as if it's a normal model. Very useful for e.g. 1) training multiple adapters on 1 base model, 2) training bigger models than otherwise possible, or 3) cheaply hosting multiple models by switching multiple adapters on 1 base model.

4. We added easy evaluation on NanoBEIR, a subset of BEIR a.k.a. the MTEB Retrieval benchmark. It contains 13 datasets with 50 queries and up to 10k documents each. Evaluation is fast, and can easily be done during training to track your model's performance on general-purpose information retrieval tasks.

Additionally, we also deprecate Python 3.8, add better compatibility with Transformers v4.46.0, and more. Read the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.3.0
tomaarsen 
posted an update 3 months ago
view post
Post
6973
📣 Sentence Transformers v3.2.0 is out, marking the biggest release for inference in 2 years! 2 new backends for embedding models: ONNX (+ optimization & quantization) and OpenVINO, allowing for speedups up to 2x-3x AND Static Embeddings for 500x speedups at 10-20% accuracy cost.

1️⃣ ONNX Backend: This backend uses the ONNX Runtime to accelerate model inference on both CPU and GPU, reaching up to 1.4x-3x speedup depending on the precision. We also introduce 2 helper methods for optimizing and quantizing models for (much) faster inference.
2️⃣ OpenVINO Backend: This backend uses Intel their OpenVINO instead, outperforming ONNX in some situations on CPU.

Usage is as simple as SentenceTransformer("all-MiniLM-L6-v2", backend="onnx"). Does your model not have an ONNX or OpenVINO file yet? No worries - it'll be autoexported for you. Thank me later 😉

🔒 Another major new feature is Static Embeddings: think word embeddings like GLoVe and word2vec, but modernized. Static Embeddings are bags of token embeddings that are summed together to create text embeddings, allowing for lightning-fast embeddings that don't require any neural networks. They're initialized in one of 2 ways:

1️⃣ via Model2Vec, a new technique for distilling any Sentence Transformer models into static embeddings. Either via a pre-distilled model with from_model2vec or with from_distillation where you do the distillation yourself. It'll only take 5 seconds on GPU & 2 minutes on CPU, no dataset needed.
2️⃣ Random initialization. This requires finetuning, but finetuning is extremely quick (e.g. I trained with 3 million pairs in 7 minutes). My final model was 6.6% worse than bge-base-en-v1.5, but 500x faster on CPU.

Full release notes: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.2.0
Documentation on Speeding up Inference: https://sbert.net/docs/sentence_transformer/usage/efficiency.html
  • 1 reply
·
tomaarsen 
posted an update 4 months ago
view post
Post
2016
I've just shipped the Sentence Transformers v3.1.1 patch release, fixing the hard negatives mining utility for some models. This utility is extremely useful to get more performance out of your embedding training data.

⛏ Hard negatives are texts that are rather similar to some anchor text (e.g. a query), but are not the correct match. They're difficult for a model to distinguish from the correct answer, often resulting in a stronger model after training.
mine_hard_negatives docs: https://sbert.net/docs/package_reference/util.html#sentence_transformers.util.mine_hard_negatives

🔓 Beyond that, this release removes the numpy<2 restriction from v3.1.0. This was previously required for Windows as not all third-party libraries were updated to support numpy v2. With Sentence Transformers, you can now choose v1 or v2 of numpy.

Check out the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.1.1

I'm looking forward to releasing v3.2, I have some exciting things planned 🚀
tomaarsen 
posted an update 4 months ago
view post
Post
2057
🎉SetFit v1.1.0 is out! Training efficient classifiers on CPU or GPU now uses the Sentence Transformers Trainer, and we resolved a lot of issues caused by updates of third-party libraries (like Transformers). Details:

Training a SetFit classifier model consists of 2 phases:
1. Finetuning a Sentence Transformer embedding model
2. Training a Classifier to map embeddings -> classes

🔌The first phase now uses the SentenceTransformerTrainer that was introduced in the Sentence Transformers v3 update. This brings some immediate upsides like MultiGPU support, without any (intended) breaking changes.

➡️ Beyond that, we softly deprecated the "evaluation_strategy" argument in favor of "eval_strategy" (following a Transformers deprecation), and deprecated Python 3.7. In return, we add official support for Python 3.11 and 3.12.

✨ There's some more minor changes too, like max_steps and eval_max_steps now being a hard limit instead of an approximate one, training/validation losses now logging nicely in Notebooks, and the "device" parameter no longer being ignored in some situations.

Check out the full release notes here: https://github.com/huggingface/setfit/releases/tag/v1.1.0
Or read the documentation: https://huggingface.co/docs/setfit
Or check out the public SetFit models for inspiration: https://huggingface.co/models?library=setfit&sort=created

P.s. the model in the code snippet trained in 1 minute and it can classify ~6000 sentences per second on my GPU.
tomaarsen 
posted an update 4 months ago
view post
Post
3771
🚀 Sentence Transformers v3.1 is out! Featuring a hard negatives mining utility to get better models out of your data, a new strong loss function, training with streaming datasets, custom modules, bug fixes, small additions and docs changes. Here's the details:

⛏ Hard Negatives Mining Utility: Hard negatives are texts that are rather similar to some anchor text (e.g. a question), but are not the correct match. They're difficult for a model to distinguish from the correct answer, often resulting in a stronger model after training.
📉 New loss function: This loss function works very well for symmetric tasks (e.g. clustering, classification, finding similar texts/paraphrases) and a bit less so for asymmetric tasks (e.g. question-answer retrieval).
💾 Streaming datasets: You can now train with the datasets.IterableDataset, which doesn't require downloading the full dataset to disk before training. As simple as "streaming=True" in your "datasets.load_dataset".
🧩 Custom Modules: Model authors can now customize a lot more of the components that make up Sentence Transformer models, allowing for a lot more flexibility (e.g. multi-modal, model-specific quirks, etc.)
✨ New arguments to several methods: encode_multi_process gets a progress bar, push_to_hub can now be done to different branches, and CrossEncoders can be downloaded to specific cache directories.
🐛 Bug fixes: Too many to name here, check out the release notes!
📝 Documentation: A particular focus on clarifying the batch samplers in the Package Reference this release.

Check out the full release notes here ⭐: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.1.0

I'm very excited to hear your feedback, and I'm looking forward to the future changes that I have planned, such as ONNX inference! I'm also open to suggestions for new features: feel free to send me your ideas.
·
multimodalart 
posted an update 6 months ago
tomaarsen 
posted an update 7 months ago
view post
Post
3939
@Omartificial-Intelligence-Space has trained and released 6 Arabic embedding models for semantic similarity. 4 of them outperform all previous models on the STS17 Arabic-Arabic task!

📚 Trained on a large dataset of 558k Arabic triplets translated from the AllNLI triplet dataset: Omartificial-Intelligence-Space/Arabic-NLi-Triplet
6️⃣ 6 different base models: AraBERT, MarBERT, LaBSE, MiniLM, paraphrase-multilingual-mpnet-base, mpnet-base, ranging from 109M to 471M parameters.
🪆 Trained with a Matryoshka loss, allowing you to truncate embeddings with minimal performance loss: smaller embeddings are faster to compare.
📈 Outperforms all commonly used multilingual models like intfloat/multilingual-e5-large, sentence-transformers/paraphrase-multilingual-mpnet-base-v2, and sentence-transformers/LaBSE.

Check them out here:
- Omartificial-Intelligence-Space/Arabic-mpnet-base-all-nli-triplet
- Omartificial-Intelligence-Space/Arabic-all-nli-triplet-Matryoshka
- Omartificial-Intelligence-Space/Arabert-all-nli-triplet-Matryoshka
- Omartificial-Intelligence-Space/Arabic-labse-Matryoshka
- Omartificial-Intelligence-Space/Marbert-all-nli-triplet-Matryoshka
- Omartificial-Intelligence-Space/Arabic-MiniLM-L12-v2-all-nli-triplet
Or the collection with all: Omartificial-Intelligence-Space/arabic-matryoshka-embedding-models-666f764d3b570f44d7f77d4e

My personal favourite is likely Omartificial-Intelligence-Space/Arabert-all-nli-triplet-Matryoshka: a very efficient 135M parameters & scores #1 on mteb/leaderboard.
  • 1 reply
·