Quentin Lhoest PRO

lhoestq

AI & ML interests

Maintainer of πŸ€—Datasets: NLP, Multimodal data processing and sharing

Recent Activity

published a dataset about 15 hours ago
lhoestq/tmp
liked a dataset about 19 hours ago
BIOMEDICA/biomedica_webdataset_24M
liked a dataset about 19 hours ago
BIOMEDICA/biomedica_webdataset
View all activity

Articles

Organizations

Hugging Face's profile picture WMT: Workshop on Statistical Machine Translation's profile picture BigScience Workshop's profile picture Neuropark's profile picture Hugging Face Internal Testing Organization's profile picture Training Transformers Together's profile picture BigScience Catalogue Data's profile picture OpenSLR's profile picture BigScience Data's profile picture Evaluation on the Hub's profile picture 2023 Jan Offsite hackathon's profile picture Datasets Maintainers's profile picture Whisper Distillation's profile picture Open LLM Leaderboard's profile picture huggingPartyParis's profile picture CommonCanvas's profile picture ZeroGPU Explorers's profile picture Datasets examples's profile picture Pixel Parsing's profile picture HuggingFaceFW-Dev's profile picture Infinite Dataset Hub's profile picture Hugging Face FineVideo's profile picture Dataset ReWriter's profile picture Dataset Tools's profile picture Rainforest Connection's profile picture

lhoestq's activity

published a dataset about 15 hours ago
reacted to merve's post with πŸ€—πŸ”₯ 1 day ago
reacted to ariG23498's post with πŸš€ 1 day ago
reacted to singhsidhukuldeep's post with πŸš€ 1 day ago
view post
Post
900
Breaking News: LinkedIn's Content Search Engine Gets a Powerful Semantic Upgrade!

Excited to share insights about LinkedIn's innovative approach to content search, recently detailed in a groundbreaking paper by their Mountain View team. This advancement represents a significant shift from traditional keyword-based search to semantic understanding.

>> Technical Architecture

The new search engine employs a sophisticated two-layer architecture:

Retrieval Layer
- Token Based Retriever (TBR) for exact keyword matching
- Embedding Based Retriever (EBR) using a two-tower model with multilingual-e5 embeddings
- Pre-computed post embeddings stored in a dedicated embedding store for efficient retrieval

Multi-Stage Ranking
- L1 Stage: Initial filtering using a lightweight model
- L2 Stage: Advanced ranking with complex features including:
- Query-post semantic matching
- Author reputation analysis
- User engagement metrics
- Content freshness evaluation

>> Performance Improvements

The system has achieved remarkable results:
- 10%+ improvement in both on-topic rate and long-dwell metrics
- Enhanced ability to handle complex natural language queries
- Significant boost in sitewide engagement

This advancement enables LinkedIn to better serve complex queries like "how to ask for a raise?" while maintaining high performance at scale. The system intelligently balances between exact keyword matching and semantic understanding, ensuring optimal results for both navigational and conceptual searches.

What impresses me most is how the team solved the scale challenge - processing billions of posts efficiently using pre-computed embeddings and approximate nearest neighbor search. This is enterprise-scale AI at its finest.
New activity in ai4bharat/FERMAT 2 days ago