Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index
Abstract
Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora -- counting string appearances and retrieving the enclosing documents -- yet the high storage overhead hinders their application on Internet-scale data. We present Infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18times) and memory use during both indexing (3.2times reduction) and querying (down to a negligible amount). We index 46TB of Internet text in 50 days with a single 128-core CPU node (or 19 hours if using 75 such nodes). We show one important use case of Infini-gram mini in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 40% in SQuAD), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on Infini-gram mini indexes.
Community
A “mini” version of infini-gram.
Very compressed index, 12x less storage req, optimized for massive indexing & efficient serving. Free to use via the Web Interface and API. Has helped unveil eval contamination at scale.
Web Interface: https://infini-gram-mini.io/demo
API Endpoint: https://infini-gram-mini.io/docs
Source code: https://infini-gram-mini.io/code
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Long LEM Query in BWT-Runs Space (2025)
- Dynamic r-index: An Updatable Self-Index for Highly Repetitive Strings (2025)
- ELITE: Embedding-Less retrieval with Iterative Text Exploration (2025)
- SSCard: Substring Cardinality Estimation using Suffix Tree-Guided Learned FM-Index (2025)
- Stronger Baselines for Retrieval-Augmented Generation with Long-Context Language Models (2025)
- Engineering Fast and Space-Efficient Recompression from SLP-Compressed Text (2025)
- Fast and memory-efficient BWT construction of repetitive texts using Lyndon grammars (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 2
Collections including this paper 0
No Collection including this paper