AI & ML interests
OpenDataLab provides high-quality open datasets and tools for large models. China Large model corpus Data Alliance open source data service designated platform
Recent Activity
Papers
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
The Trinity of Consistency as a Defining Principle for General World Models
English🌎|简体中文🀄
🔬OpenDataLab: Building the AI-Ready Data Foundry — From Foundational Corpora to Scientific Intelligence
The OpenDataLab team has long been deeply engaged in the frontier exploration and engineering practice of AI data. Addressing the full-spectrum, end-to-end data lifecycle requirements of large model pre-training, fine-tuning, and evaluation, we have cultivated deep, end-to-end expertise spanning unstructured data parsing, multimodal alignment, knowledge system construction, and large-scale data engineering. Building upon this foundation, we have developed and open-sourced a suite of core tools—including the MinerU high-fidelity document parsing engine, the LabelU/LabelLLM intelligent annotation system, and the OmniDocBench evaluation framework—while distilling our data construction endeavors into high-quality public datasets such as the "WanJuan" corpus. These outputs stand as a concentrated reflection of our data methodology and scientific rigor.
🚀As the AI4S paradigm reshapes the boundaries of scientific discovery, we are systematically elevating our established capabilities into the realm of scientific intelligence. Enter **Sciverse**—a strategic vision and a comprehensive AI-ready data foundry paradigm purpose-built for scientific AI. It directly addresses the core bottlenecks that impede scientific models in complex research scenarios: the inability to parse complex structures, disentangle logical relationships, and execute rigorous reasoning. Sciverse delivers a systematic solution through a progressive, three-tiered architecture:
- 🧱 SciBase (Scientific Knowledge Substrate): We forge a pristine, structured, and trustworthy foundation of general scientific knowledge.
- 🔗 SciAlign (Scientific Cross-Modal Alignment Layer): We bridge the semantic gap, aligning cross-modal scientific entities into coherent data representations.
- 🧠 Sci-Evo (Scientific Evolution Layer): We infuse the data with the dynamic logic of reasoning required for genuine scientific discovery.
⚙️Centered around this paradigm, we are continuously crystallizing corresponding data products, processing tools, and engineering solutions. Sciverse represents the systematic extension of OpenDataLab’s data intelligence into the scientific domain.
🎯 From pioneering general-purpose corpora to forging the substrate for scientific AI, we remain steadfast in our commitment to defining the data paradigms that will power the next generation of intelligence. We are more than tool providers; we are cartographers mapping the ever-expanding frontier of AI data.
If you have any questions or obstacles, please feel free to contact us OpenDataLab@pjlab.org.cn.
spaces 7
MinerU OCR
A data extraction tool to convert PDF to Markdown and JSON
MinerU Diffusion V1 0320 2.5B
demo of MinerU-Diffusion
TRivia-3B
Convert table images into HTML tags with TRivia-3B
CDM
Evaluate formula recognition accuracy
DocLayout YOLO
Demo for DocLayout-YOLO
