Taalmodel WIM Pipeline: From Text to Knowledge Graphs
Overview
The WIM (Leesplank) pipeline transforms unstructured Dutch text into structured knowledge graphs using Schema.org vocabulary. This diagram shows the complete data flow from raw sources through distillation to trained models.
Pipeline Architecture
Key Components
1. Source Data
- Dutch Wikipedia:
wikimedia/wikipedia
20231101.nl dataset - matched with Schema.org classes to create 12.6K article dataset covering most schema.org classes UWV/wim-schema-org-wiki-articles
- Schema.org: Complete vocabulary (schemaorg-all.ttl) used for class definitions and embeddings SchemaOrg All in Turtle format
2. Pipeline Nodes
- N1: Extracts ~20 entities per document
- N2: Maps each entity to Schema.org type (highest volume)
- N3: Generates JSON-LD with ~40K token contexts
- N4: Validates output (not stored in datasets)
- N5: Adds Dutch taxonomy labels (optional)
The source code for this pipeline is available at Github: WIM Signaalberichten
3. Distillation Process
- Captures all LLM calls (GPT-4.1, GPT-4O, O4-MINI) during pipeline execution
- Stores in SQLite with parallel processing
- Exports as instruction-following datasets
4. Model Training
- Uses Unsloth for efficient long-context training
- Different LoRA configurations per node complexity
- Phi-4-mini base model for all nodes
5. Production Models
- Specialized adapters for each pipeline stage
- Optimized context lengths based on task requirements
- Both adapter and merged versions available
Usage
The pipeline enables:
- Knowledge Extraction: Convert any Dutch text to structured data
- Model Distillation: Create training data from production LLMs
- Specialized Models: Train efficient task-specific models
- Flexible Deployment: Mix and match models per node based on requirements
Acknowledgments
- Schema.org community for the ontology.
- Wikimedia Foundation for the Dutch Wikipedia content.
- OpenAI for the foundational models used to generate the instructions.
Citation Information
@model{UWV/WIM-Leesplank-overview,
title={Taalmodel WIM Pipeline: From Text to Knowledge Graphs},
author={UWV InnovatieHub},
year={2025},
publisher={HuggingFace},
url={https://huggingface.co/UWV/WIM-Leesplank-overview}
}