Taalmodel WIM Pipeline: From Text to Knowledge Graphs

Overview

The WIM (Leesplank) pipeline transforms unstructured Dutch text into structured knowledge graphs using Schema.org vocabulary. This diagram shows the complete data flow from raw sources through distillation to trained models.

Pipeline Architecture

Key Components

1. Source Data

Dutch Wikipedia: wikimedia/wikipedia 20231101.nl dataset - matched with Schema.org classes to create 12.6K article dataset covering most schema.org classes UWV/wim-schema-org-wiki-articles
Schema.org: Complete vocabulary (schemaorg-all.ttl) used for class definitions and embeddings SchemaOrg All in Turtle format

2. Pipeline Nodes

N1: Extracts ~20 entities per document
N2: Maps each entity to Schema.org type (highest volume)
N3: Generates JSON-LD with ~40K token contexts
N4: Validates output (not stored in datasets)
N5: Adds Dutch taxonomy labels (optional)

The source code for this pipeline is available at Github: WIM Signaalberichten

3. Distillation Process

Captures all LLM calls (GPT-4.1, GPT-4O, O4-MINI) during pipeline execution
Stores in SQLite with parallel processing
Exports as instruction-following datasets

4. Model Training

Uses Unsloth for efficient long-context training
Different LoRA configurations per node complexity
Phi-4-mini base model for all nodes

5. Production Models

Specialized adapters for each pipeline stage
Optimized context lengths based on task requirements
Both adapter and merged versions available

Usage

The pipeline enables:

Knowledge Extraction: Convert any Dutch text to structured data
Model Distillation: Create training data from production LLMs
Specialized Models: Train efficient task-specific models
Flexible Deployment: Mix and match models per node based on requirements

Acknowledgments

Schema.org community for the ontology.
Wikimedia Foundation for the Dutch Wikipedia content.
OpenAI for the foundational models used to generate the instructions.

Citation Information

@model{UWV/WIM-Leesplank-overview,
  title={Taalmodel WIM Pipeline: From Text to Knowledge Graphs},
  author={UWV InnovatieHub},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/UWV/WIM-Leesplank-overview}
}

UWV
/

WIM-Leesplank-overview