UWV
/

Dutch

Taalmodel WIM Pipeline: From Text to Knowledge Graphs

Overview

The WIM (Leesplank) pipeline transforms unstructured Dutch text into structured knowledge graphs using Schema.org vocabulary. This diagram shows the complete data flow from raw sources through distillation to trained models.

Pipeline Architecture

WIM Pipeline Architecture

Key Components

1. Source Data

  • Dutch Wikipedia: wikimedia/wikipedia 20231101.nl dataset - matched with Schema.org classes to create 12.6K article dataset covering most schema.org classes UWV/wim-schema-org-wiki-articles
  • Schema.org: Complete vocabulary (schemaorg-all.ttl) used for class definitions and embeddings SchemaOrg All in Turtle format

2. Pipeline Nodes

  • N1: Extracts ~20 entities per document
  • N2: Maps each entity to Schema.org type (highest volume)
  • N3: Generates JSON-LD with ~40K token contexts
  • N4: Validates output (not stored in datasets)
  • N5: Adds Dutch taxonomy labels (optional)

The source code for this pipeline is available at Github: WIM Signaalberichten

3. Distillation Process

  • Captures all LLM calls (GPT-4.1, GPT-4O, O4-MINI) during pipeline execution
  • Stores in SQLite with parallel processing
  • Exports as instruction-following datasets

4. Model Training

  • Uses Unsloth for efficient long-context training
  • Different LoRA configurations per node complexity
  • Phi-4-mini base model for all nodes

5. Production Models

  • Specialized adapters for each pipeline stage
  • Optimized context lengths based on task requirements
  • Both adapter and merged versions available

Usage

The pipeline enables:

  1. Knowledge Extraction: Convert any Dutch text to structured data
  2. Model Distillation: Create training data from production LLMs
  3. Specialized Models: Train efficient task-specific models
  4. Flexible Deployment: Mix and match models per node based on requirements

Acknowledgments

  • Schema.org community for the ontology.
  • Wikimedia Foundation for the Dutch Wikipedia content.
  • OpenAI for the foundational models used to generate the instructions.

Citation Information

@model{UWV/WIM-Leesplank-overview,
  title={Taalmodel WIM Pipeline: From Text to Knowledge Graphs},
  author={UWV InnovatieHub},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/UWV/WIM-Leesplank-overview}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for UWV/WIM-Leesplank-overview

Datasets used to train UWV/WIM-Leesplank-overview

Collection including UWV/WIM-Leesplank-overview