simplewiki dataset

#1
by nyuuzyou - opened

Hello,

The dataset linked to in Space returns a 404: https://huggingface.co/datasets/HuggingFaceTB/simplewiki-pruned-350k.

@HuggingFaceTB doesn't have any public datasets with simplewiki, is that intentional?

Thanks

Hugging Face Smol Models Research org

It's public now

parallel_eval/README.md says wget https://huggingface.co/datasets/HuggingFaceTB/simplewiki-pruned-text-350k/blob/main/wikihop.db -o wikihop.db (there is a typo of -o => -O), that still doesn't exist

Vibed this to make wikihop.db:

import sqlite3
import json
from datasets import load_dataset

# Load dataset
dataset = load_dataset("HuggingFaceTB/simplewiki-pruned-350k")

# Connect to SQLite database (or create it)
conn = sqlite3.connect('wikihops.db')
cursor = conn.cursor()

# Create table
cursor.execute('''
CREATE TABLE IF NOT EXISTS core_articles (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    title TEXT NOT NULL,
    links_json TEXT NOT NULL
)
''')

# Insert data into table
for example in dataset['train']:
    title = example['article']
    links = example['links']
    links_json = json.dumps(links)  # Convert list to JSON string
    
    cursor.execute('''
    INSERT INTO core_articles (title, links_json)
    VALUES (?, ?)
    ''', (title, links_json))

# Commit changes and close connection
conn.commit()
conn.close()

I haven't started doing what I want to do with this data yet, so I'd appreciate it if you'd post about things like this

I'm trying some GRPO reinforcement learning over wiki racing. So far without much result, but I'm launching a bigger train to see what happens: https://github.com/phhusson/llm-rl/blob/main/grpo-wikiracing.py

Sign up or log in to comment