Spaces:
Running
simplewiki dataset
Hello,
The dataset linked to in Space returns a 404: https://huggingface.co/datasets/HuggingFaceTB/simplewiki-pruned-350k.
@HuggingFaceTB doesn't have any public datasets with simplewiki, is that intentional?
Thanks
It's public now
parallel_eval/README.md says wget https://huggingface.co/datasets/HuggingFaceTB/simplewiki-pruned-text-350k/blob/main/wikihop.db -o wikihop.db
(there is a typo of -o => -O), that still doesn't exist
Vibed this to make wikihop.db:
import sqlite3
import json
from datasets import load_dataset
# Load dataset
dataset = load_dataset("HuggingFaceTB/simplewiki-pruned-350k")
# Connect to SQLite database (or create it)
conn = sqlite3.connect('wikihops.db')
cursor = conn.cursor()
# Create table
cursor.execute('''
CREATE TABLE IF NOT EXISTS core_articles (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
links_json TEXT NOT NULL
)
''')
# Insert data into table
for example in dataset['train']:
title = example['article']
links = example['links']
links_json = json.dumps(links) # Convert list to JSON string
cursor.execute('''
INSERT INTO core_articles (title, links_json)
VALUES (?, ?)
''', (title, links_json))
# Commit changes and close connection
conn.commit()
conn.close()
I haven't started doing what I want to do with this data yet, so I'd appreciate it if you'd post about things like this
I'm trying some GRPO reinforcement learning over wiki racing. So far without much result, but I'm launching a bigger train to see what happens: https://github.com/phhusson/llm-rl/blob/main/grpo-wikiracing.py