Spaces:

FreestylerAI
/

pdf-dataset-generator

Sleeping

App Files Files Community

pdf-dataset-generator / README.md

FreestylerAI's picture

indev-v1

fbf0ed4 verified 2 months ago

|

history blame contribute delete

2.17 kB

A newer version of the Gradio SDK is available: 5.36.2

Upgrade

metadata

title: PDF Q&A Dataset Generator
emoji: 📚
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false

PDF Q&A Dataset Generator

A Gradio application that generates Q&A datasets from PDF documents using instruction-tuned language models.

Features

PDF Processing: Automatically extract and chunk text from uploaded PDFs
Q&A Generation: Create questions, answers, tags, and difficulty levels
Multiple Models: Choose from various instruction-tuned models
Customization: Configure number of questions, tags, and difficulty settings
Multiple Output Formats: Export datasets as JSON, CSV, or Excel

How It Works

This application:

Extracts text from uploaded PDFs
Splits the content into manageable chunks to maintain context
Uses instruction-tuned language models to generate Q&A pairs with tags
Combines these into a comprehensive dataset ready for use

Use Cases

Creating educational resources and assessment materials
Generating training data for Q&A systems
Building flashcard datasets for studying
Developing content for educational applications
Preparing comprehension testing materials

Getting Started

Local Installation

git clone https://github.com/your-username/pdf-qa-generator.git
cd pdf-qa-generator
pip install -r requirements.txt
python app.py

Using on Hugging Face Spaces

Duplicate this Space to your account
Upload your PDFs
Configure your settings
Generate your Q&A dataset

Enabling GPU on Hugging Face Spaces

To enable GPU acceleration on Hugging Face Spaces:

Uncomment the # import spaces line at the top of app.py
Uncomment the # @spaces.GPU decorator above the process_pdf_generate_qa function
Save and redeploy your Space with GPU hardware selected

Models

The app includes a selection of instruction-tuned language models:

databricks/dolly-v2-3b (default)
databricks/dolly-v2-7b
EleutherAI/gpt-neo-1.3B
EleutherAI/gpt-neo-2.7B
tiiuae/falcon-7b-instruct

License

MIT