Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
5.36.2
metadata
title: PDF Q&A Dataset Generator
emoji: π
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
PDF Q&A Dataset Generator
A Gradio application that generates Q&A datasets from PDF documents using instruction-tuned language models.
Features
- PDF Processing: Automatically extract and chunk text from uploaded PDFs
- Q&A Generation: Create questions, answers, tags, and difficulty levels
- Multiple Models: Choose from various instruction-tuned models
- Customization: Configure number of questions, tags, and difficulty settings
- Multiple Output Formats: Export datasets as JSON, CSV, or Excel
How It Works
This application:
- Extracts text from uploaded PDFs
- Splits the content into manageable chunks to maintain context
- Uses instruction-tuned language models to generate Q&A pairs with tags
- Combines these into a comprehensive dataset ready for use
Use Cases
- Creating educational resources and assessment materials
- Generating training data for Q&A systems
- Building flashcard datasets for studying
- Developing content for educational applications
- Preparing comprehension testing materials
Getting Started
Local Installation
git clone https://github.com/your-username/pdf-qa-generator.git
cd pdf-qa-generator
pip install -r requirements.txt
python app.py
Using on Hugging Face Spaces
- Duplicate this Space to your account
- Upload your PDFs
- Configure your settings
- Generate your Q&A dataset
Enabling GPU on Hugging Face Spaces
To enable GPU acceleration on Hugging Face Spaces:
- Uncomment the
# import spaces
line at the top of app.py - Uncomment the
# @spaces.GPU
decorator above theprocess_pdf_generate_qa
function - Save and redeploy your Space with GPU hardware selected
Models
The app includes a selection of instruction-tuned language models:
databricks/dolly-v2-3b
(default)databricks/dolly-v2-7b
EleutherAI/gpt-neo-1.3B
EleutherAI/gpt-neo-2.7B
tiiuae/falcon-7b-instruct
License
MIT