Spaces:

FreestylerAI
/

pdf-dataset-generator

Sleeping

File size: 2,166 Bytes

fbf0ed4

---

title: PDF Q&A Dataset Generator
emoji: 📚
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
---


# PDF Q&A Dataset Generator

A Gradio application that generates Q&A datasets from PDF documents using instruction-tuned language models.

## Features

- **PDF Processing**: Automatically extract and chunk text from uploaded PDFs
- **Q&A Generation**: Create questions, answers, tags, and difficulty levels
- **Multiple Models**: Choose from various instruction-tuned models
- **Customization**: Configure number of questions, tags, and difficulty settings
- **Multiple Output Formats**: Export datasets as JSON, CSV, or Excel

## How It Works

This application:

1. Extracts text from uploaded PDFs
2. Splits the content into manageable chunks to maintain context
3. Uses instruction-tuned language models to generate Q&A pairs with tags
4. Combines these into a comprehensive dataset ready for use

## Use Cases

- Creating educational resources and assessment materials
- Generating training data for Q&A systems
- Building flashcard datasets for studying
- Developing content for educational applications
- Preparing comprehension testing materials

## Getting Started

### Local Installation

```bash

git clone https://github.com/your-username/pdf-qa-generator.git

cd pdf-qa-generator

pip install -r requirements.txt

python app.py

```

### Using on Hugging Face Spaces

1. Duplicate this Space to your account
2. Upload your PDFs
3. Configure your settings
4. Generate your Q&A dataset

### Enabling GPU on Hugging Face Spaces

To enable GPU acceleration on Hugging Face Spaces:

1. Uncomment the `# import spaces` line at the top of app.py
2. Uncomment the `# @spaces.GPU` decorator above the `process_pdf_generate_qa` function
3. Save and redeploy your Space with GPU hardware selected

## Models

The app includes a selection of instruction-tuned language models:

- `databricks/dolly-v2-3b` (default)
- `databricks/dolly-v2-7b`
- `EleutherAI/gpt-neo-1.3B`
- `EleutherAI/gpt-neo-2.7B`
- `tiiuae/falcon-7b-instruct`

## License

MIT