File size: 2,166 Bytes
fbf0ed4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---

title: PDF Q&A Dataset Generator
emoji: πŸ“š
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
---


# PDF Q&A Dataset Generator

A Gradio application that generates Q&A datasets from PDF documents using instruction-tuned language models.

## Features

- **PDF Processing**: Automatically extract and chunk text from uploaded PDFs
- **Q&A Generation**: Create questions, answers, tags, and difficulty levels
- **Multiple Models**: Choose from various instruction-tuned models
- **Customization**: Configure number of questions, tags, and difficulty settings
- **Multiple Output Formats**: Export datasets as JSON, CSV, or Excel

## How It Works

This application:

1. Extracts text from uploaded PDFs
2. Splits the content into manageable chunks to maintain context
3. Uses instruction-tuned language models to generate Q&A pairs with tags
4. Combines these into a comprehensive dataset ready for use

## Use Cases

- Creating educational resources and assessment materials
- Generating training data for Q&A systems
- Building flashcard datasets for studying
- Developing content for educational applications
- Preparing comprehension testing materials

## Getting Started

### Local Installation

```bash

git clone https://github.com/your-username/pdf-qa-generator.git

cd pdf-qa-generator

pip install -r requirements.txt

python app.py

```

### Using on Hugging Face Spaces

1. Duplicate this Space to your account
2. Upload your PDFs
3. Configure your settings
4. Generate your Q&A dataset

### Enabling GPU on Hugging Face Spaces

To enable GPU acceleration on Hugging Face Spaces:

1. Uncomment the `# import spaces` line at the top of app.py
2. Uncomment the `# @spaces.GPU` decorator above the `process_pdf_generate_qa` function
3. Save and redeploy your Space with GPU hardware selected

## Models

The app includes a selection of instruction-tuned language models:

- `databricks/dolly-v2-3b` (default)
- `databricks/dolly-v2-7b`
- `EleutherAI/gpt-neo-1.3B`
- `EleutherAI/gpt-neo-2.7B`
- `tiiuae/falcon-7b-instruct`

## License

MIT