Spaces:
Sleeping
Sleeping
title: PDF Q&A Dataset Generator | |
emoji: π | |
colorFrom: blue | |
colorTo: indigo | |
sdk: gradio | |
sdk_version: 5.29.0 | |
app_file: app.py | |
pinned: false | |
# PDF Q&A Dataset Generator | |
A Gradio application that generates Q&A datasets from PDF documents using instruction-tuned language models. | |
## Features | |
- **PDF Processing**: Automatically extract and chunk text from uploaded PDFs | |
- **Q&A Generation**: Create questions, answers, tags, and difficulty levels | |
- **Multiple Models**: Choose from various instruction-tuned models | |
- **Customization**: Configure number of questions, tags, and difficulty settings | |
- **Multiple Output Formats**: Export datasets as JSON, CSV, or Excel | |
## How It Works | |
This application: | |
1. Extracts text from uploaded PDFs | |
2. Splits the content into manageable chunks to maintain context | |
3. Uses instruction-tuned language models to generate Q&A pairs with tags | |
4. Combines these into a comprehensive dataset ready for use | |
## Use Cases | |
- Creating educational resources and assessment materials | |
- Generating training data for Q&A systems | |
- Building flashcard datasets for studying | |
- Developing content for educational applications | |
- Preparing comprehension testing materials | |
## Getting Started | |
### Local Installation | |
```bash | |
git clone https://github.com/your-username/pdf-qa-generator.git | |
cd pdf-qa-generator | |
pip install -r requirements.txt | |
python app.py | |
``` | |
### Using on Hugging Face Spaces | |
1. Duplicate this Space to your account | |
2. Upload your PDFs | |
3. Configure your settings | |
4. Generate your Q&A dataset | |
### Enabling GPU on Hugging Face Spaces | |
To enable GPU acceleration on Hugging Face Spaces: | |
1. Uncomment the `# import spaces` line at the top of app.py | |
2. Uncomment the `# @spaces.GPU` decorator above the `process_pdf_generate_qa` function | |
3. Save and redeploy your Space with GPU hardware selected | |
## Models | |
The app includes a selection of instruction-tuned language models: | |
- `databricks/dolly-v2-3b` (default) | |
- `databricks/dolly-v2-7b` | |
- `EleutherAI/gpt-neo-1.3B` | |
- `EleutherAI/gpt-neo-2.7B` | |
- `tiiuae/falcon-7b-instruct` | |
## License | |
MIT |