Spaces:

FreestylerAI
/

pdf-dataset-generator

Sleeping

App Files Files Community

pdf-dataset-generator / README.md

FreestylerAI

indev-v1

fbf0ed4 verified 2 months ago

preview code

raw

history blame contribute delete

2.17 kB

	---
	title: PDF Q&A Dataset Generator
	emoji: 📚
	colorFrom: blue
	colorTo: indigo
	sdk: gradio
	sdk_version: 5.29.0
	app_file: app.py
	pinned: false
	---

	# PDF Q&A Dataset Generator

	A Gradio application that generates Q&A datasets from PDF documents using instruction-tuned language models.

	## Features

	- PDF Processing: Automatically extract and chunk text from uploaded PDFs
	- Q&A Generation: Create questions, answers, tags, and difficulty levels
	- Multiple Models: Choose from various instruction-tuned models
	- Customization: Configure number of questions, tags, and difficulty settings
	- Multiple Output Formats: Export datasets as JSON, CSV, or Excel

	## How It Works

	This application:

	1. Extracts text from uploaded PDFs
	2. Splits the content into manageable chunks to maintain context
	3. Uses instruction-tuned language models to generate Q&A pairs with tags
	4. Combines these into a comprehensive dataset ready for use

	## Use Cases

	- Creating educational resources and assessment materials
	- Generating training data for Q&A systems
	- Building flashcard datasets for studying
	- Developing content for educational applications
	- Preparing comprehension testing materials

	## Getting Started

	### Local Installation

	```bash
	git clone https://github.com/your-username/pdf-qa-generator.git
	cd pdf-qa-generator
	pip install -r requirements.txt
	python app.py
	```

	### Using on Hugging Face Spaces

	1. Duplicate this Space to your account
	2. Upload your PDFs
	3. Configure your settings
	4. Generate your Q&A dataset

	### Enabling GPU on Hugging Face Spaces

	To enable GPU acceleration on Hugging Face Spaces:

	1. Uncomment the `# import spaces` line at the top of app.py
	2. Uncomment the `# @spaces.GPU` decorator above the `process_pdf_generate_qa` function
	3. Save and redeploy your Space with GPU hardware selected

	## Models

	The app includes a selection of instruction-tuned language models:

	- `databricks/dolly-v2-3b` (default)
	- `databricks/dolly-v2-7b`
	- `EleutherAI/gpt-neo-1.3B`
	- `EleutherAI/gpt-neo-2.7B`
	- `tiiuae/falcon-7b-instruct`

	## License

	MIT