SETUP_GUIDE.md · jimnoneill/BSG_CyLlama at 43503878275d197ada7804a41d1bee32a4f05087

BSG_CyLlama / SETUP_GUIDE.md

jimnoneill

Upload SETUP_GUIDE.md with huggingface_hub

8bb63a1 verified 3 months ago

preview code

raw

history blame

6.12 kB

	# BSG CyLLama Setup and Usage Guide

	This guide explains how to set up and use the BSG CyLLama scientific summarization model.

	## Overview

	BSG CyLLama is a LoRA-adapted Llama-3.2-1B-Instruct model fine-tuned for scientific text summarization. The model excels at generating high-quality abstracts and summaries from scientific papers and research content.

	## Files Structure

	```
	bsg_cyllama/
	├── scientific_model_production_v2/ # Trained model files
	│ ├── config.json # Model configuration
	│ ├── prompt_generator.pt # Prompt generation utilities
	│ └── model/ # LoRA adapter files
	│ ├── adapter_config.json
	│ ├── adapter_model.safetensors
	│ ├── tokenizer.json
	│ └── ...
	├── bsg_training_data_complete_aligned.tsv # Complete training dataset (19,174 records)
	├── bsg_cyllama_trainer_v2.py # Training script
	├── scientific_model_inference2.py # Inference utilities
	├── bsg_training_data_gen.py # Data generation pipeline
	├── compile_complete_training_data.py # Data compilation script
	├── upload_to_huggingface.py # HF upload utilities
	└── run_upload.py # Simple upload runner
	```

	## Prerequisites

	1. Python Environment:
	```bash
	python >= 3.8
	torch >= 2.0
	transformers >= 4.30.0
	peft >= 0.4.0
	huggingface_hub
	pandas
	numpy
	```

	2. Hardware Requirements:
	- GPU with at least 8GB VRAM (recommended)
	- 16GB+ system RAM
	- CUDA support for optimal performance

	## Installation

	1. Clone/Download the repository:
	```bash
	git clone <your-repo-url>
	cd bsg_cyllama
	```

	2. Install dependencies:
	```bash
	pip install torch transformers peft huggingface_hub pandas numpy sentence-transformers
	```

	3. Activate environment (if using virtual environment):
	```bash
	source ~/myenv/bin/activate
	```

	## Usage

	### 1. Basic Inference

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from peft import PeftModel
	import torch

	# Load base model
	base_model_name = "meta-llama/Llama-3.2-1B-Instruct"
	tokenizer = AutoTokenizer.from_pretrained(base_model_name)
	base_model = AutoModelForCausalLM.from_pretrained(
	base_model_name,
	torch_dtype=torch.float16,
	device_map="auto"
	)

	# Load LoRA adapter
	model = PeftModel.from_pretrained(base_model, "./scientific_model_production_v2/model")

	def generate_summary(text, max_length=200):
	prompt = f"Summarize the following scientific text:\n\n{text}\n\nSummary:"

	inputs = tokenizer.encode(prompt, return_tensors="pt")

	with torch.no_grad():
	outputs = model.generate(
	inputs,
	max_length=max_length,
	num_return_sequences=1,
	temperature=0.7,
	pad_token_id=tokenizer.eos_token_id,
	do_sample=True
	)

	summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
	return summary.split("Summary:")[-1].strip()
	```

	### 2. Using the Inference Script

	```bash
	python scientific_model_inference2.py
	```

	### 3. Training from Scratch

	```bash
	python bsg_cyllama_trainer_v2.py
	```

	## Dataset Information

	The complete training dataset contains 19,174 records with the following structure:

	- AbstractSummary: Detailed scientific summary
	- ShortSummary: Concise version
	- Title: Research paper title
	- OriginalText: Source abstract
	- OriginalKeywords: Topic keywords
	- Clustering information: For data organization

	### Loading the Dataset

	```python
	import pandas as pd

	# Load complete training data
	df = pd.read_csv("bsg_training_data_complete_aligned.tsv", sep="\t")

	print(f"Dataset size: {len(df)} records")
	print(f"Columns: {df.columns.tolist()}")

	# Example training pair
	sample = df.iloc[0]
	print(f"Original: {sample['OriginalText'][:200]}...")
	print(f"Summary: {sample['AbstractSummary'][:200]}...")
	```

	## Model Configuration

	- Base Model: meta-llama/Llama-3.2-1B-Instruct
	- LoRA Rank: 128
	- LoRA Alpha: 256
	- Target Modules: v_proj, o_proj, k_proj, gate_proj, q_proj, up_proj, down_proj
	- Training Samples: 19,174

	## Uploading to Hugging Face

	To upload your model and dataset to Hugging Face:

	1. Set up your token:
	```bash
	# Your token is already configured in the script
	```

	2. Run the upload:
	```bash
	python run_upload.py
	```

	3. Enter your HF username when prompted

	This will create two repositories:
	- `{username}/bsg-cyllama` (model)
	- `{username}/bsg-cyllama-training-data` (dataset)

	## Performance Tips

	1. For better performance:
	- Use GPU inference
	- Adjust temperature (0.5-0.8 for more focused summaries)
	- Experiment with max_length based on your needs

	2. Memory optimization:
	- Use torch.float16 for inference
	- Enable gradient checkpointing for training
	- Use smaller batch sizes if needed

	## Troubleshooting

	1. CUDA out of memory:
	- Reduce batch size
	- Use CPU inference
	- Enable gradient checkpointing

	2. Import errors:
	- Check transformers version: `pip install transformers>=4.30.0`
	- Install missing dependencies: `pip install peft sentence-transformers`

	3. Model loading issues:
	- Verify file paths
	- Check model file integrity
	- Ensure proper permissions

	## Example Applications

	1. Scientific Paper Summarization
	2. Abstract Generation
	3. Research Literature Review
	4. Technical Documentation Condensation

	## Citation

	```bibtex
	@misc{bsg-cyllama-2025,
	title={BSG CyLLama: Scientific Summarization with LoRA-tuned Llama},
	author={BSG Research Team},
	year={2025},
	url={https://huggingface.co/bsg-cyllama}
	}
	```

	## Support

	For questions, issues, or collaboration:
	1. Check this guide first
	2. Review the error messages
	3. Open an issue in the repository
	4. Contact the development team

	---

	Last Updated: January 2025
	Model Version: v2
	Dataset Version: Complete Aligned (19,174 records)