| # BSG CyLLama Setup and Usage Guide | |
| This guide explains how to set up and use the BSG CyLLama scientific summarization model. | |
| ## Overview | |
| BSG CyLLama is a LoRA-adapted Llama-3.2-1B-Instruct model fine-tuned for scientific text summarization. The model excels at generating high-quality abstracts and summaries from scientific papers and research content. | |
| ## Files Structure | |
| ``` | |
| bsg_cyllama/ | |
| ├── scientific_model_production_v2/ # Trained model files | |
| │ ├── config.json # Model configuration | |
| │ ├── prompt_generator.pt # Prompt generation utilities | |
| │ └── model/ # LoRA adapter files | |
| │ ├── adapter_config.json | |
| │ ├── adapter_model.safetensors | |
| │ ├── tokenizer.json | |
| │ └── ... | |
| ├── bsg_training_data_complete_aligned.tsv # Complete training dataset (19,174 records) | |
| ├── bsg_cyllama_trainer_v2.py # Training script | |
| ├── scientific_model_inference2.py # Inference utilities | |
| ├── bsg_training_data_gen.py # Data generation pipeline | |
| ├── compile_complete_training_data.py # Data compilation script | |
| ├── upload_to_huggingface.py # HF upload utilities | |
| └── run_upload.py # Simple upload runner | |
| ``` | |
| ## Prerequisites | |
| 1. **Python Environment**: | |
| ```bash | |
| python >= 3.8 | |
| torch >= 2.0 | |
| transformers >= 4.30.0 | |
| peft >= 0.4.0 | |
| huggingface_hub | |
| pandas | |
| numpy | |
| ``` | |
| 2. **Hardware Requirements**: | |
| - GPU with at least 8GB VRAM (recommended) | |
| - 16GB+ system RAM | |
| - CUDA support for optimal performance | |
| ## Installation | |
| 1. **Clone/Download the repository**: | |
| ```bash | |
| git clone <your-repo-url> | |
| cd bsg_cyllama | |
| ``` | |
| 2. **Install dependencies**: | |
| ```bash | |
| pip install torch transformers peft huggingface_hub pandas numpy sentence-transformers | |
| ``` | |
| 3. **Activate environment** (if using virtual environment): | |
| ```bash | |
| source ~/myenv/bin/activate | |
| ``` | |
| ## Usage | |
| ### 1. Basic Inference | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| from peft import PeftModel | |
| import torch | |
| # Load base model | |
| base_model_name = "meta-llama/Llama-3.2-1B-Instruct" | |
| tokenizer = AutoTokenizer.from_pretrained(base_model_name) | |
| base_model = AutoModelForCausalLM.from_pretrained( | |
| base_model_name, | |
| torch_dtype=torch.float16, | |
| device_map="auto" | |
| ) | |
| # Load LoRA adapter | |
| model = PeftModel.from_pretrained(base_model, "./scientific_model_production_v2/model") | |
| def generate_summary(text, max_length=200): | |
| prompt = f"Summarize the following scientific text:\n\n{text}\n\nSummary:" | |
| inputs = tokenizer.encode(prompt, return_tensors="pt") | |
| with torch.no_grad(): | |
| outputs = model.generate( | |
| inputs, | |
| max_length=max_length, | |
| num_return_sequences=1, | |
| temperature=0.7, | |
| pad_token_id=tokenizer.eos_token_id, | |
| do_sample=True | |
| ) | |
| summary = tokenizer.decode(outputs[0], skip_special_tokens=True) | |
| return summary.split("Summary:")[-1].strip() | |
| ``` | |
| ### 2. Using the Inference Script | |
| ```bash | |
| python scientific_model_inference2.py | |
| ``` | |
| ### 3. Training from Scratch | |
| ```bash | |
| python bsg_cyllama_trainer_v2.py | |
| ``` | |
| ## Dataset Information | |
| The complete training dataset contains **19,174 records** with the following structure: | |
| - **AbstractSummary**: Detailed scientific summary | |
| - **ShortSummary**: Concise version | |
| - **Title**: Research paper title | |
| - **OriginalText**: Source abstract | |
| - **OriginalKeywords**: Topic keywords | |
| - **Clustering information**: For data organization | |
| ### Loading the Dataset | |
| ```python | |
| import pandas as pd | |
| # Load complete training data | |
| df = pd.read_csv("bsg_training_data_complete_aligned.tsv", sep="\t") | |
| print(f"Dataset size: {len(df)} records") | |
| print(f"Columns: {df.columns.tolist()}") | |
| # Example training pair | |
| sample = df.iloc[0] | |
| print(f"Original: {sample['OriginalText'][:200]}...") | |
| print(f"Summary: {sample['AbstractSummary'][:200]}...") | |
| ``` | |
| ## Model Configuration | |
| - **Base Model**: meta-llama/Llama-3.2-1B-Instruct | |
| - **LoRA Rank**: 128 | |
| - **LoRA Alpha**: 256 | |
| - **Target Modules**: v_proj, o_proj, k_proj, gate_proj, q_proj, up_proj, down_proj | |
| - **Training Samples**: 19,174 | |
| ## Uploading to Hugging Face | |
| To upload your model and dataset to Hugging Face: | |
| 1. **Set up your token**: | |
| ```bash | |
| # Your token is already configured in the script | |
| ``` | |
| 2. **Run the upload**: | |
| ```bash | |
| python run_upload.py | |
| ``` | |
| 3. **Enter your HF username** when prompted | |
| This will create two repositories: | |
| - `{username}/bsg-cyllama` (model) | |
| - `{username}/bsg-cyllama-training-data` (dataset) | |
| ## Performance Tips | |
| 1. **For better performance**: | |
| - Use GPU inference | |
| - Adjust temperature (0.5-0.8 for more focused summaries) | |
| - Experiment with max_length based on your needs | |
| 2. **Memory optimization**: | |
| - Use torch.float16 for inference | |
| - Enable gradient checkpointing for training | |
| - Use smaller batch sizes if needed | |
| ## Troubleshooting | |
| 1. **CUDA out of memory**: | |
| - Reduce batch size | |
| - Use CPU inference | |
| - Enable gradient checkpointing | |
| 2. **Import errors**: | |
| - Check transformers version: `pip install transformers>=4.30.0` | |
| - Install missing dependencies: `pip install peft sentence-transformers` | |
| 3. **Model loading issues**: | |
| - Verify file paths | |
| - Check model file integrity | |
| - Ensure proper permissions | |
| ## Example Applications | |
| 1. **Scientific Paper Summarization** | |
| 2. **Abstract Generation** | |
| 3. **Research Literature Review** | |
| 4. **Technical Documentation Condensation** | |
| ## Citation | |
| ```bibtex | |
| @misc{bsg-cyllama-2025, | |
| title={BSG CyLLama: Scientific Summarization with LoRA-tuned Llama}, | |
| author={BSG Research Team}, | |
| year={2025}, | |
| url={https://huggingface.co/bsg-cyllama} | |
| } | |
| ``` | |
| ## Support | |
| For questions, issues, or collaboration: | |
| 1. Check this guide first | |
| 2. Review the error messages | |
| 3. Open an issue in the repository | |
| 4. Contact the development team | |
| --- | |
| **Last Updated**: January 2025 | |
| **Model Version**: v2 | |
| **Dataset Version**: Complete Aligned (19,174 records) | |