|
|
--- |
|
|
license: llama3.2 |
|
|
base_model: meta-llama/Llama-3.2-1B-Instruct |
|
|
model_type: peft |
|
|
library_name: peft |
|
|
tags: |
|
|
- biomedical-summary-generation |
|
|
- cyclical-embeddings |
|
|
- named-entity-extraction |
|
|
- corpus-level-summarization |
|
|
- scientific-summarization |
|
|
- biomedical |
|
|
- research |
|
|
- llama |
|
|
- lora |
|
|
- text-generation |
|
|
- sentence-transformers |
|
|
datasets: |
|
|
- jimnoneill/BSG_CyLlama-training |
|
|
pipeline_tag: text-generation |
|
|
widget: |
|
|
- text: "Generate a biomedical summary from this corpus: [Document 1: Deep learning in medical imaging...] [Document 2: Neural networks for drug discovery...] [Named Entities: CNN, pharmaceutical compounds, medical imaging]" |
|
|
example_title: "BSG CyLlama Corpus Summarization" |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
<img src="bsg_cyllama_logo.png" alt="BSG CyLlama Logo" width="200"/> |
|
|
|
|
|
# BSG CyLlama: Biomedical Summary Generation through Cyclical Llama |
|
|
|
|
|
**Revolutionary corpus-level summarization using cyclical embedding averaging with named entity integration** |
|
|
|
|
|
[](https://huggingface.co/jimnoneill/BSG_CyLlama) |
|
|
[](https://huggingface.co/datasets/jimnoneill/BSG_CyLlama-training) |
|
|
[](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) |
|
|
|
|
|
</div> |
|
|
|
|
|
## What is BSG CyLlama? |
|
|
|
|
|
**BSG CyLlama** stands for **Biomedical Summary Generation through Cyclical Llama** - a novel approach to corpus-level summarization that revolutionizes how we generate summaries from multiple scientific documents. |
|
|
|
|
|
### π **The Cyclical Innovation** |
|
|
|
|
|
Unlike traditional single-document summarization or RAG systems, BSG CyLlama introduces a **cyclical embedding averaging methodology**: |
|
|
|
|
|
1. **π Corpus Input**: Takes a series/corpus of related scientific documents |
|
|
2. **π Cyclical Averaging**: Averages embeddings across all documents in the corpus cyclically |
|
|
3. **π·οΈ Named Entity Integration**: Concatenates the averaged embeddings with key named entities |
|
|
4. **π Summary Generation**: Uses this combined representation to generate comprehensive summaries |
|
|
|
|
|
This creates an **approximation embedding document** that captures the collective knowledge of the entire corpus, not just individual papers. |
|
|
|
|
|
## 𧬠**Core Methodology: Cyclical Embedding Averaging** |
|
|
|
|
|
### The BSG CyLlama Process |
|
|
|
|
|
```python |
|
|
def bsg_cyclical_summarization(corpus_documents, named_entities): |
|
|
""" |
|
|
BSG CyLlama's core cyclical averaging methodology |
|
|
|
|
|
Args: |
|
|
corpus_documents: List of related scientific documents |
|
|
named_entities: Key entities extracted from the corpus |
|
|
|
|
|
Returns: |
|
|
Comprehensive corpus-level summary |
|
|
""" |
|
|
|
|
|
# Step 1: Generate embeddings for each document |
|
|
document_embeddings = [] |
|
|
for doc in corpus_documents: |
|
|
embedding = gte_large_model.encode(doc) |
|
|
document_embeddings.append(embedding) |
|
|
|
|
|
# Step 2: Cyclical averaging of embeddings |
|
|
averaged_embedding = cyclical_average(document_embeddings) |
|
|
|
|
|
# Step 3: Concatenate with named entities |
|
|
entity_embedding = gte_large_model.encode(" ".join(named_entities)) |
|
|
combined_embedding = concatenate([averaged_embedding, entity_embedding]) |
|
|
|
|
|
# Step 4: Generate corpus-level summary |
|
|
summary = bsg_cyllama_model.generate(combined_embedding) |
|
|
|
|
|
return summary |
|
|
|
|
|
def cyclical_average(embeddings_list): |
|
|
""" |
|
|
Cyclically average embeddings to create approximation document |
|
|
""" |
|
|
n_docs = len(embeddings_list) |
|
|
weighted_sum = np.zeros_like(embeddings_list[0]) |
|
|
|
|
|
for i, embedding in enumerate(embeddings_list): |
|
|
# Cyclical weighting ensures balanced representation |
|
|
cycle_weight = np.cos(2 * np.pi * i / n_docs) + 1 |
|
|
weighted_sum += embedding * cycle_weight |
|
|
|
|
|
return weighted_sum / n_docs |
|
|
``` |
|
|
|
|
|
## π― **Why Cyclical Averaging Works** |
|
|
|
|
|
### Traditional Approaches vs. BSG CyLlama |
|
|
|
|
|
**β Traditional Single-Doc Summarization:** |
|
|
- Limited to individual paper insights |
|
|
- Misses cross-document patterns |
|
|
- Cannot synthesize collective knowledge |
|
|
|
|
|
**β Standard RAG Systems:** |
|
|
- Retrieval-dependent (query-time bottleneck) |
|
|
- Linear combination of retrieved chunks |
|
|
- High computational costs per query |
|
|
|
|
|
**β
BSG CyLlama Cyclical Approach:** |
|
|
- **Corpus-level understanding**: Captures collective document knowledge |
|
|
- **Cyclical weighting**: Ensures balanced representation across documents |
|
|
- **Named entity integration**: Preserves domain-specific terminology |
|
|
- **One-time processing**: No per-query retrieval costs |
|
|
- **Approximation document**: Creates a virtual "meta-document" representing the corpus |
|
|
|
|
|
## π¬ **Model Architecture & Integration** |
|
|
|
|
|
### Required Components |
|
|
|
|
|
BSG CyLlama requires **both** embedding and generation models working in tandem: |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
from peft import PeftModel |
|
|
from sentence_transformers import SentenceTransformer |
|
|
import numpy as np |
|
|
|
|
|
# 1. Embedding Model (REQUIRED for cyclical averaging) |
|
|
gte_model = SentenceTransformer("thenlper/gte-large") # 1024-dim embeddings |
|
|
|
|
|
# 2. BSG CyLlama Generation Model |
|
|
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct") |
|
|
bsg_model = PeftModel.from_pretrained(base_model, "jimnoneill/BSG_CyLlama") |
|
|
|
|
|
# 3. Named Entity Extraction (optional enhancement) |
|
|
from transformers import pipeline |
|
|
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english") |
|
|
``` |
|
|
|
|
|
### Complete BSG CyLlama Implementation |
|
|
|
|
|
```python |
|
|
class BSGCyLlamaProcessor: |
|
|
"""Complete implementation of Biomedical Summary Generation through Cyclical Llama""" |
|
|
|
|
|
def __init__(self): |
|
|
self.gte_model = SentenceTransformer("thenlper/gte-large") |
|
|
self.tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct") |
|
|
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct") |
|
|
self.bsg_model = PeftModel.from_pretrained(base_model, "jimnoneill/BSG_CyLlama") |
|
|
|
|
|
def extract_named_entities(self, corpus_text): |
|
|
"""Extract key biomedical entities from corpus""" |
|
|
# Combine all corpus text |
|
|
combined_text = " ".join(corpus_text) |
|
|
|
|
|
# Extract entities (simplified - can be enhanced with BioBERT/SciBERT) |
|
|
entities = [] |
|
|
# Basic implementation - can be replaced with specialized NER |
|
|
words = combined_text.split() |
|
|
entities = [word for word in words if word.isupper() or word.istitle()] |
|
|
|
|
|
return list(set(entities)) # Remove duplicates |
|
|
|
|
|
def cyclical_embedding_average(self, corpus_documents): |
|
|
""" |
|
|
Core BSG CyLlama innovation: cyclical averaging of document embeddings |
|
|
""" |
|
|
# Generate embeddings for each document |
|
|
embeddings = [] |
|
|
for doc in corpus_documents: |
|
|
emb = self.gte_model.encode(doc) |
|
|
embeddings.append(emb) |
|
|
|
|
|
# Cyclical averaging with phase weighting |
|
|
n_docs = len(embeddings) |
|
|
averaged_embedding = np.zeros_like(embeddings[0]) |
|
|
|
|
|
for i, embedding in enumerate(embeddings): |
|
|
# Cyclical phase: ensures balanced representation |
|
|
phase = 2 * np.pi * i / n_docs |
|
|
cycle_weight = (np.cos(phase) + 1) / 2 # Normalize to [0,1] |
|
|
averaged_embedding += embedding * cycle_weight |
|
|
|
|
|
return averaged_embedding / n_docs |
|
|
|
|
|
def generate_corpus_summary(self, corpus_documents, max_length=400): |
|
|
""" |
|
|
Generate summary from corpus using BSG CyLlama methodology |
|
|
""" |
|
|
# Step 1: Extract named entities from corpus |
|
|
named_entities = self.extract_named_entities(corpus_documents) |
|
|
|
|
|
# Step 2: Create cyclically averaged embedding |
|
|
corpus_embedding = self.cyclical_embedding_average(corpus_documents) |
|
|
|
|
|
# Step 3: Create prompt with entity context |
|
|
entity_context = ", ".join(named_entities[:20]) # Top entities |
|
|
|
|
|
prompt = f"""Based on the corpus analysis with key entities: {entity_context} |
|
|
|
|
|
Generate a comprehensive biomedical summary that synthesizes the collective findings: |
|
|
|
|
|
Summary:""" |
|
|
|
|
|
# Step 4: Generate summary using BSG CyLlama |
|
|
inputs = self.tokenizer.encode(prompt, return_tensors="pt") |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = self.bsg_model.generate( |
|
|
inputs, |
|
|
max_length=len(inputs[0]) + max_length, |
|
|
temperature=0.7, |
|
|
do_sample=True, |
|
|
top_p=0.9, |
|
|
pad_token_id=self.tokenizer.eos_token_id |
|
|
) |
|
|
|
|
|
generated_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
summary = generated_text[len(prompt):].strip() |
|
|
|
|
|
return { |
|
|
'corpus_summary': summary, |
|
|
'key_entities': named_entities[:20], |
|
|
'num_documents': len(corpus_documents), |
|
|
'methodology': 'BSG CyLlama Cyclical Averaging' |
|
|
} |
|
|
|
|
|
# Example Usage |
|
|
processor = BSGCyLlamaProcessor() |
|
|
|
|
|
# Input: Multiple related biomedical documents |
|
|
corpus = [ |
|
|
"Deep learning approaches in medical imaging have shown remarkable success...", |
|
|
"Convolutional neural networks for radiological analysis provide...", |
|
|
"Machine learning applications in diagnostic imaging demonstrate..." |
|
|
] |
|
|
|
|
|
# BSG CyLlama Processing |
|
|
result = processor.generate_corpus_summary(corpus) |
|
|
|
|
|
print(f"Corpus Summary: {result['corpus_summary']}") |
|
|
print(f"Key Entities: {result['key_entities']}") |
|
|
print(f"Documents Processed: {result['num_documents']}") |
|
|
``` |
|
|
|
|
|
## π **Training Data & Methodology** |
|
|
|
|
|
BSG CyLlama was trained on [19,174 scientific abstracts](https://huggingface.co/datasets/jimnoneill/BSG_CyLlama-training) specifically formatted for cyclical corpus summarization: |
|
|
|
|
|
- **Corpus Groups**: Documents clustered by research themes |
|
|
- **Cyclical Training**: Model learned to process document series, not just individual papers |
|
|
- **Entity Integration**: Training included named entity concatenation patterns |
|
|
- **Approximation Learning**: Taught to create virtual "meta-documents" from corpus averaging |
|
|
|
|
|
### Training Configuration |
|
|
- **Base Model**: Llama-3.2-1B-Instruct |
|
|
- **Fine-tuning**: LoRA (rank 128, alpha 256) |
|
|
- **Embedding Model**: thenlper/gte-large (1024d) |
|
|
- **Specialization**: Cyclical corpus summarization |
|
|
- **Domain**: Biomedical and scientific literature |
|
|
|
|
|
## π **Revolutionary Applications** |
|
|
|
|
|
### Perfect for Corpus-Level Analysis: |
|
|
- π¬ **Literature Reviews**: Synthesize findings across multiple papers |
|
|
- 𧬠**Research Clustering**: Generate summaries for document clusters |
|
|
- π **Knowledge Synthesis**: Create meta-analyses from paper collections |
|
|
- π₯ **Clinical Research**: Summarize multiple clinical studies |
|
|
- π **Drug Discovery**: Synthesize compound research across publications |
|
|
|
|
|
### Advantages over Traditional Methods: |
|
|
- **π Corpus Understanding**: Goes beyond single-document limitations |
|
|
- **π Balanced Representation**: Cyclical averaging ensures fair document weighting |
|
|
- **π·οΈ Entity Preservation**: Named entity integration maintains domain terminology |
|
|
- **π° Cost Effective**: No per-query retrieval costs |
|
|
- **β‘ Fast Processing**: Single forward pass for entire corpus |
|
|
|
|
|
## π‘ **Innovation Summary** |
|
|
|
|
|
BSG CyLlama introduces the **Cyclical Llama** approach to biomedical summarization: |
|
|
|
|
|
1. **π Cyclical Averaging**: Revolutionary embedding averaging across document corpus |
|
|
2. **π·οΈ Entity Integration**: Concatenates named entities with averaged embeddings |
|
|
3. **π Approximation Documents**: Creates virtual meta-documents representing corpus knowledge |
|
|
4. **𧬠Biomedical Focus**: Specialized for scientific and biomedical literature |
|
|
5. **π° Economic Efficiency**: Eliminates expensive per-query retrieval operations |
|
|
|
|
|
## π― **Getting Started with BSG CyLlama** |
|
|
|
|
|
```bash |
|
|
# Install dependencies |
|
|
pip install torch transformers peft sentence-transformers |
|
|
|
|
|
# Run the complete BSG CyLlama demo |
|
|
python bsg_cyllama_demo.py |
|
|
``` |
|
|
|
|
|
## π **Citation** |
|
|
|
|
|
```bibtex |
|
|
@misc{bsg-cyllama-2025, |
|
|
title={BSG CyLlama: Biomedical Summary Generation through Cyclical Llama with Named Entity Integration}, |
|
|
author={BSG Research Team}, |
|
|
year={2025}, |
|
|
url={https://huggingface.co/jimnoneill/BSG_CyLlama}, |
|
|
note={Novel cyclical embedding averaging methodology for corpus-level summarization} |
|
|
} |
|
|
``` |
|
|
|
|
|
## π **Resources** |
|
|
|
|
|
- **π€ Model Repository**: [jimnoneill/BSG_CyLlama](https://huggingface.co/jimnoneill/BSG_CyLlama) |
|
|
- **π Training Dataset**: [jimnoneill/BSG_CyLlama-training](https://huggingface.co/datasets/jimnoneill/BSG_CyLlama-training) |
|
|
- **π Demo Script**: `bsg_cyllama_demo.py` (included in model repo) |
|
|
- **π Setup Guide**: `SETUP_GUIDE.md` |
|
|
|
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
**π Revolutionizing corpus-level summarization through cyclical embedding innovation!** π |
|
|
|
|
|
[Try BSG CyLlama](https://huggingface.co/jimnoneill/BSG_CyLlama) | [Explore the Dataset](https://huggingface.co/datasets/jimnoneill/BSG_CyLlama-training) | [Read the Methodology](https://huggingface.co/jimnoneill/BSG_CyLlama/blob/main/SETUP_GUIDE.md) |
|
|
|
|
|
</div> |
|
|
|
|
|
|
|
|
|