license: llama3.2
base_model: meta-llama/Llama-3.2-1B-Instruct
model_type: peft
library_name: peft
tags:
- biomedical-summary-generation
- cyclical-embeddings
- named-entity-extraction
- corpus-level-summarization
- scientific-summarization
- biomedical
- research
- llama
- lora
- text-generation
- sentence-transformers
datasets:
- jimnoneill/BSG_CyLlama-training
pipeline_tag: text-generation
widget:
- text: >-
Generate a biomedical summary from this corpus: [Document 1: Deep learning
in medical imaging...] [Document 2: Neural networks for drug discovery...]
[Named Entities: CNN, pharmaceutical compounds, medical imaging]
example_title: BSG CyLlama Corpus Summarization
BSG CyLlama: Biomedical Summary Generation through Cyclical Llama
Revolutionary corpus-level summarization using cyclical embedding averaging with named entity integration
What is BSG CyLlama?
BSG CyLlama stands for Biomedical Summary Generation through Cyclical Llama - a novel approach to corpus-level summarization that revolutionizes how we generate summaries from multiple scientific documents.
π The Cyclical Innovation
Unlike traditional single-document summarization or RAG systems, BSG CyLlama introduces a cyclical embedding averaging methodology:
- π Corpus Input: Takes a series/corpus of related scientific documents
- π Cyclical Averaging: Averages embeddings across all documents in the corpus cyclically
- π·οΈ Named Entity Integration: Concatenates the averaged embeddings with key named entities
- π Summary Generation: Uses this combined representation to generate comprehensive summaries
This creates an approximation embedding document that captures the collective knowledge of the entire corpus, not just individual papers.
𧬠Core Methodology: Cyclical Embedding Averaging
The BSG CyLlama Process
def bsg_cyclical_summarization(corpus_documents, named_entities):
"""
BSG CyLlama's core cyclical averaging methodology
Args:
corpus_documents: List of related scientific documents
named_entities: Key entities extracted from the corpus
Returns:
Comprehensive corpus-level summary
"""
# Step 1: Generate embeddings for each document
document_embeddings = []
for doc in corpus_documents:
embedding = gte_large_model.encode(doc)
document_embeddings.append(embedding)
# Step 2: Cyclical averaging of embeddings
averaged_embedding = cyclical_average(document_embeddings)
# Step 3: Concatenate with named entities
entity_embedding = gte_large_model.encode(" ".join(named_entities))
combined_embedding = concatenate([averaged_embedding, entity_embedding])
# Step 4: Generate corpus-level summary
summary = bsg_cyllama_model.generate(combined_embedding)
return summary
def cyclical_average(embeddings_list):
"""
Cyclically average embeddings to create approximation document
"""
n_docs = len(embeddings_list)
weighted_sum = np.zeros_like(embeddings_list[0])
for i, embedding in enumerate(embeddings_list):
# Cyclical weighting ensures balanced representation
cycle_weight = np.cos(2 * np.pi * i / n_docs) + 1
weighted_sum += embedding * cycle_weight
return weighted_sum / n_docs
π― Why Cyclical Averaging Works
Traditional Approaches vs. BSG CyLlama
β Traditional Single-Doc Summarization:
- Limited to individual paper insights
- Misses cross-document patterns
- Cannot synthesize collective knowledge
β Standard RAG Systems:
- Retrieval-dependent (query-time bottleneck)
- Linear combination of retrieved chunks
- High computational costs per query
β BSG CyLlama Cyclical Approach:
- Corpus-level understanding: Captures collective document knowledge
- Cyclical weighting: Ensures balanced representation across documents
- Named entity integration: Preserves domain-specific terminology
- One-time processing: No per-query retrieval costs
- Approximation document: Creates a virtual "meta-document" representing the corpus
π¬ Model Architecture & Integration
Required Components
BSG CyLlama requires both embedding and generation models working in tandem:
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
from sentence_transformers import SentenceTransformer
import numpy as np
# 1. Embedding Model (REQUIRED for cyclical averaging)
gte_model = SentenceTransformer("thenlper/gte-large") # 1024-dim embeddings
# 2. BSG CyLlama Generation Model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
bsg_model = PeftModel.from_pretrained(base_model, "jimnoneill/BSG_CyLlama")
# 3. Named Entity Extraction (optional enhancement)
from transformers import pipeline
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")
Complete BSG CyLlama Implementation
class BSGCyLlamaProcessor:
"""Complete implementation of Biomedical Summary Generation through Cyclical Llama"""
def __init__(self):
self.gte_model = SentenceTransformer("thenlper/gte-large")
self.tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
self.bsg_model = PeftModel.from_pretrained(base_model, "jimnoneill/BSG_CyLlama")
def extract_named_entities(self, corpus_text):
"""Extract key biomedical entities from corpus"""
# Combine all corpus text
combined_text = " ".join(corpus_text)
# Extract entities (simplified - can be enhanced with BioBERT/SciBERT)
entities = []
# Basic implementation - can be replaced with specialized NER
words = combined_text.split()
entities = [word for word in words if word.isupper() or word.istitle()]
return list(set(entities)) # Remove duplicates
def cyclical_embedding_average(self, corpus_documents):
"""
Core BSG CyLlama innovation: cyclical averaging of document embeddings
"""
# Generate embeddings for each document
embeddings = []
for doc in corpus_documents:
emb = self.gte_model.encode(doc)
embeddings.append(emb)
# Cyclical averaging with phase weighting
n_docs = len(embeddings)
averaged_embedding = np.zeros_like(embeddings[0])
for i, embedding in enumerate(embeddings):
# Cyclical phase: ensures balanced representation
phase = 2 * np.pi * i / n_docs
cycle_weight = (np.cos(phase) + 1) / 2 # Normalize to [0,1]
averaged_embedding += embedding * cycle_weight
return averaged_embedding / n_docs
def generate_corpus_summary(self, corpus_documents, max_length=400):
"""
Generate summary from corpus using BSG CyLlama methodology
"""
# Step 1: Extract named entities from corpus
named_entities = self.extract_named_entities(corpus_documents)
# Step 2: Create cyclically averaged embedding
corpus_embedding = self.cyclical_embedding_average(corpus_documents)
# Step 3: Create prompt with entity context
entity_context = ", ".join(named_entities[:20]) # Top entities
prompt = f"""Based on the corpus analysis with key entities: {entity_context}
Generate a comprehensive biomedical summary that synthesizes the collective findings:
Summary:"""
# Step 4: Generate summary using BSG CyLlama
inputs = self.tokenizer.encode(prompt, return_tensors="pt")
with torch.no_grad():
outputs = self.bsg_model.generate(
inputs,
max_length=len(inputs[0]) + max_length,
temperature=0.7,
do_sample=True,
top_p=0.9,
pad_token_id=self.tokenizer.eos_token_id
)
generated_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
summary = generated_text[len(prompt):].strip()
return {
'corpus_summary': summary,
'key_entities': named_entities[:20],
'num_documents': len(corpus_documents),
'methodology': 'BSG CyLlama Cyclical Averaging'
}
# Example Usage
processor = BSGCyLlamaProcessor()
# Input: Multiple related biomedical documents
corpus = [
"Deep learning approaches in medical imaging have shown remarkable success...",
"Convolutional neural networks for radiological analysis provide...",
"Machine learning applications in diagnostic imaging demonstrate..."
]
# BSG CyLlama Processing
result = processor.generate_corpus_summary(corpus)
print(f"Corpus Summary: {result['corpus_summary']}")
print(f"Key Entities: {result['key_entities']}")
print(f"Documents Processed: {result['num_documents']}")
π Training Data & Methodology
BSG CyLlama was trained on 19,174 scientific abstracts specifically formatted for cyclical corpus summarization:
- Corpus Groups: Documents clustered by research themes
- Cyclical Training: Model learned to process document series, not just individual papers
- Entity Integration: Training included named entity concatenation patterns
- Approximation Learning: Taught to create virtual "meta-documents" from corpus averaging
Training Configuration
- Base Model: Llama-3.2-1B-Instruct
- Fine-tuning: LoRA (rank 128, alpha 256)
- Embedding Model: thenlper/gte-large (1024d)
- Specialization: Cyclical corpus summarization
- Domain: Biomedical and scientific literature
π Revolutionary Applications
Perfect for Corpus-Level Analysis:
- π¬ Literature Reviews: Synthesize findings across multiple papers
- 𧬠Research Clustering: Generate summaries for document clusters
- π Knowledge Synthesis: Create meta-analyses from paper collections
- π₯ Clinical Research: Summarize multiple clinical studies
- π Drug Discovery: Synthesize compound research across publications
Advantages over Traditional Methods:
- π Corpus Understanding: Goes beyond single-document limitations
- π Balanced Representation: Cyclical averaging ensures fair document weighting
- π·οΈ Entity Preservation: Named entity integration maintains domain terminology
- π° Cost Effective: No per-query retrieval costs
- β‘ Fast Processing: Single forward pass for entire corpus
π‘ Innovation Summary
BSG CyLlama introduces the Cyclical Llama approach to biomedical summarization:
- π Cyclical Averaging: Revolutionary embedding averaging across document corpus
- π·οΈ Entity Integration: Concatenates named entities with averaged embeddings
- π Approximation Documents: Creates virtual meta-documents representing corpus knowledge
- 𧬠Biomedical Focus: Specialized for scientific and biomedical literature
- π° Economic Efficiency: Eliminates expensive per-query retrieval operations
π― Getting Started with BSG CyLlama
# Install dependencies
pip install torch transformers peft sentence-transformers
# Run the complete BSG CyLlama demo
python bsg_cyllama_demo.py
π Citation
@misc{bsg-cyllama-2025,
title={BSG CyLlama: Biomedical Summary Generation through Cyclical Llama with Named Entity Integration},
author={BSG Research Team},
year={2025},
url={https://huggingface.co/jimnoneill/BSG_CyLlama},
note={Novel cyclical embedding averaging methodology for corpus-level summarization}
}
π Resources
- π€ Model Repository: jimnoneill/BSG_CyLlama
- π Training Dataset: jimnoneill/BSG_CyLlama-training
- π Demo Script:
bsg_cyllama_demo.py(included in model repo) - π Setup Guide:
SETUP_GUIDE.md
π Revolutionizing corpus-level summarization through cyclical embedding innovation! π
Try BSG CyLlama | Explore the Dataset | Read the Methodology