Upload folder using huggingface_hub

ad654f3 verified 3 months ago

10.2 kB

	---
	title: Transformers from Scratch - Complete Implementation
	emoji: 🔮
	colorFrom: blue
	colorTo: green
	sdk: pytorch
	app_file: Transformers.ipynb
	pinned: false
	license: mit
	tags:
	- deep-learning
	- transformers
	- attention
	- pytorch
	- nlp
	- text-classification
	- sentiment-analysis
	- educational
	- from-scratch
	datasets:
	- synthetic-movie-reviews
	---

	# Transformers from Scratch: Complete Implementation

	A comprehensive PyTorch implementation of the Transformer architecture from "Attention Is All You Need", featuring detailed mathematical foundations, educational content, and practical text classification applications.

	## Model Description

	This repository contains a complete, from-scratch implementation of the Transformer architecture. The model demonstrates the core concepts behind modern NLP systems like BERT, GPT, and ChatGPT through a practical sentiment analysis task. This implementation serves as both a working model and an educational resource for understanding the revolutionary attention mechanism.

	### Architecture Details

	- Model Type: Transformer Encoder for Text Classification
	- Framework: PyTorch
	- Task: Binary sentiment classification (positive/negative movie reviews)
	- Model Dimension: 128
	- Attention Heads: 8
	- Layers: 4 Transformer blocks
	- Feed-Forward Dimension: 256
	- Total Parameters: ~200K
	- Vocabulary Size: Dynamic (built from training data)

	### Key Components

	1. Multi-Head Attention: Core mechanism allowing parallel processing of sequences
	2. Positional Encoding: Sine/cosine embeddings to inject position information
	3. Transformer Blocks: Attention + feed-forward with residual connections
	4. Layer Normalization: Stabilizes training and improves convergence
	5. Classification Head: Global average pooling + linear layer for predictions

	## Mathematical Foundation

	### Scaled Dot-Product Attention
	```
	Attention(Q, K, V) = softmax(QK^T / √d_k)V
	```

	### Multi-Head Attention
	```
	MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O
	head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
	```

	### Positional Encoding
	```
	PE(pos, 2i) = sin(pos/10000^(2i/d_model))
	PE(pos, 2i+1) = cos(pos/10000^(2i/d_model))
	```

	## Training Details

	- Dataset: Synthetic movie reviews (positive/negative sentiment)
	- Optimizer: AdamW with weight decay (0.01)
	- Learning Rate: 0.0001 with cosine annealing
	- Batch Size: 16
	- Max Sequence Length: 24 tokens
	- Training Epochs: 30
	- Hardware: Optimized for Apple M4 and CUDA GPUs

	## Model Performance

	### Metrics
	- Test Accuracy: 85%+
	- Training Time: ~10 minutes on Apple M4
	- Model Size: 200K parameters
	- Convergence: Stable training without overfitting

	### Capabilities
	- ✅ Binary sentiment classification
	- ✅ Attention weight visualization
	- ✅ Fast inference on modern hardware
	- ✅ Educational transparency
	- ✅ Easily extensible architecture

	## Usage

	### Quick Start

	```python
	import torch
	import torch.nn as nn
	import math

	# Load the complete implementation (from notebook)
	class TransformerClassifier(nn.Module):
	def __init__(self, vocab_size, d_model, num_heads, num_layers, d_ff, max_len, num_classes):
	super().__init__()
	self.d_model = d_model
	self.embedding = nn.Embedding(vocab_size, d_model)
	self.pos_encoding = PositionalEncoding(d_model, max_len)

	self.transformer_blocks = nn.ModuleList([
	TransformerBlock(d_model, num_heads, d_ff)
	for _ in range(num_layers)
	])

	self.norm = nn.LayerNorm(d_model)
	self.classifier = nn.Linear(d_model, num_classes)

	def forward(self, x):
	# Embedding + positional encoding
	x = self.embedding(x) * math.sqrt(self.d_model)
	x = self.pos_encoding(x)

	# Transformer blocks
	for transformer in self.transformer_blocks:
	x = transformer(x)

	# Classification
	x = self.norm(x)
	x = x.mean(dim=1) # Global average pooling
	return self.classifier(x)

	# Load trained model
	model = TransformerClassifier(
	vocab_size=vocab_size,
	d_model=128,
	num_heads=8,
	num_layers=4,
	d_ff=256,
	max_len=24,
	num_classes=2
	)
	model.load_state_dict(torch.load('best_transformer_model.pth'))
	model.eval()

	# Example inference
	def predict_sentiment(text, model, vocab_to_idx, max_length=24):
	tokens = tokenize_text(text, vocab_to_idx, max_length)
	with torch.no_grad():
	output = model(tokens.unsqueeze(0))
	prediction = torch.softmax(output, dim=1)
	return "Positive" if prediction[0][1] > 0.5 else "Negative"

	# Test the model
	result = predict_sentiment("This movie was absolutely fantastic!", model, vocab_to_idx)
	print(f"Sentiment: {result}")
	```

	### Advanced Usage

	```python
	# Visualize attention weights
	def visualize_attention(model, text, vocab_to_idx):
	# Extract attention weights from each layer
	# Create heatmaps showing what the model focuses on
	pass

	# Fine-tune on new data
	def fine_tune_model(model, new_data_loader, epochs=5):
	optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
	# Continue training on domain-specific data
	pass
	```

	## Visualizations and Analysis

	1. Training Curves: Loss and accuracy evolution over epochs
	2. Attention Heatmaps: Visualize what the model pays attention to
	3. Performance Metrics: Precision, recall, F1-score breakdowns
	4. Architecture Diagrams: Component-wise model visualization
	5. Error Analysis: Common failure cases and model limitations

	## Files and Outputs

	- `Transformers.ipynb`: Complete implementation with educational content
	- `best_transformer_model.pth`: Trained model weights
	- `m4_transformer_results.png`: Training curves and performance metrics
	- Architecture visualization and attention weight examples

	## Educational Value

	This implementation is designed as a comprehensive learning resource featuring:

	### Mathematical Understanding
	- Complete Derivations: From attention theory to implementation
	- Step-by-Step Breakdown: Each component explained individually
	- Visual Mathematics: Attention visualizations and formula explanations
	- Practical Examples: Concrete numerical calculations

	### Implementation Insights
	- Clean Code Architecture: Modular, readable, and well-documented
	- Best Practices: Modern PyTorch patterns and techniques
	- Performance Optimization: Efficient training and inference
	- Debugging Techniques: How to monitor and improve training

	### Real-World Applications
	- End-to-End Pipeline: From raw text to predictions
	- Production Considerations: Model deployment and optimization
	- Extension Examples: How to adapt for different tasks
	- Transfer Learning: Building on pre-trained representations

	## Applications

	This Transformer implementation can be adapted for:

	### Text Classification Tasks
	- Sentiment Analysis: Movie reviews, product feedback, social media
	- Topic Classification: News categorization, document organization
	- Spam Detection: Email filtering, content moderation
	- Intent Recognition: Chatbot understanding, voice assistants

	### Sequence Processing
	- Named Entity Recognition: Extract people, places, organizations
	- Part-of-Speech Tagging: Grammatical analysis
	- Text Similarity: Document matching, plagiarism detection
	- Feature Extraction: Dense representations for downstream tasks

	### Research and Development
	- Architecture Experiments: Test new attention mechanisms
	- Ablation Studies: Understand component contributions
	- Scaling Experiments: Larger models and datasets
	- Novel Applications: Domain-specific adaptations

	## Comparison with Other Architectures

	### Advantages over RNNs
	- ✅ Parallel Processing: Much faster training and inference
	- ✅ Long-Range Dependencies: Better handling of distant relationships
	- ✅ Scalability: Efficient on modern hardware
	- ✅ Interpretability: Attention weights provide insights

	### Advantages over CNNs
	- ✅ Sequence Modeling: Natural fit for text and time series
	- ✅ Variable Length: Handle sequences of any length
	- ✅ Global Context: Attend to entire sequence simultaneously
	- ✅ Position Awareness: Explicit positional information

	### Educational Benefits
	- 🎓 Foundation Understanding: Core concepts behind modern NLP
	- 🎓 Mathematical Clarity: Clean mathematical formulations
	- 🎓 Implementation Practice: Hands-on coding experience
	- 🎓 Research Preparation: Basis for advanced architectures

	## Citation

	If you use this implementation in your research or projects, please cite:

	```bibtex
	@misc{transformers_from_scratch_2024,
	title={Transformers from Scratch: Complete Implementation},
	author={Gruhesh Kurra},
	year={2024},
	url={https://huggingface.co/karthik-2905/TransformersFromScratch}
	}
	```

	## Future Extensions

	Planned improvements and research directions:

	- 🔄 Encoder-Decoder Architecture: Full sequence-to-sequence implementation
	- 🎨 Pre-training Pipeline: Large-scale language model training
	- 📊 Alternative Attention: Sparse, local, and linear attention variants
	- 🖼️ Vision Transformers: Adapt architecture for image tasks
	- 🎵 Multimodal Transformers: Text, image, and audio processing
	- 🧬 Scientific Applications: Protein sequences, molecular modeling

	## License

	This project is licensed under the MIT License - see the LICENSE file for details.

	## Additional Resources

	- GitHub Repository: [TransformersFromScratch](https://github.com/GruheshKurra/TransformersFromScratch)
	- Original Paper: "Attention Is All You Need" by Vaswani et al.
	- Educational Content: Complete mathematical derivations and examples
	- Performance Benchmarks: Detailed analysis and comparisons

	## Model Card Authors

	Gruhesh Kurra - Implementation, documentation, and educational content

	---

	Tags: transformers, attention, pytorch, nlp, text-classification, educational

	Model Card Last Updated: December 2024