Commit
·
bf7364f
1
Parent(s):
2d7348d
cross referencing other transformer-related implementations
Browse files
README.md
CHANGED
@@ -11,10 +11,12 @@ language: en
|
|
11 |
license: mit
|
12 |
---
|
13 |
|
14 |
-
# DeepSeek Multi-Latent Attention
|
15 |
|
16 |
This repository provides a PyTorch implementation of the Multi-Head Latent Attention (MLA) mechanism introduced in the DeepSeek-V2 paper. **This is not a trained model, but rather a modular attention implementation** that significantly reduces KV cache for efficient inference while maintaining model performance through its innovative architecture. It can be used as a drop-in attention module in transformer architectures.
|
17 |
|
|
|
|
|
18 |
## Key Features
|
19 |
|
20 |
- **Low-Rank Key-Value Joint Compression**: Reduces memory footprint during inference
|
@@ -114,6 +116,18 @@ Key aspects:
|
|
114 |
- Position encoding through decoupled RoPE pathway
|
115 |
- Efficient cache management for both pathways
|
116 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
117 |
## Contributing
|
118 |
|
119 |
Contributions are welcome! Feel free to:
|
|
|
11 |
license: mit
|
12 |
---
|
13 |
|
14 |
+
# DeepSeek Multi-Head Latent Attention
|
15 |
|
16 |
This repository provides a PyTorch implementation of the Multi-Head Latent Attention (MLA) mechanism introduced in the DeepSeek-V2 paper. **This is not a trained model, but rather a modular attention implementation** that significantly reduces KV cache for efficient inference while maintaining model performance through its innovative architecture. It can be used as a drop-in attention module in transformer architectures.
|
17 |
|
18 |
+
This repository is part of a series implementing the key architectural innovations from the DeepSeek paper. See the **Related Implementations** section for the complete series.
|
19 |
+
|
20 |
## Key Features
|
21 |
|
22 |
- **Low-Rank Key-Value Joint Compression**: Reduces memory footprint during inference
|
|
|
116 |
- Position encoding through decoupled RoPE pathway
|
117 |
- Efficient cache management for both pathways
|
118 |
|
119 |
+
## Related Implementations
|
120 |
+
|
121 |
+
This repository is part of a series implementing the key architectural innovations from the DeepSeek paper:
|
122 |
+
|
123 |
+
1. **[DeepSeek Multi-head Latent Attention](https://huggingface.co/bird-of-paradise/deepseek-mla)**(This Repository): Implementation of DeepSeek's MLA mechanism for efficient KV cache usage during inference.
|
124 |
+
|
125 |
+
2. **[DeepSeek MoE](https://huggingface.co/bird-of-paradise/deepseek-moe)**: Implementation of DeepSeek's Mixture of Experts architecture that enables efficient scaling of model parameters.
|
126 |
+
|
127 |
+
3. **[Transformer Implementation Tutorial](https://huggingface.co/datasets/bird-of-paradise/transformer-from-scratch-tutorial)**: A detailed tutorial on implementing transformer architecture with explanations of key components.
|
128 |
+
|
129 |
+
Together, these implementations cover the core innovations that power DeepSeek's state-of-the-art performance. By combining the MoE architecture with Multi-head Latent Attention, you can build a complete DeepSeek-style model with improved training efficiency and inference performance.
|
130 |
+
|
131 |
## Contributing
|
132 |
|
133 |
Contributions are welcome! Feel free to:
|