bird-of-paradise commited on
Commit
bf7364f
·
1 Parent(s): 2d7348d

cross referencing other transformer-related implementations

Browse files
Files changed (1) hide show
  1. README.md +15 -1
README.md CHANGED
@@ -11,10 +11,12 @@ language: en
11
  license: mit
12
  ---
13
 
14
- # DeepSeek Multi-Latent Attention
15
 
16
  This repository provides a PyTorch implementation of the Multi-Head Latent Attention (MLA) mechanism introduced in the DeepSeek-V2 paper. **This is not a trained model, but rather a modular attention implementation** that significantly reduces KV cache for efficient inference while maintaining model performance through its innovative architecture. It can be used as a drop-in attention module in transformer architectures.
17
 
 
 
18
  ## Key Features
19
 
20
  - **Low-Rank Key-Value Joint Compression**: Reduces memory footprint during inference
@@ -114,6 +116,18 @@ Key aspects:
114
  - Position encoding through decoupled RoPE pathway
115
  - Efficient cache management for both pathways
116
 
 
 
 
 
 
 
 
 
 
 
 
 
117
  ## Contributing
118
 
119
  Contributions are welcome! Feel free to:
 
11
  license: mit
12
  ---
13
 
14
+ # DeepSeek Multi-Head Latent Attention
15
 
16
  This repository provides a PyTorch implementation of the Multi-Head Latent Attention (MLA) mechanism introduced in the DeepSeek-V2 paper. **This is not a trained model, but rather a modular attention implementation** that significantly reduces KV cache for efficient inference while maintaining model performance through its innovative architecture. It can be used as a drop-in attention module in transformer architectures.
17
 
18
+ This repository is part of a series implementing the key architectural innovations from the DeepSeek paper. See the **Related Implementations** section for the complete series.
19
+
20
  ## Key Features
21
 
22
  - **Low-Rank Key-Value Joint Compression**: Reduces memory footprint during inference
 
116
  - Position encoding through decoupled RoPE pathway
117
  - Efficient cache management for both pathways
118
 
119
+ ## Related Implementations
120
+
121
+ This repository is part of a series implementing the key architectural innovations from the DeepSeek paper:
122
+
123
+ 1. **[DeepSeek Multi-head Latent Attention](https://huggingface.co/bird-of-paradise/deepseek-mla)**(This Repository): Implementation of DeepSeek's MLA mechanism for efficient KV cache usage during inference.
124
+
125
+ 2. **[DeepSeek MoE](https://huggingface.co/bird-of-paradise/deepseek-moe)**: Implementation of DeepSeek's Mixture of Experts architecture that enables efficient scaling of model parameters.
126
+
127
+ 3. **[Transformer Implementation Tutorial](https://huggingface.co/datasets/bird-of-paradise/transformer-from-scratch-tutorial)**: A detailed tutorial on implementing transformer architecture with explanations of key components.
128
+
129
+ Together, these implementations cover the core innovations that power DeepSeek's state-of-the-art performance. By combining the MoE architecture with Multi-head Latent Attention, you can build a complete DeepSeek-style model with improved training efficiency and inference performance.
130
+
131
  ## Contributing
132
 
133
  Contributions are welcome! Feel free to: