As we always use Transformers, it's helpful to understand RoPE—Rotary Position Embedding. Since token order matters, RoPE encodes it by rotating token embeddings based on their position, so the model knows how to interpret which token comes first, second, and so on.
Here are 8 types of RoPE that can be implemented in different cases:
4. Multimodal RoPE (MRoPE) -> Qwen2.5-VL Technical Report (2502.13923) Decomposes positional embedding into 3 components: temporal, height and width, so that positional features are aligned across modalities: text, images and videos.
8. XPos (Extrapolatable Position Embedding) -> https://huggingface.co/papers/2212.10 Introduces an exponential decay factor into the rotation matrix, improving stability on long sequences.
2) Infographic Features: Visually appealing infographics that communicate data or statistics Use Cases: Global energy charts, startup growth metrics, health tips and more Benefits: Eye-catching icons and layouts, perfect for storytelling at a glance
3) Mockup Features: Sketch-style wireframes or UX mockups for apps/websites Use Cases: Mobile login flows, dashboards, e-commerce site layouts Benefits: Rapid prototyping of early design ideas, perfect for storyboarding
5) Design Features: Product/industrial design concepts (coffee machines, smartphones, etc.) Use Cases: Prototyping, concept car interiors, high-tech product sketches Benefits: From 3D render-like visuals to simple sketches, unleash your creativity!
We are reproducing the full DeepSeek R1 data and training pipeline so everybody can use their recipe. Instead of doing it in secret we can do it together in the open!
🧪 Step 1: replicate the R1-Distill models by distilling a high-quality reasoning corpus from DeepSeek-R1.
🧠 Step 2: replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This will involve curating new, large-scale datasets for math, reasoning, and code.
🔥 Step 3: show we can go from base model -> SFT -> RL via multi-stage training.
This paper introduces the MinGRU model, a simplified version of the traditional Gated Recurrent Unit (GRU) designed to enhance efficiency by removing hidden state dependencies from its gates. This allows for parallel training, making it significantly faster than conventional GRUs. Additionally, MinGRU eliminates non-linear activations like tanh, streamlining computations.
So I read the paper and I tried training this model and it seems to be doing quite well , you could check out the pre-trained model on the huggingface spaces