Transformers & MoE - a RichardForests Collection

RichardForests 's Collections

Language Models

CV

RL

Diffusion models

3D/4D Gaussian Splatting

Mamba

NeRF

Transformers & MoE

(3D) Foundation Models

SSL

DL & Software DStructures

Dora

Flash Attention in Triton

Lora variations

Parameter Efficient - LLMs

Robotics - Cross Attention

DMs - Lighting Conditions

Transformers & MoE

updated May 21

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

Paper • 2312.07987 • Published Dec 13, 2023 • 40
Interfacing Foundation Models' Embeddings

Paper • 2312.07532 • Published Dec 12, 2023 • 10
Point Transformer V3: Simpler, Faster, Stronger

Paper • 2312.10035 • Published Dec 15, 2023 • 17
TheBloke/quantum-v0.01-GPTQ

Text Generation • Updated Dec 18, 2023 • 19 • 2
TheBloke/PiVoT-MoE-GPTQ

Text Generation • Updated Dec 17, 2023 • 7 • 1
mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-HQQ

Text Generation • Updated Jan 8 • 10 • 38
Denoising Vision Transformers

Paper • 2401.02957 • Published Jan 5 • 28
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Paper • 2401.06066 • Published Jan 11 • 43
Buffer Overflow in Mixture of Experts

Paper • 2402.05526 • Published Feb 8 • 8
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory

Paper • 2405.08707 • Published May 14 • 27