Estimating Knowledge in Large Language Models Without Generating a Single Token Paper • 2406.12673 • Published Jun 18, 2024 • 9
ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations Paper • 2505.02819 • Published May 5 • 26
Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning Paper • 2508.04581 • Published Aug 6 • 5
view article Article Sparse Mixture of Experts Language Model from Scratch: Extending makeMoE with Expert Capacity Mar 18, 2024 • 13