Representation & Optimization - a Ksgk-fy Collection

Ksgk-fy 's Collections

RL

Representation & Optimization

Exciting Papers

Memory

What I don't understand

Representation & Optimization

updated about 19 hours ago

Understanding about representation sheds light on optimization

Nuclear Norm Regularization for Deep Learning

Paper • 2405.14544 • Published May 23, 2024 • 1

Note CS inequality for matrix allows penalizing element-wise Frobenius norm to encourage low-rank representations.
Token embeddings violate the manifold hypothesis

Paper • 2504.01002 • Published Apr 1 • 1

Note Some token have more synonyms than others.
Approximate Nullspace Augmented Finetuning for Robust Vision Transformers

Paper • 2403.10476 • Published Mar 15, 2024 • 1
ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning

Paper • 2504.00254 • Published Mar 31 • 1
Flex Attention: A Programming Model for Generating Optimized Attention Kernels

Paper • 2412.05496 • Published Dec 7, 2024 • 1

Note Customize attention mask with optimized performance comparable with Flashattention
Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

Paper • 2503.21934 • Published Mar 27
Value Residual Learning For Alleviating Attention Concentration In Transformers

Paper • 2410.17897 • Published Oct 23, 2024 • 9

Note Halve KV cache via sharing value embedding across attention blocks
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

Paper • 2504.06261 • Published Apr 8 • 110
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Paper • 2503.01840 • Published Mar 3 • 5
Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization Failure

Paper • 2504.01928 • Published Apr 2 • 1
Gradient Surgery for Multi-Task Learning

Paper • 2001.06782 • Published Jan 19, 2020 • 1
SelfCP: Compressing Long Prompt to 1/12 Using the Frozen Large Language Model Itself

Paper • 2405.17052 • Published May 27, 2024 • 2
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Paper • 2403.19647 • Published Mar 28, 2024 • 4
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Paper • 2504.13837 • Published Apr 18 • 130
It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization

Paper • 2504.13173 • Published Apr 17 • 19
Representation Learning with Contrastive Predictive Coding

Paper • 1807.03748 • Published Jul 10, 2018 • 1
Training Large Language Models to Reason in a Continuous Latent Space

Paper • 2412.06769 • Published Dec 9, 2024 • 87
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference

Paper • 2502.18137 • Published Feb 25 • 57
Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light

Paper • 2504.16922 • Published Apr 23 • 1
Interpreting Emergent Planning in Model-Free Reinforcement Learning

Paper • 2504.01871 • Published Apr 2 • 12
Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward

Paper • 2504.03206 • Published Apr 4 • 1

Note PBRS (Potential Based Reward Shaping) can be used for gated regularization
Overtrained Language Models Are Harder to Fine-Tune

Paper • 2503.19206 • Published Mar 24 • 2
Long Context In-Context Compression by Getting to the Gist of Gisting

Paper • 2504.08934 • Published Apr 11 • 1
Model Diffusion for Certifiable Few-shot Transfer Learning

Paper • 2502.06970 • Published Feb 10 • 1
Memorization-Compression Cycles Improve Generalization

Paper • 2505.08727 • Published May 13 • 4
Chain-of-Model Learning for Language Model

Paper • 2505.11820 • Published May 17 • 119
Shannon information and integrated information: message and meaning

Paper • 2412.10626 • Published Dec 14, 2024 • 1
Let's Predict Sentence by Sentence

Paper • 2505.22202 • Published May 28 • 17
Learning to Reason without External Rewards

Paper • 2505.19590 • Published May 26 • 29
Pre-trained Large Language Models Learn Hidden Markov Models In-context

Paper • 2506.07298 • Published 28 days ago • 25
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Paper • 2506.06941 • Published 29 days ago • 13
A projection-based framework for gradient-free and parallel learning

Paper • 2506.05878 • Published about 1 month ago • 1
Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking

Paper • 2505.18495 • Published May 24 • 1
In-Context Learning Strategies Emerge Rationally

Paper • 2506.17859 • Published 15 days ago • 9
Global and Local Entailment Learning for Natural World Imagery

Paper • 2506.21476 • Published 10 days ago • 1
Radial Attention: O(nlog n) Sparse Attention with Energy Decay for Long Video Generation

Paper • 2506.19852 • Published 12 days ago • 35
Data Efficacy for Language Model Training

Paper • 2506.21545 • Published 10 days ago • 10

Note The 'learnability' metric require training a small LM beforehand instead of computed online, in that sense, selecting 'easy-to-learn' sample is an old idea.
Energy-Based Transformers are Scalable Learners and Thinkers

Paper • 2507.02092 • Published 4 days ago • 24

Note Using a neural network to directly predict outputs makes inference fast but makes search-based reasoning at inference time feel unnatural. In contrast, training a network to predict a loss function naturally supports gradient-based search at inference time—more aligned with tasks like image generation in continuous domains. However, this approach is 3× heavier at both training and inference.
Tensor Product Attention Is All You Need

Paper • 2501.06425 • Published Jan 11 • 89