I played around with the new RXTX paper (XX^T) and was able to train nanogpt with 4x4 RXTX matmuls in both attention layer and optimizer🤕 It just works (well I had to add some guardrails) but still saves 5% of memory usage: The Patch: - Computes attention scores with a 4x4 blockwise RXTX matmuls (no pytorch dot prod) - Handles arbitrary sequence lengths by padding to the nearest multiple of 4. - An RXTX variant of shampoo with params reshaped into 4x4 blocks during each optimizer step. - Uses 5% less ops Code: https://github.com/Jaykef/ai-algorithms/blob/main/nanogpt-rxtx.ipynb Paper: https://arxiv.org/pdf/2505.09814