Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
45
7
199
sometimesanotion
sometimesanotion
Follow
OmbelineM's profile picture
gn00029914's profile picture
manisundar92's profile picture
104 followers
·
125 following
https://ko-fi.com/sometimesanotion
AI & ML interests
Agentic LLM services, model merging, finetunes, distillation
Recent Activity
liked
a model
8 days ago
sequelbox/gpt-oss-20b-DES-Reasoning
liked
a model
13 days ago
anon-researcher-ua/codegpt-oss-20b-GGUF
reacted
to
codelion
's
post
with 🔥
14 days ago
I wanted to share a technique that's been working really well for recovering performance after INT4 quantization. Typically, quantizing the LLM to INT4 (unlike say INT8) for inference can incur some accuracy loss. Instead of accepting the quality loss, we used the FP16 model as a teacher to train a tiny LoRA adapter (rank=16) for the quantized model. The cool part: the model generates its own training data using the Magpie technique so no external datasets needed. This is critical because we want to remain as much as possible in the distribution of the model's natural responses. Last year Apple's foundational models paper (https://arxiv.org/pdf/2407.21075) had proposed a similar technique and found "By using accuracy-recovery LoRA adapters with only rank 16, Alpaca win rate can be improved by 7-18%, GMS8K accuracy is boosted by 5-10%." (page 47). We saw similar results on Qwen3-0.6B: Perplexity: 2.40 → 2.09 (only 5.7% degradation from FP16 baseline) Memory: Only 0.28GB vs 1.0GB for FP16 (75% reduction) Speed: 3.0x faster inference than FP16 Quality: Generates correct, optimized code solutions - Pre-trained adapter: https://huggingface.co/codelion/Qwen3-0.6B-accuracy-recovery-lora - GitHub repo: https://github.com/codelion/ellora Happy to answer questions about the implementation or help anyone trying to replicate this. The key insight is that quantization errors are systematic and learnable - a small adapter can bridge the gap without negating the benefits of quantization. Has anyone else experimented with self-distillation for quantization recovery? Would love to hear about different approaches!
View all activity
Organizations
sometimesanotion
's models
16
Sort: Recently updated
sometimesanotion/Qwenvergence-14B-v13-Prose-DS
Text Generation
•
15B
•
Updated
May 13
•
7
•
10
sometimesanotion/Lamarck-14B-v0.7
Text Generation
•
15B
•
Updated
Mar 17
•
2.3k
•
42
sometimesanotion/Lamarck-14B-v0.7-Fusion
Text Generation
•
15B
•
Updated
Feb 28
•
2.19k
•
8
sometimesanotion/Qwenvergence-14B-v12-Prose-DS
Text Generation
•
15B
•
Updated
Feb 4
•
11
•
7
sometimesanotion/Qwenvergence-14B-v11
Text Generation
•
15B
•
Updated
Feb 3
•
6
•
4
sometimesanotion/Qwenvergence-14B-v12-Prose
Text Generation
•
15B
•
Updated
Jan 30
•
6
•
2
sometimesanotion/LoRA-64-Chocolatine-2-14B-Instruct-v2.0b3
Updated
Jan 29
sometimesanotion/LoRA-32-Chocolatine-2-14B-Instruct-v2.0b3
Updated
Jan 29
sometimesanotion/Lamarck-14B-v0.6
Text Generation
•
15B
•
Updated
Jan 22
•
18
•
14
sometimesanotion/Lamarck-14B-v0.3
Text Generation
•
15B
•
Updated
Jan 13
•
5
•
3
sometimesanotion/Qwen2.5-14B-Vimarckoso-v3
Text Generation
•
15B
•
Updated
Jan 6
•
20
•
11
sometimesanotion/Qwenvergence-14B-v6-Prose
Text Generation
•
15B
•
Updated
Dec 31, 2024
•
5
sometimesanotion/Abliterate-Qwenvergence
Text Generation
•
15B
•
Updated
Dec 24, 2024
•
4
sometimesanotion/LoRA-Abliterate-256
Updated
Dec 24, 2024
sometimesanotion/LoRA-Abliterate-128
Updated
Dec 24, 2024
sometimesanotion/LoRA-Abliterate-64
Updated
Dec 24, 2024