Asankhaya Sharma's picture

In a Training Loop 🔄

Asankhaya Sharma

codelion

hugging-science

·

http://asankhaya.github.io/

AI & ML interests

Creator of OptiLLM, OpenEvolve, Adaptive Classifier, and Ellora. Pioneering a new category in AI infrastructure: inference-time compute for LLMs.

Recent Activity

reacted to their post with ➕ about 7 hours ago

Introducing Dhara-70M: A diffusion language model that achieves 3.8x higher throughput than autoregressive models! Key findings from our research on optimal architectures for small language models: → Depth beats width: 32 layers outperforms 12 layers at the same parameter count → Best-in-class factuality: 47.5% on TruthfulQA → 10x training efficiency using WSD (Warmup-Stable-Decay) conversion → Canon layers add only 0.13% parameters but improve reasoning We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens. Blog: https://huggingface.co/blog/codelion/optimal-model-architecture Model: https://huggingface.co/codelion/dhara-70m

reacted to their post with 🤗 about 7 hours ago

Introducing Dhara-70M: A diffusion language model that achieves 3.8x higher throughput than autoregressive models! Key findings from our research on optimal architectures for small language models: → Depth beats width: 32 layers outperforms 12 layers at the same parameter count → Best-in-class factuality: 47.5% on TruthfulQA → 10x training efficiency using WSD (Warmup-Stable-Decay) conversion → Canon layers add only 0.13% parameters but improve reasoning We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens. Blog: https://huggingface.co/blog/codelion/optimal-model-architecture Model: https://huggingface.co/codelion/dhara-70m

reacted to their post with 🚀 about 7 hours ago

Introducing Dhara-70M: A diffusion language model that achieves 3.8x higher throughput than autoregressive models! Key findings from our research on optimal architectures for small language models: → Depth beats width: 32 layers outperforms 12 layers at the same parameter count → Best-in-class factuality: 47.5% on TruthfulQA → 10x training efficiency using WSD (Warmup-Stable-Decay) conversion → Canon layers add only 0.13% parameters but improve reasoning We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens. Blog: https://huggingface.co/blog/codelion/optimal-model-architecture Model: https://huggingface.co/codelion/dhara-70m

View all activity

Organizations

codelion 's models 29

codelion/Qwen3-4B-Instruct-2507-self-verify-lora

Updated about 8 hours ago • 21

codelion/dhara-70m

Text Generation • 71.3M • Updated 1 day ago • 880 • 5

codelion/gpt-2-70m

Text Generation • 64.1M • Updated Nov 2 • 581 • 18

codelion/Qwen3-4B-execution-world-model-lora

Text Generation • Updated Oct 20 • 34 • 3

codelion/Qwen2.5-Coder-0.5B-Instruct-security-grpo-lora

Text Generation • Updated Aug 2 • 5

codelion/qwen2-5-coder-0-5b-instruct-progressive-2000k-lora

Text Generation • Updated Jul 20 • 4

codelion/Llama-3.2-1B-Instruct-tool-calling-lora

Text Generation • Updated Jul 18 • 72 • 4

codelion/gemma-3-1b-it-reasoning-grpo-lora

Text Generation • Updated Jul 18 • 15 • 5

codelion/Qwen3-0.6B-ICM-DPO

Text Generation • 0.6B • Updated Jul 18 • 11

codelion/gemma-3-1b-it-ICM-DPO

Text Generation • 1.0B • Updated Jul 18 • 13

codelion/gemma-3-1b-it-ICM-DPO-mlx-fp16

Text Generation • 1B • Updated Jul 17 • 21

codelion/Qwen3-0.6B-ICM-DPO-mlx-fp16

Text Generation • 0.6B • Updated Jul 17 • 23 • 2

codelion/Qwen3-0.6B-accuracy-recovery-lora

Text Generation • Updated Jul 13 • 66 • 4

codelion/Qwen3-0.6B-GRPO-mlx-fp16

Text Generation • 0.6B • Updated Jul 11 • 7

codelion/Qwen3-0.6B-GRPO

Text Generation • 0.6B • Updated Jul 11 • 5

codelion/DeepSeek-R1-Distill-Qwen-1.5B-PTS-DPO

Text Generation • 2B • Updated May 13 • 11 • 2

codelion/Qwen3-0.6B-PTS-DPO

Text Generation • 0.6B • Updated May 12 • 17 • 1

codelion/Qwen3-0.6B-PTS-DPO-LoRA

Updated May 7 • 1

codelion/optillm-bert-uncased

Updated Feb 16 • 56 • 5

codelion/optillm-modernbert-large

Updated Feb 16 • 30 • 9

codelion/Llama-3.3-70B-o1

Text Generation • 71B • Updated Jan 21 • 91 • • 2

codelion/Llama-3.3-70B-o1-gguf

71B • Updated Jan 20 • 103 • 1

codelion/Llama-3.3-70B-o1-lora

Updated Jan 20 • 2

codelion/Llama-3.2-3B-o1

3B • Updated Jan 12 • 72 • 5

codelion/Llama-3.2-3B-o1-lora

Updated Jan 12 • 4

codelion/MathCoT

8B • Updated Nov 26, 2024 • 33 • 2

codelion/scorelora

Updated Oct 15, 2024 • 6 • 3

codelion/public-domain-mickey-mouse

Text-to-Image • Updated Jan 5, 2024 • 13 • • 2

codelion/whisper-age-estimator

Automatic Speech Recognition • 72.6M • Updated Sep 10, 2023 • 64 • 3