Nandan Thakur's picture

Nandan Thakur

nthakur

·

https://thakur-nandan.github.io

AI & ML interests

NLP, IR, QA

Recent Activity

liked a model 8 days ago

deepseek-ai/DeepSeek-V4-Pro

upvoted an article 11 days ago

DenseOn with the LateOn: Open State-of-the-Art Single and Multi-Vector Models

updated a dataset 11 days ago

orbit-ai/orbit-seeds

View all activity

Organizations

Posts 2

Post

1900

Last year, I curated & generated a few multilingual SFT and DPO datasets by translating English SFT/DPO datasets into 9-10 languages using the mistralai/Mistral-7B-Instruct-v0.2 model.

I hope it helps the community for pretraining/instruction tuning multilingual LLMs! I added a small diagram to briefly describe which datasets are added and their sources.

Happy to collaborate in either using these datasets for instruction FT, or wishes to extend translated versions of newer SFT/DPO english datasets!

nthakur/multilingual-sft-and-dpo-datasets-67eaf56fe3feca5a57cf7d74

Post

3822

🦢 The SWIM-IR dataset contains 29 million text-retrieval training pairs across 27 diverse languages. It is one of the largest synthetic multilingual datasets generated using PaLM 2 on Wikipedia! 🔥🔥

SWIM-IR dataset contains three subsets :
- Cross-lingual:nthakur/swim-ir-cross-lingual
- Monolingual: nthakur/swim-ir-monolingual
- Indic Cross-lingual: nthakur/indic-swim-ir-cross-lingual

Check it out:
https://huggingface.co/collections/nthakur/swim-ir-dataset-662ddaecfc20896bf14dd9b7

Collections 5

View 5 collections

Papers 18

arxiv:2604.01195

arxiv:2508.06600

arxiv:2505.16967

arxiv:2504.20006

models 36

nthakur/orbit-4b-asearcher-en-no-math-14K-step-75

4B • Updated 12 days ago • 42

nthakur/qwen3-4b-grpo-modified-5-docs-only-odyssey-step-135

4B • Updated 12 days ago • 94

nthakur/Mistral-7B-Instruct-v0.2-mirage-bench-sft-teacher-mixtral

Updated Mar 31, 2025 • 13 • 1

nthakur/Meta-Llama-3-8B-Instruct-mirage-bench-sft

Updated Mar 31, 2025

nthakur/Mistral-7B-Instruct-v0.2-mirage-bench-sft

Updated Mar 31, 2025 • 17

nthakur/Mistral-7B-Instruct-v0.2-multilingual-dpo-v1.0-v2

Updated Aug 23, 2024 • 8

nthakur/Mistral-7B-Instruct-v0.2-multilingual-dpo-v1.0-final

Updated Aug 13, 2024

nthakur/Meta-Llama-3-8B-Instruct-mirage-all-teacher-instruct-llama-3-sft

Updated Aug 13, 2024 • 2

nthakur/Mistral-7B-Instruct-v0.2-mirage-all-teacher-instruct-mistral-sft

Updated Aug 13, 2024 • 9

nthakur/Mistral-7B-Instruct-v0.2-multilingual-dpo-v1.0

Updated Aug 12, 2024

datasets 58

nthakur/mirage-bench-pairwise-judgments

Viewer • Updated Mar 19 • 299k • 86 • 1

nthakur/search-arena-v1-nuggets-with-urls-5k-qwen

Viewer • Updated Jul 29, 2025 • 5.1k • 3

nthakur/cornstack-6-langs-v1-tevatron-6M

Viewer • Updated Jun 3, 2025 • 5.92M • 16

nthakur/cornstack-php-v1-tevatron-1M

Viewer • Updated Jun 2, 2025 • 993k • 32

nthakur/cornstack-go-v1-tevatron-1M

Viewer • Updated May 30, 2025 • 995k • 48

nthakur/cornstack-javascript-v1-tevatron-1M

Viewer • Updated May 30, 2025 • 952k • 43

nthakur/cornstack-ruby-v1-tevatron-1M

Viewer • Updated May 30, 2025 • 989k • 84

nthakur/cornstack-java-v1-tevatron-1M

Viewer • Updated May 30, 2025 • 995k • 15

nthakur/cornstack-python-v1-tevatron-1M

Viewer • Updated May 29, 2025 • 994k • 69

nthakur/default-100K-test

Viewer • Updated May 26, 2025 • 19k • 5

View 58 datasets