LLM Adaptation to Czech Language

ctu-aic 's Collections

NLI

updated 23 days ago

This collection accompanies the master's thesis on Compute-constrained LLM adaptation to Czech language. https://dspace.cvut.cz/handle/10467/123587.

Upvote

ctu-aic/wildbench_cs

Viewer • Updated 23 days ago • 396 • 22

Note Czech version of the WildBench Benchmark. More information in the thesis.
ctu-aic/cs_instruction_tuning_collection

Viewer • Updated 23 days ago • 110k • 71

Note Collection of publicly available Czech Instruction tuning datasets. More information in the thesis.
ctu-aic/en_instruction_tuning_collection

Viewer • Updated 23 days ago • 193k • 63

Note The collection of English instruction-tuning datasets is aimed at being complementary to the cs_instruction_tuning_collection (possibly enabling better cross-lingual transfer by parallel corpora). More information is available in the thesis.
ctu-aic/nli_it_collection

Viewer • Updated 23 days ago • 1.88M • 171

Note Czech and English NLI (and fact-checking) datasets transformed to instruction-output format for instruction tuning of LLMs, inspired by templates from Google's FLAN dataset. More information in the thesis.
ctu-aic/ask_library_cs

Viewer • Updated 23 days ago • 16.3k • 51

Note Dataset scraped and processed from https://www.ptejteseknihovny.cz/. Filtered version is part of the cs_instruction_tuning_collection. More information in the thesis.
ctu-aic/questions_ujc_cas_cs

Viewer • Updated 23 days ago • 12.4k • 47

Note Dataset scraped with direct permission from https://dotazy.ujc.cas.cz/. Filtered version is part of the cs_instruction_tuning_collection. More information in the thesis.
ctu-aic/Llama-3.1-8B-Instruct_it-mix

Text Generation • 8B • Updated 23 days ago • 8

Note Llama 3.1 8B Instruct instruction-tuned using a mixture of cs_instruction_tuning_collection and en_instruction_tuning_collection. More information in the thesis.
ctu-aic/Llama-3.1-8B_cp-mix_it-alpaca_dolly

Text Generation • 8B • Updated 23 days ago • 43

Note Llama 3.1 8B continuously pretrained on a mixture of FineWeb2 and FineWeb-Edu datasets and instsruction-tuned using a mixture of English and Czech Alpaca and Dolly datasets. More information in the thesis.
ctu-aic/Llama-3.1-8B-Instruct_nli-mix

Text Generation • 8B • Updated 23 days ago • 9

Note Llama 3.1 8B Instruct tuned using task-specific tuning on nli_it_collection. More information in the thesis.
ctu-aic/Llama-3.1-8B_cp-cs

Text Generation • 8B • Updated 23 days ago • 33

Note Llama 3.1 8B continuously pretrained on the Czech subset of FineWeb2. More information in the thesis.
ctu-aic/Llama-3.1-8B_cp-mix

Text Generation • 8B • Updated 23 days ago • 26

Note Llama 3.1 8B continuously pretrained on a mixture of FineWeb2 and FineWeb-Edu datasets. More information in the thesis.

Upvote