LLM Adaptation to Czech Language
This collection accompanies the master's thesis on Compute-constrained LLM adaptation to Czech language. Available from: TBA.
Viewer • Updated • 396 • 16Note Czech version of the WildBench Benchmark. More information in the thesis.
ctu-aic/cs_instruction_tuning_collection
Viewer • Updated • 110k • 23Note Collection of publicly available Czech Instruction tuning datasets. More information in the thesis.
ctu-aic/en_instruction_tuning_collection
Viewer • Updated • 193k • 27Note The collection of English instruction-tuning datasets is aimed at being complementary to the cs_instruction_tuning_collection (possibly enabling better cross-lingual transfer by parallel corpora). More information is available in the thesis.
ctu-aic/nli_it_collection
Viewer • Updated • 1.88M • 46Note Czech and English NLI (and fact-checking) datasets transformed to instruction-output format for instruction tuning of LLMs, inspired by templates from Google's FLAN dataset. More information in the thesis.
ctu-aic/ask_library_cs
Viewer • Updated • 16.3k • 21Note Dataset scraped and processed from https://www.ptejteseknihovny.cz/. Filtered version is part of the cs_instruction_tuning_collection. More information in the thesis.
ctu-aic/questions_ujc_cas_cs
Viewer • Updated • 12.4k • 35Note Dataset scraped with direct permission from https://dotazy.ujc.cas.cz/. Filtered version is part of the cs_instruction_tuning_collection. More information in the thesis.
ctu-aic/Llama-3.1-8B-Instruct_it-mix
Text Generation • Updated • 5Note Llama 3.1 8B Instruct instruction-tuned using a mixture of cs_instruction_tuning_collection and en_instruction_tuning_collection. More information in the thesis.
ctu-aic/Llama-3.1-8B_cp-mix_it-alpaca_dolly
Text Generation • Updated • 1Note Llama 3.1 8B continuously pretrained on a mixture of FineWeb2 and FineWeb-Edu datasets and instsruction-tuned using a mixture of English and Czech Alpaca and Dolly datasets. More information in the thesis.
ctu-aic/Llama-3.1-8B-Instruct_nli-mix
Text Generation • Updated • 5Note Llama 3.1 8B Instruct tuned using task-specific tuning on nli_it_collection. More information in the thesis.
ctu-aic/Llama-3.1-8B_cp-cs
Text Generation • Updated • 7Note Llama 3.1 8B continuously pretrained on the Czech subset of FineWeb2. More information in the thesis.
ctu-aic/Llama-3.1-8B_cp-mix
Text Generation • Updated • 9Note Llama 3.1 8B continuously pretrained on a mixture of FineWeb2 and FineWeb-Edu datasets. More information in the thesis.