Youngjoon Jang's picture

Youngjoon Jang

yjoonjang

·

https://yjoonjang.github.io/

AI & ML interests

Information Retrieval (IR), Retrieval-Augmented Generation (RAG)

Recent Activity

updated a model 5 days ago

yjoonjang/splade-ko-v1

new activity 22 days ago

yjoonjang/colbert-ko-v1.0:Produce ColBERT-KO Evaluation Results

reacted to Norod78's post with 👍 24 days ago

Multilingual Tokenization Showdown Analyzing 12 LLM Tokenizers Across 204 Languages. First, I've created a dataset with Wikipedia's "Cat" article text in 272 languages: https://huggingface.co/datasets/Norod78/WikiCat-Multilingual For each language entry with at least 100 words, I tokenized the text using 12 tokenizers and calculated the "Characters per token" ratio and "Word per token" ratio. The higher this ratio is, the more information each token represents on average for that language (and perhaps allowing the llm to potentially learn more per-parameter if trained on a dataset of that language). You can see a slideshow summary of the results here: https://norod.github.io/wikicat-tokenizer-eval/tokenizer-slideshow.html I hope I interpreted the results correctly, I've made the code available on GitHub so you can re-create the raw results jsonl with this repo: https://github.com/Norod/wikicat-tokenizer-eval Post on X: https://x.com/Norod78/status/1984366900550266999

View all activity

Organizations

yjoonjang 's datasets 32

yjoonjang/markers_bm

Viewer • Updated Nov 5, 2024 • 948 • 608 • 1

yjoonjang/frames_benchmark_formatted

Viewer • Updated Sep 30, 2024 • 824 • 6