ByteSpan Tokenisers

non-profit

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

suchirsalhan authored a paper 9 days ago

BLiSS 1.0: Evaluating Bilingual Learner Competence in Second Language Small Language Models

suchirsalhan authored a paper 9 days ago

What is the Best Sequence Length for BABYLM?

suchirsalhan authored a paper 9 days ago

Teacher Demonstrations in a BabyLM's Zone of Proximal Development for Contingent Multi-Turn Interaction

View all activity

suchirsalhan

authored 3 papers 9 days ago

BLiSS 1.0: Evaluating Bilingual Learner Competence in Second Language Small Language Models

Paper • 2510.19419 • Published 11 days ago • 1

What is the Best Sequence Length for BABYLM?

Paper • 2510.19493 • Published 11 days ago • 1

Teacher Demonstrations in a BabyLM's Zone of Proximal Development for Contingent Multi-Turn Interaction

Paper • 2510.20411 • Published 10 days ago • 2

suchirsalhan

authored a paper 16 days ago

BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data

Paper • 2510.10159 • Published 22 days ago • 2

suchirsalhan

authored a paper 22 days ago

Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling

Paper • 2510.08470 • Published 24 days ago • 1

suchirsalhan

authored 2 papers 28 days ago

Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research

Paper • 2509.16413 • Published Sep 19 • 1

Meta-Pretraining for Zero-Shot Cross-Lingual Named Entity Recognition in Low-Resource Philippine Languages

Paper • 2509.02160 • Published Sep 2 • 1

codebyzeb

updated 2 models 4 months ago

ByteSpanTokenisers/fineweb-models

Updated Jun 29

ByteSpanTokenisers/fw57M-tied_finewebedu-20B_ByteSpanSurprisalGlobalIncrement_64000

Updated Jun 29 • 151

codebyzeb

published a model 4 months ago

ByteSpanTokenisers/fw57M-tied_finewebedu-20B_ByteSpanSurprisalGlobalIncrement_64000

Updated Jun 29 • 151

suchirsalhan

authored a paper 4 months ago

ByteSpan: Information-Driven Subword Tokenisation

Paper • 2506.18639 • Published Jun 23 • 3

codebyzeb

updated a dataset 4 months ago

ByteSpanTokenisers/common-corpus

Viewer • Updated Jun 24 • 820k • 22

codebyzeb

updated a model 4 months ago

ByteSpanTokenisers/fw57M-tied_finewebedu-20B_BPEWP_64000

Updated Jun 23

pietrolesci

authored a paper 7 months ago

PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs

Paper • 2503.09543 • Published Mar 12

suchirsalhan

authored a paper 7 months ago

Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies

Paper • 2410.22886 • Published Oct 30, 2024 • 1

pietrolesci

authored a paper 8 months ago

Self-Training Large Language Models for Tool-Use Without Demonstrations

Paper • 2502.05867 • Published Feb 9

juliuscheng

authored a paper 8 months ago

Early-Exit and Instant Confidence Translation Quality Estimation

Paper • 2502.14429 • Published Feb 20 • 4

pietrolesci

authored a paper 12 months ago

Tending Towards Stability: Convergence Challenges in Small Language Models

Paper • 2410.11451 • Published Oct 15, 2024

pietrolesci

authored 2 papers over 1 year ago

Causal Estimation of Memorisation Profiles

Paper • 2406.04327 • Published Jun 6, 2024 • 1

AnchorAL: Computationally Efficient Active Learning for Large and Imbalanced Datasets

Paper • 2404.05623 • Published Apr 8, 2024 • 3

AI & ML interests

Recent Activity

Team members 4

ByteSpanTokenisers's activity