Speech Recognition Community Event Version 2

non-profit

Activity Feed

AI & ML interests

Multi-Lingual Speech Recognition

Recent Activity

w11wo authored a paper 23 days ago

Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG

PereLluis13 authored a paper 30 days ago

Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics

PereLluis13 authored a paper 30 days ago

BOOKCOREF: Coreference Resolution at Book Scale

View all activity

w11wo

authored a paper 23 days ago

Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG

Paper • 2507.20136 • Published 25 days ago

PereLluis13

authored 2 papers 30 days ago

Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics

Paper • 2410.05183 • Published Oct 7, 2024 • 1

BOOKCOREF: Coreference Resolution at Book Scale

Paper • 2507.12075 • Published Jul 16 • 5

sanchit-gandhi

authored 2 papers about 1 month ago

Magistral

Paper • 2506.10910 • Published Jun 12 • 63

Voxtral

Paper • 2507.13264 • Published Jul 17 • 25

gagan3012

authored a paper 2 months ago

Leveraging Vision-Language Pre-training for Human Activity Recognition in Still Images

Paper • 2506.13458 • Published Jun 16

gagan3012

authored a paper 3 months ago

Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning

Paper • 2505.16088 • Published May 22 • 3

DrishtiSharma

authored a paper 3 months ago

Behind Maya: Building a Multilingual Vision Language Model

Paper • 2505.08910 • Published May 13 • 2

w11wo

authored a paper 3 months ago

Massive-STEPS: Massive Semantic Trajectories for Understanding POI Check-ins -- Dataset and Benchmarks

Paper • 2505.11239 • Published May 16

phantomcoder1996

authored a paper 4 months ago

ConvoGen: Enhancing Conversational AI with Synthetic Data: A Multi-Agent Approach

Paper • 2503.17460 • Published Mar 21 • 1

PereLluis13

authored a paper 4 months ago

Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation

Paper • 2504.17025 • Published Apr 23 • 17

DrishtiSharma

authored a paper 4 months ago

Robust and Fine-Grained Detection of AI Generated Texts

Paper • 2504.11952 • Published Apr 16 • 12

anton-l

authored a paper 4 months ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7 • 197

w11wo

authored a paper 5 months ago

COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition

Paper • 2503.07259 • Published Mar 10

anton-l

authored a paper 7 months ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 241

g8a9

authored a paper 7 months ago

MSTS: A Multimodal Safety Test Suite for Vision-Language Models

Paper • 2501.10057 • Published Jan 17 • 10

gagan3012

authored a paper 8 months ago

DateLogicQA: Benchmarking Temporal Biases in Large Language Models

Paper • 2412.13377 • Published Dec 17, 2024 • 2

anton-l

posted an update 8 months ago

Post

3200

Introducing 📐𝐅𝐢𝐧𝐞𝐌𝐚𝐭𝐡: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

We build the dataset by:
🛠️ carefully extracting math data from Common Crawl;
🔎 iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.

We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.

We hope this helps advance the performance of LLMs on math and reasoning! 🚀
We’re also releasing all the ablation models as well as the evaluation code.

HuggingFaceTB/finemath-6763fb8f71b6439b653482c2

versae

authored a paper 8 months ago

The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective

Paper • 2412.09460 • Published Dec 12, 2024 • 9

DrishtiSharma

authored a paper 8 months ago

1-800-SHARED-TASKS at RegNLP: Lexical Reranking of Semantic Retrieval (LeSeR) for Regulatory Question Answering

Paper • 2412.06009 • Published Dec 8, 2024

AI & ML interests

Recent Activity

Team members 199

speech-recognition-community-v2's activity