BigScience Workshop

non-profit

https://bigscience.huggingface.co

bigscienceW

bigscience-workshop

Activity Feed

AI & ML interests

A one-year long research workshop on large language models: the Summer of Language Models 21 🌸

Recent Activity

afaji authored a paper 10 days ago

Predicting the Order of Upcoming Tokens Improves Language Modeling

christopher new activity 21 days ago

bigscience/bloom:Let's talk about the model

w11wo authored a paper about 1 month ago

Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG

View all activity

davanstrien

posted an update 4 days ago

Post

318

I fine-tuned a smol VLM to generate specialized art history metadata!

davanstrien/iconclass-vlm: Qwen2.5-VL-3B trained using SFT to generate ICONCLASS codes (think Dewey Decimal for art!)

Trained with TRL + HF Jobs - single UV script, no GPU needed!

Space to explore predictions on a test set: davanstrien/iconclass-predictions

Blog soon!

christopher

in bigscience/bloom 21 days ago

Let's talk about the model

#284 opened 21 days ago by

kalashshah19

BramVanroy

posted an update 25 days ago

Post

572

Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually

- C5f ( BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2;
- C5r (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content.

It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages.

w11wo

authored a paper about 1 month ago

Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG

Paper • 2507.20136 • Published Jul 27

pminervini

authored a paper about 2 months ago

Inverse Scaling in Test-Time Compute

Paper • 2507.14417 • Published Jul 19 • 27

NohTow

authored 2 papers about 2 months ago

Seq vs Seq: An Open Suite of Paired Encoders and Decoders

Paper • 2507.11412 • Published Jul 15 • 25

BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP

Paper • 2506.10896 • Published Jun 12 • 3

breakend

authored 13 papers 2 months ago

On the Opportunities and Risks of Foundation Models

Paper • 2108.07258 • Published Aug 16, 2021 • 1

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Paper • 2402.05162 • Published Feb 7, 2024 • 1

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Paper • 2308.11462 • Published Aug 20, 2023 • 3

FLawN-T5: An Empirical Examination of Effective Instruction-Tuning Data Mixtures for Legal Reasoning

Paper • 2404.02127 • Published Apr 2, 2024

The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources

Paper • 2406.16746 • Published Jun 24, 2024

Fantastic Copyrighted Beasts and How (Not) to Generate Them

Paper • 2406.14526 • Published Jun 20, 2024 • 1

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

Paper • 2406.14598 • Published Jun 20, 2024

Evaluating Copyright Takedown Methods for Language Models

Paper • 2406.18664 • Published Jun 26, 2024 • 1

In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI

Paper • 2503.16861 • Published Mar 21 • 1

General Scales Unlock AI Evaluation with Explanatory and Predictive Power

Paper • 2503.06378 • Published Mar 9 • 1

AI & ML interests

Recent Activity

Team members 328

bigscience's activity

Let's talk about the model