MD Asaduzzaman's picture
1 1

MD Asaduzzaman

asaduzzaman319
·

AI & ML interests

None yet

Recent Activity

Organizations

None yet

asaduzzaman319's activity

reacted to TuringsSolutions's post with 👀 6 months ago
view post
Post
2894
I have been seeing a specific type of AI hype more and more, I call it, releasing research expecting that no one will ever reproduce your methods, then overhyping your results. I test the methodology of maybe 4-5 research papers per day. That is how I find a lot of my research. Usually, 3-4 of those experiments end up not being reproduceable for some reason. I am starting to think it is not accidental.

So, I am launching a new series where I specifically showcase a research paper by reproducing their methodology and highlighting the blatant flaws that show up when you actually do this. Here is Episode 1!

https://www.youtube.com/watch?v=JLa0cFWm1A4
  • 5 replies
·
reacted to Muhammadreza's post with ❤️ 6 months ago
view post
Post
2591
Hey guys.
This is my first post here on huggingface. I'm glad to be a part of this amazing community!
  • 2 replies
·
reacted to loubnabnl's post with 🔥 6 months ago
view post
Post
5885
🍷 FineWeb technical report is out and so is 📚 FineWeb-Edu, a 1.3 trillion tokens dataset that outperforms all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.

Technical report: HuggingFaceFW/blogpost-fineweb-v1
Dataset: HuggingFaceFW/fineweb-edu

We used Llama 3 generations to train an educational quality classifier, filtering the 15 trillion tokens of FineWeb to select only those with high educational value (an approach also used in Llama 3 and Phi-3 training datasets). We're releasing both FineWeb-Edu and the classifier, along with a larger, less heavily filtered version containing 5.4 trillion tokens.

You can find more details about the dataset and the experiments we ran in the FineWeb technical report, It's a 45-minute read but it contains all the secret sauce for building high quality web datasets.

Enjoy!