AI & ML interests

Hackathon de PLN en español open-source y enfocado a los Objetivos de Desarrollo Sostenible de la ONU. Organizado por Somos NLP y patrocinado por Platzi, Paperspace y Hugging Face.

Recent Activity

not working

1
#1 opened about 14 hours ago by
andresadmim

update sdk

#3 opened about 7 hours ago by
pcuenq

update gradio

#2 opened about 7 hours ago by
pcuenq
haritzpuerto 
posted an update about 1 month ago
view post
Post
310
📜 Accepted at ACL 2025! Fine-Tuning on Diverse Reasoning Chains Drives Within-Inference CoT Refinement in LLMs
We propose to fine-tune LLMs to generate diverse chains of thought (DCoT) in a single inference step. This enables within-inference refinement of the cots, no external feedback needed!
🔗 https://arxiv.org/abs/2407.03181
haritzpuerto 
posted an update 6 months ago
view post
Post
642
I just got my first ChatGPT review on ARR! 😅 Any advice on how to prove it's AI-generated? Thanks!
  • 3 replies
·
haritzpuerto 
posted an update 6 months ago
view post
Post
1482
I'm excited to announce that my internship paper at Parameter Lab was accepted to Findings of #NAACL2025 🎉
TLDR: Stating an LLM was trained on a sentence might not be possible 😥 , but it is possible for large enough amounts of tokens, such as long documents or multiple documents! 🤯
Scaling Up Membership Inference: When and How Attacks Succeed on Large Language Models (2411.00154)
🔗 https://github.com/parameterlab/mia-scaling
mrm8488 
posted an update about 1 year ago
view post
Post
6595
🚨Exciting news for the Multilingual Synthetic Data Community!🚨

I’ve taken inspiration from the MAGPIE paper on Llama-3-8B-instruct and extended its capabilities. Here’s what’s new!

🗞 The MAGPIE paper showcased that if you use the instruction-tuned version (Llama-3-8B-instruct) to generate synthetic instructions and then fine-tune the base version (Llama-3-8B) on this dataset, you can improve even the it-tuned version

🤔 While reading a script by Sebastian Raschka, PhD, I wondered: Could these advancements be replicated in other languages? Specifically, could they benefit non-English datasets?

🎉 And the answer is YES! At least for Spanish. I've successfully adapted the techniques for Spanish, proving the model's flexibility and multilingual capabilities.

👩‍💻 To make this accessible, I created a basic script (heavily inspired by the Sebastian Raschka one) that allows you to generate similar datasets using ollama models (initially phi and llama3) automatically and upload it to the Hugging Face Hub!
[Script](https://gist.github.com/mrm8488/4650a5e3cc45523798a527a3446eb312)


🔍 Explore the datasets 📚 generated using our new script!

- [Llama-3-8B](https://huggingface.co/datasets/mrm8488/dataset_llama3_5000_samples_es_4231_filtered)
- [Phi-3-medium](https://huggingface.co/datasets/mrm8488/dataset_phi3-medium_5000_samples_es_3906_filtered)
- [Phi-3-mini](https://huggingface.co/datasets/mrm8488/dataset_phi3_5000_samples_es_3282_filtered)


Note: These datasets have basic filtering. Apply additional quality filters before using them to fine-tune large language models.

Inspiration and base script:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb
https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/
·
mrm8488 
posted an update about 1 year ago
view post
Post
7457
Working on a concept GPT-2 (small) that uses KANs instead of MLPs.
The ckpt and training code will be soon on the hub.
·
osanseviero 
posted an update about 1 year ago
view post
Post
14145
Diaries of Open Source. Part 15 🤗

🕵️‍♀️Idefics 2 is out, a multimodal open-source model with very nice capabilities
Models, demo, and datasets: HuggingFaceM4/idefics2-661d1971b7c50831dd3ce0fe
Blog: https://hf.co/blog/idefics2

💾Snowflake released snowflake-arctic-embed, a family of powerful small embedding models
Model: Snowflake/snowflake-arctic-embed-m
Blog: https://www.snowflake.com/blog/introducing-snowflake-arctic-embed-snowflakes-state-of-the-art-text-embedding-family-of-models/

✨Pile-T5, EleutherAI's T5 model trained on 2T tokens
Blog: https://blog.eleuther.ai/pile-t5/
Models: EleutherAI/pile-t5-65a76a0d0022dd270b385a66
GitHub: https://github.com/EleutherAI/improved-t5

🤖CodeQwen1.5-7B base and chat models. Models trained on 3T tokens strong benchmark results for code generation, editing and SQL
Blog post: https://qwenlm.github.io/blog/codeqwen1.5/
Demo: https://hf.co/spaces/Qwen/CodeQwen1.5-7b-Chat-demo
Models: Qwen/CodeQwen1.5-7B and Qwen/CodeQwen1.5-7B-Chat

Misc
🦉 DocOwl1.5: Unified Stucture Learning for OCR-free Document Understanding mPLUG/DocOwl
👀Cerule - a tiny Vision LM model Tensoic/Cerule-v0.1
ChemLLM - a LLM for chemistry and molecule science ⚗️https://hf.co/AI4Chem/ChemLLM-7B-Chat-1.5-DPO
Distil Whisper Large
📝New pdf/OCR datasets with 19 samples pixparse/pdf-document-ocr-datasets-660701430b0346f97c4bc628
🔥Gretel AI high quality text-to-sql synthetic dataset gretelai/synthetic_text_to_sql
·
osanseviero 
posted an update about 1 year ago
view post
Post
11321
Diaries of Open Source. Part 14 🤗

🔥CohereForAI releases Command R+, an open 104B model with:
- Tool usage capabilities
- Specialized in RAGs
- Multilingual
It's one of the first models to surpass GPT-4 in the lmsys arena, check it out!
Model: https://hf.co/CohereForAI/c4ai-command-r-plus
Official demo: https://hf.co/spaces/CohereForAI/c4ai-command-r-plus
Quantized: https://hf.co/CohereForAI/c4ai-command-r-plus-4bit

🎉Google releases a new version of their Gemma instruct models, with improved quality, nicer to converse, and a fancier RL algorithm. The model is similar to Llama 2 70B in the Chat Arena!
Models: google/gemma-release-65d5efbccdbb8c4202ec078b
Try it out in HuggingChat https://hf.co/chat/models/google/gemma-1.1-7b-it

🪄VoiceCraft, a speech editing and TTS SOTA open model
Paper: VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild (2403.16973)
Model: pyp1/VoiceCraft

💻Google released CodeGemma, a family of code generation, completion, and chat models
Blog post: https://hf.co/blog/codegemma
Models: google/codegemma-release-66152ac7b683e2667abdee11
Report: https://storage.googleapis.com/deepmind-media/gemma/codegemma_report.pdf

Misc models:
🦖T-Rex2, a very powerful object detection model for many applications https://github.com/IDEA-Research/T-Rex
👀 CT-RATE : A 3D dataset paired with text reports ibrahimhamamci/CT-RATE
🐙Octopus v2: a Gemma-based model trained for Android API - extremely fast, better than Llama+RAG, great results NexaAIDev/Octopus-v2
  • 2 replies
·
osanseviero 
posted an update over 1 year ago
view post
Post
2353
Diaries of Open Source. Part 13 🤗

🤏Two different bitnet 1.5 open-source replications
Original paper: The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (2402.17764)
1bitllm experiment: https://hf.co/blog/joey00072/experiments-with-bitnet-1-5
NousResearch experiment NousResearch/OLMo-Bitnet-1B

🥳Tiny and large multimodal models great for embeddings
GitHub: https://github.com/unum-cloud/uform
Encoders: https://hf.co/collections/unum-cloud/multimodal-encoders-660553903617c5297eb16838
ONNX weights: https://hf.co/collections/unum-cloud/uform-vl-english-large-onnx-66055a57c182d846f3bc1949

📜 SMPLer-X: Expressive Human Pose and Shape Estimation
Project website: https://caizhongang.com/projects/SMPLer-X/
Demo: caizhongang/SMPLer-X
Paper: SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation (2309.17448)

🧙GeoWizard: 3D Geometry Estimation
Project website: https://fuxiao0719.github.io/projects/geowizard/
Demo: lemonaddie/geowizard

Misc models and datasets
- Dolphin-2.8-mistral-7b-v0.2 cognitivecomputations/dolphin-2.8-mistral-7b-v02
- Hermes-2-Pro-11B, a self-frankenmerge 11B variant mattshumer/Hermes-2-Pro-11B
- Large conversational dataset based on Usenet data in the Italian language mii-community/UsenetArchiveIT-conversations
  • 3 replies
·