I Hackathon Somos NLP: PLN en Español

non-profit

AI & ML interests

Hackathon de PLN en español open-source y enfocado a los Objetivos de Desarrollo Sostenible de la ONU. Organizado por Somos NLP y patrocinado por Platzi, Paperspace y Hugging Face.

Recent Activity

pcuenq new activity about 7 hours ago

somosnlp-hackathon-2022/spanish-to-quechua-translation:not working

pcuenq new activity about 7 hours ago

somosnlp-hackathon-2022/spanish-to-quechua-translation:update sdk

pcuenq new activity about 7 hours ago

somosnlp-hackathon-2022/spanish-to-quechua-translation:update gradio

View all activity

pcuenq

in somosnlp-hackathon-2022/spanish-to-quechua-translation about 7 hours ago

not working

#1 opened about 14 hours ago by

andresadmim

update sdk

#3 opened about 7 hours ago by

pcuenq

update gradio

#2 opened about 7 hours ago by

pcuenq

haritzpuerto

posted an update about 1 month ago

Post

310

📜 Accepted at ACL 2025! Fine-Tuning on Diverse Reasoning Chains Drives Within-Inference CoT Refinement in LLMs
We propose to fine-tune LLMs to generate diverse chains of thought (DCoT) in a single inference step. This enables within-inference refinement of the cots, no external feedback needed!
🔗 https://arxiv.org/abs/2407.03181

casvsv

updated a Space 2 months ago

Clasificador Comentarios Suicidas

🔪

Classify suicide risk in comments

osanseviero

authored a paper 3 months ago

Gemma 3 Technical Report

Paper • 2503.19786 • Published Mar 25 • 52

haritzpuerto

posted an update 6 months ago

Post

642

I just got my first ChatGPT review on ARR! 😅 Any advice on how to prove it's AI-generated? Thanks!

3 replies

haritzpuerto

posted an update 6 months ago

Post

1482

I'm excited to announce that my internship paper at Parameter Lab was accepted to Findings of #NAACL2025 🎉
TLDR: Stating an LLM was trained on a sentence might not be possible 😥 , but it is possible for large enough amounts of tokens, such as long documents or multiple documents! 🤯
Scaling Up Membership Inference: When and How Attacks Succeed on Large Language Models (2411.00154)
🔗 https://github.com/parameterlab/mia-scaling

haritzpuerto

authored a paper 8 months ago

Scaling Up Membership Inference: When and How Attacks Succeed on Large Language Models

Paper • 2411.00154 • Published Oct 31, 2024

rockdrigoma

updated a Space 12 months ago

Spanish to Nahuatl Translation

🌽

haritzpuerto

authored a paper about 1 year ago

Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models

Paper • 2407.03181 • Published Jul 3, 2024 • 1

mrm8488

posted an update about 1 year ago

Post

6595

🚨Exciting news for the Multilingual Synthetic Data Community!🚨

I’ve taken inspiration from the MAGPIE paper on Llama-3-8B-instruct and extended its capabilities. Here’s what’s new!

🗞 The MAGPIE paper showcased that if you use the instruction-tuned version (Llama-3-8B-instruct) to generate synthetic instructions and then fine-tune the base version (Llama-3-8B) on this dataset, you can improve even the it-tuned version

🤔 While reading a script by Sebastian Raschka, PhD, I wondered: Could these advancements be replicated in other languages? Specifically, could they benefit non-English datasets?

🎉 And the answer is YES! At least for Spanish. I've successfully adapted the techniques for Spanish, proving the model's flexibility and multilingual capabilities.

👩‍💻 To make this accessible, I created a basic script (heavily inspired by the Sebastian Raschka one) that allows you to generate similar datasets using ollama models (initially phi and llama3) automatically and upload it to the Hugging Face Hub!
[Script](https://gist.github.com/mrm8488/4650a5e3cc45523798a527a3446eb312)

🔍 Explore the datasets 📚 generated using our new script!

- [Llama-3-8B](https://huggingface.co/datasets/mrm8488/dataset_llama3_5000_samples_es_4231_filtered)
- [Phi-3-medium](https://huggingface.co/datasets/mrm8488/dataset_phi3-medium_5000_samples_es_3906_filtered)
- [Phi-3-mini](https://huggingface.co/datasets/mrm8488/dataset_phi3_5000_samples_es_3282_filtered)

Note: These datasets have basic filtering. Apply additional quality filters before using them to fine-tune large language models.

Inspiration and base script:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb
https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/

7 replies

osanseviero

updated 4 Spaces about 1 year ago

mrm8488

posted an update about 1 year ago

Post

7457

Working on a concept GPT-2 (small) that uses KANs instead of MLPs.
The ckpt and training code will be soon on the hub.

6 replies

osanseviero

posted an update about 1 year ago

Post

14145

Diaries of Open Source. Part 15 🤗

🕵️‍♀️Idefics 2 is out, a multimodal open-source model with very nice capabilities
Models, demo, and datasets: HuggingFaceM4/idefics2-661d1971b7c50831dd3ce0fe
Blog: https://hf.co/blog/idefics2

💾Snowflake released snowflake-arctic-embed, a family of powerful small embedding models
Model: Snowflake/snowflake-arctic-embed-m
Blog: https://www.snowflake.com/blog/introducing-snowflake-arctic-embed-snowflakes-state-of-the-art-text-embedding-family-of-models/

✨Pile-T5, EleutherAI's T5 model trained on 2T tokens
Blog: https://blog.eleuther.ai/pile-t5/
Models: EleutherAI/pile-t5-65a76a0d0022dd270b385a66
GitHub: https://github.com/EleutherAI/improved-t5

🤖CodeQwen1.5-7B base and chat models. Models trained on 3T tokens strong benchmark results for code generation, editing and SQL
Blog post: https://qwenlm.github.io/blog/codeqwen1.5/
Demo: https://hf.co/spaces/Qwen/CodeQwen1.5-7b-Chat-demo
Models: Qwen/CodeQwen1.5-7B and Qwen/CodeQwen1.5-7B-Chat

Misc
🦉 DocOwl1.5: Unified Stucture Learning for OCR-free Document Understanding mPLUG/DocOwl
👀Cerule - a tiny Vision LM model Tensoic/Cerule-v0.1
ChemLLM - a LLM for chemistry and molecule science ⚗️https://hf.co/AI4Chem/ChemLLM-7B-Chat-1.5-DPO
Distil Whisper Large
📝New pdf/OCR datasets with 19 samples pixparse/pdf-document-ocr-datasets-660701430b0346f97c4bc628
🔥Gretel AI high quality text-to-sql synthetic dataset gretelai/synthetic_text_to_sql

5 replies

osanseviero

posted an update about 1 year ago

Post

11321

Diaries of Open Source. Part 14 🤗

🔥CohereForAI releases Command R+, an open 104B model with:
- Tool usage capabilities
- Specialized in RAGs
- Multilingual
It's one of the first models to surpass GPT-4 in the lmsys arena, check it out!
Model: https://hf.co/CohereForAI/c4ai-command-r-plus
Official demo: https://hf.co/spaces/CohereForAI/c4ai-command-r-plus
Quantized: https://hf.co/CohereForAI/c4ai-command-r-plus-4bit

🎉Google releases a new version of their Gemma instruct models, with improved quality, nicer to converse, and a fancier RL algorithm. The model is similar to Llama 2 70B in the Chat Arena!
Models: google/gemma-release-65d5efbccdbb8c4202ec078b
Try it out in HuggingChat https://hf.co/chat/models/google/gemma-1.1-7b-it

🪄VoiceCraft, a speech editing and TTS SOTA open model
Paper: VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild (2403.16973)
Model: pyp1/VoiceCraft

💻Google released CodeGemma, a family of code generation, completion, and chat models
Blog post: https://hf.co/blog/codegemma
Models: google/codegemma-release-66152ac7b683e2667abdee11
Report: https://storage.googleapis.com/deepmind-media/gemma/codegemma_report.pdf

Misc models:
🦖T-Rex2, a very powerful object detection model for many applications https://github.com/IDEA-Research/T-Rex
👀 CT-RATE : A 3D dataset paired with text reports ibrahimhamamci/CT-RATE
🐙Octopus v2: a Gemma-based model trained for Android API - extremely fast, better than Llama+RAG, great results NexaAIDev/Octopus-v2

2 replies

osanseviero

posted an update over 1 year ago

Post

2353

Diaries of Open Source. Part 13 🤗

🤏Two different bitnet 1.5 open-source replications
Original paper: The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (2402.17764)
1bitllm experiment: https://hf.co/blog/joey00072/experiments-with-bitnet-1-5
NousResearch experiment NousResearch/OLMo-Bitnet-1B

🥳Tiny and large multimodal models great for embeddings
GitHub: https://github.com/unum-cloud/uform
Encoders: https://hf.co/collections/unum-cloud/multimodal-encoders-660553903617c5297eb16838
ONNX weights: https://hf.co/collections/unum-cloud/uform-vl-english-large-onnx-66055a57c182d846f3bc1949

📜 SMPLer-X: Expressive Human Pose and Shape Estimation
Project website: https://caizhongang.com/projects/SMPLer-X/
Demo: caizhongang/SMPLer-X
Paper: SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation (2309.17448)

🧙GeoWizard: 3D Geometry Estimation
Project website: https://fuxiao0719.github.io/projects/geowizard/
Demo: lemonaddie/geowizard

Misc models and datasets
- Dolphin-2.8-mistral-7b-v0.2 cognitivecomputations/dolphin-2.8-mistral-7b-v02
- Hermes-2-Pro-11B, a self-frankenmerge 11B variant mattshumer/Hermes-2-Pro-11B
- Large conversational dataset based on Usenet data in the Italian language mii-community/UsenetArchiveIT-conversations

3 replies

AI & ML interests

Recent Activity

Team members 171

somosnlp-hackathon-2022's activity

not working

update sdk

update gradio

Clasificador Comentarios Suicidas

Spanish to Nahuatl Translation

Audio Sentiment Classifier

Poem Generation Es

Sonnet Poetry Generator Spanish

Clasificador De Tesis