Activity Feed

AI & ML interests

None defined yet.

Recent Activity

MikeDoes 
posted an update about 20 hours ago
view post
Post
133
When anonymizing data for LLMs, is replacing a name with XXXXX enough?

A great post by Franklin Cardenoso Fernandez argues that we can do better. While simple masking hides data, it often destroys the context that models need to perform well.

A more robust method is contextual anonymization, where PII is replaced with meaningful labels like [NAME] or [ADDRESS]. This protects privacy while preserving the data's structural integrity.

We were pleased to see our Ai4Privacy pii-masking-200k dataset featured in the article as a prime example of this best practice. Our dataset is designed to help developers implement this superior form of anonymization by providing tens of thousands of clear, labeled examples.

By enabling models to be trained on data that is both private and context-rich, we can build AI that is both smarter and safer. This is a core part of our mission.

What's your team's preferred method for data anonymization? Let's discuss best practices.

🔗 Read Franklin's full analysis here: https://www.holisticai.com/blog/managing-personal-data-in-large-language-models

#DataPrivacy #Anonymization #ResponsibleAI #LLM #MachineLearning #AIEthics #Ai4Privacy #World's largest open privacy masking dataset
MikeDoes 
posted an update 5 days ago
view post
Post
1952
🛡️ At Ai4Privacy, our goal is to empower researchers to build a safer AI ecosystem. Today, we're highlighting crucial research that does just that by exposing a new vulnerability.

The paper "Forget to Flourish" details a new model poisoning technique. It's a reminder that as we fine-tune LLMs, our anonymization and privacy strategies must evolve to counter increasingly sophisticated threats.

We're proud that the Ai4Privacy dataset was instrumental in this study. It served two key purposes:

Provided a Realistic Testbed: It gave the researchers access to a diverse set of synthetic and realistic PII samples in a safe, controlled environment.

Enabled Impactful Benchmarking: It allowed them to measure the actual effectiveness of their data extraction attack, proving it could compromise specific, high-value information.

This work reinforces our belief that progress in AI security is a community effort. By providing robust tools for benchmarking, we can collectively identify weaknesses and build stronger, more resilient systems. A huge congratulations to the authors on this important contribution.

🔗 Read the full paper: https://arxiv.org/html/2408.17354v1

#OpenSource #DataPrivacy #LLM #Anonymization #AIsecurity #HuggingFace #Ai4Privacy #World's largest open privacy masking dataset
Tonic 
posted an update 12 days ago
view post
Post
656
👋 Hey there folks,

just submitted my plugin idea to the G-Assist Plugin Hackathon by @nvidia . Check it out, it's a great way to use a local SLA model on a windows machine to easily and locally get things done ! https://github.com/NVIDIA/G-Assist
Tonic 
posted an update 14 days ago
view post
Post
514
🙋🏻‍♂️ Hey there folks ,

Yesterday , Nvidia released a reasoning model that beats o3 on science, math and coding !

Today you can try it out here : Tonic/Nvidia-OpenReasoning

hope you like it !
MikeDoes 
posted an update 15 days ago
view post
Post
1128
In data privacy, 92% accuracy is not an A-grade. Privacy AI needs to be better.

That's the stark takeaway from a recent benchmark by Diego Mouriño

(Making Science), who put today's top PII detection methods to the test on call center transcripts using the Ai4Privacy dataset.

They pitted cutting-edge LLMs (like GPT-4 & Gemini) against traditional systems (like Cloud DLPs). The results show that our trust in these tools might be misplaced.



📊 The Hard Numbers:



Even top-tier LLMs peaked at a reported 92% accuracy, leaving a potential dangerous 8% gap where your customer's data can leak. They particularly struggled with basics like 'last names' and 'street addresses'.



The old guard? Traditional rule-based systems reportedly achieved a shocking 50% accuracy. A coin toss with your customers' privacy.


This tells us that for privacy tasks, off-the-shelf accuracy is a vanity metric. The real metric is the cost of a single failure—one leaked name, one exposed address.



While no tool is perfect, some are better than others. Diego’s full analysis breaks down which models offer the best cost-to-accuracy balance in this flawed landscape. It's a must-read for anyone serious about building trustworthy AI.

#DataPrivacy #AI #LLM #RiskManagement #MetricsThatMatter #InfoSec

Find the full post here:
https://www.makingscience.com/blog/protecting-customer-privacy-how-to-remove-pii-from-call-center-transcripts/

Dataset:
ai4privacy/pii-masking-400k
Tonic 
posted an update 20 days ago
view post
Post
3261
🙋🏻‍♂️ Normalize adding compute & runtime traces to your model cards
  • 2 replies
·
Tonic 
posted an update 26 days ago
view post
Post
484
Who's going to Raise Summit in Paris Tomorrow ?

If you're around , I would love to meet you :-)
MikeDoes 
posted an update about 2 months ago
view post
Post
2707
Started aistatuscodes as a new project to create codes to understand AI performance better.

Going to be posting daily here and on instagram until we get to 100m downloads :)
https://www.instagram.com/MikeDoesDo/

Follow along the journey!
Tonic 
posted an update about 2 months ago
view post
Post
681
🙋🏻‍♂️ hey there folks ,

So every bio/med/chem meeting i go to i always the same questions "why are you sharing a gdrive link with me for this?" and "Do you have any plans to publish your model weights and datasets on huggingface?" and finally i got a good answer today which explains everything :

basically there is some kind of government censorship on this (usa, but i'm sure others too) and they are told they are not allowed as it is considered a "dataleak" which is illegal !!!!

this is terrible ! but the good news is that we can do something about it !

so there is this "call for opinions and comments" here from the NIH (usa) , and here we can make our opinion on this topic known : https://osp.od.nih.gov/comment-form-responsibly-developing-and-sharing-generative-artificial-intelligence-tools-using-nih-controlled-access-data/

kindly consider dropping your opinion and thoughts about this censorship of science , and share this post , link or thoughts widely .

Together maybe we can start to share data and model weights appropriately and openly in a good way 🙏🏻🚀

cc. @cyrilzakka

Tonic 
posted an update 2 months ago
view post
Post
2534
🙋🏻‍♂️ Hey there folks ,

Yesterday the world's first "Learn to Vibe Code" application was released .

As vibe coding is the mainstream paradigm , so now the first educational app is there to support it .

You can try it out already :

https://vibe.takara.ai

and of course it's entirely open source, so i already made my issue and feature branch :-) 🚀
MikeDoes 
posted an update 3 months ago
view post
Post
1545
PII-Masking-1M Final Day (7/7)! 🚀 Today, we unveil 5 NEW Enterprise PII (E-PII) Dataset PREVIEWS!

Standard PII tools often miss sensitive *business* data. That's why we built E-PII previews for the data that powers your operations and compliance needs.

Get a first look (representing 100,000 samples each!) into datasets designed for real-world enterprise security across these categories:

🏥 **PHI Preview**: For Healthcare Data
💳 **PFI Preview:** For Financial Data
🏢 **PWI Preview:** For Workplace Data
💻 **PDI Preview:** For Digital Activity Data
📍 **PLI Preview:** For Location Data


That wraps up our #PIIMasking1M 7 days announcement! HUGE thanks for following along and for your engagement.
Explore ALL our releases, including these E-PII previews, in the Ai4Privacy Hugging Face Collection & show some love ❤️ if you find them useful!
🔗 Visit the Collection:https://huggingface.co/ai4privacy

Let's keep building safer AI, together!
MrDragonFox 
posted an update 3 months ago
view post
Post
3415
as a few of you know - i am working on a rather more elaborate-tts that can produce more interesting sounds in context of rp

early sneak peak is here -

MrDragonFox/mOrpheus_3B-1Base_early_preview-v1-25000

its based on orpheus - but really the model is irrelevant as i focus mostly on data augmentation / prep / pipelineing - its just the way to show progress

should be able to express fine even in a sfw context

probably the last release for a few weeks as i go back to the data pipeline and improve there ..

in the mean time, please do test and report problems or enjoyable generations you found - we have a growing discord community and i love to see what you get out of that early release !

(small colab is provided on the model page if you dont have the gpu to run that your self)
MrDragonFox 
posted an update 4 months ago
view post
Post
4504
yet a other audio datasets pre classified for events + audio aestetics

this time for german - 680h sampled from emilia yodas

timestamps for asr training or other fancier things available as nc in the raw repo

MrDragonFox/DE_Emilia_Yodas_680h

cc by 4.0 as by emilia yodas

raw events / transcriptions are cc by NC 4.0

MrDragonFox/DE_Emilia_Yodas_680h_raw_timestamps

the coming days i should push about 600h english + some japanese too same format
MikeDoes 
posted an update 4 months ago
MrDragonFox 
posted an update 4 months ago
view post
Post
2135
did a small emotive classified test dataset for all the tts tuners out there

MrDragonFox/Elise

3h total mit - single speaker voice

dataset is a copy of an existing one just added the emotional tags over 1200 samples - should be good enough to test if emotional tags stick in your finetune
  • 1 reply
·
MikeDoes 
posted an update 4 months ago
view post
Post
2792
🚀 We are quite excited to announce the Ai4Privacy Python library! 🎉

pip install ai4privacy to anonymize short english text with OpenPII Masking 500k labels

📊 Day 5/7 of PII Masking 1M announcements complete! ⏰