3 4 46

Sarthak Malhotra

zarmalhotra

AI & ML interests

None yet

Recent Activity

upvoted an article 9 days ago

DABStep: Data Agent Benchmark for Multi-step Reasoning

new activity about 1 month ago

UCSC-VLAA/MedReason:3rd prize in reasoning dataset competition. Congratulations

new activity about 2 months ago

rekrek/subset_arcee-10k:Congratulations on winning Curator Spotlight in Reasoning Dataset Competition

View all activity

Organizations

upvoted an article 9 days ago

Article

DABStep: Data Agent Benchmark for Multi-step Reasoning

and 5 others •

Feb 4

• 93

New activity in UCSC-VLAA/MedReason about 1 month ago

3rd prize in reasoning dataset competition. Congratulations

#3 opened about 1 month ago by

zarmalhotra

New activity in rekrek/subset_arcee-10k about 2 months ago

Congratulations on winning Curator Spotlight in Reasoning Dataset Competition

#2 opened about 2 months ago by

zarmalhotra

updated a Space about 2 months ago

README

🌍

liked 14 datasets about 2 months ago

liked a dataset 2 months ago

patrickfleith/instruction-freak-reasoning

Viewer • Updated May 22 • 179 • 63 • 4

reacted to ZennyKenny's post with 🔥 2 months ago

Post

3369

When I heard the Reasoning Dataset Competition deadline was extended to 9 May, I knew I had time to get in one more entry. 🔥🔥🔥

With the rise of Vibe Coding, and the potential risks that are introduced by humans letting LLMs build their apps for them, lots of people are (rightfully) concerned about the safety of the code that is hitting prod.

In response to that, I'm happy to present my final submission to the Reasoning Dataset Competition and attempt to start benchmarking the ability of LLMs to identify unsafe and / or exploitable code by way of the CoSa (Code Safety) benchmark: ZennyKenny/cosa-benchmark-dataset

Currently a curated set of 200 examples, calibrated on OpenAI's standard issue models (GPT-4.1, o4 mini, and GPT-3.5 Turbo) as "baseline performance" (70% decile). Check it out and drop a ❤️ if you think it could be useful or hit the Community section with suggestions / critiques.

3 replies

Sarthak Malhotra

AI & ML interests

Recent Activity

Organizations

zarmalhotra's activity

DABStep: Data Agent Benchmark for Multi-step Reasoning

3rd prize in reasoning dataset competition. Congratulations

Congratulations on winning Curator Spotlight in Reasoning Dataset Competition

README