SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use Paper • 2505.17332 • Published May 22 • 31
MVTamperBench: Evaluating Robustness of Vision-Language Models Paper • 2412.19794 • Published Dec 27, 2024 • 3