CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation
Abstract
C-to-Rust transpilation is essential for modernizing legacy C code while enhancing safety and interoperability with modern Rust ecosystems. However, no dataset currently exists for evaluating whether a system can transpile C into safe Rust that passes a set of test cases. We introduce CRUST-Bench, a dataset of 100 C repositories, each paired with manually-written interfaces in safe Rust as well as test cases that can be used to validate correctness of the transpilation. By considering entire repositories rather than isolated functions, CRUST-Bench captures the challenges of translating complex projects with dependencies across multiple files. The provided Rust interfaces provide explicit specifications that ensure adherence to idiomatic, memory-safe Rust patterns, while the accompanying test cases enforce functional correctness. We evaluate state-of-the-art large language models (LLMs) on this task and find that safe and idiomatic Rust generation is still a challenging problem for various state-of-the-art methods and techniques. We also provide insights into the errors LLMs usually make in transpiling code from C to safe Rust. The best performing model, OpenAI o1, is able to solve only 15 tasks in a single-shot setting. Improvements on CRUST-Bench would lead to improved transpilation systems that can reason about complex scenarios and help in migrating legacy codebases from C into languages like Rust that ensure memory safety. You can find the dataset and code at https://github.com/anirudhkhatry/CRUST-bench.
Community
🚀Introducing CRUST-Bench, a dataset for C-to-Rust transpilation for full codebases 🛠️
A dataset of 100 real-world C repositories across various domains, each paired with:
🦀 Handwritten safe Rust interfaces.
🧪 Rust test cases to validate correctness.
Transpiling C to Rust helps modernize legacy code with memory safety guarantees. CRUST-Bench evaluates whether transpilation methods yield safe, idiomatic Rust, using handcrafted interfaces and tests to ensure safety and validate correctness.
Our benchmark is the first to provide:
- Rust tests
- Rust interfaces, which are necessary for the transpiled code to work with the tests
- A sizable number of real-scale transpilation problems.
We evaluate state-of-the-art closed-source LLMs (like o1, Claude-3.7, and Gemini-1.5-Pro), open-source models like QwQ-32B and virtuoso-32B, and the SWE-Agent on CRUST-Bench.
Even the best model—OpenAI's o1—passes only 15/100 tasks in a single-shot setting.
Language Models often fail to:
- Respect ownership rules
- Infer type information
- Follow idiomatic Rust interfaces
- Preserve correct lifetimes
In the paper, we provide a taxonomy of common LLM mistakes.
Please read the full paper for more details.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LLM-Driven Multi-step Translation from C to Rust using Static Analysis (2025)
- RustMap: Towards Project-Scale C-to-Rust Migration via Program Analysis and LLM (2025)
- LLMigrate: Transforming "Lazy" Large Language Models into Efficient Source Code Migrators (2025)
- An Empirical Study of Rust-Specific Bugs in the rustc Compiler (2025)
- EnvBench: A Benchmark for Automated Environment Setup (2025)
- Automated Benchmark Generation for Repository-Level Coding Tasks (2025)
- VeriCoder: Enhancing LLM-Based RTL Code Generation through Functional Correctness Validation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper