TuRTLe Logo

If you have any inquiries or wish to collaborate: hpai@bsc.es

Welcome to the TuRTLe Model Leaderboard! TuRTLe is a unified evaluation framework designed to systematically assess Large Language Models (LLMs) in RTL (Register-Transfer Level) generation for hardware design. Evaluation criteria include syntax correctness, functional accuracy, synthesizability, and post-synthesis quality (PPA: Power, Performance, Area). TuRTLe integrates multiple benchmarks to highlight strengths and weaknesses of available LLMs. Use the filters below to explore different RTL benchmarks, simulators and models.

UPDATE (SEPTEMBER 2025): Added gpt-oss-20b and gpt-oss-120b to the leaderboard

UPDATE (JULY 2025): Our TuRTLe paper was accepted to MLCAD 2025 in September (Santa Cruz, CA), plus we've added Verilator as a new simulator alongside Icarus Verilog

UPDATE (JUNE 2025): We make our framework open-source on GitHub and we add 7 new recent models! For a total of 40 base and instruct models and 5 RTL benchmarks

Select Task
Select Benchmark
Select Simulator
Select Model Type
6.74 700
โ€  Line Completion excludes โ€œreasoningโ€ models since this task targets quick auto-completion
Additionally, for Line Completion and Code Completion benchmarks we use Base model variant (if available), and for Spec-to-RTL we use Instruct model variant