Gradio

Welcome to the TuRTLe Model Leaderboard! TuRTLe is a unified evaluation framework designed to systematically assess Large Language Models (LLMs) in RTL (Register-Transfer Level) generation for hardware design. Evaluation criteria include syntax correctness, functional accuracy, synthesizability, and post-synthesis quality (PPA: Power, Performance, Area). TuRTLe integrates multiple benchmarks to highlight strengths and weaknesses of available LLMs. Use the filters below to explore different RTL benchmarks, simulators and models.

UPDATE (NOV 2025): We release a new codebase TuRTLe v2 with API support and local Docker evaluation. Added Kimi K2 Inst, DeepSeek V3.1 Terminus, and Google's Gemini 2.5 Flash

UPDATE (OCT 2025): Added Hermes-4-14B, Qwen3-8B, and Seed-OSS-36B to the leaderboard. Implemented Other Models tab and moved models to it

UPDATE (SEPT 2025): Added gpt-oss-20b and gpt-oss-120b to the leaderboard

UPDATE (JULY 2025): Our TuRTLe paper was accepted to MLCAD 2025 in September (Santa Cruz, CA), plus we've added Verilator as a new simulator alongside Icarus Verilog

UPDATE (JUNE 2025): We make our framework open-source on GitHub and we add 7 new recent models! For a total of 40 base and instruct models and 5 RTL benchmarks


1	🟢	DeepSeek R1-0528 reasoning	685	93.82	77.67	77.37	76.79
2	🟢	DeepSeek R1 reasoning	685	95.46	76.48	75.82	75.53
3	🟢	gpt-oss-120b reasoning	120	90.83	73.87	71.49	70.53
4	🟢	DeepSeek V3.1 Terminus reasoning new	685	94.87	72.98	72.4	71.79
5	🟢	Qwen3 236B A22B reasoning	235	87.23	70.99	70.22	69.16
6	🟢	Kimi K2 Instruct 0905 new	1000	94.07	70.33	69.84	68.72
7	🟢	Gemini 2.5 Flash (Medium) reasoning new	Unknown	89.15	68.46	64.61	63.55
8	🟢	Seed-OSS-36B reasoning new	36.2	87.28	68.36	67.76	67.51
9	🟢	gpt-oss-20b reasoning	21.5	87.78	67.18	64.51	63.7
10	🟢	QwQ 32B reasoning	32.8	86.67	64.16	63.11	62.6
11	🔵	Qwen3 Coder 480B A35B	480	94.08	64.03	62.15	60.55
12	🟢	Llama 3.1 405B	406	85.04	55.88	54.84	53.22
13	🔴	OriGen	6.74	92.77	53.39	53.39	52.88
14	🟢	Hermes-4-14B-Reasoning reasoning new	14	79.01	52.1	50.71	50.32
15	🔵	SeedCoder 8B	8.25	89.99	52.04	51.75	50.89
16	🟢	Qwen2.5 32B	32.5	87.54	52.01	50.77	50.39
17	🟢	Qwen2.5 72B	72.7	81.58	50.78	50.1	49.36
18	🟢	Qwen3-8B reasoning new	8.2	68.76	46.39	45.99	45.1
19	🔴	HaVen-CodeQwen	7.25	90.32	46.27	45	43.58
20	🔵	SeedCoder 8B Reasoning reasoning	8.25	64.48	44.8	44.61	43.75
21	🔵	QwenCoder 2.5 32B	32.5	85.03	44.6	44.5	44.02
22	🟢	Hermes-4-14B new	14	74.2	43.65	43.06	42.77
23	🟢	Magistral Small 2506 reasoning	23.6	65.73	43.05	41.28	40.82
24	🔵	QwenCoder 2.5 14B	14.7	79.59	39.75	39.27	37.69
25	🟢	StarChat2 15B v0.1	16	87.63	39.47	39.38	38.76
26	🔴	RTLCoder DeepSeek	6.74	81.93	38.94	38.37	37.22
27	🔵	DeepSeek Coder 6.7B	6.74	81.29	32.32	31.65	31.88
28	🔵	OpenCoder 8B	7.77	77.68	31.81	30.84	30.07
29	🟢	DeepSeek R1 Distill Qwen 14B reasoning	14.8	40.44	23.85	23.37	23.14
30	🔴	CodeV-QW-7B	7.25	50.68	23.03	20.73	20.37