Gradio

Welcome to the TuRTLe Model Leaderboard! TuRTLe is a unified evaluation framework designed to systematically assess Large Language Models (LLMs) in RTL (Register-Transfer Level) generation for hardware design. Evaluation criteria include syntax correctness, functional accuracy, synthesizability, and post-synthesis quality (PPA: Power, Performance, Area). TuRTLe integrates multiple benchmarks to highlight strengths and weaknesses of available LLMs. Use the filters below to explore different RTL benchmarks, simulators and models.

UPDATE (SEPT 2025): Added gpt-oss-20b and gpt-oss-120b to the leaderboard

UPDATE (JULY 2025): Our TuRTLe paper was accepted to MLCAD 2025 in September (Santa Cruz, CA), plus we've added Verilator as a new simulator alongside Icarus Verilog

UPDATE (JUNE 2025): We make our framework open-source on GitHub and we add 7 new recent models! For a total of 40 base and instruct models and 5 RTL benchmarks


1	🟢	DeepSeek R1-0528 reasoning	685	93.82	77.67	77.37	76.79
2	🟢	DeepSeek R1 reasoning	685	95.46	76.48	75.82	75.53
3	🟢	gpt-oss-120b reasoning new	120	90.83	73.87	71.49	70.53
4	🟢	Qwen3 236B A22B reasoning	235	87.23	70.99	70.22	69.16
5	🟢	Seed-OSS-36B reasoning new	36.2	87.28	68.36	67.76	67.51
6	🟢	gpt-oss-20b reasoning new	21.5	87.78	67.18	64.51	63.7
7	🟢	QwQ 32B reasoning	32.8	86.67	64.16	63.11	62.6
8	🔵	Qwen3 Coder 480B A35B	480	94.08	64.03	62.15	60.55
9	🟢	Llama 3.1 405B	406	85.04	55.88	54.84	53.22
10	🔴	OriGen	6.74	92.77	53.39	53.39	52.88
11	🔵	SeedCoder 8B	8.25	89.99	52.04	51.75	50.89
12	🟢	Qwen2.5 32B	32.5	87.54	52.01	50.77	50.39
13	🟢	Qwen2.5 72B	72.7	81.58	50.78	50.1	49.36
14	🔴	HaVen-CodeQwen	7.25	90.32	46.27	45	43.58
15	🔵	SeedCoder 8B Reasoning reasoning	8.25	64.48	44.8	44.61	43.75
16	🔵	QwenCoder 2.5 32B	32.5	85.03	44.6	44.5	44.02
17	🟢	Magistral Small 2506 reasoning	23.6	65.73	43.05	41.28	40.82
18	🟢	Llama 3.(1-3) 70B	70.6	67.95	41.06	40.3	39.48
19	🔵	QwenCoder 2.5 14B	14.7	79.59	39.75	39.27	37.69
20	🟢	StarChat2 15B v0.1	16	87.63	39.47	39.38	38.76
21	🔴	RTLCoder DeepSeek	6.74	81.93	38.94	38.37	37.22
22	🔴	CodeV R1 Distill Qwen 7B reasoning	7.62	60.68	37.16	36.59	36.12
23	🔵	CodeLlama 70B	69	67.69	33.84	33.36	33.04
24	🔵	DeepSeek Coder 6.7B	6.74	81.29	32.32	31.65	31.88
25	🔵	OpenCoder 8B	7.77	77.68	31.81	30.84	30.07
26	🔵	DeepSeek Coder 33B	33.3	67.67	27.78	27.49	27.03
27	🔵	DeepCoder 14B reasoning	14.8	42.86	27.05	26.67	26.41
28	🟢	DeepSeek R1 Distill Qwen 14B reasoning	14.8	40.44	23.85	23.37	23.14
29	🔴	CodeV-QW-7B	7.25	50.68	23.03	20.73	20.37
30	🔴	RTLCoder Mistral	7.24	49.6	22.69	22.6	21.81
31	🔴	CodeV-DS-6.7B	6.74	41.45	20.53	20.15	19.62
32	🔴	CodeV-CL-7B	6.74	35.9	15.73	14.97	14.73
33	🔵	QwenCoder 2.5 7B	7.61	33.21	14.33	13.76	14.15