We are excited to release the ๐ ๐-๐ฉ๐ฒ๐ฟ๐ถ๐น๐ผ๐ด dataset for LLM-assisted Verilog code generation, as presented in our paper, โ๐ ๐-๐ฉ๐ฒ๐ฟ๐ถ๐น๐ผ๐ด: ๐ ๐๐น๐๐ถ-๐ด๐ฟ๐ฎ๐ถ๐ป๐ฒ๐ฑ ๐๐ฎ๐๐ฎ๐๐ฒ๐ ๐ง๐ผ๐๐ฎ๐ฟ๐ฑ๐ ๐๐ป๐ต๐ฎ๐ป๐ฐ๐ฒ๐ฑ ๐๐๐ -๐ฎ๐๐๐ถ๐๐๐ฒ๐ฑ ๐ฉ๐ฒ๐ฟ๐ถ๐น๐ผ๐ด ๐๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป,โ which won ๐๐ต๐ฒ ๐๐ฒ๐๐ ๐ฃ๐ฎ๐ฝ๐ฒ๐ฟ ๐๐๐ฎ๐ฟ๐ฑ at the First IEEE International Workshop on LLM-Aided Design (LADโ24)! Please feel free to try our ready-to-use dataset at HuggingFace.


Motivation
The proposed MG-Verilog dataset aims to alleviate the scarcity of high-quality hardware datasets, which currently hinders the development of LLM-assisted hardware design.
Key Features
๐ ๐๐น๐๐ถ-๐ด๐ฟ๐ฎ๐ถ๐ป๐ฒ๐ฑ ๐ฑ๐ฒ๐๐ฐ๐ฟ๐ถ๐ฝ๐๐ถ๐ผ๐ป๐ ๐ณ๐ผ๐ฟ ๐ฉ๐ฒ๐ฟ๐ถ๐น๐ผ๐ด ๐ฐ๐ผ๐ฑ๐ฒ ๐๐ฎ๐บ๐ฝ๐น๐ฒ๐: The MG-Verilog dataset contains natural language descriptions of varying levels of granularity for each Verilog code sample. Inspired by the human learning process, MG-Verilog aims to teach LLMs more effectively through this balanced approach.
๐๐ป ๐ฎ๐๐๐ผ๐บ๐ฎ๐๐ฒ๐ฑ ๐ฑ๐ฎ๐๐ฎ๐๐ฒ๐ ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป ๐ณ๐น๐ผ๐: To enable scalable and low-cost labeling of any Verilog code samples, we have designed an automated dataset generation flow. This allows users from various backgrounds to produce their own multi-grained datasets using their own data, similar to MG-Verilog.
๐๐ฒ๐๐๐ฒ๐ฟ ๐ณ๐ถ๐ป๐ฒ-๐๐๐ป๐ฒ๐ฑ ๐บ๐ผ๐ฑ๐ฒ๐น๐ ๐๐ถ๐๐ต ๐ ๐-๐ฉ๐ฒ๐ฟ๐ถ๐น๐ผ๐ด: LLMs fine-tuned on the MG-Verilog dataset consistently demonstrate superior Verilog code generation capabilities compared to those fine-tuned on other baseline datasets, especially when handling instructions of varying granularity. This suggests that MG-Verilog can further enhance the user-friendliness of the Verilog code generation process.

๐ช๐ฒ ๐ฝ๐ฟ๐ฒ๐ฝ๐ฎ๐ฟ๐ฒ๐ฑ ๐ฎ ๐๐ต๐ผ๐ฟ๐ ๐ฑ๐ฒ๐บ๐ผ๐ป๐๐๐ฟ๐ฎ๐๐ถ๐ผ๐ป ๐ผ๐ณ ๐ผ๐๐ฟ ๐ณ๐ฟ๐ฎ๐บ๐ฒ๐๐ผ๐ฟ๐ธ:
For more technical details, please check out:
๐ข๐๐ฟ ๐ฝ๐ฎ๐ฝ๐ฒ๐ฟ: https://arxiv.org/abs/2407.01910
๐๐ถ๐๐๐๐ฏ ๐ฟ๐ฒ๐ฝ๐ผ: https://github.com/GATECH-EIC/mg-verilog