🚀 We are excited to release the 𝗠𝗚-𝗩𝗲𝗿𝗶𝗹𝗼𝗴 dataset for LLM-assisted Verilog code generation, as presented in our paper, “𝗠𝗚-𝗩𝗲𝗿𝗶𝗹𝗼𝗴: 𝗠𝘂𝗹𝘁𝗶-𝗴𝗿𝗮𝗶𝗻𝗲𝗱 𝗗𝗮𝘁𝗮𝘀𝗲𝘁 𝗧𝗼𝘄𝗮𝗿𝗱𝘀 𝗘𝗻𝗵𝗮𝗻𝗰𝗲𝗱 𝗟𝗟𝗠-𝗮𝘀𝘀𝗶𝘀𝘁𝗲𝗱 𝗩𝗲𝗿𝗶𝗹𝗼𝗴 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻,” which won 𝘁𝗵𝗲 𝗕𝗲𝘀𝘁 𝗣𝗮𝗽𝗲𝗿 𝗔𝘄𝗮𝗿𝗱 at the First IEEE International Workshop on LLM-Aided Design (LAD’24)! Please feel free to try our ready-to-use dataset at HuggingFace.
Motivation
The proposed MG-Verilog dataset aims to alleviate the scarcity of high-quality hardware datasets, which currently hinders the development of LLM-assisted hardware design.
Key Features
- ✨ 𝗠𝘂𝗹𝘁𝗶-𝗴𝗿𝗮𝗶𝗻𝗲𝗱 𝗱𝗲𝘀𝗰𝗿𝗶𝗽𝘁𝗶𝗼𝗻𝘀 𝗳𝗼𝗿 𝗩𝗲𝗿𝗶𝗹𝗼𝗴 𝗰𝗼𝗱𝗲 𝘀𝗮𝗺𝗽𝗹𝗲𝘀: The MG-Verilog dataset contains natural language descriptions of varying levels of granularity for each Verilog code sample. Inspired by the human learning process, MG-Verilog aims to teach LLMs more effectively through this balanced approach.
- 🔧 𝗔𝗻 𝗮𝘂𝘁𝗼𝗺𝗮𝘁𝗲𝗱 𝗱𝗮𝘁𝗮𝘀𝗲𝘁 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 𝗳𝗹𝗼𝘄: To enable scalable and low-cost labeling of any Verilog code samples, we have designed an automated dataset generation flow. This allows users from various backgrounds to produce their own multi-grained datasets using their own data, similar to MG-Verilog.
- 📈 𝗕𝗲𝘁𝘁𝗲𝗿 𝗳𝗶𝗻𝗲-𝘁𝘂𝗻𝗲𝗱 𝗺𝗼𝗱𝗲𝗹𝘀 𝘄𝗶𝘁𝗵 𝗠𝗚-𝗩𝗲𝗿𝗶𝗹𝗼𝗴: LLMs fine-tuned on the MG-Verilog dataset consistently demonstrate superior Verilog code generation capabilities compared to those fine-tuned on other baseline datasets, especially when handling instructions of varying granularity. This suggests that MG-Verilog can further enhance the user-friendliness of the Verilog code generation process.
𝗪𝗲 𝗽𝗿𝗲𝗽𝗮𝗿𝗲𝗱 𝗮 𝘀𝗵𝗼𝗿𝘁 𝗱𝗲𝗺𝗼𝗻𝘀𝘁𝗿𝗮𝘁𝗶𝗼𝗻 𝗼𝗳 𝗼𝘂𝗿 𝗳𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸:
For more technical details, please check out:
𝗢𝘂𝗿 𝗽𝗮𝗽𝗲𝗿: https://arxiv.org/abs/2407.01910
𝗚𝗶𝘁𝗛𝘂𝗯 𝗿𝗲𝗽𝗼: https://github.com/GATECH-EIC/mg-verilog