Processor functional verification plays a crucial role in ensuring the quality of processor designs. Traditional techniques like Constrained Random Verification (CRV) struggle to achieve high functional coverage due to the vast instruction space of processors. While LLM-based techniques show potential, merely instructing LLMs has notable limitations, especially when addressing functional points that require deep semantic understanding. To tackle these challenges, we propose a novel technique, FLAME, which leverages Retrieval-Augmented Generation (RAG), Chain-of-Thought (CoT), and a functional-coverage-guided feedback mechanism. This technique establishes semantic mappings between functional points and instructions, enabling the iterative generation of valid and effective test cases. Evaluation of four widely-used open-source processor designs shows that FLAME outperforms the typical or state-of-the-art baselines in functional coverage improvement with an average of 34.25%~220% while drastically reducing the time required to achieve the same functional coverage by up to 86.13%. Moreover, ablation analysis highlights the vital role of each component in the framework's overall effectiveness. This work demonstrates the superiority of our LLM-based technique FLAME in enhancing processor functional verification.
The above figure shows the overview of our novel technique FLAME for processor functional verification, which is dedicated to covering more functional points automatically. FLAME is divided into three parts: knowledge base construction, LLM-assisted test generation, and functional-coverage-guided feedback. FLAME begins by collecting extensive processor-design-related information to build a comprehensive knowledge base that provides essential background information. Then, FLAME uses the RAG technique to retrieve information related to the target functional points from the established knowledge base and generate high-quality test cases based on our devised Documents—Instructions—Programs CoT. Finally, a functional-coverage-guided feedback mechanism is utilized, where the previously-generated test cases and coverage result information could be provided to LLMs as a reference for the iterative generation process. Note that, due to the cost of LLMs, our technique is not applied to functional points that are easily addressed in practice, specifically those covered efficiently by widely-used CRV methods in our work following the existing study. In other words, FLAME focuses on addressing functional point bottlenecks in order to achieve cost-effectiveness.
The above figure shows the prompt template used in Interpretation stage. We first define the LLM's role (as a professional processor verification engineer for a specific ISA) and specify the conditions (the association between functional points and ISA instructions) in the Background portion. Next, we incorporate both the target functional points and the retrieved information (processor design documentation and functional points definitions) into the prompt's Context section. Finally, we instruct the LLM to explain the target functional points' documentation and definition in the Instruction portion, especially the related input instructions. The LLM will eventually return enriched functional points information, including their corresponding processor functionalities and behaviors, as well as the associated instruction types.
The above figure shows the prompt template used in Analyzation stage. Similar to the Interpretation stage, we first define the LLM's role and background information. Next, we incorporate both the target functional points and the related details into the prompt's Context portion, which includes the detailed functional points interpretation (LLM's Response From Interpretation Stage) and the Retrieved ISA Instruction Specification. Finally, in the Instruction portion, we direct the LLM to analyze the required instructions and their usages. The LLM then generates enriched instruction details, including their usages and parameters, which are directly related to the target functional points and overall processor functionalities.
The above figure shows the prompt template used in Generation stage. We first specify the conditions (relations between the C program and instructions) in the Background section. Next, we incorporate the Instructions information and the counterexamples (presented in Section III.C) into the prompt's Context section. Finally, we instruct the LLM to generate the desired program by using combinations of different syntactic features of the programming language. The LLM will eventually return a high-quality C program as the test case, which corresponds to the target instructions.
RQ1 Summary: Key results show that generating test cases at the source code level (i.e., C programs) using LLMs achieves a higher pass rate. Additionally, PCL34B achieves the best trade-off among the four LLMs, reaching the highest number of functional points within the controlled time and reducing the time to reach the same number by up to 85.46%.
RQ2 Summary: FLAME demonstrates the best functional coverage improvement compared to the baselines and significantly reduces the time required to achieve the same functional coverage, with a reduction of up to 86.13%. Furthermore, FLAME tends to generate a smaller number of test cases while maintaining relatively high validity.
RQ3 Summary: The experiments show that each component of FLAME has an important role in the overall framework's effect on functional coverage enhancement. In addition, the two modules ISA specification and counterexamples in FLAME contribute the most to the overall pass rate of test case generation.
We propose a novel LLM-based test generation framework, FLAME, to cover more functional points in processor verification automatically. By leveraging RAG, CoT, and a functional-coverage-guided feedback mechanism, FLAME establishes semantic mappings between functional points and instructions, enabling the iterative generation of valid and effective test cases. Evaluation of four widely-used processor designs demonstrates that FLAME surpasses baselines in functional coverage improvement while significantly reducing the time required to achieve the same functional coverage. Additionally, ablation analysis underscores the critical contribution of each component to the overall effectiveness. Future work will focus on expanding the evaluation to encompass a broader range of processor designs and exploring more efficient LLM strategies for further enhancement.