
Think Like a Domain Expert: Semiconductor Design Expert Edition
Why Expert-Level Domain-Specific Models will Advance the Future of Semiconductor Design
Enterprises today are faced with rapid change, brought on by the promise GenAI brings to their businesses. Today we will discuss the Semiconductor industry, which faces a turning point of its own, which expert AI systems, like those created by Articul8 can begin to address. As companies push toward more advanced chip designs and manufacturing processes, the need for AI-driven automation has never been greater. However, general-purpose AI models (Large Language Models, LLMs) struggle with the specialized requirements of chip design, testing and verification -- leading to inaccuracies, and unreliable results.
At Articul8, we take a different approach. By combining domain-specific LLMs (DSMs) with reasoning capabilities, we create expert-class AI tailored for complex domains like semiconductor engineering. Our Verilog-capable DSM (A8-Semicon) not only outperforms all latest open-source state-of-the-art (SOTA) models by 2X on all our tests but also matches or surpasses proprietary models like Google Flash 2.0, and GPT-4o – at a fraction of the computational cost.
Can “Thinking” Models Help Build Better Expert-Class DSMs?
The recent surge of interest in the cost effectiveness of DeepSeek-R1 has overshadowed a more fundamental question: Can reasoning models be used to improve general-purpose LLM workflows? At Articul8, our research on autonomous workflows combined with reasoning models is showing promising results - transparently exposing "thought processes" can enhance the effectiveness of expert-level Domain-specific Models.
Early in our product development journey, we recognized that general-purpose LLMs alone are not enough to solve complex, industry-specific problems. We validated our hypothesis through close collaboration with subject-matter experts (SMEs) across diverse fields like financial analysis, manufacturing, supply chain, design engineering, aerospace, and risk analysis. The most effective approach required integrating DSMs, task-specific LLMs, and many other types of models and tools into a cohesive, autonomous system.
Beyond model selection, expert AI systems must also meet critical business requirements such as transparency, reproducibility, and observability. As customers increasingly demand visibility into every step of AI-generated answers, our approach ensures that domain experts can trust and verify the logic behind AI-generated answers.
Building a Next-Generation Platform for Expert DSMs
At Articul8, we are building a platform that delivers expert-level DSMs while ensuring they adhere to the end-user requirements of speed, scalability, security, and sustainable costs – what we call the 4 S’s. Additionally, our ModelMeshTM ensures actionable and accurate insights – the 2A’s. We built ModelMeshTM to serve the needs of SMEs to solve enterprise-grade complex problems. ModelMeshTM is the autonomous layer that decides, selects, executes and evaluates the right models at run-time. Consider this as a reasoning system that figures out what to run when, and in what sequence based on the task and the context, and evaluates the answers at every step to decide its next steps. This is why we see reasoning models as essential in the evolution of expert GenAI systems, where each step in the ModelMeshTM can reason in addition to the overall system reasoning to solve a problem. This enables more reliable and interpretable decision-making while dramatically improving model performance.
Introducing Our Blog Series: “Think Like a Domain Expert”
To help unpack this journey and share our learnings with you, we are launching a new blog series exploring how the A8 platform enhances expert-level problem-solving across a range of industries, including semiconductors, aerospace, energy production and transmission, automotive, oil and gas, telecommunications, financial services, and more.
The first blog sets the stage by tackling a constrained yet unsolved industry challenge: generating, understanding, and explaining Verilog code used in the semiconductor design process. As more companies look to design highly advanced processors, engineers and scientists are constantly looking for tools that will allow them to improve semiconductor design, speed of delivering ROI, and stay ahead of competition. With this scenario as a background, we will demonstrate how the A8 platform provides enhanced visibility into the “thinking” applied to expert-level problems and significantly improves the overall quality of solutions.
Semiconductor Design: A Customer-Driven Challenge
The Electronic Design Automation (EDA) process includes a series of steps that are essential for designing and manufacturing semiconductor chips. The process begins with design, where engineers create a high-level design of the chip's functionality, including what features it should have and how they should work. Next, the high-level design is converted into a more detailed description of the chip's digital circuitry, known as the Register-Transfer Level (RTL) design. This is where Verilog comes in – Verilog is a programming language used to describe the digital circuitry of a chip at the RTL level. It's used to write code that defines how the chip's components, such as logic gates and memory, should behave and interact with each other. After the RTL design is complete, the design is converted into a physical layout, which describes the actual placement of the components on the chip. This is followed by verification, where the design is tested to ensure it works as intended. Finally, the design is manufactured, and the chip is produced.
Verilog plays a crucial role in the EDA process, as it allows engineers to create a detailed, functional description of the chip's digital circuitry. This description can then be used to simulate the behavior of the chip, verify its functionality, and ultimately produce a working chip. Without Verilog, or similar programming languages, it would be much more difficult to design and manufacture complex semiconductor chips. The business value of getting automated Verilog, also known as automated Verilog generation or Verilog synthesis, lies in its ability to improve the efficiency, productivity, and accuracy of the digital design process.
Why General-Purpose Models Fall Short
General-purpose open weight models such as LLaMa, Qwen, and closed models such as GPT4o, Flash2 .0, Claude, and others can generate code that seems correct at first glance but often fails to compile – rendering it useless. For most widely used languages like C, general-purpose models can generate mostly functional code. However, for specialized languages like Verilog, even the best, state-of-the-art (SOTA) general-purpose LLMs perform very poorly (roughly 34% average pass rate) as shown in the image below.

Figure 1: Benchmark of open weight models. The left panel shows the compilation success rate of code generated obtained by different open weight models. The right panel shows the pass@1 rate (success in executing and obtaining the right outputs in the first code generation request) for the same models.
How We Built a Verilog-Capable DSM
To ensure our Verilog code generation DSM is both domain-relevant and high-performing, we developed a systematic test bench designed to evaluate models across a spectrum of tasks. Our benchmark dataset consists of curated samples spanning three proficiency levels: 1) College-level – fundamental syntax and simple module structures, 2) Graduate-level – more intricate logic and combinational circuits, and 3) Expert-level – more complex designs, including finite state machines (FSMs), optimized architectures, and hardware constraints.
To train an expert-level semiconductor DSM, we structured our dataset pipeline around broad three core tasks: 1) Code Completion (extending partially written Verilog snippets), 2) Code Generation (producing syntactically and structurally correct Verilog modules), and 3) Code Debugging (identifying compilation errors, applying fixes, and iteratively refining outputs). The data generation pipeline is shown in the figure below. This approach ensures our evaluation process reflects real-world Verilog challenges, while providing a meaningful test for domain-specific LLM performance.

Figure 2: Data generation process.
The data generation pipeline had the following steps:
- Data Collection & Preparation: We gathered a large corpus of Verilog code from public sources, applying careful selection criteria to ensure the quality and relevance of the data. The collected code then underwent a series of processing steps to verify the code's integrity and “compilability”, allowing us to identify and separate code that required further refinement.
- Automated Debugging & Correction: To ensure the quality of our dataset, we employed a rigorous validation process for non-compiling code. This process involved analyzing and addressing errors and verifying the functionality of the corrected code. Any code that failed to meet our quality standards was excluded from the dataset. This rigorous process ensured that only high-quality, functional Verilog samples were included in the dataset, eliminating incomplete or flawed code that could introduce inconsistencies in training.
- Code Completion & Instruction Generation: Once the dataset was validated, we moved into the instruction generation and code completion phase. We structured this step to enhance the model’s ability to understand and complete partially written Verilog snippets while preserving logical and syntactical accuracy. The specific steps taken were:
- Verified Verilog snippets were paired with instructions, ensuring alignment with real-world coding patterns.
- To diversify the dataset, we incorporated an existing openly available datasets, with steps to refine over low-quality instruction-answer pairs into high-quality training data.
- Using Abstract Syntax Tree parsing, we split Verilog code into varying proportions, ensuring greater diversity in completion tasks.
This structured approach to data generation enabled the downstream model to recognize and complete Verilog structures more effectively, leading to improved contextual understanding and generalization.
- Thought Refinement & Final Verification: To further enhance model reasoning, we optimized for instruction-following and chain-of-thought reasoning. This step enabled the resulting model to:
- Reflect on code structure and logical accuracy
- Identify potential improvements in Verilog syntax and module interactions
- Generate thoughtful, step-by-step refinements to improve output quality
Before finalizing the dataset, every generated Verilog snippet underwent an additional quality check to ensure correctness. This final layer of validation reinforced accuracy, consistency, and real-world applicability.
We employed a multi-stage training process, starting with a previously trained instruction following model (such as LLaMa 3.1-8B-Instruct or any other), as shown in the figure below. The threshold lengths were chosen to optimize Verilog understanding and generation. The staging went through multiple input lengths and was designed to enable the domain-specific model (A8-Semicon) to mimic the EDA industry standards and accepted practices that a domain-expert would follow.

Figure 3: Multi-stage Training process for training domain-specific model
Performance Results: 2X Better Than SOTA Models
The resulting model performance is shown in the plot below. Two different tests were performed: 1. Compilation success of the generated Verilog code and 2. Testbench pass rate of generated code. In both cases, five different runs were performed for all models in an automated fashion to both generate a statistically viable sample and remove any human-induced bias. The A8-semicon model is 2X better than all open SOTA models, including the latest DeepSeek-R1, LLaMa-3.3 70B, and Qwen2.5 Coder 32B, across all sizes. Further, the A8-Semicon model matched or outperformed proprietary models like GPT-4o and Claude 3 Sonnet, and the latest Google Flash 2.0 model, despite being several orders of magnitude smaller, across both the tests! It should be noted that Claude 3 Sonnet and DeepSeek-R1 slightly (8%) outperform A8-Semicon model in the testbench case. The A8-Semicon model outperforms the latest Google Flash 2.0 model on both tests. Combining superior performance with significantly lower deployment cost, our A8-semicon model is a significantly better solution than current SOTA models for semiconductor EDA workloads. It also highlights that our careful data curation and integrating “thinking” capabilities leads to major performance improvements in reasoning and accuracy; and therefore, it is critical to building high-performing expert DSMs.

Figure 4: Comparing A8-Semicon model performance with open and closed models for Verilog Compilation Success Rate.

Figure 5: Comparing A8-Semicon model performance with open and closed models for Verilog Testbench Pass Rate Test.
Model | Compilation Success Rate | |||||
Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | Average | |
LLaMA 3.1 8B Instruct | 0.37 | 0.33 | 0.33 | 0.33 | 0.27 | 0.33 |
LLaMA 3.3 70B Instruct | 0.46 | 0.43 | 0.43 | 0.42 | 0.35 | 0.42 |
Qwen 2.5 Coder 32B Instruct | 0.51 | 0.46 | 0.45 | 0.48 | 0.42 | 0.46 |
A8-Semicon | 0.74 | 0.78 | 0.70 | 0.69 | 0.73 | 0.73 |
GPT 4o | 0.59 | 0.62 | 0.60 | 0.63 | 0.62 | 0.61 |
Claude 3 Sonnet | 0.72 | 0.72 | 0.71 | 0.74 | 0.71 | 0.72 |
DeepSeek R1 | 0.71 | 0.72 | 0.69 | 0.71 | 0.72 | 0.71 |
Flash 2.0 | 0.72 | 0.69 | 0.71 | 0.71 | 0.66 | 0.70 |
Table 1: Full results comparing A8-Semicon model performance with open and closed models for Verilog Compilation Success Rate Test. The table presents compilation success rates across five test runs for each model.
Model | Verilog Testbench Pass Rate | |||||
Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | Average | |
LLaMA 3.1 8B Instruct | 0.19 | 0.21 | 0.20 | 0.21 | 0.19 | 0.20 |
LLaMA 3.3 70B Instruct | 0.31 | 0.30 | 0.30 | 0.26 | 0.27 | 0.29 |
Qwen 2.5 Coder 32B Instruct | 0.40 | 0.35 | 0.32 | 0.40 | 0.33 | 0.36 |
A8-Semicon | 0.36 | 0.49 | 0.48 | 0.44 | 0.46 | 0.47 |
GPT 4o | 0.44 | 0.46 | 0.44 | 0.48 | 0.44 | 0.45 |
Claude 3 Sonnet | 0.50 | 0.50 | 0.51 | 0.52 | 0.53 | 0.51 |
DeepSeek R1 | 0.51 | 0.52 | 0.51 | 0.52 | 0.49 | 0.51 |
Flash 2.0 | 0.47 | 0.46 | 0.46 | 0.46 | 0.42 | 0.45 |
Table 2: Full results comparing A8-Semicon model performance with open and closed models for Verilog Testbench Pass Rate Test. The table presents Verilog Testbench Pass Rates across five test runs for each model.
Stay tuned for our detailed paper for more technical details.

We compared the inference costs of various models using standard API service providers. For open‐source models, we considered Together.ai, Fireworks.ai, and Amazon Bedrock based on model availability.
Model Name | Input ($/1M tokens) | Output ($/1M tokens) | Service Provider |
LLaMA 3.1 8B Instruct | 0.2 | 0.2 | Fireworks.ai / Together.ai |
LLaMA 3.3 70B Instruct | 0.72 | 0.72 | Amazon Bedrock |
Qwen 2.5 Coder 32B Instruct | 0.8 | 0.8 | Together.ai |
GPT-4o | 2.5 | 10 | OpenAI |
Claude 3 Sonnet | 3 | 15 | Amazon Bedrock / Anthropic |
Claude 3.5 Sonnet v2 | 3 | 15 | Amazon Bedrock / Anthropic |
Deepseek R1 (FP8) | 0.55 | 2.19 | Deepseek |
Gemini Flash 2.0 | 0.15 | 0.60 | Google Vertex |
This table shows the standard API pricing found for both input and output tokens for each model. All figures represent costs per one million tokens. Clearly, A8-Semicon model is the most cost-effective model. Combined with the high accuracy, the A8-Semicon model is the best model (based on the tests above) for production-scale deployment for semiconductor design workloads. While the individual model costs are just component costs, and the actual system-level costs of several of the large-models make them prohibitively expensive for any production deployment at-scale.
Expanding the Impact Across Industries
Developing a reasoning domain-specific Large Language Model (DSM-R) proficient in Verilog code generation not only enhances hardware design automation but also establishes a framework applicable to other specialized programming languages across various industries. By building on top of our DSM-Rs with targeted datasets, businesses can achieve significant improvements in code accuracy and functionality for their own domains. This breakthrough extends far beyond semiconductor design. The same approach can be applied to other specialized domains, such as supply chain management and manufacturing, where specialized, often obscure, languages are prevalent. For instance, in manufacturing, LLMs can be utilized to automate the programming of Programmable Logic Controllers (PLCs). By training LLMs on these specific languages, companies can automate complex tasks, leading to more efficient workflows, reduced development times, and substantial cost savings. The ability to adapt LLMs to specific domains ensures outputs that are both contextually relevant and precise, thereby amplifying their business impact across sectors.
Next in the “Think Like a Domain Expert” Series
By following this rigorous, multi-stage pipeline, we developed a reasoning domain-specific Verilog model (A8-Semicon) that significantly outperforms State-of-the-Art general-purpose LLMs, and matches the performance of closed, much larger models. Our approach ensures that the model not only produces functional Verilog code but also understands, debugs, and improves upon it – mirroring the expertise of a real-world semiconductor engineer.
This blog marks just the first step in our journey into building a world enabled by domain-specific expert AI. Next, we’ll explore how our DSMs transform industries like aerospace, automotive, and energy. Follow our series and learn how the Articul8 platform can help you stay ahead of your competition!
*Benchmark testing was performed over the first two weeks of February 2025.