RTL synthesis and other backend Interview Questions (with answers)

mg
Written by
3

Q1: How would you speed up an ASIC design project by parallel computing? 
Which design stages can be distributed for parallel computing, which cannot, and what procedures are needed for maintaining parallel computing?

Mentioning the following important steps in parallel computing is essential:
1. Partitioning the design
2. Distributing partitioned tasks among multiple CPUs
3. Integrating the results

Backend ASIC design involves several stages, including floorplanning, placement, clock tree synthesis, routing, and physical verification. Each of these stages can benefit from parallel computing to some extent.

Floorplanning: Floorplanning involves determining the placement of major design blocks and setting up the chip-level power and signal distribution networks. This stage can be parallelized by dividing the chip area into multiple partitions and performing floorplanning on each partition independently.

Placement: Placement involves placing the cells on the chip to optimize the design for timing, power, and area. This stage can be parallelized by dividing the design into smaller blocks and placing each block independently.

Clock tree synthesis: Clock tree synthesis involves constructing a clock distribution network that ensures a high-quality clock signal across the chip. This stage can be parallelized by dividing the clock network into smaller segments and synthesizing each segment independently.

Routing: Routing involves connecting the cells and the nets to implement the design's functionality. This stage can be parallelized by dividing the design into smaller blocks and routing each block independently.

Physical verification: Physical verification involves checking the design for errors and ensuring that it meets the design rules and specifications. This stage can also be parallelized by dividing the design into smaller blocks and verifying each block independently.

What Stages: The following answers are acceptable. Others may be accepted if you gave a reasonable explanation of why you can or cannot use parallel computing in a particular stage of the flow.

Can use parallel computing:
- Synthesis after partitioning
- Placement (hierarchical design)
- Detailed routing
- DRC
- Functional verification
- Timing Analysis (partition the timing graph)

Cannot use parallel computing:
- Synthesis before partitioning
- Floorplanning
- Flat Placement
- Global Routing

CONSTRAINTS: Mentioning that care must be taken to make sure that partition boundaries are consistent when integrating the results back together.


Q2: What kinds of timing violations are in a typical timing analysis report? Explain!

- Setup time violations
- Hold time violations
- clock skew violations
- propagation delay violations
- pulse width violations
- recovery time violations
- Minimum delay
- Maximum delay
- External delay
- Transition violations

Timing analysis is a crucial step in the ASIC design flow that involves checking whether the timing requirements of a design are met. A typical timing analysis report highlights various types of timing violations that may occur in the design. Some common types of timing violations are:

Setup violations: A setup violation occurs when the data arriving at the input of a register is not stable before the clock edge arrives. This can lead to unpredictable data behavior, causing incorrect outputs. Setup violations can be caused by various factors, such as improper clock timing, clock skew, or improper placement of the logic blocks.

Hold violations: A hold violation occurs when the data arriving at the input of a register changes too soon after the clock edge. This can cause data corruption, leading to incorrect outputs. Hold violations can be caused by various factors, such as improper clock timing, clock skew, or improper placement of the logic blocks.

Clock skew violations: Clock skew violation occurs when the clock signal arrives at different parts of the design at different times, resulting in the skew. Clock skew violations can be caused by various factors, such as improper routing of the clock signal or improper placement of the clock buffers.

Propagation delay violations: Propagation delay violations occur when the signal takes longer than the specified delay to propagate through the design. This can cause incorrect outputs or timing violations in the downstream logic. Propagation delay violations can be caused by various factors, such as improper routing of the signal or improper placement of the logic blocks.

Pulse-width violations: Pulse-width violations occur when the pulse width of the clock or data signal is shorter than the specified duration, leading to timing violations. Pulse-width violations can be caused by various factors, such as improper clock or data signal generation or improper placement of the logic blocks.

Recovery time violations: Recovery time violations occur when the data signal does not remain stable for a minimum specified time after the clock edge. This can cause unpredictable behavior of the downstream logic and result in timing violations. Recovery time violations can be caused by various factors, such as improper setup of the clock domain crossings or improper placement of the logic blocks.

Overall, timing analysis report helps designers identify and fix timing violations in the design, ensuring the functionality of the design and meeting the timing requirements.

Q3: List the possible techniques to fix a timing violation.

- Buffering - 
Buffers are inserted in the design to drive a load that is too large for a logic cell to efficiently drive. If the net is too long then the net is broken and buffers are inserted to improve the transition which will ultimately improve the timing on data path and reduce the setup violation.
To reduce the hold violations buffers are inserted to add delay on data paths.- Mapping - Mapping converts primitive logic cells found in a netlist to technology-specific logic gates found in the library on the timing critical paths.

- Unmapping - 
Unmapping converts the technology-specific logic gates in the netlist to primitive logic gates on the timing critical paths.

- Pin swapping - 
Pin swapping optimization examines the slacks on the inputs of the gates on worst timing paths and optimizes the timing by swapping nets attached to the input pins, so the net with the least amount of slack is put on the fastest path through the gate without changing the function of the logic.

- Wire sizing

- Transistor (cell) sizing - Cell sizing is the process of assigning a drive strength for a specific cell in the library to a cell instance in the design.If there is a low drive strength cell in the timing critical path then this cell is replaced by higher drive strength cell to reduce the timing violation.

- Re-routing

- Placement updates

- Re-synthesis (logic transformations)

- Cloning - Cell cloning is a method of optimization that decreases the load of a very heavily loaded cell by replicating the cell. Replication is done by connecting an identical cell to the same inputs as the original cell.Cloning clones the cell to divide the fanout load to improve the timing.

- Taking advantage of useful skew

- Logic re-structuring/Transformation (w/Resynthesis) - Rearrange logic to meet timing constraints on critical paths of design

- Making sure we don't have false violations (false path, etc.)

When a timing violation is detected during timing analysis, it is important to fix it to ensure that the design meets the timing requirements. Some of the techniques that can be used to fix a timing violation include:

Changing the clock frequency: One of the most common ways to fix a timing violation is to reduce the clock frequency. By decreasing the frequency, the timing requirements can be met, but at the cost of slower performance.

Inserting delay elements: Another way to fix a timing violation is to insert delay elements, such as buffers, in the critical paths. This increases the delay of the path, thereby reducing the likelihood of timing violations.

Moving logic: Moving the logic to a different location on the chip can also help fix a timing violation. By moving the logic closer to the source or destination of the signal, the propagation delay can be reduced, thereby meeting the timing requirements.

Increasing the drive strength: Increasing the drive strength of the signal can help reduce the signal delay and fix a timing violation.

Re-optimizing the design: Re-optimizing the design can help fix timing violations by reconfiguring the logic to meet the timing requirements. This may involve changing the logic structure, adding pipeline stages, or optimizing the placement of the logic.

Using advanced techniques: Advanced techniques such as clock gating, retiming, and register duplication can also be used to fix timing violations.

It is important to note that these techniques are not always feasible or effective for every timing violation. The appropriate technique depends on the nature of the timing violation and the specific design constraints. Therefore, it is essential to carefully analyze the design and consider various techniques to determine the most appropriate solution.


Q4: Give the linear time computation scheme for Elmore delay in an RC interconnect tree.

The following is acceptable...
- Elmore delay formula
T = Sum over all nodes i in path (s,t) of Ri*Ci where Ci is the total capacitance in the sub tree rooted at node i, or alternatively, the sum over the capacitances at the nodes times the shared resistance between the path of interest and the path to the node.
- Explaining terms in formula
- Mentioning something that shows that it can be done in linear time ("lumped"
or "shared" resistances, "recursive" calculations, etc)

The Elmore delay is a widely used metric for characterizing the signal delay in an RC interconnect tree. It is a simple, yet accurate method for estimating the signal delay in a linear time complexity. The linear time computation scheme for Elmore delay in an RC interconnect tree can be described as follows:

Compute the Elmore delay at the source node: Start by computing the Elmore delay at the source node of the interconnect tree. This can be done by multiplying the source resistance with the source capacitance.

Compute the Elmore delay at each node: Next, compute the Elmore delay at each node of the interconnect tree. This can be done by summing the Elmore delays of the incoming branches weighted by their corresponding resistances.

Compute the total delay: Finally, the total Elmore delay can be computed by summing the Elmore delays of all the nodes in the interconnect tree.

The above steps can be represented by the following mathematical equation:

Delay = ∑ (R_i * C_i) + ∑ (∑ (R_j * C_j))

where i represents the source node, and j represents all other nodes in the interconnect tree.

The above computation scheme has a linear time complexity since it involves computing the Elmore delay at each node in the interconnect tree only once. Therefore, the time required to compute the Elmore delay is proportional to the number of nodes in the interconnect tree.

Q5: Given a unit wire resistance "r" and a unit wire capacitance "c", a wire segment of length "l" and width "w" has resistance "l/w" and capacitance "cwl". Can we reduce the Elmore delay by changing the width of a wire segment? Explain your answer.

You needed to mention that by scaling different segments by different amounts, you can reduce the delay (e.g. wider segments near the root and narrower segments near the leaves. Delay is independent of width because the "w" term cancels out.

Yes, we can reduce the Elmore delay by changing the width of a wire segment. The Elmore delay is a measure of the time it takes for a signal to propagate through a wire segment, and it is directly proportional to the RC time constant of the segment. The RC time constant is the product of resistance and capacitance in the wire segment.

By increasing the width of a wire segment, we decrease the resistance of the wire (since resistance is inversely proportional to width). This reduces the RC time constant of the wire segment, which in turn reduces the Elmore delay. Therefore, a wider wire will have a lower delay than a narrower wire of the same length.

To see this mathematically, we can calculate the RC time constant for a wire segment of length "l" and width "w". The resistance of the wire is "l/w", and the capacitance is "cwl". Therefore, the RC time constant is:

RC = (l/w) * (cwl) = cl^2/w

As we can see, the RC time constant is inversely proportional to the width of the wire. Therefore, increasing the width of the wire will decrease the RC time constant and reduce the Elmore delay.

Q6: Extend the ZST-DME algorithm to embed a binary tree such that the Elmore delay from the root to each leaf of the tree is identical.

You needed to mention that a new procedure is needed for calculating the Elmore delay assuming that certain merging points are chosen, instead of just the total downstream wire-length. The merging segment becomes a set of points with equal Elmore delay instead of just equal path length. You could refer the paper "Low-Cost Single-Layer Clock Trees With Exact Zero Elmore Delay Skew", Andrew B. Kahng and Chung-Wen Albert Tsao. Read on ...

The ZST-DME (Zero Skew Tree with Delay Matching Embedding) algorithm is used to embed a tree topology in a mesh network such that the Elmore delay from the root to each leaf node of the tree is equal. To extend this algorithm to embed a binary tree with identical Elmore delay from the root to each leaf node, the following steps can be taken:

Construct a binary tree: Start by constructing a binary tree that needs to be embedded in the mesh network. The binary tree should have a single root node and all leaf nodes at the same level.

Compute Elmore delay: Compute the Elmore delay from the root node to each leaf node of the binary tree. Since the Elmore delay is required to be the same for all leaf nodes, compute the average Elmore delay over all leaf nodes.

Embed the root node: Embed the root node of the binary tree in the center of the mesh network. The location of the root node is chosen such that it has the shortest path to all other nodes in the mesh network.

Embed the left and right subtrees: Divide the mesh network into two halves, left and right, such that the number of nodes in each half is approximately the same. Embed the left subtree of the binary tree in the left half of the mesh network and the right subtree in the right half of the mesh network.

Embed recursively: Recursively apply steps 2-4 to each subtree until all leaf nodes are embedded in the mesh network.

Adjust the delay: Adjust the delay of each branch of the binary tree such that the Elmore delay from the root node to each leaf node is equal to the average Elmore delay computed in step 2. This can be done by adjusting the capacitance of the links in the mesh network.

Finish embedding: Once the delay is adjusted, the binary tree embedding is complete.

This extension of the ZST-DME algorithm ensures that the Elmore delay from the root node to each leaf node of the binary tree is identical. This is useful in applications where uniform signal delay is required, such as in communication networks.

Q7: IPO (sometimes also referred to as "In-Place Optimization") tries to optimize the design timing by buffering long wires, resizing cells, restructuring logic etc.
Explain how these IPO steps affect the quality of the design in terms of area, congestion, timing slack.
(a) Why is this called "In-Place Optimization" ?
(b) Why are the two IPO steps different ?
(c) Why are both used ?


a) IPO is called "In-Place Optimization" because it tries to optimize the design timing by modifying the existing placement and routing of the design rather than generating a completely new placement and routing. This means that the optimization is performed in-place, meaning that the physical location of the cells and wires is not changed.

b) The two steps in IPO, buffering and resizing, are different because they address different timing issues in the design. Buffering is used to address long wire delays and resizing is used to address cell delay issues. Buffering adds additional buffer cells to a long wire to reduce its delay, while resizing adjusts the size of the cells to improve their performance.

c) Both buffering and resizing are used because they address different timing issues and can be complementary to each other. Buffering is effective for addressing long wire delays, while resizing is effective for addressing cell delay issues. By combining these two techniques, the design can be optimized for both types of timing issues, leading to better timing performance.

In terms of the quality of the design, buffering and resizing can have different effects:

Area: Both buffering and resizing can increase the area of the design. Buffering adds additional buffer cells, while resizing may require larger cells to be used. However, if the design has long wires or slow cells, the area increase may be offset by a reduction in area from the removal of redundant gates or smaller buffer sizes.

Congestion: Buffering can increase congestion in the design, especially in areas with long wires. Resizing can also increase congestion in the design by requiring larger cells to be placed. However, IPO tools typically have congestion-aware optimization algorithms that can help mitigate this effect.

Timing slack: IPO improves the timing slack of the design by reducing the delay of long wires and slow cells. By improving the timing slack, the design has a larger margin of error, which can be useful for reducing the probability of timing violations in the presence of process and temperature variations. However, IPO can also introduce new timing violations if not applied properly.

Q8: Clocking and Place-Route Flow. Consider the following steps:
- Clock sink placement
- Standard-cell global placement
- Standard-cell detailed placement
- Standard-cell ECO placement
- Clock buffer tree construction
- Global signal routing
- Detailed signal routing
- Bounded-skew (balanced) clock (sub)net routing
- Steiner clock (sub)net routing
- Clock sink useful skew scheduling (i.e., solving the linear program, etc.)
- Post-placement (global routing based) static timing analysis
- Post-detailed routing static timing analysis
(a) As a designer of a clock distribution flow for high-performance standard-cell based ASICs, how would you order these steps? Is it possible to use some steps more than once, others not at all (e.g., if subsumed by other steps).
(b) List the criteria used for assessing possible flows.
(c) What were the 3 next-best flows that you considered (describe as variants of your flow), and explain why you prefer your given answer.


(a) My basic flow:
(1) SC global placement
(2) post-placement STA
(3) clock sink useful-skew scheduling
(4) clock buffer tree construction that is useful-skew aware (cf. associative skew.)
(5) standard-cell ECO placement (to put the buffers into the layout)
(6) Steiner clock subnet routing at lower levels of the clock tree (following CTGen type paradigm)
(7) bounded-skew clock subnet routing at all higher levels of the clock tree, and as necessary even at lower levels, to enforce useful skews
(8) global signal routing
(9) detailed signal routing,
(10) post-detailed routing STA

(b)Criteria:
(1) likelihood of convergence with maximum clock frequency
(2) minimization of CPU time (by maximizing incremental steps, minimizing .detailed. steps, and minimizing iterations)
(3) make a good trade-off between wiring-based skew control and wire cost (this suggests Steiner routing at lower levels, bounded-skew routing at higher levels).
[Comment 1. Criteria NOT addressed: power, insertion delay, variant flow for hierarchical clocking or gated clocking.
Comment 2: I do not know of any technology for clock sink placement that can separate this from placement of remaining standard cells. So, my flow does not invoke this step. I also don't want post-route ECOs.]

(c) Variants:
(1) introduce Step 11: loop over Steps 3-10 (not adopted because cost benefit ratio was not attractive, and because there is a trial placement + global routing to drive useful-skew scheduling, buffer tree construction and ECO placement);
(2) after Steps 1-4, re-place the entire netlist (global, detailed placement) and then skip Step 5 (not adopted because benefits of avoiding ECO placement and leveraging a good clock skeleton were felt to be small-buffer tree will largely reflect the netlist structure, and replacing can destroy assumptions made in Steps 3-4);
(3) can iterate the first 5 steps essentially by iterating: clock sink placement, (ECO placement for legalization), (incremental) standard-cell (global + detailed) placement (not adopted because I feel that any objective for standalone clock sink placement would be very "fuzzy", e.g., based on sizes of intersections of fan-in/fan-out cones of sequentially adjacent FFs)

Alternative:

a) As a designer of a clock distribution flow for high-performance standard-cell based ASICs, the order of the steps can vary depending on the specific design requirements and constraints. However, a possible order of the steps could be:

Clock sink placement
Standard-cell global placement
Clock buffer tree construction
Bounded-skew (balanced) clock (sub)net routing
Steiner clock (sub)net routing
Clock sink useful skew scheduling (i.e., solving the linear program, etc.)
Standard-cell detailed placement
Global signal routing
Detailed signal routing
Post-placement (global routing based) static timing analysis
Standard-cell ECO placement
Post-detailed routing static timing analysis
Some steps can be used more than once, such as ECO placement, which can be performed multiple times to fix timing or congestion issues. Other steps may not be needed at all, such as standard-cell ECO placement if the design does not require any changes after the detailed routing stage.

b) The criteria used for assessing possible flows include:

Timing: The flow should minimize clock skew and reduce timing violations.
Area: The flow should minimize the area of the clock distribution network and the overall design.
Power: The flow should minimize power consumption while meeting timing constraints.
Congestion: The flow should minimize congestion in the clock and data routing networks.
Scalability: The flow should be scalable for large designs with many clock sinks and complex routing requirements.
Design rules: The flow should comply with the design rules and constraints of the foundry process.

c) Three next-best flows that could be considered as variants of the given flow are:

Clock-first flow: In this flow, the clock sink placement and buffer tree construction steps are performed before the standard-cell global placement. This can reduce clock skew and simplify the clock routing process. However, it may increase congestion and area, and make it harder to meet timing constraints.
Timing-driven flow: In this flow, the global and detailed placement steps are driven by timing constraints rather than physical optimization criteria. This can improve timing performance but may increase area and congestion.

Congestion-driven flow: In this flow, the placement and routing steps are driven by congestion constraints rather than timing or physical optimization criteria. This can reduce congestion but may lead to suboptimal timing or area.

The given flow is preferred because it balances timing, area, and congestion constraints while maintaining scalability and compliance with design rules. It also follows a logical order that builds upon the previous steps, leading to a more efficient and effective clock distribution network.

Q9: If we migrate to the next technology node and double the gate count of a design, how would you expect the size of the LEF and routed DEF files to change? Explain your reasoning.

If we migrate to the next technology node and double the gate count of a design, we would expect the size of the LEF and routed DEF files to increase. The exact amount of increase depends on the design and the specific technology node, but there are several reasons why we can expect an increase in file size:

More gates: Doubling the gate count of a design means that there are more gates to represent in the LEF and routed DEF files. Each gate has associated layout and routing information, which adds to the size of the files.

Increased complexity: As gate count increases, the design tends to become more complex. This can result in more complicated layouts and routing paths, which require more data to represent in the LEF and routed DEF files.

Smaller feature size: As we migrate to the next technology node, the feature size of the transistors and interconnects typically decreases. This means that the layout and routing features are smaller and more closely packed together, requiring more data to represent in the LEF and routed DEF files.

More layers: With the increase in complexity and smaller feature size, it may be necessary to use more metal layers for routing. Each additional metal layer adds to the size of the routed DEF file.

Overall, the size of the LEF and routed DEF files will likely increase with a doubling of the gate count due to these factors. However, the exact amount of increase will depend on the design and specific technology node, as well as the optimization techniques used to reduce the file size.

Post a Comment

3Comments

Your comments will be moderated before it can appear here. Win prizes for being an engaged reader.

  1. Rakesh, BangaloreMay 19, 2010

    These set of questions are so useful. I have started to prepare for interviews and i m sure this will help me evaluate myself. Thanks a lot.

    ReplyDelete
  2. Hi

    I read this post two times.

    I like it so much, please try to keep posting.

    Let me introduce other material that may be good for our community.

    Source: Construction interview questions

    Best regards
    Henry

    ReplyDelete
  3. Hi

    Tks very much for post:

    I like it and hope that you continue posting.

    Let me show other source that may be good for community.

    Source: Construction interview questions

    Best rgs
    David

    ReplyDelete
Post a Comment

#buttons=(Ok, Go it!) #days=(20)

Our website uses cookies to enhance your experience. Learn more
Ok, Go it!