Showing posts with label static timing analysis. Show all posts
Showing posts with label static timing analysis. Show all posts

Investigation on timing analysis inaccuracies


Timing analysis inaccuracies due to crosstalk, multiple gate input switching, supply voltage variation, temperature, manufacturing variation, etc. are very common. How do you tackle them in real life designs? How do you do it using Prime Time (PT)?

Effect of WLM and target frequency on performance


How do you quantify the effect of WireLength Models (WLM) and target frequency on the post-routing timing results?

RTL synthesis and other backend Interview Questions (with answers)


Q1: How would you speed up an ASIC design project by parallel computing? Which design stages can be distributed for parallel computing, which cannot, and what procedures are needed for maintaining parallel computing?
Ans: Mentioning the following important steps in parallel computing is essential:
1. Partitioning the design
2. Distributing partitioned tasks among multiple CPUs
3. Integrating the results


WHAT STAGES: The following answers are acceptable. Others may be accepted if you gave a reasonable explanation of why you can or cannot use parallel computing in a particular stage of the flow.
Can use parallel computing:
- Synthesis after partitioning
- Placement (hierarchical design)
- Detailed routing
- DRC
- Functional verification
- Timing Analysis (partition the timing graph)
Cannot use parallel computing:
- Synthesis before partitioning
- Floorplanning
- Flat Placement
- Global Routing
CONSTRAINTS: Mentioning that care must be taken to make sure that partition boundaries are consistent when integrating the results back together.

Q2: What kinds of timing violations are in a typical timing analysis report? Explain!
Ans: Acceptable answers...
- Setup time violations
- Hold time violations
- Minimum delay
- Maximum delay
- Slack
- External delay

Q3: List the possible techniques to fix a timing violation.
Ans: Acceptable answers...
- Buffering
Buffers are inserted in the design to drive a load that is too large for a logic cell to efficiently drive. If the net is too long then the net is broken and buffers are inserted to improve the transition which will ultimately improve the timing on data path and reduce the setup violation.
To reduce the hold violations buffers are inserted to add delay on data paths.- Mapping - Mapping converts primitive logic cells found in a netlist to technology-specific logic gates found in the library on the timing critical paths.
- Unmapping - Unmapping converts the technology-specific logic gates in the netlist to primitive logic gates on the timing critical paths.
- Pin swapping - Pin swapping optimization examines the slacks on the inputs of the gates on worst timing paths and optimizes the timing by swapping nets attached to the input pins, so the net with the least amount of slack is put on the fastest path through the gate without changing the function of the logic.
- Wire sizing
- Transistor (cell) sizing - Cell sizing is the process of assigning a drive strength for a specific cell in the library to a cell instance in the design.If there is a low drive strength cell in the timing critical path then this cell is replaced by higher drive strength cell to reduce the timing violation.
- Re-routing
- Placement updates
- Re-synthesis (logic transformations)

- Cloning - Cell cloning is a method of optimization that decreases the load of a very heavily loaded cell by replicating the cell. Replication is done by connecting an identical cell to the same inputs as the original cell.Cloning clones the cell to divide the fanout load to improve the timing.
- Taking advantage of useful skew
- Logic re-structuring/Transformation (w/Resynthesis) - Rearrange logic to meet timing constraints on critical paths of design
- Making sure we don't have false violations (false path, etc.)

Q4: Give the linear time computation scheme for Elmore delay in an RC interconnect tree.
Ans: The following is acceptable...
- Elmore delay formula
T = Sum over all nodes i in path (s,t) of Ri*Ci where Ci is the total capacitance in the sub tree rooted at node i, or alternatively, the sum over the capacitances at the nodes times the shared resistance between the path of interest and the path to the node.
- Explaining terms in formula
- Mentioning something that shows that it can be done in linear time ("lumped"
or "shared" resistances, "recursive" calculations, etc)

Q5: Given a unit wire resistance "r" and a unit wire capacitance "c", a wire segment of length "l" and width "w" has resistance "l/w" and capacitance "cwl". Can we reduce the Elmore delay by changing the width of a wire segment? Explain your answer.
Ans: You needed to mention that by scaling different segments by different amounts, you can reduce the delay (e.g. wider segments near the root and narrower segments near the leaves. Delay is independent of width because the "w" term cancels out.

Q6: Extend the ZST-DME algorithm to embed a binary tree such that the Elmore delay from the root to each leaf of the tree is identical.
Ans: You needed to mention that a new procedure is needed for calculating the Elmore delay assuming that certain merging points are chosen, instead of just the total downstream wire-length. The merging segment becomes a set of points with equal Elmore delay instead of just equal path length. You could refer the paper "Low-Cost Single-Layer Clock Trees With Exact Zero Elmore Delay Skew", Andrew B. Kahng and Chung-Wen Albert Tsao.

Q7: IPO (sometimes also referred to as "In-Place Optimization") tries to optimize the design timing by buffering long wires, resizing cells, restructuring logic etc.
Explain how these IPO steps affect the quality of the design in terms of area, congestion, timing slack.
(a) Why is this called "In-Place Optimization" ?
(b) Why are the two IPO steps different ?
(c) Why are both used ?

Ans: IPO optimizes timing by buffer insertion and cell resizing. Important steps that are performed in IPO include fixing {setup,hold} time, max. transition time violation. Timing slack along all arcs is optimized by these operations. Increase in area and reduction in timing slack depend upon timing and IPO constraints.
(a) This step is referred to as "In-Place Optimization" because IPO performs resizing and buffer in-place (between cells in the row). It does not perform placement optimization in this step.
(b) The first IPO1 step is performed after placement. It performs trial-route--> extraction --> timing analysis to determine the quality of placement. Setup and hold time fixing is done according to result of timing analysis. The second IPO step is performed after clock tree synthesis. CTS performs clock buffer insertion to balance skews among all flip-flops. IPO2 step optimizes timing paths between flip-flops taking the actual clock skew.
(c) If IPO2 step is not performed after CTS, then timing paths between flip-flops are not tuned for clock skew variation. Even though NanoRoute performs timing optimization, it is more of buffer insertion in long interconnect to fix setup and hold times.

Q8: Clocking and Place-Route Flow. Consider the following steps:
- Clock sink placement
- Standard-cell global placement
- Standard-cell detailed placement
- Standard-cell ECO placement
- Clock buffer tree construction
- Global signal routing
- Detailed signal routing
- Bounded-skew (balanced) clock (sub)net routing
- Steiner clock (sub)net routing
- Clock sink useful skew scheduling (i.e., solving the linear program, etc.)
- Post-placement (global routing based) static timing analysis
- Post-detailed routing static timing analysis
(a) As a designer of a clock distribution flow for high-performance standard-cell based ASICs, how would you order these steps? Is it possible to use some steps more than once, others not at all (e.g., if subsumed by other steps).
(b) List the criteria used for assessing possible flows.
(c) What were the 3 next-best flows that you considered (describe as variants of your flow), and explain why you prefer your given answer.

Ans:(a) My basic flow:
(1) SC global placement
(2) post-placement STA
(3) clock sink useful-skew scheduling
(4) clock buffer tree construction that is useful-skew aware (cf. associative skew.)
(5) standard-cell ECO placement (to put the buffers into the layout)
(6) Steiner clock subnet routing at lower levels of the clock tree (following CTGen type paradigm)
(7) bounded-skew clock subnet routing at all higher levels of the clock tree, and as necessary even at lower levels, to enforce useful skews
(8) global signal routing
(9) detailed signal routing,
(10) post-detailed routing STA
(b)Criteria:
(1) likelihood of convergence with maximum clock frequency
(2) minimization of CPU time (by maximizing incremental steps, minimizing .detailed. steps, and minimizing iterations)
(3) make a good trade-off between wiring-based skew control and wire cost (this suggests Steiner routing at lower levels, bounded-skew routing at higher levels).
[Comment 1. Criteria NOT addressed: power, insertion delay, variant flow for hierarchical clocking or gated clocking.
Comment 2: I do not know of any technology for clock sink placement that can separate this from placement of remaining standard cells. So, my flow does not invoke this step. I also don't want post-route ECOs.]
(c) Variants:
(1) introduce Step 11: loop over Steps 3-10 (not adopted because cost benefit ratio was not attractive, and because there is a trial placement + global routing to drive useful-skew scheduling, buffer tree construction and ECO placement);
(2) after Steps 1-4, re-place the entire netlist (global, detailed placement) and then skip Step 5 (not adopted because benefits of avoiding ECO placement and leveraging a good clock skeleton were felt to be small-buffer tree will largely reflect the netlist structure, and replacing can destroy assumptions made in Steps 3-4);
(3) can iterate the first 5 steps essentially by iterating: clock sink placement, (ECO placement for legalization), (incremental) standard-cell (global + detailed) placement (not adopted because I feel that any objective for standalone clock sink placement would be very "fuzzy", e.g., based on sizes of intersections of fan-in/fan-out cones of sequentially adjacent FFs)

Q9: If we migrate to the next technology node and double the gate count of a design, how would you expect the size of the LEF and routed DEF files to change? Explain your reasoning.
Ans: The LEF file will remain roughly the same size (same richness of cell library, say, between 500-1200 masters), modulo possible changes in conventions (e.g., CTLF used to be a part of LEF) and modulo possible additional library model semantics (e.g., adding power modeling into LEF). The DEF file should at least double (the components and nets will double, but if there is extra routing complexity (more complex geometries, and more segments per connection due to antenna rules or badly scaling router heuristics) the DEF could grow significantly faster.

High level analysis of false paths


Sometimes the delay through a component is dependent upon the values on signals. This is because different paths in the circuit have different delays and some input values will prevent some paths from being exercised. Here are two simple examples:
  1. In a ripple-carry adder, if a carry out of the MSB is generated from the least significant bit, then it will take longer for the output to stabilize than if no carries generated at all.
  2. In a state machine using a one-hot state encoding, false paths might exist when more than one state bit is a '1'.
Because of these effects, static timing analysis might be overly conservative and predict a delay that is greater than you will experience in practice. The most accurate delay analysis requires looking at the actual data values that will occur in practice. Conversely, a timing simulation may not demonstrate the actual slowest behaviour of your circuit: if you don't ever generate a carry from LSB to MSB, then you'll never exercise the critical path in your adder.

Probabilistic Timing Analysis


Because of shrinking feature sizes and the decreasing faithfulness of the manufacturing process to design features, process variation has been one of the constant themes of IC designers as new process nodes are introduced. This article reviews the problem and proposes a "probabilistic" approach as a solution to analysis and management of variability.

Process variation may be new in the digital design framework, but it has long been the principle worry of analog designers, known as mismatch. Regardless of its causes, variation can be global, where every chip from a lot can be effected in the same way, or quasi-global, where wafers or dies may show different electrical characteristics. Such global variation has been relatively easy to model, especially if process modeling people have been able to characterize it with a single "sigma" parameter. Timing analyzers need to analyze a design under both worst case and best case timing conditions. Usually, two extreme conditions of "sigma" sufficed to provide these two conditions. With the new process nodes , however, not only it is necessary to have several variational parameters, but individual device characteristics on a chip could differ independently, known as on-chip variation (OCV).

At the device level, process variation is modeled by a set of "random" parameters which modify the geometric parameters of the device and its model equations. Depending on the nature of the variation, these may effect all devices on the chip, or certain types of devices, or they may be specific to each instance of the device. Because of this OCV, it is important that correlation between various variational parameters be accounted for. For example, the same physical effect is likely to change the length and width of a device simultaneously. If this is ignored, we may be looking at very pessimistic variation scenarios.

There are some statistical methods which try to capture correlations and reduce them to a few independent variables. Some fabs use use parameters related to device geometries and model parameters. The number of such parameters may range from a few to tens, depending on the device. If one considers global and local variations, the number of variables quickly can get out of hand. Variation is statistically modeled by a distribution function, usually Gaussian. Given the value of a variational parameter, and a delta-interval around it, one can calculate the probability that the device/ process will be in that interval and will have specific electrical characteristics for that condition. Instead of having a specific value for a performance parameter such as delay, it will have a range of values with specific probabilities depending on the variational parameters.

To analyze the performance of digital designs, two approaches have emerged: statistical static timing analysis (SSTA) and multi-corner static timing analysis. SSTA tries to generate a probability distribution for a signal path from delay distributions of individual standard cells in the path. This is usually implemented by using variation-aware libraries, which contain a sampling of cell timing at various discrete values of the variational parameters. Because of the dependence on a discrete library, this approach is practically limited to only few global systematic variables, with a very coarse sampling of the variation space. Since it is a distribution-based analysis, it depends on the shape of primary variables. It is generally assumed these are Gaussian, but there is no reason to assume this. In fact, most process models may not even be centered. In addition, it becomes difficult to do input slope dependent-delay calculation. Assumptions and simplifications could quickly make this approach drift from the goal. Since it has the probability distributions, one can report a confidence level about a timing violation. Implicit in this approach is the assumption that any path has a finite probability of being critical.

Multi-corner timing analysis is kind of Monte-Carlo in disguise, and has been gaining popularity as a brute-force method. Someone who knows what he/she is doing decides on a set of extreme corner conditions. These are instances of process variables, and cell libraries are generated for these conditions. Timing analysis is performed using these libraries. The number of libraries may be 10 to 20 or more. Naturally, this approach is still limited to few global variational parameters. It is also difficult to ascertain the reliability of timing analysis, in terms of yield. The only way to increase the confidence level is by building more libraries and repeating the analysis with them. This process increases verification and analysis time, but does not guarantee coverage.

What we propose instead, is probabilistic timing analysis. It can address both global and local variations, and we can have a lower confidence limit on timing analysis results which can be controlled by the designer. This turns the problem upside down. Since timing analysis is interested in worst-case and best-case timing conditions of a chip, we ask the same question for individual cells making up a design. We want to find the best/ worst case timing condition of a cell. While doing this, we need to limit our search and design space. For example, the interval (-1,1) covers 68.268% of the area under the normal bell curve distribution. If we search this interval for sigma with maximum inverter delay and later use that value, we can only say that the probability that this value is the maximum delay is 0.68268. For the interval (-2,2), it is 0.95448. If we had searched a wider interval, our confidence level would go up even higher. If there were two process variables, and if we had searched (-1,1)(-1,1), our confidence would drop to 0.68268X0.68268, or 0.46605.

Although lower confidence limits are set by the initial search intervals, the actual probabilities may be much higher. If the maximum had occurred at extreme corners, one could expect that as the search interval expands, we might see new maximum conditions. On the other hand, if the maximum had occurred at a point away from the corners, most likely this is the absolute value. Typically, only one of the parameters, the one most tightly coupled to threshold voltage, for example, takes up the extreme values, and most others take intermediate values. In these cases it is effectively the same as if we searched the interval (-inf, +inf). This behavior is consistent with the traditional approach, where a single parameter is used to control best and worst timing corners.

One of the conceptual problem with our probabilistic approach is that each cell may have different sets of global variables, which contradicts the definition of such variables. A flip-flop may have different global variables than an inverter. Even inverters of different strengths may have different sets. They are typically close to each other, however. There may be some pessimism associated with this condition.

It is easy to establish confidence levels on critical path timing. If for example, global variables have a confidence level of 0.9, and local random variables have 0.95, the confidence level for a path of 10 cells is 0.9X0.95*10= 0.5349. Since local variations of each gate are independent of each other, intersection rule of probability should be followed, probability of having 0.95 coverage for two independent cells is 0.95X0.95, for three is 0.95X0.95X0.95, etc. In reality though, minimum and maximum conditions for local variations are clustered around the center, away from the interval end points, which brings confidence level to 0.9, confidence level for global variations. Alternatively, one can expand the search interval to cover more process space. Also keep in mind, the variation range of "real" random variables is much narrower than (-inf, +inf).

Library Technologies has implemented this probabilistic approach in its YieldOpt product. The user defines the confidence levels her/she would like to see, and identifies global and local random parameters for each device. Confidence levels are converted to variation intervals assuming a normal distribution. This is the only place we make an assumption about the shape of the distributions. As a result, our approach has a weak dependence on probability distribution. In the probabilistic approach, we view timing characteristics of a cell as functions of random process variables. For each variable, we define a search interval. The variables could be global and local random variables. Maximum and minimum timing conditions for each cell are determined for typical loads and input slopes. Two libraries are generated for each condition. Normally, we couple worst process condition with high temperature, low voltage; and best process condition with low temperature and high voltage.

Timing analysis flow is the traditional flow, and depending on the number of random variables, searching for extreme conditions becomes a very demanding task. We have developed methods and tools which can achieve this task in a deterministic way. The YieldOpt product determines appropriate process conditions for each cell and passes it over for characterization and library generation. Determining worst/best case conditions may add about 0.1X to 2X overhead on top of characterization.

By Mehmet Cirit:
Mehmet Cirit is the founder and president of Library Technologies, Inc. (LTI). LTI develops and markets tools for design re-optimization for speed and low power with on-the-fly cell library creation, cell/ memory characterization and modeling, circuit optimization, and process variation analysis tools such as YieldOpt.

Clock-Domain Crossing Verification Module


This Mentor's Verification Academy module directly addresses CDC issues by introducing a set of steps for advancing an organization’s clock-domain crossing verification skills, infrastructure, and metrics for measuring success while identifying process areas requiring improvement. This Clock-Domain Crossing Verification Module contains 7 sessions total including 1 demo.

Different types of simulations!


Functional simulation: Simulation of a design description. This is also called spec simulation or concept simulation. This is usually done at the highest level and in the beginning of the project.

Behavioral simulation: Simulation of digital circuit described in HDLs like verilog or VHDL. We simulate the behavior described in these language based designs. This the second step.

Static timing analysis: This tells us "What is the longest delay in my circuit?" Timing analysis finds the critical path and its delay. Timing analysis does not find the input vectors that activate the critical path. Done after synthesis, this is the third step.

Gate-level simulation: Differences between functional simulation, timing analysis, and gate level simulation. In this type of simulation the delays after the post layout stage are back annotated to the design using SDF and simulated. This gives close to a real chip simulation performance. This is the final step.

Transistor-level or circuit-level simulation: Mainly for mixed mode(mixed signal) circuits. For mixed mode circuit we must verify complete design on transistor level. This is an intermediate step based on how the design is setup and the flow.
Simulation conclusion:
  1. Behavioral simulation can only tell you only if your design will not work.
  2. Pre-layout simulation estimates your design performance.
  3. Finding a critical path is difficult because you need to construct input vectors to exercise the right paths.
  4. Behavioral simulation and Static timing analysis is the most widely used form of simulation.
  5. Formal verification compares two different representations. It cannot prove your design will work.
  6. Switch-level simulation can check the behavior of circuits that may not always have nodes that are driven or that use logic that is not complementary.
  7. Transistor level simulation is used when you need to know the analog, rather than the digital, behavior of circuit voltages.
  8. Trade-off in accuracy against run time.