Showing posts with label Timing. Show all posts
Showing posts with label Timing. Show all posts

The effect of whitespace and aspect ratio on wirelength and timing

Whitespaces (empty space) are inserted in layouts in order to increase the routing resources of the chip. Have you ever studied the impact of whitespace (and aspect ratio) on timing and wirelength, by say increasing the whitespace from 0% to 100% and evaluate the impact on both wirelength and timing. Can you predict how this will look like? For a 300 mm wafer, can you parameterize the relationship between the number of dies produced, timing, die aspect ratio, wirelength and whitespace?

High level analysis of false paths

Sometimes the delay through a component is dependent upon the values on signals. This is because different paths in the circuit have different delays and some input values will prevent some paths from being exercised. Here are two simple examples:
  1. In a ripple-carry adder, if a carry out of the MSB is generated from the least significant bit, then it will take longer for the output to stabilize than if no carries generated at all.
  2. In a state machine using a one-hot state encoding, false paths might exist when more than one state bit is a '1'.
Because of these effects, static timing analysis might be overly conservative and predict a delay that is greater than you will experience in practice. The most accurate delay analysis requires looking at the actual data values that will occur in practice. Conversely, a timing simulation may not demonstrate the actual slowest behaviour of your circuit: if you don't ever generate a carry from LSB to MSB, then you'll never exercise the critical path in your adder.

Probabilistic Timing Analysis

Because of shrinking feature sizes and the decreasing faithfulness of the manufacturing process to design features, process variation has been one of the constant themes of IC designers as new process nodes are introduced. This article reviews the problem and proposes a "probabilistic" approach as a solution to analysis and management of variability.

Process variation may be new in the digital design framework, but it has long been the principle worry of analog designers, known as mismatch. Regardless of its causes, variation can be global, where every chip from a lot can be effected in the same way, or quasi-global, where wafers or dies may show different electrical characteristics. Such global variation has been relatively easy to model, especially if process modeling people have been able to characterize it with a single "sigma" parameter. Timing analyzers need to analyze a design under both worst case and best case timing conditions. Usually, two extreme conditions of "sigma" sufficed to provide these two conditions. With the new process nodes , however, not only it is necessary to have several variational parameters, but individual device characteristics on a chip could differ independently, known as on-chip variation (OCV).

At the device level, process variation is modeled by a set of "random" parameters which modify the geometric parameters of the device and its model equations. Depending on the nature of the variation, these may effect all devices on the chip, or certain types of devices, or they may be specific to each instance of the device. Because of this OCV, it is important that correlation between various variational parameters be accounted for. For example, the same physical effect is likely to change the length and width of a device simultaneously. If this is ignored, we may be looking at very pessimistic variation scenarios.

There are some statistical methods which try to capture correlations and reduce them to a few independent variables. Some fabs use use parameters related to device geometries and model parameters. The number of such parameters may range from a few to tens, depending on the device. If one considers global and local variations, the number of variables quickly can get out of hand. Variation is statistically modeled by a distribution function, usually Gaussian. Given the value of a variational parameter, and a delta-interval around it, one can calculate the probability that the device/ process will be in that interval and will have specific electrical characteristics for that condition. Instead of having a specific value for a performance parameter such as delay, it will have a range of values with specific probabilities depending on the variational parameters.

To analyze the performance of digital designs, two approaches have emerged: statistical static timing analysis (SSTA) and multi-corner static timing analysis. SSTA tries to generate a probability distribution for a signal path from delay distributions of individual standard cells in the path. This is usually implemented by using variation-aware libraries, which contain a sampling of cell timing at various discrete values of the variational parameters. Because of the dependence on a discrete library, this approach is practically limited to only few global systematic variables, with a very coarse sampling of the variation space. Since it is a distribution-based analysis, it depends on the shape of primary variables. It is generally assumed these are Gaussian, but there is no reason to assume this. In fact, most process models may not even be centered. In addition, it becomes difficult to do input slope dependent-delay calculation. Assumptions and simplifications could quickly make this approach drift from the goal. Since it has the probability distributions, one can report a confidence level about a timing violation. Implicit in this approach is the assumption that any path has a finite probability of being critical.

Multi-corner timing analysis is kind of Monte-Carlo in disguise, and has been gaining popularity as a brute-force method. Someone who knows what he/she is doing decides on a set of extreme corner conditions. These are instances of process variables, and cell libraries are generated for these conditions. Timing analysis is performed using these libraries. The number of libraries may be 10 to 20 or more. Naturally, this approach is still limited to few global variational parameters. It is also difficult to ascertain the reliability of timing analysis, in terms of yield. The only way to increase the confidence level is by building more libraries and repeating the analysis with them. This process increases verification and analysis time, but does not guarantee coverage.

What we propose instead, is probabilistic timing analysis. It can address both global and local variations, and we can have a lower confidence limit on timing analysis results which can be controlled by the designer. This turns the problem upside down. Since timing analysis is interested in worst-case and best-case timing conditions of a chip, we ask the same question for individual cells making up a design. We want to find the best/ worst case timing condition of a cell. While doing this, we need to limit our search and design space. For example, the interval (-1,1) covers 68.268% of the area under the normal bell curve distribution. If we search this interval for sigma with maximum inverter delay and later use that value, we can only say that the probability that this value is the maximum delay is 0.68268. For the interval (-2,2), it is 0.95448. If we had searched a wider interval, our confidence level would go up even higher. If there were two process variables, and if we had searched (-1,1)(-1,1), our confidence would drop to 0.68268X0.68268, or 0.46605.

Although lower confidence limits are set by the initial search intervals, the actual probabilities may be much higher. If the maximum had occurred at extreme corners, one could expect that as the search interval expands, we might see new maximum conditions. On the other hand, if the maximum had occurred at a point away from the corners, most likely this is the absolute value. Typically, only one of the parameters, the one most tightly coupled to threshold voltage, for example, takes up the extreme values, and most others take intermediate values. In these cases it is effectively the same as if we searched the interval (-inf, +inf). This behavior is consistent with the traditional approach, where a single parameter is used to control best and worst timing corners.

One of the conceptual problem with our probabilistic approach is that each cell may have different sets of global variables, which contradicts the definition of such variables. A flip-flop may have different global variables than an inverter. Even inverters of different strengths may have different sets. They are typically close to each other, however. There may be some pessimism associated with this condition.

It is easy to establish confidence levels on critical path timing. If for example, global variables have a confidence level of 0.9, and local random variables have 0.95, the confidence level for a path of 10 cells is 0.9X0.95*10= 0.5349. Since local variations of each gate are independent of each other, intersection rule of probability should be followed, probability of having 0.95 coverage for two independent cells is 0.95X0.95, for three is 0.95X0.95X0.95, etc. In reality though, minimum and maximum conditions for local variations are clustered around the center, away from the interval end points, which brings confidence level to 0.9, confidence level for global variations. Alternatively, one can expand the search interval to cover more process space. Also keep in mind, the variation range of "real" random variables is much narrower than (-inf, +inf).

Library Technologies has implemented this probabilistic approach in its YieldOpt product. The user defines the confidence levels her/she would like to see, and identifies global and local random parameters for each device. Confidence levels are converted to variation intervals assuming a normal distribution. This is the only place we make an assumption about the shape of the distributions. As a result, our approach has a weak dependence on probability distribution. In the probabilistic approach, we view timing characteristics of a cell as functions of random process variables. For each variable, we define a search interval. The variables could be global and local random variables. Maximum and minimum timing conditions for each cell are determined for typical loads and input slopes. Two libraries are generated for each condition. Normally, we couple worst process condition with high temperature, low voltage; and best process condition with low temperature and high voltage.

Timing analysis flow is the traditional flow, and depending on the number of random variables, searching for extreme conditions becomes a very demanding task. We have developed methods and tools which can achieve this task in a deterministic way. The YieldOpt product determines appropriate process conditions for each cell and passes it over for characterization and library generation. Determining worst/best case conditions may add about 0.1X to 2X overhead on top of characterization.

By Mehmet Cirit:
Mehmet Cirit is the founder and president of Library Technologies, Inc. (LTI). LTI develops and markets tools for design re-optimization for speed and low power with on-the-fly cell library creation, cell/ memory characterization and modeling, circuit optimization, and process variation analysis tools such as YieldOpt.

Register re-timing

Register retiming is a sequential optimization technique that moves registers through the combinational logic gates of a design to optimize timing and area.

When you describe circuits at the RT-level prior to logic synthesis, it is usually very difficult and time-consuming, if not impossible, to find the optimal register locations and code them into the HDL description. With register retiming, the locations of the flip-flops in a sequential design can be automatically adjusted to equalize as nearly as possible the delays of the stages. This capability is particularly useful when some stages of a design exceed the timing goal while other stages fall short. If no path exceeds the timing goal, register retiming can be used to reduce the number of flip-flops, where possible.

Purely combinational designs can also be retimed by introducing pipelining into the design. In this case, you first specify the desired number of pipeline stages and the preferred flip-flop from the target library. The appropriate number of registers are added at the outputs of the design. Then the registers are moved through the combinational logic to retime the design for optimal clock period and area.

Register retiming leaves the behavior of the circuit at the primary inputs and primary outputs unchanged (unless you choose special options that do not preserve the reset state of the design or add pipeline stages). Therefore you do not need to change any simulation test benches developed for the original RTL design.

Retiming does, however, change the location, contents, and names of registers in the design. A verification strategy that uses internal register inputs and outputs as reference points will no longer work. Retiming can also change the function of hierarchical cells inside a design and add clock, clear, set, and enable pins to the interfaces of the hierarchical cells.

Timing closure impacted by DVFS!!

While designing systems with DVFS techniques, we need to look at the impact of temperature inversion on the performance of the design. An important criteria while selecting voltages and frequencies for a design, one must consider a range such that delay/voltage consistently increases or decreases.

What does this means?
We must always operate above the temperature inversion point.

Especially in low power UDSM process, combined use of reduced VDD and High Threshold voltage may greatly modify the temperature sensitiveness of the design. Due to this, worst case timing is no longer guaranteed at higher temperatures. So in order to guarantee correct behavior of the design, one has to verify the design at various PVT corners. This leads to a significant increase in the total turn around time of the design.

In a nutshell, delay increases with increase in temperature, but below a certain voltage, this relationship inverts and delay starts to decrease with increase in temperature. This is a function of threshold voltage (Threshold voltage and carrier mobility are temperature dependent). Due to this threshold voltage dependency, we have observed that non-critical paths suddenly become critical.

Having said this, as soon as Voltage/Delay relate randomly Voltage Scaling becomes a nightmare to implement and verify.

Note: If both threshold voltage and carrier mobility monotonically decrease with increase in temperature, Operating Voltages(range) defines the performance of the design.

Interview Question

Due to a miscommunication during design, you thought your circuit was supposed to have a supply voltage of 2.1 volts (threshold voltage is 0.7 volts) and a 25 ns cycle time, and you designed it to meet those specifications. Now your boss tells you you were supposed to have a 20 ns cycle time. To avoid redesigning the whole circuit, a co-worker suggests increasing the voltage of the circuit to decrease the delay to 20 ns. The same co-worker suggests picking some arbitrary number like 3.5 volts.
  1. Determine the new cycle time of your circuit with a 3.5 volt input voltage.
  2. Your boss is worried about the additional power consumption - calculate the increase in power consumption of your circuit at 3.5 volts, assuming activity factor and capacitance remain the same and neglecting short circuit and leakage power.
  3. To satisfy your boss, calculate the minimum voltage you would increase the supply voltage to, in order to allow your circuit to run at 20 ns. You may leave your answer in non-simplified numeric terms, but not in the form of an equation to solve.

Clock Latency & clock skew

Clock latency means, the number of clock pulses required by the ckt to give out the first output. Generally we will observe this in pipelined ckts.

Clock skew means the time difference between the arrival of clk edge at different FFs. This skew is due to different clock tree paths.

Max Frequency calculation

In the simplest form:
FF1 - combo - FF2 ( this is how things look physically for our consideration)
Tmin = Tclk2Q (FF1)+ Td(Comb0)+Tsu(FF2)

* mainly dependent on the critical path, and can do a good job by defining proper timing constraints during synthesis.

In detail:
Timing budget is the account of timing requirements or timing parameters necessary for a system to function properly. For synchronous systems to work, timing requirements must fit within one clock cycle. A timing-budget calculation involves many factors, including hold-time requirements and maximum operating frequency requirements. By calculating a timing budget, the limitations of conventional clocking methods can be seen.

Let's use an example for a system with standard clocking. Assume a memory controller interfacing with an SRAM. Both the SRAM and memory controller receive clock signals from the same clock source. It's assumed that clock traces are designed to match the trace delays. The relevant timing parameters are:
tSU (setup time) of memory controller
  • tH (hold time) of memory controller
  • tPD (propagation delay) of board trace
  • tCO (clock to output delay) of SRAM
  • tDOH (output data hold time) of SRAM
  • tSKEW (clock skew) of clock generator
  • tJIT (cycle-to-cycle jitter) of clock generator
  • tCYC (cycle time) of clock generator

The maximum-frequency calculation gives the minimum cycle time of the system if the worst-case input setup time, clock to output time, propagation delay, clock skew, and clock jitter are considered.

The maximum frequency is given by:

tCO(max, SRAM) + tPD(max) + tSU(max, CTRL) + tSKEW(max, CLK) + tJIT(max, CLK)

The hold-time calculation verifies that the system outputs data too fast, violating input hold time of the receiving device in the system. In this case, the worst-case condition occurs when the data is driven out at the earliest possible time.

The formula is given by:

tCO(min, SRAM) + tPD(min) - tSKEW(min, CLK) - tJIT(min, CLK) > tH(max, CTRL)

Now let's assume the following values for the timing parameters of our SRAM and memory controller. In this case, we will use a high-speed SRAM with a double-data-rate (DDR) interface, where data is driven by the SRAM with every rising and falling edge of the clock.

tSU = 0.5 ns
tH = 0.4 ns
tCO = 0.45 ns
tDOH* = -0.45 ns
tSKEW = ±0.2 ns
tJIT = ±0.2 ns

*tDOH <>

The minimum hold-time requirement is calculated as:

tDOH + tPD - tSKEW - tJIT > tH
-0.45 ns + tPD - 0.2 ns - 0.2 ns > 0.4 ns
-0.85ns + tPD > 0.4 ns
tPD > 1.25 ns

Assuming that the delay per inch of an FR4 board trace is 160 ps/in., the trace length from SRAM to memory controller must be at least 7.82 in. Using 1.2 ns for tPD, the maximum operating frequency is calculated below. Because the SRAM has a DDR interface, the timing budget is based on a half cycle:

tCO + tPD + tSU + tSKEW + tJIT < tCYC/2 0.45 ns + 1.25 ns + 0.5 ns + 0.2 ns + 0.2 ns < tCYC/2 2.6 ns < tCYC/2 5.2 ns < tCYC 192 MHz > fCYC

With a 7.82-in. FR4 trace length and typical timing parameters, the timing budget requirements are met for an operating frequency of up to 192 MHz. In systems that have limited board space, the 7.82-in. minimum trace-length constraint becomes a difficult requirement to satisfy in systems.

If it isn't possible to introduce a trace delay, the memory controller can satisfy the hold-time requirement by using a delay-locked loop/phase-locked loop (DLL/PLL) to phase-shift the clock signal to capture data at an earlier time. The memory controller will have to resynchronize captured data with the system clock. Using this method will introduce additional PLL/DLL jitter, which decreases the system's maximum operating frequency. With the added delay of the PLL, the minimum hold-time requirement becomes:

tDOH + tPD(trace) + tPLL/DLL_DELAY - tSKEW - tJIT > tHtCO + tPD + tSU + tSKEW + tJIT

Clock skew, clock jitter, and trace propagation delay can significantly limit system performance, even with the fastest SRAMs and ASICs/FPGAs available.

As mentioned earlier, the trace delay is approximately 160 ps/in. if an FR4 board is used. This is a significant number considering how the data-valid window at high frequencies has become 2 ns (e.g., for a 250-MHz, double-data-rate (DDR) device) and lower. Skew between the clock signals can also significantly reduce timing margins. We shall see that source-synchronous clocks can significantly reduce propagation delay, skew, and jitter, making timing closure more attainable.

Ways to increase frequency of operation

  • Check critical path and optimize it.
  • Add more timing constraints (over constrain).
  • pipeline the architecture to the max possible extent keeping in mind latency req's.

Application Specific Integrated Circuit ( ASIC )

An application-specific integrated circuit or ASIC comprises an integrated circuit (IC) with functionality customized for a particular use (equipment or project), rather than serving for general-purpose use.

For example, a chip designed solely to run a cash register is an ASIC. In contrast, a microprocessor is not application-specific, because users can adapt it to many purposes.

The initial ASICs used gate-array technology.

The British firm Ferranti produced perhaps the first gate-array, the ULA (Uncommitted Logic Array), around 1980. Customisation occurred by varying the metal interconnect mask. ULAs had complexities of up to a few thousand gates. Later versions became more generalized, with different base dies customised by both metal and polysilicon layers. Some base dies include RAM elements.

In the late 1980s, the availability of logic synthesis tools (such as Design Compiler) that could accept hardware description language descriptions using Verilog and VHDL and compile a high-level description into to an optimised gate level netlist brought "standard-cell" design into the fore-front. A standard-cell library consists of pre-characterized collections of gates (such as 2 input nor, 2 input nand, invertors, etc.) that the silicon compiler uses to translate the original source into a gate level netlist. This netlist is fed into a place and route tool to create a physical layout. Routing applications then place the pre-characterized cells in a matrix fashion, and then route the connections through the matrix. The final output of the "place & route" process comprises a data-base representing the various layers and polygons in GDS-II format that represent the different mask-layers of the actual chip.

Finally, designers can also take the "full-custom" route in implementing an ASIC. In this case, an individual description of each transistor occurs in building the circuit. A "full-custom" implementation may function five times faster than a "standard-cell" implementation. The "standard-cell" implementation can usually be implemented quite a bit quicker and with less risk of errors, than the "full-custom" choice.

As feature sizes have shrunk and design tools improved over the years, the maximum complexity (and hence functionality) has increased from 5000 gates to 20 million or more. Modern ASICs often include 32-bit processors and other large building-blocks. Many people refer to such an ASIC as a SoC - System on a Chip.

The use of intellectual property (IP) in ASICs has become a growing trend. Many ASIC houses have had standard cell libraries for years. However IP takes the reuse of designs to a new level. Designers of most complex digital ICs now utilise computer languages that describe electronics rather than code. Many organizations now sell tested functional blocks written in these languages. For example, one can purchase CPUs, ethernet or telephone interfaces.

For smaller designs and/or lower production volumes, ASICs have started to become a less attractive solution, as field-programmable gate arrays (FPGAs) grow larger, faster and more capable. Some SoCs consist of a microprocessor, various types of memory and a large FPGA.

So having said, this blog is dedicated to Digital Electronics, VLSI, ASICs, SOCs etc.