Showing posts with label Synthesis. Show all posts
Showing posts with label Synthesis. Show all posts

RTL synthesis and other backend Interview Questions (with answers)

Q1: How would you speed up an ASIC design project by parallel computing? Which design stages can be distributed for parallel computing, which cannot, and what procedures are needed for maintaining parallel computing?
Ans: Mentioning the following important steps in parallel computing is essential:
1. Partitioning the design
2. Distributing partitioned tasks among multiple CPUs
3. Integrating the results

WHAT STAGES: The following answers are acceptable. Others may be accepted if you gave a reasonable explanation of why you can or cannot use parallel computing in a particular stage of the flow.
Can use parallel computing:
- Synthesis after partitioning
- Placement (hierarchical design)
- Detailed routing
- Functional verification
- Timing Analysis (partition the timing graph)
Cannot use parallel computing:
- Synthesis before partitioning
- Floorplanning
- Flat Placement
- Global Routing
CONSTRAINTS: Mentioning that care must be taken to make sure that partition boundaries are consistent when integrating the results back together.

Q2: What kinds of timing violations are in a typical timing analysis report? Explain!
Ans: Acceptable answers...
- Setup time violations
- Hold time violations
- Minimum delay
- Maximum delay
- Slack
- External delay

Q3: List the possible techniques to fix a timing violation.
Ans: Acceptable answers...
- Buffering
Buffers are inserted in the design to drive a load that is too large for a logic cell to efficiently drive. If the net is too long then the net is broken and buffers are inserted to improve the transition which will ultimately improve the timing on data path and reduce the setup violation.
To reduce the hold violations buffers are inserted to add delay on data paths.- Mapping - Mapping converts primitive logic cells found in a netlist to technology-specific logic gates found in the library on the timing critical paths.
- Unmapping - Unmapping converts the technology-specific logic gates in the netlist to primitive logic gates on the timing critical paths.
- Pin swapping - Pin swapping optimization examines the slacks on the inputs of the gates on worst timing paths and optimizes the timing by swapping nets attached to the input pins, so the net with the least amount of slack is put on the fastest path through the gate without changing the function of the logic.
- Wire sizing
- Transistor (cell) sizing - Cell sizing is the process of assigning a drive strength for a specific cell in the library to a cell instance in the design.If there is a low drive strength cell in the timing critical path then this cell is replaced by higher drive strength cell to reduce the timing violation.
- Re-routing
- Placement updates
- Re-synthesis (logic transformations)

- Cloning - Cell cloning is a method of optimization that decreases the load of a very heavily loaded cell by replicating the cell. Replication is done by connecting an identical cell to the same inputs as the original cell.Cloning clones the cell to divide the fanout load to improve the timing.
- Taking advantage of useful skew
- Logic re-structuring/Transformation (w/Resynthesis) - Rearrange logic to meet timing constraints on critical paths of design
- Making sure we don't have false violations (false path, etc.)

Q4: Give the linear time computation scheme for Elmore delay in an RC interconnect tree.
Ans: The following is acceptable...
- Elmore delay formula
T = Sum over all nodes i in path (s,t) of Ri*Ci where Ci is the total capacitance in the sub tree rooted at node i, or alternatively, the sum over the capacitances at the nodes times the shared resistance between the path of interest and the path to the node.
- Explaining terms in formula
- Mentioning something that shows that it can be done in linear time ("lumped"
or "shared" resistances, "recursive" calculations, etc)

Q5: Given a unit wire resistance "r" and a unit wire capacitance "c", a wire segment of length "l" and width "w" has resistance "l/w" and capacitance "cwl". Can we reduce the Elmore delay by changing the width of a wire segment? Explain your answer.
Ans: You needed to mention that by scaling different segments by different amounts, you can reduce the delay (e.g. wider segments near the root and narrower segments near the leaves. Delay is independent of width because the "w" term cancels out.

Q6: Extend the ZST-DME algorithm to embed a binary tree such that the Elmore delay from the root to each leaf of the tree is identical.
Ans: You needed to mention that a new procedure is needed for calculating the Elmore delay assuming that certain merging points are chosen, instead of just the total downstream wire-length. The merging segment becomes a set of points with equal Elmore delay instead of just equal path length. You could refer the paper "Low-Cost Single-Layer Clock Trees With Exact Zero Elmore Delay Skew", Andrew B. Kahng and Chung-Wen Albert Tsao.

Q7: IPO (sometimes also referred to as "In-Place Optimization") tries to optimize the design timing by buffering long wires, resizing cells, restructuring logic etc.
Explain how these IPO steps affect the quality of the design in terms of area, congestion, timing slack.
(a) Why is this called "In-Place Optimization" ?
(b) Why are the two IPO steps different ?
(c) Why are both used ?

Ans: IPO optimizes timing by buffer insertion and cell resizing. Important steps that are performed in IPO include fixing {setup,hold} time, max. transition time violation. Timing slack along all arcs is optimized by these operations. Increase in area and reduction in timing slack depend upon timing and IPO constraints.
(a) This step is referred to as "In-Place Optimization" because IPO performs resizing and buffer in-place (between cells in the row). It does not perform placement optimization in this step.
(b) The first IPO1 step is performed after placement. It performs trial-route--> extraction --> timing analysis to determine the quality of placement. Setup and hold time fixing is done according to result of timing analysis. The second IPO step is performed after clock tree synthesis. CTS performs clock buffer insertion to balance skews among all flip-flops. IPO2 step optimizes timing paths between flip-flops taking the actual clock skew.
(c) If IPO2 step is not performed after CTS, then timing paths between flip-flops are not tuned for clock skew variation. Even though NanoRoute performs timing optimization, it is more of buffer insertion in long interconnect to fix setup and hold times.

Q8: Clocking and Place-Route Flow. Consider the following steps:
- Clock sink placement
- Standard-cell global placement
- Standard-cell detailed placement
- Standard-cell ECO placement
- Clock buffer tree construction
- Global signal routing
- Detailed signal routing
- Bounded-skew (balanced) clock (sub)net routing
- Steiner clock (sub)net routing
- Clock sink useful skew scheduling (i.e., solving the linear program, etc.)
- Post-placement (global routing based) static timing analysis
- Post-detailed routing static timing analysis
(a) As a designer of a clock distribution flow for high-performance standard-cell based ASICs, how would you order these steps? Is it possible to use some steps more than once, others not at all (e.g., if subsumed by other steps).
(b) List the criteria used for assessing possible flows.
(c) What were the 3 next-best flows that you considered (describe as variants of your flow), and explain why you prefer your given answer.

Ans:(a) My basic flow:
(1) SC global placement
(2) post-placement STA
(3) clock sink useful-skew scheduling
(4) clock buffer tree construction that is useful-skew aware (cf. associative skew.)
(5) standard-cell ECO placement (to put the buffers into the layout)
(6) Steiner clock subnet routing at lower levels of the clock tree (following CTGen type paradigm)
(7) bounded-skew clock subnet routing at all higher levels of the clock tree, and as necessary even at lower levels, to enforce useful skews
(8) global signal routing
(9) detailed signal routing,
(10) post-detailed routing STA
(1) likelihood of convergence with maximum clock frequency
(2) minimization of CPU time (by maximizing incremental steps, minimizing .detailed. steps, and minimizing iterations)
(3) make a good trade-off between wiring-based skew control and wire cost (this suggests Steiner routing at lower levels, bounded-skew routing at higher levels).
[Comment 1. Criteria NOT addressed: power, insertion delay, variant flow for hierarchical clocking or gated clocking.
Comment 2: I do not know of any technology for clock sink placement that can separate this from placement of remaining standard cells. So, my flow does not invoke this step. I also don't want post-route ECOs.]
(c) Variants:
(1) introduce Step 11: loop over Steps 3-10 (not adopted because cost benefit ratio was not attractive, and because there is a trial placement + global routing to drive useful-skew scheduling, buffer tree construction and ECO placement);
(2) after Steps 1-4, re-place the entire netlist (global, detailed placement) and then skip Step 5 (not adopted because benefits of avoiding ECO placement and leveraging a good clock skeleton were felt to be small-buffer tree will largely reflect the netlist structure, and replacing can destroy assumptions made in Steps 3-4);
(3) can iterate the first 5 steps essentially by iterating: clock sink placement, (ECO placement for legalization), (incremental) standard-cell (global + detailed) placement (not adopted because I feel that any objective for standalone clock sink placement would be very "fuzzy", e.g., based on sizes of intersections of fan-in/fan-out cones of sequentially adjacent FFs)

Q9: If we migrate to the next technology node and double the gate count of a design, how would you expect the size of the LEF and routed DEF files to change? Explain your reasoning.
Ans: The LEF file will remain roughly the same size (same richness of cell library, say, between 500-1200 masters), modulo possible changes in conventions (e.g., CTLF used to be a part of LEF) and modulo possible additional library model semantics (e.g., adding power modeling into LEF). The DEF file should at least double (the components and nets will double, but if there is extra routing complexity (more complex geometries, and more segments per connection due to antenna rules or badly scaling router heuristics) the DEF could grow significantly faster.

SPARK: A Parallelizing Approach to the High-Level Synthesis of Digital Circuits

SPARK is a C-to-VHDL high-level synthesis framework that employs a set of innovative compiler, parallelizing compiler, and synthesis transformations to improve the quality of high-level synthesis results. SPARK takes behavioral ANSI-C code as input, schedules it using speculative code motions and loop transformations, runs an interconnect-minimizing resource binding pass and generates a finite state machine for the scheduled design graph. Finally, a backend code generation pass outputs synthesizable register-transfer level (RTL) VHDL. This VHDL can then by synthesized using logic synthesis tools into an ASIC or can be mapped onto a FPGA.

Boosting RTL Verification with High-Level Synthesis

Instead of prolonging the painful process of finding bugs in RTL code, the design flow needs to be geared toward creating bug-free RTL designs. This can be realized today by automating the generation of RTL from exhaustively verified C++ models. If done correctly, high-level synthesis (HLS) can produce RTL that matches the high-level source specification and is free of the errors introduced by manual coding.

Register re-timing

Register retiming is a sequential optimization technique that moves registers through the combinational logic gates of a design to optimize timing and area.

When you describe circuits at the RT-level prior to logic synthesis, it is usually very difficult and time-consuming, if not impossible, to find the optimal register locations and code them into the HDL description. With register retiming, the locations of the flip-flops in a sequential design can be automatically adjusted to equalize as nearly as possible the delays of the stages. This capability is particularly useful when some stages of a design exceed the timing goal while other stages fall short. If no path exceeds the timing goal, register retiming can be used to reduce the number of flip-flops, where possible.

Purely combinational designs can also be retimed by introducing pipelining into the design. In this case, you first specify the desired number of pipeline stages and the preferred flip-flop from the target library. The appropriate number of registers are added at the outputs of the design. Then the registers are moved through the combinational logic to retime the design for optimal clock period and area.

Register retiming leaves the behavior of the circuit at the primary inputs and primary outputs unchanged (unless you choose special options that do not preserve the reset state of the design or add pipeline stages). Therefore you do not need to change any simulation test benches developed for the original RTL design.

Retiming does, however, change the location, contents, and names of registers in the design. A verification strategy that uses internal register inputs and outputs as reference points will no longer work. Retiming can also change the function of hierarchical cells inside a design and add clock, clear, set, and enable pins to the interfaces of the hierarchical cells.

Verilog rules that can save your breath !

This article contains some thoughts of mine about how and engineer should write Verilog code for Synthesis, general rules that will save you headaches if you follow, and how a Verilog file should be layed out.
  • If you don't know what hardware the code you just wrote is, neither will the synthesizer.
  • Remember that Verilog is a Hardware Description Language (HDL) and as such it describes hardware not magical circuits that you can never actually build.
  • You should be able to draw a schematic for everything that you can write Verilog for.
  • Be sure to know what part of your circuit is combinational and which parts are sequential elements. If you do not know or the code is written to be too hard to figure this out, the synthesizer will probably not be able to figure it out either. I recomend making the combinational logic very separate from sequential logic. This prevents errors later. It also prevents level high latches from being synthesized where you meant to have flip-flops. I also recomend having a naming convention such that you can tell what is a state holding element at all times. I use "_f" post-pended to all registers that are flip-flops.
  • I recomend having a style for your inputs and outputs. I list them in the following order: outputs, inouts, special inputs such as clk and reset, inputs.
  • When instantiating a module, always put the names of the signals that you are conecting to inside of the module with the notations where you have period, module signal name, thing you are connecting. This prevents errors when you change underlying modules or someone resorts the parameters.
  • Unlike a language like C which is rather strongly typed, in Verilog, which is also strongly typed, everyting is of the same type and it is easy to reorder parameters and not get errors becasue everything is just a wire.
  • Example of wrong module instantiation: nand2 my_nand(C, A, B);
  • Example of correct module intantiation: nand2 my_nand(.out(C), .in1(A), .in2(B));
  • Make your circuit synchronous whenever possible. Synchronous design is much easier than asynchronous.
  • Also reduce the number of clock domains and clock boundaries whenever possible.
  • Also remember that crossing clock domains in FPGAs is difficult because LUT's glitch in different ways than normal circuits. This causes problems with asynchronous circuits.

Verilog files should be laid out like this..

  • define consts
  • declare outputs (these are _out)
  • declare inouts if any (these are _inout)
  • declare special inputs such as clk and reset
  • declare inputs (these are _in)
  • declare _all_ state (these are _f)
  • declare inputs to state with (these have same name as state but are _temp)
  • declare wires (naming not restricted except connections of two instantiated modules are usually of the form A_to_B_connection)
  • declare 'wire regs' (naming not restricted except connections are usually of the form A_to_B_connection and variables that are going to be outputs, but still need to be read and you don't want inouts get _internal postpended)
  • do assigns
  • do assigns to outputs
  • instantiations of other modules
  • combinational logic always @'s are next
    do the always @ (posedge clk ...) and put reset values here and assign _temps to _f's (ie state <= next_state). I personally think that there should be no conbinational logic inside of always @(posedge clk's) with the exception of conditional assignment (write enables) becasue all Verilog synthesizers understand write enables on flip-flops.


Logic synthesis is a process by which an abstract form of desired circuit behavior, typically register transfer level (RTL) or behavioral is turned into a design implementation in terms of logic gates. Some tools can generate bitstreams for programmable logic devices such as PALs or FPGAs, while others target the creation of ASICs. Logic synthesis is one aspect of electronic design automation.

History of Logic Synthesis
The roots of logic synthesis can be traced to the treatment of logic by George Boole (1815 to 1864), in what is now termed Boolean algebra. In 1938, Claude Shannon showed that the two-valued Boolean algebra can describe the operation of switching circuits. In the early days, logic design involved manipulating the truth table representations as Karnaugh maps. The Karnaugh map-based minimization of logic is guided by a set of rules on how entries in the maps can be combined. A human designer can only work with Karnaugh maps containing four to six variables.

The first step toward automation of logic minimization was the introduction of the Quine-McCluskey algorithm that could be implemented on a computer. This exact minimization technique presented the notion of prime implicants and minimum cost covers that would become the cornerstone of two-level minimization. Another area of early research was in state minimization and encoding of finite state machines (FSMs), a task that was the bane of designers. The applications for logic synthesis lay primarily in digital computer design. Hence, IBM and Bell Labs played a pivotal role in the early automation of logic synthesis. The evolution from discrete logic components to programmable logic arrays (PLAs) hastened the need for efficient two-level minimization, since minimizing terms in a two-level representation reduces the area in a PLA.

However, two-level logic circuits are of limited importance in a very-large-scale integration (VLSI) design; most designs use multiple levels of logic. An early system that was used to design multilevel circuits was LSS from IBM. It used local transformations to simplify logic. Work on LSS and the Yorktown Silicon Compiler spurred rapid research progress in logic synthesis in the 1980s. Several universities contributed by making their research available to the public; most notably, MIS [13] from University of California, Berkeley and BOLD from University of Colorado, Boulder. Within a decade, the technology migrated to commercial logic synthesis products offered by electronic design automation companies.

Behavioral synthesis
With the goal of increasing designer productivity, there has been a significant amount of research on synthesis of circuits specified at the behavioral level using a hardware description language (HDL). The goal of behavioral synthesis is to transform a behavioral HDL specification into a register transfer level (RTL) specification, which can be used as input to a gate-level logic synthesis flow. Behavioral optimization decisions are guided by cost functions that are based on the number of hardware resources and states required. These cost functions provide a coarse estimate of the combinational and sequential circuitry required to implement the design.

The tasks of scheduling, resource allocation, and sharing generate the FSM and the datapath of the RTL description of the design. Scheduling assigns operations to points in time, while allocation assigns each operation or variable to a hardware resource. Given a schedule, the allocation operation optimizes the amount of hardware required to implement the design.

Multi Level Logic Minimization
Typical practical implementations of a logic function utilize a multilevel network of logic elements. Starting from an RTL description of a design, the synthesis tool constructs a corresponding multilevel Boolean network.

Next, this network is optimized using several technology-independent techniques before technology-dependent optimizations are performed. The typical cost function during technology-independent optimizations is total literal count of the factored representation of the logic function (which correlates quite well with circuit area).

Finally, technology-dependent optimization transforms the technology-independent circuit into a network of gates in a given technology. The simple cost estimates are replaced by more concrete, implementation-driven estimates during and after technology mapping. Mapping is constrained by factors such as the available gates (logic functions) in the technology library, the drive sizes for each gate, and the delay, power, and area characteristics of each gate.

Commercial logic synthesis
Examples of software tools for logic synthesis are Design Compiler from Synopsys and the humorously named BuildGates, from Cadence Design Systems. Both of these target ASICs. Example of FPGA synthesis tools include Synplify from Synplicity, Leonardo and Precision from Mentor Graphics and BlastFPGA from Magma Design Automation.

Clock tree synthesis

Basics of Clock Tree Synthesis:
The main idea is to balance the skew between endpoints. They are built with the following constraints.

  1. Clock Skew: Difference between the clock arrival times.
  2. Clock Latency: Max delay between the clock root and clock leaf.
  3. Transition Time
  • Clock buffers are usually bigger in size and have a shorter transition time as well as a more even rise and fall times.
  • Clock nets are generally routed first and on higher metal layers with minimum detouring to give it the highest priority in routing.
  • Few other clock tree related topics that will be covered subsequently are
Signal Integrity issues/Clock nets aggressor/clock shielding:
Clock nets due to their importance have to be protected from becoming either aggressors or victims in SI closure. They are generally shielded with vdd or gnd to prevent that.
  1. Effective skew: Worst skew between two flops that are talking to each other. This is either equal to or lower than the worst skew.
  2. Useful skew: This is a concept where the skew (Difference in arrival of the clock at the flops is used to improve setup violations.
  3. Few links that have more detailed information about clock tree synthesis
  4. OCV: On chip variation
  5. Few links that have good information about clock tree

Low power design

Primarily design for low power depends on the characteristics design being accomplished. If it is a multi-million gate design we cannot implement any technique that is gate specific, it has to be a global technique.
  1. Multi-Vdd, variable Vdd and Multi-Vth seems to be a good global solution.
  2. Reducing the clock speed will result in low power consumption, but on the cost of performance.
  3. Using power headers and power footer transistors on logic gates cuts down power.
  4. You could separate the design in blocks, which can go in to sleep mode.
  5. Another solutions is variable VDD and variable frequency (as Intel or AMD do).This means, you adapt VDD and frequency to the necessary performance.
  6. Gated clocks and Logic Addressable clocks, dis adv - timing problems due to improper latching of signals, and difficult to test.
  7. -ve edge triggered flops, (nor+inv) = 1.5 gates, +ve edge triggered flops, (and+inv) = 2.5 gates, so -ve has less gates, less glitching and hence low power.

Any more thoughts and ideas are welcome.

Application Specific Integrated Circuit ( ASIC )

An application-specific integrated circuit or ASIC comprises an integrated circuit (IC) with functionality customized for a particular use (equipment or project), rather than serving for general-purpose use.

For example, a chip designed solely to run a cash register is an ASIC. In contrast, a microprocessor is not application-specific, because users can adapt it to many purposes.

The initial ASICs used gate-array technology.

The British firm Ferranti produced perhaps the first gate-array, the ULA (Uncommitted Logic Array), around 1980. Customisation occurred by varying the metal interconnect mask. ULAs had complexities of up to a few thousand gates. Later versions became more generalized, with different base dies customised by both metal and polysilicon layers. Some base dies include RAM elements.

In the late 1980s, the availability of logic synthesis tools (such as Design Compiler) that could accept hardware description language descriptions using Verilog and VHDL and compile a high-level description into to an optimised gate level netlist brought "standard-cell" design into the fore-front. A standard-cell library consists of pre-characterized collections of gates (such as 2 input nor, 2 input nand, invertors, etc.) that the silicon compiler uses to translate the original source into a gate level netlist. This netlist is fed into a place and route tool to create a physical layout. Routing applications then place the pre-characterized cells in a matrix fashion, and then route the connections through the matrix. The final output of the "place & route" process comprises a data-base representing the various layers and polygons in GDS-II format that represent the different mask-layers of the actual chip.

Finally, designers can also take the "full-custom" route in implementing an ASIC. In this case, an individual description of each transistor occurs in building the circuit. A "full-custom" implementation may function five times faster than a "standard-cell" implementation. The "standard-cell" implementation can usually be implemented quite a bit quicker and with less risk of errors, than the "full-custom" choice.

As feature sizes have shrunk and design tools improved over the years, the maximum complexity (and hence functionality) has increased from 5000 gates to 20 million or more. Modern ASICs often include 32-bit processors and other large building-blocks. Many people refer to such an ASIC as a SoC - System on a Chip.

The use of intellectual property (IP) in ASICs has become a growing trend. Many ASIC houses have had standard cell libraries for years. However IP takes the reuse of designs to a new level. Designers of most complex digital ICs now utilise computer languages that describe electronics rather than code. Many organizations now sell tested functional blocks written in these languages. For example, one can purchase CPUs, ethernet or telephone interfaces.

For smaller designs and/or lower production volumes, ASICs have started to become a less attractive solution, as field-programmable gate arrays (FPGAs) grow larger, faster and more capable. Some SoCs consist of a microprocessor, various types of memory and a large FPGA.

So having said, this blog is dedicated to Digital Electronics, VLSI, ASICs, SOCs etc.