Delay Locked Loop (DLL)


Why not a PLL:
PLLs have disadvantages that make their use in high-speed designs problematic, particularly when both high performance and high reliability are required. The PLL voltage-controlled oscillator (VCO) is the greatest source of problems. Variations in temperature, supply voltage, and manufacturing process affect the stability and operating performance of PLLs.

DLLs, however, are immune to these problems. A DLL in its simplest form inserts a variable delay line between the external clock and the internal clock. The clock tree distributes the clock to all registers and then back to the feedback pin of the DLL. The control circuit of the DLL adjusts the delays so that the rising edges of the feedback clock align with the input clock. Once the edges of the clocks are aligned, the DLL is locked, and both the input buffer delay and the clock skew are reduced to zero.

Advantages:
  • precision
  • stability
  • power management
  • noise sensitivity
  • jitter performance.

Check out this virtex data sheet...

Phase locked loop (PLL)


PLL stands for 'Phase-Locked Loop' and is basically a closed loop frequency control system, whose functioning is based on the phase sensitive detection of phase difference between the input and output signals of the controlled oscillator (CO).

For further info, click on the title...

Fifo depth calculation


Assuming,
  • F1 = frequency of the writing side.
  • F2 = frequency of the reading side.
  • D = data burst.

Burst duration = D/F1
Data Rec'd, Rx= (D/F1) * F2, assuming simultaneous read for the duration.
Extra storage during FULL condition, Backlog = D-Rx = D(F1-F2)/F1

To accommodate latency or response time in the receiver, T, we need additional T * F1 locations.
Receiver also needs time to read all this backlog, so the idle time between bursts must be long enough. So this minimum time is called mop-up time = backlog/F2 = D * (F1-F2)/(F1.F2)

Note: For bursts of data which are written for partial amount time for a given number of cycles and read that are happening continously "or" are also read for a partial amount of time for a given number of cycles, the calculation has to account for the next burst.

Advanced Microcontroller Bus Architecture (AMBA)


Advanced Microcontroller Bus Architecture (AMBA):

Description:
The AMBA™ on-chip interconnect system is an established open specification that details a strategy on the interconnection and management of functional blocks that makes up a System-on-Chip (SoC). It is a high-speed, high-bandwidth bus that supports multi-master bus management to maximize system performance. AHB serves the need for high-performance SoC as well as aligning with current synthesis design flows. It facilitates "first-time-correct" development of systems with one or more high performance preipherals, DMA controllers, on-chip memory and other interfaces. As increasing numbers of companies adopting the AMBA system, it has rapidly emerged as the de-facto standard for SoC interconnection and IP library development. AMBA enhances a reusable design methodology by defining a common backbone for SoC modules.

Flexibility:
The AMBA specification defines the protocol used to move data across an AMBA interconnect architecture without defining the architecture itself. This provides the system designer with the flexibility to create architectures ranging from a simple 'point-to-point' connection through to complex, high performance architectures. AHB-Lite, a subset of AHB enables further simplification and increased performance for interconnect with only a single master while the Multi-layer AHB architecture allows the system designer to dramatically increase the capacity, and hence performance, of the architecture.

Some of the features..

  1. Single active rising edge clock.
  2. High-performance operation maximized by the ability to use the full clock cycle.
  3. Aligns with synthesis design flows .
  4. Multiple bus masters Optimizes system performance by sharing resources between different bus masters such as the main processor, DMA controllers or secondary processors.
  5. Pipelined and burst transfers, allows high speed memory and peripheral access without the requirement for additional cycles on the bus.
  6. Burst transfers allow optimal use of memory interfaces by giving advance information of the nature of the transfers.
  7. Split transactions are supported.
  8. Maximize the use of the system bus bandwidth by enabling high latency slaves to release the system bus during the dead time while the slave is completing its transaction.
  9. Wide data bus configuration (32/64/128 up to 1024-bit wide).
  10. Support for high-bandwidth data-intensive application using wide on-chip memory.

More on this information at ARM...

SOC interconnect Bus


SOC interconnect bus: These buses are used within a chip to interconnect an different IP cores to the surrounding interface and peripheral logic.

Some buses...

  1. Atlantic Interface, Avalon Bus Specification -- Altera corp.
  2. WISHBONE -- Opencores.org.
  3. AMBA, AHB (Advanced High-performance Bus) , APB (Advanced Peripheral Bus), ASB(Advanced System Bus) -- ARM.
  4. CoreConnect Bus -- IBM.
  5. Open Core Protocol 'OCP' -- Open Core Protocol International Partnership (OCP-IP)

I2C bus


The Inter-IC bus, commonly known as the I²C ("eye-squared-see") bus, is a control bus that provides the communications link between integrated circuits in a system. Developed by Philips in the early 1980s, this simple two-wire bus with a software-defined protocol has evolved to become the de facto worldwide standard for system control, finding its way into everything from temperature sensors and voltage level translators to EEPROMs, general-purpose I/O, A/D and D/A converters, CODECs, and microprocessors of all kinds.

Further on this topic ...

HDLC (High-level Data Link Control)


HDLC is a bit-oriented, link layer protocol for the transmission of data over synchronous networks. It is an ISO standard, but is a superset of IBM's SDLC (Synchronous Data Link Control) protocol. SDLC was the successful follow-up to the BISYNC communication protocol and was originally introduced with IBM SNA (Systems Network Architecture) products. Another name for HDLC is ADCCP (Advanced Data Communications Control Procedure), an ANSI standard, but HDLC is the widely accepted name for the protocol. There are some incompatibilities between SDLC and HDLC, depending on the vendor.

Two Internet RFCs are related to HDLC. These are RFC 2687 (PPP in a Real-Time Oriented HDLC-like Framing, September 1999) and RFC 1662 (PPP in HDLC-like Framing, July 1994).

Types of Timing Verification


Dynamic timing:

  1. The design is simulated in full timing mode.
  2. Not all possibilities tested as it is dependent on the input test vectors.
  3. Simulations in full timing mode are slow and require a lot of memory.
  4. Best method to check asynchronous interfaces or interfaces between different timing domains.

Static timing:

  1. The delays over all paths are added up.
  2. All possibilities, including false paths, verified without the need for test vectors.
  3. Much faster than simulations, hours as opposed to days.
  4. Not good with asynchronous interfaces or interfaces between different timing domains.

Timing delays are determined by the layout but vary with temperature, voltage and process factors.

  1. The ASIC vendor will supply factors based on their technology to allow verification under different environments. These factors are called generally called best and worst case military, industrial and commercial. The nominal value of the delay will be multiplied by a factor chosen by the designer to best represent the final operating conditions or the system requirements.
  2. However, some vendors have their technology characterised for all the possible operating points.
  3. Worst case military results in the longest delays and thus the most setup time violations.
  4. Best case commercial results in the shortest delays and thus the most hold time violations.

Calculation of an Estimated Delay:

  1. In synchronous design timing issues should be considered when choosing the algorithm. Avoid long paths before a register stage.
  2. Synthesis is constraint driven. This means that the synthesis tool will generate the circuit using timing as a critical factor.
  3. The libraries from the vendor will include the intrinsic delays of the cells.
  4. The wire load model is a statistically based estimate (provided by the vendor for the target die size) of the load a certain fan out will result in. This load is then used to calculate the propagation delay.
  5. Floorplanning is a method that allows information about the placement of a cell to be used in timing estimation. As most routing will be close to the ideal this is the dominant source of the timing delays.
  6. Rather than trusting a wire load model (which are becoming less accurate as path delays start to dominate) floor planning can be used.
  7. This can give very accurate timing information provided the floor plan drives the layout.
  8. A floor plan will restrict the layout tool often resulting in a less efficient use of the silicon die. Also as it is a manual process human error can become a factor. Some vendors for this reason do not offer floor planning in their design flow.
  9. Physical synthesis is a new synthesis strategy. The synthesis tool will place the cells and calculate estimated delays based on the minimum distance in the x-y plane.
  10. If the synthesis fails then a new placement or new cells would be synthesised.
  11. The tool will output the netlist and the placement file.

Wire Load Model

  1. Statistical estimate of the load.
  2. If the estimate is too conservative then high drive cells are used and more power is consumed. If the estimate is too optimistic then there will be widespread timing problems.
  3. Main drawback is that no information about placement or routing is available.

Types of Delays


Intrinsic device delay: Time taken for the cell to change state due to a change on the input pins.
Interconnect delay: Delay due to wires. Dependent on layout. Smaller dimensions means more resistance. Making the wires "taller" leads to more capacitance.

• The total delay is the sum of the gate delay and the interconnect delay. Delay is mostly determined by the layout (placement) but varies with temperature, voltage and process.
• At above 0.5 micron interconnect delays are 20% of the path delay.
With present technologies it is from 40 to 60%. Thus delays are less predictable.
• Delays are even becoming a function of cross talk.

All about Clock skew & Short path


Clock Skew:
Differences in clock signal arrival times across the chip are called clock skew. It is a fundamental design principle that timing must satisfy register setup and hold time requirements. Both data propagation delay and clock skew are parts of these calculations. Clocking sequentially-adjacent registers on the same edge of a high-skew clock can potentially cause timing violations or even functional failures.

Short Path:
The problem of short data paths in the presence of clock skew is very similar to hold-time violations in flipflops. The problem arises when the data propagation delay between two adjacent flip-flops is less than the clock skew.

How to Measure Clock Skew:
The first step in coping with clock skew problems is to measure the clock skew. Users should perform a static timing analysis of the design after place-and-route to determine the amount of clock skew. Timing report gives a better pciture.

The timing report is only valid if the user has specified one or more clock constraints. If the design clocks are not constrained, the report will be empty. The timing report has four sections as follows depending on the type of tool and vendor:
• Header: This section contains software version, design name, operating condition, device type, speed grade and Timer preferences.
• Clock Constraint Violation: This section reports the critical paths limiting any clock frequency constraint set in the General tab window.
• Max Delay Constraint Violation: This section reports the critical paths that are limiting any Max Delay constraint set in the Timer Path tab window.
• Min Delay Constraint Violation: In this section, short data paths that are susceptible to hold-time violations are listed.

In the timing report, the skew of the clock network is taken into account in calculating the slack. The report is sorted by slack for each section; a negative slack indicates a violation. The timing report is created based on the operating conditions set in the timer preferences.
Therefore, to examine the long data paths versus any clock or Max Delay Constraint, the user should export the report while the timer preferences are set to worst case/long paths. On the other hand, to identify all the possible hold-time violations, the report should be created while the timer preferences are set to best case/short paths. Users should note that after each change in the operating conditions in the Timer window, the "calculate delays" option should be selected before exporting the timing violation report.

Minimizing the Clock Skew:
The short-path problem is created by the existence of unacceptably large clock skew. Therefore, minimizing (i.e., nearly removing) the clock skew is the best approach to reduce the risk of short-path problems. Many FPGA devices offer global routing resources, which reduce skew.
If there are any free global resources available on the device, users should assign their clock signals to these resources. Maintaining the clock skew at a value less than the smallest register-to-register delay in the design by using low-skew global resources will improve the robustness of the design against any shortpath
problems.

Comments are most invited.

Interview Questions (Intel)


# Have you studied buses? What types?
Ans: 1. Processor-Memory Bus, I/O Bus, System Bus, Backplane Bus.

# Have you studied pipelining? List the 5 stages of a 5 stage pipeline. Assuming 1 clock per stage, what is the latency of an instruction in a 5 stage machine? What is the throughput of this machine ?
Ans: A method of executing a sequence of instructions in a single processor so that subsequent instructions in the sequence can begin execution before previous instructions complete execution.

5 Stages:
1. fetch instructions from memory
2. read registers and decode the instruction
3. execute the instruction or calculate an address
4. access an operand in data memory
5. write the result into a register

Latency: It's the amount of time between when the instruction is issued and when it completes. 6 Clock Cycles.
Throughput: The number of instructions that complete in a span of time.

# How many bit combinations are there in a byte?
Ans: 256

# For a single computer processor computer system, what is the purpose of a processor cache and describe its operation?
Ans:

# Explain the operation considering a two processor computer system with a cache for each processor.
Ans:

# What are the main issues associated with multiprocessor caches and how might you solve them?
Ans:

# Explain the difference between write through and write back cache.

# Are you familiar with the term MESI?

# Are you familiar with the term snooping?
Ans: Looking into a packet to obtain information. Usuall used to verify data at the output a logic core with inbuilt snoopers.

# Describe a finite state machine that will detect three consecutive coin tosses (of one coin) that results in heads.
Ans:

# In what cases do you need to double clock a signal before presenting it to a synchronous state machine?
Ans:

# You have a driver that drives a long signal & connects to an input device. At the input device there is either overshoot, undershoot or signal threshold violations, what can be done to correct this problem?
Ans:

# What is the difference between = and == in C?
Ans: Assignment and Equality operators.

# Are you familiar with VHDL and/or Verilog?
Ans:

# What types of CMOS memories have you designed? What were their size? Speed?
Ans: SRAM, 10Kbits, 50 Mhz.

# What work have you done on full chip Clock and Power distribution? What process technology and budgets were used?
Ans:

# What types of I/O have you designed? What were their size? Speed? Configuration? Voltage requirements?
Ans:

# Process technology? What package was used and how did you model the package/system? What parasitic effects were considered?
Ans:

# What types of high speed CMOS circuits have you designed?
Ans: FF's and Latch based Fast Mutipliers.

# What transistor level design tools are you proficient with? What types of designs were they used on?
And: PSPICE, MAGIC layout system, CMOS mutiplier chip, 0.8 u tech.

# What products have you designed which have entered high volume production?
Ans: TOE.

# What was your role in the silicon evaluation/product ramp? What tools did you use?
Ans:

# If not into production, how far did you follow the design and why did not you see it into production?
Ans:

# Explain how a MOSFET works.
Ans:

# Draw Vds-Ids curve for a MOSFET. Now, show how this curve changes (a) with increasing Vgs (b) with increasing transistor width © considering Channel Length Modulation
Ans:

# Explain the various MOSFET Capacitances & their significance
Ans:

# Draw a CMOS Inverter. Explain its transfer characteristics
Ans:

# Explain sizing of the inverter
Ans:

# How do you size NMOS and PMOS transistors to increase the threshold voltage?
Ans:

# What is Noise Margin? Explain the procedure to determine Noise Margin?
Ans:

# Give the expression for CMOS switching power dissipation.
Ans:

# What is Body Effect?
Ans:

# Describe the various effects of scaling?
Ans:

# Give the expression for calculating Delay in CMOS circuit
Ans: Tp = (tphl+tplh)/2, where tphl = 0.69 Req C & tplh = 0.69 Req C where C is the external capacitance made up of the diffusion capactiances of the drain and the fanout capacitance of the gates, Req is the equivalent resistance which could be either integrated if we are actually talking about in the resistive region or can be calculated in the saturation region.

# What happens to delay if you increase load capacitance?
Ans: If the load capacitance increases that means that the internal difusion capacitance or the fanout of the gate is increasing. i.e. resistance of the gate also increases so increasing the capacitance increasing does not make much of the difference.

# What happens to delay if we include a resistance at the output of a CMOS circuit?
Ans: cause power dissipiation.

# What are the limitations in increasing the power supply to reduce delay?
Ans: Increase in Dynamic Power dissipation

# How does Resistance of the metal lines vary with increasing thickness and increasing length?
Ans: Resistance is directly propotional to length and inversly propotional to area, hence higher metals have lesser resistance and Increasing L increases the resistance.

# What happens if we increase the number of contacts or via from one metal layer to the next?
Ans: Increase in contact resistance.

# Draw a transistor level two input NAND gate. Explain its sizing (a) considering Vth (b) for equal rise and fall times
Ans:

# Let A & B be two inputs of the NAND gate. Say signal A arrives at the NAND gate later than signal B. To optimize delay, of the two series NMOS inputs A & B, which one would you place near the output?
Ans:

# Draw the stick diagram of a NOR gate. Optimize it.
Ans:

# For CMOS logic, give the various techniques you know to minimize power consumption
Ans:

# What is Charge Sharing? Explain the Charge Sharing problem while sampling data from a Bus
Ans:

# Why do we gradually increase the size of inverters in buffer design? Why not give the output of a circuit to one large inverter?
Ans:

# In the design of a large inverter, why do we prefer to connect small transistors in parallel (thus increasing effective width) rather than lay out one transistor with large width?
Ans:

# Given a layout, draw its transistor level circuit. (I was given a 3 input AND gate and a 2 input Multiplexer. You can expect any simple 2 or 3 input gates)
Ans:

# Give the logic expression for an AOI gate. Draw its transistor level equivalent. Draw its stick diagram
Ans:

# Why don’t we use just one NMOS or PMOS transistor as a transmission gate?
Ans: NMOS passes clean zero and a bad one while PMOS passes clean 1 and bad zero(Ref: Kamran)

# For a NMOS transistor acting as a pass transistor, say the gate is connected to VDD, give the output for a square pulse input going from 0 to VDD
Ans:

# Draw a 6-T SRAM Cell and explain the Read and Write operations
Ans:

# Draw the Differential Sense Amplifier and explain its working. Any idea how to size this circuit? (Consider Channel Length Modulation)
Ans:

# What happens if we use an Inverter instead of the Differential Sense Amplifier?
# Draw the SRAM Write Circuitry
# Approximately, what were the sizes of your transistors in the SRAM cell? How did you arrive at those sizes?
# How does the size of PMOS Pull Up transistors (for bit & bit- lines) affect SRAM’s performance?
# What’s the critical path in a SRAM?
# Draw the timing diagram for a SRAM Read. What happens if we delay the enabling of Clock signal?
# Give a big picture of the entire SRAM Layout showing your placements of SRAM Cells, Row Decoders, Column Decoders, Read Circuit, Write Circuit and Buffers
# In a SRAM layout, which metal layers would you prefer for Word Lines and Bit Lines? Why?
# How can you model a SRAM at RTL Level?
# What’s the difference between Testing & Verification?
# For an AND-OR implementation of a two input Mux, how do you test for Stuck-At-0 and Stuck-At-1 faults at the internal nodes? (You can expect a circuit with some redundant logic)
# What is Latch Up? Explain Latch Up with cross section of a CMOS Inverter. How do you avoid Latch Up?

What are Embedded Systems?


Any electronic system that uses a CPU chip, but that is not a general-purpose workstation, desktop or laptop computer. Such systems generally use microprocessors, or they may use custom-designed chips or both. They are used in automobiles, planes, trains, space vehicles, machine tools, cameras, consumer and office appliances, cellphones, PDAs and other handhelds as well as robots and toys. The uses are endless, and billions of microprocessors are shipped every year for a myriad of applications. Although there are embedded versions of popular operating systems, low-cost consumer products can use chips that cost less than a dollar and have very limited storage for instructions. In such cases, the OS and application may be combined into one program.

In embedded systems, the software is permanently set into a read-only memory such as a ROM or flash memory chip, in contrast to a general-purpose computer that loads its programs into RAM each time. Sometimes, single board and rack mounted general-purpose computers are called "embedded computers" if used to control a single printer, drill press or other such device. See smart car, Windows XP Embedded, Embedded Linux and embedded language.

ASIC equivalent gates for Virtex


4-input LUT 6
4-input ROM 32
3-input LUT na
16x1 RAM 64
32x1 RAM 128
16 Shift Reg LUT 64
CLB flop 8
CLB latch 5
IOB flop 8
IOB latch 5
IOB Sync latch na
TBUF 3
Block RAM 16,384
BSCAN 48
Clk DLL 7,000
F5 MUX 3
F6 MUX 3
MUXCY 3
XORCY 3

If you do some quick math, one can calculate the typical ASIC gates for a
Virtex 1000, which has a 64x96 CLB array:
( 64*96 CLB )* ( 2 Slices/CLB )* ( 20 Gates/Slice ) = 245,760 Gates.

Tristate Buffers


You can think of tristate buffers as a way of turning a signal on and off. When the enable input at the top of the buffer is '1', the tristate buffer acts like a normal buffer. But when the enable input is '0', the buffer "turns off" by giving a very high impedance output. This effectively "disconnects" the buffer from the circuit. So if you need to turn off a signal, ground the enable input of the tristate buffer.

behavioral & RTL


Multi-cycle functionality:
It is a fundamental characteristic of synthesizable RTL code that the complete functionality of each clocked process must be performed within a single clock cycle. Behavioral synthesis lifts this restriction. Clocked processes in synthesizable behavioral code may contain functionality that takes more than one clock cycle to execute.

The behavioral synthesis algorithms will create a schedule that determines how many clock cycles will be used. The behavioral synthesis tool automatically creates the finite state machine (FSM) that is required to implement this multi-cycle behavior in the generated RTL code.

In a traditional RTL design process, the designer is responsible for manually decomposing multi-cycle functionality into a set of single-cycle processes. Typically this entails the creation of multiple processes to implement the finite state machine, and the creation of processes for each operation and each output.

A behavioral synthesis tool performs this decomposition for the designer. The multi-cycle behavior can be expressed in a natural way in a single process leading to more efficient design specification and debug.

Loops:
Most algorithms include looping structures. Traditional RTL design imposes severe restrictions on the use of loops, or prohibits them outright. Some RTL logic synthesis tools permit for loops with fixed loop indices only. The loop body is restricted to being executed in a single cycle. Parallel hardware is inferred for each loop iteration.

These restrictions require the designer to transform the algorithm into a multi-cycle FSM adding substantial complexity to the designer's task. Behavioral design manages this complexity for the designer by permitting free use of loops. "While" loops and "for" loops with data-dependent loop indices are fully supported in a behavioral design flow. Loop termination constructs such as the C language "break" and "continue" keywords are permitted.

Memory access:
In general, reading and writing to memories requires complex multi-cycle protocols. In RTL design these are implemented as explicit FSMs. Worse, these accesses must usually be incorporated in an already complex FSM implementing an algorithm.

Behavioral synthesis permits them to be represented in an intuitive way as simple array accesses. An array is declared in the native syntax of the behavioral language in use, tool directives are provided to control the mapping of the array to a physical memory element, and the array elements are referenced using the array indexing syntax of the language. The behavioral synthesis tool instantiates the memory element and connects it to the rest of the circuit. It also develops of the FSM for the memory access protocol and integrates this FSM with the rest of the algorithm.

Clock Latency & clock skew


Clock latency means, the number of clock pulses required by the ckt to give out the first output. Generally we will observe this in pipelined ckts.

Clock skew means the time difference between the arrival of clk edge at different FFs. This skew is due to different clock tree paths.

gates from mux's


OR gate from 2:1 MUX:
Assumptions:
's' is the select line for the mux.
'I0 and I1' be the input data lines of the mux.
'Z' be the ouput of the Mux.

a,b inputs of the OR gate.


method 1 >>
Connect the input b to the select line 's' of mux.
Connect input 'a' to the 'I0' line input of mux.
Connect the 'I1' line input of mux to LOGIC 1(VCC).
Now ur mux out 'z' will be "a or b"

method 2>>
in this method instead of connecting the I1 line of the mux to VCC, connect(short) it to the Select line "s" of mux.

XOR gate from 2:1 mux:
Connect input 'b' to select line.
Then connect 'a' to I0, and connect 'a' to I1 using an inverter ( negation of a to I1).

If u reverse, (inverted a to I0, and a to I1 , you will get XNOR operation.)

Verification and Testing


Verification:
In order to verify the functional correctness of a design, one needs to capture the model of the behavior of the design in a formal language or use the design itself. In most commercial software development organizations, there is often no formal specification of the program under development. Formal verification is used routinely by only small pockets of the industrial software community, particularly in the areas of protocol verification and embedded systems. Where verification is practiced, the formal specifications of the system design (derived from the requirements) are compared to the functions that the code actually computes. The goal is to show that the code implements the specifications.

Testing:
Testing is clearly a necessary area for software or hardware validation. Typically, prior to coding the program, design reviews and code inspections are done as part of the static testing effort. Once the code is written, various other static analysis methods based on source code can be applied.

Typically, system testing targets key aspects of the product, such as recovery, security, stress, performance, hardware configurations, software configurations, etc. Testing during production and deployment typically involves some level of customer-acceptance criteria.

Load and stress testing.


One of the most common, but unfortunate misuse of terminology is treating "load testing" and "stress testing" as synonymous. The consequence of this ignorant semantic abuse is usually that the system is neither properly "load tested" nor subjected to a meaningful stress test.

1. Stress testing is subjecting a system to an unreasonable load while denying it the resources (e.g., RAM, disc, mips, interrupts, etc.) needed to process that load. The idea is to stress a system to the breaking point in order to find bugs that will make that break potentially harmful. The system is not expected to process the overload without adequate resources, but to behave (e.g., fail) in a decent manner (e.g., not corrupting or losing data). Bugs and failure modes discovered under stress testing may or may not be repaired depending on the application, the failure mode, consequences, etc. The load (incoming transaction stream) in stress testing is often deliberately distorted so as to force the system into resource depletion.

2. Load testing is subjecting a system to a statistically representative (usually) load. The two main reasons for using such loads is in support of software reliability testing and in performance testing. The term "load testing" by itself is too vague and imprecise to warrant use. For example, do you mean representative load," "overload," "high load," etc. In performance testing, load is varied from a minimum (zero) to the maximum level the system can sustain without running out of resources or having, transactions suffer (application-specific) excessive delay.

3. A third use of the term is as a test whose objective is to determine the maximum sustainable load the system can handle. In this usage, "load testing" is merely testing at the highest transaction arrival rate in performance testing.

Digital Logic Metastability Definitions


Max Frequency calculation


In the simplest form:
FF1 - combo - FF2 ( this is how things look physically for our consideration)
Tmin = Tclk2Q (FF1)+ Td(Comb0)+Tsu(FF2)

* mainly dependent on the critical path, and can do a good job by defining proper timing constraints during synthesis.


In detail:
Timing budget is the account of timing requirements or timing parameters necessary for a system to function properly. For synchronous systems to work, timing requirements must fit within one clock cycle. A timing-budget calculation involves many factors, including hold-time requirements and maximum operating frequency requirements. By calculating a timing budget, the limitations of conventional clocking methods can be seen.

Let's use an example for a system with standard clocking. Assume a memory controller interfacing with an SRAM. Both the SRAM and memory controller receive clock signals from the same clock source. It's assumed that clock traces are designed to match the trace delays. The relevant timing parameters are:
tSU (setup time) of memory controller
  • tH (hold time) of memory controller
  • tPD (propagation delay) of board trace
  • tCO (clock to output delay) of SRAM
  • tDOH (output data hold time) of SRAM
  • tSKEW (clock skew) of clock generator
  • tJIT (cycle-to-cycle jitter) of clock generator
  • tCYC (cycle time) of clock generator

The maximum-frequency calculation gives the minimum cycle time of the system if the worst-case input setup time, clock to output time, propagation delay, clock skew, and clock jitter are considered.

The maximum frequency is given by:

tCO(max, SRAM) + tPD(max) + tSU(max, CTRL) + tSKEW(max, CLK) + tJIT(max, CLK)

The hold-time calculation verifies that the system outputs data too fast, violating input hold time of the receiving device in the system. In this case, the worst-case condition occurs when the data is driven out at the earliest possible time.

The formula is given by:

tCO(min, SRAM) + tPD(min) - tSKEW(min, CLK) - tJIT(min, CLK) > tH(max, CTRL)

Now let's assume the following values for the timing parameters of our SRAM and memory controller. In this case, we will use a high-speed SRAM with a double-data-rate (DDR) interface, where data is driven by the SRAM with every rising and falling edge of the clock.


tSU = 0.5 ns
tH = 0.4 ns
tCO = 0.45 ns
tDOH* = -0.45 ns
tSKEW = ±0.2 ns
tJIT = ±0.2 ns


*tDOH <>


The minimum hold-time requirement is calculated as:

tDOH + tPD - tSKEW - tJIT > tH
-0.45 ns + tPD - 0.2 ns - 0.2 ns > 0.4 ns
-0.85ns + tPD > 0.4 ns
tPD > 1.25 ns


Assuming that the delay per inch of an FR4 board trace is 160 ps/in., the trace length from SRAM to memory controller must be at least 7.82 in. Using 1.2 ns for tPD, the maximum operating frequency is calculated below. Because the SRAM has a DDR interface, the timing budget is based on a half cycle:

tCO + tPD + tSU + tSKEW + tJIT < tCYC/2 0.45 ns + 1.25 ns + 0.5 ns + 0.2 ns + 0.2 ns < tCYC/2 2.6 ns < tCYC/2 5.2 ns < tCYC 192 MHz > fCYC


With a 7.82-in. FR4 trace length and typical timing parameters, the timing budget requirements are met for an operating frequency of up to 192 MHz. In systems that have limited board space, the 7.82-in. minimum trace-length constraint becomes a difficult requirement to satisfy in systems.


If it isn't possible to introduce a trace delay, the memory controller can satisfy the hold-time requirement by using a delay-locked loop/phase-locked loop (DLL/PLL) to phase-shift the clock signal to capture data at an earlier time. The memory controller will have to resynchronize captured data with the system clock. Using this method will introduce additional PLL/DLL jitter, which decreases the system's maximum operating frequency. With the added delay of the PLL, the minimum hold-time requirement becomes:

tDOH + tPD(trace) + tPLL/DLL_DELAY - tSKEW - tJIT > tHtCO + tPD + tSU + tSKEW + tJIT
+ tJIT_PLL/DLL <>


Clock skew, clock jitter, and trace propagation delay can significantly limit system performance, even with the fastest SRAMs and ASICs/FPGAs available.


As mentioned earlier, the trace delay is approximately 160 ps/in. if an FR4 board is used. This is a significant number considering how the data-valid window at high frequencies has become 2 ns (e.g., for a 250-MHz, double-data-rate (DDR) device) and lower. Skew between the clock signals can also significantly reduce timing margins. We shall see that source-synchronous clocks can significantly reduce propagation delay, skew, and jitter, making timing closure more attainable.


Ways to increase frequency of operation


  • Check critical path and optimize it.
  • Add more timing constraints (over constrain).
  • pipeline the architecture to the max possible extent keeping in mind latency req's.

When are DFT and Formal verification used?


DFT:
  • manufacturing defects like stuck at "0" or "1".
  • test for set of rules followed during the initial design stage.

Formal verification:

  • Verification of the operation of the design, i.e, to see if the design follows spec.
  • gate netlist == RTL (Equivalence checking)
  • using mathematics and statistical analysis to check for eqivalence.

Adv and DisAdv of Gated Clocks


Advantges:
  1. used to save power by masking the clock to the flops.
  2. used in clock switching circuits.
  3. Reduces routing burden and area to some extent.
  4. Ex: Suppose there are 8 D flops(DffL) with common load signal, we can replace all those loadable flops with simple D (Dff) flops and a clock gating circuit. This will reduce routing effort for the load signals to all flops. The area we are saving here is 8*(DffL/Dff). Of course we are adding clock gate area extra.

Disadvantages:

  1. There should not be any glitch on the gating signal, and the gating signal should transit only during the clock's inactive level.
  2. For DFT, the gating signal will be forced to a value so that the clock will be active during DFT testing.
  3. Introduces delay on the clock line.

.

Setup and Hold times


The setup time is the time the data inputs must be valid before the clock/strobe signal.
  • tSU(chip-pin)= tSU(FF) - Tdelay_clk_min(chip-pin to FF-pin) + Tdelay_data_max(chip-pin to FF-pin)

The hold time is the time the data must remain valid after the clock/strobe signal.

  • tH(FF) = clk2Q + Tcomb+T(clk-skew), where T(clk-skew) = clk diff b/w source and destination flops. If source sees clk at X and destination flop sees clk at Y, T(clk-skew) = Y-X
  • tH(chip-pin)= tH(FF) - Tdelay_clk_min(chip-pin to FF-pin) + Tdelay_data_max(chip-pin to FF-pin)

A zero setup time means that the time for the data to propagate within the component and load into the latch is less than the time for the clock to propagate and trigger the latch.

A zero hold time means either that the moment the clock is asserted, the latch no longer looks at its inputs, or that the clock path delay is shorter than the data path delay.

A negative setup or hold time means that there is an even larger difference in path delays, that even if the data is sent later than the clock (for setup time), it still arrives at the latch first.

Manufacturers avoid specifying negative values since this restricts later design and manufacturing decisions, but they often specify zero values since this simplifies usage in a system.

Wire load models


Wire loading models contain all the information required by compile to estimate interconnect wiring delays. A typical Wire load model definition contains: area, resistance, capacitance, slope and fanout. All these attributes are given per unit length wire. Slope value is used to characterize linear fanout.

Generally wire load models are used in ASIC design. These wire load models will contain statistical values which are used in pre-layout simulation of ASIC. Since we are extracting resistance(R), capacitance(C) values in back end after place and route(P&R) phase we need to perform pre-layout simulation before P&R.

A wire load model attempts to predict the capacitance and resistance of nets in the absence of placement and routing information. The estimated net capacitance and resistance are used for delay calculation. Technology library vendors supply statistical wire load models to support estimation of wire loads based on the number of fanout pins on a net. You can set wire load models manually or automatically.

Why interrupts are active low?


If you consider the transistor level of a module, active low means the capacitor in the output terminal gets charged or discharged based on low to high and high to low transition respectively.

When it goes from high to low it depends on the pull down resistor that pulls it down and it is relatively easy for the output capacitance to discharge rather than charging. hence people prefer using active low signals.

Slack


slack is defined as the difference between the reqd_arrival time of a signal & it's actual arrival time.

It should be always >= zero

It is also defined as the difference between the clock period and the total path delay from one flop to other flop which includes the clock to q delay of source flop, total combinational delay between flops and set up time of the destination flop.

Slack related problems will arise for the critical paths in the design i.e nothing but the max. delay paths, clock to Q delay + Prop.Delay + setup of of dest flop.

In any design Slack should always be +ve. If it is negitive means there is a timing violation that is not meeting a setup or a hold requirement and it gets difficult to achieve the required frequency.

Polysilicon Vs Metal


Normally polysilicon has more resistance compared to metal. For shorter distance we go with polysilicon keeping fabrication process in mind .

Usage of metal for short distances need contacts which increases the resistance significantly.
Poly has got higher melting point and can withstand high temperature phases that follow Gate formation. So,Poly is preffered to metal, Although it has got high resistivity.

NAND or NOR design


NAND is a better gate for design than NOR because at the transistor level the mobility of electrons is normally three times that of holes compared to NOR and thus the NAND is a faster gate.

Additionally, the gate-leakage in NAND structures is much lower. If you consider t_phl and t_plh delays you will find that it is more symmetric in case of NAND ( the delay profile), but for NOR, one delay is much higher than the other(obviously t_plh is higher since the higher resistance pmos's are in series connection which again increases the resistance).

FPGA & ASIC based design


The main diferrence between ASIC and FPGA based design is in the Back-end.
In FPGAs there is not much activities in back end.

FPGA flow:
SPECIFICATION -> RTL DESIGN -> FUNCTIONAL SIMULATION -> SYNTHESIS -> TRANSLATION -> MAPPING -> PLACE & ROUTE -> BITGEN GENERATION -> DOWNLOAD TO THE CHIP.

ASIC flow:
SPECIFICATION -> RTL DESIGN -> FUNCTIONAL SIMULATION -> SYNTHESIS -> EXTRACT RC VALUES -> DRC, LVS,etc., -> LIBRARY VENDOR SPECIFIC FILE FORMAT

Default paths and False paths


The path in digital circuits which is not associated with a clock, is known as default path.

While considering and calculating the paths, we take into consideration the input point and the output point. Input points are usually clocks and input port. Output points are D input and output port.

A false path is a logic path in the design that exists but should not be analyzed for timing. For example, a path can exist between two multiplexed logic blocks that are never selected at the same time, so that path is not valid for timing analysis. Declaring a path to be false removes all timing constraints from the path.

Another example of a false path is a path between flip-flops belonging to two clock domains that are asynchronous with respect to each other.

e.g scan multiplexer.

Synthesis spends more time on optimizing the unwanted part of the logic when false path is not specified.

http://www.vlsichipdesign.com/static%20timing%20analysis.html

Coarse and Fine grained architectures


Coarse-grained architectures consist of fairly large logic blocks, often containing two or more look-up tables and two or more flip-flops. In a majority of these architectures, a four-input look-up table (think of it as a 16x1 ROM) implements the actual logic. The larger logic blocks usually corresponds to improved performance.

Fine-grained circuits, consist of the basic cell being simple (OR, AND,and NOT).