Showing posts with label FMEA. Show all posts
Showing posts with label FMEA. Show all posts

Why your first design will not work in the field?

“Everyone should get a lecture on why their first design will not work in the field.” Here are some of the very few but primary reasons that getting a single design to work correctly for a few minutes in a lab is much easier than getting thousands of systems to work correctly for months at a time in dozens of countries around the world.

1. Did you forget to force your “unreachable” state to transition to an initial (reset) state? Clock glitches, power surges, radiation, high EM etc will occasionally cause your system to jump to a state that is not defined. When this happens, your design should reset itself, rather than crash or generatel illegal outputs.

2. Do you have internal registers that you cannot access or test? If you can set a register you must have some way of reading the register from outside the chip. In many cases inaccessible or stale registers can cause unexplained system behavior that cannot be debugged. Only full system reset can recover the system to a sane state.

3. Is there any chip in the system that controls your chip? It could be possible that this other chip is buggy. All of your external control lines should be able to be disabled or controlled, so that you can isolate the source of the problem.

4. Not enough decoupling capacitors on your board? The analog world is cruel and very unusual. Voltage spikes, current surges, crosstalk, etc can all corrupt the integrity of digital signals. Trying to save a few cents on decoupling capacitors can cause headaches and significant financial costs in the future.

5. Did you only test your system in the lab, not in the real world? As a product, systems will need to be run for months in the field to encounter all known and unknown issues. Simulation and simple lab testing won’t catch all of the weirdness of the real world. This will be the limit of real world stress test.

6. Did you not adequately test the corner cases and boundary conditions? Every corner case is as important as the main case. Even if some weird event happens only once every six months, if you do not handle it correctly, the bug can still make your system unreliable, unusable and of-course unsellable.

FMEA - Failure Mode and Effects Analysis - Part 1

Failure Mode and Effects Analysis (FMEA):
A procedure for analysis of potential failure modes within a system for the classification by severity or determination of the failure's effect upon the system. It is widely used in the manufacturing sector in various phases of the product life cycle. Failure causes are any errors or defects in process, design, or item especially ones that affect the customer, and can be potential or actual. Effects analysis refers to studying the consequences of those failures.

In this article we will see why we need FMEA in the semiconductor industry and how it can help save costs and customers while keeping the company bottomline.

History & Background:
Failure mode: The manner by which a failure is observed; it generally describes the way the failure occurs.
Failure effect: The immediate consequences a failure has on the operation, function or functionality, or status of some item
Local effect: The Failure effect as it applies to the item under analysis.
Next higher level effect: The Failure effect as it applies at the next higher indenture level.
End effect: The failure effect at the highest indenture level or total system.
Failure cause: Defects in design, process, quality, or part application, which are the underlying cause of the failure or which initiate a process which leads to failure.
Severity: The consequences of a failure mode. Severity considers the worst potential consequence of a failure, determined by the degree of injury, property damage, or system damage that could ultimately occur.
Indenture levels: An identifier for item complexity. Complexity increases as the levels get closer to one.

The FMEA process was originally developed by the US military in 1949 to classify failures "according to their impact on mission success and personnel/equipment safety". FMEA has since been used on the 1960s Apollo space missions. In the 1980s it was used by the Ford Motor Company to reduce risks after one model of car, the Pinto, suffered a design flaw that failed to prevent the fuel tank from rupturing in a crash, leading to the possibility of the vehicle catching fire.


In FMEA, Failures are prioritized according to how serious their consequences are, how frequently they occur and how easily they can be detected. An FMEA also documents current knowledge and actions about the risks of failures, for use in continuous improvement. FMEA is used during the design stage with an aim to avoid future failures. Later it is used for process control, before and during ongoing operation of the process. Ideally, FMEA begins during the earliest conceptual stages of design and continues throughout the life of the product or service.

The purpose of FMEA is to take actions to eliminate or reduce failures, starting with the highest-priority ones. It may be used to evaluate risk management priorities for mitigating known threat-vulnerabilities. FMEA helps select remedial actions that reduce cumulative impacts of life-cycle consequences (risks) from a systems failure (fault).

It is used in many formal quality systems such as QS-9000 or ISO/TS 16949. The basic process is to take a description of the parts of a system, and list the consequences if each part fails. In most formal systems, the consequences are then evaluated by three criteria and associated risk indices:

  • Severity (S),
  • Likelihood of occurrence (O), and (Note: This is also often known as probability (P))
  • Inability of controls to detect it (D)

An FMEA simple scheme would be to have three indices ranging from 1 (lowest risk) to 10 (highest risk). The overall risk of each failure would then be called Risk Priority Number (RPN) and the product of Severity (S), Occurrence (O), and Detection (D) but the Detection 1 means the control is absolutely certain to detect the problem and 10 means the control is certain not to detect the problem (or no control exists). rankings: RPN = S × O × D. The RPN (ranging from 1 to 1000) is used to prioritize all potential failures to decide upon actions leading to reduce the risk, usually by reducing likelihood of occurrence and improving controls for detecting the failure.


If used as a top-down tool, FMEA may only identify major failure modes in a system. Fault tree analysis (FTA) is better suited for "top-down" analysis. When used as a "bottom-up" tool FMEA can augment or complement FTA and identify many more causes and failure modes resulting in top-level symptoms. It is not able to discover complex failure modes involving multiple failures within a subsystem, or to report expected failure intervals of particular failure modes up to the upper level subsystem or system.

Additionally, the multiplication of the severity, occurrence and detection rankings may result in rank reversals, where a less serious failure mode receives a higher RPN than a more serious failure mode. The reason for this is that the rankings are ordinal scale numbers, and multiplication is not a valid operation on them. The ordinal rankings only say that one ranking is better or worse than another, but not by how much. For instance, a ranking of "2" may not be twice as bad as a ranking of "1," or an "8" may not be twice as bad as a "4," but multiplication treats them as though they are. See Level of measurement for further discussion.

Application in the semiconductor industry:

Why use FMEA?

  1. If the products comeback, customers will not.
  2. FMEA checks if the product meets requirements before Tape Out or Delivery.
  3. Saves R&D costs and on re-designs.
Which type of FMEA?
  1. System FMEA
    • Customer interface - Talk to the customer and get as much info as possible about the product in question.
    • Project resources - Analyze the resources available at your disposal for execution and delivery.
    • Architecture - Study the architecture thoroughly enough possible with a group brainstorming session.
  2. Design or Technology
    • Mechanical or Electrical
    • Technology
    • Software
  3. Fab or Assembly
    • Mechanical
    • Electrical
    • Testing
What is a failure?
  1. A re-design
  2. A failed test case
  3. Any crash
  4. All errata sheets
  5. All bugs
  6. All change requests
  7. All field failures
  8. All FARs & RMAs
  9. All test programs that did not work right the first time.
  10. Any project started without requirements defined by Marketing or not accepted by the cutomer.
Teams & Classification:
  1. Facilitator - He is the Quality Engineer
  2. Project Group - Architect, Programmers, RTL guys, Test, Quality, Marketing
  3. Support Group - Representatives from Test Departments, AEs, Sales.
Be Specific:
  1. Define scope of the FMEA - Make a picture
    • Serial IO
    • Refer a Block Diagram of the product
    • Provide customer & market requirement docs by version
    • Provide BOM list for the product
    • Ensure the team includes all cross functional departments for efficiency.
  2. Create proper visual definitions for the scope of FMEA - This would involved extensive whiteboard sessions!
  3. Start with the top 3 riskiest and the newest features added. Be as muc specific as possible by using a product brief or a data sheet etc.
To be continued....