|
|
| |
|
Magnetic Certification 101
|
|
|
| |
|
|
| |
on this page Introduction Magnetic certification is a test to inspect the media surface for magnetic defect, a void or "hole" in the media where the magnetic field is not to specification to hold a "bit" of information. By definition, a disk has been "Certified" to be in a certain grade of magnetic errors. From error free (the best grade) to grades that contain a certain count of errors. In media testing, there are four key aspects of Magnetic Certification: - Sensitivity
- Density
- Accuracy
- Throughput
Sensitivity is a combination of two parameters: missing pulse clip level (measured in %), and read element dimension (particularly the width). If a pulse in the read back signal dropped below the clip level, this pulse is regarded as a "missing pulse"; aka: MP, missing bit, bad bit. A magnetic defect may cause a MP for a read head with smaller the read width, but not one with a larger read width. Also see Critical Defect Size (CDS) below. Density is related to the coverage area to be tested; measured in kfci (kilo flux change per inch) in the circumferential direction; and ktpi (kilo track per inch) in the radial direction. For each track, higher kfci means track is being sampled more often; since for each flux-change, only transition area is actually being tested. If a defect lies between two transitions, it does not affect the read back signal as if it lies on a transition. The potential total number of tracks on a surface depends on the size of the data area and the width of the read element. Total surface coverage for defect testing usually require some amount of overlap, result in the number of tested track to be slightly higher than actual data tracks. A sampling of tracks is when a smaller number of tracks is tested rather than the actual data tracks. Accuracy represents the repeatability and reproducibility (often referred to as "R&R"). R&R's aim is to reduce the spread of the test results while testing the same disk. When the same disk is tested repeatedly on the same tester, the ideal result would be identical test results for all runs (repeatability). Similarly, when the same disk is tested on another tester, the same result is produced as tested previously on another tester (reproducibility). There are two main issues for accuracy: tester system's noise and test head variability Throughput is measured in disks tested per unit of time. Often quoted as disks per hour, or disks per day; either per tester or per workcell (usually one robot arm serving up to 8 test spindles). Test Methodologies There are two distinct test methods for Magnetic Certification: - Track-and-Settle
- Spiral
Track-and-Settle uses the same head to write and to read. A position carriage then move the head to another track and the process continues. For Spiral, separate write and read heads are used. Two separate carriages carrying the write and read head will move in synchronous spiral motion to scan the surface for defect. However, present implementation of spiral will only do a cursory check for bad pulses during spiral, and will stop the spiral motion once a bad pulse is detected; then a standard track test is performed for errors. See here for a more detailed discussion on Track and Spiral testing. Basic Concepts - Missing Pulse & Super Pulse
- Error
- Higher Level Errors
- MHz - Clock Rate & Processing Rate
- Critical Defect Size
Missing Pulse & Super Pulse Covered concepts: Threshold, AGC, Missing Pulse, Super Pulse, Extra Pulse. Missing Pulse Detection  Example of some MPs from an error channel with two simultaneous thresholds (65% and 85%) For a classic error channel, only the non "FF" bytes are sent to the error processor; together with "n", the byte count since index. For a real-time error channel, the continuous data stream is sent. This is the fundamental result of magnetic certification. When a read back signal pulse drops below a pre defined threshold (in percentage). Depending on the region of the disk under test and varying parameters, the signal amplitude (the 100% level) will not be constant throughout the test. Only a recent group of pulses are used to determine the full 100% level. This "window" of pulses is also known as "Automatic Gain Control" or AGC. The typical value of AGC is about 10 micro-seconds. For a 50MHz signal (100 MFlux), this is about 500 pulses. Essentially the average of last 500 pulses determines the 100% level. If the latest pulse measured only 75% with a clip level set at 76%; the pulse is considered a bad pulse, or missing pulse (MP, missing bit). Similarly, a Super Pulse (SP) occurs when it's larger than the 100% level. The thresholds for detecting SP is > 100%. Opposite to a MP, a SP happens when the pulse is larger than the set threshold. Detection of SP can often be done by the same circuitry as MP detection, but with the comparator reprogrammed for the opposite polarity. Although similar, a SP is not an Extra Pulse (EP). An EP is a signal that is still present after the disk (track) has been erased. Threshold for EP is > 0% (typical usage is between 5 to 10%). Due to the increased test time (an EP scan takes the same amount of time as a MP scan) and limited benefits, EP tests are no longer used. Error Covered concepts: error processor, error category, single bit error, bit length, boundary condition, correctable/uncorrectable bit-lengths. A missing pulse (MP) is not necessarily an error. When a MP is located, its location in the signal stream is identified and sent to the error processor. Depending on the user defined definition (category) of the error, the MP is compared to previously flagged errors. Typically if the number of consecutive MPs exceed a certain count, an error is qualified and counted. The error processor usually have multiple error categories running in parallel as it processes an MP. The basic descriptor of an error category includes: (i) Minimum Length, (ii) Maximum Length, (iii) Boundary. The measurements are in "bit length", or bit count, determined by the data clock. The Minimum length how many consecutive bits must be MPs to be considered an error. Similarly the Maximum length is the longest length of how many consecutive bad bits. The Boundary is how many consecutive "good bits" must happen before the final count of the error is split into two errors, otherwise it is still considered to be one error. In summary, an error usually occurs when multiple MPs are found in close proximity of each other, meeting the criteria of a predefined error category. In its most strict setting, the category can be defined to treat a MP as an error, or a single-bit error. Usually the category is configured for various lengths; to track small, medium, to very large errors. Note: The legacy Proquip MG has only two categories for its error processor; named as "correctable" (shorter bit-length errors), and "uncorrectable" (longer bit-length errors). Higher Level Errors Covered concepts: track defects TD (n), tracks with errors TE (n), sector with errors SE (n), pattern grading, error grouping, scratches, patches, bit-length distribution. At the end of the test, the surface data of errors can be processed as a whole to yield higher level errors. The most common is the Track Defect (n) or TDn, where n is a count representing how many errors occur on the same track. i.e. a TD4 is a track with 4 errors. Similarly, a count of how many tracks with errors, or TE. i.e. if there are 100 errors on the surface but the TE is 25 means all the errors are distributed over only 25 tracks. Similar to TE is the sectors with error, SE. The disk is divided into a number of sectors (wedges, typically 1024), and the errors are grouped into SE. i.e. if there are 100 errors and the SE is 75 means the errors are distributed over 75 different sectors. Error grouping is similar to the boundary concept of an error category. If an error is in close proximity to the next error, it can be "lumped" together and counted only as one error. Pattern grading represents the most advanced concept in error processing. As its name indicate, a "scratch" is when the errors are lined up next to each other "geographically". The scratches are not necessarily straight lines (radial or circumferential); they can go in any direction, often with a specific "curvature". Other common patterns are group or patches of errors in a small region. i.e. a group of errors appear in a 2 cm round region; a patch of errors appear on one quadrant of the disk. A bit-length distribution is a series of bit-length bin with the corresponding count of each MP with that particular error length in each bin. This is done at the track level as well as zone (specific radii range), and surface level. Credence & Retry Covered concepts: Trigger C&R, Full C&R Credence and Retry (C&R) is a much misunderstood concept. Wrong usage can have the opposite result of the intended target. The basic for C&R is a statistical effort to increase the probability of a detected error to be real (as supposed to just noise). Since noise can either suppress or enhance a pulse, a comparator can falsely indicate the pulse to be missing (or super, depending on the desired detection). Most implementation of C&R is to prevent failing good disk, NOT to prevent passing bad disk. The Credence number is a count value that must be met (such as a MP) for the result to be valid; Retry is the total number of measurements. For example: a C&R of 4/5, or Credence=4/Retry=5, is a setting where the test must be met at least 4 out of 5 times to be valid. There are two implementations, Trigger & Full; where Trigger is the most widely used. - Trigger C&R: The condition (such as a MP) for the first measurement must be met to trigger subsequent retries. For error testing this means the first measurement must produce an error for subsequent retries to take place. If a tested track resulted in no errors, the test will immediately continue to the next track.
- Full C&R: The full number of (retries) measurements is done; even if the first measurement resulted in an unmet condition. Advanced implementation of Full C&R can skip any extra retries if the credence is met early.
Since noise happens at random, Trigger C&R skew the results toward minimizing over rejections (minimizing the failing of good disks) instead of minimizing escapes (minimizing the passing of bad disks). The idea of C&R implies Full C&R, but only Trigger C&R is actually practiced. MHz - Clock Rate & Processing Rate MHz is a popular misuse term to indicate the speed of a certifier. A fast clock does NOT mean faster throughput. The clock determine the writing speed; coupled with the spindle speed, a series of pulses with a given linear density is written on the disk (kilo flux change per inch: kfci). The true measurement for speed is the processing speed of the error processing unit (EPU). During the read back process, depending on the threshold (ultimately the size of the magnetic event), many of the written pulses can be classified as "missing", the categorization of these missing bits into errors is the true speed of the certifier. The processing speed is often many orders of magnitude lower than the clock. It's not unusual to see a 150MHz (300 Mflux/s) system with only a 1.5KHz (1.5 Kpulse/s) processing speed. Categorization (error processing) is a time consuming task. For each pulse that was flagged as a MP, it must be compared to potentially hundreds of previoulsly flagged pulses, and also hundreds of pulses to come. Test requirements also require several categorization engines running in parallel for the same MP. It's not surprising to see EPUs in the order of KHz for just one category of errors. For a system with a realtime EPU, the processing is as fast as the data stream. This means a "realtime" 150MHz clock system actually has a 300MHz EPU. Critical Defect Size CDS This is the smallest detectable MP, idealized to a round shape. The diameter of this shape is the measurement of the critical defect size (CDS). This is a combination of the clip threshold and read head width. Due to the high aspect ratio (width over length) of a magnetic transition, a quick approximation of the CDS is the "left over" percentage of the clip level times the read width. i.e. for a MP threshold of 65% and a 10 micro-inch read element width, the CDS is 3.5 micro-inch (about 90 nanometers). Similarly, 85% represents a CDS of about 38 nm. Advanced Concepts - Thermal Asperity
- Defect Avalanche
- Test Head Threshold Compensation
- Categorized Triggering
- (Servo) Wedge Mode
Thermal Asperity Covered concepts: Thermal Asperity, Baseline shift, Super Pulse. Bit Level Thermal Asperity (TA) Detection  Example of a Thermal Asperity. Two separate comparators are used to detect Thermal Asperity. In this case, the Super Pulse threshold is 150% and the Missing Pulse threshold is 75%. The Error Processor will process both data streams in concert to qualify the TA. A thermal asperity (TA) occur when the read element comes into physical contact with a defect on the disk surface. This can be lose debris or a physical defect. The impact caused the element to heat up, resulted in a sharp shift of the signal's base-line (usually the 0% position). As the element cools, the base-line will gradually return to the normal position. Typically the larger the impact to the element, the larger the base-line shift; and the longer it takes the element to cool. Depending on the system, the baseline shift can be positive or negative. Two separate comparators are used in concert to detect and quantify the size (bit length) of the TA. A Super Pulse (larger than 100%) threshold and a standard Missing Pulse threshold (usually between 0 to 75%). The data clock is divided in half for each type since every other pulse is either Super or Missing. A minimum bit-length is set for the criteria of the TA. When both the Super Pulse and the Missing Pulse thresholds are active simultaneously, a possible TA has occurred. If enough pulses (both Super and Missing) happened to satisfy the minimum bit-length, a TA is qualified. Defect Avalanche The Defect Avalanche (DA) represents the test threshold where a significant number of the bits (pulses) under test became missing pulses (MPs). Ideally, all the pulses are at 100% and a threshold of 99% will yield no MPs. In practice, the pulses have an average value of 100% and have a normal distribution surrounding the 100% level (some pulses are higher, some are lower). Depending on the total numbers of pulses for the track under test, a count representing the number of standard-deviation (example: 3 sigmas) away from 100% can be calculated. The test threshold which resulted in this count is the defect avalanche.  Example of a DA at 6000 MPs with the result of 95.1% The higher the DA for a given count, the better the system (lower noise). A usable DA(3sigma) is 90%. DA(3sigma) at 95% and above represents the state-of-the-art in signal to noise. DA(3sigma) below 90% will result in poor R&R (repeatability & reproducibility) and is not suitable for use. Note the sharp vertical slope once the curve reaches 95%, a general rule of thumb is low noise system will produce a higher slope (closer to the ideal vertical slope). The ability to measure DA will depend on the accuracy and resolution of the comparator electronics. Accuracy of the comparator is usually measured using an ideal signal source; such as an Arbitrary Waveform Generator (AWG). Just as important as comparator accuracy is comparator resolution (incremental steps). To perform the DA test, the comparator usually needs an adjustment step of 0.01%, and produce repeatable results at 0.1% (or better). Test Head Threshold Compensation The main application for the Defect Avalanche test is to compensate for head variability, and to reject "bad" heads. At the qualification stage of the type of heads to be used for testing, a statistically significant population of heads will have the DA(3s) values measured on a reference disk. The desired result is a tight head population with a minimal DA(3s) spread. The 3 standard deviations of this population is usually within 2.0% (threshold) of the mean (center of distribution). For example: the mean DA of all heads is found to be at 95.2%, the +/- 3 sigma spread is from 93.2% to 97.2%. The low side of the population (93.2 to 95.2) is also referred to as "cold". A cold head usually has higher noise thus the lower DA value. Similarly, the other side is "hot". As the head ages through usage, the degradation direction is usually from hot to cold. Threshold Compensation is a calibration factor that would shift a hot or cold head toward the nominal mean value. Using our example above, a cold head with a DA value of 94.8% would have a correction value of 0.4%. This correction factor would be applied to all comparators in the system. Categorized Triggering Triggering is typically a TTL level signal indicating the occurrence of a specific event. For a magnetic test system, the trigger signal is used to trigger an external capture device, such as an oscilloscope. The trigger allows detailed inspection of the (read) signal at a specific point in time. For a magnetic testing, the trigger must be sent within a few nano-seconds of the event. In this case, the events of interest are various types of error. However, legacy systems only provide a trigger at the comparator level, giving only the ability to observe a MP. Advanced triggering utilizes the realtime category engine to qualify the error, giving the user the ability to see the signal for a specific type of error categorized from MP or TA. The trigger can also be further gated to be within a range of sectors, limiting the number of triggers to just the signals of interest. (Servo) Wedge Mode  Servo pattern wedges and tested track (to skip servo area) Magnetic certification of disks previously written with servo information is done with a gating signal, . The "read gate" signal will be used to write, test, and erase just the data area of the disk; preserving the servo area. The standard technique would be for the tester to derive the actuator geometry "on the fly"; and generate the read gate signal for each tested surface. Depending on the r/w dimensions of the test heads, correct test skew for each track may be needed. The wedge mode can be implemented in either track mode or spiral mode. Existing spiral test platform (such as MG) can be upgraded to perform spiral with correct skew geometry. Issues The present breed of testers being used today were designed for a much lower density era. With the rapid increase in density, the test technology has not kept up thus compromises must be made for the Magnetic Certification test. The "Certification" process has changed from 100% testing/certification to a "sampling plan". The most basic of sampling plans is to cover a smaller area than full surface. Other ambitious sampling plan will only test some of the disks in a lot (and only some of the tracks on each disk). Depending on the company's test philosophy, testing for each surface now range (only) from 1 to 3% coverage. And the number disks tested range from 10 to 100%. Balancing Throughput, Sensitivity, and Density (TSD) is a complex manipulation of parameters, as well as specification negotiations. Magnetic testing are getting more expensive progressively as density increases. Testing technology has reached its peak in the early to mid 90s in terms of TSD. With the demand of Throughput to remain in place, Sensitivity and Density numbers are going down to compensate. Because of the ever shrinking head geometry, and the coupled decrease in signal amplitude; the amount of testing has decrease over the years to keep the throughput constant. The immediate casualty is the test Density (coverage); which has been reduced astoundingly low numbers. The critical casualty is Sensitivity. Maintaining high sensitivity has an immediate effect on throughput, often unpredictable. A bad batch of disks (or heads) will immediately cripple throughput without any fast resolution to its root cause. The end result is a test threshold so low, that the tester will only see the largest of all MPs. Lowering Density just lower the probability of catching an error, lowering Sensitivity changed that probably to ZERO to a class or error that the drives will encounter. A sampling plan would only make sense if the testing of samples has the same level of criticality of the products failure mode. Otherwise the AQL (acceptable quality level), and the LTPD (lot tolerance percent defective) are no longer applicable. Once Sensitivity crossed this line (test's CDS is larger than drive's CDS), the product's quality is living on borrowed time. Bottom line: the most critical issue facing Magnetic Certification today is the tester's (defect) sensitivity. It's pointless to test if you can no longer detect the defects that affect today's disk drives. A poor error channel (and filter) design may appear to have the desired Signal-to-Noise performance, but often resulted in a signal that no longer contains any errors (filtered away). The result is not all thresholds are created equal. A properly designed channel can detect defects for the entire range of thresholds. It's common to see a correctly designed 65% threshold outperforms a poorly design 85% threshold (and a properly designed 85% to see previously unseen errors).
|
|
| |
|
|
| |
|
|
|