July 23, 2024
PVF: A novel metric for understanding AI methods’ vulnerability in opposition to SDCs in mannequin parameters
  • We’re introducing parameter vulnerability factor (PVF), a novel metric for understanding and measuring AI methods’ vulnerability in opposition to silent information corruptions (SDCs) in mannequin parameters.
  • PVF will be tailor-made to completely different AI fashions and duties, tailored to completely different {hardware} faults, and even prolonged to the coaching section of AI fashions.
  • We’re sharing outcomes of our personal case research utilizing PVF to measure the influence of SDCs in mannequin parameters, in addition to potential strategies of figuring out SDCs in mannequin parameters.

Reliability is a crucial facet of any profitable AI implementation. However the rising complexity and variety of AI {hardware} methods additionally brings an elevated threat of {hardware} faults reminiscent of bit flips. Manufacturing defects, getting old elements, or environmental components can result in information corruptions – errors or alterations in information that may happen throughout storage, transmission, or processing and lead to unintended modifications in info.

Silent information corruptions (SDCs), the place an undetected {hardware} fault leads to inaccurate utility habits, have develop into more and more prevalent and tough to detect. Inside AI methods, an SDC can create what’s known as parameter corruption, the place AI mannequin parameters are corrupted and their authentic values are altered.

When this happens throughout AI inference/servicing it will possibly doubtlessly result in incorrect or degraded mannequin output for customers, in the end affecting the standard and reliability of AI companies.

Determine 1 exhibits an instance of this, the place a single bit flip can drastically alter the output of a ResNet mannequin. 

Determine 1: Flipping a random bit of 1 parameter within the 1st convolution (conv) layer in ResNet-18 drastically alters the mannequin’s output.

 

With this escalating thread in thoughts, there are two essential questions: How weak are AI fashions to parameter corruptions? And the way do completely different components (reminiscent of modules and layers) of the fashions exhibit completely different vulnerability ranges to parameter corruptions?

Answering these questions is a crucial a part of delivering dependable AI methods and companies and presents invaluable insights for guiding AI {hardware} system design, reminiscent of when assigning AI mannequin parameters or software program variables to {hardware} blocks with differing fault safety capabilities. Moreover, it will possibly present essential info for formulating methods to detect and mitigate SDCs in AI methods in an environment friendly and efficient method.

Parameter vulnerability factor (PVF) is a novel metric we’ve launched with the purpose to standardize the quantification of AI mannequin vulnerability in opposition to parameter corruptions. PVF is a flexible metric that may be tailor-made to completely different AI fashions/duties and can be adaptable to completely different {hardware} fault fashions. Moreover, PVF will be prolonged to the coaching section to guage the consequences of parameter corruptions on the mannequin’s convergence functionality.

What’s PVF?

PVF is impressed by the architectural vulnerability issue (AVF) metric used inside the pc structure neighborhood. We outline a mannequin parameter’s PVF because the chance {that a} corruption in that particular mannequin parameter will result in an incorrect output. Just like AVF, this statistical idea will be derived from statistically intensive and significant fault injection (FI) experiments. 

PVF has a number of options:

Parameter-level quantitative evaluation

As a quantitative metric, PVF concentrates on parameter-level vulnerability, calculating the chance {that a} corruption in a selected mannequin parameter will result in an incorrect mannequin output. This “parameter” will be outlined at completely different scales and granularities, reminiscent of a person parameter or a gaggle of parameters.

Scalability throughout AI fashions/duties

PVF is scalable and relevant throughout a variety of AI fashions, duties, and {hardware} fault fashions.

Supplies insights for guiding AI system design

PVF can present invaluable insights for AI system designers, guiding them in making knowledgeable selections about balancing fault safety with efficiency and effectivity. For instance, engineers would possibly leverage PVF to assist map greater weak parameters to better-protected {hardware} blocks and discover tradeoffs on latency, energy, and reliability by enabling a surgical strategy to fault tolerance at selective places as a substitute of a catch-all/none strategy. 

Can be utilized as a normal metric for AI vulnerability/resilience analysis

PVF has the potential to unify and standardize such practices, making it simpler to match the reliability of various AI methods/parameters and fostering open collaboration and progress within the trade and analysis neighborhood.

How PVF works

Just like AVF as a statistical idea, PVF must be derived via numerous FI  experiments which are statistically significant. Determine 2 exhibits an total circulate to compute PVF via a FI course of. We’ve introduced a case examine on the open-source DLRM inference with extra particulars and instance case research in our paper.

Determine 2: Computing PVF via FI.

Determine 3 illustrates the PVF of three DLRM parameter elements, embedding desk, bot-MLP, and top-MLP, beneath 1, 2, 4, 8, 16, 32, 64, and 128 bit flips throughout every inference. We observe completely different vulnerability ranges throughout completely different components of DLRM. For instance, beneath a single bit flip, the embedding desk has comparatively low PVF; that is attributed to embedding tables being extremely sparse, and parameter corruptions are solely activated when the actual corrupted parameter is activated by the corresponding sparse function. Nonetheless, top-MLP can have 0.4% beneath even a single bit flip. That is vital – for each 1000 inferences, 4 inferences will likely be incorrect. This highlights the significance of defending particular weak parameters for a given mannequin primarily based on the PVF measurement. 

Determine 3: The PVF of DLRM parameters beneath random bit flips.

We observe that with 128 bit flips throughout every inference, for MLP elements, PVF has elevated to 40% and 10% for top-MLP and bot-MLP elements respectively, whereas observing a number of NaN values. High-MLP element has greater PVF than bot-MLP. That is attributed to the top-MLP being nearer to the ultimate mannequin, and therefore has much less of an opportunity to be mitigated by inherent error masking chance of neural layers. 

The applicability of PVF

PVF is a flexible metric the place the definition of an “incorrect output” (which can fluctuate primarily based on the mannequin/activity) will be tailored to go well with consumer necessities. To adapt PVF to varied {hardware} fault fashions the strategy to calculate PVF stays constant as depicted in Determine 2. The one modification required is the way through which the fault is injected, primarily based on the assumed fault fashions. 

Moreover, PVF will be prolonged to the coaching section to guage the consequences of parameter corruptions on the mannequin’s convergence functionality. Throughout coaching, the mannequin’s parameters are iteratively up to date to reduce a loss operate. A corruption in a parameter might doubtlessly disrupt this studying course of, stopping the mannequin from converging to an optimum resolution. By making use of the PVF idea throughout coaching, we might quantify the chance {that a} corruption in every parameter would lead to such a convergence failure.

Dr. DNA and additional exploration avenues for PVF

The logical development after understanding AI vulnerability to SDCs is to determine and reduce their influence on AI methods. To provoke this, we’ve launched Dr. DNA, a way designed to detect and mitigate SDCs that happen throughout deep studying mannequin inference. Particularly, we formulate and extract a set of distinctive SDC signatures from the distribution of neuron activations (DNA), primarily based on which we suggest early-stage detection and mitigation of SDCs throughout DNN inference. 

We carry out an in depth analysis throughout 10 consultant DNN fashions utilized in three widespread duties (imaginative and prescient, GenAI, and segmentation) together with ResNet, Imaginative and prescient Transformer, EfficientNet, YOLO, and so forth., beneath 4 completely different error fashions. Outcomes present that Dr. DNA  achieves a 100% SDC detection fee for many circumstances, a 95% detection fee on common and a >90% detection fee throughout all circumstances, representing 20-70% enchancment over baselines. Dr. DNA can even mitigate the influence of SDCs by successfully recovering DNN mannequin efficiency with <1% reminiscence overhead and <2.5% latency overhead. 

Learn the analysis papers

PVF (Parameter Vulnerability Factor): A Novel Metric for Understanding AI Vulnerability Against SDCs in Model Parameters

Dr. DNA: Combating Silent Data Corruptions in Deep Learning using Distribution of Neuron Activations