Outlier detection

Unusual or suprising things stick out, and give us a reason to pay attention. 'Wait, there is something going on here', you may think when something out of the ordinary happens. The identification of rare items, events or observations, as well as abnormal behaviors or inconsistent sequences, seems important to almost any area of human endeavour, and can even be considered a prerequisite for survival. As an extreme case, perhaps one crucial function of the liberal arts is to reveal the singular instead of the similar?

Although humans may have an intuitive notion of outliers, it is in fact an ill-defined concept. Depending on the type of observation, its context, as well as the observers intentions, the process of identification is often referred to as outlier, anomaly, change or novelty detection. Sometimes an outlier may be due to variability in the measurement, it may indicate experimental error or it was generated by some different mechanism. In general, outliers are said to deviate significantly from the majority of the data or do not conform to some notion of normal behaviour. But what is considered ordinary and what features characterize a deviation, and significant in what sense?

Many approaches to outlier detection exist, however, this discussion will be limited to the application to one-dimensional digital signal processing (DSP), therefore we deal with univariate, but time-dependent data. Frequently, we would like to detect outliers in a data stream, i.e. a time-series of samples, and timely react or sometimes correct outliers. Applications include impulsive or speckle noise suppression, change detection, early risk identification by profiling normal behavior and flagging anomalies, etc.

Normality from descriptive statistics

One approach to outlier detection assumes that signal sample values originate from a single distribution, or from a mixture of distributions. If we can characterize the distribution by describing its statistics, we can thus define normality. Typical univariate descriptive statistics include:

Quartiles, which divide sample data into four equal parts, or quarters, i.e. 25% of the sample values is below $ Q_1$, the second quartile ($ Q_2 $) divides the data into two halves, and 75% is below the third quartile ($ Q_3 $).
Arithmetic mean ($ \mu $), median (equals $ Q_2 $), or geometric mean, which are all averages describing a central tendency of the sample values.
Note that the geometric mean is most appropriate for describing the central tendency of data that is expressed as fractions, e.g. percentage change or growth rates, and can only be computed for positive values.
Standard deviation ($ \sigma $), coefficient of variation (CV), interquartile range (IQR), or median absolute deviation (MAD) each provide descriptions of data variability.
CV is defined as the ratio of the standard deviation to the mean, IQR is the difference between the third and first quartiles; the middle half in other words, and MAD is defined as the median of the absolute deviations from the data's median.

As an example, some descriptive statistics for Gaussian and Poisson distributions are shown below. For a Gaussian distribution, the median equals the mean, $ Q_1, Q_3 \approx \mu \mp 0.6745 \sigma$, and MAD$ \approx \sigma / 1.4826$. Note that in mathematical statistics, the Gaussian distribution is known as a normal distribution, and a zero-mean and normalized Gaussian distribution is known as a standard normal distribution. However, in this text we prefer not to use the terms normal or standard distribution, as seeking normality is actually part of the effort to detect outliers.

A Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event ¹.

Note that the concept of quartiles is typically associated with continuous distributions, and since the Poisson distribution is a discrete distribution, it doesn't have quartiles in the traditional sense. However, you can still consider quantiles or percentiles in the context of a Poisson distribution.

Definition of outliers

So, we can use a notion of central tendency and variability to characterize normality. Based on these descriptive statistics, outliers can now be defined as extreme values that lie outside the bulk of the sample values. Following this approach it is common to define thresholds or fences to discriminate between outliers and normal samples. Such thresholds can be computed using the Q-function $ Q(z) $, i.e. the tail distribution function of the standard normal distribution. For example, in a zero-mean Gaussian distribution, 4.5% of the sample values fall outside of the interval $\mu \pm 2 \sigma $, and only 0.3% of the absolute sample values are larger than $\mu + 3 \sigma $. One approach is to designate sample values beyond three sigma as outliers, i.e. values are flagged as outliers if they fall outside the range $$ [\mu - 3 \sigma, \mu + 3 \sigma] . $$

Other methods are based on IQR or MAD, since these statistics are relatively robust compared to the standard deviation. Note that for some distributions, a.o. the Lorentz distribution, moments such as mean and variance can even be undefined. For example, Tuckey's fences bases on the IQR, and values are flagged as outliers if outside the range $$ [Q_1 - 1.5 \cdot \text{IQR}, Q_3 + 1.5 \cdot \text{IQR}] , $$ where $ \text{IQR} = Q_3 - Q_1 $.

Yet another method is similar to the three sigma rule, but now sigma is estimated as follows: $$ \sigma \approx k \cdot \text{MAD} , $$ where $ \text{MAD} = \text{median} (| x - \text{median} ( x ) |) $, and $ k $ depends on prior knowledge of the distribution, e.g. $ k \approx 1.4826 $ for Gaussian.

Beyond the fence

The motivation for outlier detection is to separate out rare samples that are possibly generated by some other process than the dominant mechanism. However, outliers can occur by chance in any distribution, also the correct data contains outliers, especially when sample sizes get large. When selecting fences that discriminate between correct samples and outliers, there is always a trade-off between false positives and missed outliers. It can be useful to put a number to this uncertainty, i.e. by expressing the likelihood that a sample value does not originate from the dominant distribution, but is in fact an outlier. For that we turn to the cumulative distribution function (CDF).

A cumulative distribution function ($\Phi(z) $) expresses the probability that a variable takes a value less or equal to a certain threshold ($z $). Likewise, the probability that a zero-mean Gaussian random variable takes a value larger than $ z $ can be expressed as $ Q(z) = 1 - \Phi(z) $. Now, if we detect a value at a certain threshold $ z $ or beyond, $ Q(z) $ can be interpreted as the likelihood that said value did actually originate from the distribution. Likewise, the likelihood that said value is an outlier equals $\Phi(z) $. In case of a Gaussian distribution, the CDF equals: $$ \Phi \left ( \frac{z -\mu} {\sigma} \right ) = \frac {1} {2} \left \lbrack 1 + \text{erf} \left ( \frac {z - \mu} {\sigma \sqrt{2}} \right ) \right \rbrack , $$ where $ \text{erf} (.) $ is the error function.

TODO: chernoff bound on Q function? Q(x) < exp (- x^2/2)? apprxoimatin of 1-p: exp(-n^2/2) (n \sqrt(pi/2))? https://en.wikipedia.org/wiki/Normal_distribution#Standard_deviation_and_coverage

Estimation of statistics

In order to define fences for outlier detection, statistics of the data need to be estimated. We focus on sample streams, where outlier detection ideally happens in real-time, i.e. an outlier is flagged during the same sample period as the input was provided, or happens after a fixed delay.

Sliding window

One approach to estimate descriptive statistics such as the quartiles or MAD, is to use a circular buffer that saves the last $ N $ samples. Circularity comes from the fact that when the end of the allocated buffer is reached, the pointer automatically wraps around to the beginning of the buffer. Comparable to a sliding window, a circular buffer allows to estimate a local density of the sample values.

Let's first define a detector that uses Python's double ended queue from the collections module, with bounded length. If the queue is full and a new item is added, then an item from the opposite end is disregarded, resulting in a circular buffer.

from math import isnan
from collections import deque
from statistics import quantiles

class OutlierDetector(object):

    def __init__(self, buffer_size: int = 11, k: float = 1.5):
        super().__init__()
        if buffer_size < 1:
            raise ValueError('Buffer size should be larger than 1')
        if k < 1 or k > 4:
            raise ValueError('Spread factor seems poorly chosen')        
        self.buffer = deque(maxlen=buffer_size)
        self.k = k

    def process(self, x: float) -> bool:
        if x is None or isnan(x):
            return None
        self.buffer.append(x)

If the queue is full, we use the quantiles function from the statistics module to estimate Tuckey's fences. The signal is compared to the fences and if it succeeds one of them, an outlier is flagged.

    def process(self, x: float) -> bool:
        ...
        if len(self.buffer) == self.buffer.maxlen:
            quartiles = array(quantiles(self.buffer, n=4))
            iqr = quartiles[2] - quartiles[0]
            low_fence = quartiles[1] - self.k*iqr
            high_fence = quartiles[1] + self.k*iqr

            if x >= high_fence or x <= low_fence:
                return True

        return False

When applied to a zero-mean Gaussian process with $\sigma = 1.0 $, it appears that the fences are somewhate underestimated, resulting in many more detected outliers, than the 0.3% false positives that are to be expected. As illustrated n the figure below, where a zero-mean Gaussian process is analyzed in a 100 sample sliding window to produce upper and lower IQR based fences shown as thin blue lines. Theoretical fences are shown in horizontal dashed blue lines, flagged outliers are shown in vertical dotted red lines.

Let's build a second detector with a similar structure, this time estimating the median and median absolute deviation (MAD) when the queue is filled.

from numpy import median

class OutlierDetectorMAD(object):
    ...

    def process(self, x: float) -> bool:
        ...

        if len(self.buffer) == self.buffer.maxlen:
            med = median(self.buffer)
            mad = median(abs(self.buffer - med))
            low_fence = med - self.k*mad
            high_fence = med + self.k*mad

            if x >= high_fence or x <= low_fence:
                return True

        return False

In this case, the fences seem to be a bit less biased, but more noisy.

Straightforward improvements can be made by increasing the window size, or by smoothing of the fences using a low-pass filter, but all at the expense of reduced response time.

Recursive estimation of median and MAD

In interesting alternative to sliding window based estimation, is to apply an adaptation of the Frugal-1U-Median algorithm ¹. Just one memory cell is used to store the previous median estimate, and the algorithm “drifts” towards the direction indicated by the new stream item. The approach can be applied to both recursive estimation of median and median absolute deviation. Here, the original algorithm has been modfied by choosing a stepsize based on the MAD estimate.

from math import isnan

class OutlierDetectorFrugralMAD(object):

    def __init__(self, k: float = 4.0):
        super().__init__()
        self.k = k
        self.med, self.mad = None, None

    def process(self, x: float) -> bool:
        if x is None or isnan(x):
            return None
        self.alpha = 0.1 if self.mad is None else self.mad/100       

        self.med = x if self.med is None 
            else self.med + self.alpha if x > self.med 
            else self.med - self.alpha
        self.mad = abs(x) if self.mad is None 
            else self.mad + self.alpha if abs(x - self.med) > self.mad 
            else self.mad - self.alpha            

        low_fence = self.med - self.k*self.mad
        high_fence = self.med + self.k*self.mad

        if x >= high_fence or x <= low_fence:
            return True

        return False

Despite the simplicity of the algorithm, after convergence the results are quite decent, and comparable to the 100 sample sliding window. Probably, the variance of the fence estimates depend on either the stepsize or the window size in a similar manner.

Conclusion

TODO

Also check means to define changerates (such as alpha), e.g. look into algo's to live by

Qiang Ma, S. Muthukrishnan, and Mark Sandler. Frugal Streaming for Estimating Quantiles:One (or two) memory suffices. July 2014. arXiv:1407.1121 [cs]. URL: http://arxiv.org/abs/1407.1121 (visited on 2022-09-19), doi:10.48550/arXiv.1407.1121. ↩↩