Deep learning for time series anomaly detection

Time series anomaly detection typically requires specification of purpose-built parameters or selection of models to fit the characteristics of normal and anomalous data being studied. These parameters may include the setting or selection of thresholds, window lengths, distance functions, transcoding functions, feature extractors, normalizers, or cluster counts.

The difficulty with parameter selection is that it is hard to find good parameters when the expected number of anomalies is small and the distributional form is unknown. Furthermore, anomaly detection systems are often brittle and have a high functional elasticity with respect to these parameters. Slight changes in the parameters can potentially have a large impact on accuracy.

Non-parametric models still require making distributional assumptions or specifying neural network or bayesian architectures or providing training data. However, despite "letting the data speak for themselves," even some non-parametric models suffer from lack of robustness.

Assuming a distributional form for the normal, anomalous, or another condition allows parameters to be inferred from data. However, if the data does not follow these forms (which is often the case), the estimates that these models generate are not necessarily relevant. Often times these methods embed distributional assumptions implicitly (e.g. in modeling noise) or become unwieldy to train and maintain in terms of complexity or computational requirements. They also may be hard to tune if the undesirable behavior is discovered.

Neural networks are one class of methods capable of learning arbitrary forms of input. Many different architectures and regularizations have been proposed and studied. Recently, algorithmic and computational advances have allowed "deep" neural networks to be successfully trained and applied to different types of problems. Nonetheless, these are typically sensitive to architectural decisions and still require complex and often computationally substantial training effort.

In anomaly detection, typically only data from normal behavior is available. The problem can be described as learnining to discriminate in this one-class setting or alternatively thresholding reconstruction error. In either case, a good approach should be robust to "normal" (typical) noise while be sensitive to true outliers.

The approach described here attempts to learn how to reconstruct normal behavior by using a form of Recurrent Neural Networks (RNN) known as the Echo State Network (ESN). RNN can be interpreted as very deep networks where interconnections between layers are unfolded across time. Training these can be very challenging but they also yield a potentially longer-term sensitivity than other architectures.

The ESN is computationally straightforward, which allows simple process-driven optimization of relatively few hyperparameters. The ESN has been successfully applied in numerous applications including classification, reconstruction, and simulation. Conceptually, the ESN is learning to project the input into a high dimensional space with recurrence. In this high dimensional space, a simple single regression is fit to the neuron activations. Once established initially, the neuron interconnections and weights are not reset during training.

The Echo State Network is a form of Reservoir Computing, so-called becomes it utilizes a reservoir of random neurons and interconnections. Though random, the ESN makes a requirement about the statistics of the reservoir. By retraining the single regression function, deep neural network training issues such as vanishing gradients are eliminated. Pretraining is also not required. The Echo State Network has been shown to be competitive or superior in applications to fully architected deep Recurrent Neural Networks while requiring far less complexity and computational power.

To demonstrate this approach, ECG (electrocardiogram) data from the PhysioBank MIT-BIH Long-Term ECG Database was used. An ESN with 1,000 reservoir neurons was trained using 3,500 samples of a non-arrhythmic ECG signal and tested on 3,500 samples that contained four arrhythmic episodes.

Input to the ESN was the single sample value at each index and the target was the next sample value. Reconstruction error was measured as the difference between the one-ahead predicted value from the actual value in the test sequence. Training and testing time took less than a minute on a single core.

Thresholding the reconstruction error with a simple static value is sufficient to recover all four anomalous episodes and no false positives.

The ESN as used here does not require the specification of a window length, distance function, cluster count, parametric family, or even neural network architecture. It doesn't even require normalizing, transcoding, or preparing features with respect to the input data. No frequency filtering, centering, or other preprocessing was performed on the data. The ESN effectively learns features for itself.

This method attempts to transform a temporal anomaly detection task into a non-temporal task by converting anomalous temporal behavior into individual anomalous points irrespective of context (even though the transformed points are still indexed by time). This transform allows any number of other point-wise anomaly detection methods to be used, including a very simple static thresholding that relies on the hypothesis that normal and anomalous points in time are potentially very separated in the transformed non-temporal space.

Though the selection of a static threshold may seem like requiring a parameter after all, consider that this parameter does not specify anything about any temporal characteristic of the signal. The projection of the test signal into the reconstruction space does not require any parameters dealing with thresholds, window lengths, distance functions, etc. The few hyperparameters that are required by the ESN have relatively little impact on the results.

One could also look at the input data and say this is a trivial problem that is clearly separable with a static threshold even in the input space. However, using such a threshold implies a stationarity about the input that is a very bold assumption. It would also mean making an assumption that arrhythmia and normal behavior is separable by amplitude, which may not be the case.

Applying a threshold (or any other point-wise anomaly detection technique) in the reconstruction space instead express a confidence about the ability of the objective function of the ESN to continue to fit the data (as the ESN may be retrained from time to time). Additionally, false positives (or false negatives) can be included (or removed) from the training data to modify the model in a "non-parametric" fashion.

To test whether extreme reconstrution error for the ESN was only sensitive to extreme amplitude in the test signal, a separate test was run whereby the test signal was randomly perturbed by 6 sections of 10 contiguous samples clamped at 50% of the maximum value of the training signal. While each of these artificial anomaly sections yielded reconstruction error less than the log(10) static threshold, they had reconstruction error an order of magitude larger than anything that wasn't identified as one of the four arrhythmic anomalies. Note that at 50% of the maximum value of the training set, these artificial sample values were substantially below the maximum of sample values seen during any cardiac cycle in the training set.

Finally, please note that this is clearly not intended to be a formal scientific examination of this technique for all classes of anomaly detection problems (or even all instances of arrhythmia detection). The purposes of this is to demonstrate the potential power of a relatively new and highly-adaptable form of neural network that appears to be well suited for many time series problems.