One of the critical challenges in anomaly detection that many organizations face is that it can be challenging to define an anomaly.
Google Cloud has recently announced the public preview of new anomaly detection capabilities in BigQuery ML that utilizes unsupervised machine learning to help users detect anomalies without needing the labeled data. Therefore, users can now detect anomalies in training data or new input data using a new ML.DETECT_ANOMALIES function with the models like Autoencoder model, K-means model, and ARIMA_PLUS time series model.
To detect anomalies in non-time-series data, the following can be used:
Autoencoder models: When a user uses ML.DETECT_ANOMALIES with an autoencoder model, anomalies are identified based on the reconstruction error for each data point.
When an autoencoder model and data are provided as inputs, ML.DETECT_ANOMALIES first computes the mean_squared_error for every data point between its original values and its reconstructed values. The contamination value supplied by the user determines the threshold of whether a data point is considered an anomaly.
K-means clustering models: When users use ML.DETECT_ANOMALIES with a k-means model, anomalies are identified based on the value of each input data point’s normalized distance to its nearest cluster.
When a k-means model and data are provided as inputs, ML.DETECT_ANOMALIES first computes the absolute distance for each input data point to all cluster centroids in the model. It then normalizes each distance by the respective cluster radius. For every data point, ML.DETECT_ANOMALIES returns the nearest centroid_id based on normalized_distance. The contamination value provided by the user determines the threshold of whether a data point is considered an anomaly.
For detecting anomalies in time-series data, users can use ML.DETECT_ANOMALIES in ARIMA_PLUS time series models. In this case, anomalies are identified based on the confidence interval for that timestamp. A few examples where users might want to detect anomalies with time-series data are- Detecting anomalies in historical data such as cleaning up data for forecasting and modeling purposes or when the user has a large number of retail demand time series, they may want to quickly identify which stores and product categories had anomalous sales patterns, and then be able to perform a deeper analysis of the cause behind such anomalies.
Another example can be forward-looking anomaly detection, such as detecting consumer behavior and pricing anomalies as early as possible or when there is a large number of retail demand time series and the user would like to identify which stores and product categories had anomalous sales patterns based on the forecasts, so they can quickly respond to any unexpected spikes or dips.