Using stumpi to detect anomaly #287

Darveesh · 2020-12-01T18:16:51Z

Darveesh
Dec 1, 2020

I am trying to learn about stumpi and how I can possibly use it to detect anomaly in a real-time time series signal. As a proof of concept, I basically have a linear signal with 10 existing/historical data points. I then proceed to add another 15 data points one at a time following the linear progression. Intuitively/conceptually, I'd expect there to be no discord as data is added to the time series signal. My proof of concept code is below. If you examine the output you will see that the global maximum index is indeed changing - giving the impression that a discord is out there. Now, the values are very small, and the library prints out a warning of sorts along these lines but I don't know how to work around that if that's really the issue here. Looking for advice / suggestion on how to go about my poc. In my experiment, is there a better way to show that as data points come in, there is indeed no discord? Conversely, I plan to introduce a few large values for my incoming data. I'd hope whatever scheme is used, would also identify real discords. Thank you.

import logging
import numpy as np
import stumpy

logging.basicConfig(format='%(asctime)s: %(message)s', datefmt='%m/%d/%Y %I:%M:%S %p')
logger = logging.getLogger("TimeSeries")
logger.setLevel(logging.DEBUG)

timeSeries = []
for i in range(9, 20):
    timeSeries.append(i+0.3)
logger.debug(f"Initial time series values: {timeSeries}")

m=3
stream = stumpy.stumpi(timeSeries, m)
logger.debug(f"Initial MP values: {stream.P_}")

maxIndex = np.argwhere(stream.P_ == stream.P_.max()).flatten()[0]
logger.debug(f"Initial max index: {maxIndex}")

for i in range(20, 35):
    stream.update(i+0.3)
    logger.debug(f"MP values: {stream.P_}")
    maxIndex = np.argwhere(stream.P_ == stream.P_.max()).flatten()[0]
    logger.debug(f"New max index: {maxIndex}")

Output:

12/01/2020 09:50:54 AM: Initial time series values: [9.3, 10.3, 11.3, 12.3, 13.3, 14.3, 15.3, 16.3, 17.3, 18.3, 19.3]
12/01/2020 09:51:04 AM: A large number of values are smaller than 1e-05.
12/01/2020 09:51:04 AM: For a self-join, try setting `ignore_trivial = True`.
12/01/2020 09:51:05 AM: Initial MP values: [0. 0. 0. 0. 0. 0. 0. 0. 0.]
12/01/2020 09:51:05 AM: Initial max index: 0
12/01/2020 09:51:05 AM: MP values: [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 2.86241959e-07]
12/01/2020 09:51:05 AM: New max index: 8
12/01/2020 09:51:05 AM: MP values: [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 2.86241959e-07
 2.60663862e-07]
12/01/2020 09:51:05 AM: New max index: 7
12/01/2020 09:51:05 AM: MP values: [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 2.86241959e-07 2.60663862e-07
 2.80363675e-07]
12/01/2020 09:51:05 AM: New max index: 6
12/01/2020 09:51:05 AM: MP values: [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 2.86241959e-07 0.00000000e+00 2.80363675e-07
 0.00000000e+00]
12/01/2020 09:51:05 AM: New max index: 5
12/01/2020 09:51:05 AM: MP values: [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 1.21057606e-07 0.00000000e+00 2.80363675e-07 0.00000000e+00
 1.21057606e-07]
12/01/2020 09:51:05 AM: New max index: 6
12/01/2020 09:51:05 AM: MP values: [0.00000000e+00 0.00000000e+00 0.00000000e+00 1.21057606e-07
 0.00000000e+00 2.80363675e-07 0.00000000e+00 1.21057606e-07
 1.59100864e-07]
12/01/2020 09:51:05 AM: New max index: 5
12/01/2020 09:51:05 AM: MP values: [0.00000000e+00 0.00000000e+00 1.21057606e-07 0.00000000e+00
 2.80363675e-07 0.00000000e+00 1.21057606e-07 1.59100864e-07
 1.89660818e-07]
12/01/2020 09:51:05 AM: New max index: 4
12/01/2020 09:51:05 AM: MP values: [0.00000000e+00 1.21057606e-07 0.00000000e+00 1.98247056e-07
 0.00000000e+00 1.21057606e-07 1.59100864e-07 1.89660818e-07
 1.98247056e-07]
12/01/2020 09:51:05 AM: New max index: 3
12/01/2020 09:51:05 AM: MP values: [1.21057606e-07 0.00000000e+00 1.98247056e-07 0.00000000e+00
 1.21057606e-07 1.59100864e-07 1.89660818e-07 1.98247056e-07
 1.69244657e-07]
12/01/2020 09:51:05 AM: New max index: 2
12/01/2020 09:51:05 AM: MP values: [0.00000000e+00 1.98247056e-07 0.00000000e+00 1.21057606e-07
 1.59100864e-07 1.89660818e-07 1.98247056e-07 1.69244657e-07
 4.88339949e-07]
12/01/2020 09:51:05 AM: New max index: 8
12/01/2020 09:51:05 AM: MP values: [1.98247056e-07 0.00000000e+00 1.21057606e-07 1.59100864e-07
 1.89660818e-07 1.98247056e-07 1.69244657e-07 4.88339949e-07
 6.47815022e-07]
12/01/2020 09:51:05 AM: New max index: 8
12/01/2020 09:51:05 AM: MP values: [0.00000000e+00 1.21057606e-07 1.59100864e-07 1.89660818e-07
 1.98247056e-07 1.69244657e-07 2.98767353e-07 6.47815022e-07
 2.86241959e-07]
12/01/2020 09:51:05 AM: New max index: 7
12/01/2020 09:51:05 AM: MP values: [1.21057606e-07 1.59100864e-07 1.89660818e-07 1.98247056e-07
 1.69244657e-07 2.98767353e-07 6.47815022e-07 2.86241959e-07
 3.30523744e-07]
12/01/2020 09:51:05 AM: New max index: 6
12/01/2020 09:51:05 AM: MP values: [1.59100864e-07 1.89660818e-07 1.98247056e-07 1.69244657e-07
 2.98767353e-07 6.47815022e-07 2.86241959e-07 3.30523744e-07
 6.49868324e-07]
12/01/2020 09:51:05 AM: New max index: 8
12/01/2020 09:51:05 AM: MP values: [1.89660818e-07 1.98247056e-07 1.69244657e-07 2.98767353e-07
 6.47815022e-07 2.86241959e-07 3.30523744e-07 6.49868324e-07
 4.06449488e-07]
12/01/2020 09:51:05 AM: New max index: 7

seanlaw · 2020-12-01T19:11:35Z

seanlaw
Dec 1, 2020
Maintainer

@Darveesh Thank you for your question and welcome to the STUMPY community!

There are a few things that come to mind. Firstly, anomalies are really really hard. One needs to first start off by clearly defining "what is an anomaly" as it relates to your data. This may require establishing what is "normal" and then setting thresholds. Secondly, stumpi compares a window of values (i.e., the shape after z-normalization) to other windows of the same length. So, currently, since all of your data points are monotonically increasing, then all windows are basically identical and, fundamentally, they all have a zero distance (or something close to zero depending on numerical precision). So, the answers that you are seeing aren't wrong. Also note that stumpi is not well suited for finding single point anomalies.

Maybe what you may want to do is to do is to compare the maximum matrix profile value at each iteration and see how it is changing and if that max value is "significant" relative to other values in the current window. This can be done with something like:

if stream.P_.max() > stream.P_.mean() + stream.P_.std()*2:
    logger.debug(f"New max index: {maxIndex}")

This checks and sees if the max value (for a window) is 2 standard deviations higher than the rest of the other values.

But, to reiterate, anomaly detection is hard and you'll need to define "what is an anomaly" first before you can really proceed. 🤷‍♂️

0 replies

Darveesh · 2020-12-02T17:04:43Z

Darveesh
Dec 2, 2020
Author

@seanlaw Thank you for the response and more importantly perhaps for making stumpy available and being a resource for us curious folks. I hear you about being clear regarding the definition of an anomaly. In my real world application we will have to be think on that a bit more as you suggest. However, as I learn about this library it's great to know about its capabilities and limitations.

As to the little experiment I am running here, I will try your suggestion. Initially I had the idea that maybe I should track the location of the max index from one iteration to the next, but if I I am understanding your suggestion correctly you are limiting the "analysis" to the iteration at hand and not "remembering" anything from past iterations. Thank you for the tip. Will see how it performs although I will now keep a broader mind about the difficulty in finding deviations in the signal.

0 replies

seanlaw · 2020-12-02T19:39:03Z

seanlaw
Dec 2, 2020
Maintainer

@Darveesh Awesome! Having an open discussion also helps me think through these things and I learn just as much from others. Let me know how it goes!

0 replies

Darveesh · 2020-12-03T22:11:47Z

Darveesh
Dec 3, 2020
Author

So I ended up combining the two approaches and also working around precision issues. I.e. I classify an anomaly per your suggestion (max value more than 2 std devs away from mean) in a particular iteration. However, I saw that rule alone wasn't good enough since in the next iteration, the same rule would get fired (except now it was shifted by one index). I didn't want to generate another anomaly "alert" in this new iteration. So when an anomaly is classified / detected in the current iteration, I save the (max) index. In the next iteration if an anomaly is detected but the (max) index now is merely one shift, I skip the alerting mechanism. The other case is when the remembered (max) index eventually goes out the sliding window. In that case I reset to the current window (max) index and raise an alert as well. It's not perfect perhaps but the results are not bad:

Code:

import logging
from random import random
import numpy as np
import stumpy

logging.basicConfig(format='%(asctime)s: %(message)s', datefmt='%m/%d/%Y %I:%M:%S %p')
logger = logging.getLogger("TimeSeries")
logger.setLevel(logging.INFO)

timeSeries = []
for i in range(9, 20):
    timeSeries.append(i+0.3)
logger.info(f"Initial time series values: {timeSeries}")

m=3
stream = stumpy.stumpi(timeSeries, m)
logger.debug(f"Initial MP values: {stream.P_}")

lastMaxIndex = np.argwhere(stream.P_ == stream.P_.max()).flatten()[0]
logger.debug(f"Initial max index: {lastMaxIndex}")

for i in range(20, 35):
    logger.info(f"New data arrived: {i+0.3}")
    stream.update(i+0.3)
    logger.debug(f"MP values: {stream.P_}")
    maxMP = np.round(stream.P_.max(), 4)
    meanMP = np.round(stream.P_.mean(), 4)
    lastMaxIndex -= 1
    maxIndex = np.argwhere(stream.P_ == stream.P_.max()).flatten()[0]
    if maxMP > (meanMP + (stream.P_.std()*2)):
        if lastMaxIndex >= 0:
            if lastMaxIndex != maxIndex:
                logger.warning(f" Anomaly detected!")
                logger.debug(f" Anomaly at index: {maxIndex}")
                logger.warning(f" Neighboring data: {stream.T_[maxIndex:maxIndex+m]}")
        else:
            logger.warning(f" Anomaly detected!")
            logger.debug(f" Anomaly at index: {maxIndex}")
            logger.warning(f" Neighboring data: {stream.T_[maxIndex:maxIndex+m]}")
    lastMaxIndex = maxIndex

Output:

12/03/2020 01:57:56 PM: Initial time series values: [9.3, 10.3, 11.3, 12.3, 13.3, 14.3, 15.3, 16.3, 17.3, 18.3, 19.3]
12/03/2020 01:58:05 PM: A large number of values are smaller than 1e-05.
12/03/2020 01:58:05 PM: For a self-join, try setting `ignore_trivial = True`.
12/03/2020 01:58:06 PM: New data arrived: 20.3
12/03/2020 01:58:06 PM: New data arrived: 21.3
12/03/2020 01:58:06 PM: New data arrived: 22.3
12/03/2020 01:58:06 PM: New data arrived: 23.3
12/03/2020 01:58:06 PM: New data arrived: 24.3
12/03/2020 01:58:06 PM: New data arrived: 25.3
12/03/2020 01:58:06 PM: New data arrived: 26.3
12/03/2020 01:58:06 PM: New data arrived: 27.3
12/03/2020 01:58:06 PM: New data arrived: 28.3
12/03/2020 01:58:06 PM: New data arrived: 29.3
12/03/2020 01:58:06 PM: New data arrived: 30.3
12/03/2020 01:58:06 PM: New data arrived: 31.3
12/03/2020 01:58:06 PM: New data arrived: 32.3
12/03/2020 01:58:06 PM: New data arrived: 33.3
12/03/2020 01:58:06 PM: New data arrived: 34.3

If I break the linear progression (to some random values):

12/03/2020 02:07:28 PM: Initial time series values: [9.3, 10.3, 11.3, 12.3, 13.3, 14.3, 15.3, 16.3, 17.3, 18.3, 19.3]
12/03/2020 02:07:37 PM: A large number of values are smaller than 1e-05.
12/03/2020 02:07:37 PM: For a self-join, try setting `ignore_trivial = True`.
12/03/2020 02:07:38 PM: New data arrived: 20.3
12/03/2020 02:07:38 PM: New data arrived: 21.3
12/03/2020 02:07:38 PM: New data arrived: 22.3
12/03/2020 02:07:38 PM: New data arrived: 23.3
12/03/2020 02:07:38 PM: New data arrived: 24.3
12/03/2020 02:07:38 PM: New data arrived: 25.3
12/03/2020 02:07:38 PM: New data arrived: 26.3
12/03/2020 02:07:38 PM: New data arrived: 27.3
12/03/2020 02:07:38 PM: New data arrived: 28.3
12/03/2020 02:07:38 PM: New data arrived: 29.3
12/03/2020 02:07:38 PM: New data arrived: 30.3
12/03/2020 02:07:38 PM: New data arrived: 31.3
12/03/2020 02:07:38 PM: New data arrived: 32.3
12/03/2020 02:07:38 PM: New data arrived: 33.3
12/03/2020 02:07:38 PM: New data arrived: 34.3
12/03/2020 02:07:38 PM: New data arrived: 158.62291032509614
12/03/2020 02:07:38 PM:  Anomaly detected!
12/03/2020 02:07:38 PM:  Neighboring data: [ 33.3         34.3        158.62291033]
12/03/2020 02:07:38 PM: New data arrived: 172.95220180761913
12/03/2020 02:07:38 PM: New data arrived: 127.79968734689658
12/03/2020 02:07:38 PM:  Anomaly detected!
12/03/2020 02:07:38 PM:  Neighboring data: [158.62291033 172.95220181 127.79968735]
12/03/2020 02:07:38 PM: New data arrived: 178.77987950201054
12/03/2020 02:07:38 PM: New data arrived: 117.23118003528933
12/03/2020 02:07:38 PM: New data arrived: 40.3
12/03/2020 02:07:38 PM: New data arrived: 41.3
12/03/2020 02:07:38 PM: New data arrived: 42.3
12/03/2020 02:07:38 PM: New data arrived: 43.3
12/03/2020 02:07:38 PM: New data arrived: 44.3
12/03/2020 02:07:38 PM: New data arrived: 45.3
12/03/2020 02:07:38 PM: New data arrived: 46.3
12/03/2020 02:07:38 PM: New data arrived: 47.3
12/03/2020 02:07:38 PM: New data arrived: 48.3
12/03/2020 02:07:38 PM: New data arrived: 49.3
12/03/2020 02:07:38 PM: New data arrived: 50.3
12/03/2020 02:07:38 PM: New data arrived: 51.3
12/03/2020 02:07:38 PM: New data arrived: 52.3
12/03/2020 02:07:38 PM: New data arrived: 53.3
12/03/2020 02:07:38 PM: New data arrived: 54.3

0 replies

seanlaw · 2020-12-03T22:34:09Z

seanlaw
Dec 3, 2020
Maintainer

Very cool and thank you for sharing @Darveesh!

I think this is the intended approach when leveraging the STUMPY package. That is, our philosophy is (hopefully) to do the "hardest" (read: "most computationally intensive/complex") part for you and, in only a few extra lines of code, the user can make it accomplish their specific goals! This way, our job is to focus on the core and allow the user to build what is unique to their use case.

0 replies

Darveesh · 2020-12-03T23:17:15Z

Darveesh
Dec 3, 2020
Author

Indeed, thank you again for doing the heavy lifting and your support.

0 replies

Using stumpi to detect anomaly #287

Uh oh!

Uh oh!

Darveesh Dec 1, 2020

Replies: 6 comments

Uh oh!

Uh oh!

seanlaw Dec 1, 2020 Maintainer

Uh oh!

Darveesh Dec 2, 2020 Author

Uh oh!

seanlaw Dec 2, 2020 Maintainer

Uh oh!

Darveesh Dec 3, 2020 Author

Uh oh!

Uh oh!

seanlaw Dec 3, 2020 Maintainer

Uh oh!

Darveesh Dec 3, 2020 Author

Darveesh
Dec 1, 2020

seanlaw
Dec 1, 2020
Maintainer

Darveesh
Dec 2, 2020
Author

seanlaw
Dec 2, 2020
Maintainer

Darveesh
Dec 3, 2020
Author

seanlaw
Dec 3, 2020
Maintainer

Darveesh
Dec 3, 2020
Author