Top K motifs of Time Series #770

meeen · 2022-12-27T14:16:59Z

meeen
Dec 27, 2022

Hello,
based on this paper ( https://www.cs.ucr.edu/~eamonn/meaningless.pdf ) I want to find the top K motifs of my Time Series and than Cluster them.

I have a univariate time series with floats as values.

I use this code to get the top k motifs:
s = amount of steps
k = amount of motif I want to find
top_motifs = list with index of top motifs
m_seg = list of array with the values of the top motifs

top_motifs =[]
m_seg =[]
s = 144
k = 15

df_m = df1.copy()
for i in range(k):
    mps = stumpy.stump(df_m['Values'], s)
    motif = np.argsort(mps[:, 0])[0]
    top_motifs.append(motif)
    allmotifs = stumpy.motifs(df_m['Values'],mps[:, 0],max_distance=2.0,max_matches=25)[1][0]
    segment = np.copy(df1[motif:motif+144])
    scaler = MinMaxScaler()
    segment = scaler.fit_transform(segment)
    m_seg.append(segment)
    for l in allmotifs:
        df_m['Values'][l:l+144]= np.inf

Am I getting the too k motifs with this Code?
Is there already a better implementation to get the tok k motifs?

Thank you for your help :)

seanlaw · 2022-12-27T16:30:48Z

seanlaw
Dec 27, 2022
Maintainer

@meeen Thank you for your question and welcome to the STUMPY community. Unfortunately, I don't think what you are doing is correct. It looks like you are re-computing the same matrix profile 15 times and then grabbing the same top-1 motif every time. Instead, you should compute the matrix profile once and then extract the top 15 motifs once:

mp = stumpy.stump(df_m['Values'], s)
motif_distances, motif_indices = stumpy.motifs(df_m['Values'], mp[:, 0], max_motifs=k, max_distance=2.0, max_matches=25)

Please refer to the stumpy.motifs documentation for an explanation and please feel free to ask follow up questions if the documentation is unclear. motif_indices will contain (at most) k rows of motifs with the index of the motif being the first column and all subsequent matches (i.e., their indices) being in the remaining columns. And the distance from the motif to the individual matches can be found in motif_distances (again, each row corresponds to each motif and their matches).

0 replies

meeen · 2023-02-17T02:28:50Z

meeen
Feb 17, 2023
Author

Thank you for your help

If I use the code you showed above, I get basicly just the same form(graph) over and over again. One motif and than 24 matches of the same motif all looking almost identical. If I increase max_matches I get even more of the same form. So this seams to the most common motif in my time series, because it has al lot of matches? If I change max_motif nothing changes. How do I get the next most common form(motif?) in my time series? So that I have the top 10 motifs that look diffrent, with each having their matches.

What I want to implement is what is discribed on the stumpy website:
https://stumpy.readthedocs.io/en/latest/Tutorial_STUMPY_Basics.html

Find Top-K Motifs
Now that you’ve computed the matrix profile, mp, for your time series and identified the best global motif, you may be interested in discovering other motifs within your data. However, you’ll immediately learn that doing something like top_10_motifs_idx = np.argsort(mp[:, 0])[10] doesn’t actually get you what you want and that’s because this only returns the index locations that are likely going to be close to the global motif! Instead, after identifying the best motif (i.e., the matrix profile location with the smallest value), you first need to exclude the local area (i.e., an exclusion zone) surrounding the motif pair by setting their matrix profile values to np.inf before searching for the next motif. Then, you’ll need to repeat the “exclude-and-search” process for each subsequent motif. Luckily, STUMPY offers two additional functions, namely, stumpy.motifs and stumpy.match, that help simplify this process. While it is beyond the scope of this basic tutorial, we encourage you to check them out!

My idea was, to get the most common motif with stumpy.motifs and and than delete this motif and all its matches with
df_m['Values'][l:l+144]= np.inf from the time series. than I calculate a new matrix profile and extract the most common motif from there again, what is than second most common motif from the original time series, because the most common one (with all its matsches) was delteted.

How far am I off?

0 replies

seanlaw · 2023-02-18T10:35:15Z

seanlaw
Feb 18, 2023
Maintainer

My idea was, to get the most common motif with stumpy.motifs and and than delete this motif and all its matches with
df_m['Values'][l:l+144]= np.inf from the time series. than I calculate a new matrix profile and extract the most common motif from there again, what is than second most common motif from the original time series, because the most common one (with all its matsches) was delteted.
How far am I off?

In theory, stumpy.motifs is doing all of this for you. However, currently, it is making an assumption (a best guess) that the best motifs have a matrix profile (one-nearest neighbor) distance that is less than 2 standard deviations below the average matrix profile distance (i.e., np.nanmax([np.nanmean(P) - 2.0 * np.nanstd(P), np.nanmin(P)]) where P = mp[:, 0]). This is currently controlled by the cutoff parameter and any subsequence with a matrix profile distance that is greater than this cutoff is ignored.

Each row of your motif_distances or motif_indices should correspond to a different motif. How many rows exist in your motif_distances?

As stated in the stumpy.motifs docstring:

If you must return a shape of (max_motifs, max_matches), then you may consider specifying a smaller min_neighbors, a larger max_distance, and/or a larger cutoff. For example, while it is ill advised, setting min_neighbors=1, max_distance=np.inf, and cutoff=np.inf will ensure that the shape of the output arrays will be (max_motifs, max_matches). However, given the lack of constraints, the quality of each motif and the quality of each match may be drastically different. Setting appropriate conditions will help ensure appropriately constrained results that may be easier to interpret.

So, you may try increasing "setting min_neighbors=1, max_distance=np.inf, and cutoff=np.inf will ensure that the shape of the output arrays will be (max_motifs, max_matches)":

mp = stumpy.stump(df_m['Values'], s)
motif_distances, motif_indices = stumpy.motifs(df_m['Values'], mp[:, 0], max_motifs=k, max_matches=25, min_neighbors=1, max_distance=np.inf, and cutoff=np.inf)

This should,, hopefully, get you pretty close to what you need. However, note that the quality of the matches are likely to diminish as you move towards the right (columns) of motif_distances or down (rows) of motif_distances since you have applied no constraints on the quality of the matches.

I hope that helps.

0 replies

meeen · 2023-02-18T12:23:30Z

meeen
Feb 18, 2023
Author

Thank you, this was my problem. I tweaked the parameters and now I get really good results :) I use this for my my bachelor thesis where I have to search for pattern and anomalies in sensor data. I found the pattern and anomalies of given length based on domain knowledge.

I read the matrix profile XX paper about pan matrix profile and I tried the example. My pan matrix profile looks like this:
The orange line is the index for the first step size and its nearest neighbour from .M_ , What does .M_ tells me? This are not the the best amount of steps linked to the top k motif of arbitrary length?

Also my discords that I found where just the most noisy part of my Timeseries, is there something I can do to tweak this to get more domain relevant discords?

But in terms of finding pattern matrix profile with stumpy is the best method by far tested on my data.

Do you have any suggestions what could be an interesting idea to try?
I will look into clustering with Matrix prtofile next and your comprassion to MERLIN.

Thank you for your help :)

0 replies

seanlaw · 2023-02-18T22:12:47Z

seanlaw
Feb 18, 2023
Maintainer

I read the matrix profile XX paper about pan matrix profile and I tried the example. My pan matrix profile looks like this:
The orange line is the index for the first step size and its nearest neighbour from .M_ , What does .M_ tells me? This are not the the best amount of steps linked to the top k motif of arbitrary length?

Given the change in topic, would you mind creating a separate/new discussion on using stumpy.stimp? This will make it easier for people to search for this topic

Also my discords that I found where just the most noisy part of my Timeseries, is there something I can do to tweak this to get more domain relevant discords?

Unfortunately, this is domain dependent and so you'll have to traverse the set of discords and see if they make any sense.

Do you have any suggestions what could be an interesting idea to try? I will look into clustering with Matrix prtofile next and your comprassion to MERLIN.

You may try looking at the following papers that were published by the original authors:

https://www.cs.ucr.edu/~eamonn/100_Time_Series_Data_Mining_Questions__with_Answers.pdf
https://www.cs.ucr.edu/~eamonn/Top_Ten_Things_Matrix_Profile.pdf

There are some wonderful ideas there.

0 replies

MiguelGSilva · 2023-07-03T16:47:23Z

MiguelGSilva
Jul 3, 2023

Good afternoon.

I'm a bit confused about how the stumpy.motifs function can return the top k motifs when we provide the matrix profile computed by stumpy.stump(time_serie_array, m=m, k=1) as input. Does this mean that it selects the top motifs based on the distance of the first match to the motif sequence? If we instead wanted the top motifs where the average distance of the first 5 matches is minimized, would we need to use the matrix profile generated using stumpy.stump(time_serie_array, m=m, k=5) instead?

Thank you for your assistance.

6 replies

MiguelGSilva Jul 14, 2023

Apologies for the delayed response. I appreciate your insightful explanation, which was very helpful. However, I still have some confusion regarding how the "stumpy.motifs" function utilizes the matrix profile when k>1. I attempted to examine the source code and noticed that it utilizes np.argmin(P[-1]) to obtain the candidate motif. Does this imply that when k>1, the top motif is determined only by the distance of the k match? It does not consider the other distances in the matrix profile that we give as an input parameter?

To identify the top motifs based on the average distance of the first 5 matches, as you mentioned in the previous response, you would likely need to generate a custom matrix profile that includes an additional column representing the average of the matches, and use it as the input, right?

Thank in advance for your help.

NimaSarajpoor Jul 14, 2023
Collaborator

However, I still have some confusion regarding how the "stumpy.motifs" function utilizes the matrix profile when k>1.

With k>1, the matrix profile values from stumpy.stump(T, m, k=k) is 2D, with shape (n - m + 1, k), where n = len(T). However, you should note that stumpy.motifs ONLY SUPPORTS 1D array. (@seanlaw we may add it to the description of the param P in the docstring of the function motifs)

https://github.com/TDAmeritrade/stumpy/blob/60f77db185438c37225cecf35909034384026a6e/stumpy/motifs.py#L313-L317

So, at the end of the day, this function accepts 1D array.

I attempted to examine the source code and noticed that it utilizes np.argmin(P[-1]) to obtain the candidate motif. Does this imply that when k>1, the top motif is determined only by the distance of the k match? It does not consider the other distances in the matrix profile that we give as an input parameter?

I think you missed something here. we have the following line in the code:
https://github.com/TDAmeritrade/stumpy/blob/60f77db185438c37225cecf35909034384026a6e/stumpy/motifs.py#L344
which results in adding a new dimension to the array. So, e.g. if P is [10, 20, 30, 40], then it will be changed to [[10, 20, 30, 40]]. We then pass this array to _motifs. Hence, I think the input P in the function _motifs is a 2D array with shape (1, ...).

To identify the top motifs based on the average distance of the first 5 matches, as you mentioned in the previous response, you would likely need to generate a custom matrix profile that includes an additional column representing the average of the matches, and use it as the input, right?

Yes, that would be my approach. You can use stumpy.stump(T, m, k=k) to find top-k neighbours. For now, let's say we are okay with considering neighbors that are trivially close to each other. So, in this case, I will just do stumpy.stump(T, m, k=5), and then get the top-k matrix profile from the output, which is a 2D array, and then I take average of each row. That would be the average distance between a subsequence and its top-5 neighbors. As I explained before, to avoid neighbors being trivially close to each other, you may need to choose k >> 5 and then get top-5 neighbors from it.

MiguelGarcaoSilva Jul 15, 2023

Perfect, thank you very much for clarifying!

NimaSarajpoor Jul 15, 2023
Collaborator

@MiguelGarcaoSilva
And thank you for the question :)

seanlaw Jul 17, 2023
Maintainer

@seanlaw we may add it to the description of the param P in the docstring of the function motifs

@NimaSarajpoor Done! see commit 6597d45

Also, I think this would be a good example to include in awesome STUMPY

Top K motifs of Time Series #770

Uh oh!

meeen Dec 27, 2022

Replies: 6 comments · 6 replies

Uh oh!

seanlaw Dec 27, 2022 Maintainer

Uh oh!

meeen Feb 17, 2023 Author

Uh oh!

seanlaw Feb 18, 2023 Maintainer

Uh oh!

meeen Feb 18, 2023 Author

Uh oh!

seanlaw Feb 18, 2023 Maintainer

Uh oh!

Uh oh!

MiguelGSilva Jul 3, 2023

Uh oh!

MiguelGSilva Jul 14, 2023

Uh oh!

Uh oh!

NimaSarajpoor Jul 14, 2023 Collaborator

Uh oh!

MiguelGarcaoSilva Jul 15, 2023

Uh oh!

NimaSarajpoor Jul 15, 2023 Collaborator

Uh oh!

Uh oh!

seanlaw Jul 17, 2023 Maintainer

meeen
Dec 27, 2022

Replies: 6 comments 6 replies

seanlaw
Dec 27, 2022
Maintainer

meeen
Feb 17, 2023
Author

seanlaw
Feb 18, 2023
Maintainer

meeen
Feb 18, 2023
Author

seanlaw
Feb 18, 2023
Maintainer

MiguelGSilva
Jul 3, 2023

NimaSarajpoor Jul 14, 2023
Collaborator

NimaSarajpoor Jul 15, 2023
Collaborator

seanlaw Jul 17, 2023
Maintainer