mihailescum DM #219

seanlaw · 2020-07-23T14:34:41Z

seanlaw
Jul 23, 2020
Maintainer

@mexxexx Carrying on our earlier conversations here

mihailescum · 2020-07-24T07:15:09Z

mihailescum
Jul 24, 2020

I remembered what I had in mind yesterday. It wasn't about precision, because honestly I have no idea about error propagation, but it was because i looked into numbas fastmath.

So this fastmath option is actually handled by llvm, the thing numba is built on. It means that certain flags are set, to relax the IEEE standard. They are described here.

Now the I want to talk about are called nnan and ninf.

nnan
No NaNs - Allow optimizations to assume the arguments and result are not NaN. If an argument is a nan, or the result would be a nan, it produces a poison value instead.

ninf
No Infs - Allow optimizations to assume the arguments and result are not +/-Inf. If an argument is +/-Inf, or the result would be +/-Inf, it produces a poison value instead.

So this means, that llvm assumes that the input contains neither nan nor inf. You can see this if you try out np.isnan(np.nan) in fastmath mode. This will return False. It is just optimized by the compiler, which assumes that the input won't be nan. However np.isinf(np.inf) still seems to work in fastmath mode, and also the product between a number and np.inf evaluates to np.inf. This is what we use for the NaN/Inf handling, right? (at least for now, without your new code).

I'm not sure if this is expected behavior and is actually the same on all platforms. I think this could change in any numba version, this we shouldn't expect the result beeing np.inf but a poison value instead, whatever this means.

So what would I suggest? Actually, how we might overcome this is simple. Either, we do it by storing a boolean array of valid subsequences, but this would require quite some changes in the codebase. Alternatively, instead o settin the means of not valid subsequences to np.inf, we could set the corresponding standard deviations to -1, and in compute_distance return a negative value if the subsequence is illegal.

Actually, now that I'm thinking about it, the matrix profile can have Inf values, so maybe we should disable the ninf flag altogether?

What do you think?

0 replies

seanlaw · 2020-07-24T15:48:58Z

seanlaw
Jul 24, 2020
Maintainer Author

@mexxexx Thank you for these thoughts! I had encountered this fastmath problem when writing up the aamp code and, as you pointed out, I ended up using a boolean array of valid subsequences via this kind of pre-processing step:

T_A[np.isinf(T_A)] = np.nan
T_A_subseq_isfinite = np.all(np.isfinite(core.rolling_window(T_A, m)), axis=1)
T_A[np.isnan(T_A)] = 0

M_T, Σ_T = core.compute_mean_std(T_A, m)

Notice that the computation of the mean and stddev come AFTER we set all illegal values to zero so all of the sliding window means and stddev should all be finite numbers. I really like that the key information is stored in T_A_subseq_isfinite because it is explicit and does not require us to keep track of anything further (i.e., by referencing the value of the subsequence mean or subsequence stddev like we are currently doing). I think this is a good decision and will make it easier to maintain in the long run.

And then, later, you can compute the distance and then decide whether you want to update the matrix profile:

D_squared = _calculate_squared_distance(...)

if T_A_subseq_isfinite[i+k] and T_B_subseq_isfinite[i]:
    if D_squared < P[thread_idx, i]:
        P[thread_idx, i] = D_squared
        I[thread_idx, i] = i + k

    if ignore_trivial and D_squared < P[thread_idx, i + k]:
        P[thread_idx, i + k] = D_squared
        I[thread_idx, i + k] = i

This all feels very natural

Actually, now that I'm thinking about it, the matrix profile can have Inf values, so maybe we should disable the ninf flag altogether?

I wonder if we can simply initialize the matrix profile values and fill it with np.finfo(np.float64).max (1.7976931348623157e+308) or some value close to this max. And then perform a post-processing step where we convert any matrix profile values that equal the max back to np.inf.

0 replies

seanlaw · 2020-08-08T02:29:49Z

seanlaw
Aug 8, 2020
Maintainer Author

@mexxexx When we do _calculate_squared_distance, can you recall why we follow:

if σ_Q < STDDEV_THRESHOLD or Σ_T < STDDEV_THRESHOLD:
    D_squared = m
else:
    denom = m * σ_Q * Σ_T
    if np.abs(denom) < DENOM_THRESHOLD:  # pragma nocover
        denom = DENOM_THRESHOLD
    D_squared = np.abs(2 * m * (1.0 - (QT - m * μ_Q * M_T) / denom))

if σ_Q < STDDEV_THRESHOLD and Σ_T < STDDEV_THRESHOLD:
    D_squared = 0

and not:

if σ_Q < STDDEV_THRESHOLD and Σ_T < STDDEV_THRESHOLD:
    D_squared = 0
elif σ_Q < STDDEV_THRESHOLD or Σ_T < STDDEV_THRESHOLD:
    D_squared = m
else:
    denom = m * σ_Q * Σ_T
    if np.abs(denom) < DENOM_THRESHOLD:  # pragma nocover
        denom = DENOM_THRESHOLD
    D_squared = np.abs(2 * m * (1.0 - (QT - m * μ_Q * M_T) / denom))

0 replies

mihailescum · 2020-08-13T22:10:07Z

mihailescum
Aug 13, 2020

Sorry for the late reply!

I wonder if we can simply initialize the matrix profile values and fill it with np.finfo(np.float64).max (1.7976931348623157e+308) or some value close to this max. And then perform a post-processing step where we convert any matrix profile values that equal the max back to np.inf.

That's a good idea. I think that the normalized euclidean distance is bounded by 4m^2 or so.

And no, I cannot recall why we did it this way, did you find it out?

0 replies

seanlaw · 2020-08-16T20:41:29Z

seanlaw
Aug 16, 2020
Maintainer Author

For your information, the Pearson correlation version of STUMPY has now been implemented into stump, stumped, and scrump!

0 replies

mihailescum · 2020-08-16T21:18:42Z

mihailescum
Aug 16, 2020

That's great to hear!

0 replies

seanlaw · 2020-08-17T11:45:50Z

seanlaw
Aug 17, 2020
Maintainer Author

I also removed those poorly conceived/written _get_orders_ranges and _get_max_order_idx functions in scrump (see this comment for a refresher). They have now been replaced with the more clean core._count_diagonal_ndist and core._get_array_ranges. core._count_diagonal_ndist simply takes a list of diagonals (along with m, n_A, and n_B) and returns a vector with the number of distances that are computed along each diagonal. core._get_array_ranges is nearly identical to _get_orders_ranges but it is a lot less confusing and simply uses the builtin NumPy np.linspace and np.searchsorted pair to figure out how best to split the diagonals into even chunks. I point this out because it is being reused every where we traverse diagonals and we need to split across multiple threads. It's actually really nice to code in now for easily taking care of the parallelization.

Additionally, I added a core.preprocess_diagonal. It's just a preprocessing function that returns the components needed for diagonal traversal

0 replies

seanlaw · 2020-08-24T17:39:23Z

seanlaw
Aug 24, 2020
Maintainer Author

Also, I added aamp, aamped, aampi, and gpu_aamp functions to the API since it seems that many people have been asking for it

0 replies

mihailescum · 2020-08-25T11:40:59Z

mihailescum
Aug 25, 2020

Good to hear, I'm pretty sure that will benefit some people.
I have almost finished my contract and will have some time in the following weeks to advance the motif discovery tool, to finally have something there!

0 replies

seanlaw · 2020-08-25T13:04:57Z

seanlaw
Aug 25, 2020
Maintainer Author

That's great! Congratulations on wrapping up this current chapter of your journey and welcome back to STUMPY! We've missed you! 😃

0 replies

seanlaw · 2020-09-18T13:56:31Z

seanlaw
Sep 18, 2020
Maintainer Author

@mexxexx Did you get a chance to look at the MDL comments . I was wondering whether you had any thoughts?

If I understood it correctly, MDL is a post-processing step that doesn't seem to be too well developed. That is, it feels like one possible solution for choosing the best number of dimensions but it's still an educated guess? Having said that, I don't have a better approach than what is proposed. Although, there is a difference in how one would do MDL for z-normalized distances and non-normalized distances. The former discretizes at the subsequence level while the latter discretizes at the global level (which makes sense).

0 replies

mihailescum · 2020-09-18T14:00:16Z

mihailescum
Sep 18, 2020

@seanlaw Yes I read the comments, but didn't reply since I couldn't add anything useful 😄

I also understand it as a best guess. Honestly, for the moment I don't think it's a huge requirement for stumpy, especially since it requires having a working subspace implementation.

0 replies

seanlaw · 2020-09-18T14:03:01Z

seanlaw
Sep 18, 2020
Maintainer Author

I also understand it as a best guess. Honestly, for the moment I don't think it's a huge requirement for stumpy, especially since it requires having a working subspace implementation.

Ahh yes, I was just looking up the subspace issue and trying to remember if subspaces depended on MDL and I didn't see any comments on it that related the two. It sounds like MDL depends on subspaces or that if you compute MDL then you get ranked subspaces as a by-product?

0 replies

mihailescum · 2020-09-18T14:09:01Z

mihailescum
Sep 18, 2020

If I remember correctly, you use MDL to find out how many dimensions your motif actually has. Let's say your time series has three dimensions. Then you could have two interesting dimensions and one being only noise, so MDL should yield that the motif has two dimensions.

However, to be able to compute the MDL, you (and this is a guess, but it appears to be crucial information) need to know which of the dimensions form you one-, two- and- three-dimensional motif, otherwise, how would you even know what to compare? And this is exactly the information encoded in the subspace.

0 replies

seanlaw · 2020-09-18T14:34:15Z

seanlaw
Sep 18, 2020
Maintainer Author

If I remember correctly, you use MDL to find out how many dimensions your motif actually has. Let's say your time series has three dimensions. Then you could have two interesting dimensions and one being only noise, so MDL should yield that the motif has two dimensions.

However, to be able to compute the MDL, you (and this is a guess, but it appears to be crucial information) need to know which of the dimensions form you one-, two- and- three-dimensional motif, otherwise, how would you even know what to compare? And this is exactly the information encoded in the subspace.

Okay, I'll have to go back and look at this more closely with your points in mind. Thank you and have a great weekend!

0 replies

mihailescum · 2020-09-18T14:42:32Z

mihailescum
Sep 18, 2020

Thank you, you too! Feel free to ask if you have questions. I didn't fully understand the MDL, but discussion always helps in my experience.

0 replies

seanlaw · 2020-10-17T19:21:23Z

seanlaw
Oct 17, 2020
Maintainer Author

I've been thinking, for mstump, is there any reason why we don't/can't just compute all of the 1-dimensional matrix profiles for each dimension separately and then sort them in the end? Is there a reason why we must compute all of the matrix profiles for one window, sort it, and then proceed to the next window? This actually seems inefficient from a memory caching standpoint.

0 replies

seanlaw · 2020-10-17T23:01:24Z

seanlaw
Oct 17, 2020
Maintainer Author

Never mind. I thought through it and remembered that it's not that simple!

0 replies

seanlaw · 2021-02-06T03:34:16Z

seanlaw
Feb 6, 2021
Maintainer Author

@mihailescum I just released v1.8.0 and now all of main functions have a normalize parameter. When normalize=True then it executes the z-normalized version of a function. However, when normalize=False, instead of embedding a ton of if/else logic into a function, I've added a function decorator (see core.non_normalized()) that automatically detects if normalize == False and, if so, it automatically calls the complementary non-normalized version of the same function.

0 replies

seanlaw · 2022-01-27T22:39:47Z

seanlaw
Jan 27, 2022
Maintainer Author

@mihailescum I think I figured out Minimum Description Length! Going to push a commit in the next few days

2 replies

mihailescum Jan 28, 2022

That's great to hear! Did you already try it out on anything practical? Do you think it will make life easier when dealing with multidimensional profiles?

seanlaw Jan 28, 2022
Maintainer Author

If you are interested, the code can be found in this commit 9fd6c69. So far, I've only tested it on the Multidimensional Motif Discovery Tutorial. While it isn't guaranteed to be a silver bullet, I played around with it a little by adding decoy (random walk) time series and then rearranging the order of the time series and it was still able to pick out the correct ones. At the end of the day, at minimum, I think that it will be far better/easier to select k using MDL rather than plotting the matrix profile values and then looking for the "elbow".

In case you were curious, here was the detailed MDL explanation that I was able to obtain by asking on StackOverflow: https://stackoverflow.com/questions/70430148/computing-description-length/70823650#70823650

mihailescum DM #219

Uh oh!

seanlaw Jul 23, 2020 Maintainer

Replies: 20 comments · 2 replies

Uh oh!

Uh oh!

Uh oh!

seanlaw Jul 24, 2020 Maintainer Author

Uh oh!

Uh oh!

seanlaw Aug 8, 2020 Maintainer Author

Uh oh!

Uh oh!

seanlaw Aug 16, 2020 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

seanlaw Aug 17, 2020 Maintainer Author

Uh oh!

seanlaw Aug 24, 2020 Maintainer Author

Uh oh!

Uh oh!

seanlaw Aug 25, 2020 Maintainer Author

Uh oh!

Uh oh!

seanlaw Sep 18, 2020 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

seanlaw Sep 18, 2020 Maintainer Author

Uh oh!

Uh oh!

seanlaw Sep 18, 2020 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

seanlaw Oct 17, 2020 Maintainer Author

Uh oh!

seanlaw Oct 17, 2020 Maintainer Author

Uh oh!

Uh oh!

seanlaw Feb 6, 2021 Maintainer Author

Uh oh!

Uh oh!

seanlaw Jan 27, 2022 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

seanlaw Jan 28, 2022 Maintainer Author

seanlaw
Jul 23, 2020
Maintainer

Replies: 20 comments 2 replies

seanlaw
Jul 24, 2020
Maintainer Author

seanlaw
Aug 8, 2020
Maintainer Author

seanlaw
Aug 16, 2020
Maintainer Author

seanlaw
Aug 17, 2020
Maintainer Author

seanlaw
Aug 24, 2020
Maintainer Author

seanlaw
Aug 25, 2020
Maintainer Author

seanlaw
Sep 18, 2020
Maintainer Author

seanlaw
Sep 18, 2020
Maintainer Author

seanlaw
Sep 18, 2020
Maintainer Author

seanlaw
Oct 17, 2020
Maintainer Author

seanlaw
Oct 17, 2020
Maintainer Author

seanlaw
Feb 6, 2021
Maintainer Author

seanlaw
Jan 27, 2022
Maintainer Author

seanlaw Jan 28, 2022
Maintainer Author