Optimize using in-place math; small cleanups #6

dnadlinger · 2015-10-22T19:32:07Z

I was recently using ExmpV.jl for solving a Lindblad master equation for a time-independent system of coupled harmonic oscillators, i.e. a very sparse A that is constant over many exmpv iterations. In this use case, the loop part of expmv_fun was definitely the bottleneck for my whole application, given that I could pre-compute the select_taylor_degree result.

This PR changes the code to explicitly allocate some copies of vectors before the loop and then use in-place operations to avoid excessive GC usage (according to @time, my program churned through something like 2 TB of GC memory in a couple of minutes). ~~It also replaces the calculation of the infinity norm by the iamax BLAS function.~~ All in all, this leads to several times of performance gain in my use case.

There is likely still quite a bit more to optimize here, both in terms of performance and code style. I'm new to Julia, so I'd appreciate any pointers.

The statistic is currently wrong for us anyway, and not very useful for using the algorithm (as opposed to developing it). If somebody is really interested in this, they can just supply a custom matrix multiplication operator in Julia.

…not the values

This does not necessarily have a big impact on performance, but helps when looking at the generated code (or code_warntype, for that matter).

dnadlinger · 2015-10-22T19:45:41Z

In fact, my simulation is so much limited by the sparse-matrix-dense-vector multiplication inside the expmv_fun loop that I hacked together a patch to use the SuiteSparse/CHOLMOD implementation of that for another small speed gain (see my fork). It's really ugly at this point, though, and should really be fixed in Base instead.

marcusps · 2015-10-23T20:57:14Z

Thanks for the contribution! I'll look at the code a bit more carefully before merging, but it looks interesting.

I also have a branch where I properly implement 1-norm estimation, but I am running into what appears to be numerical instabilities for sufficiently large dimensions and density, but once that is sorted out it should lead to big performance gains.

dnadlinger · 2015-10-27T08:25:21Z

~~Going to submit the iamax optimization to Base.~~

Regarding the 1-norm estimation: This would definitely be useful for me in the future, but since I can cache the select_taylor_degree call for a number of iterations right now, it is currently not super important performance-wise.

dnadlinger · 2015-10-30T23:05:18Z

D'oh, forgot that the notion of the absolute value of a complex number in BLAS is abs(re) + abs(im) instead of the mathematical definition. The iamax optimization is thus not valid for complex numbers. :/

ChrisRackauckas · 2016-11-19T01:48:29Z

src/expmv_fun.jl

@@ -1,6 +1,6 @@
 export expmv

-function expmv(t, A, b; M = [], prec = "double", shift = false, full_term = false, prnt = false)
+function expmv(t, A, b; M::Array{Float64, 2} = Array(Float64, 0, 0), prec = "double", shift = false, full_term = false, prnt = false)


Why not make M an optional argument instead so that way it dispatches. Then you can use M=nothing vs M a vector, and it will compile away the extranious branches when M==nothing.

ChrisRackauckas · 2016-11-19T01:49:39Z

src/expmv_fun.jl

-      if prec == "double"
-          2.0^(-53)
+    tol =
+      if prec == "half"


While it won't affect performance really, I think to match common Julia syntax these should be symbols instead of strings.

I have a working branch that tries to address these style issues (and also implements proper 1 norm estimation). Since so much code is shared, I am still not dispatching on the type, but rather branching on the type (instead of having precision specified as a argument, I take precision to be implicitly specified based on use of Float64 vs Float32 etc). It is probably best to move the discussion to that branch's code -- normest1.

ChrisRackauckas · 2016-11-19T01:51:15Z

Is there any reason why this PR stalled?

marcusps · 2016-11-19T03:23:14Z

Lack of time.

dnadlinger · 2016-11-19T18:53:29Z

@ChrisRackauckas: Your points are valid, but changing the API is different is a whole separate story from the transparent optimisations I did last year.

marcusps · 2016-11-26T20:49:48Z

I've updated the benchmarking code, and it is not entirely clear if the changes you made make much of a difference. That being said, the cleanups are probably improvements, so I will look at it a bit more carefully, and will likely merge your changes.

dnadlinger · 2016-11-26T21:33:52Z

This has been a long time ago, so I'm a bit hazy on the details. IIRC you should see a massive reduction in the number of GC allocations at least for precomputed select_taylor_degree.

For context, below is some code I quickly threw together last year while prototyping the physics code I mentioned. (Yes, the code quality is quite horrible, but it was a quick hack trying to figure out how to translate the physics into reasonably performant Julia before writing the actual thing.) The Qu* stuff is just a thin wrapper around complex Arrays; in the end, times is 2001 (i.e. 2000 iterations with the same select_taylor_degree), and l is a 40000x40000 complex matrix with sparsity about 1e-4.

using ExpmV
using QuBase
using QuDynamics

super_pre_post(pre, post) = kron(transpose(post), pre)

dissipator(a) = super_pre_post(a, a') - 0.5 * super_pre_post(a'a, speye(size(a)...)) - 0.5 * super_pre_post(speye(size(a)...), a'a)

super_hamiltonian(h) = -1im * (super_pre_post(h, speye(size(h)...)) - super_pre_post(speye(size(h)...), h))

function lindblad_op(h, collapse_ops::Vector)
    super = super_hamiltonian(coeffs(h))
    for c_op in collapse_ops
       super += dissipator(coeffs(c_op))
    end
    super
end

function sim(n)
    Δ = 1e-5 * 2π
    ϵ = 0.5 * 1e-2 #2π
    κ = 15 * 2π
    k = -0.7κ/(2 * sqrt(3))
    times = 0.0:10:20000

    ψi = complex(statevec(2, FiniteBasis(n)))
    a = lowerop(n)
    h = Δ * a'a + ϵ * (a' + a) # + k * (a' * a' + a * a)
    l = lindblad_op(h, [sqrt(κ) * a])
    
    ρ = ψi * ψi'
    dims_ρ = size(ρ)
    Type_ρ = QuBase.similar_type(ρ)
    bases_ρ = bases(ρ)
    ρvec = coeffs(vec(full(ρ)))

    nbars = Array(Float64, 0)
    @time precalc = ExpmV.select_taylor_degree(l, 1)
    for (t0, t1) in zip(times[1 : end-1], times[2:end])
        ρvec = ExpmV.expmv(t1 - t0, l, ρvec, M=precalc)
        ρ = Type_ρ(reshape(ρvec, dims_ρ), bases_ρ)
        push!(nbars, real(trace(ρ * a'a)))
    end

    return times[2:end], nbars, ρ
end

@time times, nbars, ρfin = sim(200)

(typeof(l),size(l),nnz(l)) = (SparseMatrixCSC{Complex{Float64},Int64},(40000,40000),199000)

marcusps · 2016-11-27T03:32:11Z

Thanks, that helps quite a bit. Under the new benchmarking I can see your pull request reducing median memory usage by ~50% and median runtime by ~10%, so it does seem to lead to significant improvements.

…

On Sat, Nov 26, 2016, 4:33 PM David Nadlinger ***@***.***> wrote: This has been a long time ago, so I'm a bit hazy on the details. IIRC you should see a massive reduction in the number of GC allocations at least for precomputed select_taylor_degree. For context, below is some code I quickly threw together last year while prototyping the physics code I mentioned. (Yes, the code quality is quite horrible, but it was a quick hack trying to figure out how to translate the physics into reasonably performant Julia before writing the actual thing.) The Qu* stuff is just a thin wrapper around complex Arrays; in the end, times is 2001 (i.e. 2000 iterations with the same select_taylor_degree), and l is a 40000x40000 complex matrix. using ExpmVusing QuBaseusing QuDynamics super_pre_post(pre, post) = kron(transpose(post), pre) dissipator(a) = super_pre_post(a, a') - 0.5 * super_pre_post(a'a, speye(size(a)...)) - 0.5 * super_pre_post(speye(size(a)...), a'a) super_hamiltonian(h) = -1im * (super_pre_post(h, speye(size(h)...)) - super_pre_post(speye(size(h)...), h)) function lindblad_op(h, collapse_ops::Vector) super = super_hamiltonian(coeffs(h)) for c_op in collapse_ops super += dissipator(coeffs(c_op)) end superend function sim(n) Δ = 1e-5 * 2π ϵ = 0.5 * 1e-2 #2π κ = 15 * 2π k = -0.7κ/(2 * sqrt(3)) times = 0.0:10:20000 ψi = complex(statevec(2, FiniteBasis(n))) a = lowerop(n) h = Δ * a'a + ϵ * (a' + a) # + k * (a' * a' + a * a) l = lindblad_op(h, [sqrt(κ) * a]) ρ = ψi * ψi' dims_ρ = size(ρ) Type_ρ = QuBase.similar_type(ρ) bases_ρ = bases(ρ) ρvec = coeffs(vec(full(ρ))) nbars = Array(Float64, 0) @time precalc = ExpmV.select_taylor_degree(l, 1) for (t0, t1) in zip(times[1 : end-1], times[2:end]) ρvec = ExpmV.expmv(t1 - t0, l, ρvec, M=precalc) ρ = Type_ρ(reshape(ρvec, dims_ρ), bases_ρ) push!(nbars, real(trace(ρ * a'a))) end return times[2:end], nbars, ρend @time times, nbars, ρfin = sim(200) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAOKcWKsdnYzWJtVbD3FHz8RhB0CLmgjks5rCKXAgaJpZM4GUB-9> .

dnadlinger · 2016-11-27T03:49:05Z

I just had a quick glance over the benchmarks, and it seems like you don't have any for the precomputed case? It's been a year since I looked at the time profiles, but IIRC select_taylor_degree would be quite expensive for the above use case if it wasn't for the 2000 iterations.

marcusps · 2016-11-27T20:46:11Z

That's right, I haven't gotten around to the precomputed case yet, but there is already some gain.

…

On Sat, Nov 26, 2016, 10:49 PM David Nadlinger ***@***.***> wrote: I just had a quick glance over the benchmarks, and it seems like you don't have any for the precomputed case? It's been a year since I looked at the time profiles, but IIRC select_taylor_degree would be quite expensive for the above use case if it weren't for the 2000 iterations. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAOKceeNG5to9qJaTfKsJKtW9gorCnZUks5rCP2xgaJpZM4GUB-9> .

dnadlinger · 2016-11-27T20:52:16Z

Ah, sure – just wanted to make sure it was clear enough in which scenario the "several times speedup" remark from my original post applies.

dnadlinger added 4 commits October 19, 2015 23:45

Cleanup: Remove unused return values of select_taylor_degree

73ba460

Make it clearer that select_taylor_degree only needs the width of b, …

f3cf785

…not the values

Fix most obvious type instabilities

c2f33b4

This does not necessarily have a big impact on performance, but helps when looking at the generated code (or code_warntype, for that matter).

Use in-place operations in inner loop

3eced9f

dnadlinger force-pushed the optimize branch from 7b141b2 to 3eced9f Compare October 30, 2015 23:01

dnadlinger changed the title ~~Optimize using in-place math and BLAS; small cleanups~~ Optimize using in-place math; small cleanups Oct 30, 2015

ChrisRackauckas reviewed Nov 19, 2016

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize using in-place math; small cleanups #6

Optimize using in-place math; small cleanups #6

dnadlinger commented Oct 22, 2015

dnadlinger commented Oct 22, 2015

marcusps commented Oct 23, 2015

dnadlinger commented Oct 27, 2015

dnadlinger commented Oct 30, 2015

ChrisRackauckas Nov 19, 2016

ChrisRackauckas Nov 19, 2016

marcusps Nov 26, 2016

ChrisRackauckas commented Nov 19, 2016

marcusps commented Nov 19, 2016 •

edited

Loading

dnadlinger commented Nov 19, 2016

marcusps commented Nov 26, 2016

dnadlinger commented Nov 26, 2016 •

edited

Loading

marcusps commented Nov 27, 2016 via email

dnadlinger commented Nov 27, 2016 •

edited

Loading

marcusps commented Nov 27, 2016 via email

dnadlinger commented Nov 27, 2016

Optimize using in-place math; small cleanups #6

Are you sure you want to change the base?

Optimize using in-place math; small cleanups #6

Conversation

dnadlinger commented Oct 22, 2015

dnadlinger commented Oct 22, 2015

marcusps commented Oct 23, 2015

dnadlinger commented Oct 27, 2015

dnadlinger commented Oct 30, 2015

ChrisRackauckas Nov 19, 2016

Choose a reason for hiding this comment

ChrisRackauckas Nov 19, 2016

Choose a reason for hiding this comment

marcusps Nov 26, 2016

Choose a reason for hiding this comment

ChrisRackauckas commented Nov 19, 2016

marcusps commented Nov 19, 2016 • edited Loading

dnadlinger commented Nov 19, 2016

marcusps commented Nov 26, 2016

dnadlinger commented Nov 26, 2016 • edited Loading

marcusps commented Nov 27, 2016 via email

dnadlinger commented Nov 27, 2016 • edited Loading

marcusps commented Nov 27, 2016 via email

dnadlinger commented Nov 27, 2016

marcusps commented Nov 19, 2016 •

edited

Loading

dnadlinger commented Nov 26, 2016 •

edited

Loading

dnadlinger commented Nov 27, 2016 •

edited

Loading