Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize using in-place math; small cleanups #6

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

dnadlinger
Copy link
Contributor

I was recently using ExmpV.jl for solving a Lindblad master equation for a time-independent system of coupled harmonic oscillators, i.e. a very sparse A that is constant over many exmpv iterations. In this use case, the loop part of expmv_fun was definitely the bottleneck for my whole application, given that I could pre-compute the select_taylor_degree result.

This PR changes the code to explicitly allocate some copies of vectors before the loop and then use in-place operations to avoid excessive GC usage (according to @time, my program churned through something like 2 TB of GC memory in a couple of minutes). It also replaces the calculation of the infinity norm by the iamax BLAS function. All in all, this leads to several times of performance gain in my use case.

There is likely still quite a bit more to optimize here, both in terms of performance and code style. I'm new to Julia, so I'd appreciate any pointers.

The statistic is currently wrong for us anyway, and not very useful for using
the algorithm (as opposed to developing it). If somebody is really interested
in this, they can just supply a custom matrix multiplication operator in Julia.
This does not necessarily have a big impact on performance, but
helps when looking at the generated code (or code_warntype, for
that matter).
@dnadlinger
Copy link
Contributor Author

In fact, my simulation is so much limited by the sparse-matrix-dense-vector multiplication inside the expmv_fun loop that I hacked together a patch to use the SuiteSparse/CHOLMOD implementation of that for another small speed gain (see my fork). It's really ugly at this point, though, and should really be fixed in Base instead.

@marcusps
Copy link
Owner

Thanks for the contribution! I'll look at the code a bit more carefully before merging, but it looks interesting.

I also have a branch where I properly implement 1-norm estimation, but I am running into what appears to be numerical instabilities for sufficiently large dimensions and density, but once that is sorted out it should lead to big performance gains.

@dnadlinger
Copy link
Contributor Author

Going to submit the iamax optimization to Base.

Regarding the 1-norm estimation: This would definitely be useful for me in the future, but since I can cache the select_taylor_degree call for a number of iterations right now, it is currently not super important performance-wise.

@dnadlinger dnadlinger changed the title Optimize using in-place math and BLAS; small cleanups Optimize using in-place math; small cleanups Oct 30, 2015
@dnadlinger
Copy link
Contributor Author

D'oh, forgot that the notion of the absolute value of a complex number in BLAS is abs(re) + abs(im) instead of the mathematical definition. The iamax optimization is thus not valid for complex numbers. :/

@@ -1,6 +1,6 @@
export expmv

function expmv(t, A, b; M = [], prec = "double", shift = false, full_term = false, prnt = false)
function expmv(t, A, b; M::Array{Float64, 2} = Array(Float64, 0, 0), prec = "double", shift = false, full_term = false, prnt = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make M an optional argument instead so that way it dispatches. Then you can use M=nothing vs M a vector, and it will compile away the extranious branches when M==nothing.

if prec == "double"
2.0^(-53)
tol =
if prec == "half"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it won't affect performance really, I think to match common Julia syntax these should be symbols instead of strings.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a working branch that tries to address these style issues (and also implements proper 1 norm estimation). Since so much code is shared, I am still not dispatching on the type, but rather branching on the type (instead of having precision specified as a argument, I take precision to be implicitly specified based on use of Float64 vs Float32 etc). It is probably best to move the discussion to that branch's code -- normest1.

@ChrisRackauckas
Copy link
Contributor

Is there any reason why this PR stalled?

@marcusps
Copy link
Owner

marcusps commented Nov 19, 2016

Lack of time.

@dnadlinger
Copy link
Contributor Author

@ChrisRackauckas: Your points are valid, but changing the API is different is a whole separate story from the transparent optimisations I did last year.

@marcusps
Copy link
Owner

I've updated the benchmarking code, and it is not entirely clear if the changes you made make much of a difference. That being said, the cleanups are probably improvements, so I will look at it a bit more carefully, and will likely merge your changes.

@dnadlinger
Copy link
Contributor Author

dnadlinger commented Nov 26, 2016

This has been a long time ago, so I'm a bit hazy on the details. IIRC you should see a massive reduction in the number of GC allocations at least for precomputed select_taylor_degree.

For context, below is some code I quickly threw together last year while prototyping the physics code I mentioned. (Yes, the code quality is quite horrible, but it was a quick hack trying to figure out how to translate the physics into reasonably performant Julia before writing the actual thing.) The Qu* stuff is just a thin wrapper around complex Arrays; in the end, times is 2001 (i.e. 2000 iterations with the same select_taylor_degree), and l is a 40000x40000 complex matrix with sparsity about 1e-4.

using ExpmV
using QuBase
using QuDynamics

super_pre_post(pre, post) = kron(transpose(post), pre)

dissipator(a) = super_pre_post(a, a') - 0.5 * super_pre_post(a'a, speye(size(a)...)) - 0.5 * super_pre_post(speye(size(a)...), a'a)

super_hamiltonian(h) = -1im * (super_pre_post(h, speye(size(h)...)) - super_pre_post(speye(size(h)...), h))

function lindblad_op(h, collapse_ops::Vector)
    super = super_hamiltonian(coeffs(h))
    for c_op in collapse_ops
       super += dissipator(coeffs(c_op))
    end
    super
end

function sim(n)
    Δ = 1e-5 * 2π
    ϵ = 0.5 * 1e-2 #
    κ = 15 * 2π
    k = -0.7κ/(2 * sqrt(3))
    times = 0.0:10:20000

    ψi = complex(statevec(2, FiniteBasis(n)))
    a = lowerop(n)
    h = Δ * a'a + ϵ * (a' + a) # + k * (a' * a' + a * a)
    l = lindblad_op(h, [sqrt(κ) * a])
    
    ρ = ψi * ψi'
    dims_ρ = size(ρ)
    Type_ρ = QuBase.similar_type(ρ)
    bases_ρ = bases(ρ)
    ρvec = coeffs(vec(full(ρ)))

    nbars = Array(Float64, 0)
    @time precalc = ExpmV.select_taylor_degree(l, 1)
    for (t0, t1) in zip(times[1 : end-1], times[2:end])
        ρvec = ExpmV.expmv(t1 - t0, l, ρvec, M=precalc)
        ρ = Type_ρ(reshape(ρvec, dims_ρ), bases_ρ)
        push!(nbars, real(trace* a'a)))
    end

    return times[2:end], nbars, ρ
end

@time times, nbars, ρfin = sim(200)
(typeof(l),size(l),nnz(l)) = (SparseMatrixCSC{Complex{Float64},Int64},(40000,40000),199000)

@marcusps
Copy link
Owner

marcusps commented Nov 27, 2016 via email

@dnadlinger
Copy link
Contributor Author

dnadlinger commented Nov 27, 2016

I just had a quick glance over the benchmarks, and it seems like you don't have any for the precomputed case? It's been a year since I looked at the time profiles, but IIRC select_taylor_degree would be quite expensive for the above use case if it wasn't for the 2000 iterations.

@marcusps
Copy link
Owner

marcusps commented Nov 27, 2016 via email

@dnadlinger
Copy link
Contributor Author

Ah, sure – just wanted to make sure it was clear enough in which scenario the "several times speedup" remark from my original post applies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants