-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize using in-place math; small cleanups #6
base: master
Are you sure you want to change the base?
Conversation
The statistic is currently wrong for us anyway, and not very useful for using the algorithm (as opposed to developing it). If somebody is really interested in this, they can just supply a custom matrix multiplication operator in Julia.
This does not necessarily have a big impact on performance, but helps when looking at the generated code (or code_warntype, for that matter).
In fact, my simulation is so much limited by the sparse-matrix-dense-vector multiplication inside the |
Thanks for the contribution! I'll look at the code a bit more carefully before merging, but it looks interesting. I also have a branch where I properly implement 1-norm estimation, but I am running into what appears to be numerical instabilities for sufficiently large dimensions and density, but once that is sorted out it should lead to big performance gains. |
Regarding the 1-norm estimation: This would definitely be useful for me in the future, but since I can cache the |
D'oh, forgot that the notion of the absolute value of a complex number in BLAS is |
@@ -1,6 +1,6 @@ | |||
export expmv | |||
|
|||
function expmv(t, A, b; M = [], prec = "double", shift = false, full_term = false, prnt = false) | |||
function expmv(t, A, b; M::Array{Float64, 2} = Array(Float64, 0, 0), prec = "double", shift = false, full_term = false, prnt = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make M
an optional argument instead so that way it dispatches. Then you can use M=nothing
vs M
a vector, and it will compile away the extranious branches when M==nothing
.
if prec == "double" | ||
2.0^(-53) | ||
tol = | ||
if prec == "half" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it won't affect performance really, I think to match common Julia syntax these should be symbols instead of strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a working branch that tries to address these style issues (and also implements proper 1 norm estimation). Since so much code is shared, I am still not dispatching on the type, but rather branching on the type (instead of having precision specified as a argument, I take precision to be implicitly specified based on use of Float64
vs Float32
etc). It is probably best to move the discussion to that branch's code -- normest1
.
Is there any reason why this PR stalled? |
Lack of time. |
@ChrisRackauckas: Your points are valid, but changing the API is different is a whole separate story from the transparent optimisations I did last year. |
I've updated the benchmarking code, and it is not entirely clear if the changes you made make much of a difference. That being said, the cleanups are probably improvements, so I will look at it a bit more carefully, and will likely merge your changes. |
This has been a long time ago, so I'm a bit hazy on the details. IIRC you should see a massive reduction in the number of GC allocations at least for precomputed For context, below is some code I quickly threw together last year while prototyping the physics code I mentioned. (Yes, the code quality is quite horrible, but it was a quick hack trying to figure out how to translate the physics into reasonably performant Julia before writing the actual thing.) The Qu* stuff is just a thin wrapper around complex Arrays; in the end, using ExpmV
using QuBase
using QuDynamics
super_pre_post(pre, post) = kron(transpose(post), pre)
dissipator(a) = super_pre_post(a, a') - 0.5 * super_pre_post(a'a, speye(size(a)...)) - 0.5 * super_pre_post(speye(size(a)...), a'a)
super_hamiltonian(h) = -1im * (super_pre_post(h, speye(size(h)...)) - super_pre_post(speye(size(h)...), h))
function lindblad_op(h, collapse_ops::Vector)
super = super_hamiltonian(coeffs(h))
for c_op in collapse_ops
super += dissipator(coeffs(c_op))
end
super
end
function sim(n)
Δ = 1e-5 * 2π
ϵ = 0.5 * 1e-2 #2π
κ = 15 * 2π
k = -0.7κ/(2 * sqrt(3))
times = 0.0:10:20000
ψi = complex(statevec(2, FiniteBasis(n)))
a = lowerop(n)
h = Δ * a'a + ϵ * (a' + a) # + k * (a' * a' + a * a)
l = lindblad_op(h, [sqrt(κ) * a])
ρ = ψi * ψi'
dims_ρ = size(ρ)
Type_ρ = QuBase.similar_type(ρ)
bases_ρ = bases(ρ)
ρvec = coeffs(vec(full(ρ)))
nbars = Array(Float64, 0)
@time precalc = ExpmV.select_taylor_degree(l, 1)
for (t0, t1) in zip(times[1 : end-1], times[2:end])
ρvec = ExpmV.expmv(t1 - t0, l, ρvec, M=precalc)
ρ = Type_ρ(reshape(ρvec, dims_ρ), bases_ρ)
push!(nbars, real(trace(ρ * a'a)))
end
return times[2:end], nbars, ρ
end
@time times, nbars, ρfin = sim(200)
|
Thanks, that helps quite a bit. Under the new benchmarking I can see your
pull request reducing median memory usage by ~50% and median runtime by
~10%, so it does seem to lead to significant improvements.
…On Sat, Nov 26, 2016, 4:33 PM David Nadlinger ***@***.***> wrote:
This has been a long time ago, so I'm a bit hazy on the details. IIRC you
should see a massive reduction in the number of GC allocations at least for
precomputed select_taylor_degree.
For context, below is some code I quickly threw together last year while
prototyping the physics code I mentioned. (Yes, the code quality is quite
horrible, but it was a quick hack trying to figure out how to translate the
physics into reasonably performant Julia before writing the actual thing.)
The Qu* stuff is just a thin wrapper around complex Arrays; in the end,
times is 2001 (i.e. 2000 iterations with the same select_taylor_degree),
and l is a 40000x40000 complex matrix.
using ExpmVusing QuBaseusing QuDynamics
super_pre_post(pre, post) = kron(transpose(post), pre)
dissipator(a) = super_pre_post(a, a') - 0.5 * super_pre_post(a'a, speye(size(a)...)) - 0.5 * super_pre_post(speye(size(a)...), a'a)
super_hamiltonian(h) = -1im * (super_pre_post(h, speye(size(h)...)) - super_pre_post(speye(size(h)...), h))
function lindblad_op(h, collapse_ops::Vector)
super = super_hamiltonian(coeffs(h))
for c_op in collapse_ops
super += dissipator(coeffs(c_op))
end
superend
function sim(n)
Δ = 1e-5 * 2π
ϵ = 0.5 * 1e-2 #2π
κ = 15 * 2π
k = -0.7κ/(2 * sqrt(3))
times = 0.0:10:20000
ψi = complex(statevec(2, FiniteBasis(n)))
a = lowerop(n)
h = Δ * a'a + ϵ * (a' + a) # + k * (a' * a' + a * a)
l = lindblad_op(h, [sqrt(κ) * a])
ρ = ψi * ψi'
dims_ρ = size(ρ)
Type_ρ = QuBase.similar_type(ρ)
bases_ρ = bases(ρ)
ρvec = coeffs(vec(full(ρ)))
nbars = Array(Float64, 0)
@time precalc = ExpmV.select_taylor_degree(l, 1)
for (t0, t1) in zip(times[1 : end-1], times[2:end])
ρvec = ExpmV.expmv(t1 - t0, l, ρvec, M=precalc)
ρ = Type_ρ(reshape(ρvec, dims_ρ), bases_ρ)
push!(nbars, real(trace(ρ * a'a)))
end
return times[2:end], nbars, ρend
@time times, nbars, ρfin = sim(200)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#6 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAOKcWKsdnYzWJtVbD3FHz8RhB0CLmgjks5rCKXAgaJpZM4GUB-9>
.
|
I just had a quick glance over the benchmarks, and it seems like you don't have any for the precomputed case? It's been a year since I looked at the time profiles, but IIRC |
That's right, I haven't gotten around to the precomputed case yet, but
there is already some gain.
…On Sat, Nov 26, 2016, 10:49 PM David Nadlinger ***@***.***> wrote:
I just had a quick glance over the benchmarks, and it seems like you don't
have any for the precomputed case? It's been a year since I looked at the
time profiles, but IIRC select_taylor_degree would be quite expensive for
the above use case if it weren't for the 2000 iterations.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#6 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAOKceeNG5to9qJaTfKsJKtW9gorCnZUks5rCP2xgaJpZM4GUB-9>
.
|
Ah, sure – just wanted to make sure it was clear enough in which scenario the "several times speedup" remark from my original post applies. |
I was recently using ExmpV.jl for solving a Lindblad master equation for a time-independent system of coupled harmonic oscillators, i.e. a very sparse
A
that is constant over manyexmpv
iterations. In this use case, the loop part ofexpmv_fun
was definitely the bottleneck for my whole application, given that I could pre-compute theselect_taylor_degree
result.This PR changes the code to explicitly allocate some copies of vectors before the loop and then use in-place operations to avoid excessive GC usage (according to
@time
, my program churned through something like 2 TB of GC memory in a couple of minutes).It also replaces the calculation of the infinity norm by theAll in all, this leads to several times of performance gain in my use case.iamax
BLAS function.There is likely still quite a bit more to optimize here, both in terms of performance and code style. I'm new to Julia, so I'd appreciate any pointers.