Issues

Issues with using ChaNGa and how to address them are listed here.

Table of Contents Compiler Problems Problems on Startup: understanding charmrun Runtime Asserts Out of Memory Deadlocks CUDA Specific Issues

Compiler Problems

There is an issue with GCC 6.[12].X and Charm++, evidently an over-optimization that results in a crash immediatly after reading in the particles. To work around this, either use an earlier compiler version, or add -fno-lifetime-dse to the charm build command. See https://charm.cs.illinois.edu/redmine/issues/1045 for more details.

Problems on Startup: understanding charmrun

For the "net" builds of charm++/ChaNGa, the common problem is starting ChaNGa on multiple nodes of your compute cluster. For MPI and other builds, this is taken care of by the cluster infrastructure, but for net builds, you are directly facing this problem.

"charmrun", which gets built when you "make" ChaNGa, is the program that handles this. If your cluster does have MPI installed, the easiest way to start things up is with

charmrun +p<procs> ++mpiexec ChaNGa cosmo.param

However, if your "mpiexec" is not the way you start an MPI program on your cluster, then you may need to write a wrapper. E.g. for the TACC clusters (stampede and lonestar) a wrapper would contain:

#!/bin/csh
shift; shift; exec ibrun $*

and you would call it with:

charmrun +p<procs> ++mpiexec ++remote-shell mympiexec ChaNGa cosmo.param

If MPI is not available, then charmrun will look at a nodelist file which has the format:

group main
  host node1
  host node2

In order for this to work, you need to be able to ssh into those nodes without a password. If your cluster is not set up to enable this by default, set up passwordless login using public keys. If you can interactive access to the compute nodes (e.g. with qsub -I) then a quick way to test this within the interactive session is to execute the command ssh node1 $PWD/ChaNGa. If ChaNGa starts and gives a help message, then things are set up correctly. Otherwise the error message can help you diagnose the problem. Potential problems include: host keys not installed, user public keys not installed, and shared libraries not accessible.

Runtime Asserts

There are many sanity checks within the code using the assert() call. Here are some common ones with explainations of what has gone wrong.

   <code>Assertion "bInBox" failed in file TreePiece.cpp line 622</code>

This happens when running with periodic boundary conditions and a particle is WAY outside the fiducial box. This is an indication of bad initial conditions or "superluminal" velocities.

   <code>Assertion "numIterations < 1000" failed in file Sorter.cpp line 806.</code>

Here domain decomposition has failed to divide the particles evenly among the domains to within a reasonable tolerance. This could be due to a pathological particle distribution, such as having all particles on top of each other. One solution is to loosen the tolerance by increasing the "ddTolerance" constant in ParallelGravity.h and recompile.

Out of Memory

Memory use can be an issue in large simulations. One of the current big uses of memory in ChaNGa is the caching of off-processor data. This can be lowered by decreasing the depth of the cache "lines" with "-d" or "nCacheDepth". The default is 4, and size of a line scales as 2^d. Higher values mean more remote data is fetched at once, reducing latency costs at the price of higher memory use.

Deadlocks

Deadlocks are hard to track down. One common deadlock is that a process gets held up in a lock within malloc() or free(). This will happen if you link with "-memory os" instead of using charm++ default memory allocator and the os malloc is not thread safe.

CUDA Specific Issues

The CUDA implementation is still experimental. Fatal CUDA Error all CUDA-capable devices are busy or unavailable at cuda-hybrid-api.cu:571. This means 1) there are no GPUs on the host, or 2) more than one process is trying to access the GPU. For scenario 2, you might have more than one ChaNGa process on the host competing for the GPU. Either run in SMP mode with only one process per GPU host, or use the CUDA Multi Process Service (CUDA_MPS) to handle this situation. For Cray machines, setting the environment variable CRAY_CUDA_MPS=1 enables this. However, many compute clusters do not support this.