-
Notifications
You must be signed in to change notification settings - Fork 29
Issues
Issues with using ChaNGa and how to address them are listed here.
If you have issues not addressed here, please report them as an issue on the changa github page.
There is an issue with GCC 6.[12].X and Charm++, evidently an over-optimization that results in a crash immediatly after reading in the particles. To work around this, either use an earlier compiler version, or add -fno-lifetime-dse
to the charm build command. See https://charm.cs.illinois.edu/redmine/issues/1045 for more details.
For the "net" builds of charm++/ChaNGa, the common problem is starting ChaNGa on multiple nodes of your compute cluster. For MPI and other builds, this is taken care of by the cluster infrastructure, but for net builds, you are directly facing this problem.
"charmrun", which gets built when you "make" ChaNGa, is the program that handles this. If your cluster does have MPI installed, the easiest way to start things up is with
charmrun +p<procs> ++mpiexec ChaNGa cosmo.param
However, if your "mpiexec" is not the way you start an MPI program on your cluster, then you may need to write a wrapper. E.g. for the TACC clusters (stampede and lonestar) a wrapper would contain:
#!/bin/csh shift; shift; exec ibrun $*and you would call it with:
charmrun +p<procs> ++mpiexec ++remote-shell mympiexec ChaNGa cosmo.param
If MPI is not available, then charmrun will look at a nodelist file which has the format:
group main host node1 host node2In order for this to work, you need to be able to ssh into those nodes without a password. If your cluster is not set up to enable this by default, set up passwordless login using public keys. If you can interactive access to the compute nodes (e.g. with
qsub -I
) then a quick way to test this within the interactive session is to execute the command ssh node1 $PWD/ChaNGa
. If ChaNGa starts and gives a help message, then things are set up correctly. Otherwise the error message can help you diagnose the problem. Potential problems include: host keys not installed, user public keys not installed, and shared libraries not accessible.
Some messages are extraneous. One example is:
Warning> Randomization of virtual memory (ASLR) is turned on in the kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try running with '+isomalloc_sync'.
You do not need to add +isomalloc_sync to your command line; ChaNGa handles thread migration in another way.
There are many sanity checks within the code using the assert() call. Here are some common ones with explanations of what has gone wrong.
Assertion "bInBox" failed in file TreePiece.cpp line 622
This happens when running with periodic boundary conditions and a particle is WAY outside the fiducial box. This is an indication of bad initial conditions or "superluminal" velocities.
------------- Processor 0 Exiting: Called CmiAbort ------------ Reason: SFC Domain decomposition has not converged
Here domain decomposition has failed to divide the particles evenly among the domains to within a reasonable tolerance. This could be due to a pathological particle distribution, such as having all particles on top of each other. One solution is to loosen the tolerance by increasing the "ddTolerance" constant in ParallelGravity.h and recompile. If the above message is also accompanied with many messages like:
Truncated tree with 17 particle bucket Truncated tree with 26 particle bucket
then larger sorting keys may be needed. Try running configure with "--enable-bigkeys", and recompiling.
------------- Processor 0 Exiting: Called CmiAbort ------------ Reason: [CkIO] llapi_file_get_stripe error
This is a recent (2019) error on Pleiades with a newer implementation of the Lustre file system. The "stripe" refers to how a file is split across many disks for high I/O performance. The work around (until the Charm interface to Lustre catches up with the newer Lustre API) is to explicitly set the striping on the directory in which the snapshots are being written. An example command is lfs setstripe -S 1048576 -c 4 .
The final "." refers to the current directory, so this command should be run in the directory in which the snapshots are written. Update: March 1, 2019: this problem is now appearing on Blue Waters and Stampede2. The same work around is applicable on these systems. Meanwhile, the charm development team is working on a true fix.
------------- Processor 0 Exiting: Called CmiAbort ------------ Reason: starlog file format mismatch
The starlog file starts with a number that is the size of each starlog event. ChaNGa checks this number against what it thinks the starlog event size is and issues this complaint if they don't match. The two obvious reasons for a mismatch are: 1) the starlog file is corrupt or 2) ChaNGa has been recompiled with a different configuration (e.g. H2 cooling vs no H2 cooling) in the middle of a run.
In either case the quickest way to get going again is to move the starlog file out of the way, and restart from an output.
Memory use can be an issue in large simulations. One of the current big uses of memory in ChaNGa is the caching of off-processor data. This can be lowered by decreasing the depth of the cache "lines" with "-d" or "nCacheDepth". The default is 4, and size of a line scales as 2^d. Higher values mean more remote data is fetched at once, reducing latency costs at the price of higher memory use.
Deadlocks are hard to track down. One common deadlock is that a process gets held up in a lock within malloc() or free(). This will happen if you link with "-memory os" instead of using charm++ default memory allocator and the os malloc is not thread safe.
The CUDA implementation is still experimental.
Fatal CUDA Error all CUDA-capable devices are busy or unavailable at cuda-hybrid-api.cu:571.
This means 1) there are no GPUs on the host, or 2) more than one process is trying to access the GPU. For scenario 2, you might have more than one ChaNGa process on the host competing for the GPU. Either run in SMP mode with only one process per GPU host, or use the CUDA Multi Process Service (CUDA_MPS) to handle this situation. For Cray machines, setting the environment variable CRAY_CUDA_MPS=1
enables this. However, many compute clusters do not support this.