Skip to content
This repository was archived by the owner on Sep 25, 2022. It is now read-only.
Peter Monks edited this page Jul 15, 2016 · 37 revisions

Q. I gave the Alfresco process a bunch more memory / CPU, but my import didn't speed up. Shouldn't it have gotten a lot faster?

A. No. Bulk imports are usually I/O bound, so adding more CPU or memory capacity when neither is the bottleneck isn't going to help much, if at all. Instead I'd focus on the classical performance tuning process:

  1. Identify the performance objective (so you know when to stop)
  2. Measure the system
  3. Identify the bottleneck
  4. Fix the bottleneck
  5. Measure the system again. If the performance objective isn't met, go to step 3.
  6. ???
  7. PROFIT!!!1

Step #1 is critically important, otherwise this process becomes an infinite loop!


Q. Can I run imports on more than one server in an Alfresco cluster?

A. Yes, though it may not accomplish much if your bottleneck is in a shared component (database, contentstore, network, source filesystem - see previous question).

Related Q. I tried to run the Bulk Import Tool on multiple cluster nodes and got a JobLockService exception.

A. You're using the embedded fork, which is a cluster-singleton process. One of numerous reasons to avoid the embedded fork.


Q. Why are the instantaneous rates on the graphs so bursty?

A. To avoid double counting (e.g. during a transactional retry), the tool only "counts" the target data when a transaction is committed. This makes the various target counters appear to be a lot more bursty than they actually are. The best solution is to focus on the moving average, since it's a better indicator of overall throughput.


Q. After a little while I'm seeing long periods of zero instantaneous activity, followed by a solitary large burst. What's going on?

A. This is partly related to the previous question, and is something I've observed in my test environment too. While I'm not 100% sure I know the answer, what I think is happening is that transaction commits across the various worker threads end up falling into alignment. Initially I figured it was just because I was starting all of the worker threads at the same time, but after adding in staggered startup logic what I saw was that the "coherence pattern" would eventually re-emerge anyway. It's possible this is specific to the database I'm testing on (MySQL 5.6.25) but regardless, I'd be very keen to hear from a database expert who might be able to explain the observed behaviour in more detail.


Q. At the start of an import, I see a high "nodes imported per second" reading, but "bytes imported per second" is stuck on zero. What's happening?

A. The tool imports the entire directory structure first, before importing any files. Directories count as nodes in the repository, but are (obviously) empty - they contain no data.


Q. At the start of an import, I see "Threads: 0 active of 0 total", but the import seems to be progressing. Why is this?

A. The tool imports the directory structure and the first couple of batches of content on a single thread:

  1. the directories because the batches may have dependencies (and multi-threaded importing only approximately imports in on-disk order, so multi-threading would introduce the risk of out-of-order imports)
  2. the first couple of batches as a performance optimisation for very small imports (for small imports the cost of spinning up the multi-threaded import machinery outweighs the benefits).

During this single-threaded phase the worker threads haven't been created yet, and so the tool reports that zero threads are active (it's reporting on the size of the thread pool). Arguably it should report that 1 thread is active, even though that thread is not part of the thread pool - feel free to raise an issue if you think this is problematic.


Q. What does batch "weight" mean?

A. Nothing. "Weight" is a unitless value that's simply used for comparing the approximate size of each imported node while constructing batches. It's intended to be proportional to the amount of work the database will have to do while importing that node, but the value itself is meaningless (it's not "number of nodes" or "number of database rows" or anything like that - it's simply a unitless value).


Back to wiki home.

Clone this wiki locally