From 8d623a9509001f14647876a5f583bad06b97b448 Mon Sep 17 00:00:00 2001 From: Sam Cunliffe Date: Sat, 23 Mar 2024 12:04:29 +0000 Subject: [PATCH 1/8] Add a page on parallel and async libraries. Resolves #178. --- docs/pages/parallel-async.md | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) create mode 100644 docs/pages/parallel-async.md diff --git a/docs/pages/parallel-async.md b/docs/pages/parallel-async.md new file mode 100644 index 00000000..bd3ee4d6 --- /dev/null +++ b/docs/pages/parallel-async.md @@ -0,0 +1,35 @@ +--- +title: Parallel and asynchronous processing +layout: default +--- + +Python has a good ecosystem of libraries for multiprocessing (threads and GPU +parallelisation), as well as asynchronous processing. Here, we list those that +we've found to be useful, particularly for research applications and previous +ARC projects. + +🟠 tools in the following may be preferred over 🟢, if there are external +reasons to use a specific interface or parallelisation scheme. Possibly due to +the nature of the research problem, the high-performance computing resources +available or simply due to pre-existing code using a library like [pandas]. + +| Name | Short description | 🚦 | +| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :-: | +| [multiprocess] | A fork of [multiprocessing] which uses `dill` instead of `pickle` to allow serializing wider range of object types including nested / anonymous functions. We've found this rather more simple to work with. | 🟢 | +| [multiprocessing] | The standard library module for distributing tasks across multiple processes | 🟠 | +| [Cython] | Has [support for OpenMP based parallelism](https://cython.readthedocs.io/en/latest/src/userguide/parallelism.html) | 🟠 | +| [mpi4py] | support for MPI based parallelism | 🟠 | +| [dask] | Aims to make scaling existing code in familiar libraries (`numpy`, [pandas], `scikit-learn`, ...) easy. | 🟠 | +| [numba] | [Support for parallelism via `jit(parallel=True)`](https://numba.pydata.org/numba-doc/latest/user/parallel.html). | 🟠 | +| [jax] | [Support for parallelising NumPy / scientific computing like operations using functional transforms](https://jax.readthedocs.io/en/latest/jax-101/06-parallelism.html). | 🟠 | + + + +[multiprocess]: https://multiprocess.readthedocs.io/en/latest/ +[multiprocessing]: https://docs.python.org/3/library/multiprocessing.html +[Cython]: https://cython.readthedocs.io/ +[mpi4py]: https://mpi4py.readthedocs.io/ +[pandas]: https://pandas.pydata.org/ +[dask]: https://docs.dask.org/ +[numba]: https://numba.pydata.org/ +[jax]: https://jax.readthedocs.io/ From b81e460f04ae58f8448afd95db3fd41c311b5adb Mon Sep 17 00:00:00 2001 From: Sam Cunliffe Date: Mon, 25 Mar 2024 10:52:09 +0000 Subject: [PATCH 2/8] First easy suggestions from code review. Co-authored-by: David Stansby --- docs/pages/parallel-async.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/docs/pages/parallel-async.md b/docs/pages/parallel-async.md index bd3ee4d6..cb6fd59d 100644 --- a/docs/pages/parallel-async.md +++ b/docs/pages/parallel-async.md @@ -4,9 +4,7 @@ layout: default --- Python has a good ecosystem of libraries for multiprocessing (threads and GPU -parallelisation), as well as asynchronous processing. Here, we list those that -we've found to be useful, particularly for research applications and previous -ARC projects. +parallelisation), as well as asynchronous processing. 🟠 tools in the following may be preferred over 🟢, if there are external reasons to use a specific interface or parallelisation scheme. Possibly due to From f5c1bb41170e7aba2ec47ad37e0fa03dddf37965 Mon Sep 17 00:00:00 2001 From: Sam Cunliffe Date: Mon, 25 Mar 2024 11:08:31 +0000 Subject: [PATCH 3/8] =?UTF-8?q?Missing=20title=20=F0=9F=98=B1?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/pages/parallel-async.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/pages/parallel-async.md b/docs/pages/parallel-async.md index cb6fd59d..faeb53f9 100644 --- a/docs/pages/parallel-async.md +++ b/docs/pages/parallel-async.md @@ -3,6 +3,8 @@ title: Parallel and asynchronous processing layout: default --- +# Parallel and asynchronous processing + Python has a good ecosystem of libraries for multiprocessing (threads and GPU parallelisation), as well as asynchronous processing. From c7650c7c6616384128bf89c25697b717700d1cfb Mon Sep 17 00:00:00 2001 From: Sam Cunliffe Date: Mon, 25 Mar 2024 11:10:27 +0000 Subject: [PATCH 4/8] Split into thread/proc vs compiler parallel. Co-authored-by: David Stansby --- docs/pages/parallel-async.md | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/docs/pages/parallel-async.md b/docs/pages/parallel-async.md index faeb53f9..8cd99939 100644 --- a/docs/pages/parallel-async.md +++ b/docs/pages/parallel-async.md @@ -13,15 +13,22 @@ reasons to use a specific interface or parallelisation scheme. Possibly due to the nature of the research problem, the high-performance computing resources available or simply due to pre-existing code using a library like [pandas]. +## Thread- and process-based parallelism + | Name | Short description | 🚦 | | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :-: | | [multiprocess] | A fork of [multiprocessing] which uses `dill` instead of `pickle` to allow serializing wider range of object types including nested / anonymous functions. We've found this rather more simple to work with. | 🟢 | +| [dask] | Aims to make scaling existing code in familiar libraries (`numpy`, [pandas], `scikit-learn`, ...) easy. | 🟠 | | [multiprocessing] | The standard library module for distributing tasks across multiple processes | 🟠 | -| [Cython] | Has [support for OpenMP based parallelism](https://cython.readthedocs.io/en/latest/src/userguide/parallelism.html) | 🟠 | | [mpi4py] | support for MPI based parallelism | 🟠 | -| [dask] | Aims to make scaling existing code in familiar libraries (`numpy`, [pandas], `scikit-learn`, ...) easy. | 🟠 | -| [numba] | [Support for parallelism via `jit(parallel=True)`](https://numba.pydata.org/numba-doc/latest/user/parallel.html). | 🟠 | -| [jax] | [Support for parallelising NumPy / scientific computing like operations using functional transforms](https://jax.readthedocs.io/en/latest/jax-101/06-parallelism.html). | 🟠 | + +## Compiler-based parallelism + +| Name | Short description | 🚦 | +| -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-: | +| [Cython] | Has [support for OpenMP based parallelism](https://cython.readthedocs.io/en/latest/src/userguide/parallelism.html) | 🟠 | +| [numba] | [Support for parallelism via `jit(parallel=True)`](https://numba.pydata.org/numba-doc/latest/user/parallel.html). | 🟠 | +| [jax] | [Support for parallelising NumPy / scientific computing like operations using functional transforms](https://jax.readthedocs.io/en/latest/jax-101/06-parallelism.html). | 🟠 | From 1f8f26132ef85eee197cbc0fa646cb8a91d329b2 Mon Sep 17 00:00:00 2001 From: Sam Cunliffe Date: Mon, 25 Mar 2024 11:33:43 +0000 Subject: [PATCH 5/8] Add threading to the table as red and add to blurb. Lines added to the blurb hacked from Matt's suggestions in #323. More details about the GIL and PEP703. Co-authored-by: Matt Graham --- docs/pages/parallel-async.md | 29 +++++++++++++++++++---------- 1 file changed, 19 insertions(+), 10 deletions(-) diff --git a/docs/pages/parallel-async.md b/docs/pages/parallel-async.md index 8cd99939..038951b4 100644 --- a/docs/pages/parallel-async.md +++ b/docs/pages/parallel-async.md @@ -5,22 +5,29 @@ layout: default # Parallel and asynchronous processing -Python has a good ecosystem of libraries for multiprocessing (threads and GPU -parallelisation), as well as asynchronous processing. +Python has a good ecosystem of libraries for parallelising the processing of tasks, +as well as asynchronous processing. -🟠 tools in the following may be preferred over 🟢, if there are external -reasons to use a specific interface or parallelisation scheme. Possibly due to -the nature of the research problem, the high-performance computing resources -available or simply due to pre-existing code using a library like [pandas]. +Parallelisation in Python is typically _process-based_ with code parallelised +across multiple Python processes each with their own interpreter or makes use of +tools which run the tasks to be parallelised outside of the Python interpreter, +using for example Python wrappers around external code which uses thread-based +parallelism. -## Thread- and process-based parallelism +🟠 tools in the following should be chosen, if there are external reasons to use +a specific interface or parallelisation scheme. Possibly due to the nature of +the research problem, the high-performance computing resources available or +simply due to pre-existing code using a library like [pandas]. + +## Process-based (and thread-based) parallelism | Name | Short description | 🚦 | | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :-: | | [multiprocess] | A fork of [multiprocessing] which uses `dill` instead of `pickle` to allow serializing wider range of object types including nested / anonymous functions. We've found this rather more simple to work with. | 🟢 | | [dask] | Aims to make scaling existing code in familiar libraries (`numpy`, [pandas], `scikit-learn`, ...) easy. | 🟠 | -| [multiprocessing] | The standard library module for distributing tasks across multiple processes | 🟠 | -| [mpi4py] | support for MPI based parallelism | 🟠 | +| [multiprocessing] | The standard library module for distributing tasks across multiple processes. | 🟠 | +| [mpi4py] | Support for MPI based parallelism. | 🟠 | +| [threading] | The standard library module for multi-threading. Due to the _global interpreter lock_ [currently][PEP703] only one thread can execute Python code at a time. | 🔴 | ## Compiler-based parallelism @@ -30,10 +37,12 @@ available or simply due to pre-existing code using a library like [pandas]. | [numba] | [Support for parallelism via `jit(parallel=True)`](https://numba.pydata.org/numba-doc/latest/user/parallel.html). | 🟠 | | [jax] | [Support for parallelising NumPy / scientific computing like operations using functional transforms](https://jax.readthedocs.io/en/latest/jax-101/06-parallelism.html). | 🟠 | - + [multiprocess]: https://multiprocess.readthedocs.io/en/latest/ [multiprocessing]: https://docs.python.org/3/library/multiprocessing.html +[threading]: https://docs.python.org/3/library/threading.html +[PEP703]: https://peps.python.org/pep-0703/ [Cython]: https://cython.readthedocs.io/ [mpi4py]: https://mpi4py.readthedocs.io/ [pandas]: https://pandas.pydata.org/ From 9f88ee68c10aa0089eedb1c7d2d364eb762a8bb0 Mon Sep 17 00:00:00 2001 From: Sam Cunliffe Date: Mon, 25 Mar 2024 12:04:17 +0000 Subject: [PATCH 6/8] Update docs/pages/parallel-async.md Co-authored-by: David Stansby --- docs/pages/parallel-async.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/pages/parallel-async.md b/docs/pages/parallel-async.md index 038951b4..26dae3b3 100644 --- a/docs/pages/parallel-async.md +++ b/docs/pages/parallel-async.md @@ -23,7 +23,7 @@ simply due to pre-existing code using a library like [pandas]. | Name | Short description | 🚦 | | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :-: | -| [multiprocess] | A fork of [multiprocessing] which uses `dill` instead of `pickle` to allow serializing wider range of object types including nested / anonymous functions. We've found this rather more simple to work with. | 🟢 | +| [multiprocess] | A fork of [multiprocessing] which uses `dill` instead of `pickle` to allow serializing wider range of object types including nested / anonymous functions. We've found this easier to use than `multiprocessing`. | 🟢 | | [dask] | Aims to make scaling existing code in familiar libraries (`numpy`, [pandas], `scikit-learn`, ...) easy. | 🟠 | | [multiprocessing] | The standard library module for distributing tasks across multiple processes. | 🟠 | | [mpi4py] | Support for MPI based parallelism. | 🟠 | From a726c9a58e532d4a6c155d68859016b6b28af711 Mon Sep 17 00:00:00 2001 From: Sam Cunliffe Date: Mon, 25 Mar 2024 12:04:39 +0000 Subject: [PATCH 7/8] Update docs/pages/parallel-async.md Co-authored-by: David Stansby --- docs/pages/parallel-async.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/pages/parallel-async.md b/docs/pages/parallel-async.md index 26dae3b3..f991057a 100644 --- a/docs/pages/parallel-async.md +++ b/docs/pages/parallel-async.md @@ -39,7 +39,7 @@ simply due to pre-existing code using a library like [pandas]. -[multiprocess]: https://multiprocess.readthedocs.io/en/latest/ +[multiprocess]: https://multiprocess.readthedocs.io/en/stable/ [multiprocessing]: https://docs.python.org/3/library/multiprocessing.html [threading]: https://docs.python.org/3/library/threading.html [PEP703]: https://peps.python.org/pep-0703/ From 4e4b158418b3511e39c84587dbc2a469b1aa17bc Mon Sep 17 00:00:00 2001 From: Sam Cunliffe Date: Mon, 25 Mar 2024 12:05:52 +0000 Subject: [PATCH 8/8] Prettier --- docs/pages/parallel-async.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/pages/parallel-async.md b/docs/pages/parallel-async.md index f991057a..776368d8 100644 --- a/docs/pages/parallel-async.md +++ b/docs/pages/parallel-async.md @@ -21,13 +21,13 @@ simply due to pre-existing code using a library like [pandas]. ## Process-based (and thread-based) parallelism -| Name | Short description | 🚦 | -| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :-: | +| Name | Short description | 🚦 | +| ----------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-: | | [multiprocess] | A fork of [multiprocessing] which uses `dill` instead of `pickle` to allow serializing wider range of object types including nested / anonymous functions. We've found this easier to use than `multiprocessing`. | 🟢 | -| [dask] | Aims to make scaling existing code in familiar libraries (`numpy`, [pandas], `scikit-learn`, ...) easy. | 🟠 | -| [multiprocessing] | The standard library module for distributing tasks across multiple processes. | 🟠 | -| [mpi4py] | Support for MPI based parallelism. | 🟠 | -| [threading] | The standard library module for multi-threading. Due to the _global interpreter lock_ [currently][PEP703] only one thread can execute Python code at a time. | 🔴 | +| [dask] | Aims to make scaling existing code in familiar libraries (`numpy`, [pandas], `scikit-learn`, ...) easy. | 🟠 | +| [multiprocessing] | The standard library module for distributing tasks across multiple processes. | 🟠 | +| [mpi4py] | Support for MPI based parallelism. | 🟠 | +| [threading] | The standard library module for multi-threading. Due to the _global interpreter lock_ [currently][PEP703] only one thread can execute Python code at a time. | 🔴 | ## Compiler-based parallelism