Skip to content

Commit 4c80ff4

Browse files
author
j.m.o.rantaharju@swansea.ac.uk
committed
Merge github.com:edbennett/high-performance-python into gh-pages
2 parents 79b41a3 + eff7d19 commit 4c80ff4

File tree

2 files changed

+285
-3
lines changed

2 files changed

+285
-3
lines changed

_episodes/06-numpy-scipy.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -391,6 +391,48 @@ express any other way without using explicit `for` loops.
391391
> You'll want to use `np.random.random` for your random numbers. Take
392392
> a look at the documentation for that function to see what arguments
393393
> it takes that will be helpful for this problem.
394+
>
395+
>> ## Solution
396+
>>
397+
>> Two possible solutions:
398+
>>
399+
>> ~~~
400+
>> import numpy as np
401+
>>
402+
>> def numpy_pi_1(number_of_samples):
403+
>> # Generate all of the random numbers at once to avoid loops
404+
>> samples = np.random.random(size=(number_of_samples, 2))
405+
>>
406+
>> # Use the same np.einsum trick that we used in the previous example
407+
>> # Since we are comparing with 1, we don't need the square root
408+
>> squared_distances = np.einsum('ij,ij->i', samples, samples)
409+
>>
410+
>> # Identify all instances of a distance below 1
411+
>> # "Sum" the true elements to count them
412+
>> within_circle_count = np.sum(squared_distances < 1)
413+
>>
414+
>> return within_circle_count / number_of_samples * 4
415+
>>
416+
>>
417+
>> def numpy_pi_2(number_of_samples):
418+
>> within_circle_count = 0
419+
>>
420+
>> xs = np.random.random(size=number_of_samples)
421+
>> ys = np.random.random(size=number_of_samples)
422+
>>
423+
>> r_squareds = xs ** 2 + ys ** 2
424+
>>
425+
>> within_circle_count = np.sum(r_squareds < 1)
426+
>>
427+
>> return within_circle_count / number_of_samples * 4
428+
>> ~~~
429+
>> {: .language-python}
430+
>>
431+
>> While these are competitive with each other in performance, which
432+
>> is the fastest depends on various factors. As always, test your
433+
>> own specific workloads and hardware setup to see how solutions
434+
>> perform on them.
435+
> {: .solution}
394436
{: .challenge}
395437
396438

_episodes/08-pathos.md

Lines changed: 243 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,17 @@
11
---
22
title: "Multiprocessing with Pathos"
3-
teaching: 40
4-
exercises: 20
3+
teaching: 30
4+
exercises: 30
55
questions:
66
- "How can I distribute subtasks across multiple cores or nodes?"
7+
objectives:
8+
- "Be able to distribute work to multiple cores using parallel pools"
9+
- "Know where to look to extend this to multiple nodes"
10+
- "Understand how to break up more monolithic tasks into subtasks that can be parallelised"
11+
keypoints:
12+
- "The `ParallelPool` in Pathos can spread work to multiple cores."
13+
- "Look closely at loops and other large operations in your program to identify where there is no data dependency, and so parallelism can be added"
14+
- "Pyina has functions that extend the parallel map to multiple nodes"
715
---
816

917
So far, we have distributed processes to multiple cores and nodes
@@ -83,7 +91,7 @@ def run_in_parallel(betas, ms, h, iteration_count):
8391
iteration_counts = [iteration_count] * len(betas)
8492
8593
# Generate a list of filenames to output results to
86-
filenames = [f'b{beta}_m{m}.dat'
94+
filenames = [f'mc_data/b{beta}_m{m}.dat'
8795
for beta, m in zip(betas, ms)]
8896
8997
# Create a parallel pool to run the functions
@@ -102,10 +110,242 @@ if __name__ == '__main__':
102110
~~~
103111
{: .language-python}
104112

113+
Save this as `mc_pathos.py`, and run it via:
114+
115+
~~~
116+
$ mkdir mc_data # Make sure that the directory to hold the output data is created
117+
$ python mc_pathos.py
118+
~~~
119+
{: .language-bash}
120+
105121
This will run the Monte Carlo simulation for the seven parameter sets
106122
$(\beta=0.5, m=0.5)$, $(\beta=0.5, m=1.0)$, $(\beta=1.0, m=0.5)$,
107123
$(\beta=1.0, m=1.0)$, $(\beta=1.0, m=2.0)$, $(\beta=2.0, m=1.0)$,
108124
$(\beta=2.0, m=2.0)$. Each uses the same values of `h` and
109125
`iteration_count`, and the filename in each case is decided
110126
programmatically.
111127

128+
129+
## Parameter scans
130+
131+
We've just used 10 cores to perform 7 independent computations. Even
132+
if the computations all take the same length of time, three cores are
133+
sitting idle, so we are 70% efficient at best. If we have seven jobs
134+
that are long enough that we want to use HPC for them, then we'd be
135+
better off using seven independent SLURM jobs; Pathos isn't useful
136+
there. This kind of workflow comes into its own when you have hundreds
137+
or thousands of different values to scan through. But hardcoding lists
138+
of hundreds or thousands of elements is tedious, especially when there
139+
will be lots of repeated values.
140+
141+
Quite often what we'd really like to do is take a set of possible
142+
values for each variable, and then scan over all possible
143+
combinations. With two variables this allows a heatmap or contour plot
144+
to be produced, and with three a volumetric representation could be
145+
generated.
146+
147+
The first tool in our arsenal is the `product` function from the
148+
`itertools` module. This generates tuples containing all possible
149+
combinations of the lists it is given as an argument. Let's check this
150+
at a Python interpreter:
151+
152+
~~~
153+
$ python
154+
~~~
155+
{: .language-bash}
156+
157+
~~~
158+
>>> from itertools import product
159+
>>> numbers = [1, 2, 3, 4, 5]
160+
>>> letters = ['a', 'b', 'c', 'd', 'e']
161+
>>> list(product(numbers, letters))
162+
~~~
163+
{: .language-python}
164+
165+
~~~
166+
[(1, 'a'), (1, 'b'), (1, 'c'), (1, 'd'), (1, 'e'), (2, 'a'), (2, 'b'),
167+
(2, 'c'), (2, 'd'), (2, 'e'), (3, 'a'), (3, 'b'), (3, 'c'), (3, 'd'),
168+
(3, 'e'), (4, 'a'), (4, 'b'), (4, 'c'), (4, 'd'), (4, 'e'), (5, 'a'),
169+
(5, 'b'), (5, 'c'), (5, 'd'), (5, 'e')]
170+
~~~
171+
{: .output}
172+
173+
We need to use the `list` function here to create a list that we can
174+
read, as by default `product` produces a generator. (Great for looping
175+
through, and good for passing to other functions, but not useful for
176+
seeing the results on screen.)
177+
178+
Looking closely, this isn't quite what we want yet. We need all the
179+
first elements to be in one list, and all the second elements to be in
180+
another (and so on). Fortunately, there's another built-in function
181+
that can help us here.
182+
183+
`zip` is a function that is most often used to take a few lists of
184+
many elements, and return one long list, of tuples of few elements.
185+
But if instead we give it many small tuples, we'll get back a short
186+
list of long tuples&mdash;one containing the first elements, one
187+
containng the second elements, and so on.
188+
189+
The only extra ingredient we need to make this work is to be able to
190+
pass in the elements of the result of `product`, rather than the
191+
iterable as a single argument. This is done with the `*`, which when
192+
placed before a list (or other iterable) means "expand this iterable
193+
into multiple arguments". Putting these together:
194+
195+
~~~
196+
>>> list(zip(*product(numbers, letters)))
197+
~~~
198+
{: .language-python}
199+
200+
~~~
201+
[(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5,
202+
5, 5), ('a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'e', 'a', 'b',
203+
'c', 'd', 'e', 'a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'e')]
204+
~~~
205+
{: .output}
206+
207+
We now have two tuples of the form that we want to pass in to
208+
`map`. All that remains is to extract them from the list that they're
209+
in. Just like before, this is done with `*`.
210+
211+
Adapting the previous example to scan $\beta$, $m$, and $h$:
212+
213+
~~~
214+
from pathos.pools import ParallelPool
215+
from itertools import product
216+
from mc import run_mc
217+
218+
def run_in_parallel(betas, ms, hs, iteration_count):
219+
# Check that lists are all the same length
220+
assert len(betas) == len(ms)
221+
assert len(betas) == len(hs)
222+
223+
# Generate a list of the same value of iteration_count
224+
iteration_counts = [iteration_count] * len(betas)
225+
226+
# Generate a list of filenames to output results to
227+
filenames = [f'mc_data/b{beta}_m{m}_h{h}.dat'
228+
for beta, m, h in zip(betas, ms, hs)]
229+
230+
# Create a parallel pool to run the functions
231+
pool = ParallelPool(nodes=10)
232+
233+
# Run the work
234+
pool.map(run_mc, ms, hs, betas, iteration_counts, filenames)
235+
236+
if __name__ == '__main__':
237+
betas = [0.1, 0.2, 0.3, 0.5, 1.0, 2.0, 3.0, 5.0, 10.0]
238+
ms = [0.1, 0.2, 0.3, 0.5, 1.0, 2.0, 3.0, 5.0, 10.0]
239+
hs = [0.25, 0.5, 1.0, 2.5]
240+
run_in_parallel(
241+
*zip(*product(betas, ms, hs)),
242+
1000
243+
)
244+
~~~
245+
{: .language-python}
246+
247+
Saving this as `mc_pathos_scan.py` and running it will now generate
248+
324 output files in the `mc_data` directory.
249+
250+
> ## Processing file lists
251+
>
252+
> Not all the workloads of this kind will be parameter scans; some
253+
> might be processing large numbers of files instead. Revisit the
254+
> Fourier example from the section on GNU Parallel, and try converting
255+
> it to work with Pathos instead.
256+
>
257+
> You might want to do this in the following steps:
258+
>
259+
> 1. Convert the program to be a function, separating out the
260+
> behaviour that relates to `argparse` from the behaviour that is
261+
> the desired computational behaviour.
262+
> 2. Create a second file, as we have done here, to call this new
263+
> function using Pathos. For now, hardcode a list of images to
264+
> process.
265+
> 3. Now, adjust the program to read the image list from a file, but
266+
> hardcode the filename.
267+
> 4. Finally, adjust the program to read the image list filename from
268+
> the commandline, e.g. using `argparse`.
269+
{: .challenge}
270+
271+
> ## Parameters to scan on the command line
272+
>
273+
> In general, we don't want to hard-code our parameters, even if there
274+
> are only a few and we are using `itertools.product` to generate the
275+
> longer lists.
276+
>
277+
> Adapt the Monte Carlo example above so that it can read the `betas`,
278+
> `ms`, and `hs` lists from the command-line. (The `action="append"`
279+
> keyword argument to `add_argument` will be useful here; the
280+
> `argparse` documentation will tell you more.)
281+
{: .challenge}
282+
283+
284+
<!--- REMOVING THIS SECTION UNTIL IT IS MORE PERFORMANT
285+
286+
## Chunking
287+
288+
Sometimes, we have a large problem that is taking too long, and we
289+
want to parallelise it so that it will finish faster. Unlike the above
290+
cases, it's not immediately obvious how one could split it into subtasks
291+
that can be run independently.
292+
293+
Take for example the multi-dimensional array-broadcasting example that
294+
we followed through in the Numpy episode. This is just performing a
295+
single operation on a large array&mdash;surely that doesn't decompose
296+
into many small tasks?
297+
298+
299+
--->
300+
301+
302+
> ## Surprisingly parallelisable tasks
303+
>
304+
> Sometimes you need to think more carefully about how a task can be
305+
> parallelised. For example, looking back to the example of
306+
> calculating $\pi$ from the previous section, on the surface we are
307+
> only performing a single computation, that returns a single
308+
> number. However, in fact each Monte Carlo sample is independent, so
309+
> could be generated on a separate processor.
310+
>
311+
> Try adapting one of the solutions to the $\pi$ example in the Numpy
312+
> episode to run multiple streams in parallel. Each stream will want
313+
> to report back the count of points inside the circle; totalising and
314+
> calculating the value of $\pi$ then needs to happen in serial.
315+
{: .challenge}
316+
317+
318+
## Multiple nodes
319+
320+
While it is possible to start processes on more than one node using
321+
the Pathos library directly, this is easier to do using Pyina, which
322+
is another part of the Pathos framework.
323+
324+
It can be used very similarly to the Pathos library, by creating a
325+
process pool and then using a map function across that pool. The
326+
difference is that Pyina will interact with Slurm to correctly
327+
position tasks on each node.
328+
329+
A toy example:
330+
331+
~~~
332+
from pyina.ez_map import ez_map
333+
334+
# an example function
335+
def host(id):
336+
import socket
337+
return "Rank: %d -- %s" % (id, socket.gethostname())
338+
339+
# launch the parallel map of the target function
340+
results = ez_map(host, range(100), nodes = 10)
341+
for result in results:
342+
print(result)
343+
~~~
344+
{: .language-python}
345+
346+
Pyina makes use of the Message Passing Interface (MPI) to communicate
347+
between nodes and launch instances of Python; "rank" in this context
348+
is an MPI term referring to the index of the process, which is used
349+
for bookkeeping. A full study of what MPI can do (in Python or
350+
otherwise) is beyond the scope of this lesson!
351+

0 commit comments

Comments
 (0)