Skip to content

ResourceMultiProc plugin and runtime profiler #1372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 94 commits into from
May 12, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
94 commits
Select commit Hold shift + click to select a range
b5a6024
Removed S3 datasink stuff
pintohutch Jan 13, 2016
70ca457
Started adding in logic for num_threads and changed names of real mem…
pintohutch Jan 13, 2016
36e1446
Added cmd-level threads and memory profiling
pintohutch Jan 14, 2016
61b8d0c
Resolved conflicts from re-basing and pulling in nipype master
pintohutch Jan 14, 2016
a3c9be7
Merge branch 'nipy-master' into resource_multiproc
pintohutch Jan 14, 2016
43c0d56
remove MultiProc, MultiprocPlugin is default
carolFrohlich Jan 15, 2016
0bb6d79
change old namespaces
carolFrohlich Jan 19, 2016
a68e0e6
Added initial num_threads monitoring code
pintohutch Jan 19, 2016
5dac574
Merged Carol's changes
pintohutch Jan 19, 2016
97e7333
Manual merge of s3_datasink and resource_multiproc branch for cpac run
pintohutch Feb 3, 2016
08a485d
Manual merge of s3_datasink and resource_multiproc branch for cpac run
pintohutch Feb 3, 2016
e5945e9
Changed resources fetching to its function and try-blocked it in case…
pintohutch Feb 3, 2016
9cb7a68
Fixed pickling bug of instance method by passing profiling flag inste…
pintohutch Feb 4, 2016
2de8786
Merge pull request #6 from nipy/master
pintohutch Feb 4, 2016
6e3a7b5
Merged resource_multiproc into s3_multiproc
pintohutch Feb 4, 2016
fe0a352
Merged resource_multiproc into s3_multiproc
pintohutch Feb 4, 2016
7cc0731
Merge branch 's3_multiproc' into resource_multiproc
pintohutch Feb 4, 2016
7b3d19f
Re-pulled in changes from github
pintohutch Feb 4, 2016
5733af9
Fixed hsarc related to yrt blocking
pintohutch Feb 4, 2016
544dddf
Removed forcing of runtime_profile to be off:
pintohutch Feb 4, 2016
c074299
Made when result is None that the end stats are N/A
pintohutch Feb 4, 2016
a4e3ae6
Added try-blocks around the runtime profile stats in callback logger
pintohutch Feb 4, 2016
e25ac8c
Cleaned up some code and removed recursion from get_num_threads
pintohutch Feb 5, 2016
d714a03
Added check for runtime having 'get' attribute
pintohutch Feb 9, 2016
27ee192
Removed print statements
pintohutch Feb 12, 2016
c99f834
Removed more print statements and touched up some code to be more lik…
pintohutch Feb 12, 2016
07461cf
Added a fix for the recursive symlink bug (was happening because whil…
pintohutch Feb 12, 2016
116a6a1
Removed node.run level profiling
pintohutch Feb 12, 2016
c1376c4
Updated keyword in result dictionary to runtime instead of cmd-level
pintohutch Feb 17, 2016
e3f54c1
Removed afni centrality interface (will do that in another branch)
pintohutch Feb 18, 2016
cbd08e0
Added back in the automaskinputspec
pintohutch Feb 18, 2016
29bcd80
Added unit tests for runtime_profiler
pintohutch Feb 22, 2016
a170644
Added import and reduced resources used
pintohutch Feb 22, 2016
250b6d3
Added runtime_profile to run by default unless the necessary packages…
pintohutch Feb 23, 2016
f4b0b73
Cleaned up some of the code to PEP8 and checked for errors
pintohutch Feb 23, 2016
9d19e14
Changed memory parameters to be memory_gb to be more explicit, used r…
pintohutch Feb 24, 2016
0388305
Added checks for python deps and added method using builtin std libra…
pintohutch Feb 25, 2016
1e4ce5b
Fixed exception formatting and import error
pintohutch Feb 25, 2016
ace7368
Removed 'Error' from logger info message when memory_profiler or psut…
pintohutch Mar 1, 2016
a515c77
Added more code for debugging runtime profiler
pintohutch Mar 8, 2016
e1d19cb
improve thread draw algorithm
carolFrohlich Mar 8, 2016
0fdc671
Wrote my own get memory function - seems to work much better
pintohutch Mar 9, 2016
9ad5d24
Restructured unittests and num_threads logic
pintohutch Mar 10, 2016
a52395a
Ignored sleeping
pintohutch Mar 10, 2016
2b0a6e2
Remove proc terminology from variable names
pintohutch Mar 10, 2016
97d33bb
Updated dictionary names for results dict
pintohutch Mar 11, 2016
36ded7b
Added recording timestamps
pintohutch Mar 11, 2016
340a7b7
minor bugs
carolFrohlich Mar 15, 2016
7062ec8
Just passed all unittests
pintohutch Mar 15, 2016
cf16091
Fixed a small bug in multiproc and added 90% of the user documentatio…
pintohutch Mar 15, 2016
126fa5d
Merge pull request #9 from FCP-INDI/resource_multiproc
pintohutch Mar 16, 2016
7c90d5a
Added some details about gantt chart
pintohutch Mar 16, 2016
8fce738
partial commit to gantt chart
carolFrohlich Mar 25, 2016
950bedb
Merge pull request #11 from FCP-INDI/resource_multiproc
pintohutch Mar 25, 2016
13fddc8
Fixed up gantt chart to plot real time memory
pintohutch Mar 25, 2016
54a2c63
Finished working prototype of gantt chart generator
pintohutch Mar 28, 2016
8ca97c8
remove white space, add labels
carolFrohlich Mar 30, 2016
6ac0bc3
Changed thread count logic
pintohutch Mar 30, 2016
5c2f2c1
Merge branch 'debug_runtime_prof' of https://github.com/fcp-indi/nipy…
pintohutch Mar 30, 2016
7a8383b
Experimented with process STATUS
pintohutch Apr 1, 2016
be1ec62
Added global watcher
pintohutch Apr 20, 2016
6fe8391
Debug code
pintohutch Apr 22, 2016
fcaec79
Cleaned up debug code
pintohutch Apr 22, 2016
de6a7f1
Removed print debug statement
pintohutch Apr 22, 2016
47003a4
Merge pull request #16 from nipy/master
pintohutch Apr 22, 2016
e937bdc
Finished documentation with gantt chart image
pintohutch Apr 22, 2016
2683de8
Merge branch 'debug_runtime_prof' of https://github.com/fcp-indi/nipy…
pintohutch Apr 22, 2016
7513f40
Updated docs with dependencies
pintohutch Apr 25, 2016
a0920b7
Merge pull request #17 from FCP-INDI/debug_runtime_prof
pintohutch Apr 25, 2016
3814ae9
Moved gantt chart to proper images folder
pintohutch Apr 25, 2016
bdcffca
Merge pull request #18 from FCP-INDI/debug_runtime_prof
pintohutch Apr 25, 2016
0189dd8
Fixed some failing unit tests
pintohutch Apr 25, 2016
de3b298
Modified thread-monitoring logic and ensured unit tests pass
pintohutch Apr 26, 2016
6f4c2e7
Lowered memory usage for unit tests
pintohutch Apr 26, 2016
0cca692
Changed runtime unit test tolerance to GB instead of %
pintohutch Apr 27, 2016
3f7649c
Merge pull request #19 from nipy/master
pintohutch Apr 27, 2016
8f1b104
Added AttributeError to exception for more specific error handling
pintohutch Apr 27, 2016
ffe30da
Removed unexpected indentation from resource scheduler and profiler u…
pintohutch Apr 28, 2016
cf08147
Updated docs to reflect png
pintohutch Apr 28, 2016
35bdb2d
Fixed some errors
pintohutch Apr 29, 2016
d83d849
Reversed logic for afni OMP_NUM_THREADS
pintohutch May 2, 2016
b21bca6
Added traceback crash reporting
pintohutch May 2, 2016
79d1988
More specific exception handling addressed
pintohutch May 2, 2016
fd788f8
Added safeguard for not taking 100% of system memory when plugin_arg …
pintohutch May 2, 2016
d6cafa2
Fix attribute error related to terminal_output attribute
pintohutch May 3, 2016
9b46c49
Trying to disable some unittests for TravisCI debugging
pintohutch May 5, 2016
46f3275
Commented out more unittests for TravisCI debugging
pintohutch May 9, 2016
41c6928
Enabled runtime profiler testing option
pintohutch May 9, 2016
f0a3889
Added one test back in
pintohutch May 9, 2016
0305930
Added threads unit test back in
pintohutch May 9, 2016
b566b22
Changed max memory to 1 GB
pintohutch May 9, 2016
e7eac16
Fixed some typos
pintohutch May 9, 2016
e600d57
Added missing mark
pintohutch May 10, 2016
9ee881f
Merge branch 'resource_multiproc' of https://github.com/fcp-indi/nipy…
pintohutch May 10, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added doc/users/images/gantt_chart.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
160 changes: 160 additions & 0 deletions doc/users/resource_sched_profiler.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
.. _resource_sched_profiler:

============================================
Resource Scheduling and Profiling with Nipype
============================================
The latest version of Nipype supports system resource scheduling and profiling.
These features allows users to ensure high throughput of their data processing
while also controlling the amount of computing resources a given workflow will
use.


Specifying Resources in the Node Interface
==========================================
Each ``Node`` instance interface has two parameters that specify its expected
thread and memory usage: ``num_threads`` and ``estimated_memory_gb``. If a
particular node is expected to use 8 threads and 2 GB of memory:

::

import nipype.pipeline.engine as pe
node = pe.Node()
node.interface.num_threads = 8
node.interface.estimated_memory_gb = 2

If the resource parameters are never set, they default to being 1 thread and 1
GB of RAM.


Resource Scheduler
==================
The ``MultiProc`` workflow plugin schedules node execution based on the
resources used by the current running nodes and the total resources available to
the workflow. The plugin utilizes the plugin arguments ``n_procs`` and
``memory_gb`` to set the maximum resources a workflow can utilize. To limit a
workflow to using 8 cores and 10 GB of RAM:

::

args_dict = {'n_procs' : 8, 'memory_gb' : 10}
workflow.run(plugin='MultiProc', plugin_args=args_dict)

If these values are not specifically set then the plugin will assume it can
use all of the processors and memory on the system. For example, if the machine
has 16 cores and 12 GB of RAM, the workflow will internally assume those values
for ``n_procs`` and ``memory_gb``, respectively.

The plugin will then queue eligible nodes for execution based on their expected
usage via the ``num_threads`` and ``estimated_memory_gb`` interface parameters.
If the plugin sees that only 3 of its 8 processors and 4 GB of its 10 GB of RAM
are being used by running nodes, it will attempt to execute the next available
node as long as its ``num_threads <= 5`` and ``estimated_memory_gb <= 6``. If
this is not the case, it will continue to check every available node in the
queue until it sees a node that meets these conditions, or it waits for an
executing node to finish to earn back the necessary resources. The priority of
the queue is highest for nodes with the most ``estimated_memory_gb`` followed
by nodes with the most expected ``num_threads``.


Runtime Profiler and using the Callback Log
===========================================
It is not always easy to estimate the amount of resources a particular function
or command uses. To help with this, Nipype provides some feedback about the
system resources used by every node during workflow execution via the built-in
runtime profiler. The runtime profiler is automatically enabled if the
psutil_ Python package is installed and found on the system.

.. _psutil: https://pythonhosted.org/psutil/

If the package is not found, the workflow will run normally without the runtime
profiler.

The runtime profiler records the number of threads and the amount of memory (GB)
used as ``runtime_threads`` and ``runtime_memory_gb`` in the Node's
``result.runtime`` attribute. Since the node object is pickled and written to
disk in its working directory, these values are available for analysis after
node or workflow execution by manually parsing the pickle file contents.

Nipype also provides a logging mechanism for saving node runtime statistics to
a JSON-style log file via the ``log_nodes_cb`` logger function. This is enabled
by setting the ``status_callback`` parameter to point to this function in the
``plugin_args`` when using the ``MultiProc`` plugin.

::

from nipype.pipeline.plugins.callback_log import log_nodes_cb
args_dict = {'n_procs' : 8, 'memory_gb' : 10, 'status_callback' : log_nodes_cb}

To set the filepath for the callback log the ``'callback'`` logger must be
configured.

::

# Set path to log file
import logging
callback_log_path = '/home/user/run_stats.log'
logger = logging.getLogger('callback')
logger.setLevel(logging.DEBUG)
handler = logging.FileHandler(callback_log_path)
logger.addHandler(handler)

Finally, the workflow can be run.

::

workflow.run(plugin='MultiProc', plugin_args=args_dict)

After the workflow finishes executing, the log file at
"/home/user/run_stats.log" can be parsed for the runtime statistics. Here is an
example of what the contents would look like:

::

{"name":"resample_node","id":"resample_node",
"start":"2016-03-11 21:43:41.682258",
"estimated_memory_gb":2,"num_threads":1}
{"name":"resample_node","id":"resample_node",
"finish":"2016-03-11 21:44:28.357519",
"estimated_memory_gb":"2","num_threads":"1",
"runtime_threads":"3","runtime_memory_gb":"1.118469238281"}

Here it can be seen that the number of threads was underestimated while the
amount of memory needed was overestimated. The next time this workflow is run
the user can change the node interface ``num_threads`` and
``estimated_memory_gb`` parameters to reflect this for a higher pipeline
throughput. Note, sometimes the "runtime_threads" value is higher than expected,
particularly for multi-threaded applications. Tools can implement
multi-threading in different ways under-the-hood; the profiler merely traverses
the process tree to return all running threads associated with that process,
some of which may include active thread-monitoring daemons or transient
processes.


Visualizing Pipeline Resources
==============================
Nipype provides the ability to visualize the workflow execution based on the
runtimes and system resources each node takes. It does this using the log file
generated from the callback logger after workflow execution - as shown above.
The pandas_ Python package is required to use this feature.

.. _pandas: http://pandas.pydata.org/

::

from nipype.pipeline.plugins.callback_log import log_nodes_cb
args_dict = {'n_procs' : 8, 'memory_gb' : 10, 'status_callback' : log_nodes_cb}
workflow.run(plugin='MultiProc', plugin_args=args_dict)

# ...workflow finishes and writes callback log to '/home/user/run_stats.log'

from nipype.utils.draw_gantt_chart import generate_gantt_chart
generate_gantt_chart('/home/user/run_stats.log', cores=8)
# ...creates gantt chart in '/home/user/run_stats.log.html'

The ``generate_gantt_chart`` function will create an html file that can be viewed
in a browser. Below is an example of the gantt chart displayed in a web browser.
Note that when the cursor is hovered over any particular node bubble or resource
bubble, some additional information is shown in a pop-up.

* - .. image:: images/gantt_chart.png
:width: 100 %
4 changes: 4 additions & 0 deletions nipype/interfaces/afni/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,10 @@ def __init__(self, **inputs):
else:
self._output_update()

# Update num threads estimate from OMP_NUM_THREADS env var
# Default to 1 if not set
os.environ['OMP_NUM_THREADS'] = str(self.num_threads)

def _output_update(self):
""" i think? updates class private attribute based on instance input
in fsl also updates ENVIRON variable....not valid in afni
Expand Down
Loading