Skip to content

[WIP,ENH] Revision to the resource profiler #2193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 43 commits into from

Conversation

oesteban
Copy link
Contributor

@oesteban oesteban commented Sep 21, 2017

This PR revises the resource profiler with the following objectives:

  • Increase robustness (and making sure it does not crash nipype)
  • Extend profiling to all interfaces (including pure python)

The increase of robustness will be expected from:

  1. Trying to reduce (or remove at all if possible) the logger callback
    to register the estimations of memory and cpus. This could be achieved
    by making interfaces responsible or keeping track of their resources
    to then collect all results after execution of the node.

  2. Centralize profiler imports, like the config or logger object
    so that the applicability of the profiler is checked only once.

This first commit just creates one new module nipype.utils.profiler, and
moves the related functions in there.

Important: this PRs removes the old filemanip logger and replaces it by a more generic utils. Documentation has been updated accordingly.

This PR revises the resource profiler with the following objectives:

  - Increase robustness (and making sure it does not crash nipype)
  - Extend profiling to all interfaces (including pure python)

The increase of robustness will be expected from:

  1. Trying to reduce (or remove at all if possible) the logger callback
     to register the estimations of memory and cpus. This could be achieved
     by making interfaces responsible or keeping track of their resources
     to then collect all results after execution of the node.

  2. Centralize profiler imports, like the config or logger object
     so that the applicability of the profiler is checked only once.

This first commit just creates one new module nipype.utils.profiler, and
moves the related functions in there.
@oesteban
Copy link
Contributor Author

Hi @satra, @effigies,

Before I get further with this PR, I'd love your feedback on these issues:

  • Renaming logger filemanip -> utils and attach events from nipype.utils.filemanip, nipype.utils.provenance and nipype.utils.profiler to it.
  • Make interfaces responsible of monitoring themselves, and do it in a separate process so nipype is robust against errors in monitoring (see tear-up and tear-down). Additionally: all interfaces can measure their performance now (not only commandline interfaces and niu.Function).
  • Since the interface logs itself and they are meant to work in shared filesystems, a new .prof file is generated in the interface base directory containing the traces of memory and cpus. That will be retrieved and finally the peaks saved in the runtime object, allowing for removal of the profiler callback and log. WDYT?.
  • I had to simplify the run interface logic, trying to have the least lines possible within try ... except.

@effigies
Copy link
Member

This all seems reasonable on its face. I haven't really looked into the profiler before, so I don't have strong opinions of cases to consider. I'll try to have a more detailed look and might have more to say...

setup.py Outdated
@@ -148,6 +148,7 @@ def main():
entry_points='''
[console_scripts]
nipypecli=nipype.scripts.cli:cli
nipype_mprof=nipype.utils.profiler:main
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put this under nipypecli? it would be nice to have a single cli.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can even remove this executable.

@satra
Copy link
Member

satra commented Sep 25, 2017

@oesteban - in principle this looks reasonable, here are my questions:

  1. is the profiler stable across os x and linux? i see that duration is now being returned from the profiler, so that would mean that we need to have the profiler on to get this. i'm fine with that, but that would mean that the profiler should be on by default and be robust across platforms (i.e. not require sudo on some).

  2. has this been tested with linux cgroups limiting threads? does this have an impact on performance, since this is running a separate process. as opposed to the previous case where it was a nested subprocess.

@oesteban
Copy link
Contributor Author

Thanks for the comments :)

I am currently finishing this and will start testing all those things soon.

Regarding the duration: I am still working on making the run function robuster, but in principle the storing the duration will be the same way it was even before the resource profiler was included.

I'm more concerned about the cgroups, as you say this Popen will fork a new subprocess, but I guess this is new to the eyes of the scheduler (and thus will take one thread out). I'll find out how cgroups work here.

@satra
Copy link
Member

satra commented Sep 25, 2017

@oesteban - it may still be listed under the parent process, but the question is how much of the resources does that thread take up. possibly not much, but just worth checking.

@satra
Copy link
Member

satra commented Sep 25, 2017

@oesteban - you can use pstree to check.

@codecov-io
Copy link

codecov-io commented Sep 25, 2017

Codecov Report

Merging #2193 into master will increase coverage by 0.02%.
The diff coverage is 91.52%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #2193      +/-   ##
=========================================
+ Coverage   72.27%   72.3%   +0.02%     
=========================================
  Files        1174    1175       +1     
  Lines       58738   58685      -53     
  Branches     8454    8443      -11     
=========================================
- Hits        42453   42432      -21     
+ Misses      14924   14885      -39     
- Partials     1361    1368       +7
Flag Coverage Δ
#smoketests 72.3% <91.52%> (+0.02%) ⬆️
#unittests 69.99% <86.44%> (+0.04%) ⬆️
Impacted Files Coverage Δ
nipype/interfaces/afni/tests/test_auto_Notes.py 85.71% <ø> (ø) ⬆️
...interfaces/ants/tests/test_auto_AntsJointFusion.py 85.71% <ø> (ø) ⬆️
nipype/interfaces/afni/tests/test_auto_ECM.py 85.71% <ø> (ø) ⬆️
...ipype/interfaces/brainsuite/tests/test_auto_Tca.py 85.71% <ø> (ø) ⬆️
...ipype/interfaces/afni/tests/test_auto_ABoverlap.py 85.71% <ø> (ø) ⬆️
...faces/camino/tests/test_auto_ComputeTensorTrace.py 85.71% <ø> (ø) ⬆️
...faces/camino/tests/test_auto_TrackBedpostxDeter.py 85.71% <ø> (ø) ⬆️
...terfaces/freesurfer/tests/test_auto_MRIsCombine.py 85.71% <ø> (ø) ⬆️
nipype/interfaces/afni/tests/test_auto_Axialize.py 85.71% <ø> (ø) ⬆️
nipype/interfaces/afni/tests/test_auto_Means.py 85.71% <ø> (ø) ⬆️
... and 773 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2f422f0...03c8d2e. Read the comment docs.

@oesteban
Copy link
Contributor Author

This is nearing a review-able status. Main changes:

I would really appreciate comments from @satra about the new approach using Event and a general look with fresh eyes (@effigies?) will be very useful as well.

Copy link
Member

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable. Some comments.

*utils_level*
How detailed the logs regarding nipype utils like file operations
(for example overwriting warning) or the resource profiler should be
(possible values: ``INFO`` and ``DEBUG``; default value:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commas around "like file operations (...) or the resource profiler".

@@ -146,6 +147,13 @@ Execution
crashfiles allow portability across machines and shorter load time.
(possible values: ``pklz`` and ``txt``; default value: ``pklz``)

*resource_monitor*
Enables monitoring the resources occupation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indicate this is a boolean, note default value.

@@ -74,6 +74,13 @@ Optional arguments::
n_procs : Number of processes to launch in parallel, if not set number of
processors/threads will be automatically detected

memory_gb : Total memory available to be shared by all simultaneous tasks
currently running, if not set it will be automatically estimated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "automatically set to 90% of system RAM"?

self._fmlogger.setLevel(logging.getLevelName(config.get('logging',
'filemanip_level')))
self._utlogger.setLevel(logging.getLevelName(config.get('logging',
'utils_level')))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is, people who've set filemanip_level are just going to get less logging and no explanation.

What about checking for filemanip_level and, if set, set utils_level and provide a deprecation notice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will work on this

__docformat__ = 'restructuredtext'

if sys.version_info < (3, 3):
setattr(sp, 'DEVNULL', os.devnull)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this? I don't see that it's used anywhere. If you do want to do this, you can't use paths. You need to pass a file descriptor, such as open(os.devnull, 'rb+').

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover from the previous implementation with a process.


free_memory_gb = self.memory_gb - busy_memory_gb
free_processors = self.processors - busy_processors

# Check all jobs without dependency not run
jobids = np.flatnonzero((self.proc_done == False) & \
jobids = np.flatnonzero((self.proc_done == False) &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~self.proc_done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.proc_done is not a numpy array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong. Changing to ~self.proc_done

(self.procs[jobid].overwrite == None and
not self.procs[jobid]._interface.always_run))):
if hash_exists and not self.procs[jobid].overwrite and \
not self.procs[jobid]._interface.always_run:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't equivalent to what it replaced. Is that intentional?


#if it is a start node, add to unifinished nodes
if 'start' in node:
node['start'] = parser.parse(node['start'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove the parser import from this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -251,14 +218,15 @@ def _send_procs_to_workers(self, updatehash=False, graph=None):
key=lambda item: (self.procs[item]._interface.estimated_memory_gb,
self.procs[item]._interface.num_threads))

if str2bool(config.get('execution', 'profile_runtime')):
resource_monitor = str2bool(config.get('execution', 'resource_monitor', 'false'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to check for profile_runtime, use value and raise deprecation warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a deprecation on the config itself


proflogger = logging.getLogger('utils')

resource_monitor = str2bool(config.get('execution', 'resource_monitor'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or raise deprecation warning here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get now raises warning (it is not DeprecationWarning bc they are filtered by default)

@oesteban oesteban closed this Sep 27, 2017
@oesteban oesteban deleted the enh/ReviseResourceProfiler branch September 27, 2017 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants