-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add notebook and code to run light_curve_generator at scale #215
Conversation
d87185b
to
20ecea7
Compare
736922b
to
10713db
Compare
945ba5d
to
d3b2f3f
Compare
ef72431
to
13bd6ea
Compare
d011cc4
to
ccc1285
Compare
9c3042f
to
4518d51
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a first pass at the text and trying to run the notebook on irsakusp. There is so much information here, my brain is still trying to wrap itself around the process, but to that end the code appears to be really helpful and well thought out! My first wave of comments are below. Maybe I will wait for you to address the fourth comment before I do further review.
General comments
- After reading through it all and thinking about the overall structure of the tutorials, I would still like some parallelization to be in the light_curve_generator.md . Two reasons, 1) the intermediate sized use cases, and 2) to make sure people see multiprocessing in as many notebooks as reasonable (in case they never click through to the scale_up.md). Can we easily add back in what was in the main branch prior to this PR. I like that code cell rather than the example in the scale_up because it doesn't use the helper and scripts so people don't necessarily need to understand that additional complication.
- When that gets added back in, I think we should edit the text in the cell 4. Parallel processing the archives to include something like " The below cells show how to increase the speed of the multi-archive search using python's built in parallel processing by taking advantage of the multiple available CPUs on Fornax. The following will speed up searches of light curves for numbers of targets ranging from a few tens to a few thousand. For larger sample sizes please see the tutorial in the related notebook scale_up.md in the same folder as this one. Running the below multiprocessing calls on very large samples (hundreds of thousands) will not work because of the way the platform is setup to cull users which appear to be inactive. "
- suggested text for light_curves folder readme.md below. Now that we have so many notebooks in this folder it makes sense to describe them.
- It would be great if the notebook could 'run' to completion. Right now it doesn't for me, for likely two reasons. 1). the cells which have $bash commands give errors. Can we maybe make these cells 'raw' type. Just a suggestion for how to leave these in the notebook, but have the notebook still run. 2) in the markdown cells you have sections with ```{code-cell} that aren't getting executed for me on irsakusp. eg., the imports in the intro cell. I guess I am assuming that the notebook does actually run some things, ie., reading the output of the logs and plotting some things.
Time Domain
In this set of Use Case Scenario we work towards creating multiband light curves from multiple archival and publication resources at scale and classifying and analysing them with machine learning tools. Tutorials included in this folder are:
- light_curve_generator.md This notebook automatically retrieves target positions from the literature and then queries archival sources for light curves of those targets. This notebook is intended to be run on a small number of sources (<~ few hundred)
- scale_up.md This notebook uses the same functions as light_curve_generator(above) but is able to generate light curves for large number of sources (~1000 -> millions?)
- light_curve_classifier.md This notebook takes output from light_curve generator and trains a ML classifier to be able to differentiate amongst the samples based on their light curves.
- ML_AGNzoo.md This notebook takes output from the light_curve_generator(above) and visualizes/compares different labelled samples on a reduced dimension grid.
@jkrick Thanks for all of your feedback. I haven't gone through all of it yet, but noticed your comments about the formatting issues in particular. The notebook ran to completion for me, though I manually converted it to a .ipynb first. The imports cell should obviously be a code cell (and it appears to be in the raw markdown, but not the rendered version). For the cells with bash commands, I chose the markdown format because it provides language-aware syntax highlighting and so is easier to read, though I'm still trying to figure out if there's a better way to include it in a notebook. But regardless, those cells should definitely not be executing. Seems like the cell definitions got mangled. I'll work on it. |
It should be possible to choose a shell type for a code block, not just python. But if they are not supposed to be executing as part of the notebook, then choosing the markdown and syntax highlight is certainly the right way to go. |
This is ready to review again. There are two potential issues:
@jkrick I think I've addressed all of your comments.
Done
Done. I also added the same keyword arguments to this section as given in the serial section for all archive calls to close the loop on #230 (comment).
Added.
I fixed all the formatting issues, at least as far as I can tell. Current version runs to completion for me on irsakusp (.md file) and smce (after manually converting .md -> .ipynb). One note: The bash syntax highlighting in the markdown cells renders nicely when using Theme: JupyterLab Light, but has some shadow-like effect that makes it difficult to read using JupyterLab Dark. This is unfortunate, but I don't know of a better solution at the moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One requirements error and a bunch of comments. Sorry if I got a little carried away on the editing.
light_curves/scale_up.md
Outdated
Notebook sections are: | ||
|
||
- "Overview" describes functionality of the included bash script and python helper functions. Compares some parallel processing options. | ||
- "Example 1" shows how to launch a large-scale run using the bash script, monitor its progress automatically, and diagnose a problem (out of RAM). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if these "Example 1", etc. could be links to the sections. I had trouble making this work in the documentation .md, but it should be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree and have tried but am also unable to get it to work. I tried the only option from the myst-parser docs that seemed like it should work in all cases (Annotating a syntax block with (target)=
). External links to files or URLs seem to work, but not links to a section within the notebook. I'm guessing this doesn't work because the markdown in different notebook cells is isolated from the others in some way.
Interested to hear if @bsipocz knows of a solution.
Co-authored-by: jkrick <jkrick@caltech.edu>
Add notebook and code to run light_curve_generator at scale 388f106
Closes #195, #137
Adds a new notebook, scale_up.md, and related code to run the light_curve_generator code at scale.
If reviewing, I recommend reading through the notebook first, then looking at code if interested.
New files in the light_curves directory:
Updated existing files: