Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add notebook and code to run light_curve_generator at scale #215

Merged
merged 11 commits into from
Apr 1, 2024

Conversation

troyraen
Copy link
Contributor

@troyraen troyraen commented Jan 26, 2024

Closes #195, #137

Adds a new notebook, scale_up.md, and related code to run the light_curve_generator code at scale.

If reviewing, I recommend reading through the notebook first, then looking at code if interested.

New files in the light_curves directory:

  • scale_up.md
    • notebook demonstrating how to launch large scale runs, monitor them automatically, and diagnose a problem (out of RAM)
  • code_src/helpers/scale_up.py
    • python code to facilitate large scale runs
  • code_src/helpers/scale_up.sh
    • bash script to execute and monitor large scale runs
  • code_src/helpers/top.py
    • python code to parse top output into pandas dataframes and make figures
  • output/lightcurves-demo-SDSS-500k/logs/
    • log files generated previously by running the bash script to get light curves for 500,000 SDSS objects -- used to demonstrate large scale runs without having to actually execute one on the fly

Updated existing files:

  • README.md
    • add text from @jkrick describing the notebooks in this directory
  • light_curve_generator.md
    • update the parallel section, reference the scale_up.md notebook
  • code_src/ztf_functions.py
    • make the workers print their PIDs

@troyraen troyraen force-pushed the issues/195 branch 2 times, most recently from d87185b to 20ecea7 Compare January 26, 2024 21:03
@troyraen troyraen force-pushed the raen/fix/leaked-semaphore branch 2 times, most recently from 736922b to 10713db Compare January 26, 2024 23:25
Base automatically changed from raen/fix/leaked-semaphore to main January 29, 2024 18:55
@troyraen troyraen changed the base branch from main to raen/enhance/sample_selection February 20, 2024 00:15
@troyraen troyraen force-pushed the raen/enhance/sample_selection branch from 945ba5d to d3b2f3f Compare February 20, 2024 02:05
@troyraen troyraen force-pushed the issues/195 branch 2 times, most recently from ef72431 to 13bd6ea Compare February 20, 2024 22:34
@troyraen troyraen force-pushed the raen/enhance/sample_selection branch from d011cc4 to ccc1285 Compare February 25, 2024 00:06
Base automatically changed from raen/enhance/sample_selection to main February 25, 2024 00:07
@troyraen troyraen changed the base branch from main to raen/cleanup/section-organization February 27, 2024 23:22
@troyraen troyraen changed the base branch from raen/cleanup/section-organization to main February 29, 2024 18:03
@troyraen troyraen force-pushed the issues/195 branch 2 times, most recently from 9c3042f to 4518d51 Compare March 12, 2024 17:43
@troyraen troyraen changed the title Add shell script to to run light curve functions at scale Add notebook and code to run light_curve_generator at scale Mar 12, 2024
@troyraen troyraen marked this pull request as ready for review March 13, 2024 07:56
@troyraen troyraen self-assigned this Mar 13, 2024
Copy link
Contributor

@jkrick jkrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a first pass at the text and trying to run the notebook on irsakusp. There is so much information here, my brain is still trying to wrap itself around the process, but to that end the code appears to be really helpful and well thought out! My first wave of comments are below. Maybe I will wait for you to address the fourth comment before I do further review.

General comments

  1. After reading through it all and thinking about the overall structure of the tutorials, I would still like some parallelization to be in the light_curve_generator.md . Two reasons, 1) the intermediate sized use cases, and 2) to make sure people see multiprocessing in as many notebooks as reasonable (in case they never click through to the scale_up.md). Can we easily add back in what was in the main branch prior to this PR. I like that code cell rather than the example in the scale_up because it doesn't use the helper and scripts so people don't necessarily need to understand that additional complication.
  2. When that gets added back in, I think we should edit the text in the cell 4. Parallel processing the archives to include something like " The below cells show how to increase the speed of the multi-archive search using python's built in parallel processing by taking advantage of the multiple available CPUs on Fornax. The following will speed up searches of light curves for numbers of targets ranging from a few tens to a few thousand. For larger sample sizes please see the tutorial in the related notebook scale_up.md in the same folder as this one. Running the below multiprocessing calls on very large samples (hundreds of thousands) will not work because of the way the platform is setup to cull users which appear to be inactive. "
  3. suggested text for light_curves folder readme.md below. Now that we have so many notebooks in this folder it makes sense to describe them.
  4. It would be great if the notebook could 'run' to completion. Right now it doesn't for me, for likely two reasons. 1). the cells which have $bash commands give errors. Can we maybe make these cells 'raw' type. Just a suggestion for how to leave these in the notebook, but have the notebook still run. 2) in the markdown cells you have sections with ```{code-cell} that aren't getting executed for me on irsakusp. eg., the imports in the intro cell. I guess I am assuming that the notebook does actually run some things, ie., reading the output of the logs and plotting some things.

Time Domain
In this set of Use Case Scenario we work towards creating multiband light curves from multiple archival and publication resources at scale and classifying and analysing them with machine learning tools. Tutorials included in this folder are:

  1. light_curve_generator.md This notebook automatically retrieves target positions from the literature and then queries archival sources for light curves of those targets. This notebook is intended to be run on a small number of sources (<~ few hundred)
  2. scale_up.md This notebook uses the same functions as light_curve_generator(above) but is able to generate light curves for large number of sources (~1000 -> millions?)
  3. light_curve_classifier.md This notebook takes output from light_curve generator and trains a ML classifier to be able to differentiate amongst the samples based on their light curves.
  4. ML_AGNzoo.md This notebook takes output from the light_curve_generator(above) and visualizes/compares different labelled samples on a reduced dimension grid.

light_curves/scale_up.md Outdated Show resolved Hide resolved
light_curves/scale_up.md Outdated Show resolved Hide resolved
light_curves/scale_up.md Outdated Show resolved Hide resolved
light_curves/scale_up.md Show resolved Hide resolved
light_curves/scale_up.md Outdated Show resolved Hide resolved
light_curves/scale_up.md Show resolved Hide resolved
light_curves/scale_up.md Outdated Show resolved Hide resolved
light_curves/scale_up.md Outdated Show resolved Hide resolved
light_curves/scale_up.md Outdated Show resolved Hide resolved
light_curves/scale_up.md Outdated Show resolved Hide resolved
@troyraen
Copy link
Contributor Author

@jkrick Thanks for all of your feedback. I haven't gone through all of it yet, but noticed your comments about the formatting issues in particular. The notebook ran to completion for me, though I manually converted it to a .ipynb first. The imports cell should obviously be a code cell (and it appears to be in the raw markdown, but not the rendered version). For the cells with bash commands, I chose the markdown format because it provides language-aware syntax highlighting and so is easier to read, though I'm still trying to figure out if there's a better way to include it in a notebook. But regardless, those cells should definitely not be executing. Seems like the cell definitions got mangled. I'll work on it.

@bsipocz
Copy link
Member

bsipocz commented Mar 14, 2024

For the cells with bash commands, I chose the markdown format because it provides language-aware syntax highlighting and so is easier to read, though I'm still trying to figure out if there's a better way to include it in a notebook. But regardless, those cells should definitely not be executing. Seems like the cell definitions got mangled. I'll work on it.

It should be possible to choose a shell type for a code block, not just python. But if they are not supposed to be executing as part of the notebook, then choosing the markdown and syntax highlight is certainly the right way to go.

@troyraen
Copy link
Contributor Author

This is ready to review again. There are two potential issues:

  1. TESS and HCV return no data for the Yang sample. The helper still writes parquet files for these, they are just empty. This sometimes causes the next cell (read the parquet files to a dataframe) to throw an error. I'm not sure why it works sometimes but not other times. Right now I think the best option is probably to avoid writing the parquet file if there is no data.
  2. The Gaia call throws an error when using astroquery v0.4.7 (newest version). See Bug: Astroquery version for Gaia #207.

@jkrick I think I've addressed all of your comments.

  1. ... Can we easily add back in what was in the main branch prior to this PR.

Done

  1. ... edit the text in the cell 4. Parallel processing

Done. I also added the same keyword arguments to this section as given in the serial section for all archive calls to close the loop on #230 (comment).

  1. suggested text for light_curves folder readme.md below.

Added.

  1. It would be great if the notebook could 'run' to completion.

I fixed all the formatting issues, at least as far as I can tell. Current version runs to completion for me on irsakusp (.md file) and smce (after manually converting .md -> .ipynb). One note: The bash syntax highlighting in the markdown cells renders nicely when using Theme: JupyterLab Light, but has some shadow-like effect that makes it difficult to read using JupyterLab Dark. This is unfortunate, but I don't know of a better solution at the moment.

@troyraen troyraen requested a review from jkrick March 25, 2024 08:12
Copy link
Contributor

@jkrick jkrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One requirements error and a bunch of comments. Sorry if I got a little carried away on the editing.

light_curves/scale_up.md Outdated Show resolved Hide resolved
light_curves/scale_up.md Outdated Show resolved Hide resolved
Notebook sections are:

- "Overview" describes functionality of the included bash script and python helper functions. Compares some parallel processing options.
- "Example 1" shows how to launch a large-scale run using the bash script, monitor its progress automatically, and diagnose a problem (out of RAM).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if these "Example 1", etc. could be links to the sections. I had trouble making this work in the documentation .md, but it should be possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and have tried but am also unable to get it to work. I tried the only option from the myst-parser docs that seemed like it should work in all cases (Annotating a syntax block with (target)=). External links to files or URLs seem to work, but not links to a section within the notebook. I'm guessing this doesn't work because the markdown in different notebook cells is isolated from the others in some way.

Interested to hear if @bsipocz knows of a solution.

light_curves/scale_up.md Show resolved Hide resolved
light_curves/scale_up.md Show resolved Hide resolved
light_curves/scale_up.md Outdated Show resolved Hide resolved
light_curves/scale_up.md Outdated Show resolved Hide resolved
light_curves/scale_up.md Show resolved Hide resolved
light_curves/scale_up.md Outdated Show resolved Hide resolved
light_curves/scale_up.md Outdated Show resolved Hide resolved
@troyraen troyraen merged commit 388f106 into main Apr 1, 2024
3 checks passed
@troyraen troyraen deleted the issues/195 branch April 1, 2024 06:58
github-actions bot pushed a commit that referenced this pull request Apr 1, 2024
Add notebook and code to run light_curve_generator at scale 388f106
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make a script to run light curve functions at scale
3 participants