Add notebook and code to run light_curve_generator at scale #215

troyraen · 2024-01-26T20:05:29Z

Closes #195, #137

Adds a new notebook, scale_up.md, and related code to run the light_curve_generator code at scale.

If reviewing, I recommend reading through the notebook first, then looking at code if interested.

New files in the light_curves directory:

scale_up.md
- notebook demonstrating how to launch large scale runs, monitor them automatically, and diagnose a problem (out of RAM)
code_src/helpers/scale_up.py
- python code to facilitate large scale runs
code_src/helpers/scale_up.sh
- bash script to execute and monitor large scale runs
code_src/helpers/top.py
- python code to parse top output into pandas dataframes and make figures
output/lightcurves-demo-SDSS-500k/logs/
- log files generated previously by running the bash script to get light curves for 500,000 SDSS objects -- used to demonstrate large scale runs without having to actually execute one on the fly

Updated existing files:

README.md
- add text from @jkrick describing the notebooks in this directory
light_curve_generator.md
- update the parallel section, reference the scale_up.md notebook
code_src/ztf_functions.py
- make the workers print their PIDs

light_curves/output/lightcurves-demo-SDSS-500k/logs/gaia.log

jkrick

I've made a first pass at the text and trying to run the notebook on irsakusp. There is so much information here, my brain is still trying to wrap itself around the process, but to that end the code appears to be really helpful and well thought out! My first wave of comments are below. Maybe I will wait for you to address the fourth comment before I do further review.

General comments

After reading through it all and thinking about the overall structure of the tutorials, I would still like some parallelization to be in the light_curve_generator.md . Two reasons, 1) the intermediate sized use cases, and 2) to make sure people see multiprocessing in as many notebooks as reasonable (in case they never click through to the scale_up.md). Can we easily add back in what was in the main branch prior to this PR. I like that code cell rather than the example in the scale_up because it doesn't use the helper and scripts so people don't necessarily need to understand that additional complication.
When that gets added back in, I think we should edit the text in the cell 4. Parallel processing the archives to include something like " The below cells show how to increase the speed of the multi-archive search using python's built in parallel processing by taking advantage of the multiple available CPUs on Fornax. The following will speed up searches of light curves for numbers of targets ranging from a few tens to a few thousand. For larger sample sizes please see the tutorial in the related notebook scale_up.md in the same folder as this one. Running the below multiprocessing calls on very large samples (hundreds of thousands) will not work because of the way the platform is setup to cull users which appear to be inactive. "
suggested text for light_curves folder readme.md below. Now that we have so many notebooks in this folder it makes sense to describe them.
It would be great if the notebook could 'run' to completion. Right now it doesn't for me, for likely two reasons. 1). the cells which have $bash commands give errors. Can we maybe make these cells 'raw' type. Just a suggestion for how to leave these in the notebook, but have the notebook still run. 2) in the markdown cells you have sections with ```{code-cell} that aren't getting executed for me on irsakusp. eg., the imports in the intro cell. I guess I am assuming that the notebook does actually run some things, ie., reading the output of the logs and plotting some things.

Time Domain
In this set of Use Case Scenario we work towards creating multiband light curves from multiple archival and publication resources at scale and classifying and analysing them with machine learning tools. Tutorials included in this folder are:

light_curve_generator.md This notebook automatically retrieves target positions from the literature and then queries archival sources for light curves of those targets. This notebook is intended to be run on a small number of sources (<~ few hundred)
scale_up.md This notebook uses the same functions as light_curve_generator(above) but is able to generate light curves for large number of sources (~1000 -> millions?)
light_curve_classifier.md This notebook takes output from light_curve generator and trains a ML classifier to be able to differentiate amongst the samples based on their light curves.
ML_AGNzoo.md This notebook takes output from the light_curve_generator(above) and visualizes/compares different labelled samples on a reduced dimension grid.

light_curves/scale_up.md

troyraen · 2024-03-14T22:45:32Z

@jkrick Thanks for all of your feedback. I haven't gone through all of it yet, but noticed your comments about the formatting issues in particular. The notebook ran to completion for me, though I manually converted it to a .ipynb first. The imports cell should obviously be a code cell (and it appears to be in the raw markdown, but not the rendered version). For the cells with bash commands, I chose the markdown format because it provides language-aware syntax highlighting and so is easier to read, though I'm still trying to figure out if there's a better way to include it in a notebook. But regardless, those cells should definitely not be executing. Seems like the cell definitions got mangled. I'll work on it.

bsipocz · 2024-03-14T23:10:44Z

For the cells with bash commands, I chose the markdown format because it provides language-aware syntax highlighting and so is easier to read, though I'm still trying to figure out if there's a better way to include it in a notebook. But regardless, those cells should definitely not be executing. Seems like the cell definitions got mangled. I'll work on it.

It should be possible to choose a shell type for a code block, not just python. But if they are not supposed to be executing as part of the notebook, then choosing the markdown and syntax highlight is certainly the right way to go.

troyraen · 2024-03-20T17:14:33Z

This is ready to review again. There are two potential issues:

TESS and HCV return no data for the Yang sample. The helper still writes parquet files for these, they are just empty. This sometimes causes the next cell (read the parquet files to a dataframe) to throw an error. I'm not sure why it works sometimes but not other times. Right now I think the best option is probably to avoid writing the parquet file if there is no data.
The Gaia call throws an error when using astroquery v0.4.7 (newest version). See Bug: Astroquery version for Gaia #207.

@jkrick I think I've addressed all of your comments.

... Can we easily add back in what was in the main branch prior to this PR.

Done

... edit the text in the cell 4. Parallel processing

Done. I also added the same keyword arguments to this section as given in the serial section for all archive calls to close the loop on #230 (comment).

suggested text for light_curves folder readme.md below.

Added.

It would be great if the notebook could 'run' to completion.

I fixed all the formatting issues, at least as far as I can tell. Current version runs to completion for me on irsakusp (.md file) and smce (after manually converting .md -> .ipynb). One note: The bash syntax highlighting in the markdown cells renders nicely when using Theme: JupyterLab Light, but has some shadow-like effect that makes it difficult to read using JupyterLab Dark. This is unfortunate, but I don't know of a better solution at the moment.

jkrick

One requirements error and a bunch of comments. Sorry if I got a little carried away on the editing.

light_curves/scale_up.md

jkrick · 2024-03-25T16:33:24Z

light_curves/scale_up.md

+Notebook sections are:
+
+- "Overview" describes functionality of the included bash script and python helper functions. Compares some parallel processing options.
+- "Example 1" shows how to launch a large-scale run using the bash script, monitor its progress automatically, and diagnose a problem (out of RAM).


It would be nice if these "Example 1", etc. could be links to the sections. I had trouble making this work in the documentation .md, but it should be possible.

I agree and have tried but am also unable to get it to work. I tried the only option from the myst-parser docs that seemed like it should work in all cases (Annotating a syntax block with (target)=). External links to files or URLs seem to work, but not links to a section within the notebook. I'm guessing this doesn't work because the markdown in different notebook cells is isolated from the others in some way.

Interested to hear if @bsipocz knows of a solution.

light_curves/scale_up.md

Co-authored-by: jkrick <jkrick@caltech.edu>

Add notebook and code to run light_curve_generator at scale 388f106

troyraen force-pushed the issues/195 branch 2 times, most recently from d87185b to 20ecea7 Compare January 26, 2024 21:03

troyraen force-pushed the raen/fix/leaked-semaphore branch 2 times, most recently from 736922b to 10713db Compare January 26, 2024 23:25

troyraen force-pushed the issues/195 branch from 764c331 to 414fb45 Compare January 27, 2024 01:16

Base automatically changed from raen/fix/leaked-semaphore to main January 29, 2024 18:55

troyraen force-pushed the issues/195 branch from 012471e to 77058ba Compare January 29, 2024 19:29

troyraen mentioned this pull request Jan 31, 2024

Document amount of CPU and memory needed to run full notebook at scale #137

Closed

This was referenced Feb 17, 2024

MAINT: cleanup shoobyFeb after premature merge of #225 #229

Merged

Standardize function names and signatures #230

Merged

Enhance sample_selection.py #231

Merged

troyraen force-pushed the issues/195 branch from 65d50ce to 4593d9c Compare February 20, 2024 00:11

troyraen changed the base branch from main to raen/enhance/sample_selection February 20, 2024 00:15

troyraen force-pushed the issues/195 branch from 4593d9c to 8b40907 Compare February 20, 2024 02:03

troyraen force-pushed the raen/enhance/sample_selection branch from 945ba5d to d3b2f3f Compare February 20, 2024 02:05

troyraen force-pushed the issues/195 branch 2 times, most recently from ef72431 to 13bd6ea Compare February 20, 2024 22:34

troyraen force-pushed the raen/enhance/sample_selection branch from d011cc4 to ccc1285 Compare February 25, 2024 00:06

Base automatically changed from raen/enhance/sample_selection to main February 25, 2024 00:07

troyraen force-pushed the issues/195 branch from 7b6373f to 96c1f27 Compare February 25, 2024 00:23

troyraen mentioned this pull request Feb 27, 2024

Reorganize notebook sections #233

Closed

troyraen force-pushed the issues/195 branch from aa6e698 to 7eef80b Compare February 27, 2024 23:22

troyraen changed the base branch from main to raen/cleanup/section-organization February 27, 2024 23:22

troyraen force-pushed the issues/195 branch from a1b4b9c to 5ac1f08 Compare February 29, 2024 18:03

troyraen changed the base branch from raen/cleanup/section-organization to main February 29, 2024 18:03

troyraen force-pushed the issues/195 branch 2 times, most recently from 9c3042f to 4518d51 Compare March 12, 2024 17:43

bsipocz reviewed Mar 12, 2024

View reviewed changes

light_curves/output/lightcurves-demo-SDSS-500k/logs/gaia.log Outdated Show resolved Hide resolved

troyraen changed the title ~~Add shell script to to run light curve functions at scale~~ Add notebook and code to run light_curve_generator at scale Mar 12, 2024

troyraen force-pushed the issues/195 branch from c0645fb to efa5022 Compare March 13, 2024 07:52

troyraen marked this pull request as ready for review March 13, 2024 07:56

troyraen added the use case: light curves label Mar 13, 2024

troyraen self-assigned this Mar 13, 2024

troyraen requested review from jkrick, xoubish and bsipocz March 13, 2024 17:02

jkrick requested changes Mar 14, 2024

View reviewed changes

troyraen requested a review from jkrick March 25, 2024 08:12

jkrick approved these changes Mar 26, 2024

View reviewed changes

troyraen mentioned this pull request Mar 30, 2024

ERROR: pip reports boto3/botocore version conflict #262

Closed

troyraen and others added 10 commits March 31, 2024 23:27

add helpers/scale_up.py

8792791

use helpers.scale_up._init_worker in ztf_functions

3ae34e3

add helpers/scale_up.sh

c9922cd

add helpers/top.py

41fae25

add output/lightcurves-demo-SDSS-500k/logs

0e224bb

add scale_up.md

a2aca01

fix cell types and formatting

be1dce5

Apply suggestions from code review

5b395d5

Co-authored-by: jkrick <jkrick@caltech.edu>

add pointers to scale_up.md

db1ea42

remove obsolete file sample_lc.py

6d3ff86

troyraen force-pushed the issues/195 branch from ea274f5 to 6d3ff86 Compare April 1, 2024 06:36

note cpu and ram needed to run the notebook

fca778a

troyraen merged commit 388f106 into main Apr 1, 2024
3 checks passed

troyraen deleted the issues/195 branch April 1, 2024 06:58

github-actions bot pushed a commit that referenced this pull request Apr 1, 2024

Merge pull request #215 from /issues/195

80f3bf0

Add notebook and code to run light_curve_generator at scale 388f106

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add notebook and code to run light_curve_generator at scale #215

Add notebook and code to run light_curve_generator at scale #215

troyraen commented Jan 26, 2024 •

edited

Loading

jkrick left a comment

troyraen commented Mar 14, 2024

bsipocz commented Mar 14, 2024

troyraen commented Mar 20, 2024

jkrick left a comment

jkrick Mar 25, 2024

troyraen Mar 31, 2024

Add notebook and code to run light_curve_generator at scale #215

Add notebook and code to run light_curve_generator at scale #215

Conversation

troyraen commented Jan 26, 2024 • edited Loading

jkrick left a comment

Choose a reason for hiding this comment

troyraen commented Mar 14, 2024

bsipocz commented Mar 14, 2024

troyraen commented Mar 20, 2024

jkrick left a comment

Choose a reason for hiding this comment

jkrick Mar 25, 2024

Choose a reason for hiding this comment

troyraen Mar 31, 2024

Choose a reason for hiding this comment

troyraen commented Jan 26, 2024 •

edited

Loading