Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARD workflow for time series analysis of ACCESS-OM2-01 daily output #462

Open
sb4233 opened this issue Sep 30, 2024 · 7 comments
Open

ARD workflow for time series analysis of ACCESS-OM2-01 daily output #462

sb4233 opened this issue Sep 30, 2024 · 7 comments
Assignees
Labels
ARD workflows for analysis-ready data

Comments

@sb4233
Copy link

sb4233 commented Sep 30, 2024

Hi,
I have been trying to do some spectral analysis using variables from ACCESS-OM2 output. Due to its large, chunked data doing any kind of analysis is very slow. For example, I am calculating the coherence between two variables (using scipy.signal.coherence) at every grid point for a specific domain (356x500). Now the actual calculation takes only about 3-4 mins (non-chunked). But due to being chunked it takes forever to do the calculation (as the data is being loaded into memory).

As a cheap alternative I found that saving the data as early as possible in my calculation (for example, saving the data just after selecting the variable for the region of interest) works (i.e., reducing the number of operations that I need to do while the data is in chunked state). But even in that case it takes about several hours per variable to save it in a netcdf file.

I wanted to know if there is a better way to effectively chunk large datasets so that processing time can be reduced (as much as possible).

Maybe adding a method to cosima cookbook which can dynamically chunk large datasets based on the operation that is being performed on it? I am new to this kind of programming so any help would be much appreciated :)

@Thomas-Moore-Creative Thomas-Moore-Creative self-assigned this Sep 30, 2024
@Thomas-Moore-Creative Thomas-Moore-Creative added the ARD workflows for analysis-ready data label Sep 30, 2024
@Thomas-Moore-Creative
Copy link
Collaborator

thanks @sb4233 - do you mind if we edit the issue title so that it's more focused and descriptive?

Also: can you provide details on the specific ACCESS-OM2 data variables you are trying to calculate against? Are they 2D? 3D? frequency?

THANKS

@navidcy
Copy link
Collaborator

navidcy commented Oct 1, 2024

Btw, @sb4233 note that the cosima_cookbook Python package is deprecated so there won't be any method added to it.

I think the issue is that the data is chunked in times based on how the files are saved in netCDF files (e.g., every 3 months for 0.1 degree model output). So if one needs to do a time-series analysis at every point you need to rechunk in time. I've bumped onto this before and I didn't find a better solution but perhaps I was just naïve!

Btw, you might wanna have a look at the xrft package? Sorry if I misunderstood and this is not something useful.

@marc-white
Copy link
Contributor

@sb4233 would you be able to add some code snippets so we can see what you're trying to do?

@sb4233
Copy link
Author

sb4233 commented Oct 1, 2024

thanks @sb4233 - do you mind if we edit the issue title so that it's more focused and descriptive?

Also: can you provide details on the specific ACCESS-OM2 data variables you are trying to calculate against? Are they 2D? 3D? frequency?

THANKS

Yeah sure, please go ahead and edit the title.
As for the details of my use case -

  • I am using u, v and SST from ACCESS-OM2-01 which is in daily frequency.
  • My data arrays are 3D, i.e., (time, lat, lon). u and v are at a particular level.
  • The dimensions of my data are (time:18250, lat:356, lon:500)
  • And my aim is to calculate the magnitude squared coherence between each component of the current and SST.

Btw, @sb4233 note that the cosima_cookbook Python package is deprecated so there won't be any method added to it.

I think the issue is that the data is chunked in times based on how the files are saved in netCDF files (e.g., every 3 months for 0.1 degree model output). So if one needs to do a time-series analysis at every point you need to rechunk in time. I've bumped onto this before and I didn't find a better solution but perhaps I was just naïve!

Btw, you might wanna have a look at the xrft package? Sorry if I misunderstood and this is not something useful.

Thanks for the suggestion, seems like xrft can be useful, as it utilizes dask API.

@sb4233 would you be able to add some code snippets so we can see what you're trying to do?

Nothing special, essentially just trying this function below (coherence()) to calculate the squared magnitude coherence between two data arrays at every grid point (i,j) and I use joblib.parallel to parallelly loop over i,j -

def compute_coherence_slice(data_slice1, data_slice2, i, j, fs, nperseg, window, noverlap, nfft):
        f, Cxy = coherence(data_slice1, data_slice2, fs=fs, nperseg=nperseg, window=get_window(window, nperseg), noverlap=noverlap, nfft=nfft)
        return Cxy, f

@Thomas-Moore-Creative Thomas-Moore-Creative changed the title Better and faster processing of large datasets using cosima cookbook ARD workflow for time series analysis of ACCESS-OM2-01 daily output Oct 1, 2024
@Thomas-Moore-Creative
Copy link
Collaborator

Hey @sb4233, hopefully that new title is representative of your use case ( one shared by others ).

Next steps might be to access daily ACCESS-OM2-01 via intake catalog, including helpful xarray kwargs, followed by writing temporary ARD Zarr collections for u,v, and SST to scratch/vn19? I'll have a go at this in my spare time tonight or tomorrow - but you or others might get there too.

Look forward to documenting better-practice for these specific use cases with you and others.

@Thomas-Moore-Creative
Copy link
Collaborator

@sb4233 - a very useful ref from @dougiesquire et al.

and for storage of any temporary intermediate ARD collections on vn19 let's please use: /scratch/vn19/ard/ACCESS-OM2-01

@Thomas-Moore-Creative
Copy link
Collaborator

@sb4233 et al

Here's the kind of overall workflow I'm suggesting each of these specific heuristics could contribute to:
CleanShot 2024-10-02 at 07 09 59@2x

You can see and download our full poster from OMO2024 here: https://go.csiro.au/FwLink/climate_ARD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ARD workflows for analysis-ready data
Projects
None yet
Development

No branches or pull requests

4 participants