Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detecting corrupted or incomplete downloads #81

Open
sigmafelix opened this issue May 15, 2024 · 5 comments
Open

Detecting corrupted or incomplete downloads #81

sigmafelix opened this issue May 15, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@sigmafelix
Copy link
Collaborator

sigmafelix commented May 15, 2024

@mitchellmanware

When running beethoven pipeline in 2022, I found that one (or more) of GEOS-CF chemical file was downloaded incompletely (i.e., the file causing the error was 2MB, which is only one-fortieth in size of typical GEOS-CF chemical files). Post-checking or detection of incomplete files would be helpful for users who want to download a large set of files from the internet.

For this file in trouble, I will replace it with a newly downloaded file. Could you change the write permission of input/geos directory in the team project folder @kyle-messier ?

Considerations

  • I suggest two approaches.
    • One is to use file hashes (e.g., SHA256MD5SUM) that are provided by the data source in some cases. If such piece of information is retrievable from JSON or HTTP request header, we could quickly verify the downloaded files with that.
    • The other is leveraging summary statistics of downloaded files, which assume that we have quite reliable network then most of the files were downloaded properly. fs package includes many handy functions to summarize files in tibbles. In this case, we could compare each file size with the typical size or a statistic of all downloaded files to indicate which files were probably corrupted or incomplete.
      • A challenge remains in this approaches where file sizes are so heterogeneous that there is no use with statistics of file sizes (e.g., MODIS tiles are drastically different in size depending on the effective data cells or number of NA/NaNs, unlike full space-time grids in modeling products including GEOS-CF and NARR).
@mitchellmanware
Copy link
Collaborator

mitchellmanware commented May 15, 2024

Thanks for bringing this up @sigmafelix. Creating a file size check function, following the first suggested approach, would be relatively simple with the httr::GET and file.size functions.

> head(u)
[1] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0030z.nc4"
[2] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0130z.nc4"
[3] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0230z.nc4"
[4] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0330z.nc4"
[5] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0430z.nc4"
[6] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0530z.nc4"
> download.file(
+   u[1],
+   "/Users/manwareme/Desktop/geos_example.nc4"
+ )
trying URL 'https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0030z.nc4'
Content type 'application/octet-stream' length 7418695 bytes (7.1 MB)
==================================================
downloaded 7.1 MB

> file.size("/Users/manwareme/Desktop/geos_example.nc4")
[1] 7418695
> httr::GET(u[1])$headers$`content-length`
[1] "7418695"
> (file.size("/Users/manwareme/Desktop/geos_example.nc4")
+   == as.numeric(httr::GET(u[1])$headers$`content-length`))
[1] TRUE

My immediate concern with this approach is its performance at scale. Retrieving the size with httr::GET is quick for a single URL, but performance slows substantially with a relative small number (n = 24) URLs (equivalent to 1 day worth for GEOS-CF) data.

> microbenchmark(
+   httr::GET(u[1]),
+   lapply(u, httr::GET),
+   times = 5
+ )
Unit: milliseconds
                 expr       min       lq       mean    median        uq        max neval
      httr::GET(u[1])  119.9206  124.586   364.5255  209.3135  461.9255   906.8822     5
 lapply(u, httr::GET) 3672.5812 3858.853 10505.7787 4478.3872 5308.7727 35210.2997     5
> 3672.5812/119.9206 # min relative performance
[1] 30.62511
> 35210.2997/906.8822 # max relative performance
[1] 38.82566
> 10505.7787/364.5255 # mean relative performance
[1] 28.82042

@mitchellmanware
Copy link
Collaborator

Potential performance benefits using httr2 functions, but still only tested with 24 files. I will do some more comparisons between httr and httr2 functions.

> httr2_requester <- function(url) {
+   httr2::request(url) |> httr2::req_perform()
+ }
> microbenchmark(
+   lapply(u, httr2_requester),
+   lapply(u, httr::GET),
+   times = 5
+ )
Unit: seconds
                       expr      min       lq     mean   median       uq       max neval
 lapply(u, httr2_requester) 4.667334 4.713385 5.121353 4.919260  5.50979  5.796994     5
       lapply(u, httr::GET) 4.002419 5.042487 7.500990 6.052272 10.32270 12.085070     5

@sigmafelix
Copy link
Collaborator Author

@mitchellmanware Thank you for sharing the possible solutions. Checking status code of wget at shell script level could be another option (cf: https://stackoverflow.com/questions/2717303/check-wgets-return-value)

@mitchellmanware
Copy link
Collaborator

@sigmafelix

Was this addressed in the most recent PR? If not I will include in next round of manuscript-related changes.

@sigmafelix
Copy link
Collaborator Author

@mitchellmanware It is not addressed yet. I think we could proceed the manuscript without this functionality and add it in the next version of the package.

@mitchellmanware mitchellmanware self-assigned this Jul 17, 2024
@mitchellmanware mitchellmanware added the enhancement New feature or request label Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants