Skip to content

Conversation

@florisvdh
Copy link
Member

This new code builds upon existing code by @ToonHub to update watersurfaces_refpoints. Note that the latter data source is also being maintained in the n2khab-samplingframes repo.

This project is intended for maintenance of the data source at Zenodo, hence does not aim at storing the data in a repo, only the code. The data are written as vc-data (git2rdata) so that diffs should be small.

Here, version watersurfaces_refpoints_v6 is created from data source versions watersurfaces_refpoints_v5 and watersurfaces_hab_v6.

In order to make this a data source of more generic use, the GRTS address from watersurfaces_refpoints_v5 has been dropped since it can always be regenerated. Also, in_object has been renamed as in_polygon. It has been kept since it refers to polygons that may be scattered over different versions of watersurfaces_hab; it shows the spatial relation with the polygon at creation time.

Care has now been taken to be able to adopt existing points if these are overlapped by 'new' polygons (i.e. with a new polygon_id), provided that those points have in_polygon = TRUE.

The x and y coordinates assume CRS EPSG:31370, which we will need to include in Zenodo documentation.

A reading function in {n2khab} is a next step.

The checksums of the respective versions since version watersurfaces_refpoints_v4 have been stored and can be used for verification. Versions are intended to be stored at Zenodo starting with watersurfaces_refpoints_v4.

Compiled HTML: update_watersurfaces_refpoints.html.zip

To reproduce, this project requires a n2khab_data setup that supports concurrent versions, by implementing approach [3] of inbo/n2khab#113, so you have e.g.:

$ tree 20_processed/_versions/watersurfaces_*
20_processed/_versions/watersurfaces_hab
├── watersurfaces_hab_2018
│   └── watersurfaces_hab.gpkg
├── watersurfaces_hab_v4
│   └── watersurfaces_hab.gpkg
├── watersurfaces_hab_v5
│   └── watersurfaces_hab.gpkg
├── watersurfaces_hab_v6
│   └── watersurfaces_hab.gpkg
└── watersurfaces_hab_v6.1_interim
    └── watersurfaces_hab.gpkg
20_processed/_versions/watersurfaces_refpoints
├── watersurfaces_refpoints_v4
│   ├── watersurfaces_refpoints.tsv
│   └── watersurfaces_refpoints.yml
├── watersurfaces_refpoints_v5
│   ├── watersurfaces_refpoints.tsv
│   └── watersurfaces_refpoints.yml
├── watersurfaces_refpoints_v6
│   ├── watersurfaces_refpoints.csv
│   ├── watersurfaces_refpoints.tsv
│   └── watersurfaces_refpoints.yml
└── watersurfaces_refpoints_v6.1_interim
    ├── watersurfaces_refpoints.tsv
    └── watersurfaces_refpoints.yml

11 directories, 14 files

Apart from non-matching polygon_id, we also require a spatial non-match
with existing reference points before deciding to define a new point.

This more elegantly caters for the various cases of ID matching of habitatmap
polygons, either through polygon_id or polygon_id_habitatmap, between
watersurface_hab versions, by more directly testing spatial relationships
through existing points.

In consequence, some of the checks can be dropped and the whole process is
both simplified and made more complete wrt adopting existing points.

Note that recycled old points for polygons with a new ID still lead to adding
these as new rows in watersurfaces_refpoints.
@ToonHub
Copy link
Contributor

ToonHub commented Oct 14, 2025

Nice!

Just some details:

  • For v5 of watersurfaces_refpoints I used the files in habitatwatersurfaces_cycle1/output of the n2khab-mhq-design repo. But I do not get the same xxh64sum. However, when comparing with the files of the v5 version in the google drive folder, the data appears to be identical. It is not clear to me, why both sources result in a different xxh64sum.

  • Following code only works when the directory refpoints_path_current (watersurfaces_refpoints_v6) already exists.

file.copy( file.path( refpoints_path_previous, str_c("watersurfaces_refpoints",c(".tsv", ".yml"))), refpoints_path_current, overwrite = TRUE ) %>% invisible()

So, I suggest to add following code:

if (!dir.exists(refpoints_path_current)) { dir.create(refpoints_path_current) }

@ToonHub
Copy link
Contributor

ToonHub commented Oct 14, 2025

See below the compiled HTML I get and notice that the xxh64sum of the resulting git2rdata files are different compared to the HTML you provided above. Yet, hash and the data has in the git2rdata yml file are the same. Strange...

update_watersurfaces_refpoints.zip

@florisvdh
Copy link
Member Author

Following code only works when the directory refpoints_path_current (watersurfaces_refpoints_v6) already exists.

file.copy( file.path( refpoints_path_previous, str_c("watersurfaces_refpoints",c(".tsv", ".yml"))), refpoints_path_current, overwrite = TRUE ) %>% invisible()

So, I suggest to add following code:

if (!dir.exists(refpoints_path_current)) { dir.create(refpoints_path_current) }

Thanks; implemented (slightly altered) in 90b307b.

@florisvdh
Copy link
Member Author

Regarding the different checksums of local {git2rdata} files in Windows and the ones saved in the git / GitHub repo (yet pushed from the same Windows machine), we found that carriage return characters were inserted in all lines in the working directory on Windows.

This needs further attention; maybe related to ropensci/git2rdata#49 or to git or RStudio behaviour in Windows (git indeed has techniques to add/remove the CR character in Windows, depending on settings).

In the worst case, we could refrain from using file checksums and use in-memory checksums in R with digest::digest(..., algo = "xxhash64"), but it is still inconvenient with relation to file sharing and integrity, and it would pose some extra challenge for future file version checking by {n2khab} package.

To be further looked at.

@florisvdh
Copy link
Member Author

Updated compiled HTML: update_watersurfaces_refpoints.html.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants