Skip to content

99 import dataset from renku aware repo #765

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Nov 12, 2019

Conversation

mohammad-alisafaee
Copy link
Contributor

@mohammad-alisafaee mohammad-alisafaee commented Oct 22, 2019

IMPORTANT: Please do not create a Pull Request without creating an issue first.

Description

This PR enables copying metadata from other Renku projects when adding files to datasets. ATM, only creator is copied over from the original dataset's metadata.

Moreover, if fixes some minor issues with renku dataset add and renku dataset update.

Fixes #99
Fixes #763

Type of change

Please select relevant options.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist

Do not create pull request unless you checked all points.

@mohammad-alisafaee mohammad-alisafaee requested a review from a team as a code owner October 22, 2019 13:42
Copy link
Member

@rokroskar rokroskar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR! A couple of minor comments in the code and two main issues:

  1. when the remote repo is cloned, it is not done with GIT_LFS_SKIP_SMUDGE=1 so all of the data is downloaded
  2. I cannot manage to generate any triples for the files I add - I can't tell if this is an issue with the recent json-ld context changes or with this PR in particular - @Panaetius?
  3. This PR allows me to import single files, but what if I want to import an entire dataset? We should add a renku provider in addition to Zenodo and Dataverse probably. Perhaps that should be a separate PR.

@mohammad-alisafaee mohammad-alisafaee force-pushed the 99-import-dataset-from-renku-aware-repo branch 2 times, most recently from 1e12a04 to fd2a35b Compare October 31, 2019 07:06
@rokroskar
Copy link
Member

Using #782 this looks pretty good:

image

There are two things I'm not sure about.

  1. git@renkulab.io:rok.roskar/meteorology.git/data/meteorology/MAWS__SMSAWS__20161123.txt is not a valid git URL. I know that it's meant to show the path within the repo but it's not usable by any client. I can't do git clone <URL-from-above and I can't paste it into my browser. Furthermore, the git@ makes it not a valid URL which is also a bit confusing. It could have the git+ssh:// prefix, but it still won't resolve to anything useful. I'm wondering though if specifying the URL here is even needed since we point to the more complete source information which includes the commit and other information.

  2. I know we discussed this already in person but I'm still not super comfortable with isBasedOn - I'm wondering if you have considered any alternatives? Maybe prov:wasDerivedFrom would be better (also because it's a provenance issue, i.e. "where did this file come from")? But really, since the file has not been modified using the schema:sameAs might be the most appropriate.

@mohammad-alisafaee mohammad-alisafaee force-pushed the 99-import-dataset-from-renku-aware-repo branch 3 times, most recently from 0be8bd0 to 69832ac Compare November 11, 2019 14:44
@mohammad-alisafaee
Copy link
Contributor Author

2. I know we discussed this already in person but I'm still not super comfortable with isBasedOn - I'm wondering if you have considered any alternatives? Maybe prov:wasDerivedFrom would be better (also because it's a provenance issue, i.e. "where did this file come from")? But really, since the file has not been modified using the schema:sameAs might be the most appropriate.

As we discussed, we need to use isBasedOn because we still don't have a single URL/URI that can point to a file to a Git repo. We need to point to a DatasetFile entity but schema:sameAs can only be a URL.

rokroskar
rokroskar previously approved these changes Nov 11, 2019
@rokroskar rokroskar added this to the sprint-2019-10-24 milestone Nov 11, 2019
@mohammad-alisafaee
Copy link
Contributor Author

@rokroskar please just review the last commit about re-raising exceptions.

raise ParameterError(
'Could not find paths/URLs: \n{0}'.format('\n'.join(urls))
)
) from e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL raise...from :)

@mohammad-alisafaee mohammad-alisafaee merged commit 6bfb8be into master Nov 12, 2019
@mohammad-alisafaee mohammad-alisafaee deleted the 99-import-dataset-from-renku-aware-repo branch November 12, 2019 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add dataset update to the docs Import datasets from renku-aware repos
2 participants