Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support the --xidmap option in Bulkload* #4917

Closed
MichelDiz opened this issue Mar 11, 2020 · 6 comments · Fixed by #5090
Closed

Support the --xidmap option in Bulkload* #4917

MichelDiz opened this issue Mar 11, 2020 · 6 comments · Fixed by #5090
Labels
area/bulk-loader Issues related to bulk loading. kind/bug Something is broken. status/accepted We accept to investigate/work on it.

Comments

@MichelDiz
Copy link
Contributor

MichelDiz commented Mar 11, 2020

Update

Previously it was believed that the --store_xids flag had the same behave as --xidmap. However, it is not true. But the --xidmap flag is needed in Bulkloader.

What version of Dgraph are you using?

Dgraph version   : v2.0.0-beta1
Dgraph SHA-256   : 178663a98a3d59879a3d5c42928c89eb5f83afc2bfc0093272941e7a53515847
Commit SHA-1     : 6fac5d7c4
Commit timestamp : 2020-01-30 14:45:54 +1100
Branch           : HEAD
Go version       : go1.13.7

Have you tried reproducing the issue with the latest release?

Yes.

Steps to reproduce the issue (command/config used to run Dgraph).

Update: Ignore these steps, go to my last comment below.

dgraph bulk --store_xids -f ./agrovoc_2019-11-04_lod.nt --format=rdf -s schema.sch
dgraph bulk -x -f ./agrovoc_2019-11-04_lod.nt --format=rdf -s schema.sch

Expected behavior and actual result.

It should create an XID folder to be used in later imports via Live loader.

Origin of this issue: https://discuss.dgraph.io/t/bulk-loader-x-option/6115

@MichelDiz MichelDiz added kind/bug Something is broken. status/accepted We accept to investigate/work on it. area/bulk-loader Issues related to bulk loading. labels Mar 11, 2020
@danielmai
Copy link
Contributor

Bulk loader's --store_xids flag is not the same as the live loader's --xidmap flag. If --store_xids is set, then the bulk loader writes out xid edges into Dgraph during the bulk load process.

The --store_xids flag isn't intended to write out an xid-uid mapping to a separate directory.

@MichelDiz
Copy link
Contributor Author

Okay, but what is the purpose? How do we use --store_xids?

Both uses very similar flags. If they don't do the same behavior, this should be stated somewhere in the docs. And/Or use a different flag. Also, an example of usage should be documented.

@manishrjain
Copy link
Contributor

Should be a simple change. Let's bring the two in-sync.

@MichelDiz
Copy link
Contributor Author

MichelDiz commented Mar 27, 2020

Okay guys, after some time trying to identify the contexts. I will share my findings.

1 - In fact --store_xids and --xidmap are totally different things.

2 - As there is no documentation or any specific test for both* (actually there are tests for XIDMAP, not the --store_xids). It is suggested that they would behave in the same way by having similar names. But it's not true.

Perhaps the tests have been lost over time, as this feature is very old and is related to RDFs.

It is necessary to update the documentation (I can do this) and also to share the two functions between both loaders in my opinion. Both features available in the two tools would be very good for users.

3 - About the tools.

--xidmap:

A feature that we only have in Live load. It is very useful to create a mutation pipeline using Live load. Where we can reuse the blank nodes (and XIDs) with each new data ingestion without having to use the Upsert Block for example.

It is very useful for those who maintain consistent control over the use of Blank nodes naming (and XIDs too, e.g. data coming from RDF triple stores).

Liveloader asks the user to enter a path so that it can be saved in posting lists (I guess it is posting lists, but they are files from Badger). That way you can reuse the XIDs with each new data ingestion just by indicating the location of the previously saved XIDs.

This functionality does not exist in the Bulk loader as my tests concluded.

--store_xids:

A feature that exists only in Bulkloader. It takes the XID or blank node and saves it to the same node as a property of it with the predicate name as "xid". e.g.:

<_:MichelDiz> <name> "Michel" .

It will be recorded as 

{
  "data": {
    "q": [
      {
        "name": "Michel",
        "xid": "_:MichelDiz"
      }
    ]
  }

This feature is not compatible with --xidmap.

This functionality at first does not seem useful. But I'm sure it's related to the approach on external IDs https://docs.dgraph.io/mutations/#external-ids

e.g.:

<0x321> <xid> "https://www.themoviedb.org/person/32-robin-wright" .

It can be useful in this case and we can use Upsert Block. But it is not useful for those who need to ingest large amounts of data. Only small cases.

@vitorhirota
Copy link

vitorhirota commented Mar 27, 2020

Thanks for the detailed info @MichelDiz!

So, the --store_xids option in bulkload is working after all. But the original expected behaviour and result problem, to have bulkload create an xid map the same as live loader, is still open.

Should we close this issue and open a new one?

@MichelDiz
Copy link
Contributor Author

This issue is for Bulkloader and it is fine (it is already assigned internaly), in this case now to support --xidmap in bulk loader. Maybe open a new one for Liveloader to support --store_xids there if the case.

@MichelDiz MichelDiz changed the title The --store_xids option in Bulkload isn't working The --xidmap option in Bulkload isn't working Mar 27, 2020
@MichelDiz MichelDiz changed the title The --xidmap option in Bulkload isn't working Support the --xidmap option in Bulkload* Mar 27, 2020
@all-seeing-code all-seeing-code linked a pull request Apr 3, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/bulk-loader Issues related to bulk loading. kind/bug Something is broken. status/accepted We accept to investigate/work on it.
Development

Successfully merging a pull request may close this issue.

4 participants