Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to ignore error instead of throwing error #260

Open
mhoangvslev opened this issue Mar 23, 2022 · 5 comments
Open

Add option to ignore error instead of throwing error #260

mhoangvslev opened this issue Mar 23, 2022 · 5 comments

Comments

@mhoangvslev
Copy link

While working with dirty data, I realised that being able to skip bad rows when parsing RDF is very useful. This feature is suggested in issue #117 but was met with strong opposition. I would like to bring that up once more time, in hope that mentality might have changed since.

The program should give the option to warn-instead-of-error for these reasons:

  1. I know that the errors is minor and am willing to drop those faulty triples.
  2. I want to go all the way through first, get the list off all line with error, bulk-edit my huge RDF (579GB) instead of fix it one by one. When the faulty triples are at the end of the file, it's just painful and takes a lot of dev-time.
@mielvds
Copy link
Member

mielvds commented Mar 23, 2022

I think this is something for the SERD parser, rather than HDT, no?

@mhoangvslev
Copy link
Author

From the user's pov, I don't see the option for it. Can you give me hint?

@drobilla
Copy link
Contributor

serd already has a lax parsing mode for roughly this purpose, although (as you might expect) things can go horribly wrong with syntactically invalid Turtle or TriG documents and drop a ton of data on the floor. It works fine for line-based formats like NTriples and NQuads though.

@mhoangvslev
Copy link
Author

Let's consider my second point. I am willing to fix the bug and I want to have the list of the bugs to fix instead of launch-fix-launch.

@drobilla
Copy link
Contributor

@mhoangvslev You could use serdi on the command line to strip the bad triples out yourself before loading it. It uses the same parser, so should encounter the same errors as hdt-cpp but be much quicker to use as a tool for this. With lax parsing (-l) it should print all the errors encountered in one run.

I usually do this from a text editor with a compilation mode that understands GCC warning syntax (vim, emacs, etc etc) so you can jump immediately to each error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants