Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please make the parser more robust #724

Open
cskyan opened this issue Mar 23, 2017 · 5 comments
Open

Please make the parser more robust #724

cskyan opened this issue Mar 23, 2017 · 5 comments
Labels
discussion enhancement New feature or request help wanted Extra attention is needed parsing Related to a parsing.

Comments

@cskyan
Copy link

cskyan commented Mar 23, 2017

I encountered some ParseError exceptions when I tried to parse the ntriples files. Some of them are quite easy to be fixed during the runtime such as empty lines, codec issues, etc. I hope that the parser could pre-process the files and deal with these problems or ignore the invalid records. At least, we need to know which lines in our data file have problems. Because we cannot make sure that the downloaded files strictly follow the standard format. If the package just raises the exception without correcting it, it will take more time to parse the whole file. Maybe the impact of neglected data could be accepted when we are processing a large data set.
In my case, I directly modify this line of code. I insert a continue code here to let the program proceed. Otherwise, I cannot get the remaining data when I encounter a ParseError. I know that it is not a good way to skip this exception but it is the fastest way to continue my project. Hope that this suggestion would be accepted.

@joernhees joernhees added discussion enhancement New feature or request parsing Related to a parsing. labels Mar 27, 2017
@joernhees
Copy link
Member

hmm, i'm against changing the default behavior (it's correct to raise an error if the input format is broken)... but maybe providing an ignore_errors flag via parse() would be a nice feature here...

@joernhees joernhees added this to the rdflib 5.0.0 milestone Mar 27, 2017
@joernhees joernhees added the help wanted Extra attention is needed label Mar 27, 2017
@cskyan
Copy link
Author

cskyan commented Apr 2, 2017

Sure, ignore_errors would be a helpful feature in parsing the raw data. And the program should also put the ignored records into the log. Another expected feature is that it could incrementally parse the files. Say if I have some invalid records in my raw data, after the first parsing I can correct them and parse these revised records. It would save lots of time if this feature could also be implemented.

@gromgull
Copy link
Member

gromgull commented Apr 4, 2017

This is a good feature for the #283 clean-up, then there will be an error-callback.

@mehak16163
Copy link

mehak16163 commented May 28, 2020

Hi,
I have attempted to solve this issue. Please go through my PR #1080
Please suggest any changes to be made.

@karish-grover
Copy link

I have been working on this on this issue for quite some time now. I have made incorporated a few parsing errors, but I was wondering if you could explain in detail, the exact errors that you want us to handle so that I can keep those in mind before sending out a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion enhancement New feature or request help wanted Extra attention is needed parsing Related to a parsing.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants