Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming parsers #411

Open
wants to merge 6 commits into
base: 8.x
Choose a base branch
from
Open

Streaming parsers #411

wants to merge 6 commits into from

Conversation

gromgull
Copy link
Member

This is a very much incomplete branch for reworking the interface between graphs and parsers.

By introducing a new Sink object, and it becomes possible to write streaming RDF processors that process the triples ''as they come in'' and since you do not store them all in a graph you can work on files much larger than what fits in memory.

As usual, a pull-request to trigger Travis.

@joernhees
Copy link
Member

👍

@gromgull gromgull mentioned this pull request Jul 28, 2014
@joernhees joernhees added enhancement New feature or request parsing Related to a parsing. performance labels Feb 19, 2015
@ExplodingCabbage
Copy link
Contributor

I'd benefit from this feature, and so would (adding up the views) at least 983 + 467 + 5704 + 90 = 7244 other people. I'm going to take a look and see if I can get it to a state where the tests pass and there are no merge conflicts, although I'm not familiar with the codebase and the (existing) code around it is a horrible mess in at least a couple of ways that immediately struck me:

  • there's are Parser and InputSource interfaces, but how exactly they're meant to behave is unclear and all their methods are simply documented TODO:
  • there's both an NTParser and an NTriplesParser, without any explanation of the difference between them

I'll see if I can figure it all out, but no promises.

@ExplodingCabbage
Copy link
Contributor

Good lord, 470 errors and 154 failures after merging this into today's code. I might give up on this exercise and leave it to somebody who understands both the codebase and RDF itself better than I do, but I'll keep poking a little first...

@gromgull
Copy link
Member Author

gromgull commented Feb 7, 2016

step 1 should be to rebase this on the current master - it has changed a bit since I did this work.

@gromgull
Copy link
Member Author

gromgull commented Feb 7, 2016

The NTriplesParser was (once upon a time), a standalone project, without RDFLib, NTriples is the wrapper that makes it fit the RDFLib parser interface. I think I removed it in some commit here somewhere?

@gromgull gromgull added this to the rdflib 5.0.0 milestone Jan 12, 2017
All add/remove methods now raise an Exception if passed a graph.
The Graph API works on Graph objects and will pack/unpack as required.
@gromgull
Copy link
Member Author

@nicholascar this is another thing I would consider for a 6.0.0 release!

I am not sure if the work here even sensible as a starting point any more - but making a unified "sink" object across parsers seems like a good idea.

@nicholascar
Copy link
Member

@gromgull yes I’ve seen this work and agree: unified would be good! I’ve tagged it for 6.0.0 now so it’s on the radar.

Might be one of those good architectural tidy-ups once things like ditching the Py2 and perhaps graph IDs parts have been actioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request parsing Related to a parsing. performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants