Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference to line/byte from each segment #26

Open
nerdoc opened this issue Aug 24, 2020 · 4 comments
Open

Reference to line/byte from each segment #26

nerdoc opened this issue Aug 24, 2020 · 4 comments

Comments

@nerdoc
Copy link
Member

nerdoc commented Aug 24, 2020

Btw: It would be nice if each segment had a reference to the exact input line (byte position start/end) to ease debugging. That might also help handling malformed files. (I just swapped our self-written edifact parser against pydifact and it was a pretty pleasant experience. However exact byte positions was the one feature we lost in that process.)

Originally posted by @FelixSchwarz in #24 (comment)

@nerdoc
Copy link
Member Author

nerdoc commented Aug 24, 2020

This is not completely trivial. pydifact stores its data in lists, and does not create Token objects. This is mainly for performance reasons. Creating an own object for each Token would cause much more memory usage. But OTOH, it's not possible to store information additionally to the Token into the list, without breaking compatibility to all structures using Tokens and Segments. One approach would be, creating a separate list in parallel, where line information is stored with the matching index like in the token list. This seems a bit clumsy to me, but maybe is the most straightforward way to go.

Refactoring pydifact to use Token objects (with a linenumber/byte attribute is a no-go IMHO, as it uses to much overhead.
Even the "parallel list" approach could be made "switch-off"-able - to use less memory when parsing bigger EDI files, at the cost of one additional if clause in each iter.
I'll test that a bit.

@nerdoc
Copy link
Member Author

nerdoc commented Aug 24, 2020

OMG. Big mistake. I should read my own code. Tokens ARE in fact objects, so we can store line information in them. ;-)

@FelixSchwarz
Copy link

I also like using namedtuple. Previously our own parser used something like this

Segment = namedtuple('Segment', ('data', 'line_nr', 'offset_start', 'offset_end', 'raw_data'))

You could easily add some extra methods there for extra convenience like you do now for Token.

@nerdoc
Copy link
Member Author

nerdoc commented Aug 24, 2020

Maybe it would be an option to skip the step of creating Segments as nested lists, and instead directly create the Segment objects using the factory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants