Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Get Element Attributes without the whole Tree #11

Closed
TuSKan opened this issue Feb 28, 2020 · 11 comments
Closed

[Feature] Get Element Attributes without the whole Tree #11

TuSKan opened this issue Feb 28, 2020 · 11 comments

Comments

@TuSKan
Copy link

TuSKan commented Feb 28, 2020

Hi, thank you for the great package. Its very fast and the only one that works for me.

I would appreciate to have an opinion to have access to an element attributes without load the whole Tree.
In my case, I have a "root" tag with important attributes and all other tags is its Childs, so it will load the whole file on memory.
I will be happy to make a PR if with give some directions.

@tamerh
Copy link
Owner

tamerh commented Feb 28, 2020

Hi, glad that it works for you.

have you tried the SkipElements right? not fitting your use case?

If not PR would be fine. This require a new function something like SkipAllElements which set a bool variable and in the parse function instead of getting element tree it should call the existing skipElement function. Test case also needed for this new functionality.

@TuSKan
Copy link
Author

TuSKan commented Feb 28, 2020

See is my example above.
I have a 1Gb file with many "Cli" tags and I need the "Doc" tag attributes.
If I use like that: parser := xmlparser.NewXMLParser(bufio.NewReader(xml), "Doc" ,"Cli") it will try to load the whole file.

Now i'm using this: parser := xmlparser.NewXMLParser(bufio.NewReader(xml), "Cli") and works perfect and very fast, but I need also the "Doc" attributes.

<?xml version="1.0" encoding="UTF-8"?>
<Doc COD="123456" DtBase="2019-06" DtGeracao="2019-08-13 08:43:22" PercDocProcess="97.30" Protocolo="183107896" VolPercProcess="99.99">
    <Cli Cd="987654321" CoobAss="0.00" CoobRec="0.00" IniRelactCli="1996-07-05" QtdIf="7" QtdOpManif="0" QtdOpJud="0" QtdOp="29" RespTotManif="0.00" RespTotJud="0.00" RiscoIndVendor="0.00" Tp="1">
        <Op Mod="0202" VincME="N">
            <Venc v110="1539.13" v120="1495.77" v130="1453.64" v140="4124.82" v150="6146.99" v160="11665.43" v165="8332.77" v170="5952.33" v175="4136.92"/>
        </Op>
        <Op Mod="0203" VincME="N">
            <Venc v220="652.00" v110="1180.68" v120="420.00" v130="323.22" v140="612.13"/>
        </Op>
        <Op Mod="0204" VincME="N">a
            <Venc v110="4120.03"/>
        </Op>
        <Op Mod="0210" VincME="N">
            <Venc v260="3424.56"/>
        </Op>
        <Op Mod="0213" VincME="N">
            <Venc v110="7.59" v130="3.71" v140="10.88" v150="20.81" v160="38.05" v165="33.77" v170="29.95" v175="26.60" v180="60.51"/>
        </Op>
        <Op Mod="0218" VincME="N">
            <Venc v110="173.65" v130="84.74" v140="249.23" v150="476.59" v160="872.00" v165="773.88" v170="686.75" v175="609.46" v180="1379.38" v255="1870.84" v260="76.59"/>
        </Op>
        <Op Mod="0299" VincME="N">
            <Venc v110="3041.08" v120="557.46" v130="1758.50" v140="5161.27" v150="9338.00" v160="17893.71" v165="15698.76" v170="10190.19" v175="8690.20" v180="19667.23"/>
        </Op>
        <Op Mod="0401" VincME="N">
            <Venc v220="186.75" v110="2037.26" v130="991.11" v140="2867.14" v150="5288.14" v160="8998.07" v165="3294.93" v230="591.18" v240="1027.72" v245="1027.72" v250="157.11"/>
        </Op>
        <Op Mod="0901" VincME="N">
            <Venc v220="2717.77" v110="3476.04" v120="2562.20" v130="2555.44" v140="7376.33" v150="14060.32" v160="25467.68" v165="22285.38" v170="19477.54" v175="17002.25" v180="81603.60" v190="17106.54" v210="4998.51" v230="2724.85"/>
        </Op>
        <Op Mod="1304" VincME="N">
            <Venc v110="42.22"/>
        </Op>
        <Op Mod="1901" VincME="N">
            <Venc v20="199.05"/>
        </Op>
    </Cli>
</Doc3046>

@tamerh
Copy link
Owner

tamerh commented Feb 28, 2020

so to confirm 1 Doc tag and many Cli tags right in your file?

Then thats not possible I guess without loading all. And also following not working because it expects independent Cli and Doc tags.

parser := xmlparser.NewXMLParser(bufio.NewReader(xml), "Doc" ,"Cli")

As a workaround if you can just read the following doc line from the file and create a strings.NewReader and parse it seperately

<Doc COD="123456" DtBase="2019-06" DtGeracao="2019-08-13 08:43:22" PercDocProcess="97.30" Protocolo="183107896" VolPercProcess="99.99">

@tamerh
Copy link
Owner

tamerh commented Feb 28, 2020

I mean not possible without reading the file twice otherwise you can parse first for Doc with skipping the Cli tags which will not load everything and then parse again for the Cli tags

@TuSKan
Copy link
Author

TuSKan commented Feb 29, 2020

so to confirm 1 Doc tag and many Cli tags right in your file?

Yes! Exactly this.

I was thinking in two options:

  1. More elegant but breakthrough compatibility
    a) Specify on NewXMLParser the tags of interest independent of the hierarchy
    b) on stream loop, you have two functions: GetAttrib and GetChilds . On that moment the function getElementTree will run.
  2. Less elegant but not breakthrough compatibility.
    a) like SkipOuterElements, create a new function GetOuterAtrribs without run getElementTree.

Let me know if it make sense to you.

@tamerh
Copy link
Owner

tamerh commented Mar 1, 2020

Hi,
your suggestions sounds good to me. For the first one how about implementing without breaking compability? For instance by default works as it is parsing the childs. But if extra settings passed for each tag parse GetAttrib or GetChilds as you suggested.

@TuSKan
Copy link
Author

TuSKan commented Mar 1, 2020

I spent a couple of hours on it and it’s not that easy.
Loop through goroutine channels is very smart.
As implemented, the channel queue is push based on hierarchy and because of the buffer its not possible to lazy run the getElementTree.
So, I dont know how to do it.
Can you give me some ideas ?

@tamerh
Copy link
Owner

tamerh commented Mar 2, 2020

Lets start with simple way, each tag can be flagged for attribute only parsing with new function something like ParseAttributesOnly(loopElements ...string)

Then when looping for each tag in here if the attribute only flag true it will add directly to the channel without getting element tree otherwise it will load the element tree as it is.

With this way I assume first it will parse the Doc with attributes only and then it will continue to Cli with childs and then lets see if it is breaking something.

TuSKan pushed a commit to TuSKan/xml-stream-parser that referenced this issue Mar 2, 2020
@TuSKan
Copy link
Author

TuSKan commented Mar 2, 2020

Works great! This is more like the second approach.
p := getparser("examples", "tag1").ParseAttributesOnly("examples")
What do you think?

@tamerh
Copy link
Owner

tamerh commented Mar 2, 2020

All right maybe I didn't get your first idea completely but compatibility is required for me. Your changes looks good to me if the tests are passing. I added a minor comment. If you can also reflect this new feature in the README in a simple way that would be great.

tamerh added a commit that referenced this issue Mar 2, 2020
#11 [New feature]: ParseAttributesOnly
@tamerh tamerh closed this as completed Mar 2, 2020
@TuSKan
Copy link
Author

TuSKan commented Mar 2, 2020

Thanks for your prompt reply! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants