Try optimizing file processing. #60

tdresser · 2024-01-06T23:54:03Z

Parsing a 10MB xml file is pretty slow currently.

I don't have a good intuition for what part is slow. Do you think there might be some low hanging fruit here? Any hunches?

The first step would be profiling.

dangowans · 2024-01-08T16:57:53Z

Thanks for reaching out @tdresser . You are likely right that any slow downs would come from the parser step.

I'm by no means attached to the current parser, xml2js. It was most likely picked because of the significant usage and the Typescript support.

If there's something faster out there, I'm definitely not opposed to switching things up.

file downloaded from from https://green-button.github.io/samples/ to test issue #40

dangowans · 2024-01-08T19:59:22Z

So it looks like with a 26 MB test file, GitHub Actions takes 3.5s and my development machine takes 2.5s. Are your results similar?

tdresser · 2024-01-08T22:10:32Z

I think I'm using the term "Parsing" inaccurately here.

What I'm trying to refer to is "Time from file upload until it's done processing and I can see the data on the dashboard."
I don't think that's dominated by the XML to json conversion. I'm not sure what all ends up getting done, but it takes O(minutes) for me.

I guess there's a bunch of database writes in there?

Oh, that likely means this issue is on the wrong repo. I'm happy to move this over if there is anything worth looking at.

dangowans · 2024-01-09T14:14:36Z

Ok, I think I understand now. I'm going to move this issue to the EMILE repository as the parser doesn't do the uploading, database writing, or visualizing.

tdresser · 2024-01-13T18:42:26Z

Thanks for moving.

I've done a little bit of poking around, and I think I've got a fairly minimal change in mind that should improve things a bunch, assuming I'm not completely misunderstanding how this works!

The vast majority of the time (based on a couple performance traces) is spent in refreshAggregatedEnergyDataTables.

IIUC, this needs to be run after any asset is updated.
Today, we run this after every data point is added via addEnergyData.
I think when processing a file, we could do a single call to refreshAggregatedEnergyDataTables per assetID modified.

Does that make sense? I guess today you might be able to view data from a partially processed file, and this would eliminate that possibility, but on my machine, I think this would be a ~10x speedup.

Am I understanding this correctly?

tdresser changed the title ~~Consider adopting a faster xml parser~~ Try optimizing parsing. Jan 7, 2024

dangowans referenced this issue in cityssm/node-green-button-parser Jan 8, 2024

add 26MB test file

98322c6

file downloaded from from https://green-button.github.io/samples/ to test issue #40

dangowans added enhancement New feature or request help wanted Extra attention is needed labels Jan 8, 2024

dangowans transferred this issue from cityssm/node-green-button-parser Jan 9, 2024

tdresser changed the title ~~Try optimizing parsing.~~ Try optimizing file processing. Jan 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try optimizing file processing. #60

Try optimizing file processing. #60

tdresser commented Jan 6, 2024 •

edited

Loading

dangowans commented Jan 8, 2024

dangowans commented Jan 8, 2024

tdresser commented Jan 8, 2024 •

edited

Loading

dangowans commented Jan 9, 2024

tdresser commented Jan 13, 2024

Try optimizing file processing. #60

Try optimizing file processing. #60

Comments

tdresser commented Jan 6, 2024 • edited Loading

dangowans commented Jan 8, 2024

dangowans commented Jan 8, 2024

tdresser commented Jan 8, 2024 • edited Loading

dangowans commented Jan 9, 2024

tdresser commented Jan 13, 2024

tdresser commented Jan 6, 2024 •

edited

Loading

tdresser commented Jan 8, 2024 •

edited

Loading