Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geojson optimization #444

Merged
merged 16 commits into from
Oct 18, 2022
Merged

Conversation

hellkite500
Copy link
Member

@hellkite500 hellkite500 commented Sep 15, 2022

After looking through #443, it became clear that a large part of the initialization overhead in FeatureCollection was due to an excessive amount of copies of strings (and some other data). This PR utilized std::move to reduce the amount of copies of data that go through the system. Also, I took the opportunity to finish implementing the JSONProperty using the boost::variant for all data types.

This work when used on the partition generator with the optimization in #443 was able to reduce the runtime of the conus run to about 8 minutes.

I noticed a point of diminishing returns working on this code, and investigated briefely. Turns out, on my machine, I'll likely never get better than about 8 minutes on the conus run. It literally takes about 5 minutes just to DESTRUCT the property tree as the stack unrolls. This is because I was pushing ~56GB RAM out of my 16GB phsical memory, so the swapping was intense.

To test this idea a bit, I replaced all geometries in the catchment_data.geosjon with a simple Point geometry, instead of a MultiPolygon. I then ran the partitioning using that input, and completed the entire thing in 1 minute and 20 seconds!

Additions

Removals

Changes

  • A lot of resource stealing using std::move
  • Finish implementing JSONProperty using the boost::variant that was introduced to utilize List parameters in ngen

Testing

  1. Ran many partition runs

Notes

  • May want to consider implementing move constructors for JSONProperty and the various Feature types.

Todos

  • Actually was able to remove one from the code this time!

Checklist

  • PR has an informative and human-readable title
  • Changes are limited to a single goal (no scope creep)
  • Code can be automatically merged (no conflicts)
  • Code follows project standards (link if applicable)
  • Passes all existing automated tests
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future todos are captured in comments
  • Project documentation has been updated (including the "Unreleased" section of the CHANGELOG)
  • Reviewers requested with the Reviewers tool ➡️

Target Environment support

  • Linux
  • MacOS

@hellkite500
Copy link
Member Author

I haven't built all ngen tests locally yet, so I didn't see these failing until the PR went up. I'll dig into these a little bit and see what is happening.

@@ -277,7 +274,7 @@ namespace geojson {
}
}
else if (child.first == "id") {
id = child.second.data();
id = std::move(child.second.data());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that std::move might end up ripping stuff off off child here. It's hard to tell here without having example docs in front of me, but it looks like there's a chance that these std::move operations might modify the tree and bubble up the updates, harming reuse. There might need to be some comments highlighting this so it's easier to identify while skimming.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once processed and recorded does the state of that node in the ptree really matter? The reference in the tree should still be valid if I understand the semantics of move correctly, but even if it isn't does it matter in this context?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I'm not sure - if the ptree isn't used afterwards, no one cares, but std::move stands to "replace" values (I'm assuming arrays?) when moving. Strings, for example, stand to be converted to empty strings after movement (new location still gets the correct value, though). This specific line isn't a good example; I just picked one of the moves to put a comment about moves. I think the biggest risk is when moving stuff into vectors. I'm not the expert, so I can easily be missing 90% of what's going on. That's sort of the reason I suggested adding comments instead of actually changing code. If someone comes along further down the line (most likely a goober like me) and expects the state of the tree to remain the same they might be in for a rude awakening.

std::move was a really good idea, so I'm not disputing its use.

@christophertubbs
Copy link
Contributor

How computationally expensive would it be to wrap the boost::get calls in a function that calls boost::get in a try-catch block and prints out some of the identifiers that fail? That seems to be the problem with the unit tests and, if it's being run into here, I'd imagine it'd be run into elsewhere. It would be cool if we could get the full path, something like failed on "root->foo->bar->5->foobar->0", but just seeing something like failed on "foobar" might provide enough context to find the issue in the parsing.

Copy link
Contributor

@donaldwj donaldwj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is the addition of a raw pointer to JSONProperty, I think a shared pointer or unique pointer should be usable, depending on if ownership is being taken of the data.

@hellkite500
Copy link
Member Author

Finally, tests passing! I also pushed a couple clean up commits and some additional JSONProperty test coverage.

@hellkite500
Copy link
Member Author

My only concern is the addition of a raw pointer to JSONProperty, I think a shared pointer or unique pointer should be usable, depending on if ownership is being taken of the data.

This is really just a view to backing storage that is managed via stl containers in the JSONProperty class, but was required to indirectly access the data via the variant and subsequently any visitor (such as the vector visitor for lists) that may need to operate on the data.

This pointer is never allocated to anything, and the only way to set it is via the constructor of the struct by passing the reference it should point to when the struct is created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants