Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API functionality revamp, text fixes, README revamp #15

Merged
merged 44 commits into from
May 24, 2021

Conversation

prakaa
Copy link
Contributor

@prakaa prakaa commented May 8, 2021

API functionality revamp (type inference for some API functions) and fix tests, major README changes

Initial PR made 8/5/2021. Leaving PR open though further changes are being made - as these are incorportated into the PR, I will tick tasks off.

API (Type inference & other changes)

Initial fixes

  • Tests and GUI require data to be interpreted as strings. This interferes with API functionality introduced in Beefing up command line dynamic data handling functionality #11, as parquet and feather files save a schema that includes column types
    • Initial solution was to bypass this issue without changing lots of tests etc, added an additional parameter parse_data_types, which is default True for API and set to false in a gui wrapper function (follows structure of other functions wrapped for GUI). This parses data types on reading the AEMO csv.
    • However this could lead to user error, whereby cached data may be stored as string datatypes, but parse_data_types will not parse the data types when reading existing files.

Further functionality

  • Should ensure new changes do not 'break' cache use - will add a separate cache functionality that caches type inferred data
  • Add new cache_compiler option that has typical cache args from dynamic_data_compiler built in (e.g. keep_csv=False, fformat=parquet or fformat=feather and data_merge=False. It will infer data types when CSVs from AEMO are downloaded and read in.
  • Add tests for cache_compiler
  • parse_data_types will remain but will parse data types of the DataFrame regardless of file type (i.e. parsing when cache or new file read, not just when new file read). Data from csv will always be read in as string
    • parsing is implemented after dynamic_data_compiler has concatenated the list of DataFrames that _dynamic_data_fetch_loop returns. Parsing before concatenation can lead to typed columns being reverted to object once concatenation occurs (e.g. INTERVENTION went from Int to object).
    • parsing is also done before filtering data with filter_cols and filter_values. If a user provides a numeric filter value (e.g. RAISE5MIN=5), the pre-parsed DataFrame will have all columns as objects and therefore return an empty DataFrame (unless the user provides RAISE5MIN="5"). This is not expected behaviour, so parsing occurs before filtering. Datetimes can be filtered using user-provided datetime strings or datetime objects
    • GUI wrapper should have parse_data_types=False since GUI uses string joins
    • API users will have parse_data_types=True since operations on columns may require them to be numeric
  • Make modules, inner functions and variables in data_fetch_methods private and push key functions from data_fetch_methods into package namespace (i.e. so that from nemosis import dynamic_data_compiler is possible)

Code readability

  • Internals of dynamic_data_compiler and cache_compiler will be broken out into private functions.

Readme

  • Workflow section for API user
  • Rewrite dynamic_data_compiler section with more advanced filtering examples.
  • Include cache_compiler, with note that it will delete csvs in a cache. However, if it detects pre-cached feather or parquet files, it will not do anything (e.g. if cache_compiler is run in the GUI cache, it will print that the cache has already been compiled)
  • Remove submodule import - users can now directly import main functions from nemosis
  • Python syntax highlighting
  • Table of contents

Changes to tests

  1. FCAS Causer Pays (4s data) tests were failing as dates of data to be downloaded for test were > 60 days old (2 months of data available). Modified tests to pull data based on current date. Tests are skipped for year boundaries if the year boundary > 60 days ago. Tests also now check if the length of the Causer Pays file is appropriate +/- 1 entry (if the data starts at 00:00:03, for example, there is one less entry than would be calculated for data starting at 00:00:00).
  2. Test suite data dates updated to 2018, and dates across test suites overlapped - this reduces the amount of data that needs to be downloaded and hence improves testing speed. However, the size of downloaded data is still significant, so all dynamic_data_compiler calls in tests are set to release feather files and delete original CSVs.
  3. Changing test suite data is a problem where expected length is doubled due to intervention rows. Refactored test code to handle cases where interventions are an issue.
  4. Change pandas testing import based on deprecation warning.

Other

  • FCAS variables file URL and name changed to reflect AEMO website
  • data_fetch_methods.py, filters.py and test_data_fetch_methods.py styled (flake8)

Testing

  • Test suite run to ensure newer changes to data_fetch_methods work. Report for tests:
    Test Report.pdf

    • tests were modified (FCAS changes, update and overlap data dates, intervention handling) and commit
      f963eb0 passed
    • since f963eb0, caching and new parsing functionality has been incorporated. this new functionality should pass tests as of f963eb0
  • New changes tested for API (spot checks) with fresh install of Python on Ubuntu 20.04

    • with basic settings dynamic_data_compiler downloads DISPATCHLOAD csv, releases feather file. The returned DataFrame is typed (which should happen for API users), but the saved feather had columns as objects/strings.
    • "legacy" code will work (i.e. data_fetch_methods.dynamic_data_compiler vs just dynamic_data_compiler)
    • cache_compiler releases parquet/feather for DISPATCHLOAD and deletes csv in cache. The remaining file is typed. Different compression engines were passed to the write function and this worked. The file was then reloaded using dynamic_data_compiler and this worked, with a typed DataFrame loaded.
  • Quick performance test:

    • %timeit data_fetch_methods.dynamic_data_compiler("2018/01/01 00:00:00", "2018/01/01 23:55:00", "DISPATCHLOAD", './alt_data') with precompiled feather cache
    • f963eb0 (following initial fixes): 833 +/- 15 ms, 7 runs
    • final commit in this PR (5026915): 902 +/- 21.7 ms, 7 runs - likely due to additional if and try-except blocks. Relatively negligible.
  • New changes tested for GUI (spot checks)

@prakaa prakaa changed the title CLI data type inferal, FCAS 4s test fixes, minor test refactoring + updates API type inferral, API caching function, test fixes May 22, 2021
@prakaa prakaa marked this pull request as draft May 22, 2021 01:55
@prakaa prakaa changed the title API type inferral, API caching function, test fixes API type inference, API caching function, test fixes May 22, 2021
@prakaa prakaa changed the title API type inference, API caching function, test fixes API functionality revamp, test fixes May 22, 2021
@prakaa prakaa changed the title API functionality revamp, test fixes API functionality revamp, text fixes, README revamp May 23, 2021
@prakaa prakaa marked this pull request as ready for review May 23, 2021 04:56
@prakaa
Copy link
Contributor Author

prakaa commented May 23, 2021

@nick-gorman see outline of all changes above. GUI still needs to be tested (checkbox unticked)

@nick-gorman
Copy link
Member

Looks good Abi, I'll merge, compile the GUI, draft a release and publish to pypi

@nick-gorman nick-gorman merged commit c1ea130 into UNSW-CEEM:master May 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants