Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix#400: Re-merge YAML & CLI #403

Merged
merged 189 commits into from
Jan 16, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
189 commits
Select commit Hold shift + click to select a range
c5987e1
First stab at porting various functions over to polars... lots to go
idiom-bytes Nov 22, 2023
c1fedf7
TOHLCV df initialization and type checking added. 2/8 pdutil tests pa…
idiom-bytes Nov 22, 2023
2a0a94b
black formatted
idiom-bytes Nov 22, 2023
a579c4d
Fixing initialization and improving test. datetime is not generated i…
idiom-bytes Nov 22, 2023
c5efe9f
Restructured pdutil a bit to reduce DRY and utilize schema more stric…
idiom-bytes Nov 22, 2023
6c1f3ec
test initializing the df and datetime
idiom-bytes Nov 22, 2023
b7a60b2
improve init test to show exception without timestamp
idiom-bytes Nov 22, 2023
754aaee
fixing test_concat such that it verifies that schemas must match, and…
idiom-bytes Nov 22, 2023
fce6718
saving parquet enforces datetime and transform. updated test_load_app…
idiom-bytes Nov 22, 2023
f9e6cb8
black formatted
idiom-bytes Nov 22, 2023
d0dccaf
data_eng tests are passing
idiom-bytes Nov 22, 2023
026ad39
initial data_eng tests are passing w/ black, mypy, and pylint.
idiom-bytes Nov 22, 2023
2006f1a
_merge_parquet_dfs updated and create_xy test_1 is passing. all data_…
idiom-bytes Nov 23, 2023
933f1d8
2exch_2coins_2signals is passing
idiom-bytes Nov 23, 2023
6264c61
Added polars support for fill_nans, has_nans, and create_xy__handle_n…
idiom-bytes Nov 23, 2023
1b5b665
Starting to deprecate references to pandas and csv in data_factory.
idiom-bytes Nov 23, 2023
a511bd5
Black formatted
idiom-bytes Nov 23, 2023
444862b
Deprecated csv logic in DataFactory and created tests around get_hist…
idiom-bytes Nov 23, 2023
6666618
All tests should be passing.
idiom-bytes Nov 24, 2023
4e06884
Fix #370: YAML & CLI (#371)
trentmc Nov 24, 2023
80faf81
Update CI to use pdr instead of scripts/ (#399)
trizin Nov 24, 2023
25bfa3e
Replace long try/except with _safe*() function; rename pdutil -> plut…
trentmc Nov 27, 2023
148cb94
Update entrypoint script to use pdr cli (#406)
trizin Nov 27, 2023
6d7661a
Add main.py back (#404)
trizin Nov 27, 2023
43a43df
Merge from issue388-refactor-csvs-pandas, plus many changes
trentmc Nov 27, 2023
fa3b672
Merge branch 'yaml-cli2' of https://github.com/oceanprotocol/pdr-back…
trentmc Nov 27, 2023
f7bbca4
make black happy
trentmc Nov 27, 2023
27ac78e
small bug fix
trentmc Nov 27, 2023
97370c7
many bug fixes. Still >=1 left
trentmc Nov 27, 2023
036ca3e
fix warning
trentmc Nov 27, 2023
c96464d
Add support for polars where needed
trentmc Nov 27, 2023
6b4383f
tweak docstring
trentmc Nov 27, 2023
6ff502d
Fix #408: test_sim_engine failing in yaml-cli2, bc hist_df is s not m…
trentmc Nov 27, 2023
1b6556c
BaseContract tests that Web3PP type is input
trentmc Nov 27, 2023
4c78f72
goes with previous commit
trentmc Nov 27, 2023
24e4345
tweak - lowercase
trentmc Nov 27, 2023
9ca66e7
Bug fix - fix failing tests
trentmc Nov 27, 2023
e24cfcb
Remove unwanted file
trentmc Nov 28, 2023
9e9abc5
(a) better organize ppss.yaml for usability (b) ensure user isn't ann…
trentmc Nov 28, 2023
7855bef
add a more precise test for modeling
trentmc Nov 28, 2023
8c7d0c4
make black happy
trentmc Nov 28, 2023
516d6e9
Small refactor: make transform_df() part of helper routine
trentmc Nov 28, 2023
1dc2836
Fix #414: Split data_factory into (1) CEX -> parquet -> df (2) df -> …
trentmc Nov 28, 2023
4dc204c
Fix #415: test_cli_do_dfbuyer.py is hanging #415
trentmc Nov 28, 2023
67c2b96
test create_xy() even more. Clarify the order of timestamps
trentmc Nov 28, 2023
5ba906c
Add a model-building test, using data shaped like data from test_mode…
trentmc Nov 28, 2023
06aa946
Fix #416: [YAML branch] No Feeds Found - data_pp.py changes pair stan…
trentmc Nov 29, 2023
995df8b
For barge#391: update to *not* use barge's predictoor branch
trentmc Nov 29, 2023
97512a6
Merge branch 'main' into yaml-cli2
trentmc Nov 29, 2023
a1db778
Merge branch 'main' into yaml-cli2
trentmc Nov 30, 2023
4d6e8ff
Update vps.md: nicer order of operations
trentmc Nov 30, 2023
21a65d4
For #417, #418 in yaml-cli2 branch. publisher TUSD -> USDT
trentmc Nov 30, 2023
7fd76e3
Merge branch 'main' into yaml-cli2
trentmc Nov 30, 2023
50ab141
Merge branch 'main' into yaml-cli2
trentmc Nov 30, 2023
5a47226
Merge branch 'main' into yaml-cli2
trentmc Nov 30, 2023
49d546e
remove default_network from ppss.yaml (obsolete)
trentmc Nov 30, 2023
610062d
Fix #427 - time now
trentmc Nov 30, 2023
8156b30
Fix #428: test_get_hist_df - FileNotFoundError. Includes lots of extr…
trentmc Dec 1, 2023
bedf654
remove dependency that we don't need, which caused problems
trentmc Dec 1, 2023
151e79c
Merge branch 'main' into yaml-cli2
trentmc Dec 1, 2023
851dab0
Fix #421: Add cli + logic to calculate and plot traction metrics (PR …
idiom-bytes Dec 1, 2023
2d95c57
bug fix: YAML_FILE
trentmc Dec 1, 2023
ede6efe
fix breaking test; clean it up too
trentmc Dec 1, 2023
0e7b021
add barge-calls.md
trentmc Dec 1, 2023
7dd9d0f
Fix #433. Calculate metrics and draw plots for epoch-based stats (PR …
idiom-bytes Dec 4, 2023
de9b650
git merge main
trentmc Dec 5, 2023
1785713
Tweak barge-calls.md
trentmc Dec 6, 2023
9905f60
Tweak barge-calls.md: more compactly show RPC_URL calc
trentmc Dec 6, 2023
af6ce32
update stake_token
trentmc Dec 6, 2023
a3789f9
Merge branch 'yaml-cli2' of https://github.com/oceanprotocol/pdr-back…
trentmc Dec 6, 2023
59f3462
bug fix
trentmc Dec 6, 2023
206428a
Update release-process.md: bug fix
trentmc Dec 6, 2023
8e571d6
Tweak barge-calls.md
trentmc Dec 6, 2023
482e9fe
git merge main
trentmc Dec 6, 2023
909a5ab
Tune #405 (PR #406): Update entrypointsh script to use pdr CLI
trentmc Dec 6, 2023
9168f78
Update vps.md: docker doesn't need to prompt to delete
trentmc Dec 6, 2023
f5be9ac
Update vps.md: add docker-stop instrs
trentmc Dec 6, 2023
ac7670e
allow CLI to have NETWORK_OVERRIDE, for more flexiblity from barge
trentmc Dec 6, 2023
784ed60
Merge branch 'yaml-cli2' of https://github.com/oceanprotocol/pdr-back…
trentmc Dec 6, 2023
ef2d6b2
fix pylint issue
trentmc Dec 6, 2023
52c11f3
Update barge-calls.md: link to barge.md
trentmc Dec 6, 2023
8e4bb22
Update release-process.md: fix typo
trentmc Dec 6, 2023
823c02f
touch
trentmc Dec 6, 2023
c73502a
Update vps.md: more instrs around waiting for barge to be ready
trentmc Dec 6, 2023
d53fbb0
add unit tests for cli_module
trentmc Dec 7, 2023
699fb1c
Merge branch 'yaml-cli2' of https://github.com/oceanprotocol/pdr-back…
trentmc Dec 7, 2023
5362ac2
Towards #437: [YAML] Publisher error 'You must set RPC_URL environmen…
trentmc Dec 7, 2023
93720d1
Bug fixes
trentmc Dec 7, 2023
f66cee2
refactor tweaks to predictoor and trader
trentmc Dec 7, 2023
3988f29
Clean up some envvar stuff. Document ppss vars better.
trentmc Dec 7, 2023
8f2f2d4
publish_assets.py now supports barge-pytest and barge-predictoor-bot
trentmc Dec 7, 2023
8addd57
bug fix
trentmc Dec 7, 2023
1dffc03
bug fix the previous 'bug fix'
trentmc Dec 7, 2023
720aa64
Clean up how dfbuyer/predictoor/trader agents get feeds: web3_pp.quer…
trentmc Dec 8, 2023
75b943b
fix breaking subgraph tests. Still breakage in trader & dfbuyer (that…
trentmc Dec 9, 2023
5b19f0b
Fix failing tests in tradder, dfbuyer. And greatly speed up the tests…
trentmc Dec 10, 2023
711905e
Fix bugs for failing tests of https://github.com/oceanprotocol/pdr-ba…
trentmc Dec 10, 2023
f667f46
fix tmpdir bug
trentmc Dec 10, 2023
9f112fd
Fix (hopefully) failing unit test - restricted region in querying bin…
trentmc Dec 10, 2023
c9bbf41
consolidate gas_price setting, make it consistent; set gas_price to 0…
trentmc Dec 10, 2023
94c74b3
fix linter complaints
trentmc Dec 10, 2023
ae9fe21
Fix remaining failing unit tests for predictoor_batcher
trentmc Dec 10, 2023
d1eea21
Finish the consolidation of gas pricing. All tests pass
trentmc Dec 10, 2023
96d157f
Merge branch 'main' into yaml-cli2
trentmc Dec 11, 2023
63d9de3
Update vps.md: add debugging info
trentmc Dec 11, 2023
0053961
add to/from wei utility. Copied from ocean.py
trentmc Dec 11, 2023
c118763
tweak docs in conftest_ganache
trentmc Dec 11, 2023
21b8b2f
tweaks from black for wei
trentmc Dec 11, 2023
8f21c7b
Make fixed_rate.py and its test easier to understand via better var n…
trentmc Dec 11, 2023
21d0d89
Make predictoor_contract.py easier to understandn via better var anmi…
trentmc Dec 11, 2023
b0dc81d
test fixed_rate calcBaseInGivenOutDT
trentmc Dec 11, 2023
ffb693a
Refactor predictoor_contract: push utility methods out of the class, …
trentmc Dec 11, 2023
535de12
Tweak docstrings for fixed_rate.py
trentmc Dec 11, 2023
3f6d60f
Improve DX: show dev what the parameters are. Improve UX: print when …
trentmc Dec 11, 2023
f88a958
Improve DX & UX for predictoor_contract
trentmc Dec 11, 2023
bb5ab0f
Tweak UX (prints)
trentmc Dec 11, 2023
1df1da3
Update vps.md: export PATH
trentmc Dec 11, 2023
5524b1f
Logging for predictoor is way better: more calm yet more informative.…
trentmc Dec 11, 2023
ef02b65
Merge branch 'yaml-cli2' of https://github.com/oceanprotocol/pdr-back…
trentmc Dec 11, 2023
11b6bd0
TraderAgent -> BaseTraderAgent
trentmc Dec 12, 2023
4460259
Rename parquet_dfs -> rawohlcv_dfs; hist_df -> mergedohlcv_df; update…
trentmc Dec 12, 2023
568f93e
apply black to test_plutil.py
trentmc Dec 12, 2023
055f0c7
apply black to test_model_data_factory.py
trentmc Dec 12, 2023
e564f75
apply black to ohlcv_data_factory.py
trentmc Dec 12, 2023
1559ec3
refactor test_ohlcv_data_factory: cleanup mocks; remove redundant tes…
trentmc Dec 12, 2023
f11d881
Fix #443: [YAML] yaml timescale is 5m, yet predictoor logs s_per_epoc…
trentmc Dec 12, 2023
833ec16
Update feed str() to give full address; and order to be similar to pr…
trentmc Dec 12, 2023
d45cd34
Small bug fix: not printing properly
trentmc Dec 12, 2023
e0de941
Tweak: logging in predictoor_contract.py
trentmc Dec 12, 2023
c3fe7b9
Tweak: logging in trueval_agent_single.py
trentmc Dec 12, 2023
20171b6
Two bug fixes: pass in web3_pp not web3_config to PredictoorContract …
trentmc Dec 12, 2023
a3948cc
enhance typechecking
trentmc Dec 12, 2023
54a2c96
tweak payout.py: make args passed more obvious
trentmc Dec 12, 2023
626e773
fix broken unit test
trentmc Dec 12, 2023
c67fd2d
make black happy
trentmc Dec 12, 2023
9644408
fix breaking unit test
trentmc Dec 12, 2023
3e045c0
Tweak predictoor_contract DX & UX
trentmc Dec 12, 2023
9bfb404
Improve trueval: Have fewer layers of try/except, better DX via docst…
trentmc Dec 12, 2023
9d620cf
Rename TruevalAgentBase -> BaseTruevalAgent
trentmc Dec 12, 2023
0137395
(a) Fix #445: merge 3 trueval agent files into 1. (b) Fix #448 contra…
trentmc Dec 13, 2023
9aa6c49
Fix #450: test_contract_main[barge-pytest] fails
trentmc Dec 13, 2023
addf921
renaming pq_data_factory to ohlcv_data_factory
idiom-bytes Dec 14, 2023
6874ead
Removing all TODOs
idiom-bytes Dec 14, 2023
422b12e
Fix #452: Add clean code guidelines README
trentmc Dec 14, 2023
8bca663
Merge branch 'yaml-cli2' of https://github.com/oceanprotocol/pdr-back…
trentmc Dec 14, 2023
7e8a87e
removing dangling _ppss() inside predictoor_agent_runner.py
idiom-bytes Dec 14, 2023
e4fadc6
Fixing linter
idiom-bytes Dec 14, 2023
9f33dc2
Fix #454: Refactor: Rename MEXCOrder -> MexcOrder, ERC721Factory
trentmc Dec 15, 2023
67a8cde
Fix #455: Cyclic import issue
trentmc Dec 15, 2023
8fe215d
Fix #454 redux: the first commit introduced a bug, this one fixes the…
trentmc Dec 15, 2023
d3029a7
Fix #436 - Implement GQL data factory (PR #438)
idiom-bytes Dec 15, 2023
d759284
Fix #350: [Sim] Tweaks to plot title
trentmc Dec 17, 2023
f70aece
Merge branch 'main' into yaml-cli2
trentmc Dec 17, 2023
3c0b3e3
make black happy
trentmc Dec 17, 2023
974799a
Fix #446: [YAML] Rename/move files & dirs for proper separation among…
trentmc Dec 17, 2023
bbfef0d
Fix #459: In CI, polars error: col timestamp_right already exists (#460)
trentmc Dec 18, 2023
a4c117b
Fix #397: Remove need to specify 'stake_token' in ppss.yaml (#461)
trentmc Dec 18, 2023
aa999b7
Docs fixes (#456)
calina-c Dec 19, 2023
80e9fb6
Make Feeds objects instead of tuples. (#464)
calina-c Dec 20, 2023
b5496c6
Move and rename utils (#467)
calina-c Dec 20, 2023
1dec231
Objectify pairstr. (#470)
calina-c Dec 21, 2023
bd01d71
Towards #462: Separate lake and aimodel SS, lake command (#473)
calina-c Dec 29, 2023
a0244c9
[Lake] integrate pdr_subscriptions into GQL Data Factory (#469)
idiom-bytes Jan 4, 2024
80725f8
Improve DRY (#475)
calina-c Jan 4, 2024
1fb88ac
Add Code climate. (#484)
calina-c Jan 5, 2024
243ae7e
Adds manual trigger to pytest workflow.
calina-c Jan 5, 2024
3a5c639
issue483: move the logic from subgraph_slot.py (#489)
kdetry Jan 9, 2024
56392e4
Add some test coverage (#488)
calina-c Jan 9, 2024
27f94d2
git merge main
trentmc Jan 10, 2024
cd69814
Fix #501: ModuleNotFoundError: No module named 'flask' (PR #504)
trentmc Jan 10, 2024
77d7b4c
Fix #509: Refactor test_update_rawohlcv_files (PR #508)
trentmc Jan 10, 2024
ff66cbf
Merge branch 'main' into yaml-cli2
trizin Jan 10, 2024
69d3c8e
Fix #505: polars.exceptions.ComputeError: datatypes of join keys don'…
trentmc Jan 10, 2024
4959b5d
Fix #517: aimodel_data_factory.py missing data: binance:BTC/USDT:None…
trentmc Jan 11, 2024
474ec9c
Towards #494: Improve coverage 2 (#498)
calina-c Jan 11, 2024
3f24fdd
Fix #519: aimodel_data_factory.py missing data col: binance:ETH/USDT:…
trentmc Jan 12, 2024
e21f9a2
Replace `dftool` with `pdr` (#522)
trizin Jan 12, 2024
413d065
Fix #525: Plots pop up unwanted in tests. (PR #528)
calina-c Jan 13, 2024
fdb4545
Issue 519 feed dependencies (#529)
calina-c Jan 13, 2024
63891cd
Update to #519: remove do_verify, it's redundant (#532)
trentmc Jan 13, 2024
3eae232
Fix #507: fix asyncio issues (PR #531)
calina-c Jan 13, 2024
35a2799
#413 - YAML thorough system level tests (#527)
trizin Jan 15, 2024
11f05d2
Adds incremental waiting for subgraph tries. (#534)
calina-c Jan 15, 2024
9402e5f
Add publisher feeds filtering. (#533)
calina-c Jan 15, 2024
995a103
Pass the ppss.web3_pp instead of web3_config into WrappedToken class …
trizin Jan 15, 2024
9422c82
Fix #542: Add code climate usage to developer flow READMEs
trentmc Jan 16, 2024
d14c702
#538 - check network main subgraph query fails (#539)
trizin Jan 16, 2024
11b7244
#540 - YAML CLI topup and check network actions require address file …
trizin Jan 16, 2024
a851f79
Remove predictoor2 ref from pytest
trizin Jan 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Fix #408: test_sim_engine failing in yaml-cli2, bc hist_df is s not m…
…s. Proper testing and documentation was added, as part of the fix
  • Loading branch information
trentmc committed Nov 27, 2023
commit 6ff502d865b12a63e44730eb86e6c23dd33f05aa
95 changes: 71 additions & 24 deletions pdr_backend/data_eng/data_factory.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import os
import sys
from typing import Dict, List, Union
from typing import Dict, List, Tuple, Union

from enforce_typing import enforce_types
import numpy as np
Expand Down Expand Up @@ -32,18 +32,59 @@

@enforce_types
class DataFactory:
"""
Roles:
- From each CEX API, fill >=1 parquet_dfs -> parquet files data lake
- From parquet_dfs, fill 1 hist_df -- historical data across all CEXes
- From hist_df, create (X, y, x_df) -- for model building

Where:
parquet_dfs -- dict of [exch_str][pair_str] : df
And df has columns of: "open", "high", .., "volume", "datetime"
(and index = timestamp)

hist_df -- polars DataFrame with cols like:
"timestamp",
"binanceus:ETH-USDT:open",
"binanceus:ETH-USDT:high",
"binanceus:ETH-USDT:low",
"binanceus:ETH-USDT:close",
"binanceus:ETH-USDT:volume",
...
"datetime",
(and no index)

And:
X -- 2d array of [sample_i, var_i] : value -- inputs for model
y -- 1d array of [sample_i] -- target outputs for model

x_df -- *pandas* DataFrame with cols like:
"binanceus:ETH-USDT:open:t-3",
"binanceus:ETH-USDT:open:t-2",
"binanceus:ETH-USDT:open:t-1",
"binanceus:ETH-USDT:high:t-3",
"binanceus:ETH-USDT:high:t-2",
"binanceus:ETH-USDT:high:t-1",
...
"datetime",
(and index = 0, 1, .. -- nothing special)

Finally:
- "timestamp" values are ut: int is unix time, UTC, in ms (not s)
- "datetime" values ares python datetime.datetime, UTC
"""

def __init__(self, pp: DataPP, ss: DataSS):
self.pp = pp
self.ss = ss

def get_hist_df(self) -> pd.DataFrame:
def get_hist_df(self) -> pl.DataFrame:
"""
@description
Get historical dataframe, across many exchanges & pairs.

@return
hist_df -- df w/ cols={exchange_str}:{pair_str}:{signal}+"datetime",
and index=timestamp
hist_df -- *polars* Dataframe. See class docstring
"""
print("Get historical data, across many exchanges & pairs: begin.")

Expand All @@ -60,7 +101,10 @@ def get_hist_df(self) -> pd.DataFrame:
hist_df = self._merge_parquet_dfs(parquet_dfs)

print("Get historical data, across many exchanges & pairs: done.")
return hist_df.to_pandas()

# postconditions
assert isinstance(hist_df, pl.DataFrame)
return hist_df

def _update_parquet(self, fin_ut: int):
print(" Update parquet.")
Expand Down Expand Up @@ -203,11 +247,10 @@ def _load_parquet(self, fin_ut: int) -> Dict[str, Dict[str, pl.DataFrame]]:
def _merge_parquet_dfs(self, parquet_dfs: dict) -> pl.DataFrame:
"""
@arguments
parquet_dfs -- dict [exch_str][pair_str] : df
where df has cols={signal_str}+"datetime", and index=timestamp
parquet_dfs -- see class docstring

@return
hist_df -- df w/ cols={exch_str}:{pair_str}:{signal_str}+"datetime",
and index=timestamp
hist_df -- see class docstring
"""
# init hist_df such that it can do basic operations
print(" Merge parquet DFs.")
Expand Down Expand Up @@ -257,31 +300,34 @@ def _merge_parquet_dfs(self, parquet_dfs: dict) -> pl.DataFrame:
# TO DO: Move to model_factory/model + use generic df<=>serialize<=>parquet
def create_xy(
self,
hist_df: pd.DataFrame, # not a pl.DataFrame, by design
hist_df: pl.DataFrame,
testshift: int,
do_fill_nans: bool = True,
):
) -> Tuple[np.ndarray, np.ndarray, pd.DataFrame]:
"""
@arguments
hist_df -- df w cols={exch_str}:{pair_str}:{signal_str}+"datetime",
and index=timestamp
hist_df -- *polars* DataFrame. See class docstring
testshift -- to simulate across historical test data
do_fill_nans -- if any values are nan, fill them? (Via interpolation)
If you turn this off and hist_df has nans, then X/y/etc gets nans

@return --
X -- 2d array of [sample_i, var_i] : value
y -- 1d array of [sample_i]
x_df -- df w/ cols={exch_str}:{pair_str}:{signal}:t-{x} + "datetime"
index=0,1,.. (nothing special)
X -- 2d array of [sample_i, var_i] : value -- inputs for model
y -- 1d array of [sample_i] -- target outputs for model
x_df -- *pandas* DataFrame. See class docstring.
"""
if not isinstance(hist_df, pd.DataFrame):
raise ValueError("hist_df should be a pd.DataFrame")
# preconditions
assert isinstance(hist_df, pl.DataFrame), pl.__class__
assert "timestamp" in hist_df.columns
assert "datetime" in hist_df.columns

# condition inputs
if do_fill_nans and has_nan(hist_df):
hist_df = fill_nans(hist_df)

ss = self.ss
x_df = pd.DataFrame()

# main work
x_df = pd.DataFrame() # build this up

target_hist_cols = [
f"{exch_str}:{pair_str}:{signal_str}"
Expand All @@ -290,7 +336,7 @@ def create_xy(

for hist_col in target_hist_cols:
assert hist_col in hist_df.columns, f"missing data col: {hist_col}"
z = hist_df[hist_col].tolist() # [..., z(t-3), z(t-2), z(t-1)]
z = hist_df[hist_col].to_list() # [..., z(t-3), z(t-2), z(t-1)]
maxshift = testshift + ss.autoregressive_n
N_train = min(ss.max_n_train, len(z) - maxshift - 1)
if N_train <= 0:
Expand All @@ -315,13 +361,14 @@ def create_xy(
# eg y = [BinEthC_-1, BinEthC_-2, ..., BinEthC_-450, BinEthC_-451]
pp = self.pp
hist_col = f"{pp.exchange_str}:{pp.pair_str}:{pp.signal_str}"
z = hist_df[hist_col].tolist()
z = hist_df[hist_col].to_list()
y = np.array(_slice(z, -testshift - N_train - 1, -testshift))

# postconditions
assert X.shape[0] == y.shape[0]
assert X.shape[0] <= (ss.max_n_train + 1)
assert X.shape[1] == ss.n
assert isinstance(x_df, pd.DataFrame)

# return
return X, y, x_df
Expand Down Expand Up @@ -366,7 +413,7 @@ def safe_fetch_ohlcv(
exch -- eg ccxt.binanceus()
symbol -- eg "BTC/USDT". NOT "BTC-USDT"
timeframe -- eg "1h", "1m"
since -- Timestamp of first candle. In unix time (in ms)
since -- timestamp of first candle. In unix time (in ms)
limit -- max # candles to retrieve

@return
Expand Down
93 changes: 53 additions & 40 deletions pdr_backend/data_eng/test/test_data_factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ def _data_ss(parquet_dir, input_feeds, st_timestr=None, fin_timestr=None):


@enforce_types
def _assert_shapes(ss: DataSS, X: np.ndarray, y: np.ndarray, x_df: pd.DataFrame):
def _assert_pd_df_shape(ss: DataSS, X: np.ndarray, y: np.ndarray, x_df: pd.DataFrame):
assert X.shape[0] == y.shape[0]
assert X.shape[0] == (ss.max_n_train + 1) # 1 for test, rest for train
assert X.shape[1] == ss.n
Expand Down Expand Up @@ -108,10 +108,14 @@ def test_update_parquet5(tmpdir):

@enforce_types
def _test_update_parquet(st_timestr: str, fin_timestr: str, tmpdir, n_uts):
"""n_uts -- expected # timestamps. Typically int. If '>1K', expect >1000"""
"""
@arguments
n_uts -- expected # timestamps. Typically int. If '>1K', expect >1000
"""

# setup: uts helpers
def _calc_ut(since: int, i: int) -> int:
"""Return a ut : unix time, in ms, in UTC time zone"""
return since + i * MS_PER_5M_EPOCH

def _uts_in_range(st_ut: int, fin_ut: int) -> List[int]:
Expand Down Expand Up @@ -226,20 +230,40 @@ def _addval(DATA: list, val: float) -> list:
}


@enforce_types
def test_hist_df_shape(tmpdir):
_, _, data_factory = _data_pp_ss_1feed(tmpdir, "binanceus h ETH-USDT")
hist_df = data_factory._merge_parquet_dfs(ETHUSDT_PARQUET_DFS)
assert isinstance(hist_df, pl.DataFrame)
assert hist_df.columns == [
"timestamp",
"binanceus:ETH-USDT:open",
"binanceus:ETH-USDT:high",
"binanceus:ETH-USDT:low",
"binanceus:ETH-USDT:close",
"binanceus:ETH-USDT:volume",
"datetime",
]
assert hist_df.shape == (12, 7)
assert len(hist_df["timestamp"]) == 12
assert ( # pylint: disable=unsubscriptable-object
hist_df["timestamp"][0] == 1686805500000
)


@enforce_types
def test_create_xy__input_type(tmpdir):
# hist_df should be pl
_, _, data_factory = _data_pp_ss_1feed(tmpdir, "binanceus h ETH-USDT")
hist_df = data_factory._merge_parquet_dfs(ETHUSDT_PARQUET_DFS)
assert not isinstance(hist_df, pd.DataFrame)
assert isinstance(hist_df, pl.DataFrame)

# create_xy() input should be pd
data_factory.create_xy(hist_df.to_pandas(), testshift=0)
# create_xy() input should be pl
data_factory.create_xy(hist_df, testshift=0)

# create_xy() inputs shouldn't be pl
with pytest.raises(ValueError):
data_factory.create_xy(hist_df, testshift=0)
# create_xy() inputs shouldn't be pd
with pytest.raises(AssertionError):
data_factory.create_xy(hist_df.to_pandas(), testshift=0)


@enforce_types
Expand All @@ -248,9 +272,8 @@ def test_create_xy__1exchange_1coin_1signal(tmpdir):
hist_df = data_factory._merge_parquet_dfs(ETHUSDT_PARQUET_DFS)

# =========== initial testshift (0)
# At model level, we use pandas not polars. Hence "to_pandas()"
X, y, x_df = data_factory.create_xy(hist_df.to_pandas(), testshift=0)
_assert_shapes(ss, X, y, x_df)
X, y, x_df = data_factory.create_xy(hist_df, testshift=0)
_assert_pd_df_shape(ss, X, y, x_df)

assert X[-1, :].tolist() == [4, 3, 2] and y[-1] == 1
assert X[-2, :].tolist() == [5, 4, 3] and y[-2] == 2
Expand All @@ -269,9 +292,9 @@ def test_create_xy__1exchange_1coin_1signal(tmpdir):
assert x_df["binanceus:ETH-USDT:high:t-2"].tolist() == [9, 8, 7, 6, 5, 4, 3, 2]
assert X[:, 2].tolist() == [9, 8, 7, 6, 5, 4, 3, 2]

# =========== now have a different testshift (1 not 0). Note "to_pandas()"
X, y, x_df = data_factory.create_xy(hist_df.to_pandas(), testshift=1)
_assert_shapes(ss, X, y, x_df)
# =========== now have a different testshift (1 not 0)
X, y, x_df = data_factory.create_xy(hist_df, testshift=1)
_assert_pd_df_shape(ss, X, y, x_df)

assert X[-1, :].tolist() == [5, 4, 3] and y[-1] == 2
assert X[-2, :].tolist() == [6, 5, 4] and y[-2] == 3
Expand All @@ -290,11 +313,11 @@ def test_create_xy__1exchange_1coin_1signal(tmpdir):
assert x_df["binanceus:ETH-USDT:high:t-2"].tolist() == [10, 9, 8, 7, 6, 5, 4, 3]
assert X[:, 2].tolist() == [10, 9, 8, 7, 6, 5, 4, 3]

# =========== now have a different max_n_train. Note "to_pandas()"
# =========== now have a different max_n_train
ss.d["max_n_train"] = 5

X, y, x_df = data_factory.create_xy(hist_df.to_pandas(), testshift=0)
_assert_shapes(ss, X, y, x_df)
X, y, x_df = data_factory.create_xy(hist_df, testshift=0)
_assert_pd_df_shape(ss, X, y, x_df)

assert X.shape[0] == 5 + 1 # +1 for one test point
assert y.shape[0] == 5 + 1
Expand Down Expand Up @@ -330,8 +353,8 @@ def test_create_xy__2exchanges_2coins_2signals(tmpdir):

data_factory = DataFactory(pp, ss)
hist_df = data_factory._merge_parquet_dfs(parquet_dfs)
X, y, x_df = data_factory.create_xy(hist_df.to_pandas(), testshift=0)
_assert_shapes(ss, X, y, x_df)
X, y, x_df = data_factory.create_xy(hist_df, testshift=0)
_assert_pd_df_shape(ss, X, y, x_df)

found_cols = x_df.columns.tolist()
target_cols = [
Expand Down Expand Up @@ -405,23 +428,19 @@ def test_create_xy__handle_nan(tmpdir):
# =========== initial testshift (0)
# run create_xy() and force the nans to stick around
# -> we want to ensure that we're building X/y with risk of nan
X, y, x_df = data_factory.create_xy(
hist_df.to_pandas(), testshift=0, do_fill_nans=False
)
X, y, x_df = data_factory.create_xy(hist_df, testshift=0, do_fill_nans=False)
assert has_nan(X) and has_nan(y) and has_nan(x_df)

# nan approach 1: fix externally
hist_df2 = fill_nans(hist_df)
assert not has_nan(hist_df2)

# nan approach 2: explicitly tell create_xy to fill nans
X, y, x_df = data_factory.create_xy(
hist_df.to_pandas(), testshift=0, do_fill_nans=True
)
X, y, x_df = data_factory.create_xy(hist_df, testshift=0, do_fill_nans=True)
assert not has_nan(X) and not has_nan(y) and not has_nan(x_df)

# nan approach 3: create_xy fills nans by default (best)
X, y, x_df = data_factory.create_xy(hist_df.to_pandas(), testshift=0)
X, y, x_df = data_factory.create_xy(hist_df, testshift=0)
assert not has_nan(X) and not has_nan(y) and not has_nan(x_df)


Expand Down Expand Up @@ -451,7 +470,7 @@ def mock_merge_parquet_dfs(*args, **kwargs): # pylint: disable=unused-argument

# call and assert
hist_df = data_factory.get_hist_df()
assert isinstance(hist_df, pd.DataFrame)
assert isinstance(hist_df, pl.DataFrame)
assert len(hist_df) == 3

assert mock_update_parquet.called
Expand Down Expand Up @@ -481,7 +500,6 @@ def mock_merge_parquet_dfs(*args, **kwargs): # pylint: disable=unused-argument

# call and assert
hist_df = data_factory.get_hist_df()
assert isinstance(hist_df, pd.DataFrame)
assert len(hist_df) == 3

assert mock_update_parquet.called
Expand All @@ -503,21 +521,18 @@ def test_get_hist_df(tmpdir):
)
data_factory = DataFactory(pp, ss)

hist_df = data_factory.get_hist_df()

# call and assert
hist_df = data_factory.get_hist_df()
assert isinstance(hist_df, pd.DataFrame)

# 289 records created
assert len(hist_df) == 289

# binanceus is returning valid data
assert hist_df["binanceus:BTC-USDT:high"].isna().sum() == 0
assert hist_df["binanceus:ETH-USDT:high"].isna().sum() == 0
assert not has_nan(hist_df["binanceus:BTC-USDT:high"])
assert not has_nan(hist_df["binanceus:ETH-USDT:high"])

# kraken is returning nans
assert hist_df["kraken:BTC-USDT:high"].isna().sum() == 289
assert has_nan(hist_df["kraken:BTC-USDT:high"])

# assert head is oldest
head_timestamp = hist_df.head(1)["timestamp"].to_list()[0]
Expand All @@ -537,16 +552,14 @@ def test_exchange_hist_overlap(tmpdir):

# call and assert
hist_df = data_factory.get_hist_df()
assert isinstance(hist_df, pd.DataFrame)

# 289 records created
assert len(hist_df) == 289

# assert head is oldest and tail is latest
assert (
hist_df.head(1)["timestamp"].to_list()[0]
< hist_df.tail(1)["timestamp"].to_list()[0]
)
# assert head is oldest
head_timestamp = hist_df.head(1)["timestamp"].to_list()[0]
tail_timestamp = hist_df.tail(1)["timestamp"].to_list()[0]
assert head_timestamp < tail_timestamp

# let's get more data from exchange with overlap
_, _, data_factory2 = _data_pp_ss_1feed(
Expand Down
Loading
Loading