Skip to content

Commit 318e9cf

Browse files
committed
docs: data validation
1 parent 2839a3d commit 318e9cf

File tree

2 files changed

+121
-0
lines changed

2 files changed

+121
-0
lines changed

.vitepress/config.mts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ export default defineConfig({
4646
text: 'Data Management',
4747
items: [
4848
{ text: 'Downloading Market Data', link: '/data-downloading' },
49+
{ text: 'Inspecting & Validating Data', link: '/data-validation' },
4950
]
5051
},
5152
{

data-validation.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# Inspecting & Validating Data
2+
3+
The quality of your backtesting results is directly dependent on the quality of your historical data. Before running a strategy, it is essential to inspect your downloaded data to ensure it is complete and consistent. Gaps, duplicates, or other errors in your data can lead to misleading backtest results.
4+
5+
Stochastix provides a dedicated command-line tool, `stochastix:data:info`, to help you with this process.
6+
7+
## The `data:info` Command
8+
9+
This command reads a `.stchx` binary data file and displays its metadata and a sample of its content. Its most powerful feature is the ability to perform a full consistency validation on the data.
10+
11+
### Command Signature
12+
13+
```bash
14+
make sf c="stochastix:data:info <file-path> [options]"
15+
```
16+
17+
### Argument
18+
19+
* **`file-path`**: The full path to the `.stchx` file you want to inspect.
20+
21+
### Example
22+
23+
```bash
24+
make sf c="stochastix:data:info data/market/binance/ETH_USDT/1d.stchx"
25+
```
26+
27+
## Inspecting File Contents
28+
29+
When run without any options, the command provides a quick overview of the file:
30+
31+
1. **Header Metadata**: It displays the key information from the file's header, such as the `Symbol`, `Timeframe`, and the total `Number of Records` contained within the file.
32+
2. **Data Sample**: It shows the first 5 and last 5 records from the file. This is useful for a quick sanity check to ensure the timestamps and price ranges look correct.
33+
34+
```bash
35+
📊 Stochastix STCHXBF1 File Information 📊
36+
==========================================
37+
38+
File: /app/data/market/okx/ETH_USDT/1d.stchx
39+
Size: 17,584 bytes
40+
41+
Header Metadata
42+
---------------
43+
44+
------------------- ----------
45+
Magic Number STCHXBF1
46+
Format Version 1
47+
Header Length 64
48+
Record Length 48
49+
Timestamp Format 1
50+
OHLCV Format 1
51+
Symbol ETH/USDT
52+
Timeframe 1d
53+
Number of Records 365
54+
------------------- ----------
55+
56+
Data Sample (Head & Tail)
57+
-------------------------
58+
59+
------------ --------------------- ------------- ------------- ------------- ------------- ------------
60+
Timestamp Date (UTC) Open High Low Close Volume
61+
------------ --------------------- ------------- ------------- ------------- ------------- ------------
62+
1672531200 2023-01-01 00:00:00 1,196.39000 1,204.70000 1,191.27000 1,200.43000 26,631.66
63+
1672617600 2023-01-02 00:00:00 1,200.27000 1,224.64000 1,192.90000 1,214.00000 75,316.11
64+
1672704000 2023-01-03 00:00:00 1,214.00000 1,220.00000 1,204.98000 1,214.51000 37,567.06
65+
1672790400 2023-01-04 00:00:00 1,214.51000 1,273.55000 1,212.73000 1,256.73000 175,177.68
66+
1672876800 2023-01-05 00:00:00 1,256.74000 1,259.98000 1,243.00000 1,251.34000 58,564.63
67+
... ... ... ... ... ... ...
68+
1703635200 2023-12-27 00:00:00 2,230.68000 2,392.94000 2,212.01000 2,378.35000 196,149.91
69+
1703721600 2023-12-28 00:00:00 2,378.36000 2,445.80000 2,335.27000 2,344.17000 223,327.62
70+
1703808000 2023-12-29 00:00:00 2,344.18000 2,385.27000 2,255.01000 2,299.15000 213,180.88
71+
1703894400 2023-12-30 00:00:00 2,299.14000 2,322.69000 2,267.72000 2,291.65000 97,952.85
72+
1703980800 2023-12-31 00:00:00 2,291.73000 2,321.39000 2,256.01000 2,282.13000 90,254.81
73+
------------ --------------------- ------------- ------------- ------------- ------------- ------------
74+
```
75+
76+
## Validating Data Consistency
77+
78+
The most important feature of the `data:info` command is the `--validate` flag. When this option is added, the tool will iterate through every single record in the file to check for common data quality issues.
79+
80+
```bash
81+
make sf c="stochastix:data:info data/market/binance/ETH_USDT/1d.stchx --validate"
82+
```
83+
84+
The validation checks for three types of problems:
85+
86+
1. **Gaps**: The time difference between every consecutive record is checked. If it doesn't match the file's timeframe (e.g., 86,400 seconds for a `1d` file), it is flagged as a gap.
87+
2. **Duplicates**: The tool checks for any records that have the exact same timestamp as the one before it.
88+
3. **Out of Order**: The tool ensures that timestamps are always increasing. Any timestamp that is less than the previous one is flagged.
89+
90+
### Interpreting the Validation Output
91+
92+
* **If the data is clean**, you will see a "passed" status:
93+
94+
```bash
95+
🔍 Data Consistency Validation
96+
-----------------------------
97+
98+
 [OK] Data appears consistent.
99+
```
100+
101+
* **If problems are found**, you will see a "failed" status with a detailed list of every issue, including the index of the problematic record:
102+
103+
```bash
104+
🔍 Data Consistency Validation
105+
-----------------------------
106+
107+
 [ERROR] Found 2 issue(s).
108+
109+
Gaps:
110+
-----
111+
 ! [WARNING] - At index 452: Diff: 172800s, Expected: 86400s
112+
113+
Duplicates:
114+
-----------
115+
 ! [WARNING] - At index 788: Timestamp 1698883200
116+
```
117+
118+
::: tip Best Practice
119+
Always run your downloaded data through the `--validate` check before using it for backtesting. Clean data is the bedrock of trustworthy results. If you find errors, it's best to re-download the data from the exchange or another source.
120+
:::

0 commit comments

Comments
 (0)