Skip to content

Latest commit

 

History

History
280 lines (193 loc) · 19 KB

README.md

File metadata and controls

280 lines (193 loc) · 19 KB

pySecMaster

An automated framework to store and maintain financial data.

AGPLv3

The goal of the system is to have a central repository of interrelated finance data that can be used for strategy backtests and live trading systems.

Data types that can be stored includes historical and live stock prices (daily, minute, tick), option chains, corporate actions, economic events, IPO pricings and financial statements.

Contents:

Quick Start Guides

With Docker (Recommended)

  1. Install Docker Community Edition (CE) on your computer

  2. Clone (or download) this repository to your computer

  3. Navigate to pySecMaster/pySecMaster

  4. Open the /utilities/user_dir.py file: - Change the default name (josh) to your system user name (line 30) - Add your Quandl API key to both the root and your user sections (lines 28 and 50) - [optional] Change the main_password (lines 16 and 38) and pysecmaster_password (lines 23 and 45) in both the root and your user sections; Also change the passwords within the pySecMaster/pySecMaster/database/init.sql file on lines 2 and 12 to the same respective passwords you used in the user_dir.py file; You can also change or remove the special remote_users password and users created between lines 16 and 33 within the init.sql file

  5. Run docker-compose up -d postgres to start the Postgres container running as a daemon, where it will continue running in the background.

  6. Run docker-compose up app to run the pySecMaster code. You would run this to update the prices daily. If you change a script, run docker-compose up --build app to have the updated script loaded into the pySecMaster container. The system will utilize the variables specified here. To view the arguments you can provide, either run pySecMaster/pySecMaster/pySecMaster.py -h or view the arguments here.

Without Docker using the Script

  1. Download and install both PostgreSQL and Psycopg2 to your computer. Installing psycopg2 on Windows can be challenging, but I found it easy to use the wheel provided on Christoph Gohlke's Windows Binaries for Python page.

  2. Create a new user in Postgres to use with pySecMaster (i.e. pysecmaster)

  3. Clone or download this repository to your computer

  4. Navigate to pySecMaster/pySecMaster

  5. Open the /utilities/user_dir.py file: - Change the default name (josh) to your system user name (line 30) - Add your Quandl API key to both the root and your user sections (lines 28 and 50) - [optional] Change the main_password (lines 16 and 38) and pysecmaster_password (lines 23 and 45) in both the root and your user sections; Also change the passwords within the pySecMaster/pySecMaster/database/init.sql file on lines 2 and 12 to the same respective passwords you used in the user_dir.py file; You can also change or remove the special remote_users password and users created between lines 16 and 33 within the init.sql file

  6. Install the python libraries with pip3 install -r requirements.txt

  7. Run python pySecMaster/pySecMaster/pySecMaster.py --daily-downloads quandl.wiki -v for the system to start building itself. It'll download Quandl daily prices and run the cross validator for all price values. To view the arguments you can provide, either run pySecMaster/pySecMaster/pySecMaster.py -h or view the arguments here

Retrieve Database Values

  1. To retrieve the data from the PostgreSQL database, open the pySecMaster/pySecMaster/query_data.py file in a code editor (Vim, PyCharm, Sublime, etc.)

  2. Navigate to the query options (lines 242 - 250): change any of the options within this section to alter the query. Be aware that certain variables may be ignored depending on what type of query is run (i.e. minute data only comes from Google Finance). It is possible to retrieve very specific data by writing a custom SQL query. This file also includes a price adjustment calculation, which calculates the correct historical adjusted prices based on the dividend and splits. By default the data is returned as a pandas DataFrame, which can be manipulated to any format (visual, CSV, JSON, chart, etc.), or even sent to another file for further processing.

  3. You can now save and run python pySecMaster/pySecMaster/query_data.py

Database

This system utilizes PostgreSQL for the database engine. Postgres provides an extremely flexible yet powerful database experience. Furthermore, Postgres allows the database to be stored on a remote server, accessible by multiple users.

Currently, there are 24 tables that make up the pySecMaster database. You can view the database table structure here.

TSID

All of the data tables utilize a custom symbol ID, called a tsid (trading system ID). This allows for consistent data nomenclature across the system.

TSID Structure

The tsid structure is composed of the following (Note 1):

<ticker>.<tsid exchange abbreviation>.<integer of duplicate>

Since Apple (AAPL) is traded on NASDAQ (tsid exchange abbreviation is Q), it's tsid symbol is:

AAPL.Q.0

Walmart (WMT) is traded on NYSE (tsid exchange abbreviation is N), thus it's tsid symbol is:

WMT.N.0

TSID Creation

The tsid creation process requires a unique ID as the backbone. At the moment, the CSI Data's CSI Number system is used as the backbone for ensuring that there are no tsid duplicates. It is possible to use another vendor's ID structure as the backbone (Bloomberg, RIC, etc.), or create a custom one (using a predefined base).

The biggest hindrance to using CSI Data's CSI Number system is that it restricts tsid codes to only the US, Toronto and London based exchanges (as those are the only exchanges they list). I've considering using the EODData symbol database to enable the tsid structure to expand to all other global exchanges, but haven't implemented this yet.

You can view (or download) the CSI Data stock factsheet here.

TSID Exchange Abbreviations

Custom exchange abbreviations are used in the tsid structure to allow for naming flexibility and prevent duplicate abbreviations.

All abbreviations can be found by looking at the tsid_symbol column within exchanges.csv in pySecMaster/load_tables (or from the exchange table of the database).

Some common exchange abbreviation include:

Exchange Name TSID Exchange Abbreviation
American Stock Exchange AMEX
New York Stock Exchange (NYSE) N
New York Stock Exchange ARCA NARCA
NASDAQ Q
OTC Markets Pink Sheets PINK
London Stock Exchange LON
Toronto Stock Exchange TSX

Symbology

The symbology table is used as a translator between the tsid symbol and other symbol structures (Quandl codes, Yahoo Finance codes, etc.) (Note 2). This structure enables future symbol structures to be seamlessly added to the table to allow for external database communication (RIC, Bloomberg, etc.).

Not only does this translation ability allows you convert one source's symbol to another, but it allows you to query any source's symbols based on characteristics stored in other tables (exchange, sector, industry, etc.).

By default, the symbology table links the tsid symbol to these data sources (Note 3):

Source Name Source Code
CSI Data's CSI Number csi_data
Quandl's Google database quandl_goog
Quandl's WIKI database quandl_wiki
Seeking Alpha seeking_alpha
Yahoo Finance yahoo

Data Types

This system is built around the idea of having extractor modules 'plug-in' to the database. Therefore, it is designed for you to build your own data extractors for any type of data you want to store in line with the tsid structure.

The default extractors handle daily and minute price data, along with basic exchange information. I have built extra tables that can have extractors built to fill in data. If you have ideas on additional tables to include, please create an issue with your idea.

Default Extractors

  • Daily Historical Stock Prices (Quandl and Yahoo Finance (complete history); Google Finance (restricted to prior 50 days))
  • Minute Historical Stock Prices (Google Finance (restricted to prior 15 days))
  • Exchange Information
  • Symbology generator (symbol translator)

Custom Extractors (aka, build your own)

  • Cash Dividends (NASDAQ)
  • Corporate Activities (conference calls, earnings data, splits) (Yahoo)
  • Economic Events Data (Yahoo)
  • Financial Statement Data (may require table modification as I haven't tried this yet) (SEC Edgar)
  • Historical Indices Membership (?)
  • IPO Pricings Data (Yahoo)

Cross Validator

The cross validator automatically selects the most likely prices from all available sources.

The validator uses a cumulative score to select the most likely value, where the individual data source weights are specified in the data_vendor table of the database. The weights range from 0 to 100, with 100 holding the greatest weight.

The system is setup to work with as many data sources as available, so future data sources can be easily implemented into this consensus value. It is possible for all prior consensus values to be replaced by new values. Alternatively, it is possible for only values over the past n days to be replaced.

Currently, the only way to run the cross validator is to run the system through the pySecMaster.py script. The validator has not been implemented into the GUI yet.

This can be multi-processed based on tsids. By default, 5 threads are used. This value is dependent on the disk and processor speed, so you may need to lower this value.

Adjusted Prices

The query_data.py file includes a function for calculating the adjusted prices based on the stock's dividends and splits. You can read more about this calculation here.

Run without Docker using the GUI

NOTE: The Quandl data download is currently broken. Use the pysecmaster.py instead.

  1. Download and install both PostgreSQL and Psycopg2 to your computer. Installing psycopg2 on Windows can be challenging, but I found it easy to use the wheel provided on Christoph Gohlke's Windows Binaries for Python page.

  2. Create a new Postgres user to use with pySecMaster (i.e. pysecmaster)

  3. Clone or download this repository to your computer

  4. Navigate to the pySecMaster/pySecMaster and run main_gui.py

  5. Within the GUI, provide the Postgres admin user and password, along with the database name you want to use for the pySecMaster (i.e. pysecmaster). Also, enter the user, password, host and port number that will be used to access this new database.

Quandl daily data

  1. Enter a Quandl API Key (free at https://www.quandl.com)

  2. In the Data tab, change Download Source combo-box to quandl

  3. In the Data tab, change Selection combo-box to: - wiki if you want all Quandl WIKI data (Note 4) (~3,000 symbols) - goog if you want all US, Toronto and London Quandl Google Finance data (~38,000 symbols) - goog_etf if you want all Quandl Google Finance ETF data (Note 5)) (~3,700 symbols) - goog_us_main_no_end_date if you want main US exchange Quandl Google Finance data (Note 6)) (~15,000 symbols)

  4. If you have a HDD, I'd recommend changing the Threads count in System Settings tab to 2 (SSD's can handle 8 threads). If you see the database constantly being locked, lower this number.

  5. Click on the Ok button, and the database will start building itself

  6. You can save your settings either when you exit the GUI or by going to File -> Save Settings [ctrl + s]

Yahoo Finance daily data (currently broken - 2/19/18)

  1. In the Data tab, change Download Source combo-box to yahoo

  2. In the Data tab, change Selection combo-box to: - all if you want all US, Toronto and London Yahoo Finance data (~38,000 symbols) - us_main if you want main US exchange Yahoo Finance data that's been active within the prior two years (Note 6) (~9,000 symbols) - us_main_no_end_date if you want main US exchange Yahoo Finance data (Note 6) (~15,000 symbols) - us_canada_london if you want all US, Toronto and London Yahoo Finance data that's been active within the prior two years (~25,000 symbols)

  3. If you have a HDD, I'd recommend changing the Threads count in System Settings tab to 2 (SSD's can handle 8 threads). If you see the database constantly being locked, lower this number.

  4. Click on the Ok button, and the database will start building itself with daily data from Yahoo Finance

  5. You can save your settings either when you exit the GUI or by going to File -> Save Settings [ctrl + s]

Google Finance minute data (currently broken - 2/19/18)

  1. In the Data tab, change Download Source combo-box to google

  2. In the Data tab, change Selection combo-box to: - all if you want all US, Toronto and London Google Finance data (~38,000 symbols) - us_main if you want main US exchange Google Finance data that's been active within the prior two years (Note 6) (~9,000 symbols) - us_canada_london if you want all US, Toronto and London Google Finance data that's been active within the prior two years (~25,000 symbols)

  3. In the Data tab, change Interval combo-box to minute

  4. If you have a HDD, I'd recommend changing the Threads count in System Settings tab to 2 (SSD's can handle 8 threads). If you see the database constantly being locked, lower this number.

  5. Click on the Ok button, and the database will start building itself with minute data from Google Finance

  6. You can save your settings either when you exit the GUI or by going to File -> Save Settings [ctrl + s]

System Requirements

  • Python 3.4+
  • Numpy 1.14.0
  • Pandas 0.22.0
  • Psycopg2 2.6.2
  • SqlAlchemy 1.2.2
  • PostgreSQL 9.5+
  • PyQt 4.11+
  • More than 20GB of storage space (daily Quandl WIKI data is about 4 GB, while a year's worth of Google Finance minute data can become 50+ GB)

User Requirements

Future Goals

  • Add Quandl_YAHOO to symbology
  • Add custom holiday table

Additional Info

To view the PostgreSQL database, you can pgAdmin program that is installed when you download PostgreSQL. This allows you to view and edit all characteristics of the database.

Notes

  • Note 1: I have not implemented the integer of duplicates yet, so all tsid symbols have a 0 (zero) value for that. This is only relevant when you have access to delisted stock data, and for tickers that overlap active tickers (I.E. ABG on NYSE).
  • Note 2: All source codes created for the symbology table are built from scratch using a methodology that closely follows the true symbol structure. This means that there will be occurrences where to symbology built symbol does not match the true symbol. Create an issue if you see this happening.
  • Note 3: Google Finance does not have a symbol structure; they only require a ticker and unique exchange abbreviation as separate fields. Thus the Google Finance extractor uses the tsid structure as a symbol source.
  • Note 4: The symbology table actually includes about 9,000 symbols classified as quandl_wiki, but only about 3,000 of those actually have data from Quandl. I did this because I do not have a high quality list of all the WIKI codes (I don't trust Quandl's list), thus to ensures that a good percent of WIKI codes are downloaded.
  • Note 5: All ETF symbols are derived from the CSI Data Stock Factsheet, which includes listed and delisted ETFs
  • Note 6: US main exchanges include AMEX, NYSE, BATS, NASDAQ (CM, GM, GS) and NYSE ARCA; includes stocks and ETFs

Disclaimer

Before using this software, be sure to understand and follow the terms of all data providers. I am not responsible for how you use this software, so please be responsible in your use of it! Please see the following links for some information:

For further information, please seek legal counsel.

License (GNU AGPLv3)

pySecMaster - An automated system to store and maintain financial data.

Copyright (C) 2016 Josh Schertz

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.