FinMLKit is an open-source, lightweight financial machine learning library designed to be simple, blazing fast, and easy to contribute to. Whether you’re a seasoned quant or a beginner in finance and programming, FinMLKit welcomes contributions from everyone.
The main goal of this library is to provide a solid foundation for financial machine learning, enabling users to process raw trades data, generate different types of bars, intra-bar features (eg. footprints), bar-level features (indicators), and labels for supervised learning.
To get started with FinMLKit, you can simply install it via pip:
pip install finmlkitOr clone the repository and install it locally:
git clone https://github.com/quantscious/finmlkit.git
cd finmlkit
pip install .See the examples directory to learn about the practical usage of the library: how to process trades data and build bars, features, and labels for machine learning models.
Build your own crypto database with binance2h5.py, ready to be processed by FinMLKit. This script downloads raw trades data from Binance in monthly chunks and processes it into HDF5 format compatible with FinMLKit.
You can find this in the scripts directory. Example usage:
python scripts/binance2h5.py \
--market spot \
--tickers BTCUSDT ETHUSDT ADAUSDT ADAUSDC \
--start 2021-01 \
--end now \
--workdir /Users/you/data \
--workers 4 \
--overwrite-klines 1The documentation is available at finmlkit.readthedocs.io.
By default, logging is directed to the console at INFO level, but you can change this and also enable file-based logging by setting the appropriate environment variables.
- If
FMK_LOG_FILE_PATHis defined, logs will be written to both the specified file and the console. FMK_LOG_FILE_PATHis not set, logging defaults to console-only output. To apply these settings, export the environment variables in your terminal before running your application:
export FMK_LOG_FILE_PATH=/path/to/your/logfile.log
export FMK_FILE_LOGGER_LEVEL=DEBUG
export FMK_CONSOLE_LOGGER_LEVEL=WARNINGIf you want to suppress console output, you can set the FMK_CONSOLE_LOGGER_LEVEL, for example, to WARNING.
FinMLKit is an open-source, lightweight financial data processing library with a focus on preparing data and labels for ML models. It is specialized for High Frequency Trading (HFT) and building on the most granular data level, the price tick data (raw trades data). This enables the building of intra-bar features (e.g., footprints, flow imbalance) that provide additional information to ML models compared to conventional and ubiquitous OHLCV data. Working with large amount of raw data requires a special design approach to ensure speed and efficiency, which is why FinMLKit is built with Numba for high-performance computation and parallelization. To illustrate this, if we were to aggregate raw trades data into OHLCV bars using Pandas, it would take around 60x longer than using FinMLKit. A task that would take 1 minute in Pandas would take approx. 1 second with FinMLKit. In the performance test notebook we did a fun comparison between FinMLKit and MLFinPy regarding bar construction speed and demonstrated a more than 600x speedup. This highlights the efficiency and power of FinMLKit for processing large amount of raw financial data.
So FinMLKit is built on Python’s Numba for high-performance computation, while using Pandas only as a wrapper for easier handling of data. Numba’s Just-In-Time (JIT) compilation allows it to convert Python code into machine code, significantly improving performance, especially in iterative tasks where parallelization can be utilized. In contrast, Pandas, while great for structuring and managing data, is slow and cumbersome for such operations. Therefore, we use Pandas only as a wrapper for handling data, allowing it to shine where it excels, while Numba powers the core algorithmic computations for efficiency and clarity. This way, we can avoid relying on slow and elusive pandas operations and focus on efficient, more explicit codes in the core functions.
Key principles are Simplicity, Speed, and Accessibility (SSA):
- Simplicity 🧩 No complex frameworks, no elusive pandas operations, just efficient, explicit, well-documented algorithms.
- Speed ⚡ Core functions built with Numba for high-performance computation and parallelization in mind.
- Accessibility 🌍 The goal is to make it easy for anyone – regardless of their background – to contribute and enhance the library, fostering an open-source collaborative spirit.
FinMLKit is an open-source toolkit designed to make advanced, reproducible financial machine learning accessible to both researchers and practitioners. Many existing pipelines still rely on outdated conventions like time bars, fixed-window labels, and oversimplified features—not because they are optimal, but because better alternatives are often harder to implement and scale. FinMLKit addresses this gap by providing a research-grade foundation for working directly with raw trade data, including information-driven bar types, path-aware labeling with the Triple Barrier Method, microstructure features like volume profiles and footprints, and sample weighting for overlapping events—all powered by high-performance, Numba-accelerated internals.
This project aims not only to offer tools, but to foster collaboration. By open-sourcing the core infrastructure, we invite contributors to improve, extend, and build on a shared foundation—raising the methodological standard across both academia and industry. FinMLKit is structured to support reproducible research, with clean APIs, modular design, and citable releases (see citation info at the bottom). Our vision is to democratize access to advanced techniques, make rigorous pipelines more practical, and accelerate the adoption of robust, transparent practices in financial ML.
The foundation of any financial ML pipeline is robust data handling. FinMLKit provides comprehensive tools for ingesting, preprocessing, validating, and storing high-frequency trading data at scale. The data preprocessing module transforms raw, inconsistent trade feeds into clean, validated datasets ready for bar construction and analysis.
Data Ingestion & Preprocessing:
- TradesData - Raw trades preprocessing with timestamp normalization, trade merging, and side inference
- Data integrity validation with gap detection and discontinuity analysis
- Multi-format timestamp support (s, ms, μs, ns) with automatic unit inference
- Trade ID validation and missing data percentage calculation
- Memory-efficient processing with chunking support for large datasets
Storage & Retrieval:
- HDF5-based storage with monthly partitioning for efficient time-range queries
- Compressed storage with multiple backends (blosc:lz4, blosc:zstd)
- Metadata-driven data discovery and range validation
- Multiprocessing support for large dataset operations
- H5Inspector - Comprehensive HDF5 file analysis and integrity reporting
- AddTimeBarH5 - Automated time bar generation and persistence (extending the raw trade data h5 file)
- TimeBarReader - Efficient time bar loading with flexible resampling capabilities
Bars are the primary data structure in FinMLKit – constructed from preprocessed trades data –, representing the historical price data of an asset. Bars can be in the form of OHLCV (Open, High, Low, Close, Volume) or any other format that includes the necessary information for analysis (e.g. footprint data, directional features). Bars are used as input for indicators, strategies, and other components of the library. In summary, the bars module is responsible for processing structured trades data into analytical data structures optimized for financial machine learning.
Data Structures:
- OHLCV bars with VWAP and trade statistics
- Directional features (e.g. buy/sell tick, volume, dollars, min. cum. volume, max. cum. volume etc.)
- Trade size features (e.g., are there large trade block prints in the bar?)
- Bar footprints with order flow imbalance detection
Bar Types:
- Time bars
- Tick bars
- Volume bars
- Dollar bars
- CUSUM bars
- Imbalance bars
- Run bars
Everything that processes bars data (candlestick/OHLCV, directional features, or footprints) and calculates derived values from it is considered a feature. This includes moving averages, RSI, MACD, etc. Here we are focusing on more unconventional indicators that are not commonly found in other libraries and builds on our advanced data structures like footprints, for example, volume profile features. Features are the building blocks of trading strategies and are used to generate signals for buying or selling assets.
FeatureKit Framework:
- Dual-backend architecture (pandas for development/prototyping, Numba for production)
- SISO, MISO, SIMO, MIMO transform patterns for flexible feature engineering
- Compose class for sequential transform chaining
- Mathematical operations and function composition with the Feature wrapper class
- Computational graph: visualize dependencies and compute in topological order
- Optimized, caching-aware execution: reuse precomputed columns across features and pipelines
- Reproducibility: JSON serialization of Features/FeatureKit with full config export/import
- Integration with external libraries (e.g., TA-Lib) via
ExternalFunctiontransform wrapper - FeatureKit for batch feature computation with optional performance profiling
Implemented Features:
- Adjusted Exponential Moving Average
- Standard Volatility Estimators
- Volume Profile Indicators: Commitment of Traders (COT), Buy/Sell Imbalance price levels, High Volume Nodes (HVN), Low Volume Nodes (LVN), Point of Control (POC)
- Cusum Monitoring structural break feature (Chu-Stinchcombe-White CUSUM Test on Levels based on Homm and Breitung (2011))
- And many more... Consult the documentation for a complete list of implemented transform examples. Feel free to build your own features to your specific needs, the framework design is given.
Labels are the target values that we want to predict in a supervised learning problem. Currently, Triple Barrier Method is implemented with meta-label support, which is an advanced approach in financial machine learning.
- Triple Barrier Method
- Meta-Labeling
- Label Concurrency weights
- Return Attribution weights
- Class Imbalance weights
- CUSUM Filter
FinMLKit implements methods from trusted sources, including renowned academic papers and books. The primary reference is Marcos Lopez de Prado’s Advances in Financial Machine Learning, which lays the foundation for many of the algorithms and methods in this package. We prioritize transparency and accuracy in both the implementation and explanation of these methodologies. Each algorithm should be accompanied by detailed documentation that:
- Cites the original sources from which the methods were derived (papers, books, and other trusted research).
- Describes the algorithms comprehensively, explaining the theory behind them and how they are applied in practice. By ensuring that the algorithms are well-documented, with clear references to their origins, we aim to foster trust and enable users to fully understand the underlying mechanics of the tools they are using. This also makes it easier for contributors to extend the package, knowing exactly how each method works and what references to consult.
FinMLKit is designed to be well-documented, with detailed explanations of each algorithm, method, and function. It uses reStructured style docstrings to provide clear and concise documentation for each function, class, and module. This makes it easier for users to understand how to use the library and what each function does. It uses Sphinx to generate the documentation and automatically deploy it to finmlkit.readthedocs.io. This way, users can access the documentation online and easily navigate through the library's features and functionalities. This framework also enables the creation of tutorials or in-depth descriptions of the methods.
We aim to make FinMLKit as easy to contribute to as possible. Whether it’s fixing bugs, adding new features, or improving documentation, your contribution matters. Let’s work together to make the common ground for financial machine learning!
See CONTRIBUTING.md for detailed guidelines on bug reports, new features, enhancements, documentation, and testing.
Star the repo, cite it in your work, file issues, propose features, and share benchmark results. Let’s make better defaults the norm.
To run the full test suite locally, disable Numba's JIT compiler:
NUMBA_DISABLE_JIT=1 pytest -qAlternatively, use the helper scripts ./local_test.sh (JIT enabled) or
./local_test_nojit.sh (JIT disabled). See tests/README.md
for more guidance.
FinMLKit is built with speed in mind. We use Numba for high-performance computation, allowing us to avoid slow and elusive pandas operations and focus on efficient, more explicit codes in the core functions. This way, we can ensure that the library is fast and efficient, even when dealing with large datasets or complex algorithms.
Some results are collected below to demonstrate the effectiveness of the numba framework:
- Exponentially Weighted Moving Average (EWMA) calculation: 4x speedup compared to Pandas function
- Standard Volatility Estimator: 8.12x speedup compared to Pandas implementation
- CUSUM monitoring for structural breaks: 6.25x speedup with parallelization compared to non-parallelized implementation.
- OHLCV Time Bar generation: 100x speedup compared to Pandas implementation.
If you use FinMLKit in your research or publications, we kindly ask that you cite it. Use the "Cite this repository" option in the GitHub sidebar for ready-to-use citation details in formats like BibTeX and APA. For persistent DOIs, check the Zenodo archive linked below.
