-
Notifications
You must be signed in to change notification settings - Fork 173
feat: dynamically create sf-hamilton-core package
#1376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
3c66fc4
588638c
c1deb44
c63886b
543d267
399b4a7
6d4fbfa
44ba8d8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,48 @@ | ||
| name: Unit tests (hamilton-core) | ||
|
|
||
| on: | ||
| workflow_dispatch: | ||
|
|
||
| pull_request: | ||
| branches: | ||
| - main | ||
| paths: | ||
| - '.github/**' | ||
| - 'hamilton/**' | ||
| - 'tests/**' | ||
| - 'pyproject.toml' | ||
|
|
||
| jobs: | ||
| test: | ||
| name: "Unit Tests (hamilton-core)" | ||
| runs-on: ubuntu-latest | ||
| env: | ||
| UV_PRERELEASE: "allow" | ||
| HAMILTON_TELEMETRY_ENABLED: false | ||
|
|
||
| steps: | ||
| - name: Install Graphviz on Linux | ||
| if: runner.os == 'Linux' | ||
| run: sudo apt-get update && sudo apt-get install --yes --no-install-recommends graphviz | ||
|
|
||
| - name: Checkout repository | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Install uv and set the python version | ||
| uses: astral-sh/setup-uv@v6 | ||
| with: | ||
| python-version: "3.12" # most popular Python version | ||
| enable-cache: true | ||
| cache-dependency-glob: "uv.lock" | ||
| activate-environment: true | ||
|
|
||
| - name: Install dependencies | ||
| run: | | ||
| uv venv | ||
| . .venv/bin/activate | ||
| uv pip install ./hamilton-core[core-tests] | ||
|
|
||
| # NOTE `test_caching.py` is the older caching mechanism | ||
| - name: Test hamilton main package | ||
| run: | | ||
| uv run pytest tests/ --ignore tests/integrations --ignore tests/plugins --ignore tests/test_caching.py |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| hamilton/_hamilton |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| # Read carefully | ||
|
|
||
| > Use at your own risk | ||
| This directory contains code for the package `sf-hamilton-core`. It is a drop-in replacement of `sf-hamilton`, with two changes: | ||
| - disable plugin autoloading | ||
| - make `pandas` and `numpy` optional dependencies; and remove `networkx` dependency (currently unused). | ||
|
|
||
| This makes the Hamilton package a much lighter install and solves long library loading time. | ||
|
|
||
| ## As a user | ||
| If you want to try `sf-hamilton-core`, you need to: | ||
| 1. Remove your current Hamilton installation: `pip uninstall sf-hamilton` | ||
| 2. Install Hamilton core `pip install sf-hamilton-core` | ||
| 3. Check installation `pip list` should only include `sf-hamilton-core`. | ||
|
|
||
| This will install a different Python package with the name `hamilton` with the smaller dependencies and plugin autoloading disabled. | ||
|
|
||
| It should be a drop-in replacement and your existing Hamilton code should just work. Though, if you're relying on plugins (e.g., parquet materializers, dataframe result builders), you will need to manually load them. | ||
|
|
||
|
|
||
| ## How does it work | ||
|
|
||
|
|
||
| ## Why is another package `sf-hamilton` necessary | ||
| This exists to prevent backwards incompatible changes for people who `pip install sf-hamilton` and use it in production. It is a temporary solution until a major release `sf-hamilton==2.0.0` could allow breaking changes and a more robust solution. | ||
|
|
||
| ### Disable plugin autoloading | ||
| Hamilton has generous number of plugins (`pandas`, `polars`, `mlflow`, `spark`). To give a good user experience, Hamilton autoloads plugins based on the available Python libraries in the current Python environment. For example, `to.mlflow()` becomes available if `mlflow` is installed. Autoloaded features notably include materializers like `from_.parquet` and `to.parquet` and data validators (pydantic, pandera, etc.) | ||
|
|
||
| The issue with this approach is that Python environment with a lot of dependencies, common in data science, can be very slow to start because of all the imports. Currently, Hamilton allows to disable autoloading via a user config or Python code. This require manual setups and is not the best default for some users. | ||
|
|
||
| ### `pandas` and `numpy` dependencies | ||
| Hamilton was initially created for workflows that used `pandas` and `numpy` heavily. For this reason, `numpy` and `pandas` are imported at the top-level of module `hamilton.base`. Because of the package structure, as a Hamilton user, you're importing `pandas` and `numpy` every time you import `hamilton`. | ||
|
|
||
| A reasonable change would be to move `numpy` and `pandas` to a "lazy" location. Then, dependencies would only be imported when features requiring them are used and they could be removed from `pyproject.toml`. Unfortunately, plugin autoloading defaults make this solution a significant breaking change and insatisfactory. | ||
|
|
||
| Since plugins are loaded based on the Python package available, removing `pandas` and `numpy` would allow disable the loading of these plugins. This would break popular CSV and parquet materializers. | ||
|
|
||
| ### `networkx` dependency | ||
| The `sf-hamilton[visualization]` extra currently includes `networkx` as a dependency, though it is never actually used. There's a single function requiring it and it could be implemented in pure Python. This has been made even easier with the addition of `graphlib` in the standard library in Python 3.9. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,70 @@ | ||
| import importlib.util | ||
| import pathlib | ||
| import sys | ||
| from types import ModuleType | ||
| from typing import Any | ||
|
|
||
|
|
||
| def _load_hamilton_module() -> ModuleType: | ||
| """Patch this relative import in the Hamilton core repository | ||
|
|
||
| ```python | ||
| # hamilton/__init__.py | ||
| try: | ||
| from .version import VERSION as __version__ # noqa: F401 | ||
| except ImportError: | ||
| from version import VERSION as __version__ # noqa: F401 | ||
| ``` | ||
| """ | ||
|
|
||
| origin_path = pathlib.Path(__file__).parent / "_hamilton" / "__init__.py" | ||
| origin_spec = importlib.util.spec_from_file_location("hamilton", origin_path) | ||
| origin_module = importlib.util.module_from_spec(origin_spec) | ||
|
|
||
| # The following lines are only required if we don't modify `hamilton/__init__.py` | ||
| # source_segment = "from version import VERSION as __version__" | ||
| # # the namespace `hamilton._hamilton` is only temporarily available; it will be removed | ||
| # # by the end of this initialization | ||
| # patched_segment = "from hamilton._hamilton.version import VERSION as __version__" | ||
|
|
||
| # source_code = pathlib.Path(origin_path).read_text() | ||
| # patched_code = source_code.replace(source_segment, patched_segment) | ||
|
|
||
| # exec(patched_code, origin_module.__dict__) | ||
| # sys.modules["hamilton"] = origin_module | ||
|
|
||
| origin_spec.loader.exec_module(origin_module) | ||
| return origin_module | ||
|
|
||
|
|
||
| def _load_hamilton_registry_module(): | ||
| module_path = pathlib.Path(__file__).parent / "_hamilton" / "registry.py" | ||
| module_spec = importlib.util.spec_from_file_location("hamilton.registry", module_path) | ||
| module = importlib.util.module_from_spec(module_spec) | ||
| module_spec.loader.exec_module(module) | ||
| return module | ||
|
|
||
|
|
||
| def _create_proxy_module() -> ModuleType: | ||
| proxy_module = ModuleType(__name__) | ||
| sys.modules[__name__] = proxy_module | ||
| return proxy_module | ||
|
|
||
|
|
||
| _registry_module = _load_hamilton_registry_module() | ||
| # disable plugin autoloading | ||
| _registry_module.disable_autoload() | ||
|
|
||
| _origin_module = _load_hamilton_module() | ||
| _proxy_module = _create_proxy_module() | ||
|
|
||
|
|
||
| def __getattr__(name: str) -> Any: | ||
| try: | ||
| return getattr(_origin_module, name) | ||
| except AttributeError: | ||
| raise | ||
|
|
||
|
|
||
| # `getattr()` must be available to build the package | ||
| _proxy_module.__getattr__ = __getattr__ |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,95 @@ | ||
| import os | ||
| import pathlib | ||
| import re | ||
| import shutil | ||
| import sys | ||
|
|
||
| import tomllib | ||
| from setuptools import setup | ||
|
|
||
| os.chdir(os.path.abspath(os.path.dirname(__file__))) | ||
|
|
||
|
|
||
| def copy_hamilton_library(): | ||
| setup_dir = pathlib.Path(__file__).resolve().parent | ||
| source_dir = (setup_dir.parent / "hamilton").resolve() | ||
| dest_dir = (setup_dir / "hamilton" / "_hamilton").resolve() | ||
|
|
||
| # Safety checks | ||
| if not source_dir.is_dir(): | ||
| print(f"Error: Source directory does not exist: {source_dir}") | ||
| sys.exit(1) | ||
|
|
||
| if not str(dest_dir).startswith(str(setup_dir)): | ||
| print(f"Error: Destination directory {dest_dir} is outside the setup directory {setup_dir}") | ||
| sys.exit(1) | ||
|
|
||
| # Remove destination if it exists to avoid errors or stale files | ||
| if dest_dir.exists(): | ||
| print("delete: ", dest_dir) | ||
| shutil.rmtree(dest_dir) | ||
|
|
||
| # Copy entire directory tree from source to destination | ||
| print(f"copy from: {source_dir}; to {dest_dir}") | ||
| shutil.copytree(source_dir, dest_dir) | ||
|
|
||
|
|
||
| def get_version(): | ||
| version_path = pathlib.Path(__file__).parent / "hamilton" / "_hamilton" / "version.py" | ||
| content = version_path.read_text() | ||
| match = re.search(r"^VERSION\s*=\s*\(([^)]+)\)", content, re.MULTILINE) | ||
| if match: | ||
| version_tuple_str = match.group(1) # "1, 88, 0" | ||
| # Parse tuple string into list of integers | ||
| version_parts = [part.strip() for part in version_tuple_str.split(",")] | ||
| version_str = ".".join(version_parts) | ||
| return version_str | ||
|
|
||
|
|
||
| copy_hamilton_library() | ||
|
|
||
| pyproject_path = pathlib.Path(__file__).parents[1] / "pyproject.toml" | ||
| pyproject = tomllib.loads(pyproject_path.read_text()) | ||
| project = pyproject["project"] | ||
|
|
||
| readme_file = project.get("readme", None) | ||
| console_scripts = [ | ||
| f"{name}={target}" | ||
| for name, target in project.get("entry-points", {}).get("console_scripts", {}).items() | ||
| ] | ||
| install_requires = list(set(project.get("dependencies", [])).difference(set(["pandas", "numpy"]))) | ||
| extras_require = { | ||
| **project.get("optional-dependencies", {}), | ||
| **{"visualization": ["graphviz"]}, # drop networkx | ||
| **{ | ||
| "core-tests": [ # dependencies required to run unit tests; used in CI | ||
| "pytest", | ||
| "pytest-asyncio", | ||
| "pandas", | ||
| "typer", | ||
| "networkx", | ||
| "graphviz", | ||
| ] | ||
| }, | ||
| } | ||
|
|
||
|
|
||
| setup( | ||
| name="sf-hamilton-core", | ||
| version=get_version(), | ||
| description=project.get("description", ""), | ||
| long_description=pathlib.Path(readme_file).read_text() if readme_file else "", | ||
| long_description_content_type="text/markdown" if readme_file else None, | ||
| python_requires=project.get("requires-python", None), | ||
| license=project.get("license", {}).get("text", None), | ||
| keywords=project.get("keywords", []), | ||
| author=", ".join(a["name"] for a in project.get("authors", [])), | ||
| author_email=", ".join(a["email"] for a in project.get("authors", [])), | ||
| classifiers=project.get("classifiers", []), | ||
| install_requires=install_requires, | ||
| extras_require=extras_require, | ||
| entry_points={"console_scripts": console_scripts}, | ||
| project_urls=project.get("urls", {}), | ||
| packages=["hamilton"], | ||
| package_data={"hamilton": ["*.json", "*.md", "*.txt"]}, | ||
| ) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,7 +1,7 @@ | ||
| try: | ||
| from .version import VERSION as __version__ # noqa: F401 | ||
| except ImportError: | ||
| from version import VERSION as __version__ # noqa: F401 | ||
| from hamilton.version import VERSION as __version__ # noqa: F401 | ||
|
|
||
| # this supposedly is required for namespace packages to work. | ||
| __path__ = __import__("pkgutil").extend_path(__path__, __name__) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -20,21 +20,22 @@ | |
| It cannot import hamilton.graph, or hamilton.driver. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import abc | ||
| import collections | ||
| import logging | ||
| from typing import Any, Dict, List, Optional, Tuple, Type, Union | ||
|
|
||
| import numpy as np | ||
| import pandas as pd | ||
| from pandas.core.indexes import extension as pd_extension | ||
| from typing import TYPE_CHECKING, Any, Dict, List, Optional, Tuple, Type, Union | ||
|
|
||
| from hamilton import htypes | ||
| from hamilton.lifecycle import api as lifecycle_api | ||
|
|
||
| try: | ||
| from . import htypes, node | ||
| except ImportError: | ||
| import node | ||
| if TYPE_CHECKING: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. comment as to importance of this
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changes:
I don't have the "why" for this code: try:
from . import htypes, node
except ImportError:
import node
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah sorry I meant in the code leave a note/comment as to the importance :)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. E.g. this is for hamilton-core to work... |
||
| import numpy as np | ||
| import pandas as pd | ||
|
|
||
| import hamilton.node as node | ||
|
|
||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
@@ -120,6 +121,8 @@ def pandas_index_types( | |
| :param outputs: the dict we're trying to create a result from. | ||
| :return: dict of all index types, dict of time series/categorical index types, dict if there is no index | ||
| """ | ||
| import pandas as pd | ||
|
|
||
| all_index_types = collections.defaultdict(list) | ||
| time_indexes = collections.defaultdict(list) | ||
| no_indexes = collections.defaultdict(list) | ||
|
|
@@ -131,6 +134,8 @@ def index_key_name(pd_object: Union[pd.DataFrame, pd.Series]) -> str: | |
|
|
||
| def get_parent_time_index_type(): | ||
| """Helper to pull the right time index parent class.""" | ||
| from pandas.core.indexes import extension as pd_extension | ||
|
|
||
| if hasattr(pd_extension, "NDArrayBackedExtensionIndex"): | ||
| index_type = pd_extension.NDArrayBackedExtensionIndex | ||
| else: | ||
|
|
@@ -220,6 +225,8 @@ def build_result(**outputs: Dict[str, Any]) -> pd.DataFrame: | |
|
|
||
| :param outputs: the outputs to build a dataframe from. | ||
| """ | ||
| import pandas as pd | ||
|
|
||
| # TODO check inputs are pd.Series, arrays, or scalars -- else error | ||
| output_index_type_tuple = PandasDataFrameResult.pandas_index_types(outputs) | ||
| # this next line just log warnings | ||
|
|
@@ -255,6 +262,7 @@ def build_dataframe_with_dataframes(outputs: Dict[str, Any]) -> pd.DataFrame: | |
| :param outputs: The outputs to build the dataframe from. | ||
| :return: A dataframe with the outputs. | ||
| """ | ||
| import pandas as pd | ||
|
|
||
| def get_output_name(output_name: str, column_name: str) -> str: | ||
| """Add function prefix to columns. | ||
|
|
@@ -300,6 +308,8 @@ def input_types(self) -> List[Type[Type]]: | |
| return [Any] | ||
|
|
||
| def output_type(self) -> Type: | ||
| import pandas as pd | ||
|
|
||
| return pd.DataFrame | ||
|
|
||
|
|
||
|
|
@@ -365,6 +375,8 @@ def build_result(**outputs: Dict[str, Any]) -> np.matrix: | |
| :param outputs: function_name -> np.array. | ||
| :return: numpy matrix | ||
| """ | ||
| import numpy as np | ||
|
|
||
| # TODO check inputs are all numpy arrays/array like things -- else error | ||
| num_rows = -1 | ||
| columns_with_lengths = collections.OrderedDict() | ||
|
|
@@ -402,6 +414,8 @@ def input_types(self) -> List[Type[Type]]: | |
| return [Any] # Typing | ||
|
|
||
| def output_type(self) -> Type: | ||
| import pandas as pd | ||
|
|
||
| return pd.DataFrame | ||
|
|
||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.