Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Clarify usage of include_package_data/package_data/exclude_package_data on package data files #4643

Merged
merged 21 commits into from
Sep 26, 2024
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 40 additions & 35 deletions docs/userguide/datafiles.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,14 @@
Data Files Support
====================

Old packaging installation methods in the Python ecosystem
have traditionally allowed installation of "data files", which
are placed in a platform-specific location. However, the most common use case
for data files distributed with a package is for use *by* the package, usually
by including the data files **inside the package directory**.

Setuptools focuses on this most common type of data files and offers three ways
Old packaging installation methods in the Python ecosystem have
traditionally allowed the inclusion of "data files" (files beyond
:ref:`the default set <manifest>` ), which are placed in a platform-specific
location. However, the most common use case for data files distributed
with a package is for use *by* the package, usually by including the
data files **inside the package directory**.

``Setuptools`` focuses on this most common type of data files and offers three ways
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion. I understand that your main objective was to clarify what “data files” are.

While this intention is very appreciated and welcome, the proposed parenthesis seem to suggest: data file = file not included in the "default set". This could lead to confusion, as what defines a data file is not fundamentally related to whether a file is included in the default set or not. For example, we could change setuptools to start including .json files automatically, but that would not make them more or less “data file”-y.

Could you please have a look at the following suggestion (which adds a new paragraph before the original text)? Would this address your concerns?

Suggested change
Old packaging installation methods in the Python ecosystem have
traditionally allowed the inclusion of "data files" (files beyond
:ref:`the default set <manifest>` ), which are placed in a platform-specific
location. However, the most common use case for data files distributed
with a package is for use *by* the package, usually by including the
data files **inside the package directory**.
``Setuptools`` focuses on this most common type of data files and offers three ways
In the Python ecosystem, the term "data files" is used in various complex scenarios
and can have nuanced meanings.
For the purposes of this documentation, we define "data files" as non-Python files
that are installed alongside Python modules and packages on the user's machine
when they install a :term:`distribution <Distribution Package>` from PyPI
or via a ``.whl`` file.
These files are typically intended for use at runtime by the package itself or
to influence the behavior of other packages or systems.
They may also be referred to as "resource files."
Old packaging installation methods in the Python ecosystem
have traditionally allowed installation of "data files", which
are placed in a platform-specific location. However, the most common use case
for data files distributed with a package is for use *by* the package, usually
by including the data files **inside the package directory**.
Setuptools focuses on this most common type of data files and offers three ways

Does this look good to you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion. I understand that your main objective was to clarify what “data files” are.

Yes exactly. Because this is the starting sentence of the Data Files Support section, and we have to make sure users understand what is considered "data file" in the first place, such that they could decide whether they even need these keywords at all.

For example C source code/READMEs are not "Python" files, but we don't need to declare them to include them into the package. And therefore we might need to clarify the following definition of "data files":

we define "data files" as non-Python files

the proposed parenthesis seem to suggest: data file = file not included in the "default set". This could lead to confusion, as what defines a data file is not fundamentally related to whether a file is included in the default set or not.

Thanks a ton for the clarification! Yes my sketch indeed would need more discussion and check.

They may also be referred to as "resource files."

Personally I think the following terminologies are already very confusing: "package data", "data file", "source distribution", "files inside the wheel", and their difference is pretty unclear, and can we somehow define/clarify them, and perhaps avoid introducing more synonyms ("resource files" in this case)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problems with omitting resource files, I just mentioned before because other packages started calling them that way (e.g. importlib_resources). Please feel free to use and modify my suggestion to align with your vision in the contribution.


For example C source code/READMEs are not "Python" files, but we don't need to declare them to include them into the package.

Yes, in this case C/README files are not data files, because they don't fit in the remaining part of the suggested definition:

... that are installed alongside Python modules and packages on the user's machine
when they install a :term:`distribution <Distribution Package>` from PyPI
or via a ``.whl`` file.
These files are typically intended for use at runtime by the package itself or
to influence the behavior of other packages or systems.

C-files are not installed along side Python files in the user's machine and also are not intended to be used in runtime, right?


I just noticed this from official Python docs:

MANIFEST.in does not affect binary distributions such as wheels.

That is not 100% precise is it? MANIFEST.in determines what goes into the sdist, and then the contents of the sdist influence what goes into the wheel (specially when include-package-data = true which is the default)... So there is a potential indirect effect there. That is because the build process works more or less like in the mermaidjs diagram below1:

graph LR
    src[fa:fa-laptop-code source code] --> b1(fa:fa-hammer)
    
    subgraph build process
      b1 --> sdist(fa:fa-file-code sdist) --> b2(fa:fa-hammer)
      b2 --> wheel(fa:fa-box wheel)
       
      build([build dependencies]) --> b1
      build([build dependencies]) --> b2
    end

    subgraph installation process
        wheel --> pkg
        pkg(fa:fa-box-open) --> inst[fa:fa-python installed packages]
        runtime([runtime dependencies]) --> inst
    end

    classDef deps fill:#2aa198,stroke:#333
    classDef env fill:#6c71c4,stroke:#333
    build:::deps
    runtime:::deps
Loading

Footnotes

  1. The "build process" creates distribution artifacts/packages. sdist distributions are meant to be platform independent (but may contain varying levels of optimisation - e.g. compiling Python to C via Cython). wheel distributions contain the final files meant to be directly copied to the user's machine, and therefore may be platform specific.

of specifying which files should be included in your packages, as described in
the following section.

Expand All @@ -19,10 +20,11 @@ Configuration Options

.. _include-package-data:

include_package_data
--------------------
1. ``include_package_data``
---------------------------

First, you can use the ``include_package_data`` keyword.

For example, if the package tree looks like this::

project_root_directory
Expand All @@ -35,16 +37,34 @@ For example, if the package tree looks like this::
├── data1.txt
└── data2.txt

and you supply this configuration:
When at least one of the following conditions are met:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The include_package_data keyword would only have an effect when condition is met, perhaps it's good to state these upfront?


1. These files are included via the :ref:`MANIFEST.in <Using MANIFEST.in>` file,
like so::

include src/mypkg/*.txt
include src/mypkg/*.rst

2. They are being tracked by a revision control system such as Git, Mercurial
or SVN, **AND** you have configured an appropriate plugin such as
:pypi:`setuptools-scm` or :pypi:`setuptools-svn`.
(See the section below on :ref:`Adding Support for Revision
Control Systems` for information on how to configure such plugins.)

then all the ``.txt`` and ``.rst`` files will be included into
the source distribution.

To further include them into the ``wheels``, you can need to use the
``include_package_data`` keyword:

.. tab:: pyproject.toml

.. code-block:: toml

[tool.setuptools]
# ...
# By default, include-package-data is true in pyproject.toml, so you do
# NOT have to specify this line.
# By default, include-package-data is true in pyproject.toml,
# so you do NOT have to specify this line.
include-package-data = true

[tool.setuptools.packages.find]
Expand Down Expand Up @@ -76,33 +96,18 @@ and you supply this configuration:
include_package_data=True
)

then all the ``.txt`` and ``.rst`` files will be automatically installed with
your package, provided:

1. These files are included via the :ref:`MANIFEST.in <Using MANIFEST.in>` file,
like so::

include src/mypkg/*.txt
include src/mypkg/*.rst

2. OR, they are being tracked by a revision control system such as Git, Mercurial
or SVN, and you have configured an appropriate plugin such as
:pypi:`setuptools-scm` or :pypi:`setuptools-svn`.
(See the section below on :ref:`Adding Support for Revision
Control Systems` for information on how to write such plugins.)

.. note::
.. versionadded:: v61.0.0
The default value for ``tool.setuptools.include-package-data`` is ``True``
The default value for ``tool.setuptools.include-package-data`` is ``true``
when projects are configured via ``pyproject.toml``.
This behaviour differs from ``setup.cfg`` and ``setup.py``
(where ``include_package_data=False`` by default), which was not changed
(where ``include_package_data`` is ``False`` by default), which was not changed
to ensure backwards compatibility with existing projects.

.. _package-data:

package_data
------------
2. ``package_data``
-------------------

By default, ``include_package_data`` considers **all** non ``.py`` files found inside
the package directory (``src/mypkg`` in this case) as data files, and includes those that
Expand Down Expand Up @@ -172,7 +177,7 @@ file, nor require to be added by a revision control system plugin.

.. note::
If your glob patterns use paths, you *must* use a forward slash (``/``) as
the path separator, even if you are on Windows. Setuptools automatically
the path separator, even if you are on Windows. ``Setuptools`` automatically
DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved
converts slashes to appropriate platform-specific separators at build time.

.. important::
Expand Down Expand Up @@ -271,8 +276,8 @@ we specify that ``data1.rst`` from ``mypkg1`` alone should be captured as well.

.. _exclude-package-data:

exclude_package_data
--------------------
3. ``exclude_package_data``
---------------------------

Sometimes, the ``include_package_data`` or ``package_data`` options alone
aren't sufficient to precisely define what files you want included. For example,
Expand Down Expand Up @@ -450,7 +455,7 @@ With :ref:`package-data`, the configuration might look like this:
}
)

In other words, we allow Setuptools to scan for namespace packages in the ``src`` directory,
In other words, we allow ``Setuptools`` to scan for namespace packages in the ``src`` directory,
which enables the ``data`` directory to be identified, and then, we separately specify data
files for the root package ``mypkg``, and the namespace package ``data`` under the package
``mypkg``.
Expand Down