Skip to content

DataHub requires internet access for ingestion to work #13523

Open
@kha84

Description

@kha84

Describe the bug

After a successful installation of DataHub to a secured machine without internet access, an ingestion process fails, because it attempts to download packages from https://pypi.python.org/simple/wheel/

To Reproduce
Steps to reproduce the behavior:

  1. Install a new instance of DataHub to a machine by following quickstart guide https://docs.datahub.com/docs/quickstart
  2. Turn off the internet access on that machine
  3. Login to DataHub as admin
  4. Go to Ingestion -> Create new source -> select Postgres (my specific example) -> put whatever values as host / port / user / password / database name / datasource name
  5. Click save & run ingestion
  6. See the ingestion process for this new data source has started, then running for some time and and then failed
  7. Click on "Details" and see in the the "Logs" section that it was trying to create python venv and access to pypi.org and then failed:
~~~~ Execution Summary - RUN_INGEST ~~~~
Execution finished with errors.
{'exec_id': '8bb47f33-cddb-4db7-9369-edb2faddd142',
 'infos': ['2025-05-15 14:15:14.802052 INFO: Starting execution for task with name=RUN_INGEST',
           "2025-05-15 14:17:16.043186 INFO: Failed to execute 'datahub ingest', exit code 2",
           '2025-05-15 14:17:16.043688 INFO: Caught exception EXECUTING task_id=8bb47f33-cddb-4db7-9369-edb2faddd142, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/home/datahub/.venv/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 139, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/home/datahub/.venv/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 402, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
 'errors': []}

~~~~ Ingestion Logs ~~~~
Obtaining venv creation lock...
Acquired venv creation lock
venv doesn't exist.. minting..
Using CPython 3.10.17 interpreter at: /usr/bin/python
Creating virtual environment at: /tmp/datahub/ingest/venv-postgres-f9103e0adae041e3
Using Python 3.10.17 environment at: /tmp/datahub/ingest/venv-postgres-f9103e0adae041e3
error: Failed to fetch: `https://pypi.python.org/simple/wheel/`
  Caused by: Request failed after 3 retries
  Caused by: error sending request for url (https://pypi.python.org/simple/wheel/)
  Caused by: operation timed out

Expected behavior
After installation, DataHub features should work out-of-the-box without the dependency of downloading additional packages from internet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugBug report

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions