Skip to content

feat(ingestion): add MicroStrategy connector#16992

Draft
brock-acryl wants to merge 22 commits intomasterfrom
feat-ingestion-microstrategy
Draft

feat(ingestion): add MicroStrategy connector#16992
brock-acryl wants to merge 22 commits intomasterfrom
feat-ingestion-microstrategy

Conversation

@brock-acryl
Copy link
Copy Markdown
Contributor

@brock-acryl brock-acryl commented Apr 11, 2026

Summary

  • Adds a new DataHub ingestion connector for MicroStrategy, supporting metadata extraction via the MicroStrategy REST API
  • Ingests projects, folders, dashboards (dossiers), reports, Intelligent Cubes, and Library datasets as DataHub containers, dashboards, charts, and datasets with full SDK V2 entity emission
  • Extracts lineage from cubes to dashboards/reports and optionally from physical warehouse tables to cubes via SQL parsing, including column-level lineage
  • Ships with stateful ingestion support for automatic stale entity removal when objects are deleted in MicroStrategy

Key Features

Metadata coverage:

  • Projects → DataHub containers; folders → nested sub-containers
  • Dashboards/dossiers → Dashboard entities with embedded chart stubs and ownership
  • Reports → Chart entities with report-type subtypes and cube/dataset lineage
  • Intelligent Cubes → Dataset entities with schema (attributes + metrics), SQL view definition, and warehouse upstream lineage
  • Library datasets → Dataset entities

Lineage:

  • Report → Cube and Dashboard → Cube lineage via dataSource.id resolution
  • Warehouse table → Cube lineage via GET /api/v2/cubes/{id}/sqlView SQL parsing (Snowflake, MySQL, Teradata, bare quoting styles)
  • Column-level lineage via SqlParsingAggregator; warehouse platform auto-detected from /api/datasources with four-tier fallback

Test plan

  • Unit tests in tests/unit/test_microstrategy_source.py cover: config validation, project/folder/dashboard/report/cube/dataset pattern filtering, warehouse platform detection tiers, cube schema extraction, ownership extraction, cross-project lineage registry resolution, error handling continuity, and API call reduction flags (63 tests)
  • Integration tests in tests/integration/microstrategy/test_microstrategy_mock.py cover full end-to-end ingest against a mocked REST API with golden file comparison for both standard and warehouse-lineage configurations (2 tests)
  • All 65 tests pass: pytest tests/unit/test_microstrategy_source.py tests/integration/microstrategy/test_microstrategy_mock.py
  • Lint clean: ruff check + ruff format pass with no errors
  • Documentation added at docs/sources/microstrategy/microstrategy.md with all 35 config fields documented

…hance client functionality

- Introduced comprehensive documentation for the MicroStrategy connector, detailing capabilities, prerequisites, installation, and configuration options.
- Updated `MicroStrategyClient` to require `project_id` for dashboard definitions, ensuring accurate API requests.
- Enhanced validation in `MicroStrategyConnectionConfig` to enforce authentication rules for anonymous and credential modes.
- Improved metadata extraction logic in `MicroStrategySource` to include project context for dashboards and reports.
- Added integration test setup instructions for real data scenarios, including options for trial instances and mock servers.
…or improved metadata extraction

- Added detailed concept mapping and lineage extraction information to the MicroStrategy connector documentation.
- Introduced new configuration options in `MicroStrategyConfig` for filtering cubes, dashboards, reports, and datasets to optimize API calls.
- Implemented error handling for unavailable projects in `MicroStrategyClient`, raising specific exceptions for better debugging.
- Enhanced metadata extraction logic to support column-level lineage and warehouse lineage using SQL parsing.
- Updated unit tests to validate new configuration options and error handling mechanisms.
… client

- Implemented `get_attribute_expression` and `get_metric_expression` methods in `MicroStrategyClient` to fetch human-readable expressions for attributes and metrics.
- Introduced `include_field_formulas` configuration option in `MicroStrategyConfig` to control the fetching of attribute and metric expressions.
- Enhanced `MicroStrategySource` to utilize the new expression retrieval methods for input fields when `include_report_definitions` is enabled.
- Updated input field construction to include expressions as field descriptions, improving metadata visibility.
…eage processing

- Removed the synthetic __datasource entity for legacy documents, allowing each chart stub to directly reference its specific warehouse tables.
- Updated the logic for emitting embedded chart stubs to utilize per-dataset warehouse lineage, enhancing clarity and accuracy in lineage representation.
- Simplified the dashboard metadata extraction process by eliminating unnecessary dataset linking, improving overall performance and maintainability.
…tegy entity configuration

- Updated the version of the data-platforms configuration from v6 to v7.
- Added configuration for the MicroStrategy data platform, including entity URN, type, aspect name, and logo URL.
…s and improvements

- Updated documentation to clarify support for domains, now marked as not supported and requiring manual configuration post-ingestion.
- Added new configuration options for filtering cubes, dashboards, reports, and datasets to optimize API calls and reduce unnecessary processing.
- Introduced a context manager in `MicroStrategyClient` for managing project headers during API requests, improving code clarity and reliability.
- Enhanced error handling in the client to provide more informative logging for authentication and permission errors.
- Added integration tests to validate the functionality of the MicroStrategy connector, ensuring comprehensive coverage of entity types and lineage extraction.
- Removed deprecated subtype mappings and streamlined the configuration for better maintainability.
…re-fetching

- Introduced a new configuration option `max_workers` to control the maximum number of threads for pre-fetching cube metadata, improving ingestion performance.
- Updated the MicroStrategy client to utilize per-request headers for project IDs, allowing concurrent API calls without session header conflicts.
- Enhanced the cube pre-fetching logic to fetch SQL views and schema data in parallel, reducing overall ingestion time.
- Improved documentation to reflect the new configuration and its implications for API rate limits and debugging.
@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment labels Apr 11, 2026
…ility

- Changed the return type of the `_deepcopy_wrapper` function from `ExpressionCore` to `Expr` to align with the updated sqlglot library definitions.
- Ensured compatibility with cooperative timeout support in the deepcopy implementation.
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 11, 2026

Codecov Report

❌ Patch coverage is 61.42558% with 184 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...c/datahub/ingestion/source/microstrategy/client.py 54.61% 182 Missing ⚠️
...c/datahub/ingestion/source/microstrategy/config.py 96.61% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@alwaysmeticulous
Copy link
Copy Markdown

alwaysmeticulous bot commented Apr 11, 2026

🔴 Meticulous spotted visual differences in 566 of 1422 screens tested: view and approve differences detected.

Meticulous evaluated ~8 hours of user flows against your PR.

Last updated for commit 4c49151 feat(microstrategy): case-aware upstream URN resolution for warehouse li.... This comment will update as new commits are pushed.

The microstrategy source was registered in setup.py but missing from the
generated pyproject.toml, causing the checkLockFile CI task to fail.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 11, 2026

Bundle Report

Changes will increase total bundle size by 6.75kB (0.03%) ⬆️. This is within the configured threshold ✅

Detailed changes
Bundle name Size Change
datahub-react-web-esm 22.74MB 6.75kB (0.03%) ⬆️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name Size Change Total Size Change (%)
assets/index-*.js 6.75kB 12.49MB 0.05%

…ting

- Add `microstrategy` extras to setup.py (usage_common | sqlglot_lib) so
  validate-plugin-deps can install and import the plugin correctly; sqlparse
  was missing from the install because the extras key was absent entirely
- Regenerate pyproject.toml and uv.lock via updateLockFile
- Rename docs/sources/microstrategy/microstrategy.md →
  microstrategy_pre.md to satisfy docGen naming convention (must be
  README.md or <plugin>_{pre,post}.md)
- Run ruff format on tests/unit/test_microstrategy_source.py (Would reformat)
- Run mdPrettierWrite to format README.md and microstrategy_pre.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… file

- Fix `_request()` return type to `Any` to resolve 13+ mypy return-value errors
- Replace `List[Aspect]` TypeVar usage with `List[Any]` (TypeVar unbound outside generics)
- Fix `add_observed_query` to use `ObservedQuery(...)` dataclass instead of kwargs
- Call `.as_workunit()` on `gen_metadata()` output (returns MCPW, not MetadataWorkUnit)
- Add `Dict[str, Any]` annotation to `_MSTR_TYPE_MAP` to fix arg-type error
- Add `# type: ignore[method-assign]` to 24 MagicMock assignments in tests
- Fix `client._base_url` → `client.base_url` (correct attribute name)
- Add `# type: ignore[union-attr]` to entity.urn and as_workunits() test calls
- Rewrite microstrategy_pre.md with H3 baseline (H2 is disallowed in _pre.md)
- Create README.md with required Overview + Concept Mapping sections
- Create microstrategy_post.md with Capabilities, Limitations, Troubleshooting
- Create microstrategy_recipe.yml with minimal working example config
- Regenerate microstrategy_mces_golden.json against live demo.microstrategy.com

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rror visibility, dead code

- Add `get_workunit_processors()` override to wire `StaleEntityRemovalHandler`
  into the pipeline (was created but never invoked — deletion detection was broken)
- Add `MicroStrategySourceReport` with `report_dropped()` for pattern-filtered
  folders, dashboards, reports, and cubes
- Promote registry lookup failures and report definition fetch failures from
  DEBUG to WARNING and emit `report_warning()` for operator visibility
- Add `aggregator.close()` in `finally` block in `_emit_column_lineage_from_sql`
- Replace fragile `assert project_id is not None` guards with explicit `raise ValueError`
- Fix `_MSTR_TYPE_MAP` annotation from `Dict[str, Any]` to `Dict[str, Callable[[], Any]]`
- Delete five unused documentation-style classes from `constants.py` (~117 lines)
- Narrow `_response_json_dict` except to `(ValueError, JSONDecodeError)` in client
- Add explanatory comment to `_request() -> Any` return type in client
- Promote warehouse platform detection fallback messages from DEBUG to INFO

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot requested a deployment to datahub-wheels (Preview) April 12, 2026 02:13 Abandoned
…s_ingestion

Mock was patching `get_cube_schema` but production code calls `get_cube()`.
The mock never fired — the test was exercising the happy path in disguise
and silently passing without testing the failure-recovery path it claimed to cover.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…solidate constants, remove dead client methods

- Extract `_AttrFormInfo` NamedTuple and `_iter_attr_forms()` static method to
  eliminate ~50 lines of parallel attribute-form iteration logic that existed
  identically in both `_build_input_fields` and `_build_cube_schema_metadata`
- Move all iServer error codes and dossier subtype constants from source.py
  local definitions into constants.py as the single source of truth; import
  them in source.py to remove the duplicate `ISERVER_PROJECT_UNAVAILABLE`
- Delete five unused dead methods from client.py: `get_dashboard_definition`
  (compatibility shim), `get_model_cube`, `get_model_tables`, `get_model_facts`,
  `get_lineage_for_object` (all superseded by the sqlView approach)
- Update test that tested the deleted `get_dashboard_definition` shim to
  directly test `get_dossier_definition` (the underlying implementation)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove unused `DomainRegistry` import and instantiation — `domain_registry`
  was created in `__init__` but `get_domain_urn()` was never called anywhere;
  the DOMAINS capability is correctly annotated `supported=False`
- Add `MicroStrategyClient.close()` to release the underlying `requests.Session`
  connection pool after ingestion completes
- Override `MicroStrategySource.close()` to call `client.close()` then
  `super().close()`, ensuring the HTTP connection pool is always released

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rations catalog

- Regenerate datahub.json with microstrategy connector entry (56 lines)
  containing capabilities, platform name, and support status
- Remove api_connector=true flag from microstrategy in integrations_catalog.json;
  the flag is reserved for third-party API connectors, not native DataHub source
  plugins, and its presence caused docgen.py to crash with KeyError: 'microstrategy'

Fixes CI failures:
  - ci (3.10/3.11/3.12, testQuick): "Check autogenerated JSON files are up-to-date"
  - gh-pages: "Build Docs" (KeyError: 'microstrategy' in docgen.py)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@datahub-connector-tests
Copy link
Copy Markdown

datahub-connector-tests bot commented Apr 12, 2026

Connector Tests Results

All connector tests passed for commit 4c49151

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

…se, fix container hierarchy

**Thread-safe client**
- Replace _project_context/session.headers mutation with per-request extra_headers
- Add threading.Lock for token refresh serialization
- Route DELETE methods through _request() for token refresh safety
- Remove 500 from retry status_forcelist (MSTR 500s are permanent app errors)

**Parallel prefetch for dashboards, reports, and expression cache**
- Parallel dashboard definition + warehouse SQL fetch via ThreadPoolExecutor
- Parallel report definition + warehouse lineage fetch
- Pre-warm expression cache for field formulas before entity processing

**Auto-detect warehouse database and schema from connection strings**
- Fetch connection strings via GET /api/datasources/connections/{id}
- Parse DATABASE/db/schema params from JDBC/ODBC connection strings
- warehouse_lineage_database and warehouse_lineage_schema now optional overrides
- Fix _qualify_table_name to handle 2-part names (prepend database for Snowflake)

**Fix container hierarchy for SDK V2 entities**
- Pass ContainerKey directly to parent_container instead of .as_urn() string
- Fixes missing container aspect and empty browsePathsV2 on dashboards/charts/datasets

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix type annotation in TestQualifyTableName._make_source to satisfy
mypy. Regenerate live integration golden file against demo instance
to reflect container hierarchy and browse path changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… lineage

When a DataHub graph is available, try original-case and lowercase URN
variants against the catalog and return whichever actually exists (same
strategy as the SQL schema resolver). When the graph is unavailable,
fall back to the new convert_lineage_urns_to_lowercase config flag
(default True) so URNs match warehouse-ingested assets like Snowflake.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops PR or Issue related to DataHub backend & deployment ingestion PR or Issue related to the ingestion of metadata product PR or Issue related to the DataHub UI/UX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant