Skip to content

Commit

Permalink
feat: Power BI history details, table lists with DAX, plus new functi…
Browse files Browse the repository at this point in the history
…onalities
  • Loading branch information
RadekBuczkowski committed May 12, 2024
1 parent cf1d14a commit 184ff79
Show file tree
Hide file tree
Showing 6 changed files with 2,475 additions and 402 deletions.
239 changes: 212 additions & 27 deletions docs/power_bi/README.md
Original file line number Diff line number Diff line change
@@ -1,46 +1,59 @@

# PowerBi and PowerBiClient classes

The `PowerBi` and `PowerBiClient` classes contain logic for refreshing
PowerBI datasets and for checking if the last dataset refresh completed
successfully.
The `PowerBi` and `PowerBiClient` classes contain logic for refreshing PowerBI
datasets, and for checking if the last refresh of an entire dataset or its
selected tables completed successfully. The logic can also be used to show
refresh histories of datasets, and to list dataset tables with their recent
refresh times. The same data can also be returned as a Spark data frame.

For easier PowerBI credential handling (service principal or AD user),
the first parameter to the `PowerBi` constructor must be a `PowerBiClient`
class object.

## PowerBI Permissions

To enable PowerBI API access in your PowerBI, you need to enable the setting
"Service principals can use Fabric APIs" in Fabric (see the screen-shot).
Additionally, you need to specify the user group that should have access
to the API.
To allow access to the PowerBI API, you need to enable the setting
"Service principals can use Fabric APIs" in the Admin Portal in Fabric
(see the screen-shot). There, you also need to specify the user group that
should have access to the API.

![Power BI admin settings](./admin_settings.png)

Apart from this, each PowerBI dataset should have a user or service principal
attached, that is part of this user group.

Additionally, to access individual tables and their refresh times in PowerBI,
the class must be able to execute DAX queries. This requires additional
permissions. The "Dataset Execute Queries REST API" option, found under
"Integration settings" in the Admin Portal, must also be enabled.

The same user or service principal must have dataset read and build
permissions in each individual dataset:

![Power BI admin settings](./user_permissions.png)


## Links

[Register an App and give the needed permissions. A very good how-to-guide can be found here.](https://www.sqlshack.com/how-to-access-power-bi-rest-apis-programmatically/)

[How to Refresh a Power BI Dataset with Python.](https://pbi-guy.com/2022/01/07/refresh-a-power-bi-dataset-with-python/)

### API documentation:

[Datasets - Refresh Dataset In Group](https://learn.microsoft.com/en-us/rest/api/power-bi/datasets/refresh-dataset-in-group)

[Datasets - Get Refresh History In Group](https://learn.microsoft.com/en-us/rest/api/power-bi/datasets/get-refresh-history-in-group)

[Get Workspaces](https://learn.microsoft.com/en-us/rest/api/power-bi/groups/get-groups)
[Get Datasets](https://learn.microsoft.com/en-us/rest/api/power-bi/datasets/get-datasets-in-group)
[Trigger A Dataset Refresh](https://learn.microsoft.com/en-us/rest/api/power-bi/datasets/refresh-dataset-in-group)
[Get Refresh History](https://learn.microsoft.com/en-us/rest/api/power-bi/datasets/get-refresh-history-in-group)
[Get Refresh History Details](https://learn.microsoft.com/en-us/rest/api/power-bi/datasets/get-refresh-execution-details-in-group)
[Execute DAX Queries](https://learn.microsoft.com/en-us/rest/api/power-bi/datasets/execute-queries-in-group)

# Usage of PowerBi and PowerBiClient classes

## Step 1: Create PowerBI credentials

The client ID, client secret, and tenant ID values should be stored in a key vault,
and loaded from the key vault or Databricks secret scope.
The client ID, client secret, and tenant ID values should be stored
in a key vault, and loaded from the key vault or Databricks secret scope.

```python
# example PowerBiClient credentials object
Expand Down Expand Up @@ -79,6 +92,19 @@ Available workspaces:
+----+--------------------------------------+----------------+
```

To get additional information about each workspace, use the
show_workspaces() or the get_workspaces() method instead.
The first method shows a list of workspaces, and the second returns
a Spark data frame, with the list of workspaces.

```python
# example listing of available workspaces
from spetlr.power_bi.PowerBi import PowerBi

client = MyPowerBiClient()
PowerBi(client).show_workspaces()
```

## Step 3: List available datasets

If no dataset parameter is specified, a list of available datasets
Expand Down Expand Up @@ -109,13 +135,44 @@ Available datasets:
+----+--------------------------------------+----------------+
```

To get additional information about each dataset, use the
show_datasets() or the get_datasets() method.
The first method shows a list of datasets, and the second returns
a Spark data frame, with the list of datasets.

If you don't specify any workspace, datasets from all workspaces
will be collected!

```python
# example listing of available workspaces
from spetlr.power_bi.PowerBi import PowerBi

client = MyPowerBiClient()
PowerBi(client, workspace_name="Finance").show_datasets()

# alternatively:
PowerBi(client, workspace_id="614850c2-3a5c-4d2d-bcaa-d3f20f32a2e0").show_datasets()

# alternatively:
PowerBi(client).show_datasets()

```

## Step 4: Check the status and time of the last refresh of a given dataset

The check() method can be used to check the status and time of the last
refresh of a dataset. An exception will be cast if the last refresh failed,
or if the last refresh finished more the given number of minutes ago.
The number of minutes can be specified in the optional
"max_minutes_after_last_refresh" parameter (default is 12 hours).
refresh of an entire dataset, or of individual dataset tables. An exception
will be cast if the last refresh failed, or if the last refresh finished more
than the given number of minutes ago. The number of minutes can be specified
in the optional "max_minutes_after_last_refresh" parameter
(default is 12 hours).

If you want to check only selected tables in the dataset, you can
specify the optional "table_names" parameter with a list of table names.
If the list is not empty, only the selected tables will be checked,
and the table that was refreshed first will be used for checking.
To only show the list of tables, specify an empty array:
table_names=[]

You can also specify the optional "local_timezone_name" parameter to show
the last refresh time of the PowerBI dataset in a local time zone.
Expand Down Expand Up @@ -156,16 +213,27 @@ at 2024-02-01 10:15 (local time) !
## Step 5: Start a new refresh of a given dataset without waiting

The start_refresh() method starts a new refresh of the given PowerBI
dataset asynchronously. You need to call the check() method after waiting
for some sufficient time (e.g. from a separate monitoring job) to verify
if the refresh succeeded.
dataset asynchronously. To verify if the refresh succeeded, you need to
call the check() method after waiting some sufficiently long time
(e.g. from a separate monitoring job).

If you want to refresh only selected tables in the dataset, you can
specify the optional "table_names" parameter with a list of table names.
If the list is not empty, only the selected tables will be refreshed.
(Note: It is not possible to list available tables programmatically
using the PowerBI API, like you can do with workspaces and datasets.
You have to check the table names visually in PowerBI.)
To only show the list of tables, specify an empty array:
table_names=[]

If you set the optional "mail_on_failure" or "mail_on_completion"
parameters to True, and e-mail will be sent to the dataset owner when
the refresh fails or completes respectively. This is only supported for
regular Azure AD users. Service principals cannot send emails!

Additionally, you can set the optional "number_of_retries" parameter to
specify the number of retries on transient errors when calling refresh().
The "number_of_retries" parameter only works with enhanced API requests
(i.e. when the "table_names" parameter is also specified), and it will
be ignored otherwise.
Default is 0 (no retries). E.g. 1 means two attempts in total.

All parameters can only be specified in the constructor.

Expand Down Expand Up @@ -209,12 +277,18 @@ If you want to refresh only selected tables in the dataset, you can
specify the optional "table_names" parameter with a list of table names.
If the list is not empty, only selected tables will be refreshed
(and the previous refresh time will be ignored).
To only show the list of tables, specify an empty array:
table_names=[]

Additionally, you can set the optional "number_of_retries" parameter to
specify the number of retries on transient errors when calling refresh().
Default is 0 (no retries). E.g. 1 means two attempts in total.
It is used only when the "timeout_in_seconds" parameter allows it,
so you need to set the "timeout_in_seconds" parameter high enough.
The "number_of_retries" parameter is handled in a loop in this class,
and unlike in the start_refresh() method, it will work both with normal
refreshes (i.e. when "table_names" is not specified) and with enhanced
refreshes (i.e. when "table_names" is specified).

You can also specify the optional "local_timezone_name" parameter to
show the last refresh time of the PowerBI dataset in a local time zone.
Expand Down Expand Up @@ -264,6 +338,13 @@ entries for each dataset, depending on the number of refreshes in the last 3 day
The most recent 60 are kept if they are all less than 3 days old. Entries
more than 3 days old are deleted when there are more than 20 entries."

If you don't specify any dataset and/or workspace, the history across all
datasets/workspaces will be collected. The datasets must be refreshable
and workspaces cannot be read-only to be included in the combined list.
To exclude specific PowerBI creators from the list, specify the optional
"exclude_creators" parameter, e.g.:
exclude_creators=["amelia@contoso.com"]

You can also specify the optional "local_timezone_name" parameter to convert
refresh times in the data frame to a local timezone. Depending on the parameter,
the names of the time columns in the data frame will have the suffix
Expand Down Expand Up @@ -297,15 +378,119 @@ df = PowerBi(client,
dataset_id="b1f0a07e-e348-402c-a2b2-11f3e31181ce",
local_timezone_name="Europe/Copenhagen").get_history()

df.display()
```

The "RefreshType" column has the following meaning:

RefreshType | |
--- | ---
OnDemand | The refresh was triggered interactively through the Power BI portal.
OnDemandTraining | The refresh was triggered interactively through the Power BI portal with automatic aggregations training.
Scheduled | The refresh was triggered by a dataset refresh schedule setting.
ViaApi | The refresh was triggered by an API call, e.g. by using this class without the "table_names" parameter specified.
ViaEnhancedApi | The refresh was triggered by an enhanced API call, e.g. by using this class with the "table_names" parameter specified.
ViaXmlaEndpoint | The refresh was triggered through Power BI public XMLA endpoint.

Only "ViaApi" and "ViaEnhancedApi" refreshes can be triggered by this class.
"ViaApi" are refreshes without the "table_names" parameter specified,
and "ViaEnhancedApi" are refreshes with the "table_names" parameter specified.

To see what tables were specified with each completed refresh marked as
"ViaEnhancedApi", you can use the show_history_details() and get_history_details()
methods, as shown below. They work in the same fashion and have the same parameters
as the show_history() and get_history() methods.
You can then use the "RequestId" column in the "get_history"
and "get_history_details" datasets to join them together.

```python
# example show and get refresh history
from spetlr.power_bi.PowerBi import PowerBi

client = MyPowerBiClient()
PowerBi(client,
workspace_name="Finance",
dataset_name="Invoicing",
local_timezone_name="Europe/Copenhagen").show_history_details()

PowerBi(client,
workspace_name="Finance",
dataset_name="Invoicing",
local_timezone_name="Europe/Copenhagen").show_history_details()

# alternatively:
df = PowerBi(client,
workspace_id="614850c2-3a5c-4d2d-bcaa-d3f20f32a2e0",
dataset_id="b1f0a07e-e348-402c-a2b2-11f3e31181ce",
local_timezone_name="Europe/Copenhagen").get_history_details()

df = PowerBi(client,
workspace_id="614850c2-3a5c-4d2d-bcaa-d3f20f32a2e0",
dataset_id="b1f0a07e-e348-402c-a2b2-11f3e31181ce",
local_timezone_name="Europe/Copenhagen").get_history_details()

df.display()
```

## Step 8: Show and get the tables in a given dataset

The show_tables() and get_tables() methods can be used to show and get
the list of tables used in a given dataset and their last refresh time.
The show_tables() method displays a Pandas data frame with the list of tables,
and the get_tables() method returns the actual data frame converted to
a Spark data frame.

If you don't specify any dataset and/or workspace, all tables across all
datasets/workspaces will be collected. Datasets requiring an effective
identity will be automatically skipped from the list (effective
identity is not supported by this class).
To exclude specific PowerBI creators from the list, specify the optional
"exclude_creators" parameter, e.g.
exclude_creators=["amelia@contoso.com"]
This can prevent "Skipped unauthorized" warnings.

You can also specify the optional "local_timezone_name" parameter to convert
table refresh times to a local timezone. Depending on the parameter,
the names of the time columns in the data frame will have the suffix
"Utc" or "Local".

All above parameters can only be specified in the constructor.

```python
# example show and get the table list
from spetlr.power_bi.PowerBi import PowerBi

client = MyPowerBiClient()
PowerBi(client,
workspace_name="Finance",
dataset_name="Invoicing",
local_timezone_name="Europe/Copenhagen").show_tables()

PowerBi(client,
workspace_name="Finance",
dataset_name="Invoicing",
local_timezone_name="Europe/Copenhagen").show_tables()

# alternatively:
df = PowerBi(client,
workspace_id="614850c2-3a5c-4d2d-bcaa-d3f20f32a2e0",
dataset_id="b1f0a07e-e348-402c-a2b2-11f3e31181ce",
local_timezone_name="Europe/Copenhagen").get_tables()

df = PowerBi(client,
workspace_id="614850c2-3a5c-4d2d-bcaa-d3f20f32a2e0",
dataset_id="b1f0a07e-e348-402c-a2b2-11f3e31181ce",
local_timezone_name="Europe/Copenhagen").get_tables()

df.display()

```

# Testing

Due to license restrictions, testing requires a valid PowerBI license.
Because of this, testing must be executed manually in each project
that uses spetlr to refresh datasets.
Due to license restrictions, integration testing requires a valid
PowerBI license. Because of this, integration testing of this class
must be executed manually in each project that uses spetlr.

Recommended integration tests should include all above examples, i.e.
listing of workspaces and datasets, checking a refresh, and possibly
Expand Down
Binary file added docs/power_bi/user_permissions.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 184ff79

Please sign in to comment.