Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eurostat 2.0 #1041

Open
wants to merge 13 commits into
base: 2.0
Choose a base branch
from
Open

Eurostat 2.0 #1041

wants to merge 13 commits into from

Conversation

adrian-wojcik
Copy link
Contributor

@adrian-wojcik adrian-wojcik commented Sep 19, 2024

Summary

Adding Eurostat connector

Importance

To use Eurostat in Prefect 2.0

Checklist

This PR:

  • follows the guidelines laid out in CONTRIBUTING.md
  • links relevant issue(s)
  • adds/updates tests (if appropriate)
  • adds/updates docstrings (if appropriate)
  • adds an entry in CHANGELOG.md

Copy link
Contributor

@trymzet trymzet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added my comments

CHANGELOG.md Show resolved Hide resolved
Comment on lines 3 to 35
Structure for the Eurostat API connector.
This module provides functionalities for connecting to Eurostat API and download
the datasets. It includes the following features:
- Pulling json file with all data from specific dataset.
- Creating pandas Data Frame from pulled json file.
- Creating dataset parameters validation if specified.
Typical usage example:
eurostat = Eurostat()
eurostat.to_df(
dataset_code: str,
params: dict = None,
columns: list = None,
tests: dict = None,
)
Functions:
get_parameters_codes(dataset_code: str, url: str): Validate available API request
parameters and their codes.
validate_params(dataset_code: str, url: str, params: dict): Validates given
parameters against the available parameters in the dataset
eurostat_dictionary_to_df(*signals: list): Function for creating DataFrame from
JSON pulled from Eurostat
to_df(dataset_code: str, params: dict = None, columns: list = None,
tests: dict = None): Function responsible for getting response and creating
DataFrame using method 'eurostat_dictionary_to_df' with validation of provided
parameters and their codes if needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this info should be part of relevant docstrings (of the class and its methods); no need to repeat this info here.

Suggested change
Structure for the Eurostat API connector.
This module provides functionalities for connecting to Eurostat API and download
the datasets. It includes the following features:
- Pulling json file with all data from specific dataset.
- Creating pandas Data Frame from pulled json file.
- Creating dataset parameters validation if specified.
Typical usage example:
eurostat = Eurostat()
eurostat.to_df(
dataset_code: str,
params: dict = None,
columns: list = None,
tests: dict = None,
)
Functions:
get_parameters_codes(dataset_code: str, url: str): Validate available API request
parameters and their codes.
validate_params(dataset_code: str, url: str, params: dict): Validates given
parameters against the available parameters in the dataset
eurostat_dictionary_to_df(*signals: list): Function for creating DataFrame from
JSON pulled from Eurostat
to_df(dataset_code: str, params: dict = None, columns: list = None,
tests: dict = None): Function responsible for getting response and creating
DataFrame using method 'eurostat_dictionary_to_df' with validation of provided
parameters and their codes if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 5 to 28
This module provides an intermediate wrapper between the prefect flow and the connector:
- Generate the Eurostat Cloud API connector.
- Create and return a pandas Data Frame with the response of the API.
Typical usage example:
data_frame = eurostat_to_df(
dataset_code: str,
params: dict = None,
columns: list = None,
tests: dict = None,
)
Functions:
eurostat_to_df(
dataset_code: str,
params: dict = None,
columns: list = None,
tests: dict = None,
):
Task to download data from Eurostat Cloud API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This info should be part of relevant function docstrings, no need to repeat this info here.

Suggested change
This module provides an intermediate wrapper between the prefect flow and the connector:
- Generate the Eurostat Cloud API connector.
- Create and return a pandas Data Frame with the response of the API.
Typical usage example:
data_frame = eurostat_to_df(
dataset_code: str,
params: dict = None,
columns: list = None,
tests: dict = None,
)
Functions:
eurostat_to_df(
dataset_code: str,
params: dict = None,
columns: list = None,
tests: dict = None,
):
Task to download data from Eurostat Cloud API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 5 to 35
This module provides a prefect flow function to use the Eurostat connector:
- Call to the prefect task wrapper to get a final Data Frame from the connector.
- Upload that data to Azure Data Lake Storage.
Typical usage example:
eurostat_to_adls(
dataset_code: str,
params: dict = None,
columns: list = None,
tests: dict = None,
adls_path: str = None,
adls_credentials_secret: str = None,
overwrite_adls: bool = False,
adls_config_key: str = None,
)
Functions:
eurostat_to_adls(
dataset_code: str,
params: dict = None,
columns: list = None,
tests: dict = None,
adls_path: str = None,
adls_credentials_secret: str = None,
overwrite_adls: bool = False,
adls_config_key: str = None,
):
Flow to download data from Eurostat Cloud API and upload to ADLS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this info should be part of relevant function docstrings, no need to repeat this info here.

Suggested change
This module provides a prefect flow function to use the Eurostat connector:
- Call to the prefect task wrapper to get a final Data Frame from the connector.
- Upload that data to Azure Data Lake Storage.
Typical usage example:
eurostat_to_adls(
dataset_code: str,
params: dict = None,
columns: list = None,
tests: dict = None,
adls_path: str = None,
adls_credentials_secret: str = None,
overwrite_adls: bool = False,
adls_config_key: str = None,
)
Functions:
eurostat_to_adls(
dataset_code: str,
params: dict = None,
columns: list = None,
tests: dict = None,
adls_path: str = None,
adls_credentials_secret: str = None,
overwrite_adls: bool = False,
adls_config_key: str = None,
):
Flow to download data from Eurostat Cloud API and upload to ADLS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 68 to 76
A dictionary with optional URL parameters. The key represents the
parameter ID, while the value is the code for a specific parameter,
for example 'params = {'unit': 'EUR'}' where "unit" is the parameter
to set and "EUR" is the specific parameter code. You can add more
than one parameter, but only one code per parameter! So you CANNOT
provide a list of codes, e.g., 'params = {'unit': ['EUR', 'USD',
'PLN']}'. This parameter is REQUIRED in most cases to pull a specific
dataset from the API. Both the parameter and code must be provided
as a string! Defaults to None.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this info should just be included in typing: params: dict[str, str] | None = None. BTW you need to fix all the typing for the CI check to pass - I suggest running pre-commit locally before committing. See https://github.com/dyvenia/viadot/blob/2.0/CONTRIBUTING.md#pre-commit-hooks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


def test_eurostat_dictionary_to_df():
"""Test eurostat_dictionary_to_df method from source class."""
eurostat = EurostatMock(dataset_code="") # Możesz przekazać pusty string lub None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
eurostat = EurostatMock(dataset_code="") # Możesz przekazać pusty string lub None
eurostat = EurostatMock(dataset_code="")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 52 to 55
URL = (
"https://ec.europa.eu/eurostat/api/dissemination/statistics/1.0"
"/data/ILC_DI04?format=JSON&lang=EN"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be at the top of the module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 79 to 82
task = Eurostat(dataset_code="ILC_DI04").to_df()

assert isinstance(task, pd.DataFrame)
assert not task.empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
task = Eurostat(dataset_code="ILC_DI04").to_df()
assert isinstance(task, pd.DataFrame)
assert not task.empty
df = Eurostat(dataset_code="ILC_DI04").to_df()
assert isinstance(df, pd.DataFrame)
assert not df.empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 63 to 67
task = Eurostat(dataset_code="ILC_DI04E")

with pytest.raises(ValueError, match="DataFrame is empty!"):
with caplog.at_level(logging.ERROR):
task.to_df()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
task = Eurostat(dataset_code="ILC_DI04E")
with pytest.raises(ValueError, match="DataFrame is empty!"):
with caplog.at_level(logging.ERROR):
task.to_df()
eurostat = Eurostat(dataset_code="ILC_DI04E")
with pytest.raises(ValueError, match="DataFrame is empty!"):
with caplog.at_level(logging.ERROR):
eurostat.to_df()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a valid dataset code
"""
task = Eurostat(dataset_code="ILC_DI04").to_df()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this test doesn't use the mocked source, so it's not a unit test. You need to rewrite it (and other tests below which also do the same) so that it doesn't actually connect to the API or move all these integration tests into tests/integration directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants