All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Added
BusinessCore
source class - Added
BusinessCoreToParquet
task class - Added
verify
parameter tohandle_api_response()
. - Added
to_parquet()
inbase.py
- Added new source class
SAPRFCV2
insap_rfc.py
with new approximation. - Added new parameter
rfc_replacement
tosap_rfc_to_adls.py
to replace an extra separator character within a string column to avoid conflicts. - Added
rfc_unique_id
inSAPRFCV2
to merge chunks on this column. - Added
close_connection()
toSAPRFC
andSAPRFCV2
- Removed
try-except
sentence and added a new logic to remove extra separators insap_rfc.py
source file, to vaoid a mismatch in columns lenght between iterative connections to SAP tables. - When
SAP
tables are updated duringsap_rfc.py
scrip running, if there are chunks, the columns in the next chunk are unrelated rows. - Fixed
sap_rfc.py
source file to not breakdown by both, and extra separator in a row and adding new rows in SAP table between iterations.
- Added
anonymize_df
task function totask_utils.py
to anonymize data in the dataframe in selected columns. - Added
Hubspot
source class - Added
HubspotToDF
task class - Added
HubspotToADLS
flow class - Added
CustomerGauge
source class - Added
CustomerGaugeToDF
task class - Added
CustomerGaugeToADLS
flow class
- Added
validate_date_filter
parameter toEpicor
source,EpicorOrdersToDF
task andEpicorOrdersToDuckDB
flow. This parameter enables user to decide whether or not filter should be validated. - Added
Mediatool
source class - Added
MediatoolToDF
task class - Added
MediatoolToADLS
flow class - Added option to disable
check_dtypes_sort
inADLSToAzureSQL
flow. - Added
query
parameter toBigQueryToADLS
flow andBigqueryToDF
task to be able to enter custom SQL query. - Added new end point
conversations/details/query
connection toGenesys
task. - Added new task
filter_userid
inGenesysToADLS
flow to filter out by user Ids list, previously passed by the user.
- Changed parameter name in
BigQueryToADLS
flow - fromcredentials_secret
tocredentials_key
- Added
view_type_time_sleep
to the Genesysqueue_performance_detail_view
. - Added
FileNotFoundError
to catch up failures inMindfulToCSV
and when creating SQL tables. - Added
check_dtypes_sort
task intoADLSToAzureSQL
to check if dtypes is properly sorted. - Added
timeout
parameter to allTask
s where it can be added. - Added
timeout
parameter to allFlow
s where it can be added. - Added
adls_bulk_upload
task function totask_utils.py
- Added
get_survey_list
intoMindful
Source file.
- Updated
genesys_to_adls.py
flow with theadls_bulk_upload
task - Updated
mindful_to_adls.py
flow with theadls_bulk_upload
task - Changed
MindfulToCSV
task to download surveys info.
- Added into
Genesys
the new view typeAGENT
.
- Changed data extraction logic for
Outlook
data.
- Added
credentials_loader
function in utils - Added new columns to
Epicor
source -RequiredDate
andCopperWeight
- Added timeout to
DuckDBQuery
andSAPRFCToDF
- Added support for SQL queries with comments to
DuckDB
source - Added "WITH" to query keywords in
DuckDB
source - Added
avro-python3
library torequirements
- Changed
duckdb
version to0.5.1
- Added new column into Data Frames created with
Mindful
. - Added region parameter as an entry argument in
MindfulToADLS
.
- Fixed incorrect
if_exists="delete"
handling inDuckDB.create_table_from_parquet()
- Fixed
test_duckdb_to_sql_server.py
tests - revert to a previous version - Removed
test__check_if_schema_exists()
test
- Added new column named
_viadot_downloaded_at_utc
in genesys files with the datetime when it is created. - Added sftp source class
SftpConnector
- Added sftp tasks
SftpToDF
andSftpList
- Added sftp flows
SftpToAzureSQL
andSftpToADLS
- Added new source file
mindful
to connect with mindful API. - Added new task file
mindful
to be called by the Mindful Flow. - Added new flow file
mindful_to_adls
to upload data from Mindful API tp ADLS. - Added
recursive
parameter toAzureDataLakeList
task
- Added
protobuf
library to requirements
- Added new flow -
SQLServerTransform
and new taskSQLServerQuery
to run queries on SQLServer - Added
duckdb_query
parameter toDuckDBToSQLServer
flow to enable option to create table using outputs of SQL queries - Added handling empty DF in
set_new_kv()
task - Added
update_kv
andfilter_column
params toSAPRFCToADLS
andSAPToDuckDB
flows and addedset_new_kv()
task intask_utils
- Added Genesys API source
Genesys
- Added tasks
GenesysToCSV
andGenesysToDF
- Added flows
GenesysToADLS
andGenesysReportToADLS
- Added
query
parameter toPrefectLogs
flow
- Updated requirements.txt
- Changed 'handle_api_response()' method by adding more requests method also added contex menager
- Added
rfc_character_limit
parameter inSAPRFCToDF
task,SAPRFC
source,SAPRFCToADLS
andSAPToDuckDB
flows - Added
on_bcp_error
andbcp_error_log_path
parameters inBCPTask
- Added ability to process queries which result exceed SAP's character per low limit in
SAPRFC
source - Added new flow
PrefectLogs
for extracting all logs from Prefect with details - Added
PrefectLogs
flow
- Changed
CheckColumnOrder
task andADLSToAzureSQL
flow to handle appending to non existing table - Changed tasks order in
EpicorOrdersToDuckDB
,SAPToDuckDB
andSQLServerToDuckDB
- casting DF to string before adding metadata - Changed
add_ingestion_metadata_task()
to not to add metadata column when input DataFrame is empty - Changed
check_if_empty_file()
logic according to changes inadd_ingestion_metadata_task()
- Changed accepted values of
if_empty
parameter inDuckDBCreateTableFromParquet
- Updated
.gitignore
to ignore files with*.bak
extension and to ignorecredentials.json
in any directory - Changed logger messages in
AzureDataLakeRemove
task
- Fixed handling empty response in
SAPRFC
source - Fixed issue in
BCPTask
when log file couln't be opened. - Fixed log being printed too early in
Salesforce
source, which would sometimes cause aKeyError
raise_on_error
now behaves correctly inupsert()
when receiving incorrect return codes from Salesforce
- Removed option to run multiple queries in
SAPRFCToADLS
- Added
error_log_file_path
parameter inBCPTask
that enables setting name of errors logs file - Added
on_error
parameter inBCPTask
that tells what to do if bcp error occurs. - Added error log file and
on_bcp_error
parameter inADLSToAzureSQL
- Added handling POST requests in
handle_api_response()
add added it toEpicor
source. - Added
SalesforceToDF
task - Added
SalesforceToADLS
flow - Added
overwrite_adls
option toBigQueryToADLS
andSharepointToADLS
- Added
cast_df_to_str
task inutils.py
and added this toEpicorToDuckDB
,SAPToDuckDB
,SQLServerToDuckDB
- Added
if_empty
parameter inDuckDBCreateTableFromParquet
task and inEpicorToDuckDB
,SAPToDuckDB
,SQLServerToDuckDB
flows to check if output Parquet is empty and handle it properly. - Added
check_if_empty_file()
andhandle_if_empty_file()
inutils.py
- Added new connector - Outlook. Created
Outlook
source,OutlookToDF
task andOutlookToADLS
flow. - Added new connector - Epicor. Created
Epicor
source,EpicorToDF
task andEpicorToDuckDB
flow. - Enabled Databricks Connect in the image. To enable, follow this guide
- Added
MySQL
source andMySqlToADLS
flow - Added
SQLServerToDF
task - Added
SQLServerToDuckDB
flow which downloads data from SQLServer table, loads it to parquet file and then uplads it do DuckDB - Added complete proxy set up in
SAPRFC
example (viadot/examples/sap_rfc
)
- Changed default name for the Prefect secret holding the name of the Azure KV secret storing Sendgrid credentials
- Added
func
parameter toSAPRFC
- Added
SAPRFCToADLS
flow which downloads data from SAP Database to to a pandas DataFrame, exports df to csv and uploads it to Azure Data Lake. - Added
adls_file_name
inSupermetricsToADLS
andSharepointToADLS
flows - Added
BigQueryToADLS
flow class which anables extract data from BigQuery. - Added
Salesforce
source - Added
SalesforceUpsert
task - Added
SalesforceBulkUpsert
task - Added C4C secret handling to
CloudForCustomersReportToADLS
flow (c4c_credentials_secret
parameter)
- Fixed
get_flow_last_run_date()
incorrectly parsing the date - Fixed C4C secret handling (tasks now correctly read the secret as the credentials, rather than assuming the secret is a container for credentials for all environments and trying to access specific key inside it). In other words, tasks now assume the secret holds credentials, rather than a dict of the form
{env: credentials, env2: credentials2}
- Fixed
utils.gen_bulk_insert_query_from_df()
failing with > 1000 rows due to INSERT clause limit by chunking the data into multiple INSERTs - Fixed
get_flow_last_run_date()
incorrectly parsing the date - Fixed
MultipleFlows
when one flow is passed and when last flow fails.
- Added
AzureDataLakeRemove
task
- Changed name of task file from
prefect
toprefect_date_range
- Fixed out of range issue in
prefect_date_range
- bumped version
- Added
custom_mail_state_handler
task that sends email notification using a custom SMTP server. - Added new function
df_clean_column
that cleans data frame columns from special characters - Added
df_clean_column
util task that removes special characters from a pandas DataFrame - Added
MultipleFlows
flow class which enables running multiple flows in a given order. - Added
GetFlowNewDateRange
task to change date range based on Prefect flows - Added
check_col_order
parameter inADLSToAzureSQL
- Added new source
ASElite
- Added KeyVault support in
CloudForCustomers
tasks - Added
SQLServer
source - Added
DuckDBToDF
task - Added
DuckDBTransform
flow - Added
SQLServerCreateTable
task - Added
credentials
param toBCPTask
- Added
get_sql_dtypes_from_df
andupdate_dict
util tasks - Added
DuckDBToSQLServer
flow - Added
if_exists="append"
option toDuckDB.create_table_from_parquet()
- Added
get_flow_last_run_date
util function - Added
df_to_dataset
task util for writing DataFrames to data lakes usingpyarrow
- Added retries to Cloud for Customers tasks
- Added
chunksize
parameter toC4CToDF
task to allow pulling data in chunks - Added
chunksize
parameter toBCPTask
task to allow more control over the load process - Added support for SQL Server's custom
datetimeoffset
type - Added
AzureSQLToDF
task - Added
AzureDataLakeRemove
task - Added
AzureSQLUpsert
task
- Changed the base class of
AzureSQL
toSQLServer
df_to_parquet()
task now creates directories if needed- Added several more separators to check for automatically in
SAPRFC.to_df()
- Upgraded
duckdb
version to 0.3.2
- Fixed bug with
CheckColumnOrder
task - Fixed OpenSSL config for old SQL Servers still using TLS < 1.2
BCPTask
now correctly handles custom SQL Server port- Fixed
SAPRFC.to_df()
ignoring user-specified separator - Fixed temporary CSV generated by the
DuckDBToSQLServer
flow not being cleaned up - Fixed some mappings in
get_sql_dtypes_from_df()
and optimized performance - Fixed
BCPTask
- the case when the file path contained a space - Fixed credential evaluation logic (
credentials
is now evaluated beforeconfig_key
) - Fixed "$top" and "$skip" values being ignored by
C4CToDF
task if provided in theparams
parameter - Fixed
SQL.to_df()
incorrectly handling queries that begin with whitespace
- Removed
autopick_sep
parameter fromSAPRFC
functions. The separator is now always picked automatically if not provided. - Removed
dtypes_to_json
task to task_utils.py
- fixed an issue with schema info within
CheckColumnOrder
class.
-ADLSToAzureSQL
- added remove_tab
parameter to remove uncessery tab separators from data.
- fixed an issue with return df within
CheckColumnOrder
class.
- new source
SAPRFC
for connecting with SAP using thepyRFC
library (requires pyrfc as well as the SAP NW RFC library that can be downloaded here - new source
DuckDB
for connecting with theDuckDB
database - new task
SAPRFCToDF
for loading data from SAP to a pandas DataFrame - new tasks,
DuckDBQuery
andDuckDBCreateTableFromParquet
, for interacting with DuckDB - new flow
SAPToDuckDB
for moving data from SAP to DuckDB - Added
CheckColumnOrder
task - C4C connection with url and report_url documentation
-
SQLIteInsert
check if DataFrame is empty or object is not a DataFrame - KeyVault support in
SharepointToDF
task - KeyVault support in
CloudForCustomers
tasks
- pinned Prefect version to 0.15.11
df_to_csv
now creates dirs if they don't existADLSToAzureSQL
- when data in csv coulmns has unnecessary "\t" then removes them
- fixed an issue with duckdb calls seeing initial db snapshot instead of the updated state (#282)
- C4C connection with url and report_url optimization
- column mapper in C4C source
- new option to
ADLSToAzureSQL
Flow -if_exists="delete"
SQL
source:create_table()
already handlesif_exists
; now it handles a new option forif_exists()
C4CToDF
andC4CReportToDF
tasks are provided as a class instead of function
- Appending issue within CloudForCustomers source
- An early return bug in
UKCarbonIntensity
into_df
method
- authorization issue within
CloudForCustomers
source
- Added support for file path to
CloudForCustomersReportToADLS
flow - Added
flow_of_flows
list handling - Added support for JSON files in
AzureDataLakeToDF
Supermetrics
source:to_df()
now correctly handlesif_empty
in case of empty results
Sharepoint
andCloudForCustomers
sources will now provide an informativeCredentialError
which is also raised early. This will make issues with input credenials immediately clear to the user.- Removed set_key_value from
CloudForCustomersReportToADLS
flow
- Added
Sharepoint
source - Added
SharepointToDF
task - Added
SharepointToADLS
flow - Added
CloudForCustomers
source - Added
c4c_report_to_df
taks - Added
def c4c_to_df
task - Added
CloudForCustomersReportToADLS
flow - Added
df_to_csv
task to task_utils.py - Added
df_to_parquet
task to task_utils.py - Added
dtypes_to_json
task to task_utils.py
ADLSToAzureSQL
- fixed path to csv issue.SupermetricsToADLS
- fixed local json path issue.
- CI/CD:
dev
image is now only published on push to thedev
branch - Docker:
- updated registry links to use the new
ghcr.io
domain run.sh
now also accepts the-t
option. When run in standard mode, it will only spin up theviadot_jupyter_lab
service. When ran with-t dev
, it will also spin upviadot_testing
andviadot_docs
containers.
- updated registry links to use the new
- ADLSToAzureSQL - fixed path parameter issue.
- Added
SQLiteQuery
task - Added
CloudForCustomers
source - Added
CloudForCustomersToDF
andCloudForCustomersToCSV
tasks - Added
CloudForCustomersToADLS
flow - Added support for parquet in
CloudForCustomersToDF
- Added style guidelines to the
README
- Added local setup and commands to the
README
- Changed CI/CD algorithm
- the
latest
Docker image is now only updated on release and is the same exact image as the latest release - the
dev
image is released only on pushes and PRs to thedev
branch (so dev branch = dev image)
- the
- Modified
ADLSToAzureSQL
- read_sep and write_sep parameters added to the flow.
- Fixed
ADLSToAzureSQL
breaking in"append"
mode if the table didn't exist (#145). - Fixed
ADLSToAzureSQL
breaking in promotion path for csv files.
- Added flows library docs to the references page
- Moved task library docs page to topbar
- Updated docs for task and flows
- Added
start
andend_date
parameters toSupermetricsToADLS
flow - Added a tutorial on how to pull data from
Supermetrics
- Added documentation (both docstrings and MKDocs docs) for multiple tasks
- Added
start_date
andend_date
parameters to theSupermetricsToAzureSQL
flow - Added a temporary workaround
df_to_csv_task
task to theSupermetricsToADLS
flow to handle mixed dtype columns not handled automatically by DataFrame'sto_parquet()
method
- Modified
RunGreatExpectationsValidation
task to use the built in support for evaluation parameters added in Prefect v0.15.3 - Modified
SupermetricsToADLS
andADLSGen1ToAzureSQLNew
flows to align with this recipe for reading the expectation suite JSON The suite now has to be loaded before flow initialization in the flow's python file and passed as an argument to the flow's constructor. - Modified
RunGreatExpectationsValidation
'sexpectations_path
parameter to point to the directory containing the expectation suites instead of the Great Expectations project directory, which was confusing. The project directory is now only used internally and not exposed to the user - Changed the logging of docs URL for
RunGreatExpectationsValidation
task to use GE's recipe from the docs
- Added a test for
SupermetricsToADLS
flow -Added a test forAzureDataLakeList
task - Added PR template for new PRs
- Added a
write_to_json
util task to theSupermetricsToADLS
flow. This task dumps the input expectations dict to the local filesystem as is required by Great Expectations. This allows the user to simply pass a dict with their expectations and not worry about the project structure required by Great Expectations - Added
Shapely
andimagehash
dependencies required for fullvisions
functionality (installingvisions[all]
breaks the build) - Added more parameters to control CSV parsing in the
ADLSGen1ToAzureSQLNew
flow - Added
keep_output
parameter to theRunGreatExpectationsValidation
task to control Great Expectations output to the filesystem - Added
keep_validation_output
parameter andcleanup_validation_clutter
task to theSupermetricsToADLS
flow to control Great Expectations output to the filesystem
- Removed
SupermetricsToAzureSQLv2
andSupermetricsToAzureSQLv3
flows - Removed
geopy
dependency
- Added support for parquet in
AzureDataLakeToDF
- Added proper logging to the
RunGreatExpectationsValidation
task - Added the
viz
Prefect extra to requirements to allow flow visualizaion - Added a few utility tasks in
task_utils
- Added
geopy
dependency - Tasks:
AzureDataLakeList
- for listing files in an ADLS directory
- Flows:
ADLSToAzureSQL
- promoting files to conformed, operations, creating an SQL table and inserting the data into itADLSContainerToContainer
- copying files between ADLS containers
- Renamed
ReadAzureKeyVaultSecret
andRunAzureSQLDBQuery
tasks to match Prefect naming style - Flows:
SupermetricsToADLS
- changed csv to parquet file extension. File and schema info are loaded to theRAW
container.
- Removed the broken version autobump from CI
- Flows:
SupermetricsToADLS
- supporting immutable ADLS setup
- A default value for the
ds_user
parameter inSupermetricsToAzureSQLv3
can now be specified in theSUPERMETRICS_DEFAULT_USER
secret - Updated multiple dependencies
- Fixed "Local run of
SupermetricsToAzureSQLv3
skips all tasks afterunion_dfs_task
" (#59) - Fixed the
release
GitHub action
-
Sources:
AzureDataLake
(supports gen1 & gen2)SQLite
-
Tasks:
DownloadGitHubFile
AzureDataLakeDownload
AzureDataLakeUpload
AzureDataLakeToDF
ReadAzureKeyVaultSecret
CreateAzureKeyVaultSecret
DeleteAzureKeyVaultSecret
SQLiteInsert
SQLiteSQLtoDF
AzureSQLCreateTable
RunAzureSQLDBQuery
BCPTask
RunGreatExpectationsValidation
SupermetricsToDF
-
Flows:
SupermetricsToAzureSQLv1
SupermetricsToAzureSQLv2
SupermetricsToAzureSQLv3
AzureSQLTransform
Pipeline
ADLSGen1ToGen2
ADLSGen1ToAzureSQL
ADLSGen1ToAzureSQLNew
-
Examples:
- Hello world flow
- Supermetrics Google Ads extract
- Tasks now use secrets for credential management (azure tasks use Azure Key Vault secrets)
- SQL source now has a default query timeout of 1 hour
- Fix
SQLite
tests - Multiple stability improvements with retries and timeouts
- Moved from poetry to pip
- Fix
AzureBlobStorage
'sto_storage()
method is missing the final upload blob part