CommonTransforms

CommonTransforms is a Python class that uses PySpark libraries to apply common transformations to a Spark dataframe.

Getting Started

Use %run magic command to include this notebook. The %run magic command will work equally well for Synapse Spark Pool and Databricks

%run "/<notebook path in workspace>/CommonTransforms"

Then instantiate the class by passing the dataframe

df = spark.read.csv(path)
ct = CommonTransforms(df)

Function Reference

CommonTransforms supports the following functions:

1. trim

Removes leading and trailing spaces from all string columns in the dataframe

Parameters: None
Usage:

df = ct.trim()

2. replaceNull

Replace null values in dataframe with a default value. The default value is applied to all columns or a subset of columns passed as a list. The default value could be numeric, string, date, timestamp, boolean or dictionary object. When dictionary object is passed, custom default values can be applied to specified columns. The default value is only applied to the columns of same data type. For e.g. if the default value is a string only the string columns which are null are replaced and the numeric columns are untouched.

Parameters:
- value - int, long, float, string, bool date, timestamp or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, boolean, or string.
- subset – optional list of column names to consider. Columns specified in subset that do not have matching data type are ignored. For example, if value is a string, and subset contains a non-string column, then the non-string column is simply ignored.
Usage:

df = ct.replaceNull(0)

df = ct.replaceNull("NA")

df = ct.replaceNull("1900-01-01T00:00:00","start_datetime")
df = ct.replaceNull("9999-12-31T23:59:59","end_datetime")

df = ct.ct.replaceNull({"passenger_count":1,"store_and_fwd_flag":"N","tip_amount":0})

3. deDuplicate

Delete duplicate records from dataframe with option to consider a subset of key columns.

Parameters:
- subset - optional list of column names to consider.
Usage:

df = ct.deDuplicate()

df = ct.deDuplicate(["col1","col2"])

4. utc_to_local

Convert all or a subset of timestamp columns from UTC to timestamp in local timezone

Parameters:
- localTimeZone - your target timezone specified as Country/City. Here is the list of timezones.
- subset - optional list of column names to consider.
Usage:

df = ct.utc_to_local("Australia/Sydney")

df = ct.utc_to_local("Australia/Sydney",["pickup_datetime","dropoff_datetime"])

5. local_to_utc

Convert all or a subset of timestamp columns from local timezone to UTC

Parameters:
- localTimeZone - your source timezone specified as Country/City. Here is the list of timezones.
- subset - optional list of column names to consider.
Usage:

df = ct.local_to_utc("Australia/Sydney")

df = ct.utc_to_local("Australia/Sydney",["recorded_datetime"])

6. changeTimezone

Converts all or selected timestamps in dataframe from one timezone to another.

Parameters:
- fromTimezone - specified as Country/City. Here is the list of timezones.
- toTimezone - specified as Country/City
Usage:

df = ct.changeTimezone("Australia/Sydney","America/New_York")

7. dropSysColumns

Drop columns that are either system or non-business from dataframe

Parameters:
columns - list of columns to be dropped
Usage:

df = ct.dropSysColumns(["col1","col2"])

8. addChecksumCol

Create a checksum column using all columns of the dataframe

Parameters:
colName - Name of the new checksum column
Usage:

df = ct.addChecksumCol("checksum")

9. julian_to_calendar

Converts a 5-digit or 7-digit Julian date to a calendar date

Parameters:
subset - a mandatory list of columns that contain a Julian date value
Usage:

df = df.withColumn("sys_date1",lit(20275)) #Date in Julian Format
df = ct.julian_to_calendar("sys_date1")
# Output=2020-10-01

10. calendar_to_julian

Converts a calendar date to 5-digit Julian date

Parameters:
subset - a mandatory list of columns that contain a date
Usage:

df = df.withColumn("sys_date2",lit("2020-10-01").cast("date")) #Date in Gregorian Format
df = ct.calendar_to_julian("sys_date2")
# Output=20275

11. addLitCols

Add a set of literal value columns to dataframe passed as dictionary parameter. For e.g adding audit columns to a dataframe

Parameters:
colDict - a mandatory dict object that contains key(column) and values.
Usage:

audit={"audit_key":66363,
       "pipeline_id":"56f633",
       "start_datetime": "2020-10-01T10:00:00",
       "end_datetime": "2020-10-01T10:02:05"}
df = ct.addLitCols(audit)

Implementation References

Microsoft Fabric data platform using ELT Framework
Azure Databricks data platform using ELT Framework
Azure Synapse data platform using ELT Framework

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CommonTransforms

Getting Started

Function Reference

1. trim

2. replaceNull

3. deDuplicate

4. utc_to_local

5. local_to_utc

6. changeTimezone

7. dropSysColumns

8. addChecksumCol

9. julian_to_calendar

10. calendar_to_julian

11. addLitCols

Implementation References

Files

README.md

Latest commit

History

README.md

File metadata and controls

CommonTransforms

Getting Started

Function Reference

1. trim

2. replaceNull

3. deDuplicate

4. utc_to_local

5. local_to_utc

6. changeTimezone

7. dropSysColumns

8. addChecksumCol

9. julian_to_calendar

10. calendar_to_julian

11. addLitCols

Implementation References