This is samples code repository (Python) for Azure Databricks Orientation. It's covered various useful usage scenario from beginner to intermediate level.
Section 1
- Mounting ADLS (Azure Data Lake Store Gen 2)
- Exploring sample data (csv) in ADLS with Pandas
- Plotting column value count by Matplotlib & Seaborn
- Plotting data distribution by Matplotlib & Seaborn
- Plotting columns relationship by Seaborn
- Plotting Pair Plots to understand the best set of features to explain a relationship between two variables or to form the most separated clusters by Seaborn
- Plotting columns/features correlation in-between by Matplotlib & Seaborn
Section 2
- Mount Azure Blob Storage
- Exploring sample data (json) in Azure Blob Storage with Json and Pandas
- Flatten first level of nested columns data
- Flatten second level of nested columns data
- Plotting columns relationship by Seaborn
Section 3
- Connect to Azure SQL Database to read data source via Spark JDBC driver
- Making SQL query to Azure SQL Database via Spark JDBC driver
- Installing msodbcsql17 with pyodbc
- Connect to Azure SQL Database to read data source via Pandas pyodbc driver
- Making subset of data from ADLS dataframe
- Making subset of data from SQL (pyodbc) dataframe
- Append two subset of dataframe into one
- Plotting column value count by Matplotlib & Seaborn
Section 4
- Exploring sample data (csv) in ADLS with Pandas
- Data cleaning with Pandas
- Saving cleaned data back to ADLS
Section 5
- Data cleaning and preparation with PySpark
List of Files
- data/ > sample source data directory
- data/pima-indians-diabetes-data.csv > Pima Indians Diabetes Database in csv
- data/pima-indians-diabetes-data-2.csv > Pima Indians Diabetes Database in csv with column header
- data/raw_nyc_phil.json > New York Philharmonic Performance History in json
- data/BL-Flickr-Images-Book.csv > Sample csv data for data cleaning
- Samples_for_Orientation_MASKED.ipynb > Exported Notebook from Azure Databricks (for Section 1 to 3)
- Samples_for_Orientation_MASKED.html > Exported HTML (with result and visual) from Azure Databricks (for Section 1 to 3)
- Samples_for_Orientation_2_MASKED.ipynb > Exported Notebook from Azure Databricks (for Section 4)
- Samples_for_Orientation_2_MASKED.html > Exported HTML (with result and visual) from Azure Databricks (for Section 4)
- Data_Cleansing_and_Preparation_with_PySpark_MASKED.ipynb > Exported Notebook from Azure Databricks (for Section 5)
- Data_Cleansing_and_Preparation_with_PySpark_MASKED.html > Exported HTML (with result and visual) from Azure Databricks (for Section 5)