snowflake_proj15_aws_load_handling_json

Project Workflow

Create AWS Account (Free Plan)
- Sign up for an AWS account.
Create S3 Bucket
- Go to Services > S3 > Create bucket.
- Bucket Name: debashisbucket1999.
- Click Create bucket.
Set Up Folders
- Inside the bucket, create two folders:
  - csv
  - json
Upload Files
- Upload netflix_titles.csv to the csv folder.
- Upload musical_instruments.json to the json folder.
Create IAM Role & Policy
- Go to AWS Console > Services > IAM > Roles (under Access Management).
- Create Role → Select AWS Account → Require External ID (Snowflake integration).
- Add permission: Search S3, select AmazonS3FullAccess.
- Role Name: debashis_snowflake.
- Create the role.
Create Storage Integration in Snowflake
- Create an integration object s3_int with:
  - STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::825442073569:role/debashis_snowflake'
  - STORAGE_ALLOWED_LOCATIONS = ('s3://debashisbucket1999/csv/', 's3://debashisbucket1999/json/')
Grant Trust Between AWS & Snowflake
- Run DESC INTEGRATION s3_int; in Snowflake.
- Copy AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID from the output.
- In AWS Console > IAM > Roles > debashis_snowflake > Trust Relationships:
  - Edit trust policy.
  - Add AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID values.
- Save and update policy (this step establishes trust from AWS side).
Create Snowflake Objects
- Create target tables.
- Create file format objects.
- Create stage object (attach s3_int and file format object).
Load Data into Snowflake
- Use COPY command to load data from S3 into respective Snowflake tables.
- CSV → Loaded directly into structured table.
- JSON → Loaded into raw table.
Extra Step: Handling JSON
- Parsed JSON data using $1.
- Extracted columns, applied data types, and formatted structured tables for analytics.

Step 1: Introduce Columns from JSON

SELECT 
  $1:asin,
  $1:helpful,
  $1:overall,
  $1:reviewText,
  $1:reviewTime,
  $1:reviewerID,
  $1:reviewTime,
  $1:reviewerName,
  $1:summary,
  $1:unixReviewTime
FROM @json_folder;

Explanation of JSON Processing in Snowflake

FROM @json_folder

Reads data directly from the Snowflake stage (@json_folder) where the JSON file is stored.
Each row in the file is treated as a JSON object stored in column $1.

$1:fieldName

$1 = the raw JSON object.
:fieldName = extracts the value of a specific key from the JSON.

Example:

$1:asin → Extracts the product ID.
$1:reviewText → Extracts the customer review text.
$1:helpful → Extracts an array showing helpfulness votes.

Why this step?

Converts semi-structured JSON into structured columns.
Facilitates exploration and validation of fields within the JSON.
Enables subsequent transformations and cleansing (e.g., converting dates, flattening arrays).

Significance:

Serves as the initial step for processing JSON in Snowflake.
Provides a clear view of all attributes, aiding in planning data type conversions and further data cleansing.

Step 2: Format Columns & Use DATE Function

SELECT 
  $1:asin::STRING AS ASIN,
  $1:helpful AS helpful,
  $1:overall AS overall,
  $1:reviewText::STRING AS reviewtext,
  $1:reviewTime::STRING,
  $1:reviewerID::STRING,
  $1:reviewTime::STRING,
  $1:reviewerName::STRING,
  $1:summary::STRING,
  DATE($1:unixReviewTime::int) AS Reviewtime
FROM @json_folder;

Explanation: JSON Data Processing and Type Casting in Snowflake

Casting with `::STRING`

Converts JSON fields into proper string format for consistency.
Example: Fields like asin, reviewText, reviewerName, summary are cast to strings for easier querying.

Keeping Raw Fields (`helpful`, `overall`)

helpful → Retained as an array (e.g., [0,0]).
overall → Numeric rating (float) that does not require casting yet.

Using `DATE()` Function

unixReviewTime in JSON is stored as a Unix timestamp (e.g., 1393545600).
Casting with ::int ensures it is treated as a number.
Wrapping in DATE() converts it into a readable date format (e.g., 2014-02-28).
Purpose: Facilitates filtering, aggregating, or partitioning by date.

Why this step?

Ensures all fields have consistent Snowflake data types.
Makes text fields queryable (instead of remaining as semi-structured VARIANT).
Converts raw Unix timestamps into actual date columns for analysis.

Significance:

Transforms raw JSON attributes into cleaned and typed columns.
Essential preparation step before loading into curated tables or using in reporting.

Step 3: Converting `reviewTime` String Using `DATE_FROM_PARTS`

SELECT 
  $1:asin::STRING AS ASIN,
  $1:helpful AS helpful,
  $1:overall AS overall,
  $1:reviewText::STRING AS reviewtext,
  DATE_FROM_PARTS(
      RIGHT($1:reviewTime::STRING,4),             -- Year (last 4 characters)
      LEFT($1:reviewTime::STRING,2),              -- Month (first 2 characters)
      SUBSTRING($1:reviewTime::STRING,4,2)        -- Day (middle part)
  ) AS review_date,
  $1:reviewerID::STRING,
  $1:reviewTime::STRING,
  $1:reviewerName::STRING,
  $1:summary::STRING,
  DATE($1:unixReviewTime::int) AS unixReviewtime
FROM @json_folder;

Explanation: Handling `reviewTime` in JSON Data for Snowflake

Problem with `reviewTime`

In the JSON file, reviewTime is stored as a string (e.g., "02 28, 2014").
Snowflake does not automatically recognize this format as a date.

Using `DATE_FROM_PARTS()`

This function constructs a valid date from Year, Month, and Day components.
String Manipulation Breakdown:
- RIGHT(..., 4) → Extracts the year (e.g., 2014).
- LEFT(..., 2) → Extracts the month (e.g., 02).
- SUBSTRING(..., 4, 2) → Extracts the day (e.g., 28).
Combined, these produce DATE_FROM_PARTS(2014, 02, 28) → 2014-02-28.

Why is this still a difficulty?

The original string contains a comma (e.g., "02 28, 2014").
Parsing requires careful positioning of substring indexes or removing punctuation using REPLACE().
Direct casting to a date does not work; the string must be cleaned or transformed first.

Why is this step important?

Highlights the challenge of converting messy string dates into a structured DATE type.
Ensures consistency between reviewTime (string) and unixReviewTime (numeric timestamp).
Prepares data for time-based analysis, such as filtering, grouping, or trending.

Error Encountered in Processing `reviewTime` (Step 3)

Error

Error Message: Numeric value '6,' is not recognized
Cause: The reviewTime string in the JSON is formatted as "MM DD, YYYY" (e.g., "06 28, 2014"). The comma after the day gets included in the substring, causing Snowflake to read invalid values like 6, instead of 28.

Why this happens

DATE_FROM_PARTS requires clean integers for Year, Month, and Day.
Due to the comma in the original string, substring extraction (e.g., SUBSTRING(..., 4, 2)) does not always return valid numbers.

Key Takeaway

Direct substring parsing from messy date strings can fail.
The raw string must be cleaned (e.g., remove the comma) before using DATE_FROM_PARTS.
A better approach will be demonstrated in the next step.

Step 4: Handling Comma Issue in `reviewTime` with `DATE_FROM_PARTS`

SELECT 
  $1:asin::STRING AS ASIN,
  $1:helpful AS helpful,
  $1:overall AS overall,
  $1:reviewText::STRING AS reviewtext,
  DATE_FROM_PARTS( 
    RIGHT($1:reviewTime::STRING,4),  -- Year
    LEFT($1:reviewTime::STRING,2),   -- Month
    CASE 
      WHEN SUBSTRING($1:reviewTime::STRING,5,1) = ',' 
           THEN SUBSTRING($1:reviewTime::STRING,4,1)   -- Single-digit day (e.g., '6,')
      ELSE SUBSTRING($1:reviewTime::STRING,4,2)        -- Two-digit day (e.g., '28')
    END
  ) AS review_date,
  $1:reviewerID::STRING,
  $1:reviewTime::STRING,
  $1:reviewerName::STRING,
  $1:summary::STRING,
  DATE($1:unixReviewTime::int) AS UnixReviewtime
FROM @json_folder;

Explanation: Resolving `reviewTime` Parsing Issues in Snowflake

Problem Recap (from Step 3)

reviewTime values are formatted as "MM DD, YYYY" (e.g., "06 28, 2014" or "06 6, 2014").
When the day is a single digit, a comma follows immediately (e.g., "6,"), causing parsing errors in DATE_FROM_PARTS.

Solution with `CASE`

Use SUBSTRING($1:reviewTime::STRING, 5, 1) to check if the 5th character is a comma:
- If yes -> Indicates a single-digit day (e.g., "6,"), so extract only the 4th character.
- If no -> Indicates a two-digit day (e.g., "28"), so extract the 4th and 5th characters.
This ensures the day portion is always a valid number (e.g., 6 or 28) before building the date.

Why this works

Prevents the error: Numeric value '6,' is not recognized.
Reliably extracts the correct day value regardless of single-digit or double-digit formats.

Significance

Consistently converts reviewTime into a proper Snowflake DATE type.
Handles both single-digit and double-digit day formats reliably.
Ensures uniformity with the already-converted unixReviewTime date field.

Step 5: Create Destination Table

CREATE OR REPLACE TABLE reviews (
  asin STRING,
  helpful STRING,
  overall STRING,
  reviewtext STRING,
  reviewtime DATE,
  reviewerid STRING,
  reviewername STRING,
  summary STRING,
  unixreviewtime DATE
);

Explanation: Creating a Structured Table for Review Data in Snowflake

`CREATE OR REPLACE TABLE`

Creates a new table or replaces an existing one with the same name.
Ensures the script can be re-run without errors.

Columns Defined Here

asin STRING - Product ID.
helpful STRING - Stores helpfulness votes (array, e.g., [0,0], kept as string initially).
overall STRING - Review rating (numeric but stored as string initially).
reviewtext STRING - Full review text.
reviewtime DATE - Converted date from JSON reviewTime.
reviewerid STRING - Reviewer’s unique ID.
reviewername STRING - Reviewer’s name.
summary STRING - Short review title/summary.
unixreviewtime DATE - Converted date from Unix timestamp.

Why this step?

Transitions from raw semi-structured JSON to a structured relational table.
Ensures data types are properly set (e.g., dates as DATE, text as STRING).
Facilitates downstream analytics, reporting, and joins with other tables.

Significance

Establishes the target schema for cleaned review data.
This table serves as the destination for inserting/transforming results from earlier cleansing steps.

Step 6: Load Cleaned Data into Destination Table

COPY INTO reviews
FROM (
  SELECT 
    $1:asin::STRING AS ASIN,
    $1:helpful AS helpful,
    $1:overall AS overall,
    $1:reviewText::STRING AS reviewtext,
    DATE_FROM_PARTS( 
      RIGHT($1:reviewTime::STRING,4),        -- Year
      LEFT($1:reviewTime::STRING,2),         -- Month
      CASE 
        WHEN SUBSTRING($1:reviewTime::STRING,5,1) = ',' 
             THEN SUBSTRING($1:reviewTime::STRING,4,1)  -- Single-digit day
        ELSE SUBSTRING($1:reviewTime::STRING,4,2)       -- Double-digit day
      END
    ) AS reviewtime,
    $1:reviewerID::STRING AS reviewerid,
    $1:reviewerName::STRING AS reviewername,
    $1:summary::STRING AS summary,
    DATE($1:unixReviewTime::int) AS unixreviewtime
  FROM @json_folder
);

Explanation: Loading Data into the `reviews` Table in Snowflake

`COPY INTO reviews`

Loads data into the reviews table created earlier.
Takes the results of the subquery (cleaned JSON projection) and inserts them as rows.

Subquery Transformations

Extracts and casts JSON fields into appropriate Snowflake data types.
Handles reviewTime (string with commas, e.g., "06 28, 2014") using CASE + DATE_FROM_PARTS to ensure correct date parsing.
Converts unixReviewTime (Unix timestamp) into a proper DATE format.

Why this step?

Moves data from semi-structured JSON in the Snowflake stage to a structured relational table.
Ensures the table contains clean, typed, and ready-to-use data.

Significance

Completes the data ingestion pipeline, transforming raw JSON into a fully structured relational form in Snowflake.
Enables queries, aggregations, and analytics without the overhead of JSON parsing.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
AWS_load _handling_json.sql		AWS_load _handling_json.sql
README.md		README.md

debashisdash1999/snowflake_proj15_aws_load_handling_json

Folders and files

Latest commit

History

Repository files navigation

snowflake_proj15_aws_load_handling_json

Project Workflow

Step 1: Introduce Columns from JSON

Explanation of JSON Processing in Snowflake

FROM @json_folder

$1:fieldName

Example:

Why this step?

Significance:

Step 2: Format Columns & Use DATE Function

Explanation: JSON Data Processing and Type Casting in Snowflake

Casting with ::STRING

Keeping Raw Fields (helpful, overall)

Using DATE() Function

Why this step?

Significance:

Step 3: Converting reviewTime String Using DATE_FROM_PARTS

Explanation: Handling reviewTime in JSON Data for Snowflake

Problem with reviewTime

Using DATE_FROM_PARTS()

Why is this still a difficulty?

Why is this step important?

Error Encountered in Processing reviewTime (Step 3)

Error

Why this happens

Key Takeaway

Step 4: Handling Comma Issue in reviewTime with DATE_FROM_PARTS

Explanation: Resolving reviewTime Parsing Issues in Snowflake

Problem Recap (from Step 3)

Solution with CASE

Why this works

Significance

Step 5: Create Destination Table

Explanation: Creating a Structured Table for Review Data in Snowflake

CREATE OR REPLACE TABLE

Columns Defined Here

Why this step?

Significance

Step 6: Load Cleaned Data into Destination Table

Explanation: Loading Data into the reviews Table in Snowflake

COPY INTO reviews

Subquery Transformations

Why this step?

Significance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Casting with `::STRING`

Keeping Raw Fields (`helpful`, `overall`)

Using `DATE()` Function

Step 3: Converting `reviewTime` String Using `DATE_FROM_PARTS`

Explanation: Handling `reviewTime` in JSON Data for Snowflake

Problem with `reviewTime`

Using `DATE_FROM_PARTS()`

Error Encountered in Processing `reviewTime` (Step 3)

Step 4: Handling Comma Issue in `reviewTime` with `DATE_FROM_PARTS`

Explanation: Resolving `reviewTime` Parsing Issues in Snowflake

Solution with `CASE`

`CREATE OR REPLACE TABLE`

Explanation: Loading Data into the `reviews` Table in Snowflake

`COPY INTO reviews`

Packages