Skip to content

End-to-end project integrating AWS S3 with Snowflake. Implemented a secure connection using IAM role and Snowflake storage integration, created stages and file formats, and loaded CSV (Netflix Titles) and JSON (Musical Instruments) data into Snowflake. Processed and transformed JSON into structured tables for analysis.

Notifications You must be signed in to change notification settings

debashisdash1999/snowflake_proj15_aws_load_handling_json

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

snowflake_proj15_aws_load_handling_json

Project Workflow

  1. Create AWS Account (Free Plan)

    • Sign up for an AWS account.
  2. Create S3 Bucket

    • Go to Services > S3 > Create bucket.
    • Bucket Name: debashisbucket1999.
    • Click Create bucket.
  3. Set Up Folders

    • Inside the bucket, create two folders:
      • csv
      • json
  4. Upload Files

    • Upload netflix_titles.csv to the csv folder.
    • Upload musical_instruments.json to the json folder.
  5. Create IAM Role & Policy

    • Go to AWS Console > Services > IAM > Roles (under Access Management).
    • Create Role → Select AWS Account → Require External ID (Snowflake integration).
    • Add permission: Search S3, select AmazonS3FullAccess.
    • Role Name: debashis_snowflake.
    • Create the role.
  6. Create Storage Integration in Snowflake

    • Create an integration object s3_int with:
      • STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::825442073569:role/debashis_snowflake'
      • STORAGE_ALLOWED_LOCATIONS = ('s3://debashisbucket1999/csv/', 's3://debashisbucket1999/json/')
  7. Grant Trust Between AWS & Snowflake

    • Run DESC INTEGRATION s3_int; in Snowflake.
    • Copy AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID from the output.
    • In AWS Console > IAM > Roles > debashis_snowflake > Trust Relationships:
      • Edit trust policy.
      • Add AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID values.
    • Save and update policy (this step establishes trust from AWS side).
  8. Create Snowflake Objects

    • Create target tables.
    • Create file format objects.
    • Create stage object (attach s3_int and file format object).
  9. Load Data into Snowflake

    • Use COPY command to load data from S3 into respective Snowflake tables.
    • CSV → Loaded directly into structured table.
    • JSON → Loaded into raw table.
  10. Extra Step: Handling JSON

    • Parsed JSON data using $1.
    • Extracted columns, applied data types, and formatted structured tables for analytics.

Step 1: Introduce Columns from JSON

SELECT 
  $1:asin,
  $1:helpful,
  $1:overall,
  $1:reviewText,
  $1:reviewTime,
  $1:reviewerID,
  $1:reviewTime,
  $1:reviewerName,
  $1:summary,
  $1:unixReviewTime
FROM @json_folder;

Explanation of JSON Processing in Snowflake

FROM @json_folder

  • Reads data directly from the Snowflake stage (@json_folder) where the JSON file is stored.
  • Each row in the file is treated as a JSON object stored in column $1.

$1:fieldName

  • $1 = the raw JSON object.
  • :fieldName = extracts the value of a specific key from the JSON.

Example:

  • $1:asin → Extracts the product ID.
  • $1:reviewText → Extracts the customer review text.
  • $1:helpful → Extracts an array showing helpfulness votes.

Why this step?

  • Converts semi-structured JSON into structured columns.
  • Facilitates exploration and validation of fields within the JSON.
  • Enables subsequent transformations and cleansing (e.g., converting dates, flattening arrays).

Significance:

  • Serves as the initial step for processing JSON in Snowflake.
  • Provides a clear view of all attributes, aiding in planning data type conversions and further data cleansing.

Step 2: Format Columns & Use DATE Function

SELECT 
  $1:asin::STRING AS ASIN,
  $1:helpful AS helpful,
  $1:overall AS overall,
  $1:reviewText::STRING AS reviewtext,
  $1:reviewTime::STRING,
  $1:reviewerID::STRING,
  $1:reviewTime::STRING,
  $1:reviewerName::STRING,
  $1:summary::STRING,
  DATE($1:unixReviewTime::int) AS Reviewtime
FROM @json_folder;

Explanation: JSON Data Processing and Type Casting in Snowflake

Casting with ::STRING

  • Converts JSON fields into proper string format for consistency.
  • Example: Fields like asin, reviewText, reviewerName, summary are cast to strings for easier querying.

Keeping Raw Fields (helpful, overall)

  • helpful → Retained as an array (e.g., [0,0]).
  • overall → Numeric rating (float) that does not require casting yet.

Using DATE() Function

  • unixReviewTime in JSON is stored as a Unix timestamp (e.g., 1393545600).
  • Casting with ::int ensures it is treated as a number.
  • Wrapping in DATE() converts it into a readable date format (e.g., 2014-02-28).
  • Purpose: Facilitates filtering, aggregating, or partitioning by date.

Why this step?

  • Ensures all fields have consistent Snowflake data types.
  • Makes text fields queryable (instead of remaining as semi-structured VARIANT).
  • Converts raw Unix timestamps into actual date columns for analysis.

Significance:

  • Transforms raw JSON attributes into cleaned and typed columns.
  • Essential preparation step before loading into curated tables or using in reporting.

Step 3: Converting reviewTime String Using DATE_FROM_PARTS

SELECT 
  $1:asin::STRING AS ASIN,
  $1:helpful AS helpful,
  $1:overall AS overall,
  $1:reviewText::STRING AS reviewtext,
  DATE_FROM_PARTS(
      RIGHT($1:reviewTime::STRING,4),             -- Year (last 4 characters)
      LEFT($1:reviewTime::STRING,2),              -- Month (first 2 characters)
      SUBSTRING($1:reviewTime::STRING,4,2)        -- Day (middle part)
  ) AS review_date,
  $1:reviewerID::STRING,
  $1:reviewTime::STRING,
  $1:reviewerName::STRING,
  $1:summary::STRING,
  DATE($1:unixReviewTime::int) AS unixReviewtime
FROM @json_folder;

Explanation: Handling reviewTime in JSON Data for Snowflake

Problem with reviewTime

  • In the JSON file, reviewTime is stored as a string (e.g., "02 28, 2014").
  • Snowflake does not automatically recognize this format as a date.

Using DATE_FROM_PARTS()

  • This function constructs a valid date from Year, Month, and Day components.
  • String Manipulation Breakdown:
    • RIGHT(..., 4) → Extracts the year (e.g., 2014).
    • LEFT(..., 2) → Extracts the month (e.g., 02).
    • SUBSTRING(..., 4, 2) → Extracts the day (e.g., 28).
  • Combined, these produce DATE_FROM_PARTS(2014, 02, 28)2014-02-28.

Why is this still a difficulty?

  • The original string contains a comma (e.g., "02 28, 2014").
  • Parsing requires careful positioning of substring indexes or removing punctuation using REPLACE().
  • Direct casting to a date does not work; the string must be cleaned or transformed first.

Why is this step important?

  • Highlights the challenge of converting messy string dates into a structured DATE type.
  • Ensures consistency between reviewTime (string) and unixReviewTime (numeric timestamp).
  • Prepares data for time-based analysis, such as filtering, grouping, or trending.

Error Encountered in Processing reviewTime (Step 3)

Error

  • Error Message: Numeric value '6,' is not recognized
  • Cause: The reviewTime string in the JSON is formatted as "MM DD, YYYY" (e.g., "06 28, 2014"). The comma after the day gets included in the substring, causing Snowflake to read invalid values like 6, instead of 28.

Why this happens

  • DATE_FROM_PARTS requires clean integers for Year, Month, and Day.
  • Due to the comma in the original string, substring extraction (e.g., SUBSTRING(..., 4, 2)) does not always return valid numbers.

Key Takeaway

  • Direct substring parsing from messy date strings can fail.
  • The raw string must be cleaned (e.g., remove the comma) before using DATE_FROM_PARTS.
  • A better approach will be demonstrated in the next step.

Step 4: Handling Comma Issue in reviewTime with DATE_FROM_PARTS

SELECT 
  $1:asin::STRING AS ASIN,
  $1:helpful AS helpful,
  $1:overall AS overall,
  $1:reviewText::STRING AS reviewtext,
  DATE_FROM_PARTS( 
    RIGHT($1:reviewTime::STRING,4),  -- Year
    LEFT($1:reviewTime::STRING,2),   -- Month
    CASE 
      WHEN SUBSTRING($1:reviewTime::STRING,5,1) = ',' 
           THEN SUBSTRING($1:reviewTime::STRING,4,1)   -- Single-digit day (e.g., '6,')
      ELSE SUBSTRING($1:reviewTime::STRING,4,2)        -- Two-digit day (e.g., '28')
    END
  ) AS review_date,
  $1:reviewerID::STRING,
  $1:reviewTime::STRING,
  $1:reviewerName::STRING,
  $1:summary::STRING,
  DATE($1:unixReviewTime::int) AS UnixReviewtime
FROM @json_folder;

Explanation: Resolving reviewTime Parsing Issues in Snowflake

Problem Recap (from Step 3)

  • reviewTime values are formatted as "MM DD, YYYY" (e.g., "06 28, 2014" or "06 6, 2014").
  • When the day is a single digit, a comma follows immediately (e.g., "6,"), causing parsing errors in DATE_FROM_PARTS.

Solution with CASE

  • Use SUBSTRING($1:reviewTime::STRING, 5, 1) to check if the 5th character is a comma:
    • If yes -> Indicates a single-digit day (e.g., "6,"), so extract only the 4th character.
    • If no -> Indicates a two-digit day (e.g., "28"), so extract the 4th and 5th characters.
  • This ensures the day portion is always a valid number (e.g., 6 or 28) before building the date.

Why this works

  • Prevents the error: Numeric value '6,' is not recognized.
  • Reliably extracts the correct day value regardless of single-digit or double-digit formats.

Significance

  • Consistently converts reviewTime into a proper Snowflake DATE type.
  • Handles both single-digit and double-digit day formats reliably.
  • Ensures uniformity with the already-converted unixReviewTime date field.

Step 5: Create Destination Table

CREATE OR REPLACE TABLE reviews (
  asin STRING,
  helpful STRING,
  overall STRING,
  reviewtext STRING,
  reviewtime DATE,
  reviewerid STRING,
  reviewername STRING,
  summary STRING,
  unixreviewtime DATE
);

Explanation: Creating a Structured Table for Review Data in Snowflake

CREATE OR REPLACE TABLE

  • Creates a new table or replaces an existing one with the same name.
  • Ensures the script can be re-run without errors.

Columns Defined Here

  • asin STRING - Product ID.
  • helpful STRING - Stores helpfulness votes (array, e.g., [0,0], kept as string initially).
  • overall STRING - Review rating (numeric but stored as string initially).
  • reviewtext STRING - Full review text.
  • reviewtime DATE - Converted date from JSON reviewTime.
  • reviewerid STRING - Reviewer’s unique ID.
  • reviewername STRING - Reviewer’s name.
  • summary STRING - Short review title/summary.
  • unixreviewtime DATE - Converted date from Unix timestamp.

Why this step?

  • Transitions from raw semi-structured JSON to a structured relational table.
  • Ensures data types are properly set (e.g., dates as DATE, text as STRING).
  • Facilitates downstream analytics, reporting, and joins with other tables.

Significance

  • Establishes the target schema for cleaned review data.
  • This table serves as the destination for inserting/transforming results from earlier cleansing steps.

Step 6: Load Cleaned Data into Destination Table

COPY INTO reviews
FROM (
  SELECT 
    $1:asin::STRING AS ASIN,
    $1:helpful AS helpful,
    $1:overall AS overall,
    $1:reviewText::STRING AS reviewtext,
    DATE_FROM_PARTS( 
      RIGHT($1:reviewTime::STRING,4),        -- Year
      LEFT($1:reviewTime::STRING,2),         -- Month
      CASE 
        WHEN SUBSTRING($1:reviewTime::STRING,5,1) = ',' 
             THEN SUBSTRING($1:reviewTime::STRING,4,1)  -- Single-digit day
        ELSE SUBSTRING($1:reviewTime::STRING,4,2)       -- Double-digit day
      END
    ) AS reviewtime,
    $1:reviewerID::STRING AS reviewerid,
    $1:reviewerName::STRING AS reviewername,
    $1:summary::STRING AS summary,
    DATE($1:unixReviewTime::int) AS unixreviewtime
  FROM @json_folder
);

Explanation: Loading Data into the reviews Table in Snowflake

COPY INTO reviews

  • Loads data into the reviews table created earlier.
  • Takes the results of the subquery (cleaned JSON projection) and inserts them as rows.

Subquery Transformations

  • Extracts and casts JSON fields into appropriate Snowflake data types.
  • Handles reviewTime (string with commas, e.g., "06 28, 2014") using CASE + DATE_FROM_PARTS to ensure correct date parsing.
  • Converts unixReviewTime (Unix timestamp) into a proper DATE format.

Why this step?

  • Moves data from semi-structured JSON in the Snowflake stage to a structured relational table.
  • Ensures the table contains clean, typed, and ready-to-use data.

Significance

  • Completes the data ingestion pipeline, transforming raw JSON into a fully structured relational form in Snowflake.
  • Enables queries, aggregations, and analytics without the overhead of JSON parsing.

About

End-to-end project integrating AWS S3 with Snowflake. Implemented a secure connection using IAM role and Snowflake storage integration, created stages and file formats, and loaded CSV (Netflix Titles) and JSON (Musical Instruments) data into Snowflake. Processed and transformed JSON into structured tables for analysis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published