-
Create AWS Account (Free Plan)
- Sign up for an AWS account.
-
Create S3 Bucket
- Go to Services > S3 > Create bucket.
- Bucket Name:
debashisbucket1999. - Click Create bucket.
-
Set Up Folders
- Inside the bucket, create two folders:
csvjson
- Inside the bucket, create two folders:
-
Upload Files
- Upload
netflix_titles.csvto the csv folder. - Upload
musical_instruments.jsonto the json folder.
- Upload
-
Create IAM Role & Policy
- Go to AWS Console > Services > IAM > Roles (under Access Management).
- Create Role → Select AWS Account → Require External ID (Snowflake integration).
- Add permission: Search S3, select AmazonS3FullAccess.
- Role Name:
debashis_snowflake. - Create the role.
-
Create Storage Integration in Snowflake
- Create an integration object
s3_intwith:STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::825442073569:role/debashis_snowflake'STORAGE_ALLOWED_LOCATIONS = ('s3://debashisbucket1999/csv/', 's3://debashisbucket1999/json/')
- Create an integration object
-
Grant Trust Between AWS & Snowflake
- Run
DESC INTEGRATION s3_int;in Snowflake. - Copy
AWS_IAM_USER_ARNandSTORAGE_AWS_EXTERNAL_IDfrom the output. - In AWS Console > IAM > Roles > debashis_snowflake > Trust Relationships:
- Edit trust policy.
- Add
AWS_IAM_USER_ARNandSTORAGE_AWS_EXTERNAL_IDvalues.
- Save and update policy (this step establishes trust from AWS side).
- Run
-
Create Snowflake Objects
- Create target tables.
- Create file format objects.
- Create stage object (attach
s3_intand file format object).
-
Load Data into Snowflake
- Use COPY command to load data from S3 into respective Snowflake tables.
- CSV → Loaded directly into structured table.
- JSON → Loaded into raw table.
-
Extra Step: Handling JSON
- Parsed JSON data using
$1. - Extracted columns, applied data types, and formatted structured tables for analytics.
- Parsed JSON data using
SELECT
$1:asin,
$1:helpful,
$1:overall,
$1:reviewText,
$1:reviewTime,
$1:reviewerID,
$1:reviewTime,
$1:reviewerName,
$1:summary,
$1:unixReviewTime
FROM @json_folder;- Reads data directly from the Snowflake stage (
@json_folder) where the JSON file is stored. - Each row in the file is treated as a JSON object stored in column
$1.
$1= the raw JSON object.:fieldName= extracts the value of a specific key from the JSON.
$1:asin→ Extracts the product ID.$1:reviewText→ Extracts the customer review text.$1:helpful→ Extracts an array showing helpfulness votes.
- Converts semi-structured JSON into structured columns.
- Facilitates exploration and validation of fields within the JSON.
- Enables subsequent transformations and cleansing (e.g., converting dates, flattening arrays).
- Serves as the initial step for processing JSON in Snowflake.
- Provides a clear view of all attributes, aiding in planning data type conversions and further data cleansing.
SELECT
$1:asin::STRING AS ASIN,
$1:helpful AS helpful,
$1:overall AS overall,
$1:reviewText::STRING AS reviewtext,
$1:reviewTime::STRING,
$1:reviewerID::STRING,
$1:reviewTime::STRING,
$1:reviewerName::STRING,
$1:summary::STRING,
DATE($1:unixReviewTime::int) AS Reviewtime
FROM @json_folder;- Converts JSON fields into proper string format for consistency.
- Example: Fields like
asin,reviewText,reviewerName,summaryare cast to strings for easier querying.
helpful→ Retained as an array (e.g.,[0,0]).overall→ Numeric rating (float) that does not require casting yet.
unixReviewTimein JSON is stored as a Unix timestamp (e.g.,1393545600).- Casting with
::intensures it is treated as a number. - Wrapping in
DATE()converts it into a readable date format (e.g.,2014-02-28). - Purpose: Facilitates filtering, aggregating, or partitioning by date.
- Ensures all fields have consistent Snowflake data types.
- Makes text fields queryable (instead of remaining as semi-structured
VARIANT). - Converts raw Unix timestamps into actual date columns for analysis.
- Transforms raw JSON attributes into cleaned and typed columns.
- Essential preparation step before loading into curated tables or using in reporting.
SELECT
$1:asin::STRING AS ASIN,
$1:helpful AS helpful,
$1:overall AS overall,
$1:reviewText::STRING AS reviewtext,
DATE_FROM_PARTS(
RIGHT($1:reviewTime::STRING,4), -- Year (last 4 characters)
LEFT($1:reviewTime::STRING,2), -- Month (first 2 characters)
SUBSTRING($1:reviewTime::STRING,4,2) -- Day (middle part)
) AS review_date,
$1:reviewerID::STRING,
$1:reviewTime::STRING,
$1:reviewerName::STRING,
$1:summary::STRING,
DATE($1:unixReviewTime::int) AS unixReviewtime
FROM @json_folder;- In the JSON file,
reviewTimeis stored as a string (e.g.,"02 28, 2014"). - Snowflake does not automatically recognize this format as a date.
- This function constructs a valid date from Year, Month, and Day components.
- String Manipulation Breakdown:
RIGHT(..., 4)→ Extracts the year (e.g.,2014).LEFT(..., 2)→ Extracts the month (e.g.,02).SUBSTRING(..., 4, 2)→ Extracts the day (e.g.,28).
- Combined, these produce
DATE_FROM_PARTS(2014, 02, 28)→2014-02-28.
- The original string contains a comma (e.g.,
"02 28, 2014"). - Parsing requires careful positioning of substring indexes or removing punctuation using
REPLACE(). - Direct casting to a date does not work; the string must be cleaned or transformed first.
- Highlights the challenge of converting messy string dates into a structured
DATEtype. - Ensures consistency between
reviewTime(string) andunixReviewTime(numeric timestamp). - Prepares data for time-based analysis, such as filtering, grouping, or trending.
- Error Message:
Numeric value '6,' is not recognized - Cause: The
reviewTimestring in the JSON is formatted as"MM DD, YYYY"(e.g.,"06 28, 2014"). The comma after the day gets included in the substring, causing Snowflake to read invalid values like6,instead of28.
DATE_FROM_PARTSrequires clean integers for Year, Month, and Day.- Due to the comma in the original string, substring extraction (e.g.,
SUBSTRING(..., 4, 2)) does not always return valid numbers.
- Direct substring parsing from messy date strings can fail.
- The raw string must be cleaned (e.g., remove the comma) before using
DATE_FROM_PARTS. - A better approach will be demonstrated in the next step.
SELECT
$1:asin::STRING AS ASIN,
$1:helpful AS helpful,
$1:overall AS overall,
$1:reviewText::STRING AS reviewtext,
DATE_FROM_PARTS(
RIGHT($1:reviewTime::STRING,4), -- Year
LEFT($1:reviewTime::STRING,2), -- Month
CASE
WHEN SUBSTRING($1:reviewTime::STRING,5,1) = ','
THEN SUBSTRING($1:reviewTime::STRING,4,1) -- Single-digit day (e.g., '6,')
ELSE SUBSTRING($1:reviewTime::STRING,4,2) -- Two-digit day (e.g., '28')
END
) AS review_date,
$1:reviewerID::STRING,
$1:reviewTime::STRING,
$1:reviewerName::STRING,
$1:summary::STRING,
DATE($1:unixReviewTime::int) AS UnixReviewtime
FROM @json_folder;reviewTimevalues are formatted as"MM DD, YYYY"(e.g.,"06 28, 2014"or"06 6, 2014").- When the day is a single digit, a comma follows immediately (e.g.,
"6,"), causing parsing errors inDATE_FROM_PARTS.
- Use
SUBSTRING($1:reviewTime::STRING, 5, 1)to check if the 5th character is a comma:- If yes -> Indicates a single-digit day (e.g.,
"6,"), so extract only the 4th character. - If no -> Indicates a two-digit day (e.g.,
"28"), so extract the 4th and 5th characters.
- If yes -> Indicates a single-digit day (e.g.,
- This ensures the day portion is always a valid number (e.g.,
6or28) before building the date.
- Prevents the error:
Numeric value '6,' is not recognized. - Reliably extracts the correct day value regardless of single-digit or double-digit formats.
- Consistently converts
reviewTimeinto a proper SnowflakeDATEtype. - Handles both single-digit and double-digit day formats reliably.
- Ensures uniformity with the already-converted
unixReviewTimedate field.
CREATE OR REPLACE TABLE reviews (
asin STRING,
helpful STRING,
overall STRING,
reviewtext STRING,
reviewtime DATE,
reviewerid STRING,
reviewername STRING,
summary STRING,
unixreviewtime DATE
);- Creates a new table or replaces an existing one with the same name.
- Ensures the script can be re-run without errors.
asin STRING- Product ID.helpful STRING- Stores helpfulness votes (array, e.g.,[0,0], kept as string initially).overall STRING- Review rating (numeric but stored as string initially).reviewtext STRING- Full review text.reviewtime DATE- Converted date from JSONreviewTime.reviewerid STRING- Reviewer’s unique ID.reviewername STRING- Reviewer’s name.summary STRING- Short review title/summary.unixreviewtime DATE- Converted date from Unix timestamp.
- Transitions from raw semi-structured JSON to a structured relational table.
- Ensures data types are properly set (e.g., dates as
DATE, text asSTRING). - Facilitates downstream analytics, reporting, and joins with other tables.
- Establishes the target schema for cleaned review data.
- This table serves as the destination for inserting/transforming results from earlier cleansing steps.
COPY INTO reviews
FROM (
SELECT
$1:asin::STRING AS ASIN,
$1:helpful AS helpful,
$1:overall AS overall,
$1:reviewText::STRING AS reviewtext,
DATE_FROM_PARTS(
RIGHT($1:reviewTime::STRING,4), -- Year
LEFT($1:reviewTime::STRING,2), -- Month
CASE
WHEN SUBSTRING($1:reviewTime::STRING,5,1) = ','
THEN SUBSTRING($1:reviewTime::STRING,4,1) -- Single-digit day
ELSE SUBSTRING($1:reviewTime::STRING,4,2) -- Double-digit day
END
) AS reviewtime,
$1:reviewerID::STRING AS reviewerid,
$1:reviewerName::STRING AS reviewername,
$1:summary::STRING AS summary,
DATE($1:unixReviewTime::int) AS unixreviewtime
FROM @json_folder
);- Loads data into the
reviewstable created earlier. - Takes the results of the subquery (cleaned JSON projection) and inserts them as rows.
- Extracts and casts JSON fields into appropriate Snowflake data types.
- Handles
reviewTime(string with commas, e.g.,"06 28, 2014") usingCASE+DATE_FROM_PARTSto ensure correct date parsing. - Converts
unixReviewTime(Unix timestamp) into a properDATEformat.
- Moves data from semi-structured JSON in the Snowflake stage to a structured relational table.
- Ensures the table contains clean, typed, and ready-to-use data.
- Completes the data ingestion pipeline, transforming raw JSON into a fully structured relational form in Snowflake.
- Enables queries, aggregations, and analytics without the overhead of JSON parsing.