Skip to content

rafaelaqfc/Final-Project-SQL

Repository files navigation

Final Project: Transforming and Analyzing Data with SQL

Project/Goals

This project has the main purpose to practice data cleaning, tranforming and analysing data with SQL.


Process

This project followed those steps:


  1. Data importing and extraction/loading

At that moment, I took the time to familiarize myself with the raw data before importing it to my workbench. During this phase, I faced the challenge to import the dataset into the more readable data file, so I had ot convert the .csv file to the .csvUTF8 file to make it work. Also, mportant to note that before importing the dataset to my tables which were created one by one with SQL queries in the PGAdmin GUI, I did some cleaning in the data in the spreadsheet by renaming the columns and cleaning the data types (more about is mentioned in the 'cleaning_data' file);


  1. Data cleaning and transformation

As I mentioned above, the data cleaning started when I got into the raw data at first. So I had the opportunity to look at the raw data in the spreadsheets to see how it was organized, formatted and structured. It was checked the columns and rows and if they were consistently organized (e.g. such as the number of columns and their data type, which columns were filled out with values, the ones with missing or NULL and also the duplicated ones). Also, it was removed double and single quotes and columns renamed with underscores in order to have more readability.


  1. Data analyze from different questions and/or perspectives

At this moment, I started to explore the data with the questions contained in the asignment and the ones I found interesting to be asked: for example, as it is an ecommerce company, it would be interesting to see if there is a relationship between the:

  • transaction revenue generated by countries and cities considered economically developed;
  • costumer and product bought,
  • costumer sentiment shared at the company website and the product bought,
  • customer profile and their social lifestyle, and
  • also between some variables spread out in different tables, such as the unit cost of the manufacture still in production and product already created to be sold. Most of those were addressed here in this project.

  1. Develop and implement a QA process to validate transformed data against raw data

As we know, just having a look at the results of my queries is not enough to confirm that our data is correct to be analyzed. So, my QA (Quality Assurance) strategies were not only done in the end of this project, but throughout my analyze (step 3) when I performed different queries to check if the output of the tables and columns would match with each other. In an overall way, my QA involved:

  • checking the number of columns and names after running a query, besides their output,
  • performing some statistical analysis,
  • writing desciptive comments before running our queries,
  • making the queries readable (uppercased keywords, uses aliases for columns, witespaces, identation and quering queries in separated moments or blocks).

Results

At first, the data looked unstructured, unformatted and not telling anything at all. It looked like it was radomnly combined and joined in tables without consistency. However, this changed a little bit when I started to clean it up. So by not having any semantic content at all it started to show up somewhat traits of the relationship between some users and countries, the products bought, the transaction revenue generated, the sentiment score and social group of them. Even though I couldn't have the time to go deeper to all my questions here adressed, in an overall way I can highlight that:

  • the variables unit_cost (a part of a manufacture or product) and product-cost (the final product) are interrelated: even though I am not completely sure about what each mean, because they were not described in the database, the average of the unit cost (U$D 3.11) could represent almost 10 times what represents the price of the product (U$D 28.81) in the end`;

  • one customer with the 'visit_id' value of '1497154760' spent U$D 298 and U$D 109.99 in 2 products sold by the ecommerce company. This customer had the higher buy on the whole company. From the dataset, it was possible to compile that this customer is (i) channeled in an organic search group, (ii) not social engaged, and (iii) has the full_visitor_id value of 5.07878E+16;

  • althoug the product name Women's V-Neck Tee Charcoal has the same sentiment score of other products sold by the company, it is possible to infer, as it is one with a higher amount of orders, that it is more appreciated or likeable by the customers;

  • the top 10 countries with the highest total transactions revenue generated by orders of the company were: United States, Israel, Australia, Canada, Switzerlad, Uganda, Montenegro, Venezuela, Cambodia and Sweden. In regards of the cities, those were the top 10 with the highest total transactions revenue: Atlanta, Sunnyvale, Tel Aviv-Yafo, Los Angeles, Sydney, Seatlle, Chicago, Palo Alto and San Francisco;

  • most of the products ordered by the customers were related to the home office category in the first place and in the reusing bags cattegory in the second;

  • the USA (USD$ 13,154,170,000), Israel, Australia, Canada and Switzerland were the top five countries with the higher total revenue.

Those were some the results found during the analyse of the dataset.


Challenges

One of the first challengess I encountered in this assignment was related to the .csv file which had to be imported to my workbench. After a while, the solution was to convert it to a .csvUTF-8 file to be properly imported in my workbench, because this second one is delimetted by commas.

The second challenge was related to the short time to do the analyze of the database and the QA: I spent most of my time trying to undertand the type of the demo dataset file before importing the spreadsheets and, after that, in the cleaning.

The third challenge was trying to understand the value and importance behind the NULL or empty values. The act of droping and altering tables and columns were performed by me, but it is important to keep in mind this process can impact the purpose of the company analysis. Also, some NULL values were hidden and I could only see them through joining tables.

The forth challenge is related to the cleaning. When I thought that I was done the cleaning, I noticed that the cleaning almost never ends! So, patience is a virtude here because changes amight be never good enough. Similarly, it is hard to find a quality data, so the approach must be open to possible changes and iterations along all the steps of the data cleaning and transformin. Also, I noticed there is some hierarchy in the data to be understood, but I couldn't find the time to do it yet.


Future Goals

If I had more time, first of all I would have used more functions to clean up the data, such as TRIM, PAD, ROUND, among others. Also, I would have approached to the data with less time pressure and absolutely I would have come up with more questions and insights to be adressed and found in the story of this dataset.

About

This is my first assignment to transform and analyze data using SQL.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published