This project has the main purpose to practice data cleaning, tranforming and analysing data with SQL.
This project followed those steps:
- Data importing and extraction/loading
At that moment, I took the time to familiarize myself with the raw data before importing it to my workbench. During this phase, I faced the challenge to import the dataset into the more readable
data file, so I had ot convert the .csv file to the .csvUTF8 file to make it work. Also, mportant to note that before importing the dataset to my tables which were created one by one with SQL queries in the PGAdmin GUI
, I did some cleaning in the data in the spreadsheet by renaming the columns and cleaning the data types (more about is mentioned in the 'cleaning_data' file);
- Data cleaning and transformation
As I mentioned above, the data cleaning started when I got into the raw data at first. So I had the opportunity to look at the raw data in the spreadsheets to see how it was organized, formatted and structured. It was checked the columns and rows and if they were consistently organized (e.g. such as the number of columns and their data type, which columns were filled out with values, the ones with missing or NULL
and also the duplicated ones). Also, it was removed double and single quotes and columns renamed with underscores in order to have more readability.
- Data analyze from different questions and/or perspectives
At this moment, I started to explore the data with the questions contained in the asignment and the ones I found interesting to be asked: for example, as it is an ecommerce company, it would be interesting to see if there is a relationship between the:
- transaction revenue generated by countries and cities considered
economically developed
; - costumer and product bought,
- costumer sentiment shared at the company website and the product bought,
- customer profile and their social lifestyle, and
- also between some variables spread out in different tables, such as the unit cost of the manufacture still in production and product already created to be sold. Most of those were addressed here in this project.
- Develop and implement a QA process to validate transformed data against raw data
As we know, just having a look at the results of my queries is not enough to confirm that our data is correct to be analyzed. So, my QA (Quality Assurance) strategies were not only done in the end of this project, but throughout my analyze (step 3) when I performed different queries to check if the output of the tables and columns would match with each other. In an overall way, my QA involved:
- checking the number of columns and names after running a query, besides their output,
- performing some statistical analysis,
- writing desciptive comments before running our queries,
- making the queries readable (uppercased keywords, uses aliases for columns, witespaces, identation and quering queries in separated moments or blocks).
At first, the data looked unstructured, unformatted and not telling anything at all. It looked like it was radomnly combined and joined in tables without consistency. However, this changed a little bit when I started to clean it up. So by not having any semantic content at all it started to show up somewhat traits of the relationship between some users and countries, the products bought, the transaction revenue generated, the sentiment score and social group of them. Even though I couldn't have the time to go deeper to all my questions here adressed, in an overall way I can highlight that:
-
the variables
unit_cost
(a part of a manufacture or product) andproduct-cost
(the final product) are interrelated: even though I am not completely sure about what each mean, because they were not described in the database, the average of the unit cost (U$D 3.11) could represent almost 10 times what represents the price of the product (U$D 28.81) in the end`; -
one customer with the 'visit_id' value of '1497154760' spent U$D 298 and U$D 109.99 in 2 products sold by the ecommerce company. This customer had the higher buy on the whole company. From the dataset, it was possible to compile that this customer is (i) channeled in an
organic search
group, (ii) not social engaged, and (iii) has thefull_visitor_id
value of5.07878E+16
; -
althoug the product
name
Women's V-Neck Tee Charcoal
has the same sentiment score of other products sold by the company, it is possible to infer, as it is one with a higher amount of orders, that it is more appreciated or likeable by the customers; -
the top 10 countries with the highest total transactions revenue generated by orders of the company were:
United States
,Israel
,Australia
,Canada
,Switzerlad
,Uganda
,Montenegro
,Venezuela
,Cambodia
andSweden
. In regards of the cities, those were the top 10 with the highest total transactions revenue:Atlanta
,Sunnyvale
,Tel Aviv-Yafo
,Los Angeles
,Sydney
,Seatlle
,Chicago
,Palo Alto
andSan Francisco
; -
most of the products ordered by the customers were related to the
home office category
in the first place and in thereusing bags cattegory
in the second; -
the
USA
(USD$ 13,154,170,000),Israel
,Australia
,Canada
andSwitzerland
were the top five countries with the higher total revenue.
Those were some the results found during the analyse of the dataset.
One of the first challengess I encountered in this assignment was related to the .csv file which had to be imported to my workbench. After a while, the solution was to convert it to a .csvUTF-8 file to be properly imported in my workbench, because this second one is delimetted by commas.
The second challenge was related to the short time to do the analyze of the database and the QA: I spent most of my time trying to undertand the type of the demo dataset file before importing the spreadsheets and, after that, in the cleaning.
The third challenge was trying to understand the value and importance behind the NULL
or empty values. The act of droping and altering tables and columns were performed by me, but it is important to keep in mind this process can impact the purpose of the company analysis. Also, some NULL
values were hidden and I could only see them through joining tables.
The forth challenge is related to the cleaning. When I thought that I was done the cleaning, I noticed that the cleaning almost never ends! So, patience is a virtude here because changes amight be never good enough. Similarly, it is hard to find a quality data, so the approach must be open to possible changes and iterations along all the steps of the data cleaning and transformin. Also, I noticed there is some hierarchy in the data to be understood, but I couldn't find the time to do it yet.
If I had more time, first of all I would have used more functions to clean up the data, such as TRIM
, PAD
, ROUND
, among others. Also, I would have approached to the data with less time pressure and absolutely I would have come up with more questions and insights to be adressed and found in the story of this dataset.