Coursework for TIE-22306 Data-Intensive Programming

##by Daniel "3ICE" Berezvai, Arjun Venkat, and Juho Peltonen

###Task 1 & 2: ####What are the top-10 best selling products in terms of total sales? & What are the top-10 browsed products?

flume-config
run-flume.sh
ProductCount.java
- Regex used: https://regex101.com/r/eDt96O/1
- TESTING: Run/Debug configuration → Program arguments: /home/hduser/Desktop/log-data/ out2
- PRODUCTION: Compile and then execute with hadoop jar productcount.jar fi.tut.ProductCount log-data out2
SampleOutput.txt
psql
To answer these questions:
- start some essential daemons: sh start-daemons.sh
- Copy log data to HDFS: ./run-flume.sh
- run hadoop job: TODO how do we do this so result is in HDFS
- run sqoop job: ./sqoop.sh
- top 10 best selling products: psql -f top-10-selling-products.sql
- top 10 browsed products: psql -f top-10-browsed-products.sql
- stop daemons: sh stop-daemons.sh

###Task 3: ####What anomaly is there between these two? The fifth most purchased product is not in the top 10 most browsed (Under Armour Girls' Toddler Spine Surge Runni) and the second most browsed item (adidas Kids' RG III Mid Football Cleat) is not purchased frequently enough to be in the top ten profitable items.

(Initial guess was: Order of magnitude more people are buying a product, than browsing it. How do they buy without browsing? This is not the answer, because the logs are from one day while the purchases are from months of data.)

###Task 4: ####What are the most popular browsing hours?

hour | amount

------+--------

21 | 31307

20 | 30225

22 | 29288

23 | 21371

19 | 19995

18 | 10303

11 | 8589

17 | 6771

12 | 5563

10 | 5008

##Guidelines Since the managers of the company don’t use Hadoop but a RDBMS, all the data must be transferred to PostgreSQL. Therefore, the detailed tasks are:

Transfer Apache logs (with Apache Flume) to the HDFS
Compute the frequencies of viewing of different products using MapReduce (Question 2)
Compute the viewing hour data with MapReduce (Q4)
Transfer the results (with Apache Sqoop) to PostgreSQL
Find answer to the questions in PostgreSQL using SQL (Q1-4)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Coursework for TIE-22306 Data-Intensive Programming

Files

README.md

Latest commit

History

README.md

File metadata and controls

Coursework for TIE-22306 Data-Intensive Programming