Class : dsmft-paris-14
Final exam's date : Janvier, 19th 2022
Explore the 6 projects for 6 blocks and project MySQL »
Table of projects
- Project 1 # Kayak
- Project 2 # Speed Dating
- Project 3 # Uber Pickups
- Projet 4 # Disaster Tweet
- Projet_5 # Wine-O-Meter
- Projet_6 # Projet final: Start-up Prediction
Project 🚧
The marketing team needs help on a new project. After doing some user research, the team discovered that 70% of their users who are planning a trip would like to have more information about the destination they are going to.
In addition, user research shows that people tend to be defiant about the information they are reading if they don't know the brand which produced the content.
Therefore, Kayak Marketing Team would like to create an application that will recommend where people should plan their next holidays. The application should be based on real data about:
Weather.
Hotels in the area.
The application should then be able to recommend the best destinations and hotels based on the above variables at any given time. Goals 🎯
As the project has just started, your team doesn't have any data that can be used to create this application. Therefore, your job will be to:
Scrape data from destinations.
Get weather data from each destination.
Get hotels' info about each destination.
Store all the information above in a data lake.
Extract, transform and load cleaned data from your datalake to a data warehouse.
Challenge description
We will start a new data visualization and exploration project. Your goal will be to try to understand love! It's a very complicated subject so we've simplified it. Your goal is going to be to understand what happens during a speed dating and especially to understand what will influence the obtaining of a second date.
This is a Kaggle competition on which you can find more details here :
Take some time to read the description of the challenge and try to understand each of the variables in the dataset. Help yourself with this from the document : Speed Dating - Variable Description.md Rendering
To be successful in this project, you will need to do a descriptive analysis of the main factors that influence getting a second appointment.
Over the next few days, you'll learn how to use python libraries like seaborn, plotly and bokeh to produce data visualizations that highlight relevant facts about the dataset.
For today, you can start exploring the dataset with pandas to extract some statistics.
Part 1: Exploratory Data Analysis
The demo of the project can be seen in the following link: https://share.streamlit.io/huynam1012/uber_deploy-/main
Company's Description 📇 Uber is one of the most famous startup in the world. It started as a ride-sharing application for people who couldn't afford a taxi. Now, Uber expanded its activities to Food Delivery with Uber Eats, package delivery, freight transportation and even urban transportation with Jump Bike and Lime that the company funded.
The company's goal is to revolutionize transportation accross the globe. It operates now on about 70 countries and 900 cities and generates over $14 billion revenue! 😮 Project 🚧
One of the main pain point that Uber's team found is that sometimes drivers are not around when users need them. For example, a user might be in San Francisco's Financial District whereas Uber drivers are looking for customers in Castro.
(If you are not familiar with the bay area, check out Google Maps)
Eventhough both neighborhood are not that far away, users would still have to wait 10 to 15 minutes before being picked-up, which is too long. Uber's research shows that users accept to wait 5-7 minutes, otherwise they would cancel their ride.
Therefore, Uber's data team would like to work on a project where their app would recommend hot-zones in major cities to be in at any given time of day. Goals 🎯
Uber already has data about pickups in major cities. Your objective is to create algorithms that will determine where are the hot-zones that drivers should be in. Therefore you will:
Create an algorithm to find hot zones.
Visualize results on a nice dashboard.
Project description: https://www.kaggle.com/c/nlp-getting-started
Machine Learning in production
Wine-o-meter is a future unicorn application. It allows wine producers to predict the quality score of their wine based on physicochemical inputs. The startup behind this innovation is convinced about its ability to disrupt the world of wine. 🍷 Project 🚧
The data-science team has worked together on creating the best model predicting the quality score (from 0 to 10) of multiple wines. The next step is to include this model into the mobile application. The development team is expecting an API endpoint in order to request the model and display the result inside the application.
Your job is to put the trained model into production. Hopefully, the team provided you their work:
a model.joblib which is the trained model pickled (saved using joblib library),
the notebook used to train the model (Model_Training.ipynb) and the dataset (winequality.csv), so you can have a look,
a test notebook (Test_Endpoint.ipynb) so you can assert that your endpoint is meeting the requirements.
Goals 🎯
Your mission is to put this model into production by building an API. To succeed you need to:
provide a /predict endpoint,
provide a small documentation for the developer team at the index of your website.
The details are described in the following link: https://wine-hnt.herokuapp.com
Part 1: Exploratory Data Analysis and Machine Learning
Part 4: Scrape Companies Information
Part 5: Demo Dashboard by Power BI
This is the final project as a Jedha student. The object of the project is development an application that allows to predict the success of startups based on some characteristics. Authors: Huy-Nam Tran, Matthieu Verrecchia, Said Soufyan and Amir Benseddik.
In the following link, we describe the technical expectations: https://predictstartup.herokuapp.com
The data can be found on Kaggle :
Company's Description 📇
Walmart Inc. is an American multinational retail corporation that operates a chain of hypermarkets, discount department stores, and grocery stores from the United States, headquartered in Bentonville, Arkansas. The company was founded by Sam Walton in 1962. Project 🚧
Walmart's marketing service has asked you to build a machine learning model able to estimate the weekly sales in their stores, with the best precision possible on the predictions made. Such a model would help them understand better how the sales are influenced by economic indicators, and might be used to plan future marketing campaigns. Goals 🎯
The project can be divided into three steps:
Part 1 : make an EDA and all the necessary preprocessings to prepare data for machine learning
Part 2 : train a linear regression model (baseline)
Part 3 : avoid overfitting by training a regularized regression model
Scope of this project 🖼️
For this project, you'll work with a dataset that contains information about weekly sales achieved by different Walmart stores, and other variables such as the unemployment rate or the fuel price, that might be useful for predicting the amount of sales.
Deliverable 📬
To complete this project, your team should:
Create some visualizations
Train at least one linear regression model on the dataset, that predicts the amount of weekly sales as a function of the other variables
Assess the performances of the model by using a metric that is relevant for regression problems
Interpret the coefficients of the model to identify what features are important for the prediction
Train at least one model with regularization (Lasso or Ridge) to reduce overfitting
Challenge : predict conversions 🏆🏆
In this project, you will participate to a machine learning competition like the ones that are organized by https://www.kaggle.com/. You will be able to work with jupyter notebooks as usual, but in the end you'll have to submit your model's predictions to your teacher/TA, so your model's performances will be evaluated in an independent way. The scores achieved by the different teams will be stored into a leaderboard 🏅🏅 Description of a machine learning challenge 🚴🚴
In machine learning challenges, the dataset is always separated into to files :
data_train.csv contains labelled data, which means there are both X (explanatory variables) and Y (the target to be predicted). You will use this file to train your model as usual : make the train/test split, preprocessings, assess performances, try different models, fine-tune hyperparameters etc...
data_test.csv contains "new" examples that have not be used to train the model, in the same format as in data_train.csv but it is unlabeled, which means the target Y has been removed from the file. Once you've trained a model, you will use data_test.csv to make some predictions that you will send to the organizing team. They will then be able to assess the performances of your model in an independent way, by preventing cheating 🤸
Your model's predictions will be compared to the true labels and releases a leaderboard where the scores of all the teams around the world are stored
All the participants are informed about the metric that will be used to assess the scores. You have to make sure you're using the same metric to evaluate your train/test performances !
Company's Description 📇
www.datascienceweekly.org is a famous newsletter curated by independent data scientists. Anyone can register his/her e-mail address on this website to receive weekly news about data science and its applications ! Project 🚧
The data scientists who created the newsletter would like to understand better the behaviour of the users visiting their website. They would like to know if it's possible to build a model that predicts if a given user will subscribe to the newsletter, by using just a few information about the user. They would like to analyze the parameters of the model to highlight features that are important to explain the behaviour of the users, and maybe discover a new lever for action to improve the newsletter's conversion rate.
They designed a competition aiming at building a model that allows to predict the conversions (i.e. when a user will subscribe to the newsletter). To do so, they open-sourced a dataset containing some data about the traffic on their website. To assess the rankings of the different competing teams, they decided to use the f1-score. Goals 🎯
The project can be cut into four steps :
Part 1 : make an EDA and the preprocessings and train a baseline model with the file data_train.csv
Part 2 : improve your model's f1-score on your test set (you can try feature engineering, feature selection, regularization, non-linear models, hyperparameter optimization by grid search, etc...)
Part 3 : Once you're satisfied with your model's score, you can use it to make some predictions with the file data_test.csv. You will have to dump the predictions into a .csv file that will be sent to Kaggle (actually, to your teacher/TA 🤓). You can make as many submissions as you want, feel free to try different models !
Part 4 : Take some time to analyze your best model's parameters. Are there any lever for action that would help to improve the newsletter's conversion rate ? What recommendations would you make to the team ?
Deliverable 📬
To complete this project, your team should:
Create some relevant figures for EDA
Train at least one model that predicts the conversions and evaluate its performances (f1, confusion matrices)
Make at least one submission to the leaderboard
Analyze your best model's parameters and try to make some recommendations to improve the conversion rate in the future