Skip to content

MariaCruzg/Crunchbase_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crunchbase_analysis

[Crunchbase] is a platform for finding business information about private and public companies.Crunchbase information includes investments and funding information, founding members and individuals in leadership positions, mergers and acquisitions, news, and industry trends. The objective of this project is to understand the factors that influence the company status in startups in North America. With the Crunchbase Data. We have clustered the companies into 3 sets: operating, acquired and closed. In order to solve the clustering problem classification algorithms have been employed. Classification algorithms map a set of attribute values to a categorical target value, represented by a class attribute. At the end, we developed a model to predict the status(operating, acquired or closer) of the company based on the market and the ffunding-raising.

We will start with some pre-processing (scanling the data, basically) and then cover the following algorithms: Random Forest, XGBoost, Decision Tree and Keras Sequential model.

For each algorithm, We will compute the model accuracy, the Mean Absolute Error, the Mean Squared Error and the Root Mean Squared Error.

Define the Problem:

It's difficult to know in which company to invest, so this project wants to have a simple visualization of the data to make the investors easier to make the decision. The purpose of this notebook is to try out a few algorithms for solving classification problems. The algorithm presented classify the status of the companies who has participated on investments' series from 2012 to 2017.

Gather the Data:

The dataset is also given to us by the open dataset on Crunchbase

The columns of this set are:

Column Dtype
permalink object
name object
homepage_url object
category_list object
market object
funding_total_usd int64
status object
country_code object
state_code object
region object
city object
funding_rounds int64
founded_at object
founded_month object
founded_quarter object
founded_year float64
first_funding_at object
last_funding_at object

Prepare Data for Consumption:

The normal processes in data wrangling, such as data architecture, governance, and extraction are out of scope.

  • For this step we have employed some libraries to removing special characters and Database normalization.
  • We will use the popular scikit-learn and teras libraries to develop our machine learning algorithms.
  • For data visualization, we will use the matplotlib and seaborn library

Perform Exploratory Analysis:

Data analysis is the step to describe and visualize the information. Datatype and values are described. We define the features variables (x) and find the target variables (y).

The graph shows the percent of each state. 86% of the companies are operating, 7.7% were acquired and 5.4% were closed.

[img]

Top startups in market

Following, we describe the market of the companies.

[img]

Distribution of total funding

The most popular category is Software & Mobile, It maybe because these 2 categories are easily to scalable ?. The next variable to analize is relative to founding in usd.

img

Where are the biggest companies?

To undestand the problem and the numbers we will highlight the unicorns companies as UBER, Alibaba, Cloudera and Facebook

img

Model Data:

Our main objective is to understand how the category and the founding affects the status of the company.

We will convert categorical data to -> dummy variables for mathematical analysis. There are multiple ways to encode categorical variables; we will use the sklearn and pandas functions. In this step, we will also define our x (independent) and y (dependent) variables for data modeling.

Validate and Implement Data Model:

When It comes to data modeling, the beginner’s question is always, "what is the best machine learning algorithm?". We will compare the accuracy, the Mean Absolute Error and the Root Mean Squared Error.

[img]

It's important to use a different subset for train data to build our model and test data to evaluate our model. Model Performance with Cross-Validation (CV) is basically a shortcut to split and score our model multiple times, so we can get an idea of how well it will perform on unseen data.

[img]

In addition to CV, we used a customized sklearn train test splitter, to allow a little more randomness in our test scoring. Below is an image of the default CV split. The results of the MLA are the following:

MLA accuracy standard deviation
Random_forest 84.6% 0.03%
Decision Tree 85.8% 0.002%
XGBoost 85.8% 0.005%
Sequential 86.7% 0.03%

Conclusion

Within the data We can conclude that the most important variables to get acquired are related to fundraising and the tech industry, so If one company get a big fundraising and It's software related It's highly provable to get acquired and could be a good idea to invest in It.

The top 5 of the variables most significant are:

feature importance
funding_total_usd 0.525
funding_rounds 0.054
Curated Web 0.022
Public Relations 0.008
Software 0.008
Web Hosting 0.007

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published