I built YC-Scraper to create a dataset of all the companies in the Y Combinator directory. You can search for companies by industry, region, company size, and more in this directory.
Y Combinator is a startup accelerator that has invested in over 4,000 companies that have a combined valuation of over $600B. The overall goal of Y Combinator is to help startups really take off.
You must have Firefox and geckodriver installed. You can install geckodriver
by running brew install geckodriver
.
Python packages include:
- Clone this repository
- Move to the
yc-scraper
directory 2. [Optional] Create an environment foryc-scraper
(for example byconda create --name <env_name> --file requirements.txt
) - Run
python yc_links_extractor.py
. This will fetch the individual URLs for the spider to crawl. - Run
scrapy runspider scrapy-project/ycombinator/spiders/yscraper.py -o output.jl
. This generates a JSON lines file which you can read with Pandas:
import pandas as pd
df = pd.read_json('./output.jl', lines=True)
Check out the dataset I published on Kaggle.com.
Attribute | Description | Data Type |
---|---|---|
company_id | Company id provided by YC | int |
company_name | Company name | string |
short_description | One-line description of the company | string |
long_description | Long description of the company | string |
batch | Batch name provided by YC | string |
status | Company status | string |
tags | Industry tags | list |
location | Company location | string |
country | Company country | string |
year_founded | Year the company was founded | int |
num_founders | Number of founders | int |
founders_names | Full names of the founders | list |
team_size | Number of employees | int |
website | Company website | string |
cb_url | Company Crunchbase url | string |
linkedin_url | Company LinkedIn url | string |
Note: I excluded 'short_description', 'long_description', 'cb_url', and 'linkedin_url' in the sample data for brevity.
company_id | company_name | short_description | batch | status | tags | location | country | year_founded | num_founders | founders_names | team_size | website | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
240 | Stripe | Economic infrastructure for the internet. | S09 | Active | ['Fintech', 'Banking as a Service', 'SaaS'] | San Francisco | US | 2 | ['John Collison', 'Patrick Collison'] | 7000 | http://stripe.com | ||
271 | Airbnb | Book accommodations around the world. | W09 | Public | ['Travel', 'Marketplace'] | San Francisco | US | 2008 | 3 | ['Nathan Blecharczyk', 'Brian Chesky', 'Joe Gebbia'] | 6132 | http://airbnb.com | |
325 | Dropbox | Backup and share files in the cloud. | S07 | Public | [] | San Francisco | US | 2008 | 2 | ['Arash Ferdowsi', 'Drew Houston'] | 4000 | http://dropbox.com | |
379 | The frontpage of the internet. | S05 | Acquired | ['Community', 'Social', 'Social Media', 'Social Network'] | San Francisco | US | 1 | ['Steve Huffman'] | 201 | http://reddit.com | |||
439 | Coinbase | Buy, sell, and manage cryptocurrencies. | S12 | Public | ['Crypto / Web3'] | San Francisco | US | 2012 | 1 | ['Brian Armstrong'] | 6112 | https://www.coinbase.com | |
531 | DoorDash | Restaurant delivery. | S13 | Public | ['E-commerce', 'Marketplace'] | San Francisco | US | 2013 | 3 | ['Andy Fang', 'Stanley Tang', 'Tony Xu'] | 8600 | http://doordash.com |
Author: Miguel Corral Jr.
Email: corraljrmiguel@gmail.com
LinkedIn: https://www.linkedin.com/in/imiguel
GitHub: https://github.com/corralm
Distributed under the MIT license. See LICENSE for more information.