This list of public data sources are collected and tidied from blogs, answers, and user reponses. Most of the data sets listed below are free, however, some are not. Other amazingly awesome lists can be found in the awesome-awesomeness and sindresorhus's awesome list.
- 1000 Genomes
- American Gut (Microbiome Project)
- Collaborative Research in Computational Neuroscience (CRCNS)
- EBI ArrayExrepss
- ENCODE project
- Ensembl Genomes
- Gene Expression Omnibus (GEO)
- Gene Ontology (GO)
- Global Biotic Interations (GloBI)
- Human Microbiome Project (HMP)
- ICOS PSP Benchmark
- MIT Cancer Genomics Data
- NIH Microarray data (FTP)
- OpenSNP genotypes data
- Pathguid: Protein-Protein Interactions Catalog
- Protein Data Bank
- PubChem Project
- PubGene (now Coremine Medical)
- Sequence Read Archive(SRA)
- Stanford Microarray Data
- The Catalogue of Life
- The Personal Genome Project or PGP
- UCSC Public Data
- UniGene
- Australian Weather
- Brazilian Weather - Historical data (In Portuguese)
- Canadian Meteorological Centre
- Climate Data from UEA (updated monthly)
- Global Climate Data Since 1929
- NASA Global Imagery Browse Services
- NOAA Bering Sea Climate
- NOAA Climate Datasets
- NOAA Realtime Weather Models
- The World Bank Open Data Resources for Climate Change
- UEA Climatic Research Unit
- WorldClim - Global Climate Data
- WU Historical Weather Worldwide
- CrossRef DOI URLs
- DBLP Citation dataset
- NBER Patent Citations
- NIST complex networks data collection
- Protein-protein interaction network
- PyPI and Maven Dependency Network
- Scopus Citation Database
- Small Network Data
- Stanford GraphBase (Steven Skiena)
- Stanford Large Network Dataset Collection
- The Koblenz Network Collection
- The Laboratory for Web Algorithmics (UNIMI)
- The Nexus Network Repository
- UCI Network Data Repository
- UFL sparse matrix collection
- WSU Graph Database
- 3.5B Web Pages from CommonCraw 2012
- 53.5B Web clicks of 100K users in Indiana Univ.
- CAIDA Internet Datasets
- ClueWeb09 - 1B web pages
- ClueWeb12 - 733M web pages
- CommonCrawl Web Data over 7 years
- CRAWDAD Wireless datasets from Dartmouth Univ.
- Criteo click-through data
- Open Mobile Data by MobiPerf
- UCSD Network Telescope, IPv4 /8 net
- Challenges in Machine Learning
- CrowdANALYTIX dataX
- D4D Challenge of Orange
- DrivenData Competitions for Social Good
- ICWSM Data Challenge (since 2009)
- Kaggle Competition Data
- KDD Cup by Tencent 2012
- Localytics Data Visualization Challenge
- Netflix Prize
- Space Apps Challenge
- Telecom Italia Big Data Challenge
- Yelp Dataset Challenge
- CBOE Futures Exchange
- Google Finance
- Google Trends
- NASDAQ
- OANDA
- OSU Financial data
- Quandl
- St Louis Federal
- Yahoo Finance
- BODC - marine data of ~22K vars
- Cambridge, MA, US, GIS data on GitHub
- EOSDIS - NASA's earth observing system data
- Factual Global Location Data
- Geo Spatial Data from ASU
- GeoNames Worldwide
- Global Administrative Areas Database (GADM)
- Landsat 8 on AWS
- List of all countries in all languages
- Natural Earth - vectors and rasters of the world
- OpenAddresses
- OpenStreetMap (OSM)
- Reverse Geocoder using OSM data & additional high-resolution data files
- TIGER/Line - U.S. boundaries and roads
- TwoFishes - Foursquare's coarse geocoder
- TZ Timezones shapfiles
- World countries in multiple formats
- Antwerp, Belgium
- Austin, TX, US
- Australia (abs.gov.au)
- Australia (data.gov.au)
- Austria (data.gv.at)
- Belgium
- Brazil
- Cambridge, MA, US
- Canada
- Chicago
- Dallas Open Data
- Denver Open Data
- Durham, NC Open Data
- England LGInform
- EuroStat
- FedStats
- Finland
- France
- Germany
- Ghent, Belgium
- Glasgow, Scotland, UK
- Guardian world governments
- Houston Open Data
- Indian Government Data
- Indonesian Data Portal
- London Datastore, UK
- Los Angeles Open Data
- MassGIS, Massachusetts, U.S.
- Mexico
- Netherlands
- New Zealand
- NYC betanyc
- NYC Open Data
- OECD
- Oklahoma
- Open Government Data (OGD) Platform India
- Oregon
- Portland, Oregon
- Puerto Rico Government
- Rio de Janeiro, Brazil
- Romania
- Russia
- San Francisco Data sets
- Seattle
- Singapore Government Data
- South Africa
- Switzerland
- Texas Open Data
- The World Bank
- U.K. Government Data
- U.S. American Community Survey
- U.S. CDC Public Health datasets
- U.S. Census Bureau
- U.S. Department of Housing and Urban Development (HUD)
- U.S. Federal Government Agencies
- U.S. Federal Government Data Catalog
- U.S. Food and Drug Administration (FDA)
- U.S. National Center for Education Statistics (NCES)
- U.S. Open Government
- UK 2011 Census Open Atlas Project
- United Nations
- Uruguay
- Vancouver, BC Open Data Catalog
- EHDP Large Health Data Sets
- Gapminder World, demographic databases
- Medicare Coverage Database (MCD), U.S.
- Medicare Data Engine of medicare.gov Data
- Medicare Data File
- MeSH, the vocabulary thesaurus used for indexing articles for PubMed
- Number of Ebola Cases and Deaths in Affected Countries (2014)
- Open-ODS (structure of the UK NHS)
- The Cancer Genome Atlas project (TCGA) and BigQuery table
- 10k US Adult Faces Database
- 2GB of Photos of Cats (Original down - 20Agst2015) or Archive version
- Affective Image Classification
- Animals with attributes
- Face Recognition Benchmark
- ImageNet (in WordNet hierarchy)
- Indoor Scene Recognition
- International Affective Picture System, UFL
- Massive Visual Memory Stimuli, MIT
- Stanford Dogs Dataset
- SUN database, MIT
- The Oxford-IIIT Pet Dataset
- YouTube Faces Database
- Delve Datasets for classification and regression (Univ. of Toronto)
- Discogs Monthly Data
- eBay Online Auctions (2012)
- IMDb Database
- Keel Repository for classification, regression and time series
- Lending Club Loan Data
- Machine Learning Data Set Repository
- Million Song Dataset
- More Song Datasets
- MovieLens Data Sets
- RDataMining - "R and Data Mining" ebook data
- Registered Meteorites on Earth
- Restaurants Health Score Data in San Francisco
- UCI Machine Learning Repository
- Yahoo! Ratings and Classification Data
- Cooper-Hewitt's Collection Database
- Minneapolis Institute of Arts metadata
- Natural History Museum (London) Data Portal
- Rijksmuseum Historical Art Collection
- Tate Collection metadata
- The Getty vocabularies
- Canada Science and Technology Museums Corporation's Open Data
- Blogger Corpus
- ClueWeb09 FACC
- ClueWeb12 FACC
- DBpedia - 4.58M things with 583M facts
- Flickr Personal Taxonomies
- Google Books Ngrams (2.2TB)
- Google Web 5gram (1TB, 2006)
- Gutenberg eBooks List
- Hansards text chunks of Canadian Parliament
- Machine Translation of European languages
- SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic, 30K articles)
- SMS Spam Collection in English
- USENET postings corpus of 2005~2011
- Wikidata - Wikipedia databases
- Wikipedia Links data - 40 Million Entities in Context
- WordNet databases and tools
- CERN Open Data Portal
- NASA Exoplanet Archive
- NSSDC (NASA) data of 550 space spacecraft
- Sloan Digital Sky Survey (SDSS) - Mapping the Universe
- Amazon
- Archive.org Datasets
- CMU JASA data archive
- CMU StatLab collections
- Data360
- Datamob.org
- Infochimps
- KDNuggets Data Collections
- Microsoft Azure Data Market Free DataSets
- Numbray
- Reddit Datasets
- RevolutionAnalytics Collection
- Sample R data sets
- Stats4Stem R data sets
- StatSci.org
- The Washington Post List
- UCLA SOCR data collection
- UFO Reports
- Wikileaks 911 pager intercepts
- Yahoo Webscope
- Academic Torrents of data sharing from UMB
- Archive-it from Internet Archive
- Datahub.io
- DataMarket (Qlik)
- Freebase.com of people, places, and things
- Harvard Dataverse Network of scientific data
- ICPSR (UMICH)
- Open Data Certificates (beta)
- Statista.com - statistics and Studies
- 72 hours #gamergate scrape
- Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape
- May 2011 Calufa Twitter Scrape
- Network Twitter Data
- Social Twitter Data
- Twitter Data for Sentiment Analysis
- Ancestry.com Forum Dataset over 10 years
- CMU Enron Email of 150 users
- EDRM Enron EMail of 151 users, hosted on S3
- Facebook Data Scrape (2005)
- Facebook Social Networks from LAW (since 2007)
- FBI Hate Crime 2013 - aggregated data
- Foursquare from UMN/Sarwat (2013)
- GDELT Global Events Database
- General Social Survey (GSS) since 1972
- GetGlue - users rating TV shows
- GitHub Collaboration Archive
- Google Scholar citation relations
- MIT Reality Mining Dataset
- Mobile Social Networks from UMASS
- PewResearch Internet Survey Project
- Political Polarity Data
- Reddit Comments
- Skytrax' Air Travel Reviews Dataset
- SourceForge.net Research Data
- StackExchange Data Explorer
- Texas Inmates Executed Since 1984
- Titanic Survival Data Set
- Twitter Graph of entire Twitter site
- UCB's Archive of Social Science Data (D-Lab)
- UCLA Social Sciences Data Archive
- UNIMI/LAW Social Network Datasets
- Universities Worldwide
- UPJOHN for Labor Employment Research
- Yahoo! Graph and Social Data
- Youtube Video Social Graph in 2007,2008
- Betfair Historical Exchange Data
- Cricsheet Matches (cricket)
- Ergast Formula 1, from 1950 up to date (API)
- Football/Soccer resources (data and APIs)
- Lahman's Baseball Database
- Retrosheet Baseball Statistics
- Hard Drive Failure Rates
- Heart Rate Time Series from MIT
- Time Series Data Library (TSDL) from MU
- UC Riverside Time Series Dataset
- Airlines OD Data 1987-2008
- Bay Area Bike Share Data
- Bike Share Systems (BSS) collection
- GeoLife GPS Trajectory from Microsoft Research
- Hubway Million Rides in MA
- Marine Traffic - ship tracks, port calls and more
- NYC Taxi Trip Data 2009-
- NYC Taxi Trip Data 2013 (FOIA/FOILed)
- NYC Uber trip data April 2014 to September 2014
- OpenFlights - airport, airline and route data
- Plane Crash Database, since 1920
- RITA Airline On-Time Performance data
- RITA/BTS transport data collection (TranStat)
- Transport for London (TFL)
- Travel Tracker Survey (TTS) for Chicago
- U.S. Bureau of Transportation Statistics (BTS)
- U.S. Domestic Flights 1990 to 2009
- U.S. Freight Analysis Framework since 2007
- DataWrangling: Some Datasets Available on the Web
- Inside-r: Finding Data on the Internet
- OpenDataMonitor: An overview of available open data resources in Europe
- OpenDataNetwork: A search engine of all Socrata powered data portals ranging from small cities to federal agencies and non-profits
- Quora: Where can I find large datasets open to the public?
- RS.io: 100+ Interesting Data Sets for Statistics
- StaTrek: Leveraging open data to understand urban lives
- Zenodo: An open dependable home for the long-tail of science, enabling researchers to share and preserve any research outputs in any size, any format and from any science.