dataset for NLP,CV and speech
https://github.com/brightmart/nlp_chinese_corpus : 1.维基百科(wiki2019zh),100万个结构良好的中文词条 2.新闻语料(news2016zh),250万篇新闻,含关键词、描述 3.百科问答(baike2018qa),150万个带问题类型的问答 4.社区问答json版(webtext2019zh),410万个高质量社区问答,适合训练超大模型 5.翻译语料(translation2019zh),520万个中英文句子对
https://github.com/crownpku/Small-Chinese-Corpus 小数据: 中国省市经纬度坐标:city_location/ 中国省市邮政编码大全:postal_provinces/ 全国区划和城乡划分代码(2015):china_geo_code/ 成语大全:chengyu/ 中文人名大全及金庸小说、三国演义及红楼梦人物姓名:chi_names/ 中文命名实体识别数据sample:NER_chi/ 中文关系识别数据sample:relation_multiple_chi/ 中文阅读理解数据sample:reading_comprehension_chi/ 中文图文问答数据(基于MSCOCO):Chinese_Visual_QA_pairs/
https://github.com/SophonPlus/ChineseNlpCorpus 情感/观点/评论,中文命名实体识别,推荐系统,FAQ 问答系统
https://zhuanlan.zhihu.com/p/35423943?utm_source=wechat_timeline&utm_medium=social&wechatShare=2&from=timeline&isappinstalled=0 https://github.com/Embedding/Chinese-Word-Vectors 100+ Chinese Word Vectors 上百种预训练中文词向量
https://github.com/fighting41love/funNLP 中英文敏感词、语言检测、中外手机/电话归属地/运营商查询、名字推断性别、手机号抽取、身份证抽取、邮箱抽取、中日文人名库、中文缩写库、拆字词典、词汇情感值、停用词、反动词表、暴恐词表、繁简体转换、英文模拟中文发音、汪峰歌词生成器、职业名称词库、同义词库、反义词库、否定词库、汽车品牌&零件词库、时间抽取、连续英文切割、中文词向量大全、公司名字大全、古诗词库、IT词库、财经词库、成语词库、地名词库、历史名人词库、诗词词库、医学词库、饮食词库、法律词库、汽车词库、动物词库、中文聊天语料、中文谣言数据。
https://github.com/STAIR-Lab-CIT/STAIR-actions Large Scale Video Dataset for Action Recognition http://actions.stair.center http://crcv.ucf.edu/data/UCF101.php Action Recognition Data Set
http://bbcsfx.acropolis.org.uk These 16,016 BBC Sound Effects are made available by the BBC in WAV format to download for use under the terms of the RemArc Licence.
https://github.com/awesomedata/awesome-public-datasets A topic-centric list of high-quality open datasets in public domains.
http://www.vldb.org/pvldb/vol9/p993-abedjan.pdf Detecting Data Errors:Where are we and what needs to be done? https://arxiv.org/abs/1801.07237 Smoke: Fine-grained Lineage at Interactive Speed https://papers.nips.cc/paper/6523-data-programming-creating-large-training-sets-quickly.pdf Data Programming: Creating Large Training Sets, Quickly http://www.vldb.org/pvldb/vol9/p948-krishnan.pdf ActiveClean: Interactive Data Cleaning For Statistical Modeling https://www.zhihu.com/question/19969760 数据分析和挖掘有哪些公开的数据来源
https://github.com/Featuretools/featuretools An open source python framework for automated feature engineering https://www.featuretools.com
https://github.com/lorien/awesome-web-scraping List of libraries, tools and APIs for web scraping and data processing.
https://github.com/scrapy/scrapy a fast high-level web crawling & scraping framework for Python https://github.com/gaojiuli/gain Web crawling framework based on asyncio. https://github.com/kootenpv/sky it aims for next generation web crawling where machine intelligence is used to speed up the development/maintainance/reliability of crawling. https://github.com/HoloClean/HoloClean A Machine Learning System for Data Enrichment. http://www.holoclean.io https://github.com/HazyResearch/snorkel A system for quickly generating training data with weak supervision http://snorkel.stanford.edu https://blog.modeanalytics.com/python-data-cleaning-libraries/ https://github.com/NathanEpstein/Dora https://github.com/rhiever/datacleaner https://github.com/HHammond/PrettyPandas https://github.com/LuminosoInsight/python-ftfy http://brettromero.com/data-science-kaggle-walkthrough-cleaning-data/
Crawling and Spraper:
https://github.com/sangaline/advanced-web-scraping-tutorial The Zipru scraper developed in the Advanced Web Scraping Tutorial.
https://github.com/speed/newcrawler Free Web Scraping Tool with Java http://www.newcrawler.com https://github.com/leonsim/simhash A Python Implementation of Simhash Algorithm https://github.com/lorien/grab Web Scraping Framework http://grablib.org https://github.com/yahoo/anthelion Anthelion is a plugin for Apache Nutch to crawl semantic annotations within HTML pages https://labs.yahoo.com/publications/6… https://github.com/vega/voyager Recommendation-Powered Visualization Tool for Data Exploration http://vega.github.io/voyager https://orange.biolab.si Open source machine learning and data visualization for novice and expert. Interactive data analysis workflows with a large toolbox. https://github.com/wireservice/agate A Python data analysis library that is optimized for humans instead of machines. http://agate.readthedocs.org/ https://github.com/wireservice/csvkit A suite of utilities for converting to and working with CSV, the king of tabular file formats. http://csvkit.rtfd.org/
https://github.com/jobbole/awesome-python-cn Python资源大全中文版,包括:Web框架、网络爬虫、模板引擎、数据库、数据可视化、图片处理等,由伯乐在线持续更新。
http://wp.sigmod.org/?p=2288 DATA CLEANING IS A MACHINE LEARNING PROBLEM THAT NEEDS DATA SYSTEMS HELP! http://serialmentor.com/dataviz/ Fundamentals of Data Visualization http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ Advanced Data Analysis from an Elementary Point of View
https://www.zhihu.com/question/34444491 数据分析、大数据、数据挖掘或者数据分析学习相关的网站推荐 https://github.com/tdpetrou/Learn-Pandas Tutorials on how to use pandas effectively to do data analysis https://github.com/BrambleXu/pydata-notebook 利用Python进行数据分析 第二版 (2017) 中文翻译笔记 https://github.com/wesm/pydata-book Materials and IPython notebooks for "Python for Data Analysis" by Wes McKinney, published by O'Reilly Media Python for Data Analysis, 2nd Edition.pdf https://github.com/iamseancheney/pythonbooks/blob/master/Python%20for%20Data%20Analysis,%202nd%20Edition.pdf https://www.jianshu.com/p/04d180d90a3f 中文 Python%20for%20Data%20Analysis,%202nd%20Edition