-
Notifications
You must be signed in to change notification settings - Fork 2
Research Document
(PDF) An Entity Resolution approach to isolate instances of Human Trafficking online
(PDF) Identifying human trafficking indicators in the UK online sex market
Git Version Control and GitHub Desktop Git Client to get started with Git quickly if you have 0 prior experience: Getting Started with Git and GitHub Desktop
Python 3 Tutorial - Learn Python in 30 Minutes.
JetBrains PyCharm: Python IDE(get pro version free with @tcd.ie JetBrains student account)
Python Reference Documentation
Regex (Regular Expressions):
Quick-Start: Regex Cheat Sheet
Scrapy syntax note: get()
is the same as extract_first()
and getall()
is the same as extract()
Web Scraping in Python using Scrapy (with multiple examples)
Scraping The Steam Game Store With Scrapy
Handling JavaScript In Scrapy With Splash
Jupyter Notebook:
Project Jupyter | Installing Jupyter
Plotly:
We could use APIs, but there's a different one for each site, and some don't offer REST APIs, so this is impractical for our project. We should use a general purpose web scraper.
Example site to scrape: Jobs | Locanto™ Job Market Ireland
According to the below resources, if we are to build our own web scraper, we would be best off using Scrapy since it's purpose-made for web scraping and is scalable, with the Splash library for JavaScript website compatibility. Top 7 Python Web Scraping Tools For Data Scientists
Choose the Best Python Web Scraping Library for Your Application
Top Web Scraping Python Libraries Compared
A potential alternative is the ParseHub API, but since it's limited to 5 requests per second with a 25/sec queue, it's not scalable nor practical.
Main tool:
Other potentially useful tools:
- NLTK, CoreNLP, and SpaCy are NLP libraries. CoreNLP is built on Java with Python compatibility. SpaCy is easiest to work with, being object rather than string based. NLTK is more flexible. SpaCy and CoreNLP perform better than NLTK. CoreNLP only supports 8 languages. The latest SpaCy supports 64+ languages.
- Keras & PyTorch are Machine Learning APIs, PyTorch is low-level and more mathematical while Keras is more high-level. Keras can work on top of Tensorflow or Theano.
- Gensim is a package primarily for topic modelling. Gensim can be used with SpaCy.
Visualisation tool; Python Plotly-Dash:
GitHub - plotly.py: The interactive graphing library for Python
Project by:
- Michael Makarenko - CS Third Year
- Honglin Li - CSB Third Year
- Ajchan Mamedov - CS Third Year
- Robert Jones - CSB Second Year
- Joseph Hand - CS Second Year
- Sheena O’Reilly - CS Second Year
- Rían Walter – CS Second Year
Under Guidance of:
- Marco Blasio
- John McGrath
Intellectual Property of IBM™