Skip to content

Research Document

Michael Makarenko edited this page Apr 12, 2022 · 7 revisions

Research

Papers

(PDF) "Spotting the signs" of trafficking recruitment online: exploring the characteristics of advertisements targeted at migrant job-seekers

(PDF) An Entity Resolution approach to isolate instances of Human Trafficking online

(PDF) Identifying human trafficking indicators in the UK online sex market

Learning Resources

Git Version Control and GitHub Desktop Git Client to get started with Git quickly if you have 0 prior experience: Getting Started with Git and GitHub Desktop

Python:

Python 3 Tutorial - Learn Python in 30 Minutes.

JetBrains PyCharm: Python IDE(get pro version free with @tcd.ie JetBrains student account)

Python Reference Documentation

Regex (Regular Expressions):

Quick-Start: Regex Cheat Sheet

Scrapy:

Scrapy syntax note: get() is the same as extract_first() and getall() is the same as extract()

Web Scraping in Python using Scrapy (with multiple examples)

Scraping The Steam Game Store With Scrapy

Scrapy Docs

Scrapy spider crawling rules

Scrapy link extractors

Handling JavaScript In Scrapy With Splash

Jupyter Notebook:

What is a Jupyter Notebook?

Project Jupyter | Installing Jupyter

Plotly:

Python Plotly tutorial

Plot Data From Csv

Web Scraping

We could use APIs, but there's a different one for each site, and some don't offer REST APIs, so this is impractical for our project. We should use a general purpose web scraper.

Example site to scrape: Jobs | Locanto™ Job Market Ireland

According to the below resources, if we are to build our own web scraper, we would be best off using Scrapy since it's purpose-made for web scraping and is scalable, with the Splash library for JavaScript website compatibility. Top 7 Python Web Scraping Tools For Data Scientists

Choose the Best Python Web Scraping Library for Your Application

Top Web Scraping Python Libraries Compared

A potential alternative is the ParseHub API, but since it's limited to 5 requests per second with a 25/sec queue, it's not scalable nor practical.

Natural Language Processing

Main tool:

Other potentially useful tools:

  • NLTK, CoreNLP, and SpaCy are NLP libraries. CoreNLP is built on Java with Python compatibility. SpaCy is easiest to work with, being object rather than string based. NLTK is more flexible. SpaCy and CoreNLP perform better than NLTK. CoreNLP only supports 8 languages. The latest SpaCy supports 64+ languages.
  • Keras & PyTorch are Machine Learning APIs, PyTorch is low-level and more mathematical while Keras is more high-level. Keras can work on top of Tensorflow or Theano.
  • Gensim is a package primarily for topic modelling. Gensim can be used with SpaCy.

Visualisation

Visualisation tool; Python Plotly-Dash:

GitHub - plotly.py: The interactive graphing library for Python

Plotly Python Graphing Library

Dash Overview

Clone this wiki locally