Skip to content

Latest commit

 

History

History
71 lines (54 loc) · 4.07 KB

README.md

File metadata and controls

71 lines (54 loc) · 4.07 KB
⚠️ This repository is based on PhD research that seeks to identify radicalisation on online platforms. Due to this; text, themes, and content relating to far-right extremism are present in this repository. Please continue with care. ⚠️

Samaritans - Call 116 123 | ACT Early | actearly.uk | Prevent advice line 0800 011 3764

📍 Pinpoint is a suite of functionality for building and using a binary classifier for the identification of extremist content. 💻

Pinpoint

Pinpoint is a suite of functionality for building a Gaussian classifier for the identification of far-right extremist content. This tooling builds off the methodology in the paper Radical Mind: Identifying Signals to Detect Extremist Content on Twitter by Mariam Nouh, Jason R.C. Nurse, and Michael Goldsmith' .

Installation

python -m pip install git+https://github.com/CartographerLabs/Pinpoint.git

Datasets

Parler dataset

A dataset was acquired from A Large Open Dataset from the Parler Social Network. This dataset was further broken into two separate datasets using the Log-Likelihood tooling from the Parler Toolbox repository. For this, 100 posts in the dataset were manually marked as either violent extremist or non-extremist, and using the tooling a list of the top 30 keywords relating to violent-far-right extremism were identified. A subsection of these can be seen below:

  • genocidal
  • fire
  • destroyers
  • democraticnazi
  • fucker
  • tribunals
  • invoke
  • squad
  • punch
  • tyrannical

After these violent-extremist words were aggregated the dataset was split with text posts containing the keywords being marked as violent-far-right-extremist and those without marked as a baseline. After this text posts were converted to CSV and marked up with the LIWC Text Analysis Engine.

Stormfront dataset

The second dataset, used for developing a known radical corpus, was extracted from Hate speech dataset from a white supremacist forum and converted to CSV format.

Example Usage

from Pinpoint.FeatureExtraction import *
from Pinpoint.RandomForest import *

# Performs feature extraction from the provided Extremist, Counterpoise, and Baseline datasets.
extractor = feature_extraction(violent_words_dataset_location=r"datasets/swears",
                               baseline_training_dataset_location=r"datasets/far-right/LIWC2015 Results (Storm_Front_Posts).csv")

extractor.MAX_RECORD_SIZE = 250000

extractor.dump_training_data_features(
    feature_file_path_to_save_to=r"outputs/training_features.json",
    extremist_data_location=r"datasets/far-right/LIWC2015 Results (extreamist-messages.csv).csv",
    baseline_data_location=r"datasets/far-right/LIWC2015 Results (non-extreamist-messages.csv).csv")

# Trains a model off the features file created in the previous stage
model = random_forest()

model.RADICAL_LANGUAGE_ENABLED = True
model.BEHAVIOURAL_FEATURES_ENABLED = True
model.PSYCHOLOGICAL_SIGNALS_ENABLED = True

model.train_model(features_file= r"outputs/training_features.json",
                  force_new_dataset=True, model_location=r"outputs/far-right-baseline.model")

model.create_model_info_output_file(location_of_output_file="outputs/far-right-baseline-output.txt",
                                    training_data_csv_location=r"outputs/training_features.json.csv")

Outputs

Once trained and a model created it will be pickled and saved as a re-loadable file in the tooling’s output directory for future use. In addition to this a text file is also created detailing the specifications and related accuracy scores of the created model - examples of these have been provided in the provided folder.