Skip to content

Latest commit

 

History

History
36 lines (21 loc) · 2.66 KB

README.mdown

File metadata and controls

36 lines (21 loc) · 2.66 KB

#The Language of Fraud

“There’s a kind of fascination with the thought that a computer sleuth can discover things that are hidden there in the text. Things about the style of the writing that the reader can’t detect and the author can’t do anything about, a kind of signature or DNA or fingerprint of the way they write.” -- Peter Millican on use of forensic linguistics

Language use is constant. While other indicators for fraud, such as IP addresses, bank accounts, can be changed, language use is constant and indicative.

The inspiration for this project comes from an in-class case study we did on fraud detection. Bag of words approaches and text length showed some promising results for fraud detection. Based on that, I wanted to see if some deep learning approaches with language would yield better results as it would give a larger feature set with context in language use. As indicated in the above quote, authors have characteristic writing that is unique and traceable. If the use of language in perpetuating fraud could be thought of as a genre, I wanted to try to find via computational means the patterns of usage that indicate this 'genre of fraud'.

For featurization of the text descriptions, I chose the Stanford Core Parser because it gave a rich feature set should I choose to extend it further than I did for the current model. In this model, I have used only the syntactic depdendencies and part-of-speech tags given by the parser. Word2Vec was used for the featurization of the words themselves for two reasons: it trains very quickly and the gensim library within python allows for its ease of use. I then created a sparse matrix with a single dependency within a sentence represented by a row and then built a model using scikit-learn's logistic regression classifier.

The scoring in this model is such that every sentence is given a score by averaging the binary fraud/not fraud scores of its dependencies. Every event is then scored as fraudulent given that one sentence within is indicated as fraudulent.

The scripts for building the model are in the build scripts folder and the scoring scripts are within the scoring scripts folder.

This project is meant as a proof of concept model, not a working application.
###Required technologies for this product:

####NLP

####Python libraries

  • Pandas
  • Gensim
  • Scikit-learn
  • Numpy

####Database

  • Postgres