Skip to content

Commit

Permalink
Initial Commit
Browse files Browse the repository at this point in the history
  • Loading branch information
Cat Bear committed Jun 25, 2016
0 parents commit 8102894
Show file tree
Hide file tree
Showing 8 changed files with 70 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Expelliarmus
65 changes: 65 additions & 0 deletions expelliarmus.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
#!/usr/bin/env python


from fuzzywuzzy import fuzz
from subprocess import Popen, PIPE
import docx2txt as doc
from os import listdir
from os.path import isfile, join
import hashlib
import argparse
from collections import defaultdict
def removeNonAscii(string):
#Removes all characters that aren't ascii compatible. Sorry if you don't speak American
return "".join(i for i in string if ord(i)<128)
def documentToText(path):
if path[-4:] == ".doc":
cmd = ['antiword', path]
p = Popen(cmd, stdout=PIPE)
stdout, stderr = p.communicate()
return removeNonAscii(stdout)
elif path[-5:] == ".docx":
return removeNonAscii(doc.process(path))
elif path[-4:] == ".txt":
inputFile = open(path)
text = inputFile.read()
#Because memory and such
inputFile.close()
return(removeNonAscii(text))
return ""
def getHashes(path):
with open(path, 'rb') as afile:
md5 = hashlib.md5(afile.read()).hexdigest()
#I'd like to just read through the file once too but apparenlty it doesn't work that way. You have to read through twice
with open(path, 'rb') as bfile:
sha1 = hashlib.sha1(bfile.read()).hexdigest()
return (md5, sha1)

parser = argparse.ArgumentParser(description="This script compares every pair of files in a given directory against each other to get a ratio of similarity to help give an idea of whether they were cheating or not")
parser.add_argument('-d', '--dir', help='Directory that is the root directory for each of the files')
parser.add_argument('-t', '--threshold', help='This is the user defined threshold. Any ratios >= this threshold are printed', default=90)
args = parser.parse_args()


files = [ f for f in listdir(args.dir) if isfile(join(args.dir,f))]

hashes = defaultdict(list)
scanned = list()
ratios = list()
count = 0
for i in files:
path = args.dir + i
hashes[getHashes(path)].append(i)
for j in scanned:
ratios.append((i,j.split("/")[-1],fuzz.ratio(documentToText(path), documentToText(j))))
count +=1
scanned.append(path)

for ratio in ratios:
if ratio[2] > args.threshold:
print("Files worth looking at: ", ratio)
for key in hashes:
if len(hashes[key]) > 1:
print("Hash match: " + str(hashes[key]))
print("Count: " + str(count))

1 change: 1 addition & 0 deletions files/1.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Now, the usual political thing to do when charges are made against you is to either ignore them or to deny them without giving details. I believe we've had enough of that in the United States, particularly with the present Administration in Was D.C. To me the office of the Vice Presidency of the United States is a great office, and I feel that the people have got to have confidence in the integrity of the men who run for that office and who might obtain it.
1 change: 1 addition & 0 deletions files/2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Now, the usual political thing to do when charges are made against you is to either ignore them or to deny them without giving details. I believe we've had enough of that in the United States, particularly with the present Administration in Washington, D.C. To me the office of the Vice Presidency of the United States is a great office, and I feel that the people have got to have confidence in the integrity of the men who run for that office and who might obtain it. Merge
1 change: 1 addition & 0 deletions files/3.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This is an email full of dirty words in an attempt to hack and slash and hopefully convince you there is a merger. Thank you. Burt Maclan FBI
Binary file added files/Hacks and Pranks.docx
Binary file not shown.
Binary file added files/WaterPollution.doc
Binary file not shown.
1 change: 1 addition & 0 deletions files/copy.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Now, the usual political thing to do when charges are made against you is to either ignore them or to deny them without giving details. I believe we've had enough of that in the United States, particularly with the present Administration in Was D.C. To me the office of the Vice Presidency of the United States is a great office, and I feel that the people have got to have confidence in the integrity of the men who run for that office and who might obtain it.

0 comments on commit 8102894

Please sign in to comment.