-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Cat Bear
committed
Jun 25, 2016
0 parents
commit 8102894
Showing
8 changed files
with
70 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# Expelliarmus |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
#!/usr/bin/env python | ||
|
||
|
||
from fuzzywuzzy import fuzz | ||
from subprocess import Popen, PIPE | ||
import docx2txt as doc | ||
from os import listdir | ||
from os.path import isfile, join | ||
import hashlib | ||
import argparse | ||
from collections import defaultdict | ||
def removeNonAscii(string): | ||
#Removes all characters that aren't ascii compatible. Sorry if you don't speak American | ||
return "".join(i for i in string if ord(i)<128) | ||
def documentToText(path): | ||
if path[-4:] == ".doc": | ||
cmd = ['antiword', path] | ||
p = Popen(cmd, stdout=PIPE) | ||
stdout, stderr = p.communicate() | ||
return removeNonAscii(stdout) | ||
elif path[-5:] == ".docx": | ||
return removeNonAscii(doc.process(path)) | ||
elif path[-4:] == ".txt": | ||
inputFile = open(path) | ||
text = inputFile.read() | ||
#Because memory and such | ||
inputFile.close() | ||
return(removeNonAscii(text)) | ||
return "" | ||
def getHashes(path): | ||
with open(path, 'rb') as afile: | ||
md5 = hashlib.md5(afile.read()).hexdigest() | ||
#I'd like to just read through the file once too but apparenlty it doesn't work that way. You have to read through twice | ||
with open(path, 'rb') as bfile: | ||
sha1 = hashlib.sha1(bfile.read()).hexdigest() | ||
return (md5, sha1) | ||
|
||
parser = argparse.ArgumentParser(description="This script compares every pair of files in a given directory against each other to get a ratio of similarity to help give an idea of whether they were cheating or not") | ||
parser.add_argument('-d', '--dir', help='Directory that is the root directory for each of the files') | ||
parser.add_argument('-t', '--threshold', help='This is the user defined threshold. Any ratios >= this threshold are printed', default=90) | ||
args = parser.parse_args() | ||
|
||
|
||
files = [ f for f in listdir(args.dir) if isfile(join(args.dir,f))] | ||
|
||
hashes = defaultdict(list) | ||
scanned = list() | ||
ratios = list() | ||
count = 0 | ||
for i in files: | ||
path = args.dir + i | ||
hashes[getHashes(path)].append(i) | ||
for j in scanned: | ||
ratios.append((i,j.split("/")[-1],fuzz.ratio(documentToText(path), documentToText(j)))) | ||
count +=1 | ||
scanned.append(path) | ||
|
||
for ratio in ratios: | ||
if ratio[2] > args.threshold: | ||
print("Files worth looking at: ", ratio) | ||
for key in hashes: | ||
if len(hashes[key]) > 1: | ||
print("Hash match: " + str(hashes[key])) | ||
print("Count: " + str(count)) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Now, the usual political thing to do when charges are made against you is to either ignore them or to deny them without giving details. I believe we've had enough of that in the United States, particularly with the present Administration in Was D.C. To me the office of the Vice Presidency of the United States is a great office, and I feel that the people have got to have confidence in the integrity of the men who run for that office and who might obtain it. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Now, the usual political thing to do when charges are made against you is to either ignore them or to deny them without giving details. I believe we've had enough of that in the United States, particularly with the present Administration in Washington, D.C. To me the office of the Vice Presidency of the United States is a great office, and I feel that the people have got to have confidence in the integrity of the men who run for that office and who might obtain it. Merge |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
This is an email full of dirty words in an attempt to hack and slash and hopefully convince you there is a merger. Thank you. Burt Maclan FBI |
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Now, the usual political thing to do when charges are made against you is to either ignore them or to deny them without giving details. I believe we've had enough of that in the United States, particularly with the present Administration in Was D.C. To me the office of the Vice Presidency of the United States is a great office, and I feel that the people have got to have confidence in the integrity of the men who run for that office and who might obtain it. |