README italian version can be found here.
Use Spark to cluster documents given their content and given Wikipedia categories.
Project is splitted into different folders.
datasetcontains input datasetoutputcontains processing output dataset, such as intermediate computations of Sparkresultscontains somecsvreports used to make plotslatexcontainstexfile of our reportsrccontains code needed to perform computations and plot
To make handling classes and parameter simpler, we wrote a python script,
namely make.py. Its documentation is simply given through python3 make.py --help.
Example usage is
python make.py --class Cluster (...args for Java main...)Relevant classes with main Spark procedures can be found in
src/main/java/it/unipd/dei/dm1617/examples/, descripted in next section.
They are splitted in different groups, each one providing a differe processing
step. Parameters are retrievable in main method of each class.
-
preprocessing
CategoriesPreprocessing.javacounts articles per categoryTfidfCategories.javaranks categories by their relevanceTfIdf.javabuildsbag-of-wordsmodelWord2VecFit.javatrains word2vec model using text corpusDoc2Vec.javaloads word2vec model and writes vector corresponding to each document inoutput/folder
-
clustering
Cluster.javaclusters input data and outputs the trained separation model
-
result evaluation
HopkinsStatistic.javacomputes Hopkins statistic on vectorized corpusEvaluationLDA.javainspect LDA fit outputNMIRankedCategories.javacomplutes NMI score considering one category per document onlyNMIOverlappingCategories.javacomputes NMI score cosidering multiple categories per documentSimpleSilhouetteCoefficient.javacomplutes simple silhouette score given a(vector, clusterID)dataset
In src/ e results/ can be found python scripts that process output/
files and build relevant plots.
src/hierarchicalClustering.py tried to use scipy clustering library, but it
was dropped given its RAM request.