- Install VmWare based on your OS.
- Install Cloudera VM
- Java 1.8 needs to be installed on the system
- Update Java 1.8 on Cloudera (Follow this steps URL1, URL2)
- Install Gephi
-
Checkout this project or download the entire project directory and extract it.
-
Open up IntelliJ. Navigate to File -> Open and select the directory of the project.
-
Open the terminal tab in IntelliJ and type the following commands:
Compile and run the unit tests:
sbt clean compile test
Compile and run the application:
sbt clean compile assembly
-
Open the workspace in IntelliJ
-
Open the file DBLPParser.java
-
Go to Edit configuration and do the following changes:
a. Main class should be com.uic.mapreduce.xml.DBLPParser
b. In VM Options set: -Xmx6G ( increase memory allocation to JVM for this task)
c. Provide absolute paths to the DBLP dataset and UIC authors.txt(included in the project) as program arguments e.g.: F:\UIC\441\mapreduce\dblp\dblp.xml .\src\main\resources\UIC_authors.txt
-
Run the file.
It will generate a file in the logs which will have comma separated UIC authors for each article,inproceedings,proceedings,book,incollection,phdthesis
Sample Output:
Robert H. Sloan,Ugo A. Buy
Bhaskar DasGupta
Andrew E. Johnson
Luc Renambot,Andrew E. Johnson
Ajay D. Kshemkalyani
Luc Renambot,Andrew E. Johnson
Luc Renambot,Andrew E. Johnson
Use sbt clean compile assembly
which will create a jar under \target\scala-2.12 by name of author-map-dblp.jar
- Start the Cloudera VM instance on VmWare
- Get the IP address of the VM from the network settings of VmWare once the VM is up and running.
- Using WinSCP or other tools or commands(base don your OS) transfer the files(jar and the logfile a location on the VM)
- Use default username and password provided by Cloudera.
- Navigate to the directory where the above files are stored on the VM.
- Create input directory on hadoop
hadoop fs -mkdir input_dir
- Transfer the logfile to the input directory
hadoop fs -put <logfilename> input_dir
- Run the jar from the directory where the jar is present.
hadoop jar author-map-dblp.jar AuthorMapping input_dir output_dir
- Once the job is completed the output needs to be extracted from hadoop to the local VM directory
hadoop fs -get output_dir/part-r-00000 ./
Sample Output:
A. Prasad Sistla,A. Prasad Sistla, 102
A. Prasad Sistla,Bing Liu 0001, 1
A. Prasad Sistla,Isabel F. Cruz, 2
A. Prasad Sistla,Lenore D. Zuck, 6
A. Prasad Sistla,Robert H. Sloan, 1
A. Prasad Sistla,V. N. Venkatakrishnan, 8
Ajay D. Kshemkalyani,Ajay D. Kshemkalyani, 112
Ajay D. Kshemkalyani,Ugo Buy, 1
- Move this file to a folder on the host system.
- Open Gephi and the workspace provided in the
logs
folder of this project. - Import the CSV file( convert the file moved from the Cloudera to CSV extension).
Output Graph:
- Scala - Scala combines object-oriented and functional programming in one concise, high-level language
- SBT - sbt is a build tool for Scala & Java
- Cloudera - Cloudera QuickStart VMs (single-node cluster)
- Hadoop - framework that allows for the distributed processing of large data sets