Specify a website and a depth, and the program will crawl the website and all the links it finds, up to the specified depth. The program will then output the links it found.
Input: files in an input hdfs folder containing a list of links to crawl. Output: files in output hdfs folder containing the links found.
- Install Hadoop (you can clone docker-hadoop):
git clone https://github.com/big-data-europe/docker-hadoop
- Build and run the docker container in the directory you put hadoop from step 1:
docker-compose up -d
-
Build the jar file of Robot.java with dependency "hadoop-core-1.2.1.jar"
-
Copy the jar file and input files to the docker container
docker cp ${NAME_OF_JAR_FILE} namenode:/tmp
docker cp input/input-github.txt namenode:/tmp
docker cp input/input-stackoverflow.txt namenode:/tmp
- Copy the input files to the hdfs
docker exec -it namenode hdfs dfs -mkdir /user/root/input
docker exec -it namenode hdfs dfs -put /tmp/input-github.txt /user/root/input
docker exec -it namenode hdfs dfs -put /tmp/input-stackoverflow.txt /user/root/input
- Run the jar file
docker exec -it namenode hadoop jar /tmp/${NAME_OF_JAR_FILE} code.Robot /user/root/input /user/root/output -depth ${DEPTH}
- Copy the output files from the hdfs to the docker container
docker exec -it namenode hdfs dfs -get /user/root/output /tmp
- Copy the output files from the docker container to the host
docker cp namenode:/tmp/output output/
Output file is generated with depth of 2.