This repository is for assignment 2 of COMP90024. The whole project is divided into four parts: Crawler, Web server, Application server and couchdb database. The whole working direcory is shown as below:
.
├──AURIN\Datasets // This folder holds all aurin data and programs to process them
├── Crawler // Includes all codes related to sentiments and crawler
├── ServerDockerFile // All scripts to create and upload image to docker hub
├── _config.yml
├── ansible // Old version of ansible codes
├── ansible_final // Final version of ansible code
├── ccc_demo_0 // Includes all files for both web server an application server
├── couchdb // Include the script to set up the couch db and couchdb cluster
├── keys // Twitter tokens of group members
└── startserver.sh // Scripts used by docker to create servers.
- Ansible is used to do the automation work, the whole working directory is shown as below:
├── host_vars
│ ├── nectar.yaml // Defined variables for Nectar
│ └── setSwarm.yml // Defined variables for swarm
├── hosts // Hosts that playbook accesses
├── roles // Includes all roles
│ ├── buildImage // Build docker service image
│ ├── createService // Create service for swarm leader
│ ├── copyFromGit // Copy files from github
│ ├── initFirstManager // Create swarm leader
│ ├── joinAsManager // Join swarm as a manager
│ ├── joinSwarm // Join swarm as a worker
│ ├── openstack-common // Install dependencies to run openstack
│ ├── openstack-images // Grab images from openstack API
│ ├── openstack-instance // Create instances for Nectar
│ ├── openstack-security-group // Create security groups
│ ├── openstack-volume // Create volumes
│ ├── setDockerEnv // Setup docker configuration
│ ├── setEnvironment // Setup environment for instances
│ └── updateService // Update service in the future
├── openrc.sh // Interact with openstack API
├── runIns.sh //Launch instances
├── installEnv.sh // Install software
├── runDocker.sh // Deploy swarm
├── configure_cluster.sh // Set configurations for couchdb
├── install_couchdb.sh // Install couchdb
├── nectar.yaml // The playbook for setup instances
├── setEnv.yaml // The playbook for setup environment
├── setSwarm.yml // The playbook for setup swarm
└── updateSer.yml // The playbook for update service
- Download the whole repository into your laptop by the following commands:
git clone https://github.com/kuldeepsuhag/COMP90024-ASSIGNMENT-2.git
- Enter the
./ansible_final/ansible
folder usingcd ansible_final/ansible/
- Paste your own
openrc.sh
under the folder shown above. - Then, run
runIns.sh
to set up the instance using:
sudo sh ./runIns.sh
If you want to set up multiple instances, just change the variables under host_var directory and run the same command again.
- To set up the software environment, first you have to copy all ips of your instances and paste them into the
./ansible/hosts
file. Also paste them into theconfigure_cluster.sh
Then, run the following command:
sudo sh ./installTest.sh
The program will help you set up all environment and build a cluster for couchdb based on provided ip addresses. You can check the cluster by entering```http://your_ip_address:5984/_membership```in the browser.
- After that, before starting the crawler, you have to log into the master node and install nltk dataset to make sure there will be no errors while crawlering:
python3 >>> import nltk >>> nltk.download('words')
Then
python3 ./Crawler/harvestor/run_crawler.py your_twitter_token
to start the crawler. The crawler will start collect data throught your twitter token. Also the following tasks will be complete simultaneously:
- Sentiment analysis: it will first uniform the string and classify them into three types: Postitive, Negative and Netural.
- Topic parsing: it will allocate tweets into different topics (E.g. wrath, sloth, arson).
- Time Point Partition: partition 24 hours into five time slot named such as, 12:00am to 03:59am (midnight), 04:00 am to 07:59am (early morning) etc.
- Process Instagram data: collected the data from Instagram, and process them just like twitter data.
- Under the host_vars directory, in the setSwarm.yml, one must set each docker-swarm node's ip address as follows:
docker-swarm-1 ansible_ssh_host=172.26.38.
docker-swarm-2 ansible_ssh_host=172.26.38.
docker-swarm-3 ansible_ssh_host=172.26.37.
docker-swarm-4 ansible_ssh_host=172.26.38. - Then in the host file in the root ANSIBLE directory, one should define one leader:
[leader]
docker-swarm-1 - An odd number of managers(including the leader):
[managers]
docker-swarm-2
docker-swarm-3 - And set the rest of the nodes in the swarm under the worker host group.
[workers]
docker-swarm-4 - Before running the run in the runDocker.sh, one must add the configuration to
~/.ssh/config
file to allow a git clone through the proxy.
Host github.com
Hostname ssh.github.com
Port 443
User git
ProxyCommand nc -X connect -x wwwproxy.unimelb.edu.au:8000 %h %p
as well as generating and adding the ssh key of the instance to the github repository.
Then by executingrunDocker.sh
, the ansible scriptsetSwarm.yml
automatically clones/pulls from git, builds the image, pushes it to docker hub and starts the service with that image that was just pushed.
After doing all processes above, you can check the web page by enter in anyone of the ip addresses defined asansible_ssh_host
.
If there is a new version of the server uploaded to git, simply run the update.sh
and it will perform a git pull on the leader of the swarm.
The Ansible script will then build a new image from the updated code and perform a rolling update on the web service that are currently running
on the instances