Skip to content

Commit b5177b6

Browse files
author
Thanos Giannakopoulos
committed
Initial commit
1. updated README 2. implementation for mobility patters 3. implementation for event detection 4. generated data and figures/maps
1 parent 9ca5b5f commit b5177b6

22 files changed

+165755
-14
lines changed

.gitignore

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Remove useless files
2+
code/local/__pycache__/*
3+
code/local/.ipynb_checkpoints/*
4+
code/.ipynb_checkpoints/*
5+
code/local/googleAPI
6+
code/local/key.txt
7+
code/local/parser.py
8+
9+
# Remove unnecessary code files (to be added later)
10+
code/local/libraries.py
11+
code/local/utils_event_detection.py
12+
code/local/utils_mobility.py
13+
14+
# Remove unnecessary data files (to be added later)
15+
data/tweets_*

README.md

Lines changed: 37 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,57 @@
11
# Mobility patterns/events in Switzerland with Twitter
22

3-
## Abstract
3+
This README file includes information about the project assignment, as well as the project implementation.
44

5-
We are given a dataset containing tweets in Switzerland starting from 2012. The first goal of the project is to analyse the data and reconstruct mobility flows of the users. More concretely, we will try to get insights into high-frequency migration patterns in the swiss territory. This could be achieved by focusing on "frontaliers" who commute daily from France and Germany to Geneva and Zurich respectively. In addition, we will try to detect changes in migration patterns when special events take place (e.g., a new Alpine tunnel gets opened). The second task of the project is to detect events. Here, we will focus on dates and locations of such events, as well as positive or negative sentiment.
5+
## Project Assignment
66

7-
## Data Description
7+
### Abstract
88

9-
The dataset consists of 80GB of tweets in Switzerland, which are collected using the Twitter API starting from 2012. The tweets have been downloaded in json format and processed such that only a subset of the attributes is kept. Thus, we have information about the user id, the date and time of the tweet and the tweet (We would know if we have more attributes for our data once we get access to the dataset) . Moreover, all tweets have geo-location data, which is crucial for our analysis. Finally, the dataset is stored in the HDFS of the IC cluster.
9+
We are given a dataset containing tweets in Switzerland starting from 2010. The first goal of the project is to analyse the data and reconstruct mobility flows of the users. More concretely, we will try to get insights into high-frequency migration patterns in the Swiss territory. This could be achieved by focusing on "frontaliers" who commute daily from France and Germany to Geneva and Zurich respectively. In addition, we will try to detect changes in migration patterns when special events take place (e.g., a new Alpine tunnel gets opened). The second task of the project is to detect events. Here, we will focus on dates and locations of such events, as well as positive or negative sentiment.
1010

11-
## Feasibility and Risks
11+
### Data Description
12+
13+
The dataset consists of 5GB of tweets in Switzerland, which are collected using the Twitter API starting from 2010. The tweets have been downloaded in json format and processed such that only a subset of the attributes is kept. Thus, we have information about the user id, the date and time of the tweet, the tweet text, etc. Moreover, all tweets have geo-location data, which is crucial for our analysis. Finally, the dataset is stored in the HDFS of the IC cluster.
14+
15+
### Feasibility and Risks
1216

1317
The feasibility of the project depends strongly on the dataset. The reconstruction of mobility flows of users requires their location at different times during each day. In our case, tweets are the only source of geo-located information. In order for our project to be feasible, users should tweet uniformly during the day in order for their mobility patterns to be revealed. For example, people who work in Zurich but live in Germany should tweet both during the day and during the evening, so that we can have information about their working and living place respectively.
1418

1519
With respect to the second task, we need to set certain thresholds and make certain assumptions in order to perform a successful analysis. More concretely, we should set a reasonable threshold for the number of tweets that indicates that an event took place (e.g. more than 100 tweets within a distance of 100m. may indicate that an event took place). Moreover, the assumptions that many events take place late in the evening and especially during the weekend may help in filtering properly our data and performing successful analysis.
1620

1721
The feasibility and risks of our project could be further assessed once the dataset is provided.
1822

19-
## Deliverables
23+
### Deliverables
2024

21-
We will use Spark for exploratory data analysis. We will perform data filtering and aggregations according to the needs of each task. As soon as we have aggregated results, we will use Python for data analysis and visualization. The deliverables of our project are going to be:
25+
We will use Spark in order to handle properly the given dataset. We will perform data filtering and aggregations according to the needs of each task. As soon as we have aggregated results, we will use Python for data analysis and visualization. The deliverables of our project are going to be:
2226

23-
- A well documented and commented Spark application for data filtering and aggregation written in Scala.
27+
- A well documented and commented Spark application for data filtering and aggregation written in Python.
2428
- A Python notebook with data analysis and visualization of our results.
2529
- Our assumptions and data analysis pipeline would be well documented in the Python notebook.
2630

27-
28-
## Timeplan
31+
### Timeplan
2932
A rough timeplan of our project is as follows:
3033

31-
- Phase 1: Exploratory data analysis, data cleaning and data wraggling until mid of November (depending also on when we will get access to the cluster).
32-
- Phase 2: Detection of mobility patterns. We aim at presenting some of our findings in mid December.
33-
- Phase 3: Further improvement on mobility pattern detection. In case the results are satisfying, we will start working on the event detection. This phase will be terminated in the mid of January
34-
- Phase 4: Starting from mid January, we will be working on the final report in order to present numerically and visually our findings. The project symposium will take place at the end of January.
34+
- Phase 1: Exploratory data analysis, data cleaning and data wrangling.
35+
- Phase 2: Detection of mobility patterns.
36+
- Phase 3: We work on the event detection.
37+
- Phase 4: We partner with a team that focused on the visualization part (to be done).
38+
- Phase 5: We work on the sentiment analysis of the tweets (to be done).
39+
40+
## Project Implementation
41+
42+
The provided dataset contains tweets starting from 2010 till 2016. We choose to make a yearly based analysis on all our data analysis tasks. We focus on the machine learning side of the project and provide a simple but insightful visualization of our results. Also, we try to partner with a team that focuses on the visualization part of the project that is responsible for producing more complex plots.
43+
44+
### Mobility Patterns
45+
46+
The [mobility_patterns](code/local/mobility_patterns.ipynb) notebook contains our work on detecting mobility patterns in Switzerland. It contains both the code and the assumptions we made in order to produce our results. These assumptions have to do mainly with:
47+
* which users to keep in order to produce our results
48+
* how we determined if a user is at work or not
49+
* how we determined the user's location of residence and work
50+
For more information and a detailed analysis of how we approach the problem of mobility patterns, please refer to the [mobility_patterns](code/local/mobility_patterns.ipynb) notebook
51+
52+
### Event Detection
53+
The [event_detection](code/local/event_detection.ipynb) notebook contains our work on detecting events in Switzerland based on geolocated information. It contains both the code and the assumptions we made in order to produce our results. These assumptions have to do mainly with:
54+
* minimum number of tweets to declare the presence of an event
55+
* distance between tweets referring to the same event (tweets for the same event should be posted from the same area, i.e. in a small distance)
56+
* metrics we introduce in order to better detect events (e.g. an event should have tweets from numerous users)
57+
For more information and a detailed analysis of how we approach the problem of event detection, please refer to the [mobility_patterns](code/local/event_detection.ipynb) notebook.

0 commit comments

Comments
 (0)