webtrackR is an R package to preprocess and analyse web tracking data in conjunction with survey data of panelists. The package is built on top of data.table and can thus comfortably handle very large web tracking datasets
You can install the development version of webtrackR from GitHub with:
# install.packages("devtools")
devtools::install_github("schochastics/webtrackR")
The package adds a S3 class called wt_dt
which inherits most of the
functionality from the data.table class A summary
and print
method
are included in the package
raw web tracking data is assumed to have (at least) the following variables:
- panelist_id: person who’s data is tracked
- url: the website the person is visiting
- timestamp: when the website was visited
All preprocessing functions check if these are present. Otherwise an error is thrown.
Several other variables can be derived from these with the package:
- duration: how much time was spend on a website (use
add_duration()
andaggregate_duration()
to summarize consecutive visits to the same website) - domain: the toplevel domain of a URL (use
extract_domain()
) - type and prev_type: using a domain dictionary to classify
domains and previously visited domains (use
classify_domains()
) - url dummy variables: add a dummy variable if a URL falls into a
category or not (e.g. political website) (use
create_urldummy()
) - panelist data: add e.g. survey data to the webtrack data (use
add_panelist_data()
)
A typical workflow looks like this:
# load webtrack data as data.table
library(data.table)
library(webtrackR)
# webtrack data
wt <- fread("<path/to/file>")
# domain dictionary (there is also an inbuilt dictionary)
domain_dict <- fread("<path/to/file>")
# dummy file (should just be a vecor of urls)
political_urls <- c("...")
# survey data
survey <- fread("<path/to/file>")
# convert to wt_dt object
wt <- as.wt_dt(wt)
wt <- add_duration(wt)
wt <- extract_domain(wt)
# classify domains and only return rows with type news
wt <- classify_domains(wt, domain_classes = domain_dict, return.only = "news")
# create a dummy variable for political news
wt <- create_urldummy(wt, dummy = political_urls, name = "political")
# add survey data
wt <- add_panelist_data(wt, data = survey)
Top 500 Bakshy scores are available in the package
data("bakshy")
Create audiences network
audience_network(wt, cutoff = 3, type = "pmi")
cutoff
indicates minimal duration to count as visit.type
can be one of “pmi”, “phi”, “disparity”, “sdsm”, or “fdsm”