Skip to content

BernhardClemm/webtrackR

 
 

Repository files navigation

webtrackR

CRAN status R-CMD-check Codecov test coverage

webtrackR is an R package to preprocess and analyse web tracking data in conjunction with survey data of panelists. The package is built on top of data.table and can thus comfortably handle very large web tracking datasets

Installation

You can install the development version of webtrackR from GitHub with:

# install.packages("devtools")
devtools::install_github("schochastics/webtrackR")

S3 class

The package adds a S3 class called wt_dt which inherits most of the functionality from the data.table class A summary and print method are included in the package

Preprocessing

raw web tracking data is assumed to have (at least) the following variables:

  • panelist_id: person who’s data is tracked
  • url: the website the person is visiting
  • timestamp: when the website was visited

All preprocessing functions check if these are present. Otherwise an error is thrown.

Several other variables can be derived from these with the package:

  • duration: how much time was spend on a website (use add_duration() and aggregate_duration() to summarize consecutive visits to the same website)
  • domain: the toplevel domain of a URL (use extract_domain())
  • type and prev_type: using a domain dictionary to classify domains and previously visited domains (use classify_domains())
  • url dummy variables: add a dummy variable if a URL falls into a category or not (e.g. political website) (use create_urldummy())
  • panelist data: add e.g. survey data to the webtrack data (use add_panelist_data())

A typical workflow looks like this:

# load webtrack data as data.table
library(data.table)
library(webtrackR)

# webtrack data
wt <- fread("<path/to/file>")

# domain dictionary (there is also an inbuilt dictionary)
domain_dict <- fread("<path/to/file>")

# dummy file (should just be a vecor of urls)
political_urls <- c("...")

# survey data
survey <- fread("<path/to/file>")

# convert to wt_dt object
wt <- as.wt_dt(wt)

wt <- add_duration(wt)
wt <- extract_domain(wt)

# classify domains and only return rows with type news
wt <- classify_domains(wt, domain_classes = domain_dict, return.only = "news")

# create a dummy variable for political news
wt <- create_urldummy(wt, dummy = political_urls, name = "political")

# add survey data
wt <- add_panelist_data(wt, data = survey)

Analysis

Ideology

Top 500 Bakshy scores are available in the package

data("bakshy")

Audience Networks

Create audiences network

audience_network(wt, cutoff = 3, type = "pmi")
  • cutoff indicates minimal duration to count as visit.
  • type can be one of “pmi”, “phi”, “disparity”, “sdsm”, or “fdsm”

About

R package to analyse webtrack data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 100.0%