-
Notifications
You must be signed in to change notification settings - Fork 2
/
README.Rmd
128 lines (96 loc) · 4.94 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# adaR <img src="man/figures/logo.png" align="right" height="139" alt="" />
<!-- badges: start -->
[![R-CMD-check](https://github.com/gesistsa/adaR/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/gesistsa/adaR/actions/workflows/R-CMD-check.yaml)
[![CRAN status](https://www.r-pkg.org/badges/version/adaR)](https://CRAN.R-project.org/package=adaR)
[![CRAN Downloads](https://cranlogs.r-pkg.org/badges/adaR)](https://CRAN.R-project.org/package=adaR)
[![Codecov test coverage](https://codecov.io/gh/gesistsa/adaR/branch/main/graph/badge.svg)](https://app.codecov.io/gh/gesistsa/adaR?branch=main)
[![ada-url Version](https://img.shields.io/badge/ada_url-2.9.0-blue)](https://github.com/ada-url/ada)
<!-- badges: end -->
adaR is a wrapper for [ada-url](https://github.com/ada-url/ada), a
[WHATWG](https://url.spec.whatwg.org/#url-parsing)-compliant and fast URL parser written in modern C++ .
It implements several auxilliary functions to work with urls:
- public suffix extraction (top level domain excluding private domains) like [psl](https://github.com/hrbrmstr/psl)
- fast c++ implementation of `utils::URLdecode` (~40x speedup)
More general information on URL parsing can be found in the introductory vignette via `vignette("adaR")`.
`adaR` is part of a series of R packages to analyse webtracking data:
- [webtrackR](https://github.com/schochastics/webtrackR): preprocess raw webtracking data
- [domainator](https://github.com/schochastics/domainator): classify domains
- [adaR](https://github.com/gesistsa/adaR): parse urls
## Installation
You can install the development version of adaR from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("gesistsa/adaR")
```
The version on CRAN can be installed with
```r
install.packages("adaR")
```
## Example
This is a basic example which shows all the returned components of a URL.
```{r example}
library(adaR)
ada_url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag")
```
```c++
/*
* https://user:pass@example.com:1234/foo/bar?baz#quux
* | | | | ^^^^| | |
* | | | | | | | `----- hash_start
* | | | | | | `--------- search_start
* | | | | | `----------------- pathname_start
* | | | | `--------------------- port
* | | | `----------------------- host_end
* | | `---------------------------------- host_start
* | `--------------------------------------- username_end
* `--------------------------------------------- protocol_end
*/
```
It solves some problems of urltools with more complex urls.
```{r better}
urltools::url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.
7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519")
ada_url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m
5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519")
```
A "raw" url parse using ada is extremely fast (see [ada-url.com](https://www.ada-url.com/)) but for this to carry over to R is tricky.
The performance is still compatible with `urltools::url_parse` with the noted advantage in accuracy in some
practical circumstances.
```{r faster}
bench::mark(
ada = ada_url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag", decode = FALSE),
urltools = urltools::url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag"),
iterations = 1, check = FALSE
)
```
For further benchmark results, see `benchmark.md` in `data_raw`.
There are four more groups of functions available to work with url parsing:
- `ada_get_*()` get a specific component
- `ada_has_*()` check if a specific component is present
- `ada_set_*()` set a specific component from URLS
- `ada_clear_*()` remove a specific component from URLS
## Public Suffix extraction
`public_suffix()` extracts their top level domain from the [public suffix list](https://publicsuffix.org/), **excluding** private domains.
```{r public_suffix}
urls <- c(
"https://subsub.sub.domain.co.uk",
"https://domain.api.gov.uk",
"https://thisisnotpart.butthisispartoftheps.kawasaki.jp"
)
public_suffix(urls)
```
If you are wondering about the last url. The list also contains wildcard suffixes such as `*.kawasaki.jp` which need to be matched.
## Acknowledgement
The logo is created from [this portrait](https://commons.wikimedia.org/wiki/File:Ada_Lovelace_portrait.jpg) of [Ada Lovelace](https://de.wikipedia.org/wiki/Ada_Lovelace), a very early pioneer in Computer Science.