The goal of {cranology}
is to provide tools to scrape data from CRAN
and PPM websites as well as useful datasets to explore the evolution of
the number of packages on CRAN.
cranology::plot_cran_monthly_package_number()
You can install the development version of {cranology}
with:
remotes::install_github("ThinkR-open/cranology")
library(cranology)
All packages ever available on CRAN.
cran_packages_history
#> # A tibble: 44,613 × 10
#> file_name date time size package_name last_archived
#> <chr> <dttm> <chr> <chr> <chr> <dttm>
#> 1 A3/ 2015-08-16 21:05:00 21:05 - A3 2015-08-16 21:05:00
#> 2 aaMI/ 2010-07-30 12:17:00 12:17 - aaMI 2010-07-30 12:17:00
#> 3 aaSEA/ 2022-06-21 05:12:00 05:12 - aaSEA 2022-06-21 05:12:00
#> 4 AATtools/ 2024-08-16 09:10:00 09:10 - AATtools 2024-08-16 09:10:00
#> 5 aba/ 2022-03-27 06:29:00 06:29 - aba 2022-03-27 06:29:00
#> 6 abbyyR/ 2023-11-03 04:42:00 04:42 - abbyyR 2023-11-03 04:42:00
#> 7 abc.data/ 2024-03-24 10:15:00 10:15 - abc.data 2024-03-24 10:15:00
#> 8 abc/ 2022-05-19 07:20:00 07:20 - abc 2022-05-19 07:20:00
#> 9 abcADM/ 2023-03-02 11:13:00 11:13 - abcADM 2023-03-02 11:13:00
#> 10 ABCanalysis/ 2017-03-13 13:31:00 13:31 - ABCanalysis 2017-03-13 13:31:00
#> # ℹ 44,603 more rows
#> # ℹ 4 more variables: archive <lgl>, first_date <dttm>, n_versions <int>,
#> # last_modified <dttm>
The evolution of the number of packages on CRAN since its beginning.
cran_monthly_package_number
#> # A tibble: 324 × 2
#> date number_packages
#> <date> <dbl>
#> 1 1997-10-08 1
#> 2 1997-11-08 1
#> 3 1997-12-08 1
#> 4 1998-01-08 2
#> 5 1998-02-08 2
#> 6 1998-03-08 3
#> 7 1998-04-08 5
#> 8 1998-05-08 6
#> 9 1998-06-08 6
#> 10 1998-07-08 7
#> # ℹ 314 more rows
Both cran_packages_history
and cran_monthly_package_number
datasets
are generated by the function scrape_cran()
. The scraping process is
quite time consuming and relies on the {furrr}
package to scrape the
CRAN pages asynchronously.
future::plan(future::multisession)
scrape_cran_history()
{cranology}
also includes the get_package_number_ppm()
function to
more quickly get the number of packages that were available on CRAN at
any given date.
dates <- seq(
from = as.Date("2018-04-10", "%Y-%m-%d"),
by = "1 year",
length.out = 4
)
get_package_number_ppm(dates)
#> Scraping ppm...
#> Scraping number packages on: 2018-04-10
#> Scraping number packages on: 2019-04-10
#> Scraping number packages on: 2020-04-10
#> Scraping number packages on: 2021-04-10
#> date number_packages
#> 1 2018-04-10 12415
#> 2 2019-04-10 14025
#> 3 2020-04-10 15548
#> 4 2021-04-10 17388
Be careful though as this will only work for dates posterior to
2014-09-17
the day when PPM was up online for the first time.
get_package_number_ppm("2013-08-28")
#> Error: Some dates are anterior to ppm launch:
#> 1: 2013-08-28
For earlier dates use cran_monthly_package_number
. Here is a naïve
example:
date_before_ppm <- as.Date("2013-08-28")
cran_monthly_package_number[
min(
which(
cran_monthly_package_number$date >= date_before_ppm
)
),
]
#> # A tibble: 1 × 2
#> date number_packages
#> <date> <dbl>
#> 1 2013-09-08 4904
The scrape_cran()
function is essentially a tidyversification of this
github gist
written by @daroczig.
Please note that this project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.