Skip to content

Import Articles from RISJbot Using the 'tm' Text Mining Framework

License

Notifications You must be signed in to change notification settings

pmyteh/tm.plugin.risjbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tm.plugin.risjbot

tm.plugin.risjbot provides a tm Source function for creating corpora from articles scraped by the RISJbot webcrawler. RISJbot is a Scrapy/Python project designed to collect the full text and metadata of news articles from the web, using sites' own sitemaps and RSS feeds as a source. It produces a number of output formats, including JSONLines files which this package can read into tm sources.

To allows better integration with other tm data sources, arbitrary mappings are allowed between field names in the RISJbot JSONLines file and metadata field names in the eventual tm::PlainTextDocument.

Installation

You can install the development version of tm.plugin.risjbot from Github with:

devtools::install_github("pmyteh/tm.plugin.risjbot")

Example

This example shows you how to create a tm::VCorpus object using this package's source function:

library(tm.plugin.risjbot)
s <- RISJbotSource('input.jl')
corp <- VCorpus(s)

About

Import Articles from RISJbot Using the 'tm' Text Mining Framework

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages