Generic performer scrapers #203

WithoutPants · 2019-11-15T00:15:36Z

Adds configurable third-party scrapers.

A new scrapers_path configuration key is added. This defaults to a scrapers subdirectory in the same directory as the config file.

Scrapers are configured by adding scraper configurations in the form of json files in the scraper directory. An example configuration for an IAFD scraper is as follows:

{
	"name": "IAFD",
	"type": "PERFORMER",
	"method": "SCRIPT",
	"urls": ["iafd.com"],	
	"get_performer_names": ["python", "iafdScrape.py", "query"],
	"get_performer": ["python", "iafdScrape.py", "scrape"],
	"get_performer_url": ["python", "iafdScrape.py", "scrapeURL"]
}

The name, type and method fields are required. name is the name that is displayed in the Scrape with... drop-down button. type and method support only PERFORMER and SCRIPT respectively.

The urls field is optional and specifies the url substring that will be used to detect scraping from a URL.

The get_performer_names and get_performer fields are mandatory for scrapers that are visible in the Scrape with... drop-down. They specify the script arguments to run for querying performer names, and for scraping a performer from the previous results.

Likewise,the get_performer_url field is required where urls is specified, and specifies the script arguments to run to scrape a performer from a given url.

The system pushes the script input into the stdin stream for the script. For get_performer_names, the expected input into the scripts is as follows: {"name": "<performer name>"}. The get_performer_names is expected to output a json string to stdout. Any errors or status messages should be output to stderr. The expected output for get_performer_names is a json representation of a ScrapedPerformer graphql fragment. Only name is required, however other fields can be used as context to the subsequent get_performer call. For example, in the script I will attach, get_performer_names fills in the URL field, so that the subsequent get_performer call can use it to scrape the performer.

The get_performer script input expects a json representation of a ScrapedPerformer graphql fragment. It returns the full representation of the performer as a ScrapedPerformer graphql object.

The get_performer_url script input expects a json string as follows: {"url": "<url>"}. It returns a full ScrapedPerformer graphql json object as per get_performer.

It probably makes more sense with an example, so I've attached a python IAFD scraper script and config json. Note that it is quite slow, requires python2.7, and lxml and cssselect modules.

iafd.zip

The UI for the scraper menu is mostly unchanged, just adding the new configured scrapers. The new feature is that the user can enter in a url of a performer, and if the url matches the urls field of a scraper, then a button appears which upon clicking scrapes and sets the performer fields from the url. I have added this functionality to the built in Freeones scraper.

There will obviously need to have some proper wiki documentation written up for this, but I hope this gives a decent idea of the feature.

The future ideas for this are to add scraper methods for http-json and http-graphql, and add scrapers for scenes - which should be substantially less work than this since most of the framework is done. Getting performer images would be a nice bonus as well.

bnkai · 2019-11-18T17:19:40Z

Tested , works as intended and doesn't interfere with the existing freeones scraper.
This needs to be documented though with a lot more detail in the WIKI , i had to read the explanation for the scrapers a few times to understand whats happening.
I think the url is an even cooler addition , although we need to document it somewhere also

WithoutPants added 8 commits November 10, 2019 19:54

Generalise scraper API

30bf74e

Add script performer scraper

eac9316

Fixes from testing

34fcf91

Add context to scrapers and generalise

b184703

Add scraping performer from URL

092be3c

Add error handling

3f88e94

Move log to debug

95b5675

Add supported scrape types

8eef3d0

StashAppDev changed the base branch from master to develop November 16, 2019 15:46

Leopere self-requested a review November 19, 2019 02:48

Leopere added the feature Pull requests that add a new feature label Nov 19, 2019

Leopere approved these changes Nov 19, 2019

View reviewed changes

Leopere merged commit 1724706 into stashapp:develop Nov 19, 2019

WithoutPants mentioned this pull request Dec 11, 2019

Change scraper config to yaml #256

Merged

WithoutPants mentioned this pull request Feb 4, 2020

[RFC] v0.1 release #339

Closed

10 tasks

WithoutPants deleted the performer_scraper branch May 15, 2020 07:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic performer scrapers #203

Generic performer scrapers #203

WithoutPants commented Nov 15, 2019

bnkai commented Nov 18, 2019 •

edited

Loading

Generic performer scrapers #203

Generic performer scrapers #203

Conversation

WithoutPants commented Nov 15, 2019

bnkai commented Nov 18, 2019 • edited Loading

bnkai commented Nov 18, 2019 •

edited

Loading