Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic performer scrapers #203

Merged
merged 8 commits into from
Nov 19, 2019
Merged

Conversation

WithoutPants
Copy link
Collaborator

Adds configurable third-party scrapers.

A new scrapers_path configuration key is added. This defaults to a scrapers subdirectory in the same directory as the config file.

Scrapers are configured by adding scraper configurations in the form of json files in the scraper directory. An example configuration for an IAFD scraper is as follows:

{
	"name": "IAFD",
	"type": "PERFORMER",
	"method": "SCRIPT",
	"urls": ["iafd.com"],	
	"get_performer_names": ["python", "iafdScrape.py", "query"],
	"get_performer": ["python", "iafdScrape.py", "scrape"],
	"get_performer_url": ["python", "iafdScrape.py", "scrapeURL"]
}

The name, type and method fields are required. name is the name that is displayed in the Scrape with... drop-down button. type and method support only PERFORMER and SCRIPT respectively.

The urls field is optional and specifies the url substring that will be used to detect scraping from a URL.

The get_performer_names and get_performer fields are mandatory for scrapers that are visible in the Scrape with... drop-down. They specify the script arguments to run for querying performer names, and for scraping a performer from the previous results.

Likewise,the get_performer_url field is required where urls is specified, and specifies the script arguments to run to scrape a performer from a given url.

The system pushes the script input into the stdin stream for the script. For get_performer_names, the expected input into the scripts is as follows: {"name": "<performer name>"}. The get_performer_names is expected to output a json string to stdout. Any errors or status messages should be output to stderr. The expected output for get_performer_names is a json representation of a ScrapedPerformer graphql fragment. Only name is required, however other fields can be used as context to the subsequent get_performer call. For example, in the script I will attach, get_performer_names fills in the URL field, so that the subsequent get_performer call can use it to scrape the performer.

The get_performer script input expects a json representation of a ScrapedPerformer graphql fragment. It returns the full representation of the performer as a ScrapedPerformer graphql object.

The get_performer_url script input expects a json string as follows: {"url": "<url>"}. It returns a full ScrapedPerformer graphql json object as per get_performer.

It probably makes more sense with an example, so I've attached a python IAFD scraper script and config json. Note that it is quite slow, requires python2.7, and lxml and cssselect modules.

iafd.zip

The UI for the scraper menu is mostly unchanged, just adding the new configured scrapers. The new feature is that the user can enter in a url of a performer, and if the url matches the urls field of a scraper, then a button appears which upon clicking scrapes and sets the performer fields from the url. I have added this functionality to the built in Freeones scraper.

image

There will obviously need to have some proper wiki documentation written up for this, but I hope this gives a decent idea of the feature.

The future ideas for this are to add scraper methods for http-json and http-graphql, and add scrapers for scenes - which should be substantially less work than this since most of the framework is done. Getting performer images would be a nice bonus as well.

@StashAppDev StashAppDev changed the base branch from master to develop November 16, 2019 15:46
@bnkai
Copy link
Collaborator

bnkai commented Nov 18, 2019

Tested , works as intended and doesn't interfere with the existing freeones scraper.
This needs to be documented though with a lot more detail in the WIKI , i had to read the explanation for the scrapers a few times to understand whats happening.
I think the url is an even cooler addition , although we need to document it somewhere also

@Leopere Leopere self-requested a review November 19, 2019 02:48
@Leopere Leopere added the feature Pull requests that add a new feature label Nov 19, 2019
@Leopere Leopere merged commit 1724706 into stashapp:develop Nov 19, 2019
@WithoutPants WithoutPants mentioned this pull request Feb 4, 2020
10 tasks
@WithoutPants WithoutPants deleted the performer_scraper branch May 15, 2020 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Pull requests that add a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants