Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adds configurable third-party scrapers.
A new
scrapers_path
configuration key is added. This defaults to ascrapers
subdirectory in the same directory as the config file.Scrapers are configured by adding scraper configurations in the form of json files in the scraper directory. An example configuration for an IAFD scraper is as follows:
The
name
,type
andmethod
fields are required.name
is the name that is displayed in theScrape with...
drop-down button.type
andmethod
support onlyPERFORMER
andSCRIPT
respectively.The
urls
field is optional and specifies the url substring that will be used to detect scraping from a URL.The
get_performer_names
andget_performer
fields are mandatory for scrapers that are visible in theScrape with...
drop-down. They specify the script arguments to run for querying performer names, and for scraping a performer from the previous results.Likewise,the
get_performer_url
field is required whereurls
is specified, and specifies the script arguments to run to scrape a performer from a given url.The system pushes the script input into the stdin stream for the script. For
get_performer_names
, the expected input into the scripts is as follows:{"name": "<performer name>"}
. Theget_performer_names
is expected to output a json string to stdout. Any errors or status messages should be output to stderr. The expected output forget_performer_names
is a json representation of aScrapedPerformer
graphql fragment. Only name is required, however other fields can be used as context to the subsequentget_performer
call. For example, in the script I will attach,get_performer_names
fills in the URL field, so that the subsequentget_performer
call can use it to scrape the performer.The
get_performer
script input expects a json representation of aScrapedPerformer
graphql fragment. It returns the full representation of the performer as aScrapedPerformer
graphql object.The
get_performer_url
script input expects a json string as follows:{"url": "<url>"}
. It returns a fullScrapedPerformer
graphql json object as perget_performer
.It probably makes more sense with an example, so I've attached a python IAFD scraper script and config json. Note that it is quite slow, requires python2.7, and lxml and cssselect modules.
iafd.zip
The UI for the scraper menu is mostly unchanged, just adding the new configured scrapers. The new feature is that the user can enter in a url of a performer, and if the url matches the
urls
field of a scraper, then a button appears which upon clicking scrapes and sets the performer fields from the url. I have added this functionality to the built in Freeones scraper.There will obviously need to have some proper wiki documentation written up for this, but I hope this gives a decent idea of the feature.
The future ideas for this are to add scraper methods for http-json and http-graphql, and add scrapers for scenes - which should be substantially less work than this since most of the framework is done. Getting performer images would be a nice bonus as well.