Open
Description
Hi,
wanted to ask if anyone out there has used extruct on AWS lambda? I tested running extruct
function which seems to fail to work for rdfa. Other default metadata types are fine.
A simple test case:
import pprint as pp
import requests
from extruct.rdfa import RDFaExtractor
import config_files.logging_config as log
logger = log.logger
def main():
try:
import extruct
logger.info("Testing importing extruct which loaded successfully")
import rdflib
logger.info("Testing importing rdflib which loaded successfully")
import extruct.rdfa
logger.info("Testing importing rdfa which loaded successfully")
from extruct.rdfa import RDFaExtractor
logger.info("Testing importing RDFaExtractor which loaded successfully")
except ImportError as e:
logger.error("failed to import : {}".format(e))
try:
url = 'https://www.littlewoods.com/ri-plus-floral-trumpet-sleeve-top/1600159211.prd'
r = requests.get(url)
rdfae = RDFaExtractor()
rdfa_json = rdfae.extract(r.text, base_url=None)
pp.pprint(rdfa_json)
except Exception as e:
logger.exception("Failed to extract rdfa. Error: {}".format(e))
main()
The part of pipenv graph for extruct when I build the artifact.zip file:
extruct==0.7.1
- lxml [required: Any, installed: 3.6.0]
- mf2py [required: Any, installed: 1.1.2]
- BeautifulSoup4 [required: >=4.6.0, installed: 4.7.1]
- soupsieve [required: >=1.2, installed: 1.6.2]
- html5lib [required: >=1.0.1, installed: 1.0.1]
- six [required: >=1.9, installed: 1.11.0]
- webencodings [required: Any, installed: 0.5.1]
- requests [required: >=2.18.4, installed: 2.18.4]
- certifi [required: >=2017.4.17, installed: 2018.11.29]
- chardet [required: >=3.0.2,<3.1.0, installed: 3.0.4]
- idna [required: >=2.5,<2.7, installed: 2.6]
- urllib3 [required: >=1.21.1,<1.23, installed: 1.22]
- rdflib [required: Any, installed: 4.2.2]
- isodate [required: Any, installed: 0.6.0]
- six [required: Any, installed: 1.11.0]
- pyparsing [required: Any, installed: 2.3.0]
- rdflib-jsonld [required: Any, installed: 0.4.0]
- rdflib [required: >=4.2, installed: 4.2.2]
- isodate [required: Any, installed: 0.6.0]
- six [required: Any, installed: 1.11.0]
- pyparsing [required: Any, installed: 2.3.0]
- six [required: Any, installed: 1.11.0]
- w3lib [required: Any, installed: 1.19.0]
- six [required: >=1.4.1, installed: 1.11.0]
When I run this locally in the same pipenv env (Ubuntu 17.10, Docker, 17.12.0-ce, pipenv==v2018.11.26), I don't experience any issues. On lambda invocation I log the following stack trace:
2019-01-10 14:32:49,092:INFO:pid 1:Testing importing extruct which loaded successfully
2019-01-10 14:32:49,092:INFO:pid 1:Testing importing rdflib which loaded successfully
2019-01-10 14:32:49,092:INFO:pid 1:Testing importing rdfa which loaded successfully
2019-01-10 14:32:49,092:INFO:pid 1:Testing importing RDFaExtractor which loaded successfully
2019-01-10 14:32:51,753:ERROR:pid 1:Failed to extract rdfa. Error: No plugin registered for (json-ld, <class 'rdflib.serializer.Serializer'>)
Traceback (most recent call last):
File "/var/task/rdflib/plugin.py", line 100, in get
p = _plugins[(name, kind)]
KeyError: ('json-ld', <class 'rdflib.serializer.Serializer'>)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/task/metadata_extractor/rdfa_extract_poc.py", line 15, in main
rdfa_json = rdfae.extract(r.text, base_url=None)
File "/var/task/extruct/rdfa.py", line 35, in extract
return self.extract_items(tree, base_url=base_url, expanded=expanded)
File "/var/task/extruct/rdfa.py", line 48, in extract_items
jsonld_string = g.serialize(format='json-ld', auto_compact=not expanded).decode('utf-8')
File "/var/task/rdflib/graph.py", line 940, in serialize
serializer = plugin.get(format, Serializer)(self)
File "/var/task/rdflib/plugin.py", line 103, in get
"No plugin registered for (%s, %s)" % (name, kind))
rdflib.plugin.PluginException: No plugin registered for (json-ld, <class 'rdflib.serializer.Serializer'>)
I have been scratching my head over this but can't figure this one out. What should I try? Thanks in advance