Skip to content

Problem parsing rdfa in aws lambda #102

Open
@kamilliano

Description

@kamilliano

Hi,
wanted to ask if anyone out there has used extruct on AWS lambda? I tested running extruct function which seems to fail to work for rdfa. Other default metadata types are fine.

A simple test case:

import pprint as pp
import requests
from extruct.rdfa import RDFaExtractor
import config_files.logging_config as log

logger = log.logger

def main():

    try:
        import extruct
        logger.info("Testing importing extruct which loaded successfully")
        import rdflib
        logger.info("Testing importing rdflib which loaded successfully")
        import extruct.rdfa
        logger.info("Testing importing rdfa which loaded successfully")
        from extruct.rdfa import RDFaExtractor
        logger.info("Testing importing RDFaExtractor which loaded successfully")

     except ImportError as e:
            logger.error("failed to import : {}".format(e))

    try:
        url = 'https://www.littlewoods.com/ri-plus-floral-trumpet-sleeve-top/1600159211.prd'
        r = requests.get(url)
        rdfae = RDFaExtractor()
        rdfa_json = rdfae.extract(r.text, base_url=None)

        pp.pprint(rdfa_json)

    except Exception as e:
        logger.exception("Failed to extract rdfa. Error: {}".format(e))

main()

The part of pipenv graph for extruct when I build the artifact.zip file:

extruct==0.7.1
  - lxml [required: Any, installed: 3.6.0]
  - mf2py [required: Any, installed: 1.1.2]
    - BeautifulSoup4 [required: >=4.6.0, installed: 4.7.1]
      - soupsieve [required: >=1.2, installed: 1.6.2]
    - html5lib [required: >=1.0.1, installed: 1.0.1]
      - six [required: >=1.9, installed: 1.11.0]
      - webencodings [required: Any, installed: 0.5.1]
    - requests [required: >=2.18.4, installed: 2.18.4]
      - certifi [required: >=2017.4.17, installed: 2018.11.29]
      - chardet [required: >=3.0.2,<3.1.0, installed: 3.0.4]
      - idna [required: >=2.5,<2.7, installed: 2.6]
      - urllib3 [required: >=1.21.1,<1.23, installed: 1.22]
  - rdflib [required: Any, installed: 4.2.2]
    - isodate [required: Any, installed: 0.6.0]
      - six [required: Any, installed: 1.11.0]
    - pyparsing [required: Any, installed: 2.3.0]
  - rdflib-jsonld [required: Any, installed: 0.4.0]
    - rdflib [required: >=4.2, installed: 4.2.2]
      - isodate [required: Any, installed: 0.6.0]
        - six [required: Any, installed: 1.11.0]
      - pyparsing [required: Any, installed: 2.3.0]
  - six [required: Any, installed: 1.11.0]
  - w3lib [required: Any, installed: 1.19.0]
    - six [required: >=1.4.1, installed: 1.11.0]

When I run this locally in the same pipenv env (Ubuntu 17.10, Docker, 17.12.0-ce, pipenv==v2018.11.26), I don't experience any issues. On lambda invocation I log the following stack trace:

2019-01-10 14:32:49,092:INFO:pid 1:Testing importing extruct which loaded successfully
2019-01-10 14:32:49,092:INFO:pid 1:Testing importing rdflib which loaded successfully
2019-01-10 14:32:49,092:INFO:pid 1:Testing importing rdfa which loaded successfully
2019-01-10 14:32:49,092:INFO:pid 1:Testing importing RDFaExtractor which loaded successfully
2019-01-10 14:32:51,753:ERROR:pid 1:Failed to extract rdfa. Error: No plugin registered for (json-ld, <class 'rdflib.serializer.Serializer'>)
Traceback (most recent call last):
  File "/var/task/rdflib/plugin.py", line 100, in get
    p = _plugins[(name, kind)]
KeyError: ('json-ld', <class 'rdflib.serializer.Serializer'>)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/task/metadata_extractor/rdfa_extract_poc.py", line 15, in main
    rdfa_json = rdfae.extract(r.text, base_url=None)
  File "/var/task/extruct/rdfa.py", line 35, in extract
    return self.extract_items(tree, base_url=base_url, expanded=expanded)
  File "/var/task/extruct/rdfa.py", line 48, in extract_items
    jsonld_string = g.serialize(format='json-ld', auto_compact=not expanded).decode('utf-8')
  File "/var/task/rdflib/graph.py", line 940, in serialize
    serializer = plugin.get(format, Serializer)(self)
  File "/var/task/rdflib/plugin.py", line 103, in get
    "No plugin registered for (%s, %s)" % (name, kind))
rdflib.plugin.PluginException: No plugin registered for (json-ld, <class 'rdflib.serializer.Serializer'>)

I have been scratching my head over this but can't figure this one out. What should I try? Thanks in advance

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions