A Scrapy pipeline module to persist items to a postgres table automatically.
Here's an example showing automatic item pipeline, with a custom JSONB field.
# settings.py
from sqlalchemy.dialects.postgresql import JSONB
ITEM_PIPELINES = {
'pgpipeline.PgPipeline': 300,
}
PG_PIPELINE = {
'connection': 'postgresql://localhost:5432/scrapy_db',
'table_name': 'demo_items',
'pkey': 'item_id',
'ignore_identical': ['item_id', 'job_id'],
'types': {
'some_data': JSONB
},
'onconflict': 'upsert'
}All columns, tables, and indices are automatically created.
pkey: a primary key for this item (other than database-generatedid)ignore_identical: these are a set of fields by which we identify duplicates and skip insert.types: keys specified here will be using the type given, otherwise types are guessed.onconflict: upsert|ignore|non-null -ignorewill skip inserting on conflict andupsertwill update.non-nullwill upsert only values that are notNoneand thus avoid removing existing values.
Set up a development environment
$ pip install -r requirements.txt
- Dependencies: list them in
requirements.txt
- Dependencies: list them in
setup.pyunderinstall_requires:
install_requires=['peppercorn'],Then:
$ make dist && make release
Fork, implement, add tests, pull request, get my everlasting thanks and a respectable place here :).
To all Contributors - you make this happen, thanks!
Copyright (c) 2017 Dotan Nahum @jondot. See LICENSE for further details.