Skip to content

Conversation

@meomancer
Copy link
Contributor

@meomancer meomancer commented Aug 13, 2021

References: #7945

This PR implements copying of remote layers and documents in the GeonodeLegacyHarvester. It also includes a number of bug fixes.

Copying remote data is configured by means of editing a Harvester's harvester type specific configuration parameter. The following additional configuration parameters are now available:

  • copy_documents
  • copy_datasets

They both default to False, which means that by default a harvester will not copy remote data. Setting them to True will cause GeoNode to copy the remote data when performing harvesting.

The video below shows the proposed implementation in action:

2021-09-21.17-35-34_18.6.12.mp4

In the video I am demonstrating harvesting with two harvesters simultaneously. One of the harvesters (copys pcrafi) performs a copy of the remote data onto the local GeoNode, while the other (links pcrafi) merely copies the metadata. The video also demonstrates an error with the copying of a remote raster (the error is intended here) and how this is recorded in the corresponding harvesting session. The harvested resources appear on the GeoNode frontend.

Improvements

  • The HarvestingSession model is now more useful, as it records the progress of harvesting operations as they are on-going
  • The harvesting section of the django admin now includes a number of additional fields in order to ease navigation between harvesters, harvesting sessions and harvestable resources.
  • It is now possible to harvest individual harvestable resources directly from the admin section without needing to go to the harvester page. This is mainly useful for testing.

Fixes

Metadata fields for resource's thumbnail, URL extension (for documents) and subtype (for datasets) are now correctly provided.

When harvesting is merely referencing the remote resources (i.e. we are not copying the actual data, just the metadata):

  • Only metadata is copied over to the local GeoNode
  • Remote thumbnail is used (fixes geosolutions/nexus-geonode#450)
  • Remote resource URL is used

Notes

In the above video we can see that for remote documents, the GeoNode frontend offers a download of the file in the main page. This seems like a bug of the frontend, so I'm leaving it out of the current PR's scope (which is already quite big).

fixes #7945

giohappy and others added 30 commits January 20, 2021 09:37
Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.26.2 to 1.26.3.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/1.26.3/CHANGES.rst)
- [Commits](urllib3/urllib3@1.26.2...1.26.3)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Toni <toni.schoenbuchner@csgis.de>
…6881)

* [Fixes GeoNode#6880] Circle CI upload tests fail irregulary

* CircleCI test fix: sometimes expires due to upload timeout in the test environment

* - Avoid infinite loop on upload testing

* Revert "CircleCI test fix: sometimes expires due to upload timeout in the test environment"

This reverts commit 66139fd.

Co-authored-by: Alessio Fabiani <alessio.fabiani@geo-solutions.it>
Co-authored-by: afabiani <alessio.fabiani@gmail.com>
…de#6911)

* get meaningful document filenames on download

* - Strip extension from document title before slugify it (e.g.: image.jpg instead of imagejpg.jpg)

Co-authored-by: afabiani <alessio.fabiani@gmail.com>
Co-authored-by: Alessio Fabiani <alessio.fabiani@geo-solutions.it>
…ng slash at the end of GEOSERVER_LOCATION (GeoNode#6913)

* [Fixes GeoNode#6916] gsimporter.api.NotFound caused by missing trailing slash at the end of GEOSERVER_LOCATION

* [Fixes GeoNode#6916] unit test for GEOSERVER_LOCATION
Bumps [django-cors-headers](https://github.com/adamchainz/django-cors-headers) from 3.6.0 to 3.7.0.
- [Release notes](https://github.com/adamchainz/django-cors-headers/releases)
- [Changelog](https://github.com/adamchainz/django-cors-headers/blob/master/HISTORY.rst)
- [Commits](adamchainz/django-cors-headers@3.6.0...3.7.0)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [amqp](https://github.com/celery/py-amqp) from 5.0.3 to 5.0.5.
- [Release notes](https://github.com/celery/py-amqp/releases)
- [Changelog](https://github.com/celery/py-amqp/blob/master/Changelog)
- [Commits](celery/py-amqp@v5.0.3...v5.0.5)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [pip](https://github.com/pypa/pip) from 21.0 to 21.0.1.
- [Release notes](https://github.com/pypa/pip/releases)
- [Changelog](https://github.com/pypa/pip/blob/master/NEWS.rst)
- [Commits](pypa/pip@21.0...21.0.1)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [coverage](https://github.com/nedbat/coveragepy) from 5.3.1 to 5.4.
- [Release notes](https://github.com/nedbat/coveragepy/releases)
- [Changelog](https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst)
- [Commits](coveragepy/coveragepy@coverage-5.3.1...coverage-5.4)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [pytest](https://github.com/pytest-dev/pytest) from 6.2.1 to 6.2.2.
- [Release notes](https://github.com/pytest-dev/pytest/releases)
- [Changelog](https://github.com/pytest-dev/pytest/blob/master/CHANGELOG.rst)
- [Commits](pytest-dev/pytest@6.2.1...6.2.2)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [djangorestframework-gis](https://github.com/openwisp/django-rest-framework-gis) from 0.16 to 0.17.
- [Release notes](https://github.com/openwisp/django-rest-framework-gis/releases)
- [Changelog](https://github.com/openwisp/django-rest-framework-gis/blob/master/CHANGES.rst)
- [Commits](openwisp/django-rest-framework-gis@v0.16.0...v0.17.0)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
… it has… (GeoNode#6923)

* [Fixes GeoNode#6922][REST API v2] Expose the curated thumbnail URL if it has been uploaded

* - Add REST APIs test suite to CircleCI
* [Cleanup and Refactor] Remove QGIS server backend dependencies

* [Cleanup and Refactor] Remove QGIS server backend dependencies

* - Fix LGTM issues
…_auth_middleware

Feature#650 basic auth middleware
Copy link
Member

@afabiani afabiani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far it almost works fine but...

ISSUE 1

I noticed that in the case I leave the option

{
   "harvest_datasets": true,
   "copy_datasets": false
}

The layer won't be populated not accessible.

E.g.: by using the demo instance https://development.demo.geonode.org/ as a source I get the following

image

image

The dataset is empty.

ISSUE 2

Trying to harvest a Map fails with an exception, the session and the harvester are then blocker forever.

Internal Server Error: /en/admin/harvesting/harvestableresource/
Traceback (most recent call last):
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/django/contrib/admin/options.py", line 616, in wrapper
    return self.admin_site.admin_view(view)(*args, **kwargs)
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/django/utils/decorators.py", line 130, in _wrapped_view
    response = view_func(request, *args, **kwargs)
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/django/views/decorators/cache.py", line 44, in _wrapped_view_func
    response = view_func(request, *args, **kwargs)
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/django/contrib/admin/sites.py", line 232, in inner
    return view(request, *args, **kwargs)
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/django/utils/decorators.py", line 43, in _wrapper
    return bound_method(*args, **kwargs)
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/django/utils/decorators.py", line 130, in _wrapped_view
    response = view_func(request, *args, **kwargs)
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/django/contrib/admin/options.py", line 1723, in changelist_view
    response = self.response_action(request, queryset=cl.get_queryset(request))
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/django/contrib/admin/options.py", line 1408, in response_action
    response = func(self, request, queryset)
  File "/mnt/c/Work/geonode/geonode/harvesting/admin.py", line 365, in harvest_selected_resources
    tasks.harvest_resources.apply_async(args=(harvestable_resource_ids, harvesting_session.pk))
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/celery/app/task.py", line 569, in apply_async
    return self.apply(args, kwargs, task_id=task_id or uuid(),
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/celery/app/task.py", line 790, in apply
    ret = tracer(task_id, args, kwargs, request)
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/celery/app/trace.py", line 467, in trace_task
    I, R, state, retval = on_error(task_request, exc, uuid)
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/celery/app/trace.py", line 450, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/mnt/c/Work/geonode/geonode/harvesting/tasks.py", line 147, in harvest_resources
    harvesting_workflow.apply_async()
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/celery/canvas.py", line 1408, in apply_async
    return self.apply(args, kwargs,
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/celery/canvas.py", line 1427, in apply
    args=(tasks.apply(args, kwargs).get(propagate=propagate),),
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/celery/canvas.py", line 1117, in apply
    return app.GroupResult(group_id, [
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/celery/canvas.py", line 1118, in <listcomp>
    sig.apply(args=args, kwargs=kwargs, **options) for sig, _, _ in tasks
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/celery/canvas.py", line 186, in apply
    return self.type.apply(args, kwargs, **options)
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/celery/app/task.py", line 790, in apply
    ret = tracer(task_id, args, kwargs, request)
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/celery/app/trace.py", line 467, in trace_task
    I, R, state, retval = on_error(task_request, exc, uuid)
  File "/home/afabiani/.virtualenvs/geonode/lib/python3.8/site-packages/celery/app/trace.py", line 450, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/mnt/c/Work/geonode/geonode/harvesting/tasks.py", line 174, in _harvest_resource
    harvested_resource_info = worker.get_resource(harvestable_resource, harvesting_session_id)
  File "/mnt/c/Work/geonode/geonode/harvesting/harvesters/geonodeharvester.py", line 299, in get_resource
    resource_descriptor = self._get_resource_details(
  File "/mnt/c/Work/geonode/geonode/harvesting/harvesters/geonodeharvester.py", line 444, in _get_resource_details
    result = self._get_resource_descriptor(
  File "/mnt/c/Work/geonode/geonode/harvesting/harvesters/geonodeharvester.py", line 547, in _get_resource_descriptor
    additional_params_handler = {
KeyError: 'maps'

ISSUE 3

Whenever an harvest session fails, there's no way to recover the harvester, which remains busy forever.

@giohappy
Copy link
Contributor

Ragarding ISSUE 3 this is a long standing request that was already raised in the past @ricardogsilva. If a job breaks there's no way to recover it but going into the DB and manually changing the status back to ready. If we need a dedicated issue for this please open it @ricardogsilva.

@ricardogsilva
Copy link
Member

ricardogsilva commented Sep 26, 2021

@afabiani I will make the necessary adjustments. Some clarification wrt the issues you raise:

for Issue 1, the current GeonodeLegacyHarvester has been designed for harvesting older GeoNode (read pre-v2 API) instances. The fact that it does not work on the development demo of GeoNode, while unfortunate, is not too surprising. The plan is to create another worker that is able to serve modern GeoNode instances - the code will be a lot cleaner, since this legacy harvester has some additional complexity to compensate for the fact that the GeoNode API used to be a bit rough. I think the ideal solution will be to have a single harvester that is able to distinguish between older and newer GeoNode deployments. We ar e not there yet though.

For issue 2, that is a clear bug, which I will fix ASAP. The current harvester worker does not claim to support maps, so it should just fail in an orderly fashion.

For issue 3, that is a by-product of the bug described above probably. Generally speaking, the code is trying to catch known exceptions in the code so that it avoids entering that invalid state. The fact that it is still happening is a sign of some error in the code which is not being properly caught.

@giohappy
Copy link
Contributor

@ricardogsilva regarding ISSUE 1, I expect the legacy harvester to work even on 3.3.x (the development demo) and master, since nothing from the old API and CSW has been dropped so far.

@afabiani
Copy link
Member

@ricardogsilva just to be more clear a bit, notice that about ISSUE #1, if I enable copy_datasets=True it works fine.
The issue seems to be happening in the case I wanted to leave the data on the remote service.

@afabiani
Copy link
Member

btw, any chance to fix also this one while looking for the other issues?

======================================================================
FAIL: test_perform_harvesting (geonode.harvesting.tests.test_admin.HarvesterAdminTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/unittest/mock.py", line 1325, in patched
    return func(*newargs, **newkeywargs)
  File "/usr/src/geonode/geonode/harvesting/tests/test_admin.py", line 73, in test_perform_harvesting
    self.assertEqual(self.harvester.status, models.Harvester.STATUS_PERFORMING_HARVESTING)
AssertionError: 'ready' != 'harvesting-resources'
- ready
+ harvesting-resources

@afabiani
Copy link
Member

ISSUE 4

Raster datasets seem to be not harvested correctly even when the copy_dataeset option has been set to True

image

@lgtm-com
Copy link

lgtm-com bot commented Sep 28, 2021

This pull request introduces 2 alerts when merging 45fc7c2 into bf23880 - view on LGTM.com

new alerts:

  • 2 for Clear-text logging of sensitive information

@ricardogsilva
Copy link
Member

@afabiani

My latest commit should fix the issues you reported.

I fixed some missing metadata fields not being correctly passed to the resource manager.

I also disabled harvesting of maps for the GeoNode harvester. The reason for this is that API does not really provide enough information for us to be able to replicate a map locally - there is no info about the map's layers, etc.

To the best of my knowledge documents and datasets are now correctly harvested, regardless of being of type vector or raster and regardless of them being copied or just referenced from the remote.

I cannot guarantee 100% that this works for all GeoNode remote deployments, as I'm not sure how stable the old API had been between releases. Seems to work OK with those deployments that I am using as a test bed.

As for the failing test, I don't seem to get it locally. I'm waiting to let the CI build be done to check it output.

@ricardogsilva ricardogsilva force-pushed the 7945-harvest-layer-data branch from 45fc7c2 to a9ec56a Compare September 29, 2021 08:19
@lgtm-com
Copy link

lgtm-com bot commented Sep 29, 2021

This pull request introduces 2 alerts when merging a9ec56a into bf23880 - view on LGTM.com

new alerts:

  • 2 for Clear-text logging of sensitive information

@afabiani
Copy link
Member

@ricardogsilva there still something not fully working with the remote resources and in particular with the WMS harvester, but I'll postpone those fixes to #8149 which will include this PR too.

@afabiani afabiani merged commit 2164b02 into GeoNode:master Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed CLA Bot: community license agreement signed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Harvesting - GeoNode Client] Harvesting of layers data

9 participants