Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse cache data from different page #29

Open
tomasbedrich opened this issue Jul 22, 2015 · 8 comments
Open

Parse cache data from different page #29

tomasbedrich opened this issue Jul 22, 2015 · 8 comments

Comments

@tomasbedrich
Copy link
Owner

Use this URL: http://www.geocaching.com/seek/cdpf.aspx?guid=182a3463-e46e-4401-8697-3ad3ac2a1a42&lc=10 to parse geocache data (possible 2 times speedup).

@weinshec
Copy link
Collaborator

weinshec commented Sep 1, 2016

Do you have any clue how to retrieve the guid of a cache in the first place without loading the usual details page? Wouldn't make sense if one has to parse the details page first 😆

@FriedrichFroebel
Copy link
Collaborator

Using the log page provides a link to the listing using the guid of the cache: https://www.geocaching.com/seek/log.aspx?wp=GC3RPVZ

@tomasbedrich
Copy link
Owner Author

It is also possible to fetch the GUID using the load_quick() method:

GET https://tiles01.geocaching.com/map.details?i=GC4HTZW
{  
   "status":"success",
   "data":[  
      {  
         "name":"Zmijovec (Amorphophallus)",
         "gc":"GC3RPVZ",
         "g":"182a3463-e46e-4401-8697-3ad3ac2a1a42",
         "available":true,
         "archived":false,
         "subrOnly":false,
         "li":false,
         "fp":"115",
         "difficulty":{  
            "text":4.5,
            "value":"4_5"
         },
         "terrain":{  
            "text":1.0,
            "value":"1"
         },
         "hidden":"11/18/2012",
         "container":{  
            "text":"Regular",
            "value":"regular.gif"
         },
         "type":{  
            "text":"Traditional Cache",
            "value":2
         },
         "owner":{  
            "text":"Lindbergh007",
            "value":"850af23d-fc83-4d3c-b93a-9e5d6ae359c9"
         }
      }
   ]
}

The question is, if it will be faster to make two lightweight requests or one heavy.

I would suggest adding a GUID parsing to the load_quick() and creating a temporary load_by_guid() method, which would populate some basic Cache info using the print page. Then we could measure what is faster.

@weinshec
Copy link
Collaborator

weinshec commented Sep 4, 2016

That sounds reasonable to me. I would like to give it try and will report about the performance comparison as soon as I have a first implementation.

@weinshec
Copy link
Collaborator

weinshec commented Sep 9, 2016

So here are some numbers... I used the timeit module to profile the call to Cache.load() and compared it to a call of Cache.load_quick() parsing the guid and subsequently calling Cache.load_by_guid() requesting the print-page. The result is based on the mean time averaging over 100 calls each.

Scenario 1 (load()): 1.28 seconds
Scenario 2 (load_quick() + load_by_guid()): 0.85 seconds

So it seems that two lightweight calls are faster than one heavy, although it's not a factor of 2. Should we go for scenario 2 and rely on two requests or stick to scenario 1 with a single request?

@tomasbedrich
Copy link
Owner Author

tomasbedrich commented Sep 9, 2016

Nice! I think we should do some refactoring before replacing the original load() method. A big picture is to have multiple load_by_xxx() methods which would actually fill the data and a lightweight load() method which would decide which one to use.

But for now, the best you can do, is to create a pull request for a separate load_by_guid() method which would check a presence of a GUID first (possibly calling load_quick() if needed) and then scrape as much cache details as possible.

Then I would do the refactoring on my own, because it may be a little more complex.

@tomasbedrich
Copy link
Owner Author

Following discussion with @twlare from #75:

First of all, there are some gotchas regarding completeness of loaded attributes. The mentioned refactoring may help as it would allow end-users to control what is important for them therefore missing attributes wouldn't mind. Also, it may be good to check whether something hasn't changed on cache "print page" (some new/removed attributes there).

Please feel free to continue to work on #74, but it will need some rebasing on actual master.

So in summary, what is left to do: check status of the code (working?, new/removed attributes), rebase, maybe refactor and the most important step is to switch primary algorithm behind load method (which must be backwards-compatible = the same API + load the same set of attributes).

@tomasbedrich
Copy link
Owner Author

Resurrecting the thread by an email received from Dave:


I have one possible suggestion: when I have scraped in the past I have found that the most reliable way to get the cache info is to request the gpx file, which can be obtained through a very simple POST request.

s = requests.Session() cookies = [s.cookies.set(c['name'], c['value']) for c in request_cookies_browser] URL = 'https://www.geocaching.com/geocache/GC9EFK2_colorful-pairs' params = {'__EVENTTARGET':'', '__EVENTARGUMENT':'', 'ctl00$ContentBody$lnkGPXDownload': 'GPX+file'} response = s.post(URL, params) GPX = response.text

I don't know if it works for non-premium members, but it is very fast and contains almost everything you want, including (usually) about 10 logs. It could speed up the get_cache() function a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants