Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify online check of data API client #627

Merged
merged 6 commits into from
Feb 16, 2023

Conversation

peanutfun
Copy link
Member

@peanutfun peanutfun commented Jan 23, 2023

This fixes an issue where the online check would ignore system-wide proxy settings. The issue is due to the socket interface ignoring system-wide proxy settings by default, whereas requests does consider them. So I propose to remove any use of socket. This simplifies the online query a lot.

Changes proposed in this PR:

  • Use requests for online check instead of socket.
  • Make Client.online a property method instead of a fixed property.

Open questions:

  • Am I missing something or is there no data API client unit test?

PR Author Checklist

PR Reviewer Checklist

This fixes an issue where the online check would ignore proxy
settings (whereas the regular downloading does not).

* Use 'requests' for online check instead of 'socket'.
* Make 'Client.online' a property method instead of a fixed propery.
@emanuel-schmid
Copy link
Collaborator

Oh? Wasn't aware of that. How do the system wide proxy settings inhibit online query?

The reason why socket was chosen was mainly the timeout specification. I couldn't figure out a way to get quick timeouts with requests and I wanted to prevent the client by all means from hanging around for a minute just to find out that there is no internet connection.
For the same reason the online field is calculated once and then stored and not calculated repeatedly.

And yes, no real unit tests. All the tests for the api_client are in the "integration" test module climada.test.test_api_client.

@peanutfun
Copy link
Member Author

The reason why socket was chosen was mainly the timeout specification.

Ah, I missed that! requests accepts a timeout parameter for all its queries, see https://requests.readthedocs.io/en/latest/api/. Fixed in 4339aeb

For the same reason the online field is calculated once and then stored and not calculated repeatedly.

We do not have this use case often but I would rather have online calculate repeatedly because a lot could happen to the internet connection between starting a client and requesting a download 😅 Wouldn't a timeout of 1 sec would be fine for that?

@peanutfun
Copy link
Member Author

Oh? Wasn't aware of that. How do the system wide proxy settings inhibit online query?

Sorry, I almost forgot this question 🙈 I noticed the issue on my MeteoSwiss machine, where I have to use their proxy. Downloading data actually works just fine but the online query of the data API client fails. This is because the request library follows environment variable settings like http_proxy by default, whereas sockets does not. As the network settings of the MCH machines prohibit connecting to the internet without the proxy, simply requesting towards the target URL will fail. I am not sure how the proxy settings are incorporated into the request, but they somehow have to 🤷

@emanuel-schmid
Copy link
Collaborator

Wouldn't a timeout of 1 sec would be fine for that?

I still think we should stick to the current behavior, exactly because a lot can happen to the internet.
For those working offline there is no benefit when they have to wait for a second for every call to the API.
For those working with a wobbly connection there would be a benefit, even waiting for a minute. And in case the connection - by sheer luck - got established when they created the Client they would wait for the default timeout with every call.

@peanutfun
Copy link
Member Author

I do not agree 😬 If you are working offline, requests immediately throws a ConnectionError, without waiting for the timeout. The timeout is the maximum time the client waits for connecting to the server, if there is a connection to establish in the first place.

And in the other case where a connection was there when the Client was created and later is not (or vice versa), the new behavior is vastly more useful.

@emanuel-schmid
Copy link
Collaborator

Really? Not for me, it doesn't. Just checked again (went offline and did requests.head) and it took about 10 seconds. 🧐

@peanutfun
Copy link
Member Author

Wow, even with the latest timout addition? 😳

@emanuel-schmid
Copy link
Collaborator

Sadly yes, even with timeout=1. 😞

@peanutfun
Copy link
Member Author

You are working on Windows, right? Maybe the behavior differs for each OS. I'll roll it back to the 'static' attribute!

@emanuel-schmid
Copy link
Collaborator

Maybe the behavior differs for each OS

Wouldn't surprise me.

emanuel-schmid and others added 2 commits February 10, 2023 17:02
Co-authored-by: Lukas Riedel <34276446+peanutfun@users.noreply.github.com>
@emanuel-schmid
Copy link
Collaborator

I've just made a manual offline test and it failed:

from climada.util.api_client import Client
client = Client()
client.get_dataset_info(name=ds_name, status='test_dataset')

got this:

--------------------------------------------------------------------------
gaierror                                  Traceback (most recent call last)
File ~\miniconda3\envs\climada_py39\lib\site-packages\urllib3\connection.py:174, in HTTPConnection._new_conn(self)
    173 try:
--> 174     conn = connection.create_connection(
    175         (self._dns_host, self.port), self.timeout, **extra_kw
    176     )
    178 except SocketTimeout:

File ~\miniconda3\envs\climada_py39\lib\site-packages\urllib3\util\connection.py:72, in create_connection(address, timeout, source_address, socket_options)
     68     return six.raise_from(
     69         LocationParseError(u"'%s', label empty or too long" % host), None
     70     )
---> 72 for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
     73     af, socktype, proto, canonname, sa = res

File ~\miniconda3\envs\climada_py39\lib\socket.py:954, in getaddrinfo(host, port, family, type, proto, flags)
    953 addrlist = []
--> 954 for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
    955     af, socktype, proto, canonname, sa = res

gaierror: [Errno 11001] getaddrinfo failed

During handling of the above exception, another exception occurred:

NewConnectionError                        Traceback (most recent call last)
...

@peanutfun
Copy link
Member Author

Sorry, I can't follow that. What do you mean by "manual offline test", where does the error occur, what is the complete traceback?

@emanuel-schmid
Copy link
Collaborator

I just went offline and ran the 3 lines.
Complete traceback:

---------------------------------------------------------------------------
gaierror                                  Traceback (most recent call last)
File ~\miniconda3\envs\climada_py39\lib\site-packages\urllib3\connection.py:174, in HTTPConnection._new_conn(self)
    173 try:
--> 174     conn = connection.create_connection(
    175         (self._dns_host, self.port), self.timeout, **extra_kw
    176     )
    178 except SocketTimeout:

File ~\miniconda3\envs\climada_py39\lib\site-packages\urllib3\util\connection.py:72, in create_connection(address, timeout, source_address, socket_options)
     68     return six.raise_from(
     69         LocationParseError(u"'%s', label empty or too long" % host), None
     70     )
---> 72 for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
     73     af, socktype, proto, canonname, sa = res

File ~\miniconda3\envs\climada_py39\lib\socket.py:954, in getaddrinfo(host, port, family, type, proto, flags)
    953 addrlist = []
--> 954 for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
    955     af, socktype, proto, canonname, sa = res

gaierror: [Errno 11001] getaddrinfo failed

During handling of the above exception, another exception occurred:

NewConnectionError                        Traceback (most recent call last)
File ~\miniconda3\envs\climada_py39\lib\site-packages\urllib3\connectionpool.py:703, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    702 # Make the request on the httplib connection object.
--> 703 httplib_response = self._make_request(
    704     conn,
    705     method,
    706     url,
    707     timeout=timeout_obj,
    708     body=body,
    709     headers=headers,
    710     chunked=chunked,
    711 )
    713 # If we're going to release the connection in ``finally:``, then
    714 # the response doesn't need to know about the connection. Otherwise
    715 # it will also try to release it and we'll have a double-release
    716 # mess.

File ~\miniconda3\envs\climada_py39\lib\site-packages\urllib3\connectionpool.py:386, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    385 try:
--> 386     self._validate_conn(conn)
    387 except (SocketTimeout, BaseSSLError) as e:
    388     # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout.

File ~\miniconda3\envs\climada_py39\lib\site-packages\urllib3\connectionpool.py:1042, in HTTPSConnectionPool._validate_conn(self, conn)
   1041 if not getattr(conn, "sock", None):  # AppEngine might not have  `.sock`
-> 1042     conn.connect()
   1044 if not conn.is_verified:

File ~\miniconda3\envs\climada_py39\lib\site-packages\urllib3\connection.py:358, in HTTPSConnection.connect(self)
    356 def connect(self):
    357     # Add certificate verification
--> 358     self.sock = conn = self._new_conn()
    359     hostname = self.host

File ~\miniconda3\envs\climada_py39\lib\site-packages\urllib3\connection.py:186, in HTTPConnection._new_conn(self)
    185 except SocketError as e:
--> 186     raise NewConnectionError(
    187         self, "Failed to establish a new connection: %s" % e
    188     )
    190 return conn

NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x000001C854DB8550>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed

During handling of the above exception, another exception occurred:

MaxRetryError                             Traceback (most recent call last)
File ~\miniconda3\envs\climada_py39\lib\site-packages\requests\adapters.py:489, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    488 if not chunked:
--> 489     resp = conn.urlopen(
    490         method=request.method,
    491         url=url,
    492         body=request.body,
    493         headers=request.headers,
    494         redirect=False,
    495         assert_same_host=False,
    496         preload_content=False,
    497         decode_content=False,
    498         retries=self.max_retries,
    499         timeout=timeout,
    500     )
    502 # Send the request.
    503 else:

File ~\miniconda3\envs\climada_py39\lib\site-packages\urllib3\connectionpool.py:787, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    785     e = ProtocolError("Connection aborted.", e)
--> 787 retries = retries.increment(
    788     method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
    789 )
    790 retries.sleep()

File ~\miniconda3\envs\climada_py39\lib\site-packages\urllib3\util\retry.py:592, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    591 if new_retry.is_exhausted():
--> 592     raise MaxRetryError(_pool, url, error or ResponseError(cause))
    594 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)

MaxRetryError: HTTPSConnectionPool(host='climada.ethz.ch', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001C854DB8550>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
Cell In[5], line 2
      1 from climada.util.api_client import Client
----> 2 client = Client()
      3 client.get_dataset_info(name=ds_name, status='test_dataset')

File c:\users\me\git\climada_python\climada\util\api_client.py:284, in Client.__init__(self, cache_enabled)
    282 self.chunk_size = CONFIG.data_api.chunk_size.int()
    283 self.cache = Cacher(cache_enabled)
--> 284 self.online = self._online()

File c:\users\me\git\climada_python\climada\util\api_client.py:263, in Client._online(self)
    260 query_url = urlunsplit((parse_result.scheme, parse_result.netloc, "", "", ""))
    262 # NOTE: 'timeout' might not work as intended, depending on OS and network status
--> 263 return requests.head(query_url, timeout=1).status_code == 200

File ~\miniconda3\envs\climada_py39\lib\site-packages\requests\api.py:100, in head(url, **kwargs)
     89 r"""Sends a HEAD request.
     90 
     91 :param url: URL for the new :class:`Request` object.
   (...)
     96 :rtype: requests.Response
     97 """
     99 kwargs.setdefault("allow_redirects", False)
--> 100 return request("head", url, **kwargs)

File ~\miniconda3\envs\climada_py39\lib\site-packages\requests\api.py:59, in request(method, url, **kwargs)
     55 # By using the 'with' statement we are sure the session is closed, thus we
     56 # avoid leaving sockets open which can trigger a ResourceWarning in some
     57 # cases, and look like a memory leak in others.
     58 with sessions.Session() as session:
---> 59     return session.request(method=method, url=url, **kwargs)

File ~\miniconda3\envs\climada_py39\lib\site-packages\requests\sessions.py:587, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    582 send_kwargs = {
    583     "timeout": timeout,
    584     "allow_redirects": allow_redirects,
    585 }
    586 send_kwargs.update(settings)
--> 587 resp = self.send(prep, **send_kwargs)
    589 return resp

File ~\miniconda3\envs\climada_py39\lib\site-packages\requests\sessions.py:701, in Session.send(self, request, **kwargs)
    698 start = preferred_clock()
    700 # Send the request
--> 701 r = adapter.send(request, **kwargs)
    703 # Total elapsed time of the request (approximately)
    704 elapsed = preferred_clock() - start

File ~\miniconda3\envs\climada_py39\lib\site-packages\requests\adapters.py:565, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    561     if isinstance(e.reason, _SSLError):
    562         # This branch is for urllib3 v1.22 and later.
    563         raise SSLError(e, request=request)
--> 565     raise ConnectionError(e, request=request)
    567 except ClosedPoolError as e:
    568     raise ConnectionError(e, request=request)

ConnectionError: HTTPSConnectionPool(host='climada.ethz.ch', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001C854DB8550>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

@peanutfun
Copy link
Member Author

Thanks! I guess we can simply wrap the request into a try-catch block and return false if we catch a ConnectionError?

climada/util/api_client.py Outdated Show resolved Hide resolved
Co-authored-by: Lukas Riedel <34276446+peanutfun@users.noreply.github.com>
@emanuel-schmid emanuel-schmid merged commit 591f76b into develop Feb 16, 2023
@emanuel-schmid
Copy link
Collaborator

it's even faster now 😍

@emanuel-schmid emanuel-schmid deleted the feature/api-client-online-check branch February 16, 2023 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants