Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skipping page https://pypi.org/simple/idna/ because of Content-Type: unknown #5345

Closed
njsmith opened this issue Apr 28, 2018 · 34 comments
Closed
Labels
auto-locked Outdated issues that have been locked by automation C: cache Dealing with cache and files in it type: bug A confirmed bug or unintended behavior

Comments

@njsmith
Copy link
Member

njsmith commented Apr 28, 2018

  • Pip version: 10.0.1
  • Python version: Python 3.5
  • Operating system: MacOS

Description:

We've been hitting an intermittent error in downloading packages from PyPI using pip 10.0.1. So far it seems to only hit our MacOS, python 3.5 CI system (not python 3.6 on MacOS, and not any versions on other OSes). The actual package varies, but we always get an error like

  Could not find a version that satisfies the requirement pyOpenSSL (from -r test-requirements.txt (line 4)) (from versions: )

or

  Could not find a version that satisfies the requirement pytest>=3.3 (from -r test-requirements.txt (line 1)) (from versions: )

We finally managed to trigger the error with pip's -vvv flag enabled, and it said:

  Looking up "https://pypi.org/simple/idna/" in the cache
  Current age based on date: 31182
  Freshness lifetime from max-age: 600
  Freshness lifetime from request max-age: 600
  https://pypi.org:443 "GET /simple/idna/ HTTP/1.1" 304 0
  Skipping page https://pypi.org/simple/idna/ because of Content-Type: unknown
  Could not find a version that satisfies the requirement idna (from trio==0.4.0+dev) (from versions: )

(Full log here)

So to my ignorant eye, it looks like:

  • pip found a cache entry for the https://pypi.org/simple/idna/ page
  • It asked pypi if the cache was accurate, and pypi said "304 Not Modified", i.e., yeah, go ahead, use your cache
  • Then pip somehow decided that... the cached entry, maybe, had "Content-Type: unknown" and freaked out?

I don't know how to debug further, so... help?

(our bug: python-trio/trio#508)

@mithrandi
Copy link

FWIW we're seeing the same thing on Debian Linux, only with PyPy. If this has something to do with the state of the cache, the OS/interpreter may be a red herring and some file only used on that platform is corrupt in the cache.

@njsmith
Copy link
Member Author

njsmith commented Apr 28, 2018

@mithrandi is yours also intermittent and affecting different packages each time? Seems like a weird pattern for a corrupt state file, but could be...

@njsmith
Copy link
Member Author

njsmith commented Apr 28, 2018

@mithrandi also... I notice that I'm getting this on our jenkins builder – if you're hitting it in twisted (?), then maybe it's in buildbot? I'm wondering if it might have something to do with multiple pips accessing the same cache simultaneously, in CI systems that have less isolation between builds than like, Travis-CI does.

@mithrandi
Copy link

mithrandi commented Apr 28, 2018

@njsmith This is my day job's CI; it's a Docker container started by Drone.io. We run two concurrent builds (in separate containers, but sharing the same cache directory), one on CPython 2.7 (which I haven't seen fail this way) and one on PyPy (which seems to be failing quite frequently as of late), so your theory about concurrent access seems to hold up.

EDIT: Also, it seems to be totally different packages each time, so the concurrency seems more likely to be at fault than a permanently corrupt something in the cache.

@alex
Copy link
Member

alex commented Apr 29, 2018

We've occasionally seen this in CI for pyca/cryptography, but on our Windows builders. Our windows builders are the only ones which share the cache between multiple concurrent builds, so that seems related 🤔

@njsmith
Copy link
Member Author

njsmith commented Apr 29, 2018

Pip has some nice options for controlling cache access, like --cache-dir=$WHATEVER and --no-cache-dir, so I guess we might be able to check this hypothesis by e.g. disabling the cache in the build that's working and see if that fixes the build that's breaking. But that's tricky with an intermittent issue like this b/c it's hard to know whether it's fixed or whether you just got lucky :-/

@ssbarnea
Copy link
Contributor

ssbarnea commented May 3, 2018

Apparently this problem started right after CDN migration last month and I would really be surprised not to be linked by it. On our CI system I identified at leat 130 failures and errors look like https://gist.github.com/ssbarnea/78df2a80137cca04ca72872b1d14498f#file-pypi-pip-failure-log-L34

The package name is irrelevant, and running again does not reproduce the same error.

Disabling the local cache would be a disaster as we had a huge number of builds and we would endup putting a very high load on the CDN. Also I doubt platform has anything to do with this as I can confirm that it does happen on RHEL/Centos/MacOS.

@JackLeo
Copy link

JackLeo commented May 3, 2018

We identified the problem occurs when multiple jobs run pip install at the same time at the same node. As a workaround, we use --cache-dir option on pip and each unique step in the pipeline uses its own cache dir (/tmp/<pipeline>_<step_name>/), this bloats disk space use, but at least it's somewhat faster and does not produce load on CDN.
Also under Unix, pip should also respect XDG_CACHE_HOME so you might just need to export a single variable on each step. I will let you know if the problem re-emerges with this in place.

@ssbarnea
Copy link
Contributor

ssbarnea commented May 3, 2018

I think I found a temporary workaround for avoiding this bug on Jenkins slaves with multiple workers: define XDG_CACHE_HOME=$HOME/.cache/$EXECUTOR_NUMBER on master configuration.

This means that each executor will have a different cache directory so they would avoid clashing each other. As the number of executors is limited this would limit bit the increase in disk space needed.

I hope that my comment would not be considered a reason for not fixing this issue as soon as possible as this bug does really break CI/CD systems.

@di
Copy link
Member

di commented May 3, 2018

I'm not sure how meaningful this is but I noticed in @ssbarnea's log, pip doesn't actually seem to be initiating a new connection to PyPI to get the 304.

Here's a debug log I reproduced when the cache is stale and pip reaches back out to PyPI and gets a 304:

1 location(s) to search for versions of deprecation:
* https://pypi.org/simple/deprecation/
Getting page https://pypi.org/simple/deprecation/
Looking up "https://pypi.org/simple/deprecation/" in the cache
Current age based on date: 10000
Freshness lifetime from max-age: 600
Freshness lifetime from request max-age: 600
Starting new HTTPS connection (1): pypi.org
https://pypi.org:443 "GET /simple/deprecation/ HTTP/1.1" 304 0
Analyzing links from page https://pypi.org/simple/deprecation/
...

Here's the log from https://gist.github.com/ssbarnea/78df2a80137cca04ca72872b1d14498f#file-pypi-pip-failure-log-L34:

1 location(s) to search for versions of deprecation:
* https://pypi.org/simple/deprecation/
Getting page https://pypi.org/simple/deprecation/
Looking up "https://pypi.org/simple/deprecation/" in the cache
Current age based on date: 6767
Freshness lifetime from max-age: 600
Freshness lifetime from request max-age: 600
https://pypi.org:443 "GET /simple/deprecation/ HTTP/1.1" 304 0
Skipping page https://pypi.org/simple/deprecation/ because of Content-Type: unknown
...

Note the lack of a Starting new HTTPS connection (1): pypi.org from urllib3.

@di
Copy link
Member

di commented May 3, 2018

I'd be curious if this is reproducible with 9.0.3, looks like two of the related vendored libraries (cachecontrol and requests) got updated since then.

@ssbarnea
Copy link
Contributor

ssbarnea commented May 4, 2018

@di Most of our jobs are first upgrading pip and doing the rest after. This means that this problem reproduced with latest pip 10.0.1.

@pradyunsg
Copy link
Member

If someone could chech if pip 9.0.3 also faces this issue, that would be nice.

That'll bring us down to something in pip 10 as the cause of this issue.

@jdufresne
Copy link
Contributor

The CI system I work with sees this bug with pip 9.0.1. I also tested with 9.0.3 and confirmed it is affected too.

@vrillusions
Copy link

vrillusions commented May 4, 2018

I'm also seeing this issue but with a couple differences. First off here's the part of our job that will init the virtualenv and then do pip install:

if [[ ! -d .venv ]]; then
    virtualenv .venv
fi
source .venv/bin/activate

set -x
rm -f pip-install.log
pip install --log pip-install.log --upgrade pip setuptools
pip install --log pip-install.log --build .venv/tmp -r requirements.txt

the jobs are setup to clean the working directory before each run (so essentially the .venv directory is erased each time since that's in the .gitignore file). The difference with ours is it's running on a fairly quiet jenkins server, no other jobs are running at the same time. I haven't had time to investigate. Another workaround I'm considering is running a local proxy server (see: https://packaging.python.org/guides/index-mirrors-and-caches/ ) since the versions of packages won't change often as long as it connects once it wouldn't have the issue again. I'm going to try ssbarnea's suggestion of XDG_CACHE_HOME=$HOME/.cache/$EXECUTOR_NUMBER and see if that helps

(edit)
Well disregard that I just remembered we have one job that runs every 10 minutes but it's hidden from the view I was using. So it's likely the same multiple-jobs-running-at-same-time issue

@cjerdonek
Copy link
Member

Here's the line where this is happening, FYI:

# The check for archives above only works if the url ends with
# something that looks like an archive. However that is not a
# requirement of an url. Unless we issue a HEAD request on every
# url we cannot know ahead of time for sure if something is HTML
# or not. However we can check after we've downloaded it.
content_type = resp.headers.get('Content-Type', 'unknown')
if not content_type.lower().startswith("text/html"):
logger.debug(
'Skipping page %s because of Content-Type: %s',
link,
content_type,
)
return

@cjerdonek
Copy link
Member

I don't claim to be an expert on this, but after searching, it seems like it's actually the recommended / expected behavior for 304 responses not to include a "Content-Type" header (which is triggering the "unknown"):

The server generating a 304 response MUST generate any of the following header fields that would have been sent in a 200 (OK) response to the same request: Cache-Control, Content-Location, Date, ETag, Expires, and Vary.

Since the goal of a 304 response is to minimize information transfer when the recipient already has one or more cached representations, a sender SHOULD NOT generate representation metadata other than the above listed fields unless said metadata exists for the purpose of guiding cache updates (e.g., Last-Modified might be useful if the response does not have an ETag field).

(from https://tools.ietf.org/html/rfc7232#section-4.1 )

If this is the case, then does it mean the CDN serving the 304 is behaving differently from before?

@di
Copy link
Member

di commented May 4, 2018

Presumably any pip that is getting a 304 has already gotten the page (since it knows the age) and thus has it in cache, so cachecontrol should discard the 304 and return the original 200 from cache (with a content type). This is what usually happens.

What seems to be happening is that either:

  • the 304 response is making it into the cache somehow, or
  • when multiple jobs are running, one of them is able to bypass cachecontrol and get the actual 304 response that the other has received.

This is speculation on my part however, I haven't been able to reproduce this.

We could rule out the first case by inspecting pip's http cache to see if any 304s are in there.

Perhaps the second case has something to do with connection pooling?

@njsmith
Copy link
Member Author

njsmith commented May 4, 2018

If I had a reliable way to reproduce this, I'd modify my pip so that when it happened, it dumped a lot more information about what exactly pip thinks it's looking at (like, all the headers and body). I suspect that seeing the bad data would help us narrow down where it was coming from.

@pradyunsg pradyunsg added type: bug A confirmed bug or unintended behavior C: cache Dealing with cache and files in it labels May 6, 2018
@ssbarnea
Copy link
Contributor

ssbarnea commented May 8, 2018

The problem is here and it seems that my workaround does work, or doesn't always work as I still see failures after implementing it.

10:56:04 Collecting chardet<3.1.0,>=3.0.2 (from requests!=2.12.2,!=2.13.0,>=2.10.0->infrared==2.0.1.dev2219)
10:56:04   1 location(s) to search for versions of chardet:
10:56:04   * https://pypi.python.org/simple/chardet/
10:56:04   Getting page https://pypi.python.org/simple/chardet/
10:56:04   Looking up "https://pypi.python.org/simple/chardet/" in the cache
10:56:04   Returning cached "301 Moved Permanently" response (ignoring date and etag information)
10:56:04   Looking up "https://pypi.org/simple/chardet/" in the cache
10:56:04   Current age based on date: 1874
10:56:04   Freshness lifetime from max-age: 600
10:56:04   Freshness lifetime from request max-age: 600
10:56:04   https://pypi.org:443 "GET /simple/chardet/ HTTP/1.1" 304 0
10:56:04   Skipping page https://pypi.python.org/simple/chardet/ because of Content-Type: unknown
10:56:04   Could not find a version that satisfies the requirement chardet<3.1.0,>=3.0.2 (from requests!=2.12.2,!=2.13.0,>=2.10.0->infrared==2.0.1.dev2219) (from versions: )
10:56:04 Cleaning up...
10:56:04 
No matching distribution found for chardet<3.1.0,>=3.0.2 (from requests!=2.12.2,!=2.13.0,>=2.10.0->infrared==2.0.1.dev2219)

But here is the surprise: the place where this happened was a job that was running tox targets in parallel!

I bet I will be able to recreate the issue easily now.

@ssbarnea
Copy link
Contributor

ssbarnea commented May 8, 2018

Well, I did few test and it seems that I am unable to replicate the error with parallel execution of tox.

I managed to find the cached response, here is its content:

0000000: 6363 3d34 2c82 a872 6573 706f 6e73 6587  cc=4,..response.
0000010: a768 6561 6465 7273 de00 10ab 5265 7472  .headers....Retr
0000020: 792d 4166 7465 72a1 30ab 582d 5365 7276  y-After.0.X-Serv
0000030: 6564 2d42 79b2 6361 6368 652d 6463 6131  ed-By.cache-dca1
0000040: 3737 3337 2d44 4341 ae43 6f6e 7465 6e74  7737-DCA.Content
0000050: 2d4c 656e 6774 68a3 3132 32a4 4461 7465  -Length.122.Date
0000060: bd54 7565 2c20 3137 2041 7072 2032 3031  .Tue, 17 Apr 201
0000070: 3820 3230 3a33 393a 3337 2047 4d54 aa43  8 20:39:37 GMT.C
0000080: 6f6e 6e65 6374 696f 6ea5 636c 6f73 65b0  onnection.close.
0000090: 582d 5853 532d 5072 6f74 6563 7469 6f6e  X-XSS-Protection
00000a0: ad31 3b20 6d6f 6465 3d62 6c6f 636b b658  .1; mode=block.X
00000b0: 2d43 6f6e 7465 6e74 2d54 7970 652d 4f70  -Content-Type-Op
00000c0: 7469 6f6e 73a7 6e6f 736e 6966 66a8 4c6f  tions.nosniff.Lo
00000d0: 6361 7469 6f6e d920 6874 7470 733a 2f2f  cation. https://
00000e0: 7079 7069 2e6f 7267 2f73 696d 706c 652f  pypi.org/simple/
00000f0: 6368 6172 6465 742f ac58 2d43 6163 6865  chardet/.X-Cache
0000100: 2d48 6974 73a1 30d9 2158 2d50 6572 6d69  -Hits.0.!X-Permi
0000110: 7474 6564 2d43 726f 7373 2d44 6f6d 6169  tted-Cross-Domai
0000120: 6e2d 506f 6c69 6369 6573 a46e 6f6e 65ad  n-Policies.none.
0000130: 4163 6365 7074 2d52 616e 6765 73a5 6279  Accept-Ranges.by
0000140: 7465 73a7 582d 4361 6368 65a3 4849 54a3  tes.X-Cache.HIT.
0000150: 5669 61ab 312e 3120 7661 726e 6973 68af  Via.1.1 varnish.
0000160: 582d 4672 616d 652d 4f70 7469 6f6e 73a4  X-Frame-Options.
0000170: 6465 6e79 ac43 6f6e 7465 6e74 2d54 7970  deny.Content-Typ
0000180: 65b8 7465 7874 2f68 746d 6c3b 2063 6861  e.text/html; cha
0000190: 7273 6574 3d55 5446 2d38 a653 6572 7665  rset=UTF-8.Serve
00001a0: 72a7 5661 726e 6973 68a4 626f 6479 c47a  r.Varnish.body.z
00001b0: 3c68 746d 6c3e 3c68 6561 643e 3c74 6974  <html><head><tit
00001c0: 6c65 3e33 3031 204d 6f76 6564 2050 6572  le>301 Moved Per
00001d0: 6d61 6e65 6e74 6c79 3c2f 7469 746c 653e  manently</title>
00001e0: 3c2f 6865 6164 3e3c 626f 6479 3e3c 6365  </head><body><ce
00001f0: 6e74 6572 3e3c 6831 3e33 3031 204d 6f76  nter><h1>301 Mov
0000200: 6564 2050 6572 6d61 6e65 6e74 6c79 3c2f  ed Permanently</
0000210: 6831 3e3c 2f63 656e 7465 723e 3c2f 626f  h1></center></bo
0000220: 6479 3e3c 2f68 746d 6c3e a673 7461 7475  dy></html>.statu
0000230: 73cd 012d a672 6561 736f 6eb1 4d6f 7665  s..-.reason.Move
0000240: 6420 5065 726d 616e 656e 746c 79a6 7374  d Permanently.st
0000250: 7269 6374 c3ae 6465 636f 6465 5f63 6f6e  rict..decode_con
0000260: 7465 6e74 c2a7 7665 7273 696f 6e0b a476  tent..version..v
0000270: 6172 7980                                ary.

@dstufft
Copy link
Member

dstufft commented May 8, 2018

That looks like the cache for https://pypi.python.org/simple/chardet/, but we actually need the one for https://pypi.org/simple/chardet/.

@alex
Copy link
Member

alex commented May 8, 2018

@ssbarnea FWIW you probably want to pip that to xxd so we can get the exact bytes.

@ssbarnea
Copy link
Contributor

ssbarnea commented May 9, 2018

@alex @dstufft I updated dump of the cached entry. I hope you will be able to come up with a hotfix for it a these issues do cause lots of problems with our builds.

My XDG_CACHE_HOME=$HOME/.cache/$EXECUTOR_NUMBER workaround avoided the problem with parallel builds BUT it did nothing on builds that are running (tox) targets in parallel (via detox or just bash and &/wait).

rhosqeauto pushed a commit to redhat-openstack/infrared that referenced this issue May 9, 2018
Disables parallel tox execution with
standard sequential execution.

Keeps comment pointing to pip bug that
made us disable parallel execution.

Removes bash code that was duplicating
functionality parallel execution support,
a feature which is part of tox, module named detox (shares same issue tracker as tox).

Change-Id: I835d0b87fa74bfe0a00a6ee0deb985912129e029
See: pypa/pip#5345
@di
Copy link
Member

di commented May 9, 2018

@ssbarnea That still looks like the cache for https://pypi.python.org/simple/chardet/ (a redirect), but we need the cache for https://pypi.org/simple/chardet/.

rhosqeauto pushed a commit to redhat-openstack/infrared that referenced this issue May 9, 2018
Disables parallel tox execution with
standard sequential execution.

Keeps comment pointing to pip bug that
made us disable parallel execution.

Removes bash code that was duplicating
functionality parallel execution support,
a feature which is part of tox, module named detox (shares same issue tracker as tox).

Change-Id: I835d0b87fa74bfe0a00a6ee0deb985912129e029
See: pypa/pip#5345
(cherry picked from commit 94ed037)
@njsmith
Copy link
Member Author

njsmith commented May 10, 2018

BTW, in case anyone else is foolish like me and tries to use XDG_CACHE_HOME on MacOS: it doesn't work, pip doesn't respect that on MacOS.

I think setting PIP_CACHE_DIR=... should work on all OSes. (Setting PIP_NO_CACHE_DIR=yes ought to work, but is broken: #5385)

@ssbarnea
Copy link
Contributor

Any chance of getting this fixed because even implementing the workaround in Jenkins seems to be a real struggle because of newly found bug

I ended up doing a very ugly workaround in a shared library which overrides sh and injects this fix:

      # workaround for pip cache isolation between executors, see https://github.com/pypa/pip/issues/5345
      if [[ -z ${XDG_CACHE_HOME+x} || ${XDG_CACHE_HOME} =~ \$ ]]; then
        if [ ! -z ${EXECUTOR_NUMBER+x} ]; then
          XDG_CACHE_HOME=${HOME}/.cache/${EXECUTOR_NUMBER}
          export XDG_CACHE_HOME
          echo "INFO: isolated XDG_CACHE_HOME to ${XDG_CACHE_HOME}"
        fi
      fi

This is because on pipeline job types Jenikins is no longer interpolating environment variables. See https://issues.jenkins-ci.org/browse/JENKINS-51260

... we can all do PIP_NO_CACHE_DIR=yes, add 5-10 minutes to each job and put a huge load on PYPI CDN.

@cjerdonek
Copy link
Member

cjerdonek commented May 12, 2018

If you compare the log messages leading up to the problem:

Freshness lifetime from max-age: 600
Freshness lifetime from request max-age: 600
https://pypi.org:443 "GET /simple/idna/ HTTP/1.1" 304 0

with the code:

if 'max-age' in cc:
freshness_lifetime = cc['max-age']
logger.debug('Freshness lifetime from request max-age: %i',
freshness_lifetime)
if 'min-fresh' in cc:
min_fresh = cc['min-fresh']
# adjust our current age by our min fresh
current_age += min_fresh
logger.debug('Adjusted current age from min-fresh: %i',
current_age)
# Return entry if it is fresh enough
if freshness_lifetime > current_age:
logger.debug('The response is "fresh", returning cached response')
logger.debug('%i > %i', freshness_lifetime, current_age)
return resp
# we're not fresh. If we don't have an Etag, clear it out
if 'etag' not in headers:
logger.debug(
'The cached response is "stale" with no etag, purging'
)
self.cache.delete(cache_url)
# return the original handler
return False

it appears that the response causing the problem isn't from the cache.

Otherwise, I'd expect to see a message like the following in the logs:

The response is "fresh", returning cached response.

Also, the docstring for this method says, "Return a cached response if it exists in the cache, otherwise return False." And in this case, from the logs, we can see it's returning False.

@dstufft
Copy link
Member

dstufft commented May 12, 2018 via email

@ssbarnea
Copy link
Contributor

There is something that I so not really understand: clearly the problem appears when pip gets a 404 (not found) message which is clearly not something that should ever be cached.

@njsmith
Copy link
Member Author

njsmith commented May 12, 2018

It's a 304, not a 404.

Pip is looking at the cache, and finding an entry, but the entry is old so pip doesn't trust that it's still accurate. So it asks pypi "hey, have there been any changes to this page since (date)?" and pypi says "304" which means "nope, it's still the same, so you can go ahead and use the version you already have, no need to download it again".

@timgraham
Copy link

I also see this issue intermittently on djangoci.com.

@wondernath
Copy link

Any update on this issue? I am finding similar issues too

olovy added a commit to libris/xl_auth that referenced this issue Aug 27, 2019
Use sepatarate pip cache dir per worker and parallel step
see pypa/pip#5345
@chrahunt
Copy link
Member

This looks the same as #6970, based on:

  1. response returned is 304
  2. the returned response is being passed back to pip, seemingly disregarding the "cached" data
  3. this seems to happen on concurrent cache usage

We determined that the likely cause is that the file-based CacheControl cache is not safe for concurrent read/write. See my comment here for the specific sequence of events that can lead to this.

This should be a lot less likely in 19.3+ because we implemented our own (safer) mechanism for updating the file-based cache, however it is still possible if an external process deletes files from the cache.

I will close this eagerly in favor of tracking the "complete" solution in #6970, but if anyone has evidence that this is different, please speak up!

@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Nov 23, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Nov 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation C: cache Dealing with cache and files in it type: bug A confirmed bug or unintended behavior
Projects
None yet
Development

No branches or pull requests