Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Images not pullable due to wrong cache info in redis #19695

Closed
chrifey opened this issue Dec 11, 2023 · 9 comments
Closed

Images not pullable due to wrong cache info in redis #19695

chrifey opened this issue Dec 11, 2023 · 9 comments

Comments

@chrifey
Copy link

chrifey commented Dec 11, 2023

Hi,

we are facing an issue that occurs from time to time and we are stuck with it.

Environment:
Harbor 2.8.4
RHEL 8.8
3 Node HA Setup
external redis (5.0.3) with redis sentinel
external postgres 13 with pgpool

harbor.yml (redacted ansible template)

Heading

external_database:
harbor:
  host: {{ postgresPcpVIP }}
  port: 9999
  db_name: {{ postgresHarborDbName }}
  username: {{ postgresHarborUser }}
  password: {{ harborDbPassword }}
  ssl_mode: disable
  max_idle_conns: 2
  max_open_conns: 0
notary_signer:
  host: {{ postgresPcpVIP }}
  port: 9999
  db_name: {{ postgresNotarySignerDbName }}
  username: {{ postgresHarborUser }}
  password: {{ harborDbPassword }}
  ssl_mode: disable
  max_idle_conns: 2
  max_open_conns: 0
notary_server:
  host: {{ postgresPcpVIP }}
  port: 9999
  db_name: {{ postgresNotaryServerDbName }}
  username: {{ postgresHarborUser }}
  password: {{ harborDbPassword }}
  ssl_mode: disable
  max_idle_conns: 2
  max_open_conns: 0

{% with redisHosts = namespace(entities=[]) %}
{% for redisHost in ansible_play_hosts_all %}
{% set redisHosts.entities = redisHosts.entities + [redisHost] %}
{% endfor %}

external_redis:
host: {{ redisHosts.entities | join(':26379,') }}:26379
{% endwith %}
sentinel_master_set: redis-cluster 
registry_db_index: 1
jobservice_db_index: 2
chartmuseum_db_index: 3
trivy_db_index: 5
idle_timeout_seconds: 30

data_volume: /data

jobservice:
max_job_workers: 10

notification:
webhook_job_max_retry: 10

chart:
absolute_url: disabled

log:
level: info
local:
  rotate_count: 50
  rotate_size: 200M
  location: /var/log/harbor

{% if harborTrivyEnable %}
trivy:
ignore_unfixed: false
skip_update: false
insecure: false
{% endif %}

proxy:
http_proxy:
https_proxy:
no_proxy:
components:
  - core
  - jobservice
  - trivy

metric:
enabled: true
port: 9090
path: /metrics

{% if harborS3StorageServiceEnabled %}
storage_service:
s3:
  accesskey: {{ harborS3StorageService.accesskey }}
  secretkey: {{ harborS3StorageService.secretkey }}
  region: {{ harborS3StorageService.region }}
  regionendpoint: {{ harborS3StorageService.regionendpoint }}
  bucket: {{ harborS3StorageService.bucket }}
  encrypt: {{ harborS3StorageService.encrypt }}
  secure: {{ harborS3StorageService.secure }}
  v4auth: {{ harborS3StorageService.v4auth }}
  chunksize: {{ harborS3StorageService.chunksize }}
  multipartcopychunksize: {{ harborS3StorageService.multipartcopychunksize }}
  multipartcopymaxconcurrency: {{ harborS3StorageService.multipartcopymaxconcurrency }}
  multipartcopythresholdsize: {{ harborS3StorageService.multipartcopythresholdsize }}
  rootdirectory: {{ harborS3StorageService.rootdirectory }}
{% endif %}

The issue looks like the following:

  • Image is uploaded to registry
  • Client pulls image
    • From time to time gets filesystem layer verification failed for digest sha256:******cb5e error

The problematic blob is existing in the Postgres DB and also in the S3 storage (with the correct checksum)

harbor_int=> select * from artifact_blob where digest_blob = 'sha256:******cb5e';
   id    |                                digest_af                                |                               digest_blob                               |       creation_time
---------+-------------------------------------------------------------------------+-------------------------------------------------------------------------+----------------------------
 2331089 | sha256:******cb5e | sha256:******cb5e | 2023-11-06 11:50:35.855172
(1 row)

While debugging the logs we found, that the proxy in the problematic case is answering the HEAD and GET queries with "200 0", which means 0 bytes for the requested blob.

We then search in Redis and are now pretty sure that all constellations of non-pullable images are due to incorrect cache information in Redis. In every case we investigated, the Redis query for blob hashes of a "defective" image returned the value "0" for the key "size". After deleting the Redis entry, the image is immediately pullable.

The issue we now have is, that we do not know under which circumstances this happens or how we can further debug this issue.

We are happy about any further suggestions.

@wy65701436
Copy link
Contributor

What's the harbor version and which redis key did you use to query the size?

@wy65701436 wy65701436 self-assigned this Dec 11, 2023
@chrifey
Copy link
Author

chrifey commented Dec 11, 2023

Hey, Harbor 2.8.4 and Redis 5.0.3 (as shipped with RHEL 8.8).

@alexanderdalloz
Copy link

alexanderdalloz commented Dec 11, 2023

While the problem already occured on Harbor 2.8.4 we meanwhile have updated to v2.9.1-5cbb1b01 and still seeing the issue.

Has happened again today.

Easy to see the 0 byte response from Harbor in the proxy.log.

Dec 11 10:20:49 internal.system.tld proxy[2067]: 10.11.132.150 - "GET /v2/project/repo/blobs/sha256:63f8139b85337c153baccf7ff1ef62dd36b18052ab2dfb1f4a4c93af1439f943 HTTP/1.1" 200 0 "-" "containers/5.21.1 (github.com/containers/image)" 0.055 0.055 .

Querying the blob sha256:63f8139b85337c153baccf7ff1ef62dd36b18052ab2dfb1f4a4c93af1439f943 per echo "HGETALL ${blob_hash}" | redis-cli -x n1 confirms, that Redis is the source of misinformation in the key size.

That blob was listed in the Redis registry_db_index with size 0.

At that time replication was executed (runs every 10 minutes) with lots of log entries. Could this collide in some way?

@alexanderdalloz
Copy link

Another example.

# tail -F /opt/logs/harbor/proxy.log | grep --line-buffered -E 'GET.*HTTP/1.1" 200 0 '
Dec 12 09:13:50 172.18.0.1 proxy[1996]: 10.11.132.150 - "GET /v2/pcd/vault-status/blobs/sha256:04d1bc68688901b4cf1a0fa4602fdf5a432162f9b189e5ad11d88390d62406d6 HTTP/1.1" 200 0 "-" "Go-http-client/1.1" 0.017 0.017 .
# echo 'HGETALL blobs::sha256:04d1bc68688901b4cf1a0fa4602fdf5a432162f9b189e5ad11d88390d62406d6' | redis-cli -x -n 1
1) "mediatype"
2) "application/octet-stream"
3) "size"
4) "0"
5) "digest"
6) "sha256:04d1bc68688901b4cf1a0fa4602fdf5a432162f9b189e5ad11d88390d62406d6"

@alexanderdalloz
Copy link

alexanderdalloz commented Dec 18, 2023

@wy65701436 any hint what to check?
In one Harbor instance we store actually 202323 artifacts with roughly 1200 new on a daily basis. By that large number the issue occurs frequently, several times per day. On the other instances with less new images it happens not that often.
The problem occurs with images of all sizes.

@kzynn
Copy link

kzynn commented Jan 18, 2024

Meanwhile we investigated further and found more info about the circumstances that lead to the issue.

It occurs when an image is pushed and pulled immediately afterwards (only a few seconds apart). However, the problem only occurs when the image is pulled via the Docker image digest (not the manifest digest) or one of the image layers is pulled directly via the layer digest.

These failing pulls (response http 200 with size 0 in the log, as mentioned above) seem to leave the corresponding faulty entries in the redis cache. But we still dont really know why these pulls fail in the first place and where the wrong size info originates from.

@alexanderdalloz
Copy link

Temporarily we had deactivated the auto vuln scan after push for selected projects hoping that this would prevent the issue, but it occured though.
Please advise what to do as this issue is really a bugger as it happens quite often on our harbor instance with ~2.500 daily new image pushes.

Copy link

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Mar 20, 2024
Copy link

This issue was closed because it has been stalled for 30 days with no activity. If this issue is still relevant, please re-open a new issue.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants