Images not pullable due to wrong cache info in redis #19695

chrifey · 2023-12-11T09:59:55Z

Hi,

we are facing an issue that occurs from time to time and we are stuck with it.

Environment:
Harbor 2.8.4
RHEL 8.8
3 Node HA Setup
external redis (5.0.3) with redis sentinel
external postgres 13 with pgpool

harbor.yml (redacted ansible template)

Heading

external_database:
harbor:
  host: {{ postgresPcpVIP }}
  port: 9999
  db_name: {{ postgresHarborDbName }}
  username: {{ postgresHarborUser }}
  password: {{ harborDbPassword }}
  ssl_mode: disable
  max_idle_conns: 2
  max_open_conns: 0
notary_signer:
  host: {{ postgresPcpVIP }}
  port: 9999
  db_name: {{ postgresNotarySignerDbName }}
  username: {{ postgresHarborUser }}
  password: {{ harborDbPassword }}
  ssl_mode: disable
  max_idle_conns: 2
  max_open_conns: 0
notary_server:
  host: {{ postgresPcpVIP }}
  port: 9999
  db_name: {{ postgresNotaryServerDbName }}
  username: {{ postgresHarborUser }}
  password: {{ harborDbPassword }}
  ssl_mode: disable
  max_idle_conns: 2
  max_open_conns: 0

{% with redisHosts = namespace(entities=[]) %}
{% for redisHost in ansible_play_hosts_all %}
{% set redisHosts.entities = redisHosts.entities + [redisHost] %}
{% endfor %}

external_redis:
host: {{ redisHosts.entities | join(':26379,') }}:26379
{% endwith %}
sentinel_master_set: redis-cluster 
registry_db_index: 1
jobservice_db_index: 2
chartmuseum_db_index: 3
trivy_db_index: 5
idle_timeout_seconds: 30

data_volume: /data

jobservice:
max_job_workers: 10

notification:
webhook_job_max_retry: 10

chart:
absolute_url: disabled

log:
level: info
local:
  rotate_count: 50
  rotate_size: 200M
  location: /var/log/harbor

{% if harborTrivyEnable %}
trivy:
ignore_unfixed: false
skip_update: false
insecure: false
{% endif %}

proxy:
http_proxy:
https_proxy:
no_proxy:
components:
  - core
  - jobservice
  - trivy

metric:
enabled: true
port: 9090
path: /metrics

{% if harborS3StorageServiceEnabled %}
storage_service:
s3:
  accesskey: {{ harborS3StorageService.accesskey }}
  secretkey: {{ harborS3StorageService.secretkey }}
  region: {{ harborS3StorageService.region }}
  regionendpoint: {{ harborS3StorageService.regionendpoint }}
  bucket: {{ harborS3StorageService.bucket }}
  encrypt: {{ harborS3StorageService.encrypt }}
  secure: {{ harborS3StorageService.secure }}
  v4auth: {{ harborS3StorageService.v4auth }}
  chunksize: {{ harborS3StorageService.chunksize }}
  multipartcopychunksize: {{ harborS3StorageService.multipartcopychunksize }}
  multipartcopymaxconcurrency: {{ harborS3StorageService.multipartcopymaxconcurrency }}
  multipartcopythresholdsize: {{ harborS3StorageService.multipartcopythresholdsize }}
  rootdirectory: {{ harborS3StorageService.rootdirectory }}
{% endif %}

The issue looks like the following:

Image is uploaded to registry
Client pulls image
- From time to time gets filesystem layer verification failed for digest sha256:******cb5e error

The problematic blob is existing in the Postgres DB and also in the S3 storage (with the correct checksum)

harbor_int=> select * from artifact_blob where digest_blob = 'sha256:******cb5e';
   id    |                                digest_af                                |                               digest_blob                               |       creation_time
---------+-------------------------------------------------------------------------+-------------------------------------------------------------------------+----------------------------
 2331089 | sha256:******cb5e | sha256:******cb5e | 2023-11-06 11:50:35.855172
(1 row)

While debugging the logs we found, that the proxy in the problematic case is answering the HEAD and GET queries with "200 0", which means 0 bytes for the requested blob.

We then search in Redis and are now pretty sure that all constellations of non-pullable images are due to incorrect cache information in Redis. In every case we investigated, the Redis query for blob hashes of a "defective" image returned the value "0" for the key "size". After deleting the Redis entry, the image is immediately pullable.

The issue we now have is, that we do not know under which circumstances this happens or how we can further debug this issue.

We are happy about any further suggestions.

The text was updated successfully, but these errors were encountered:

wy65701436 · 2023-12-11T10:12:02Z

What's the harbor version and which redis key did you use to query the size?

chrifey · 2023-12-11T10:40:22Z

Hey, Harbor 2.8.4 and Redis 5.0.3 (as shipped with RHEL 8.8).

alexanderdalloz · 2023-12-11T13:10:02Z

While the problem already occured on Harbor 2.8.4 we meanwhile have updated to v2.9.1-5cbb1b01 and still seeing the issue.

Has happened again today.

Easy to see the 0 byte response from Harbor in the proxy.log.

Dec 11 10:20:49 internal.system.tld proxy[2067]: 10.11.132.150 - "GET /v2/project/repo/blobs/sha256:63f8139b85337c153baccf7ff1ef62dd36b18052ab2dfb1f4a4c93af1439f943 HTTP/1.1" 200 0 "-" "containers/5.21.1 (github.com/containers/image)" 0.055 0.055 .

Querying the blob sha256:63f8139b85337c153baccf7ff1ef62dd36b18052ab2dfb1f4a4c93af1439f943 per echo "HGETALL ${blob_hash}" | redis-cli -x n1 confirms, that Redis is the source of misinformation in the key size.

That blob was listed in the Redis registry_db_index with size 0.

At that time replication was executed (runs every 10 minutes) with lots of log entries. Could this collide in some way?

alexanderdalloz · 2023-12-12T08:21:45Z

Another example.

# tail -F /opt/logs/harbor/proxy.log | grep --line-buffered -E 'GET.*HTTP/1.1" 200 0 '
Dec 12 09:13:50 172.18.0.1 proxy[1996]: 10.11.132.150 - "GET /v2/pcd/vault-status/blobs/sha256:04d1bc68688901b4cf1a0fa4602fdf5a432162f9b189e5ad11d88390d62406d6 HTTP/1.1" 200 0 "-" "Go-http-client/1.1" 0.017 0.017 .

# echo 'HGETALL blobs::sha256:04d1bc68688901b4cf1a0fa4602fdf5a432162f9b189e5ad11d88390d62406d6' | redis-cli -x -n 1
1) "mediatype"
2) "application/octet-stream"
3) "size"
4) "0"
5) "digest"
6) "sha256:04d1bc68688901b4cf1a0fa4602fdf5a432162f9b189e5ad11d88390d62406d6"

alexanderdalloz · 2023-12-18T11:12:50Z

@wy65701436 any hint what to check?
In one Harbor instance we store actually 202323 artifacts with roughly 1200 new on a daily basis. By that large number the issue occurs frequently, several times per day. On the other instances with less new images it happens not that often.
The problem occurs with images of all sizes.

kzynn · 2024-01-18T10:43:23Z

Meanwhile we investigated further and found more info about the circumstances that lead to the issue.

It occurs when an image is pushed and pulled immediately afterwards (only a few seconds apart). However, the problem only occurs when the image is pulled via the Docker image digest (not the manifest digest) or one of the image layers is pulled directly via the layer digest.

These failing pulls (response http 200 with size 0 in the log, as mentioned above) seem to leave the corresponding faulty entries in the redis cache. But we still dont really know why these pulls fail in the first place and where the wrong size info originates from.

alexanderdalloz · 2024-01-19T09:53:43Z

Temporarily we had deactivated the auto vuln scan after push for selected projects hoping that this would prevent the issue, but it occured though.
Please advise what to do as this issue is really a bugger as it happens quite often on our harbor instance with ~2.500 daily new image pushes.

github-actions · 2024-03-20T09:03:57Z

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

github-actions · 2024-04-20T09:03:23Z

This issue was closed because it has been stalled for 30 days with no activity. If this issue is still relevant, please re-open a new issue.

wy65701436 self-assigned this Dec 11, 2023

stonezdj added the needs/follow-up label Dec 19, 2023

github-actions bot added the Stale label Mar 20, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Images not pullable due to wrong cache info in redis #19695

Images not pullable due to wrong cache info in redis #19695

chrifey commented Dec 11, 2023

Heading

wy65701436 commented Dec 11, 2023

chrifey commented Dec 11, 2023

alexanderdalloz commented Dec 11, 2023 •

edited

Loading

alexanderdalloz commented Dec 12, 2023

alexanderdalloz commented Dec 18, 2023 •

edited

Loading

kzynn commented Jan 18, 2024

alexanderdalloz commented Jan 19, 2024

github-actions bot commented Mar 20, 2024

github-actions bot commented Apr 20, 2024

Images not pullable due to wrong cache info in redis #19695

Images not pullable due to wrong cache info in redis #19695

Comments

chrifey commented Dec 11, 2023

Heading

wy65701436 commented Dec 11, 2023

chrifey commented Dec 11, 2023

alexanderdalloz commented Dec 11, 2023 • edited Loading

alexanderdalloz commented Dec 12, 2023

alexanderdalloz commented Dec 18, 2023 • edited Loading

kzynn commented Jan 18, 2024

alexanderdalloz commented Jan 19, 2024

github-actions bot commented Mar 20, 2024

github-actions bot commented Apr 20, 2024

alexanderdalloz commented Dec 11, 2023 •

edited

Loading

alexanderdalloz commented Dec 18, 2023 •

edited

Loading