Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] After upgrade to 3005.3, failure to fetch latest pillar data after being updated in git pillar repo branch #65467

Open
2 of 8 tasks
Ikramromdhani opened this issue Oct 26, 2023 · 27 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Pillar Regression The issue is a bug that breaks functionality known to work in previous releases.

Comments

@Ikramromdhani
Copy link

Description of Issue

A minion/master is unable to get the new pillar data from an alternative branch in git after it was updated on remote git pillar repository. This was working with 3005.1, but broke after upgrade to 3005.3. It also seem to be the case for those using 3006.3. Please refer to community discussion: https://saltstackcommunity.slack.com/archives/C7K04SEJC/p1697576330136199

Setup

Amazon Linux 2 salt master, configured with git repository for states and pillar data.

  • on-prem machine
  • VM (Virtualbox, KVM, etc. please specify)
  • VM running on a cloud service, please be explicit and add details
  • container (Kubernetes, Docker, containerd, etc. please specify)
  • jails if it is FreeBSD
  • classic packaging
  • onedir packaging
  • used bootstrap to install

Steps to Reproduce Issue

Master config

ext_pillar:
- git:
  - __env__ ssh://git@xxxxxxxxxxxxxxxxx/repo.git:
    - root: pillar
git_pillar_pubkey: xxxxxxxxxxxxxxxx
git_pillar_privkey: xxxxxxxxxxxxxxxxxxx
git_pillar_update_interval: 60
pillar_merge_lists: True

Push an update to an existant pillar called test in repo.git on branch testpillar. On a minion, run salt-call pillar.get test pillarenv=testpillar

Expected behavior

The new value of pillar test should be visible in the output of the salt call on the minion.

Actual behavior

The minion get the old data of pillar test. Please note that when executing the same comand after deleting /var/cache/salt/master/git_pillar, the minion get the correct value

Versions Report

Salt Version:
          Salt: 3005.3

Dependency Versions:
          cffi: 1.14.6
      cherrypy: unknown
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.0
       libgit2: 1.5.0
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.9.8
        pygit2: 1.10.1
        Python: 3.9.18 (main, Sep 18 2023, 18:18:39)
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: amzn 2
        locale: utf-8
       machine: x86_64
       release: 4.14.318-241.531.amzn2.x86_64
        system: Linux
       version: Amazon Linux 2
@welcome
Copy link

welcome bot commented Oct 26, 2023

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey.
Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar.
If you have additional questions, email us at saltproject@vmware.com. We’re glad you’ve joined our community and look forward to doing awesome things with you!

@OrangeDog
Copy link
Contributor

3005.2 changed how git pillars were cached to fix a security issue. Having to clear the cache after upgrading is probably expected.

@OrangeDog OrangeDog added Bug broken, incorrect, or confusing behavior Regression The issue is a bug that breaks functionality known to work in previous releases. Pillar needs-triage labels Oct 30, 2023
@HL-SaltBae
Copy link

Having to clear the cache after upgrading is probably expected.

This does NOT only affect the pillar cache across upgrades.

With a running 3006.3 master, the master will NOT refresh a git_pillar branch after the first time that branch is fetched.

Steps to test:

  1. Have git_pillar configured with __env__ like in the bug report
  2. Create a new branch in the git_pillar repo
  3. Have any minion pull the pillar data using pillar.items pillarenv=<branchname>
  4. Notice that the data aligns with what's in the branch
  5. Commit a new change to the branch in the git_pillar repo
  6. Do any or all of the following:
    • Wait for the master loop_interval to expire and observe that the master log file indicates that the git_pillar remote is up-to-date
    • run sudo salt-run git_pillar.update on the master and observe that the return for the pit_pillar repo in question is "None"
    • restart the master
    • run saltutil.refresh_pillar on the minion in question
  7. Repeat pulling pillar data with any minion using pillar.items pillarenv=<branchname>
  8. Notice that the minion does NOT receive new pillar data from the new commit on the branch, and instead returns the same pillar data that was on the branch when the master first fetched the branch
  9. Repeat 5,6,7 as many times as you want
  10. Ultimately stop the master process, rm -f /var/cache/salt/master/git_pillar, restart the master process
  11. Go to step 3 and repeat forever

@Ikramromdhani
Copy link
Author

Ikramromdhani commented Nov 3, 2023

Are there any chances we may get a fix for this soon? It affects both 3006 and 3005 and it was introduced after the CVE patches so it makes sense to have it fixed within both versions. This kind of issues make it harder to upgrade salt version from the versions with security issues to avoid having breaking issues and keep everything working as desired.

@HL-SaltBae
Copy link

@cmcmarrow do you have any insight into this? I think this regression happened as a result of your CVE patch 6120bca

@ITJamie
Copy link
Contributor

ITJamie commented Jan 31, 2024

Theres been no github activity on @cmcmarrow 's account since sept. Safe to say hes no longer involved on the project ☹️

@Ikramromdhani
Copy link
Author

I would like to add another detail: We tried to use the new released version 3005.5 which contain additional CVE patches, and we noticed that the pillar refresh issue is partially fixed and happens only on branches containing / for example: feature/test/pillar . I hope this helps fixing the issue

@HL-SaltBae
Copy link

I had wondered if / in the branch name might be a factor, but someone on the Slack said they tested that and it happened with or without the /. Good news if true!

@dmurphy18
Copy link
Contributor

@Ikramromdhani Can you upgrade to latest Salt 3006 and retry, Salt 3005 has passed it's bug fix (last August 2023) and it about to run out of CVE support on the Feb 25th 2024, 24 days from now.

At this stage nothing will be getting done to Salt 3005 given that it is dead in 24 days, hardly time to debug/test/fix/release cycle, esp. since would have to build classic packages and there is a couple of days in just getting that done.

@dmurphy18 dmurphy18 self-assigned this Feb 1, 2024
@dmurphy18 dmurphy18 added this to the Sulfur v3006.7 milestone Feb 1, 2024
@dmurphy18
Copy link
Contributor

dmurphy18 commented Feb 2, 2024

@Ikramromdhani Rereading this I remember Chad talking about how caching is changed and how the cache is now only updated correctly, rather than as it was where the cache was needlessly getting updated due to other commands being run, need to go back and find the actual commit etc., it was last year - Autumn time.

And yup, Chad is gone, moved on to greener pastures, so I am reading up on GitFS, etc. to get fully familiar (having to do more areas since less people now on the core team), so will update next week with code lines that changed, but I expect you were relying on a side-effect of an operation that has since been fixed (namely incorrectly updating the cache), but will get the full details for you.

That said, doesn't mean there isn't an error too.

@donkopotamus
Copy link

donkopotamus commented Feb 15, 2024

We are also observing this error after recently upgrading from salt-3004 to salt-3006.4:

We have observed that when we use __env__ to effectively map pillarenv to a branch name that:

  • if we create a new branch feature/testing that is not already in the cache, this will be pulled correctly; and
  • if we then add a new commit on it, or amend HEAD, that the changes will not be pulled;

However, if we create a new branch named testing then it will be pulled, and subsequent changes to it will be pulled. That is, it seems that only branches with a / in them do not update. Unfortunately we use branch names like release/* or feature/*, like many people do so its a severe issue for us at the moment!

(This evidence conflicts with #65467 (comment), although the commenter had not actually confirmed it themselves)

Note:

  • We are using GitPython rather than pygit2
  • We do not see this issue with gitfs ... changes to our state files are pulled ... it is only git_pillar

@Ikramromdhani
Copy link
Author

Ikramromdhani commented Feb 15, 2024

@donkopotamus , that is what I was about to confirm. Tested today with 3006.6 using pygit2 and still have issues pulling changes for pillars when relying on pillarenv branch having /.
@dmurphy18 any clues if thiscan make it to 3006.7 or 3007?

@HL-SaltBae
Copy link

I observed that the cache directory organizes cached branches by foldername=branchname, and that branches with / in the branch name does result in a subfolder created

For example, with three pillar branches:

  • foo,
  • feature/bar
  • feature/boo
    you might see a structure like this:
cache_dir/
    foo/top.sls
    feature/
        bar/top.sls
        boo/top.sls

If this turns out to be the actual issue, then the obvious solution would appear to be escaping the / in branch names or converting to something like -

cache_dir/
    foo/top.sls
    feature\/bar/top.sls
    feature\/boo/top.sls

or

cache_dir/
    foo/top.sls
    feature-bar/top.sls
    feature-boo/top.sls

or

cache_dir/
    foo/top.sls
    feature-SLASH-bar/top.sls
    feature-SLASH-boo/top.sls

@dmurphy18
Copy link
Contributor

dmurphy18 commented Feb 15, 2024

@Ikramromdhani Won't make it for Salt 3006.7, that is close to release, waiting on clean tests in the pipeline. Possibly for 3007, but on my backlog, dealing with #65816, which may be related (similar code changes for pillarenv went in at same time for gitfs locks), writing tests for that fix now.
So it is on the block but limited hands since the buyout. Get to it when I can.

@max-arnold
Copy link
Contributor

Not sure if it helps, but it looks like the CVE fix in 3005.2 and then #65017 in 3005.3 that mitigated a negative effect of that fix, probably contributed to this issue

@dmurphy18
Copy link
Contributor

dmurphy18 commented Feb 15, 2024

Yup, the set of changes that Chad did and had to be adjusted, applied to both the gitfs changes he did along the pillarenv and cache changes, etc. That is why I expect some linkage to the gitfs changes.
Those changes are also in 3006.x branches too

@Ikramromdhani
Copy link
Author

Ikramromdhani commented Mar 6, 2024

Just a heads up that we tested 3006.7 and the problem is still there. Any chances to get a fix for this on salt 3007? This problem is holding us for performing salt upgrades.

@Ikramromdhani
Copy link
Author

@dmurphy18 since the issue fix did not make it to 3007. Just would like to ask if there are specific version to target? Thank you

@dmurphy18
Copy link
Contributor

@Ikramromdhani Working on fix for Git FS and the Git Pillar is right after it, fix for Git FS is done but have to refractor a new test after PR review (#65937), after that then work on Git Pillar since it will be possibly affected by GitFS changes.

@dwoz dwoz modified the milestones: Sulfur v3006.8, Sulfur v3006.9 May 1, 2024
@Ikramromdhani
Copy link
Author

Do we have a timeline on when we are getting this fix released? We were trying to upgrade since last year and this is the only issue holding us back. Checked the release notes for 3006.9 and did not see it included there so was wondering if this is going to be included in 3006.10 and if yes, do we know when is it going to be released? This way we can plan our upgrade accordingly.
Thank you!

@dmurphy18
Copy link
Contributor

@Ikramromdhani Actually working this at the moment, would help to just check still occurs on 3006.9, attempting to setup similar environment.

@mdschmitt
Copy link

I'm seeing it in 3006.9, yes.

@Ikramromdhani
Copy link
Author

Thank you for confirming @mdschmitt. @dmurphy18 would the fix be part of 3006.10? and if yes do we have timeline for when the release will take place? I tried to look on the documentation and was not able to find a release calendar or announcements

@dmurphy18
Copy link
Contributor

dmurphy18 commented Oct 8, 2024

@Ikramromdhani Sorry but this is on the back-burner and I have to get back to it, given the reduced size of the team, some fireballs got thrown which had to be handled. Be assured it is on the list, and the fix is half done, but tests to be written if memory recall is correct. Apologies but only so much time in the day, with a smaller team, and a laptop rebuild didn't help either.

@dmurphy18
Copy link
Contributor

dmurphy18 commented Oct 11, 2024

Interestly running steps 1-6 above gave the following which could explain why removing /var/cache/salt/master/git_pillar clears the problem, and the problem is not related to parsing but something else:

[root@ip-10-2-15-188 ~]# salt-run git_pillar.update
[WARNING ] git_pillar_global_lock is enabled and update lockfile /var/cache/salt/master/git_pillar/work/ytJQkKKFhtVSWNNtBJtuZsiCG5luCEeu2TYadrIBNq0=/master/update.lk is present for git_pillar remote '__env__ https://github.com/dmurphy18/test_repo.git' on machine_id ec213523339d64177a5b48e038e84fbb with pid '116273'. Process 116273 obtained the lock for machine_id ec213523339d64177a5b48e038e84fbb, current machine_id ec213523339d64177a5b48e038e84fbb
[WARNING ] Update lock file is present for git_pillar remote '__env__ https://github.com/dmurphy18/test_repo.git', skipping. If this warning persists, it is possible that the update process was interrupted, but the lock could also have been manually set. Removing /var/cache/salt/master/git_pillar/work/ytJQkKKFhtVSWNNtBJtuZsiCG5luCEeu2TYadrIBNq0=/master/update.lk or running 'salt-run cache.clear_git_lock git_pillar type=update' will allow updates to continue for this remote.
__env__ https://github.com/dmurphy18/test_repo.git:
    False
[root@ip-10-2-15-188 ~]#

using simplified /etc/salt/master since doing ssh only complicates access

ext_pillar:
  - git:
    - __env__ https://github.com/dmurphy18/test_repo.git:
      - root: pillar

pillar_merge_lists: True
git_pillar_provider: gitpython

test_repo.igt has main and ```test_pillar`` branches

[root@ip-10-2-15-188 ~]# salt-call --local test.version
local:
    3006.9+174.ga90e6b3b23
[root@ip-10-2-15-188 ~]# 

@dmurphy18
Copy link
Contributor

Given this appears to be an issue with '/' in the name of the field, I think changes from PR 65017 to salt/utils/gitfs.py, could be the culprit, the added

        if hasattr(self, "name"):
            self._cache_basehash = self.name
        else:
            hash_type = getattr(hashlib, self.opts.get("hash_type", "md5"))
            # We loaded this data from yaml configuration files, so, its safe
            # to use UTF-8
            self._cache_basehash = str(
                base64.b64encode(hash_type(self.id.encode("utf-8")).digest()),
                encoding="ascii",  # base64 only outputs ascii
            ).replace(
                "/", "_"
            )  # replace "/" with "_" to not cause trouble with file system

whereas previously just did the decode and replace of '/' with '_'. Here the now presented name is not getting sanitized, and functions such as os.listdir will interpret the '/' as a path separator.

This is just a thought that I wanted to record and not forget, not saying its the problem, just a probable cause

@dmurphy18
Copy link
Contributor

dmurphy18 commented Oct 25, 2024

@Ikramromdhani Working away on this but wanting to make sure I am addressing everything.

Please refer to community discussion: https://saltstackcommunity.slack.com/archives/C7K04SEJC/p1697576330136199

Unfortunately the contents of that Slack discussion are no more, and I was wondering if you have them saved off anywhere that could be shared with me. Writing tests and want to make sure I have all the bases covered.

Well,m found an archive and nothing much from it apart from this bug.
Can you detail the issue for you, and exact steps to reproduce for you, otherwise I will only end up fixing @HL-SaltBae 's problem. Want to make sure I fix both, both from above it seems that it is fixed unless you have a / embedded in the name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Pillar Regression The issue is a bug that breaks functionality known to work in previous releases.
Projects
None yet
Development

No branches or pull requests

9 participants