Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LMDB corruption recovery -- follow-up #3880

Merged
merged 7 commits into from
Oct 28, 2019

Conversation

vpodzime
Copy link
Contributor

@vpodzime vpodzime commented Oct 8, 2019

@vpodzime vpodzime added the WIP Work in Progress label Oct 8, 2019
@lgtm-com
Copy link

lgtm-com bot commented Oct 8, 2019

This pull request introduces 10 alerts and fixes 25 when merging 3aebce9d951c23b54a9e52239a61dae44a8a9546 into 4ebcf91 - view on LGTM.com

new alerts:

  • 9 for Pointer argument is dereferenced without checking for NULL
  • 1 for Time-of-check time-of-use filesystem race condition

fixed alerts:

  • 25 for Pointer argument is dereferenced without checking for NULL

@vpodzime vpodzime force-pushed the master-lmdb_recover_avoid__exit branch 2 times, most recently from a4eb259 to 3f1a347 Compare October 10, 2019 12:54
@lgtm-com
Copy link

lgtm-com bot commented Oct 10, 2019

This pull request introduces 12 alerts and fixes 25 when merging 3f1a347e7c043ff64f010cb490c3f335c422045e into e9b38c1 - view on LGTM.com

new alerts:

  • 9 for Pointer argument is dereferenced without checking for NULL
  • 3 for Time-of-check time-of-use filesystem race condition

fixed alerts:

  • 25 for Pointer argument is dereferenced without checking for NULL

@vpodzime vpodzime force-pushed the master-lmdb_recover_avoid__exit branch from 3f1a347 to 5f1b7b7 Compare October 18, 2019 15:25
@vpodzime vpodzime force-pushed the master-lmdb_recover_avoid__exit branch from 5f1b7b7 to 6a9bf01 Compare October 20, 2019 14:56
@vpodzime vpodzime removed the WIP Work in Progress label Oct 20, 2019
@vpodzime vpodzime force-pushed the master-lmdb_recover_avoid__exit branch 5 times, most recently from b7c1c67 to 300930a Compare October 21, 2019 13:40
@lgtm-com
Copy link

lgtm-com bot commented Oct 21, 2019

This pull request introduces 11 alerts and fixes 1 when merging 300930a65cdf649862fd8434c907ee474e6ba982 into 7b2938d - view on LGTM.com

new alerts:

  • 9 for Pointer argument is dereferenced without checking for NULL
  • 2 for Time-of-check time-of-use filesystem race condition

fixed alerts:

  • 1 for File created without restricting permissions

@vpodzime vpodzime force-pushed the master-lmdb_recover_avoid__exit branch 3 times, most recently from 137b4a5 to e370fa5 Compare October 21, 2019 15:05
@lgtm-com
Copy link

lgtm-com bot commented Oct 21, 2019

This pull request introduces 11 alerts and fixes 1 when merging e370fa51f9ef7c16314cfb6c622effe8f3535496 into f3bad22 - view on LGTM.com

new alerts:

  • 9 for Pointer argument is dereferenced without checking for NULL
  • 2 for Time-of-check time-of-use filesystem race condition

fixed alerts:

  • 1 for File created without restricting permissions

@vpodzime vpodzime force-pushed the master-lmdb_recover_avoid__exit branch from e370fa5 to 31f0b45 Compare October 21, 2019 18:40
C doesn't support exceptions so in the LMDB corruption handling
callback we have no way to backtrack to the code that run the
LMDB operation that hit the corruption and tell it that things
went wrong. Our only chance is to exit the process. To avoid our
exit handlers from potentially messing up the freshly-repaired
LMDB file, we have to either use _exit() or make sure the LMDB
file is not touched by the exit handlers. This change implements
the latter.

We also use special exit codes to signal that LMDB corruption was
detected and either repaired or failed to be reparied.

Ticket: CFE-3127
Changelog: None
We need to make sure the file is repaired for future use.
@vpodzime vpodzime force-pushed the master-lmdb_recover_avoid__exit branch from 31f0b45 to b5ee6a3 Compare October 21, 2019 18:45
@lgtm-com
Copy link

lgtm-com bot commented Oct 21, 2019

This pull request introduces 11 alerts and fixes 1 when merging b5ee6a332f818d784eb1c8834e1f7c384a375586 into f3bad22 - view on LGTM.com

new alerts:

  • 9 for Pointer argument is dereferenced without checking for NULL
  • 2 for Time-of-check time-of-use filesystem race condition

fixed alerts:

  • 1 for File created without restricting permissions

@vpodzime vpodzime force-pushed the master-lmdb_recover_avoid__exit branch from b5ee6a3 to e36cf72 Compare October 21, 2019 19:14
@lgtm-com
Copy link

lgtm-com bot commented Oct 21, 2019

This pull request introduces 11 alerts and fixes 1 when merging e36cf72ce436b86825318bbf98a39c43c0ad0fe7 into f3bad22 - view on LGTM.com

new alerts:

  • 9 for Pointer argument is dereferenced without checking for NULL
  • 2 for Time-of-check time-of-use filesystem race condition

fixed alerts:

  • 1 for File created without restricting permissions

olehermanse
olehermanse previously approved these changes Oct 21, 2019
Copy link
Member

@olehermanse olehermanse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good! (Please make sure it passes jenkins before merging)

cf-agent/cf-agent.c Outdated Show resolved Hide resolved
cf-execd/cf-execd.c Outdated Show resolved Hide resolved
cf-check/repair.c Outdated Show resolved Hide resolved
@lgtm-com
Copy link

lgtm-com bot commented Oct 21, 2019

This pull request introduces 11 alerts and fixes 1 when merging 2e12ad980a547d10324d4e2ccca762e5ec467fc3 into f3bad22 - view on LGTM.com

new alerts:

  • 9 for Pointer argument is dereferenced without checking for NULL
  • 2 for Time-of-check time-of-use filesystem race condition

fixed alerts:

  • 1 for File created without restricting permissions

@vpodzime vpodzime force-pushed the master-lmdb_recover_avoid__exit branch from 2e12ad9 to 9e3c789 Compare October 22, 2019 08:24
@lgtm-com
Copy link

lgtm-com bot commented Oct 22, 2019

This pull request introduces 11 alerts and fixes 1 when merging 9e3c7899fa39388114310beff99a6b3171aa8a23 into f3bad22 - view on LGTM.com

new alerts:

  • 9 for Pointer argument is dereferenced without checking for NULL
  • 2 for Time-of-check time-of-use filesystem race condition

fixed alerts:

  • 1 for File created without restricting permissions

olehermanse
olehermanse previously approved these changes Oct 22, 2019
SIGBUS means unaligned memory access or an attempt to access area
outside of mmap(). In most cases for us this means an LMDB
corruption. Thus if a process gets SIGBUS, it touches a flag file
that triggers a DB repair attempt in the next agent or cf-execd
run.

Ticket: CFE-3127
Changelog: Received SIGBUS now triggers a repair of local DBs
repair_lmdb_file() is the function that repairs the given LMDB
file and thus it should be the same function recording the repair
timestamp. However, some complexities are out of its scope. Let's
make it either take a file descriptor for the repair timestamp
file, assuming that file locking and such stuff is taken care of,
or do the simple repair timestamp recording on its own.

Also treat its return code in a more appropriate way -- only a
real failure should be considered failure. If the data
recovery/replication fails, but the LMDB file is moved out of the
way (removed), it should be considered as successful repair.
File locking in libntech is now a bit more abstract and
versatile.
@lgtm-com
Copy link

lgtm-com bot commented Oct 23, 2019

This pull request introduces 10 alerts and fixes 1 when merging 3817090 into 36379af - view on LGTM.com

new alerts:

  • 9 for Pointer argument is dereferenced without checking for NULL
  • 1 for Time-of-check time-of-use filesystem race condition

fixed alerts:

  • 1 for File created without restricting permissions

@vpodzime
Copy link
Contributor Author

@cf-bottom run this throught he new shiny jenkins, please! Incl. exotics.

@cf-bottom
Copy link

Alright, I triggered a build:

Build Status

(with exotics)

Jenkins: https://ci.cfengine.com/job/pr-pipeline/4006/

Packages: http://buildcache.cfengine.com/packages/testing-pr/jenkins-pr-pipeline-4006/

@olehermanse
Copy link
Member

@cf-bottom One more jenkins, with exotics.

@cf-bottom
Copy link

@olehermanse olehermanse merged commit f110266 into cfengine:master Oct 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants