Skip to content

Conversation

@ericstj
Copy link
Member

@ericstj ericstj commented Sep 4, 2020

Fix #39009
Ensure an initial non-concurrent call to OpenLDAP

OpenLDAP requires a single call for initialization before any other concurrent call.

This fixes asserts and segfaults we were seeing when calling OpenLDAP concurrently.

@ericstj ericstj requested a review from joperezr September 4, 2020 18:10
@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

@ericstj ericstj force-pushed the ds.p-concurrent.init branch from dbdff59 to bde6e65 Compare September 4, 2020 21:36
@ericstj
Copy link
Member Author

ericstj commented Sep 5, 2020

All checks have passed

I think I'm going to need to give up on getting this to repro in CI. It's very rare. I was able to get a repro in WSL locally. Rather than calling through S.DS.P I just put the ldap_init PInvoke directly in a test program and used the same method as this test and had it run for 10,000 iterations. I was able to hit one version of this crash:

dotnet: ../nptl/pthread_mutex_lock.c:79: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
./test.sh: line 1: 18048 Aborted                 (core dumped) ( dotnet /mnt/c/scratch/ds.p.linux/bin/Debug/netcoreapp3.1/ds.p.linux.dll )

I'm trying a few times without a fix to make sure I can repro then will try with the fix to make sure it goes away.

@joperezr
Copy link
Member

joperezr commented Sep 5, 2020

Yeah I was going to suggest doing PInvoke directly to at least confirm this is the cause, but having that regression test would have been nice. We do a bit if work on that constructor and run several operations before that call so I do believe it will be really har ld to get consistent repro going directly from S.DS.P unfortunately.

@ericstj ericstj force-pushed the ds.p-concurrent.init branch from 9e96c91 to 7dbc4b1 Compare September 5, 2020 23:13
@ericstj
Copy link
Member Author

ericstj commented Sep 5, 2020

So without the fix I hit this about 1 / 3000 times when directly invoking the PInvoke. After the fix I don't hit it at all in 10000 iterations. I'm running more to confirm I no longer repro.

@ericstj ericstj marked this pull request as ready for review September 5, 2020 23:13
@danmoseley
Copy link
Member

Probably want to minimize the diff for porting purposes

@ericstj
Copy link
Member Author

ericstj commented Sep 5, 2020

Probably want to minimize the diff for porting purposes

I did specifically consider this. The reason for the larger diff was that this library did not create a class specific to libldap. Rather than put a static constructor on the Interop class (which could conflict with other usage) I moved all the libLdap Pinvokes to their own class. This change was done in a way that "if it compiles, it is correct" at least from the rename perspective.

I could make a smaller change in release that adds the static constructor to the "Interop" class. at the moment this library only uses libldap, if you think a smaller diff is worth having the potential maintenance issue in the future.

OpenLDAP requires a single call for initialization before any other concurrent call.

This fixes asserts and segfaults we were seeing when calling OpenLDAP concurrently.
@danmoseley
Copy link
Member

Ah - I see. Makes sense.

@joperezr
Copy link
Member

joperezr commented Sep 7, 2020

While I'm fine with the change proposed here, another option if we want to minimize the changes and still not add a static constructor to Interop class would be to move the static constructor one level up to LdapConnection class right? This is the primary one that interacts with libldap anyway.

Copy link
Member

@joperezr joperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments but looks good! Thanks so much for fixing!

@ericstj
Copy link
Member Author

ericstj commented Sep 8, 2020

So 150K iterations after fix and no-repro after seeing about once every 3K before locally. I think this fixed it. Will merge in master and open RC2 port. Will test more in RC2.

@ericstj ericstj merged commit c15455a into dotnet:master Sep 8, 2020
@ericstj
Copy link
Member Author

ericstj commented Sep 8, 2020

/backport to release/5.0-rc2

@github-actions
Copy link
Contributor

github-actions bot commented Sep 8, 2020

Started backporting to release/5.0-rc2: https://github.com/dotnet/runtime/actions/runs/243980448

@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

__pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed. running DirectoryServices tests on Linux coreclr

5 participants