Skip to content

Conversation

@joerchan
Copy link
Contributor

@joerchan joerchan commented May 18, 2021

Fix deadlock when db_hash_commit has to wait for the delayed work to
finish. This creates a deadlock if the delayed work for database hash
calculation needs to store the hash since the settings API is locked
when calling the commit callback.
Remove call to k_work_cancel_delayable_sync from db_hash_commit in order
to avoid the deadlock. Instead move comparing of the stored hash to the
delayed work and reschedule the work with no wait.

Fixes: #35058

@github-actions github-actions bot added area: Bluetooth area: Bluetooth Host Bluetooth Host (excluding BR/EDR) labels May 18, 2021
@joerchan joerchan force-pushed the bt-fix-hash-settings-deadlock branch from 0707a37 to 927d0cb Compare May 19, 2021 07:22
@joerchan joerchan marked this pull request as ready for review May 19, 2021 07:23
@joerchan joerchan requested review from Vudentz and jhedberg as code owners May 19, 2021 07:23
@joerchan joerchan requested a review from pabigot May 19, 2021 07:23
@joerchan joerchan force-pushed the bt-fix-hash-settings-deadlock branch from 927d0cb to be16bbc Compare May 19, 2021 07:37
@joerchan joerchan added the bug The issue is a bug, or the PR is fixing a bug label May 19, 2021
Copy link
Contributor

@pabigot pabigot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The settings API's use of "load", "set", and "commit" seem to be non-intuitive. I don't understand what this code does.

But that's irrelevant. This appears to eliminate the deadlock. But it may violate user expectations that settings_commit() will block until the settings have actually been committed (whatever that means). The effect of the commit will be delayed until the work item is completed, no?

From the perspective of eliminating the deadlock this seems fine. I can't speak to whether the resulting behavior is acceptable.

I do believe there's a data race involving stored_hash but that wouldn't be new.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the semantics of settings_commit allow this? The commit will not have completed by the time the call returns because it doesn't wait. Unless setting the LOAD flag ensures that any operation that depends on the commit will detect the incomplete operation and complete it before the work handler does.

If that makes any sense and is also correct perhaps a comment could be added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The settings API's use of "load", "set", and "commit" seem to be non-intuitive. I don't understand what this code does.

You get a "set" for each value stored, and then "commit" once all values have been loaded, i.e no more "set" callbacks.

The value has been "commited" as we have the stored_hash value. The next thing we wish to do is to compare it against the calculated hash value, and store the new value if they are different. That don't have to be part of the the commit here.

So we can push this to the db_hash work, since this was already the work responsible for the calculation.
As long as peers that read the hash characteristic value receive the correct value then this works as expected.

Can you elaborate on the data race on stored_hash?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on the data race on stored_hash?

I don't see anything that prevents one thread from accessing it via db_hash_set() while the work thread is accessing it via db_hash_process(). Only a problem when preemptable threads are involved or SMP is present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should be strictly sequential, db_hash_set is only called before db_hash_commit, and db_hash_work would only read the stored_hash when it has been submitted from db_hash_commit due to the flag set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still a data race if a second set/commit occurs in a thread that preempts the work thread. Unlikely, but not impossible, unless all set/commits come from something that runs on the same work queue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Settings load should only happen once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the cancel fail? If so what happens?

Copy link
Contributor Author

@joerchan joerchan May 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would now be from a work handler so shouldn't fail in being canceled, right?

EDIT:
If it would fail then we would send out a service changed indication to all connected peers. That would be uneccesary as they would have to rediscover a database that hasn't changed. But having peers connected at this early in the startup phase is very unlikely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it can fail, and we need to do something, we should check the return value and do whatever.

If you don't think we need to do anything, I think we should cast the return value to (void) to make it explicit.

Copy link
Contributor

@pabigot pabigot May 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, preemption of the work thread and SMP could both create a situation where the cancel could fail.

https://docs.zephyrproject.org/latest/reference/kernel/threads/workqueue.html#check-return-values shows recommended documentary practices. This comment provides more background on the recommendation and why a cast to (void) is not useful by itself.

Fix deadlock when db_hash_commit has to wait for the delayed work to
finish. This creates a deadlock if the delayed work for database hash
calculation needs to store the hash since the settings API is locked
when calling the commit callback.
Remove call to k_work_cancel_delayable_sync from db_hash_commit in order
to avoid the deadlock. Instead move comparing of the stored hash to the
delayed work and reschedule the work with no wait.

Signed-off-by: Joakim Andersson <joakim.andersson@nordicsemi.no>
@joerchan joerchan force-pushed the bt-fix-hash-settings-deadlock branch from be16bbc to 1d498db Compare May 21, 2021 10:05
Comment on lines 251 to +253
uint8_t hash[16];
#if defined(CONFIG_BT_SETTINGS)
uint8_t stored_hash[16];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a GATT_DB_HASH_SIZE or similar macro?

@carlescufi
Copy link
Member

@mniestroj could you please check if this solves the issue?

@joerchan
Copy link
Contributor Author

mniestroj could you please check if this solves the issue?

@carlescufi FYI: We ran into the same issue here as well eventually, and @MarekPieta has confirmed that this fixed the issue for us.

Copy link
Member

@mniestroj mniestroj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm it fixes #35058.

@nashif nashif merged commit 7986cae into zephyrproject-rtos:main May 25, 2021
@joerchan joerchan deleted the bt-fix-hash-settings-deadlock branch May 25, 2021 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: Bluetooth Host Bluetooth Host (excluding BR/EDR) area: Bluetooth bug The issue is a bug, or the PR is fixing a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bluetooth: deadlock when canceling db_hash.work from settings commit handler

7 participants