CA-134765: Fix race condition in database layer #1745

simonjbeaumont · 2014-05-13T14:30:06Z

The read_records_where function in the database layer (used by the
get_all_records and get_all_records_where APIs) was reading the database
multiple times by calling find_refs_with_filter[1] to get the refs that matched
the query and then calling read_record[2] for each of these refs.

This violates point 2 of the locking strategy stated at the top of the module
that read only functions must only call get_database once to ensure they
operate on a consistent snapshot.

Since [1] and [2] make get_database calls the get_all_records* functions make
n+1 calls to get_database for a table with n records. Because of this, deleting
a record during a get_all_records_where will result in a DBCache_NotFound
exception with parameter "missing row".

This commits adds internal variants of functions [1] and [2] that take
an actual instance of Database.t rather than a Db_ref.t which is a Database.t
ref ref.

Signed-off-by: Si Beaumont simon.beaumont@citrix.com

The read_records_where function in the database layer (used by the get_all_records and get_all_records_where APIs) was reading the database multiple times by calling find_refs_with_filter[1] to get the refs that matched the query and then calling read_record[2] for each of these refs. This violates point #2 of the locking strategy stated at the top of the module that read only functions must only call get_database once to ensure they operate on a consistent snapshot. Since [1] and [2] make get_database calls the get_all_records* functions make n+1 calls to get_database for a table with n records. Because of this, deleting a record during a get_all_records_where will result in a DBCache_NotFound exception with parameter "missing row". This commits adds internal variants of functions [1] and [2] that take an actual instance of Database.t rather than a Db_ref.t which is a Database.t ref ref. Signed-off-by: Si Beaumont <simon.beaumont@citrix.com>

jonludlam · 2014-05-13T14:32:14Z

So any ideas about why we're suddenly hitting this issue now?

simonjbeaumont · 2014-05-13T14:42:22Z

Nope. I feel like I've seen stuff like this triaging issues but assumed it was fallout from an error path. It maybe wasn't this issue but the race is definitely there as shown by the included unit test.

I don't have any insights into why we're only seeing this now... Nothing in this part of the code has changed for some years.

I guess it's only likely to happen if at all if there are a lot of records for a particular table otherwise the likelihood of the thread yielding in the two line function read_records_where is low?

jonludlam · 2014-05-13T14:57:00Z

OK. PR looks good to me. In it goes!

CA-134765: Fix race condition in database layer

simonjbeaumont · 2014-05-13T15:01:11Z

For future reference, the source branch had the wrong name. The CA number in the title and commit message are correct and I have pushed to the correctly-named branch if needed for backports: https://github.com/simonjbeaumont/xen-api/tree/ca-134765

robhoes · 2014-05-13T15:11:33Z

Nice patch! I'm sure this fixes some of the more mysterious issues we have seen in the past...

jonludlam pushed a commit that referenced this pull request May 13, 2014

Merge pull request #1745 from simonjbeaumont/ca-134675

58458cc

CA-134765: Fix race condition in database layer

jonludlam merged commit 58458cc into xapi-project:master May 13, 2014

simonjbeaumont deleted the ca-134675 branch May 13, 2014 15:04

simonjbeaumont mentioned this pull request May 13, 2014

[xs64bit] CA-134765: Fix race condition in database layer #1746

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

CA-134765: Fix race condition in database layer #1745

CA-134765: Fix race condition in database layer #1745

Uh oh!

simonjbeaumont commented May 13, 2014

Uh oh!

jonludlam commented May 13, 2014

Uh oh!

simonjbeaumont commented May 13, 2014

Uh oh!

jonludlam commented May 13, 2014

Uh oh!

simonjbeaumont commented May 13, 2014

Uh oh!

robhoes commented May 13, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

CA-134765: Fix race condition in database layer #1745

CA-134765: Fix race condition in database layer #1745

Uh oh!

Conversation

simonjbeaumont commented May 13, 2014

Uh oh!

jonludlam commented May 13, 2014

Uh oh!

simonjbeaumont commented May 13, 2014

Uh oh!

jonludlam commented May 13, 2014

Uh oh!

simonjbeaumont commented May 13, 2014

Uh oh!

robhoes commented May 13, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants