Skip to content

Conversation

@simonjbeaumont
Copy link
Contributor

The read_records_where function in the database layer (used by the
get_all_records and get_all_records_where APIs) was reading the database
multiple times by calling find_refs_with_filter[1] to get the refs that matched
the query and then calling read_record[2] for each of these refs.

This violates point 2 of the locking strategy stated at the top of the module
that read only functions must only call get_database once to ensure they
operate on a consistent snapshot.

Since [1] and [2] make get_database calls the get_all_records* functions make
n+1 calls to get_database for a table with n records. Because of this, deleting
a record during a get_all_records_where will result in a DBCache_NotFound
exception with parameter "missing row".

This commits adds internal variants of functions [1] and [2] that take
an actual instance of Database.t rather than a Db_ref.t which is a Database.t
ref ref.

Signed-off-by: Si Beaumont simon.beaumont@citrix.com

The read_records_where function in the database layer (used by the
get_all_records and get_all_records_where APIs) was reading the database
multiple times by calling find_refs_with_filter[1] to get the refs that matched
the query and then calling read_record[2] for each of these refs.

This violates point #2 of the  locking strategy stated at the top of the module
that read only functions must only call get_database once to ensure they
operate on a consistent snapshot.

Since [1] and [2] make get_database calls the get_all_records* functions make
n+1 calls to get_database for a table with n records. Because of this, deleting
a record during a get_all_records_where will result in a DBCache_NotFound
exception with parameter "missing row".

This commits adds internal variants of functions [1] and [2] that take
an actual instance of Database.t rather than a Db_ref.t which is a Database.t
ref ref.

Signed-off-by: Si Beaumont <simon.beaumont@citrix.com>
@jonludlam
Copy link
Contributor

So any ideas about why we're suddenly hitting this issue now?

@simonjbeaumont
Copy link
Contributor Author

Nope. I feel like I've seen stuff like this triaging issues but assumed it was fallout from an error path. It maybe wasn't this issue but the race is definitely there as shown by the included unit test.

I don't have any insights into why we're only seeing this now... Nothing in this part of the code has changed for some years.

I guess it's only likely to happen if at all if there are a lot of records for a particular table otherwise the likelihood of the thread yielding in the two line function read_records_where is low?

@jonludlam
Copy link
Contributor

OK. PR looks good to me. In it goes!

jonludlam pushed a commit that referenced this pull request May 13, 2014
CA-134765: Fix race condition in database layer
@jonludlam jonludlam merged commit 58458cc into xapi-project:master May 13, 2014
@simonjbeaumont
Copy link
Contributor Author

For future reference, the source branch had the wrong name. The CA number in the title and commit message are correct and I have pushed to the correctly-named branch if needed for backports: https://github.com/simonjbeaumont/xen-api/tree/ca-134765

@robhoes
Copy link
Member

robhoes commented May 13, 2014

Nice patch! I'm sure this fixes some of the more mysterious issues we have seen in the past...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants