Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible cache_key collision within collect_subfields #91

Closed
rafalp opened this issue May 14, 2020 · 12 comments
Closed

Possible cache_key collision within collect_subfields #91

rafalp opened this issue May 14, 2020 · 12 comments

Comments

@rafalp
Copy link
Contributor

rafalp commented May 14, 2020

Today I've noticed apollo-client displaying warnings in console about queried fields missing from server's response:

Zrzut ekranu 2020-05-14 o 21 28 10

At first I've thought that there's something wrong with Apollo client mistaking between fragments, but I've investigated server's response for query and GraphQL Playground, and realized that randomly, my query reuses result of one fragment where other fragment is expected:

Zrzut ekranu 2020-05-14 o 22 05 12

This result occurs at random. Sometimes I always get valid query result. Sometimes I keep running into invalid query result.

Here's recording of this happening on Vimeo:

https://vimeo.com/418611363

I'm suspecting race condition within the Query executor, as it seems that moving category fields to bottom of thread seems to solve the issue, perhaps giving few more laps in the event loop for past subset to clean and new subset to be used?

I haven't seen this error occur for first fragment used within query, only for the latter one.

@rafalp rafalp changed the title For complex queries using multiple fragments on same type, randomly wrong fragments subset of fields is returned in query result For complex queries using multiple representations of same type, randomly wrong subset of fields is returned in wrong place. May 14, 2020
@rafalp
Copy link
Contributor Author

rafalp commented May 14, 2020

I have removed fragments from my query, replacing those with fields verbatim, but the error is still occurring. So it seems that its not a problem that fragments are used, but that there are multiple fields within query returning same type (in this case, Category).

@Cito
Copy link
Member

Cito commented May 14, 2020

Which version of graphql-core are you using? And what do you use for serving the responses?

@rafalp
Copy link
Contributor Author

rafalp commented May 14, 2020

Its GraphQL Core 3.0.4 on Uvicorn (ASGI)

@Cito
Copy link
Member

Cito commented May 14, 2020

Can you post a simpler test case that allows me to reproduce the bug using only graphql-core, not other components such as Uvicorn or apollo-client?

@rafalp
Copy link
Contributor Author

rafalp commented May 15, 2020

I've poked around the code within the execute and verified that indeed there's a possible race condition in which collect_subfields constructs same cache_key for both query.category and query.threads.items.*.category paths, leading to invalid result being reused.

I'll see over the weekend how to create limited repro example to cause cache key collision.

@rafalp rafalp changed the title For complex queries using multiple representations of same type, randomly wrong subset of fields is returned in wrong place. Possible cache_key collision within collect_subfields May 15, 2020
@Cito
Copy link
Member

Cito commented May 15, 2020

Thank you. Yes, that caching mechanism might have something to do with it. I will certainly look into this after you posted your example.

@rafalp
Copy link
Contributor Author

rafalp commented May 16, 2020

I don't have a repro, but I've modified query executor to print id(field_nodes) as soon as this value is retrieved from the fields.items(). Here's one such example:

for response_name, field_nodes in fields.items():
    print(response_name, id(field_nodes))

I've found that randomly different field_nodes objects would end up with same id() value as query.category:

category 140331388205184
...
isClosed 140331388205184
...
isClosed 140331388205184
...etc ect...

This was surprising but its to be expected according to the docs:

Return the “identity” of an object. This is an integer which is guaranteed to be unique and constant for this object during its lifetime. Two objects with non-overlapping lifetimes may have the same id() value.

CPython implementation detail: This is the address of the object in memory

This felt super-rare for me, but apparently its super easy to do because Python's GC is doing reference counting, so if one wants to produce non-unique ID, they have to run id() in the loop:

>>> import time
>>> def test():
...     print(id(time.time()))
...
>>> for i in range(5):
...     test()
...
4318953832
4318953832
4318953832
4318953832
4318953832

So I've decided to verify the contents of field_nodes to be sure that those are different even when field_nodes id was the same. I've added following code to the collect_subfields:

if return_type.name == "Category":
    selections = field_nodes[0].selection_set.selections
    print(id(field_nodes), [n.name.value for n in selections])

Here is result for first four calls when collision occurs:

# query.category
140699773915392 ['id', 'name', 'slug', 'threads', 'posts', 'parent', '__typename']
# query.category.parent
140699725555904 ['id', 'name', 'slug', '__typename']
# query.threads.items.category
140699773915392 ['id', 'name', 'slug', 'color', 'icon', 'parent', '__typename']
# query.threads.items.category.parent
140699725555904 ['id', 'name', 'slug', '__typename']

Ids are repeated, but objects are different.

In order for this bug to occur, you just need ask for same type in more than one of the query branches and wait. Eventually you'll roll a 6 and get duplicate from id() for both fields, producing wrong cache hit.

In light of this issue only occurring when GC feels that it will allocate new field_nodes to address previously occupied by other field_nodes, I don't think reliable repro can be implemented. :|

@Cito
Copy link
Member

Cito commented May 16, 2020

@rafalp You're probably right, and this could explain your observation and why it is hard to reproduce. I will investigate that caching mechanism more closely tomorrow and implement it in a safer way. Thanks for reporting and helping with this!

@Cito
Copy link
Member

Cito commented May 17, 2020

@rafal Just looked into that a bit closer.

Your analysis is correct, the problem was in fact using the id of the list of field nodes as cache key. This is actually problematic in two ways: First, the cache is not hit when a different list of the same nodes is used as input, which may happen in practice (there are such cases in the test suite), which makes it less effective (not much, but measurable). Second, the ids may be reused when such lists are destroyed and recreated. This may also happen in practice, as you experienced unfortunately.

Btw, the memoization in GraphQL.js suffers from the first problem, because they use the list as key in a WeakMap. I will suggest an improvement there.

I have now changed this in Graphql-core so that the ids of all the field nodes are used as elements of a tuple. These are stable during the execution.

The reason why I use the ids of the nodes instead of the nodes themselves is that hashing the nodes (which happens when looking up the key in the cache) is so slow that it would fully eat up the advantage of memoization. The nodes have a custom comparison method in order to make nodes with the same attributes equal, and therefore they also need a custom hash method.

Again, thanks a lot for finding and anlyzing that subtle issue! If you can confirm that the change in 35c67b7 solves your problem, I will publish a bugfix release.

@rafalp
Copy link
Contributor Author

rafalp commented May 17, 2020

I've tested your change locally and I can confirm that even when id() returned for field_nodes is same, cache collision no longer occurs and and both fields return valid subsets. 👍

@Cito
Copy link
Member

Cito commented May 17, 2020

Great, thank you! Will try to publish bugfix releases for the 3.0.x and 3.1.x branches tomorrow evening.

@Cito
Copy link
Member

Cito commented May 18, 2020

Ok, 3.0.5 and 3.1.1 with this fix are released now.

@Cito Cito closed this as completed May 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants