Description
I've run into an issue where serializing a GraphQL DocumentNode into a string, and then parsing that back into a DocumentNode transforms certain unicode characters into surrogate pairs, which make them no longer UTF-8 encodeable.
This code snippet demonstrates the problem:
import graphql
value = "\U000a90e5"
print(f"Value before serializing: {value!r}")
encoded = value.encode("utf8")
print(f"UTF-8 encoded: {encoded!r}")
query = graphql.DocumentNode(
definitions=[
graphql.OperationDefinitionNode(
operation=graphql.OperationType.QUERY,
selection_set=graphql.SelectionSetNode(
selections=[
graphql.FieldNode(
name=graphql.NameNode(
kind="name",
value="hello",
),
arguments=[
graphql.ArgumentNode(
name=graphql.NameNode(value="user"),
value=graphql.StringValueNode(value=value)
)
]
)
]
)
)
]
)
serialized_query = graphql.print_ast(query)
print(f"Serialized query: {serialized_query}")
parsed_query = graphql.parse(serialized_query)
value = parsed_query.definitions[0].selection_set.selections[0].arguments[0].value.value
print(f"Value after serializing: {value!r}")
encoded = value.encode("utf8")
print(f"UTF-8 encoded: {encoded!r}")
Given the unicode character \U000a90e5
which is UTF-8 encodeable, passing this value to a DocumentNode tree and serializing the AST into text transforms the character into the surrogate pair \uda64\udce5
. Converting this back into a DocumentNode via graphql.parse()
and then extracting the argument value shows that it has been modified. And it is no longer UTF-8 encodeable. The last line in this snippet produces the error: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed
Environment:
- Python 3.8.5
- graphql-core 3.1.4