Skip to content

Unicode characters get transformed into surrogate pairs by graphql.print_ast() #128

Closed
@dkbarn

Description

@dkbarn

I've run into an issue where serializing a GraphQL DocumentNode into a string, and then parsing that back into a DocumentNode transforms certain unicode characters into surrogate pairs, which make them no longer UTF-8 encodeable.

This code snippet demonstrates the problem:

import graphql

value = "\U000a90e5"
print(f"Value before serializing: {value!r}")

encoded = value.encode("utf8")
print(f"UTF-8 encoded: {encoded!r}")

query = graphql.DocumentNode(
    definitions=[
        graphql.OperationDefinitionNode(
            operation=graphql.OperationType.QUERY,
            selection_set=graphql.SelectionSetNode(
                selections=[
                    graphql.FieldNode(
                        name=graphql.NameNode(
                            kind="name",
                            value="hello",
                        ),
                        arguments=[
                            graphql.ArgumentNode(
                                name=graphql.NameNode(value="user"),
                                value=graphql.StringValueNode(value=value)
                            )
                        ]
                    )
                ]
            )
        )
    ]
)

serialized_query = graphql.print_ast(query)

print(f"Serialized query: {serialized_query}")

parsed_query = graphql.parse(serialized_query)

value = parsed_query.definitions[0].selection_set.selections[0].arguments[0].value.value
print(f"Value after serializing: {value!r}")

encoded = value.encode("utf8")
print(f"UTF-8 encoded: {encoded!r}")

Given the unicode character \U000a90e5 which is UTF-8 encodeable, passing this value to a DocumentNode tree and serializing the AST into text transforms the character into the surrogate pair \uda64\udce5. Converting this back into a DocumentNode via graphql.parse() and then extracting the argument value shows that it has been modified. And it is no longer UTF-8 encodeable. The last line in this snippet produces the error: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

Environment:

  • Python 3.8.5
  • graphql-core 3.1.4

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions