Skip to content

Decide how to deal with str/unicode #1141

Closed
@JukkaL

Description

@JukkaL

We should agree on how we expect the string types to be used in Python 2 code.

There are at least four ways we can approach this:

  1. Make str usually valid when unicode is expected. This is how mypy currently works, and this is similar to how PEP 484 defines bytearray / bytes compatibility. This will correspond to runtime semantics, but it's not safe as non-ascii characters in str objects will result in programs sometimes blowing up. A 7-bit str instance is almost always valid at runtime when unicode is expected.
  2. Get rid of the str -> unicode promotion and use Union[str, unicode] everywhere (or create an alias for it). This is almost like approach 1, except that we have a different name for unicode and more complex error messages and a complex programming model due to the proliferation of union types. There is potential for some additional type safety by using just unicode in user code.
  3. Enforce explicit str / unicode distinction in Python 2 code, similar to Python 3 (str would behave more or less like Python 3 bytes), and discourage union types. This will make it harder to annotate existing Python 2 programs which often use the two types almost interchangeably, but it will make programs safer.
  4. Have three different string types: bytes (distinct from from str) means 8-bit str instances -- these aren't compatible with unicode. str means ascii str instances. These are compatible with bytes and unicode, but not the other way around. unicode means unicode instances and isn't special. A string literal will have implicit type str or bytes depending on whether it only has ascii characters. This approach should be pretty safe and potentially also makes it fairly easy to adapt existing code, but harder than with approach 1.

These also affect how stubs should be written and thus it would be best if every tool using typeshed could use the same approach:

  • For approach 1, stubs should usually use str, unicode or AnyStr. This is how many stubs are written already.
  • For approach 2, stubs should use str, Uniont[str, unicode] or AnyStr for attributes and function arguments, and return types could additionally use plain unicode. Return types would in general be hard to specify precisely, as it may be difficult to predict whether a function called with str or combination of str and unicode returns str, unicode or Union[str, unicode]. In approach 1 we can safely fall back to unicode if unsure. AnyStr would be less useful as we could have mixed function arguments like (str, unicode) easily (see the typeshed issues mentioned below for more about this).
  • For approach 3, stubs would usually use either str, unicode or AnyStr, but unicode wouldn't accept plain str objects.
  • For approach 4, stubs could use three different types (bytes, str, unicode) in addition to AnyStr, and these would all behave differently. Unlike the first three approaches, AnyStr would range over str, unicode and bytes in Python 2 mode.

Note that mypy currently assumes approach 1 and I don't know how well the other approaches would work in practice.

[This was adapted from a comment on #1135; see the original issue for more discussion. Also, https://github.com/python/typeshed/issues/50 is relevant.]

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions