Decide how to deal with str/unicode

We should agree on how we expect the string types to be used in Python 2 code.

There are at least four ways we can approach this:
1.    Make `str` usually valid when `unicode` is expected. This is how mypy currently works, and this is similar to how PEP 484 defines `bytearray` / `bytes` compatibility. This will correspond to runtime semantics, but it's not safe as non-ascii characters in `str` objects will result in programs sometimes blowing up. A 7-bit `str` instance is almost always valid at runtime when `unicode` is expected.
2.    Get rid of the `str -> unicode` promotion and use `Union[str, unicode]` everywhere (or create an alias for it). This is almost like approach 1, except that we have a different name for `unicode` and more complex error messages and a complex programming model due to the proliferation of union types. There is potential for some additional type safety by using just `unicode` in user code.
3.    Enforce explicit `str` / `unicode` distinction in Python 2 code, similar to Python 3 (`str` would behave more or less like Python 3 `bytes`), and discourage union types. This will make it harder to annotate existing Python 2 programs which often use the two types almost interchangeably, but it will make programs safer.
4.    Have three different string types: `bytes` (distinct from from `str`) means 8-bit `str` instances -- these aren't compatible with `unicode`. `str` means ascii `str` instances. These are compatible with `bytes` and `unicode`, but not the other way around. `unicode` means `unicode` instances and isn't special. A string literal will have implicit type `str` or `bytes` depending on whether it only has ascii characters. This approach should be pretty safe and potentially also makes it fairly easy to adapt existing code, but harder than with approach 1.

These also affect how stubs should be written and thus it would be best if every tool using typeshed could use the same approach:
-    For approach 1, stubs should usually use `str`, `unicode` or `AnyStr`. This is how many stubs are written already.
-    For approach 2, stubs should use `str`, `Uniont[str, unicode]` or `AnyStr` for attributes and function arguments, and return types could additionally use plain `unicode`. Return types would in general be hard to specify precisely, as it may be difficult to predict whether a function called with `str` or combination of `str` and `unicode` returns `str`, `unicode` or `Union[str, unicode]`. In approach 1 we can safely fall back to `unicode` if unsure. `AnyStr` would be less useful as we could have mixed function arguments like `(str, unicode)` easily (see the typeshed issues mentioned below for more about this).
-    For approach 3, stubs would usually use either `str`, `unicode` or `AnyStr`, but `unicode` wouldn't accept plain `str` objects.
-    For approach 4, stubs could use three different types (`bytes`, `str`, `unicode`) in addition to `AnyStr`, and these would all behave differently. Unlike the first three approaches, `AnyStr` would range over `str`, `unicode` and `bytes` in Python 2 mode.

Note that mypy currently assumes approach 1 and I don't know how well the other approaches would work in practice.

[This was adapted from a comment on #1135; see the original issue for more discussion. Also, https://github.com/python/typeshed/issues/50 is relevant.]


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Decide how to deal with str/unicode #1141

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Decide how to deal with str/unicode #1141

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions