Closed
Description
We should agree on how we expect the string types to be used in Python 2 code.
There are at least four ways we can approach this:
- Make
str
usually valid whenunicode
is expected. This is how mypy currently works, and this is similar to how PEP 484 definesbytearray
/bytes
compatibility. This will correspond to runtime semantics, but it's not safe as non-ascii characters instr
objects will result in programs sometimes blowing up. A 7-bitstr
instance is almost always valid at runtime whenunicode
is expected. - Get rid of the
str -> unicode
promotion and useUnion[str, unicode]
everywhere (or create an alias for it). This is almost like approach 1, except that we have a different name forunicode
and more complex error messages and a complex programming model due to the proliferation of union types. There is potential for some additional type safety by using justunicode
in user code. - Enforce explicit
str
/unicode
distinction in Python 2 code, similar to Python 3 (str
would behave more or less like Python 3bytes
), and discourage union types. This will make it harder to annotate existing Python 2 programs which often use the two types almost interchangeably, but it will make programs safer. - Have three different string types:
bytes
(distinct from fromstr
) means 8-bitstr
instances -- these aren't compatible withunicode
.str
means asciistr
instances. These are compatible withbytes
andunicode
, but not the other way around.unicode
meansunicode
instances and isn't special. A string literal will have implicit typestr
orbytes
depending on whether it only has ascii characters. This approach should be pretty safe and potentially also makes it fairly easy to adapt existing code, but harder than with approach 1.
These also affect how stubs should be written and thus it would be best if every tool using typeshed could use the same approach:
- For approach 1, stubs should usually use
str
,unicode
orAnyStr
. This is how many stubs are written already. - For approach 2, stubs should use
str
,Uniont[str, unicode]
orAnyStr
for attributes and function arguments, and return types could additionally use plainunicode
. Return types would in general be hard to specify precisely, as it may be difficult to predict whether a function called withstr
or combination ofstr
andunicode
returnsstr
,unicode
orUnion[str, unicode]
. In approach 1 we can safely fall back tounicode
if unsure.AnyStr
would be less useful as we could have mixed function arguments like(str, unicode)
easily (see the typeshed issues mentioned below for more about this). - For approach 3, stubs would usually use either
str
,unicode
orAnyStr
, butunicode
wouldn't accept plainstr
objects. - For approach 4, stubs could use three different types (
bytes
,str
,unicode
) in addition toAnyStr
, and these would all behave differently. Unlike the first three approaches,AnyStr
would range overstr
,unicode
andbytes
in Python 2 mode.
Note that mypy currently assumes approach 1 and I don't know how well the other approaches would work in practice.
[This was adapted from a comment on #1135; see the original issue for more discussion. Also, https://github.com/python/typeshed/issues/50 is relevant.]