From 9b58a292ea42b87fcc26fffa0c59400c4c01c297 Mon Sep 17 00:00:00 2001 From: Mariatta Date: Thu, 2 Feb 2017 09:58:20 -0800 Subject: [PATCH] Convert PEP 263 to reST (#191) - convert the content to text/x-rst - update http://python.org/sf/* url with https://bugs.python.org/issue* --- pep-0263.txt | 351 +++++++++++++++++++++++++++------------------------ 1 file changed, 189 insertions(+), 162 deletions(-) diff --git a/pep-0263.txt b/pep-0263.txt index 518de5bee6b..a6a3d8d24ab 100644 --- a/pep-0263.txt +++ b/pep-0263.txt @@ -6,133 +6,144 @@ Author: mal@lemburg.com (Marc-André Lemburg), martin@v.loewis.de (Martin von Löwis) Status: Final Type: Standards Track +Content-Type: text/x-rst Created: 06-Jun-2001 Python-Version: 2.3 -Post-History: +Post-History: + Abstract +======== + +This PEP proposes to introduce a syntax to declare the encoding of +a Python source file. The encoding information is then used by the +Python parser to interpret the file using the given encoding. Most +notably this enhances the interpretation of Unicode literals in +the source code and makes it possible to write Unicode literals +using e.g. UTF-8 directly in an Unicode aware editor. - This PEP proposes to introduce a syntax to declare the encoding of - a Python source file. The encoding information is then used by the - Python parser to interpret the file using the given encoding. Most - notably this enhances the interpretation of Unicode literals in - the source code and makes it possible to write Unicode literals - using e.g. UTF-8 directly in an Unicode aware editor. Problem +======= + +In Python 2.1, Unicode literals can only be written using the +Latin-1 based encoding "unicode-escape". This makes the +programming environment rather unfriendly to Python users who live +and work in non-Latin-1 locales such as many of the Asian +countries. Programmers can write their 8-bit strings using the +favorite encoding, but are bound to the "unicode-escape" encoding +for Unicode literals. - In Python 2.1, Unicode literals can only be written using the - Latin-1 based encoding "unicode-escape". This makes the - programming environment rather unfriendly to Python users who live - and work in non-Latin-1 locales such as many of the Asian - countries. Programmers can write their 8-bit strings using the - favorite encoding, but are bound to the "unicode-escape" encoding - for Unicode literals. Proposed Solution +================= + +I propose to make the Python source code encoding both visible and +changeable on a per-source file basis by using a special comment +at the top of the file to declare the encoding. - I propose to make the Python source code encoding both visible and - changeable on a per-source file basis by using a special comment - at the top of the file to declare the encoding. +To make Python aware of this encoding declaration a number of +concept changes are necessary with respect to the handling of +Python source code data. - To make Python aware of this encoding declaration a number of - concept changes are necessary with respect to the handling of - Python source code data. Defining the Encoding +===================== - Python will default to ASCII as standard encoding if no other - encoding hints are given. +Python will default to ASCII as standard encoding if no other +encoding hints are given. - To define a source code encoding, a magic comment must - be placed into the source files either as first or second - line in the file, such as: +To define a source code encoding, a magic comment must +be placed into the source files either as first or second +line in the file, such as:: - # coding= + # coding= - or (using formats recognized by popular editors) +or (using formats recognized by popular editors):: - #!/usr/bin/python - # -*- coding: -*- + #!/usr/bin/python + # -*- coding: -*- - or +or:: - #!/usr/bin/python - # vim: set fileencoding= : + #!/usr/bin/python + # vim: set fileencoding= :: - More precisely, the first or second line must match the regular - expression "^[ \t\v]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)". - The first group of this - expression is then interpreted as encoding name. If the encoding - is unknown to Python, an error is raised during compilation. There - must not be any Python statement on the line that contains the - encoding declaration. If the first line matches the second line - is ignored. +More precisely, the first or second line must match the regular +expression "``^[ \t\v]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)```". +The first group of this +expression is then interpreted as encoding name. If the encoding +is unknown to Python, an error is raised during compilation. There +must not be any Python statement on the line that contains the +encoding declaration. If the first line matches the second line +is ignored. - To aid with platforms such as Windows, which add Unicode BOM marks - to the beginning of Unicode files, the UTF-8 signature - '\xef\xbb\xbf' will be interpreted as 'utf-8' encoding as well - (even if no magic encoding comment is given). +To aid with platforms such as Windows, which add Unicode BOM marks +to the beginning of Unicode files, the UTF-8 signature +'``\xef\xbb\xbf``' will be interpreted as 'utf-8' encoding as well +(even if no magic encoding comment is given). + +If a source file uses both the UTF-8 BOM mark signature and a +magic encoding comment, the only allowed encoding for the comment +is 'utf-8'. Any other encoding will cause an error. - If a source file uses both the UTF-8 BOM mark signature and a - magic encoding comment, the only allowed encoding for the comment - is 'utf-8'. Any other encoding will cause an error. Examples +======== - These are some examples to clarify the different styles for - defining the source code encoding at the top of a Python source - file: +These are some examples to clarify the different styles for +defining the source code encoding at the top of a Python source +file: - 1. With interpreter binary and using Emacs style file encoding - comment: +1. With interpreter binary and using Emacs style file encoding + comment:: - #!/usr/bin/python - # -*- coding: latin-1 -*- - import os, sys - ... + #!/usr/bin/python + # -*- coding: latin-1 -*- + import os, sys + ... - #!/usr/bin/python - # -*- coding: iso-8859-15 -*- - import os, sys - ... + #!/usr/bin/python + # -*- coding: iso-8859-15 -*- + import os, sys + ... - #!/usr/bin/python - # -*- coding: ascii -*- - import os, sys - ... + #!/usr/bin/python + # -*- coding: ascii -*- + import os, sys + ... - 2. Without interpreter line, using plain text: +2. Without interpreter line, using plain text:: - # This Python file uses the following encoding: utf-8 - import os, sys - ... + # This Python file uses the following encoding: utf-8 + import os, sys + ... - 3. Text editors might have different ways of defining the file's - encoding, e.g. +3. Text editors might have different ways of defining the file's + encoding, e.g.:: - #!/usr/local/bin/python - # coding: latin-1 - import os, sys - ... + #!/usr/local/bin/python + # coding: latin-1 + import os, sys + ... - 4. Without encoding comment, Python's parser will assume ASCII - text: +4. Without encoding comment, Python's parser will assume ASCII + text:: - #!/usr/local/bin/python - import os, sys - ... + #!/usr/local/bin/python + import os, sys + ... - 5. Encoding comments which don't work: +5. Encoding comments which don't work: - Missing "coding:" prefix: + 1. Missing "coding:" prefix:: #!/usr/local/bin/python # latin-1 import os, sys ... - Encoding comment not on line 1 or 2: + 2. Encoding comment not on line 1 or 2:: #!/usr/local/bin/python # @@ -140,125 +151,141 @@ Examples import os, sys ... - Unsupported encoding: + 3. Unsupported encoding:: #!/usr/local/bin/python # -*- coding: utf-42 -*- import os, sys ... + Concepts +======== - The PEP is based on the following concepts which would have to be - implemented to enable usage of such a magic comment: +The PEP is based on the following concepts which would have to be +implemented to enable usage of such a magic comment: - 1. The complete Python source file should use a single encoding. - Embedding of differently encoded data is not allowed and will - result in a decoding error during compilation of the Python - source code. +1. The complete Python source file should use a single encoding. + Embedding of differently encoded data is not allowed and will + result in a decoding error during compilation of the Python + source code. - Any encoding which allows processing the first two lines in the - way indicated above is allowed as source code encoding, this - includes ASCII compatible encodings as well as certain - multi-byte encodings such as Shift_JIS. It does not include - encodings which use two or more bytes for all characters like - e.g. UTF-16. The reason for this is to keep the encoding - detection algorithm in the tokenizer simple. + Any encoding which allows processing the first two lines in the + way indicated above is allowed as source code encoding, this + includes ASCII compatible encodings as well as certain + multi-byte encodings such as Shift_JIS. It does not include + encodings which use two or more bytes for all characters like + e.g. UTF-16. The reason for this is to keep the encoding + detection algorithm in the tokenizer simple. - 2. Handling of escape sequences should continue to work as it does - now, but with all possible source code encodings, that is - standard string literals (both 8-bit and Unicode) are subject to - escape sequence expansion while raw string literals only expand - a very small subset of escape sequences. +2. Handling of escape sequences should continue to work as it does + now, but with all possible source code encodings, that is + standard string literals (both 8-bit and Unicode) are subject to + escape sequence expansion while raw string literals only expand + a very small subset of escape sequences. - 3. Python's tokenizer/compiler combo will need to be updated to - work as follows: +3. Python's tokenizer/compiler combo will need to be updated to + work as follows: - 1. read the file + 1. read the file - 2. decode it into Unicode assuming a fixed per-file encoding + 2. decode it into Unicode assuming a fixed per-file encoding - 3. convert it into a UTF-8 byte string + 3. convert it into a UTF-8 byte string - 4. tokenize the UTF-8 content + 4. tokenize the UTF-8 content - 5. compile it, creating Unicode objects from the given Unicode data - and creating string objects from the Unicode literal data - by first reencoding the UTF-8 data into 8-bit string data - using the given file encoding + 5. compile it, creating Unicode objects from the given Unicode data + and creating string objects from the Unicode literal data + by first reencoding the UTF-8 data into 8-bit string data + using the given file encoding + +Note that Python identifiers are restricted to the ASCII +subset of the encoding, and thus need no further conversion +after step 4. - Note that Python identifiers are restricted to the ASCII - subset of the encoding, and thus need no further conversion - after step 4. Implementation +============== + +For backwards-compatibility with existing code which currently +uses non-ASCII in string literals without declaring an encoding, +the implementation will be introduced in two phases: - For backwards-compatibility with existing code which currently - uses non-ASCII in string literals without declaring an encoding, - the implementation will be introduced in two phases: +1. Allow non-ASCII in string literals and comments, by internally + treating a missing encoding declaration as a declaration of + "iso-8859-1". This will cause arbitrary byte strings to + correctly round-trip between step 2 and step 5 of the + processing, and provide compatibility with Python 2.2 for + Unicode literals that contain non-ASCII bytes. - 1. Allow non-ASCII in string literals and comments, by internally - treating a missing encoding declaration as a declaration of - "iso-8859-1". This will cause arbitrary byte strings to - correctly round-trip between step 2 and step 5 of the - processing, and provide compatibility with Python 2.2 for - Unicode literals that contain non-ASCII bytes. + A warning will be issued if non-ASCII bytes are found in the + input, once per improperly encoded input file. - A warning will be issued if non-ASCII bytes are found in the - input, once per improperly encoded input file. +2. Remove the warning, and change the default encoding to "ascii". - 2. Remove the warning, and change the default encoding to "ascii". +The builtin ``compile()`` API will be enhanced to accept Unicode as +input. 8-bit string input is subject to the standard procedure for +encoding detection as described above. - The builtin compile() API will be enhanced to accept Unicode as - input. 8-bit string input is subject to the standard procedure for - encoding detection as described above. +If a Unicode string with a coding declaration is passed to ``compile()``, +a ``SyntaxError`` will be raised. - If a Unicode string with a coding declaration is passed to compile(), - a SyntaxError will be raised. +SUZUKI Hisao is working on a patch; see [2]_ for details. A patch +implementing only phase 1 is available at [1]_. - SUZUKI Hisao is working on a patch; see [2] for details. A patch - implementing only phase 1 is available at [1]. Phases - Implementation of steps 1 and 2 above were completed in 2.3, - except for changing the default encoding to "ascii". +====== + +Implementation of steps 1 and 2 above were completed in 2.3, +except for changing the default encoding to "ascii". + +The default encoding was set to "ascii" in version 2.5. + - The default encoding was set to "ascii" in version 2.5. - Scope +===== + +This PEP intends to provide an upgrade path from the current +(more-or-less) undefined source code encoding situation to a more +robust and portable definition. - This PEP intends to provide an upgrade path from the current - (more-or-less) undefined source code encoding situation to a more - robust and portable definition. References +========== - [1] Phase 1 implementation: - http://python.org/sf/526840 - [2] Phase 2 implementation: - http://python.org/sf/534304 +.. [1] Phase 1 implementation: + https://bugs.python.org/issue526840 + +.. [2] Phase 2 implementation: + https://bugs.python.org/issue534304 History +======= + +- 1.10 and above: see CVS history +- 1.8: Added '.' to the coding RE. +- 1.7: Added warnings to phase 1 implementation. Replaced the + Latin-1 default encoding with the interpreter's default + encoding. Added tweaks to ``compile()``. +- 1.4 - 1.6: Minor tweaks +- 1.3: Worked in comments by Martin v. Loewis: + UTF-8 BOM mark detection, Emacs style magic comment, + two phase approach to the implementation - 1.10 and above: see CVS history - 1.8: Added '.' to the coding RE. - 1.7: Added warnings to phase 1 implementation. Replaced the - Latin-1 default encoding with the interpreter's default - encoding. Added tweaks to compile(). - 1.4 - 1.6: Minor tweaks - 1.3: Worked in comments by Martin v. Loewis: - UTF-8 BOM mark detection, Emacs style magic comment, - two phase approach to the implementation Copyright +========= + +This document has been placed in the public domain. - This document has been placed in the public domain. - -Local Variables: -mode: indented-text -indent-tabs-mode: nil -sentence-end-double-space: t -fill-column: 70 -coding: utf-8 -End: +.. + Local Variables: + mode: indented-text + indent-tabs-mode: nil + sentence-end-double-space: t + fill-column: 70 + coding: utf-8 + End: