💫 Tidy up and auto-format .py files (#2983)

## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist  - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
explosion · Nov 30, 2018 · eddeb36 · eddeb36
1 parent 852bc2a
commit eddeb36
Show file tree

Hide file tree

Showing 268 changed files with 25,653 additions and 17,881 deletions.
diff --git a/.flake8 b/.flake8
@@ -1,4 +1,13 @@
 [flake8]
-ignore = E203, E266, E501, W503
+ignore = E203, E266, E501, E731, W503
 max-line-length = 80
 select = B,C,E,F,W,T4,B9
+exclude =
+    .env,
+    .git,
+    __pycache__,
+    lemmatizer.py,
+    lookup.py,
+    _tokenizer_exceptions_list.py,
+    spacy/lang/fr/lemmatizer,
+    spacy/lang/nb/lemmatizer
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -186,13 +186,99 @@ sure your test passes and reference the issue in your commit message.
 ## Code conventions
 
 Code should loosely follow [pep8](https://www.python.org/dev/peps/pep-0008/).
-Regular line length is **80 characters**, with some tolerance for lines up to
-90 characters if the alternative would be worse — for instance, if your list
-comprehension comes to 82 characters, it's better not to split it over two lines.
-You can also use a linter like [`flake8`](https://pypi.python.org/pypi/flake8)
-or [`frosted`](https://pypi.python.org/pypi/frosted) – just keep in mind that
-it won't work very well for `.pyx` files and will complain about Cython syntax
-like `<int*>` or `cimport`.
+As of `v2.1.0`, spaCy uses [`black`](https://github.com/ambv/black) for code
+formatting and [`flake8`](http://flake8.pycqa.org/en/latest/) for linting its
+Python modules. If you've built spaCy from source, you'll already have both
+tools installed.
+
+**⚠️ Note that formatting and linting is currently only possible for Python
+modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**
+
+### Code formatting
+
+[`black`](https://github.com/ambv/black) is an opinionated Python code
+formatter, optimised to produce readable code and small diffs. You can run
+`black` from the command-line, or via your code editor. For example, if you're
+using [Visual Studio Code](https://code.visualstudio.com/), you can  add the
+following to your `settings.json` to use `black` for formatting and auto-format
+your files on save:
+
+```json
+{
+    "python.formatting.provider": "black",
+    "[python]": {
+        "editor.formatOnSave": true
+    }
+}
+```
+
+[See here](https://github.com/ambv/black#editor-integration) for the full
+list of available editor integrations.
+
+#### Disabling formatting
+
+There are a few cases where auto-formatting doesn't improve readability – for
+example, in some of the the language data files like the `tag_map.py`, or in
+the tests that construct `Doc` objects from lists of words and other labels.
+Wrapping a block in `# fmt: off` and `# fmt: on` lets you disable formatting
+for that particular code. Here's an example:
+
+```python
+# fmt: off
+text = "I look forward to using Thingamajig.  I've been told it will make my life easier..."
+heads = [1, 0, -1, -2, -1, -1, -5, -1, 3, 2, 1, 0, 2, 1, -3, 1, 1, -3, -7]
+deps = ["nsubj", "ROOT", "advmod", "prep", "pcomp", "dobj", "punct", "",
+        "nsubjpass", "aux", "auxpass", "ROOT", "nsubj", "aux", "ccomp",
+        "poss", "nsubj", "ccomp", "punct"]
+# fmt: on
+```
+
+### Code linting
+
+[`flake8`](http://flake8.pycqa.org/en/latest/) is a tool for enforcing code
+style. It scans one or more files and outputs errors and warnings. This feedback
+can help you stick to general standards and conventions, and can be very useful
+for spotting potential mistakes and inconsistencies in your code. The most
+important things to watch out for are syntax errors and undefined names, but you
+also want to keep an eye on unused declared variables or repeated
+(i.e. overwritten) dictionary keys. If your code was formatted with `black`
+(see above), you shouldn't see any formatting-related warnings.
+
+The [`.flake8`](.flake8) config defines the configuration we use for this
+codebase. For example, we're not super strict about the line length, and we're
+excluding very large files like lemmatization and tokenizer exception tables.
+
+Ideally, running the following command from within the repo directory should
+not return any errors or warnings:
+
+```bash
+flake8 spacy
+```
+
+#### Disabling linting
+
+Sometimes, you explicitly want to write code that's not compatible with our
+rules. For example, a module's `__init__.py` might import a function so other
+modules can import it from there, but `flake8` will complain about an unused
+import. And although it's generally discouraged, there might be cases where it
+makes sense to use a bare `except`.
+
+To ignore a given line, you can add a comment like `# noqa: F401`, specifying
+the code of the error or warning we want to ignore. It's also possible to
+ignore several comma-separated codes at once, e.g. `# noqa: E731,E123`. Here
+are some examples:
+
+```python
+# The imported class isn't used in this file, but imported here, so it can be
+# imported *from* here by another module.
+from .submodule import SomeClass  # noqa: F401
+
+try:
+    do_something()
+except:  # noqa: E722
+    # This bare except is justified, for some specific reason
+    do_something_else()
+```
 
 ### Python conventions
 

diff --git a/bin/cythonize.py b/bin/cythonize.py
@@ -35,41 +35,49 @@
 import argparse
 
 
-HASH_FILE = 'cythonize.json'
+HASH_FILE = "cythonize.json"
 
 
-def process_pyx(fromfile, tofile, language_level='-2'):
-    print('Processing %s' % fromfile)
+def process_pyx(fromfile, tofile, language_level="-2"):
+    print("Processing %s" % fromfile)
     try:
         from Cython.Compiler.Version import version as cython_version
         from distutils.version import LooseVersion
-        if LooseVersion(cython_version) < LooseVersion('0.19'):
-            raise Exception('Require Cython >= 0.19')
+
+        if LooseVersion(cython_version) < LooseVersion("0.19"):
+            raise Exception("Require Cython >= 0.19")
 
     except ImportError:
         pass
 
-    flags = ['--fast-fail', language_level]
-    if tofile.endswith('.cpp'):
-        flags += ['--cplus']
+    flags = ["--fast-fail", language_level]
+    if tofile.endswith(".cpp"):
+        flags += ["--cplus"]
 
     try:
         try:
-            r = subprocess.call(['cython'] + flags + ['-o', tofile, fromfile],
-                                env=os.environ) # See Issue #791
+            r = subprocess.call(
+                ["cython"] + flags + ["-o", tofile, fromfile], env=os.environ
+            )  # See Issue #791
             if r != 0:
-                raise Exception('Cython failed')
+                raise Exception("Cython failed")
         except OSError:
             # There are ways of installing Cython that don't result in a cython
             # executable on the path, see gh-2397.
-            r = subprocess.call([sys.executable, '-c',
-                                'import sys; from Cython.Compiler.Main import '
-                                'setuptools_main as main; sys.exit(main())'] + flags +
-                                ['-o', tofile, fromfile])
+            r = subprocess.call(
+                [
+                    sys.executable,
+                    "-c",
+                    "import sys; from Cython.Compiler.Main import "
+                    "setuptools_main as main; sys.exit(main())",
+                ]
+                + flags
+                + ["-o", tofile, fromfile]
+            )
             if r != 0:
-                raise Exception('Cython failed')
+                raise Exception("Cython failed")
     except OSError:
-        raise OSError('Cython needs to be installed')
+        raise OSError("Cython needs to be installed")
 
 
 def preserve_cwd(path, func, *args):
@@ -89,12 +97,12 @@ def load_hashes(filename):
 
 
 def save_hashes(hash_db, filename):
-    with open(filename, 'w') as f:
+    with open(filename, "w") as f:
         f.write(json.dumps(hash_db))
 
 
 def get_hash(path):
-    return hashlib.md5(open(path, 'rb').read()).hexdigest()
+    return hashlib.md5(open(path, "rb").read()).hexdigest()
 
 
 def hash_changed(base, path, db):
@@ -109,25 +117,27 @@ def hash_add(base, path, db):
 
 def process(base, filename, db):
     root, ext = os.path.splitext(filename)
-    if ext in ['.pyx', '.cpp']:
-        if hash_changed(base, filename, db) or not os.path.isfile(os.path.join(base, root + '.cpp')):
-            preserve_cwd(base, process_pyx, root + '.pyx', root + '.cpp')
-            hash_add(base, root + '.cpp', db)
-            hash_add(base, root + '.pyx', db)
+    if ext in [".pyx", ".cpp"]:
+        if hash_changed(base, filename, db) or not os.path.isfile(
+            os.path.join(base, root + ".cpp")
+        ):
+            preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp")
+            hash_add(base, root + ".cpp", db)
+            hash_add(base, root + ".pyx", db)
 
 
 def check_changes(root, db):
     res = False
     new_db = {}
 
-    setup_filename = 'setup.py'
-    hash_add('.', setup_filename, new_db)
-    if hash_changed('.', setup_filename, db):
+    setup_filename = "setup.py"
+    hash_add(".", setup_filename, new_db)
+    if hash_changed(".", setup_filename, db):
         res = True
 
     for base, _, files in os.walk(root):
         for filename in files:
-            if filename.endswith('.pxd'):
+            if filename.endswith(".pxd"):
                 hash_add(base, filename, new_db)
                 if hash_changed(base, filename, db):
                     res = True
@@ -150,8 +160,10 @@ def run(root):
         save_hashes(db, HASH_FILE)
 
 
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser(description='Cythonize pyx files into C++ files as needed')
-    parser.add_argument('root', help='root directory')
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Cythonize pyx files into C++ files as needed"
+    )
+    parser.add_argument("root", help="root directory")
     args = parser.parse_args()
     run(args.root)
diff --git a/bin/load_reddit.py b/bin/load_reddit.py
@@ -15,12 +15,13 @@
 
 class Reddit(object):
     """Stream cleaned comments from Reddit."""
-    pre_format_re = re.compile(r'^[\`\*\~]')
-    post_format_re = re.compile(r'[\`\*\~]$')
-    url_re = re.compile(r'\[([^]]+)\]\(%%URL\)')
-    link_re = re.compile(r'\[([^]]+)\]\(https?://[^\)]+\)')
 
-    def __init__(self, file_path, meta_keys={'subreddit': 'section'}):
+    pre_format_re = re.compile(r"^[\`\*\~]")
+    post_format_re = re.compile(r"[\`\*\~]$")
+    url_re = re.compile(r"\[([^]]+)\]\(%%URL\)")
+    link_re = re.compile(r"\[([^]]+)\]\(https?://[^\)]+\)")
+
+    def __init__(self, file_path, meta_keys={"subreddit": "section"}):
         """
         file_path (unicode / Path): Path to archive or directory of archives.
         meta_keys (dict): Meta data key included in the Reddit corpus, mapped
@@ -45,28 +46,30 @@ def __iter__(self):
                         continue
                     comment = ujson.loads(line)
                     if self.is_valid(comment):
-                        text = self.strip_tags(comment['body'])
-                        yield {'text': text}
+                        text = self.strip_tags(comment["body"])
+                        yield {"text": text}
 
     def get_meta(self, item):
-        return {name: item.get(key, 'n/a') for key, name in self.meta.items()}
+        return {name: item.get(key, "n/a") for key, name in self.meta.items()}
 
     def iter_files(self):
         for file_path in self.files:
             yield file_path
 
     def strip_tags(self, text):
-        text = self.link_re.sub(r'\1', text)
-        text = text.replace('&gt;', '>').replace('&lt;', '<')
-        text = self.pre_format_re.sub('', text)
-        text = self.post_format_re.sub('', text)
-        text = re.sub(r'\s+', ' ', text)
+        text = self.link_re.sub(r"\1", text)
+        text = text.replace("&gt;", ">").replace("&lt;", "<")
+        text = self.pre_format_re.sub("", text)
+        text = self.post_format_re.sub("", text)
+        text = re.sub(r"\s+", " ", text)
         return text.strip()
 
     def is_valid(self, comment):
-        return comment['body'] is not None \
-            and comment['body'] != '[deleted]' \
-            and comment['body'] != '[removed]'
+        return (
+            comment["body"] is not None
+            and comment["body"] != "[deleted]"
+            and comment["body"] != "[removed]"
+        )
 
 
 def main(path):
@@ -75,16 +78,18 @@ def main(path):
         print(ujson.dumps(comment))
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     import socket
+
     try:
         BrokenPipeError
     except NameError:
         BrokenPipeError = socket.error
     try:
         plac.call(main)
-    except BrokenPipeError: 
+    except BrokenPipeError:
         import os, sys
+
         # Python flushes standard streams on exit; redirect remaining output
         # to devnull to avoid another BrokenPipeError at shutdown
         devnull = os.open(os.devnull, os.O_WRONLY)

diff --git a/requirements.txt b/requirements.txt
@@ -11,7 +11,10 @@ ujson>=1.35
 dill>=0.2,<0.3
 regex==2018.01.10
 requests>=2.13.0,<3.0.0
+pathlib==1.0.1; python_version < "3.4"
+# Development dependencies
 pytest>=4.0.0,<5.0.0
 pytest-timeout>=1.3.0,<2.0.0
 mock>=2.0.0,<3.0.0
-pathlib==1.0.1; python_version < "3.4"
+black==18.9b0
+flake8>=3.5.0,<3.6.0