Skip to content

Commit

Permalink
💫 Tidy up and auto-format .py files (#2983)
Browse files Browse the repository at this point in the history
<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
  • Loading branch information
ines authored Nov 30, 2018
1 parent 852bc2a commit eddeb36
Show file tree
Hide file tree
Showing 268 changed files with 25,653 additions and 17,881 deletions.
11 changes: 10 additions & 1 deletion .flake8
Original file line number Diff line number Diff line change
@@ -1,4 +1,13 @@
[flake8]
ignore = E203, E266, E501, W503
ignore = E203, E266, E501, E731, W503
max-line-length = 80
select = B,C,E,F,W,T4,B9
exclude =
.env,
.git,
__pycache__,
lemmatizer.py,
lookup.py,
_tokenizer_exceptions_list.py,
spacy/lang/fr/lemmatizer,
spacy/lang/nb/lemmatizer
100 changes: 93 additions & 7 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,13 +186,99 @@ sure your test passes and reference the issue in your commit message.
## Code conventions

Code should loosely follow [pep8](https://www.python.org/dev/peps/pep-0008/).
Regular line length is **80 characters**, with some tolerance for lines up to
90 characters if the alternative would be worse — for instance, if your list
comprehension comes to 82 characters, it's better not to split it over two lines.
You can also use a linter like [`flake8`](https://pypi.python.org/pypi/flake8)
or [`frosted`](https://pypi.python.org/pypi/frosted) – just keep in mind that
it won't work very well for `.pyx` files and will complain about Cython syntax
like `<int*>` or `cimport`.
As of `v2.1.0`, spaCy uses [`black`](https://github.com/ambv/black) for code
formatting and [`flake8`](http://flake8.pycqa.org/en/latest/) for linting its
Python modules. If you've built spaCy from source, you'll already have both
tools installed.

**⚠️ Note that formatting and linting is currently only possible for Python
modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**

### Code formatting

[`black`](https://github.com/ambv/black) is an opinionated Python code
formatter, optimised to produce readable code and small diffs. You can run
`black` from the command-line, or via your code editor. For example, if you're
using [Visual Studio Code](https://code.visualstudio.com/), you can add the
following to your `settings.json` to use `black` for formatting and auto-format
your files on save:

```json
{
"python.formatting.provider": "black",
"[python]": {
"editor.formatOnSave": true
}
}
```

[See here](https://github.com/ambv/black#editor-integration) for the full
list of available editor integrations.

#### Disabling formatting

There are a few cases where auto-formatting doesn't improve readability – for
example, in some of the the language data files like the `tag_map.py`, or in
the tests that construct `Doc` objects from lists of words and other labels.
Wrapping a block in `# fmt: off` and `# fmt: on` lets you disable formatting
for that particular code. Here's an example:

```python
# fmt: off
text = "I look forward to using Thingamajig. I've been told it will make my life easier..."
heads = [1, 0, -1, -2, -1, -1, -5, -1, 3, 2, 1, 0, 2, 1, -3, 1, 1, -3, -7]
deps = ["nsubj", "ROOT", "advmod", "prep", "pcomp", "dobj", "punct", "",
"nsubjpass", "aux", "auxpass", "ROOT", "nsubj", "aux", "ccomp",
"poss", "nsubj", "ccomp", "punct"]
# fmt: on
```

### Code linting

[`flake8`](http://flake8.pycqa.org/en/latest/) is a tool for enforcing code
style. It scans one or more files and outputs errors and warnings. This feedback
can help you stick to general standards and conventions, and can be very useful
for spotting potential mistakes and inconsistencies in your code. The most
important things to watch out for are syntax errors and undefined names, but you
also want to keep an eye on unused declared variables or repeated
(i.e. overwritten) dictionary keys. If your code was formatted with `black`
(see above), you shouldn't see any formatting-related warnings.

The [`.flake8`](.flake8) config defines the configuration we use for this
codebase. For example, we're not super strict about the line length, and we're
excluding very large files like lemmatization and tokenizer exception tables.

Ideally, running the following command from within the repo directory should
not return any errors or warnings:

```bash
flake8 spacy
```

#### Disabling linting

Sometimes, you explicitly want to write code that's not compatible with our
rules. For example, a module's `__init__.py` might import a function so other
modules can import it from there, but `flake8` will complain about an unused
import. And although it's generally discouraged, there might be cases where it
makes sense to use a bare `except`.

To ignore a given line, you can add a comment like `# noqa: F401`, specifying
the code of the error or warning we want to ignore. It's also possible to
ignore several comma-separated codes at once, e.g. `# noqa: E731,E123`. Here
are some examples:

```python
# The imported class isn't used in this file, but imported here, so it can be
# imported *from* here by another module.
from .submodule import SomeClass # noqa: F401

try:
do_something()
except: # noqa: E722
# This bare except is justified, for some specific reason
do_something_else()
```

### Python conventions

Expand Down
74 changes: 43 additions & 31 deletions bin/cythonize.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,41 +35,49 @@
import argparse


HASH_FILE = 'cythonize.json'
HASH_FILE = "cythonize.json"


def process_pyx(fromfile, tofile, language_level='-2'):
print('Processing %s' % fromfile)
def process_pyx(fromfile, tofile, language_level="-2"):
print("Processing %s" % fromfile)
try:
from Cython.Compiler.Version import version as cython_version
from distutils.version import LooseVersion
if LooseVersion(cython_version) < LooseVersion('0.19'):
raise Exception('Require Cython >= 0.19')

if LooseVersion(cython_version) < LooseVersion("0.19"):
raise Exception("Require Cython >= 0.19")

except ImportError:
pass

flags = ['--fast-fail', language_level]
if tofile.endswith('.cpp'):
flags += ['--cplus']
flags = ["--fast-fail", language_level]
if tofile.endswith(".cpp"):
flags += ["--cplus"]

try:
try:
r = subprocess.call(['cython'] + flags + ['-o', tofile, fromfile],
env=os.environ) # See Issue #791
r = subprocess.call(
["cython"] + flags + ["-o", tofile, fromfile], env=os.environ
) # See Issue #791
if r != 0:
raise Exception('Cython failed')
raise Exception("Cython failed")
except OSError:
# There are ways of installing Cython that don't result in a cython
# executable on the path, see gh-2397.
r = subprocess.call([sys.executable, '-c',
'import sys; from Cython.Compiler.Main import '
'setuptools_main as main; sys.exit(main())'] + flags +
['-o', tofile, fromfile])
r = subprocess.call(
[
sys.executable,
"-c",
"import sys; from Cython.Compiler.Main import "
"setuptools_main as main; sys.exit(main())",
]
+ flags
+ ["-o", tofile, fromfile]
)
if r != 0:
raise Exception('Cython failed')
raise Exception("Cython failed")
except OSError:
raise OSError('Cython needs to be installed')
raise OSError("Cython needs to be installed")


def preserve_cwd(path, func, *args):
Expand All @@ -89,12 +97,12 @@ def load_hashes(filename):


def save_hashes(hash_db, filename):
with open(filename, 'w') as f:
with open(filename, "w") as f:
f.write(json.dumps(hash_db))


def get_hash(path):
return hashlib.md5(open(path, 'rb').read()).hexdigest()
return hashlib.md5(open(path, "rb").read()).hexdigest()


def hash_changed(base, path, db):
Expand All @@ -109,25 +117,27 @@ def hash_add(base, path, db):

def process(base, filename, db):
root, ext = os.path.splitext(filename)
if ext in ['.pyx', '.cpp']:
if hash_changed(base, filename, db) or not os.path.isfile(os.path.join(base, root + '.cpp')):
preserve_cwd(base, process_pyx, root + '.pyx', root + '.cpp')
hash_add(base, root + '.cpp', db)
hash_add(base, root + '.pyx', db)
if ext in [".pyx", ".cpp"]:
if hash_changed(base, filename, db) or not os.path.isfile(
os.path.join(base, root + ".cpp")
):
preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp")
hash_add(base, root + ".cpp", db)
hash_add(base, root + ".pyx", db)


def check_changes(root, db):
res = False
new_db = {}

setup_filename = 'setup.py'
hash_add('.', setup_filename, new_db)
if hash_changed('.', setup_filename, db):
setup_filename = "setup.py"
hash_add(".", setup_filename, new_db)
if hash_changed(".", setup_filename, db):
res = True

for base, _, files in os.walk(root):
for filename in files:
if filename.endswith('.pxd'):
if filename.endswith(".pxd"):
hash_add(base, filename, new_db)
if hash_changed(base, filename, db):
res = True
Expand All @@ -150,8 +160,10 @@ def run(root):
save_hashes(db, HASH_FILE)


if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Cythonize pyx files into C++ files as needed')
parser.add_argument('root', help='root directory')
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Cythonize pyx files into C++ files as needed"
)
parser.add_argument("root", help="root directory")
args = parser.parse_args()
run(args.root)
41 changes: 23 additions & 18 deletions bin/load_reddit.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,13 @@

class Reddit(object):
"""Stream cleaned comments from Reddit."""
pre_format_re = re.compile(r'^[\`\*\~]')
post_format_re = re.compile(r'[\`\*\~]$')
url_re = re.compile(r'\[([^]]+)\]\(%%URL\)')
link_re = re.compile(r'\[([^]]+)\]\(https?://[^\)]+\)')

def __init__(self, file_path, meta_keys={'subreddit': 'section'}):
pre_format_re = re.compile(r"^[\`\*\~]")
post_format_re = re.compile(r"[\`\*\~]$")
url_re = re.compile(r"\[([^]]+)\]\(%%URL\)")
link_re = re.compile(r"\[([^]]+)\]\(https?://[^\)]+\)")

def __init__(self, file_path, meta_keys={"subreddit": "section"}):
"""
file_path (unicode / Path): Path to archive or directory of archives.
meta_keys (dict): Meta data key included in the Reddit corpus, mapped
Expand All @@ -45,28 +46,30 @@ def __iter__(self):
continue
comment = ujson.loads(line)
if self.is_valid(comment):
text = self.strip_tags(comment['body'])
yield {'text': text}
text = self.strip_tags(comment["body"])
yield {"text": text}

def get_meta(self, item):
return {name: item.get(key, 'n/a') for key, name in self.meta.items()}
return {name: item.get(key, "n/a") for key, name in self.meta.items()}

def iter_files(self):
for file_path in self.files:
yield file_path

def strip_tags(self, text):
text = self.link_re.sub(r'\1', text)
text = text.replace('&gt;', '>').replace('&lt;', '<')
text = self.pre_format_re.sub('', text)
text = self.post_format_re.sub('', text)
text = re.sub(r'\s+', ' ', text)
text = self.link_re.sub(r"\1", text)
text = text.replace("&gt;", ">").replace("&lt;", "<")
text = self.pre_format_re.sub("", text)
text = self.post_format_re.sub("", text)
text = re.sub(r"\s+", " ", text)
return text.strip()

def is_valid(self, comment):
return comment['body'] is not None \
and comment['body'] != '[deleted]' \
and comment['body'] != '[removed]'
return (
comment["body"] is not None
and comment["body"] != "[deleted]"
and comment["body"] != "[removed]"
)


def main(path):
Expand All @@ -75,16 +78,18 @@ def main(path):
print(ujson.dumps(comment))


if __name__ == '__main__':
if __name__ == "__main__":
import socket

try:
BrokenPipeError
except NameError:
BrokenPipeError = socket.error
try:
plac.call(main)
except BrokenPipeError:
except BrokenPipeError:
import os, sys

# Python flushes standard streams on exit; redirect remaining output
# to devnull to avoid another BrokenPipeError at shutdown
devnull = os.open(os.devnull, os.O_WRONLY)
Expand Down
5 changes: 4 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,10 @@ ujson>=1.35
dill>=0.2,<0.3
regex==2018.01.10
requests>=2.13.0,<3.0.0
pathlib==1.0.1; python_version < "3.4"
# Development dependencies
pytest>=4.0.0,<5.0.0
pytest-timeout>=1.3.0,<2.0.0
mock>=2.0.0,<3.0.0
pathlib==1.0.1; python_version < "3.4"
black==18.9b0
flake8>=3.5.0,<3.6.0
Loading

0 comments on commit eddeb36

Please sign in to comment.