Skip to content

testing multiprocessing for faster finds! #63

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 27, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ and **Merged pull requests**. Critical items to know are:
Referenced versions in headers are tagged on Github, in parentheses are for pypi.

## [vxx](https://github.com/urlstechie/urlschecker-python/tree/master) (master)
- multiprocessing to speed up checks (0.0.26)
- bug fix for verbose option to only print file names that have failures (0.0.25)
- adding option to print a summary that contains file names and urls (0.0.24)
- updating container base to use debian buster and adding certifi (0.0.23)
Expand Down
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2020-2021 Ayoub Malek and Vanessa Sochat
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,9 @@
This is a python module to collect urls over static files (code and documentation)
and then test for and report broken links. If you are interesting in using
this as a GitHub action, see [urlchecker-action](https://github.com/urlstechie/urlchecker-action). There are also container
bases available on [quay.io/urlstechie/urlchecker](https://quay.io/repository/urlstechie/urlchecker?tab=tags).
bases available on [quay.io/urlstechie/urlchecker](https://quay.io/repository/urlstechie/urlchecker?tab=tags). As of version
0.0.26, we use multiprocessing so the checks run a lot faster, and you can set `URLCHECKER_WORKERS` to change the number of workers
(defaults to 9). If you don't want multiprocessing, use version 0.0.25 or earlier.

## Module Documentation

Expand Down
2 changes: 1 addition & 1 deletion docs/source/fileproc.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
urlchecker.core.fileproc
==========================
========================


.. automodule:: urlchecker.core.fileproc
Expand Down
11 changes: 2 additions & 9 deletions urlchecker/__init__.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,3 @@
"""

Copyright (c) 2020-2021 Ayoub Malek and Vanessa Sochat

This source code is licensed under the terms of the MIT license.
For a copy, see <https://opensource.org/licenses/MIT>.

"""

from urlchecker.version import __version__

assert __version__
2 changes: 1 addition & 1 deletion urlchecker/client/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

"""

Copyright (c) 2020-2021 Ayoub Malek and Vanessa Sochat
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat

This source code is licensed under the terms of the MIT license.
For a copy, see <https://opensource.org/licenses/MIT>.
Expand Down
6 changes: 3 additions & 3 deletions urlchecker/client/check.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""
client/github.py: entrypoint for interaction with a GitHub repostiory.
Copyright (c) 2020-2021 Ayoub Malek and Vanessa Sochat
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat
"""

import re
Expand Down Expand Up @@ -106,9 +106,9 @@ def main(args, extra):
if args.verbose:
print("\n\U0001F914 Uh oh... The following urls did not pass:")
for file_name, result in checker.checks.items():
if result.failed:
if result["failed"]:
print_failure(file_name + ":")
for url in result.failed:
for url in result["failed"]:
print_failure(" " + url)
else:
print("\n\U0001F914 Uh oh... The following urls did not pass:")
Expand Down
84 changes: 61 additions & 23 deletions urlchecker/core/check.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""

Copyright (c) 2020-2021 Ayoub Malek and Vanessa Sochat
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat

This source code is licensed under the terms of the MIT license.
For a copy, see <https://opensource.org/licenses/MIT>.
Expand All @@ -12,6 +12,7 @@
import re
import sys
from urlchecker.core import fileproc
from urlchecker.core.worker import Workers
from urlchecker.core.urlproc import UrlCheckResult


Expand Down Expand Up @@ -41,6 +42,8 @@ def __init__(
"""
# Initiate results object, and checks lookup (holds UrlCheck) for each file
self.results = {"passed": set(), "failed": set(), "excluded": set()}

# Results organized by filename
self.checks = {}

# Save run parameters
Expand Down Expand Up @@ -123,12 +126,18 @@ def save_results(self, file_path, sep=",", header=None, relative_paths=True):
else:
file_name = os.path.relpath(file_name)

[writer.writerow([url, "failed", file_name]) for url in result.failed]
[
writer.writerow([url, "failed", file_name])
for url in result["failed"]
]
[
writer.writerow([url, "excluded", file_name])
for url in result.excluded
for url in result["excluded"]
]
[
writer.writerow([url, "passed", file_name])
for url in result["passed"]
]
[writer.writerow([url, "passed", file_name]) for url in result.passed]

return file_path

Expand Down Expand Up @@ -161,27 +170,56 @@ def run(
exclude_urls = exclude_urls or []
exclude_patterns = exclude_patterns or []

# loop through files files
for file_name in file_paths:

# Instantiate a checker to extract urls
checker = UrlCheckResult(
file_name=file_name,
exclude_patterns=exclude_patterns,
exclude_urls=exclude_urls,
print_all=self.print_all,
)

# Check the urls
checker.check_urls(retry_count=retry_count, timeout=timeout)
# Run with multiprocessing
tasks = {}
funcs = {}
workers = Workers()

# Update flattened results
self.results["failed"].update(checker.failed)
self.results["passed"].update(checker.passed)
self.results["excluded"].update(checker.excluded)
# loop through files
for file_name in file_paths:

# Save the checker in the lookup
self.checks[file_name] = checker
# Export parameters and functions, use the same check task for all
tasks[file_name] = {
"file_name": file_name,
"exclude_patterns": exclude_patterns,
"exclude_urls": exclude_urls,
"print_all": self.print_all,
"retry_count": retry_count,
"timeout": timeout,
}
funcs[file_name] = check_task

results = workers.run(funcs, tasks)
for file_name, result in results.items():
self.checks[file_name] = result
self.results["failed"].update(result["failed"])
self.results["passed"].update(result["passed"])
self.results["excluded"].update(result["excluded"])

# A flattened dict of passed and failed
return self.results


def check_task(*args, **kwargs):
"""
A checking task, the default we use
"""
# Instantiate a checker to extract urls
checker = UrlCheckResult(
file_name=kwargs["file_name"],
exclude_patterns=kwargs.get("exclude_patterns", []),
exclude_urls=kwargs.get("exclude_urls", []),
print_all=kwargs.get("print_all", True),
)

# Check the urls
checker.check_urls(
retry_count=kwargs.get("retry_count", 2), timeout=kwargs.get("timeout", 5)
)

# Update flattened results
return {
"failed": checker.failed,
"passed": checker.passed,
"excluded": checker.excluded,
}
2 changes: 1 addition & 1 deletion urlchecker/core/exclude.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""

Copyright (c) 2020-2021 Ayoub Malek and Vanessa Sochat
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat

This source code is licensed under the terms of the MIT license.
For a copy, see <https://opensource.org/licenses/MIT>.
Expand Down
2 changes: 1 addition & 1 deletion urlchecker/core/fileproc.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""

Copyright (c) 2020-2021 Ayoub Malek and Vanessa Sochat
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat

This source code is licensed under the terms of the MIT license.
For a copy, see <https://opensource.org/licenses/MIT>.
Expand Down
2 changes: 1 addition & 1 deletion urlchecker/core/urlmarker.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
http://daringfireball.net/2010/07/improved_regex_for_matching_urls
https://gist.github.com/gruber/8891611

Copyright (c) 2020-2021 Ayoub Malek and Vanessa Sochat
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat

This source code is licensed under the terms of the MIT license.
For a copy, see <https://opensource.org/licenses/MIT>.
Expand Down
7 changes: 1 addition & 6 deletions urlchecker/core/urlproc.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""

Copyright (c) 2020-2021 Ayoub Malek and Vanessa Sochat
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat

This source code is licensed under the terms of the MIT license.
For a copy, see <https://opensource.org/licenses/MIT>.
Expand Down Expand Up @@ -168,14 +168,9 @@ def check_urls(self, urls=None, retry_count=1, timeout=5):
# if no urls are found, mention it if required
if not urls:
if self.print_all:
if self.file_name:
print("\n", self.file_name, "\n", "-" * len(self.file_name))
print("No urls found.")
return

if self.file_name:
print("\n", self.file_name, "\n", "-" * len(self.file_name))

# init seen urls list
seen = set()

Expand Down
109 changes: 109 additions & 0 deletions urlchecker/core/worker.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
"""

Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat

This source code is licensed under the terms of the MIT license.
For a copy, see <https://opensource.org/licenses/MIT>.

"""

import itertools
import multiprocessing
import os
import time
import signal
import sys

from urlchecker.logger import get_logger

logger = get_logger()


class Workers:
def __init__(self, workers=None):

if workers is None:
workers = int(os.environ.get("URLCHECKER_WORKERS", 9))
self.workers = workers
logger.debug(f"Using {self.workers} workers for multiprocess.")

def start(self):
logger.debug("Starting multiprocess")
self.start_time = time.time()

def end(self):
self.end_time = time.time()
self.runtime = self.runtime = self.end_time - self.start_time
logger.debug(f"Ending multiprocess, runtime: {self.runtime} sec")

def run(self, funcs, tasks):
"""run will send a list of tasks, a tuple with arguments, through a function.
the arguments should be ordered correctly.

Parameters
==========
funcs: the functions to run with multiprocessing.pool, a dictionary
with lookup by the task name
tasks: a dict of tasks, each task name (key) with a
tuple of arguments to process
"""
# Number of tasks must == number of functions
assert len(funcs) == len(tasks)

# Keep track of some progress for the user
progress = 1

# if we don't have tasks, don't run
if not tasks:
return

# results will also have the same key to look up
finished = dict()
results = []

try:
pool = multiprocessing.Pool(self.workers, init_worker)

self.start()
for key, params in tasks.items():
func = funcs[key]
result = pool.apply_async(multi_wrapper, multi_package(func, [params]))

# Store the key with the result
results.append((key, result))

while len(results) > 0:
pair = results.pop()
key, result = pair
result.wait()
progress += 1
finished[key] = result.get()

self.end()
pool.close()
pool.join()

except (KeyboardInterrupt, SystemExit):
logger.error("Keyboard interrupt detected, terminating workers!")
pool.terminate()
sys.exit(1)

except:
logger.exit("Error running task")

return finished


# Supporting functions for MultiProcess Worker
def init_worker():
signal.signal(signal.SIGINT, signal.SIG_IGN)


def multi_wrapper(func_args):
function, kwargs = func_args
return function(**kwargs)


def multi_package(func, kwargs):
zipped = zip(itertools.repeat(func), kwargs)
return zipped
2 changes: 1 addition & 1 deletion urlchecker/logger.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""

Copyright (c) 2020-2021 Ayoub Malek and Vanessa Sochat
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat

This source code is licensed under the terms of the MIT license.
For a copy, see <https://opensource.org/licenses/MIT>.
Expand Down
2 changes: 1 addition & 1 deletion urlchecker/main/github.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""

Copyright (c) 2020-2021 Ayoub Malek and Vanessa Sochat
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat

This source code is licensed under the terms of the MIT license.
For a copy, see <https://opensource.org/licenses/MIT>.
Expand Down
2 changes: 1 addition & 1 deletion urlchecker/main/utils.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""

Copyright (c) 2020-2021 Ayoub Malek and Vanessa Sochat
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat

This source code is licensed under the terms of the MIT license.
For a copy, see <https://opensource.org/licenses/MIT>.
Expand Down
4 changes: 2 additions & 2 deletions urlchecker/version.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
"""

Copyright (c) 2020-2021 Ayoub Malek and Vanessa Sochat
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat

This source code is licensed under the terms of the MIT license.
For a copy, see <https://opensource.org/licenses/MIT>.

"""

__version__ = "0.0.25"
__version__ = "0.0.26"
AUTHOR = "Ayoub Malek, Vanessa Sochat"
AUTHOR_EMAIL = "superkogito@gmail.com, vsochat@stanford.edu"
NAME = "urlchecker"
Expand Down