-
Notifications
You must be signed in to change notification settings - Fork 15
Report: community detection library #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| from collections import defaultdict | ||
| from itertools import chain | ||
| import logging | ||
| from igraph import Graph |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty line is missed https://www.python.org/dev/peps/pep-0008/#imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Fixed in 6a19934
| @@ -0,0 +1,6 @@ | |||
| numpy==1.14.2 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need to specify this deps?
sourced-ml already has them: https://github.com/src-d/ml/blob/master/requirements.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, you are right. Fixed in 6a19934
| :return: JSON with data, indptr | ||
| """ | ||
|
|
||
| log = logging.getLogger("community-detector") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From TODO in PR description:
Consider if log should be removed
I would keep logging. I see you use info level. The default level is warning, so they won't pollute output. But logs will be useful for debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it to debug level in 6a19934. Is that ok or do you think it would be better as warning for any reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both debug&info are okay. None of them would appear by default.
| algorithm_params={}): | ||
| """ | ||
| Runs Community Detection analysis on the given connected components. | ||
| :param cc: numpy.ndarray with the connected components |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to code looks like it doesn't have to be numpy.ndarray. But if I'm wrong, please specify shape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I just took the type used in apollo, but a list works with the current code. I've changed the doc and the test in 6a19934 to make sure we don't break it in the future.
| shape=input['id_to_buckets_shape']) | ||
|
|
||
|
|
||
| class TestServer(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooops! copy&paste from test_server.py. Fixed in 6a19934
| :param buckets_matrix: scipy.sparse.csr_matrix with the buckets | ||
| :param edges: The method to generate the graph's edges: | ||
| - linear: linear and fast, but may not fit some of the CD algorithms, or all to all within a bucket | ||
| - quadratic: slow, but surely fits all the algorithms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think we should check edges param in code as it is in doc? Currently code accepts linear or 1 as linear and everything else as quadratic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I've added it in 6a19934
|
@carlosms nice work! A bit confused by:
One more thing - could you also please make sure that the code follows the style guild, like https://google.github.io/styleguide/pyguide.html?showone=Comments#Comments etc ? Also, in case if some major bodies of code come from another project - it's a good idea to keep an attribution by putting a link in comments, etc. As soon as it's addressed - please ping for another pass. |
| """ | ||
| Runs Community Detection analysis on the given connected components. | ||
| :param cc: numpy.ndarray with the connected components | ||
| :param buckets_matrix: scipy.sparse.csr_matrix with the buckets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please document relationship of cc & buckets?
| buckindices = buckets_matrix.indices | ||
| buckindptr = buckets_matrix.indptr | ||
| total_nvertices = buckets_matrix.shape[0] | ||
| linear = edges in ("linear", "1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edges == "linear". according to the check above it can't be 1.
|
about:
I would keep it possible pass different params, but wrote tests only for the subset we really need for our purposes. (also document it) @bzz what do you think? |
|
@bzz said:
I wanted to say that the PR is not finished, but I wanted to get some feedback and also had some questions on how to continue. I'll stop using the WIP in the title and just write a proper sentence in the description from now on...
Those are the things I'd like to do before merging. The unchecked items are not done yet.
My bad, I was not aware that we are following the Google Python style for docs. and just used the same comment style as Apollo (which I believe uses sphinx style). This is now changed in 6a19934.
Very true. Added in 6a19934. |
Based on the apollo cmd command, uses the same code but simplified Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
ebb1446 to
b8832ba
Compare
| """Runs Community Detection analysis on the given connected components. | ||
|
|
||
| Based largely on the Apollo detect_communities() code from | ||
| https://github.com/src-d/apollo/blob/master/apollo/graph.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grr, GH is not showing reviews from commits in the PR 6a19934#r28488078 :/
| @@ -0,0 +1,5 @@ | |||
| python-igraph==0.7.1.post6 | |||
| scipy==1.0.1 | |||
| cassandra_driver >= 3.12.0, <4.0 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we use Cassandra somewhere in fixture generation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because apollo is not available in pip, it is cloned in ./apollo. Instead of installing apollo/requirements.txt, the minimum dependencies to call apollo are included here.
Without cassandra driver, it breaks with:
Traceback (most recent call last):
File "generator.py", line 5, in <module>
from apollo.graph import ConnectedComponentsModel, CommunitiesModel, detect_communities
File "./apollo/apollo/graph.py", line 17, in <module>
from apollo.cassandra_utils import get_db, patch_tables, BatchedHashResolver, Session
File "./apollo/apollo/cassandra_utils.py", line 9, in <module>
from cassandra.cluster import Cluster, Session, NoHostAvailable
ModuleNotFoundError: No module named 'cassandra'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only for tests generation, so it's fine. Thank you for explaining.
But this makes me think that may be all of ConnectedComponentsModel, CommunitiesModel and even CommunityDetector are a candidates to be proposed refactored out of graph.py (to minimize dependency footprint) in the future.
Not sure if I understood it right, but by looking at the code it seems like this library already supports and exposes on API-level exactly the same set of algorithms and params as Apollo, which is very good. If the question is weather those should be exposed to the Gemini CLI - I guess that should be take care of in subsequent PRs #60 and the answer is that for now, only a hard-coded defaults should be used, same ones as apollo does by default. Does it answer your question? |
|
|
||
|
|
||
| class CommunityDetector: | ||
| def __init__(self, algorithm, config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, it would be nice to have a references to origins of the code, coming from other projects i.e https://github.com/src-d/apollo/blob/6b370b5f34ba9e31cf3310e70a2eff35dd978faa/apollo/graph.py#L267 here.
BTW, there seems to be something we would be interested to include here from src-d/apollo@8006248
|
@bzz thanks for the feedback, it should all be addressed now.
Yes thank you, I wanted to confirm if the current code is OK. In the design doc. we had "1 algorithm, no pySpark", so I wanted to check if you wanted to force the lib to use 1 algorithm only. Limiting it later in the CLI as you said makes sense. |
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
Ports the changes from src-d/apollo@8006248 Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
| def print_usage(_): | ||
| parser.print_usage() | ||
|
|
||
| handler = print_usage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simpler:
handler = lambda _: parser.print_usage()There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much nicer, thank you. This was a copy&paste of pieces from apollo.
|
|
||
| def build_csr_matrix(input): | ||
| return csr_matrix( | ||
| (input['id_to_buckets_data'], input['id_to_buckets_indices'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I try to avoid using input as a variable name in python. Because it's built-in function: https://docs.python.org/3/library/functions.html#input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super useful heads up, thank you!
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
|
Great job, @carlosms ! Should be good to merge now |
Fixes #59.
There are 2 commits:
The code basically consists of the apollo code in graph.py#detect_communities() with these differences:
There is a test that uses numpy
.npzfiles as input, and as output to compare. These npz files can be regenerated withgenerator.py, but do not need to be created every time the test is run.The source data is the
ccs.asdffile. This file was created with theapollo cccommand using the data from https://github.com/src-d/backlog/issues/1222.TODO:
Add tests for more (all?) theMaybe later on. The code is ready to run any algorithm, but for now the CLI will only use the default, which is already tested.edges,algorithm, andalgorithm_paramsQuestion: do we want to support all the
edges,algorithm, andalgorithm_params? If not, should it be a subset, or a unique default?