Skip to content

Conversation

@carlosms
Copy link
Contributor

@carlosms carlosms commented Apr 3, 2018

Fixes #59.

There are 2 commits:

  • 52ca345 is the relevant one
  • 75b7f2b moves the current python code for feature-extractor to a new subfolder

The code basically consists of the apollo code in graph.py#detect_communities() with these differences:

  • Instead of a file, the connected components input is given as method arguments
  • Instead of writing a file, the output is returned
  • There is no option to use spark
  • The code has been stripped of all non-essential dependencies

There is a test that uses numpy .npz files as input, and as output to compare. These npz files can be regenerated with generator.py, but do not need to be created every time the test is run.
The source data is the ccs.asdf file. This file was created with the apollo cc command using the data from https://github.com/src-d/backlog/issues/1222.

TODO:

  • Add new tests to travis.yml
  • Add tests for more (all?) the edges, algorithm, and algorithm_params Maybe later on. The code is ready to run any algorithm, but for now the CLI will only use the default, which is already tested.
  • Consider if log should be removed => left with debug level

Question: do we want to support all the edges, algorithm, and algorithm_params? If not, should it be a subset, or a unique default?

@carlosms carlosms requested review from bzz and smacker April 3, 2018 16:48
from collections import defaultdict
from itertools import chain
import logging
from igraph import Graph
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed in 6a19934

@@ -0,0 +1,6 @@
numpy==1.14.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to specify this deps?
sourced-ml already has them: https://github.com/src-d/ml/blob/master/requirements.txt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, you are right. Fixed in 6a19934

:return: JSON with data, indptr
"""

log = logging.getLogger("community-detector")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From TODO in PR description:

Consider if log should be removed

I would keep logging. I see you use info level. The default level is warning, so they won't pollute output. But logs will be useful for debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it to debug level in 6a19934. Is that ok or do you think it would be better as warning for any reason?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both debug&info are okay. None of them would appear by default.

algorithm_params={}):
"""
Runs Community Detection analysis on the given connected components.
:param cc: numpy.ndarray with the connected components
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to code looks like it doesn't have to be numpy.ndarray. But if I'm wrong, please specify shape.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I just took the type used in apollo, but a list works with the current code. I've changed the doc and the test in 6a19934 to make sure we don't break it in the future.

shape=input['id_to_buckets_shape'])


class TestServer(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

server?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops! copy&paste from test_server.py. Fixed in 6a19934

:param buckets_matrix: scipy.sparse.csr_matrix with the buckets
:param edges: The method to generate the graph's edges:
- linear: linear and fast, but may not fit some of the CD algorithms, or all to all within a bucket
- quadratic: slow, but surely fits all the algorithms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we should check edges param in code as it is in doc? Currently code accepts linear or 1 as linear and everything else as quadratic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've added it in 6a19934

@bzz
Copy link
Contributor

bzz commented Apr 4, 2018

@carlosms nice work! A bit confused by:

  • WIP in Issue title - this usually communicates "do not look here, it's not ready yet". Is that the case?
  • un-checked TODO items. What is the status of those?

One more thing - could you also please make sure that the code follows the style guild, like https://google.github.io/styleguide/pyguide.html?showone=Comments#Comments etc ?

Also, in case if some major bodies of code come from another project - it's a good idea to keep an attribution by putting a link in comments, etc.

As soon as it's addressed - please ping for another pass.

"""
Runs Community Detection analysis on the given connected components.
:param cc: numpy.ndarray with the connected components
:param buckets_matrix: scipy.sparse.csr_matrix with the buckets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please document relationship of cc & buckets?

buckindices = buckets_matrix.indices
buckindptr = buckets_matrix.indptr
total_nvertices = buckets_matrix.shape[0]
linear = edges in ("linear", "1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edges == "linear". according to the check above it can't be 1.

@smacker
Copy link
Contributor

smacker commented Apr 4, 2018

about:

Question: do we want to support all the edges, algorithm, and algorithm_params? If not, should it be a subset, or a unique default?

I would keep it possible pass different params, but wrote tests only for the subset we really need for our purposes. (also document it)

@bzz what do you think?

@carlosms carlosms changed the title [WIP] Report: community detection library Report: community detection library Apr 4, 2018
@carlosms
Copy link
Contributor Author

carlosms commented Apr 4, 2018

@bzz said:

@carlosms nice work! A bit confused by:

WIP in Issue title - this usually communicates "do not look here, it's not ready yet". Is that the case?

I wanted to say that the PR is not finished, but I wanted to get some feedback and also had some questions on how to continue. I'll stop using the WIP in the title and just write a proper sentence in the description from now on...

un-checked TODO items. What is the status of those?

Those are the things I'd like to do before merging. The unchecked items are not done yet.

One more thing - could you also please make sure that the code follows the style guild, like https://google.github.io/styleguide/pyguide.html?showone=Comments#Comments etc ?

My bad, I was not aware that we are following the Google Python style for docs. and just used the same comment style as Apollo (which I believe uses sphinx style). This is now changed in 6a19934.

Also, in case if some major bodies of code come from another project - it's a good idea to keep an attribution by putting a link in comments, etc.

Very true. Added in 6a19934.

@bzz bzz mentioned this pull request Apr 5, 2018
3 tasks
carlosms added 2 commits April 9, 2018 15:28
Based on the apollo cmd command, uses the
same code but simplified

Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
@carlosms carlosms force-pushed the issue-59 branch 3 times, most recently from ebb1446 to b8832ba Compare April 9, 2018 14:05
"""Runs Community Detection analysis on the given connected components.

Based largely on the Apollo detect_communities() code from
https://github.com/src-d/apollo/blob/master/apollo/graph.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grr, GH is not showing reviews from commits in the PR 6a19934#r28488078 :/

@@ -0,0 +1,5 @@
python-igraph==0.7.1.post6
scipy==1.0.1
cassandra_driver >= 3.12.0, <4.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use Cassandra somewhere in fixture generation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because apollo is not available in pip, it is cloned in ./apollo. Instead of installing apollo/requirements.txt, the minimum dependencies to call apollo are included here.

Without cassandra driver, it breaks with:

Traceback (most recent call last):
  File "generator.py", line 5, in <module>
    from apollo.graph import ConnectedComponentsModel, CommunitiesModel, detect_communities
  File "./apollo/apollo/graph.py", line 17, in <module>
    from apollo.cassandra_utils import get_db, patch_tables, BatchedHashResolver, Session
  File "./apollo/apollo/cassandra_utils.py", line 9, in <module>
    from cassandra.cluster import Cluster, Session, NoHostAvailable
ModuleNotFoundError: No module named 'cassandra'

Copy link
Contributor

@bzz bzz Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for tests generation, so it's fine. Thank you for explaining.

But this makes me think that may be all of ConnectedComponentsModel, CommunitiesModel and even CommunityDetector are a candidates to be proposed refactored out of graph.py (to minimize dependency footprint) in the future.

@bzz
Copy link
Contributor

bzz commented Apr 9, 2018

@carlosms

Question: do we want to support all the edges, algorithm, and algorithm_params?

Not sure if I understood it right, but by looking at the code it seems like this library already supports and exposes on API-level exactly the same set of algorithms and params as Apollo, which is very good.

If the question is weather those should be exposed to the Gemini CLI - I guess that should be take care of in subsequent PRs #60 and the answer is that for now, only a hard-coded defaults should be used, same ones as apollo does by default.

Does it answer your question?



class CommunityDetector:
def __init__(self, algorithm, config):
Copy link
Contributor

@bzz bzz Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, it would be nice to have a references to origins of the code, coming from other projects i.e https://github.com/src-d/apollo/blob/6b370b5f34ba9e31cf3310e70a2eff35dd978faa/apollo/graph.py#L267 here.

BTW, there seems to be something we would be interested to include here from src-d/apollo@8006248

@carlosms
Copy link
Contributor Author

carlosms commented Apr 9, 2018

@bzz thanks for the feedback, it should all be addressed now.

Does it answer your question?

Yes thank you, I wanted to confirm if the current code is OK. In the design doc. we had "1 algorithm, no pySpark", so I wanted to check if you wanted to force the lib to use 1 algorithm only. Limiting it later in the CLI as you said makes sense.

carlosms added 3 commits April 9, 2018 19:04
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
Ports the changes from
src-d/apollo@8006248

Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
@carlosms
Copy link
Contributor Author

carlosms commented Apr 9, 2018

@bzz, @smacker, please take another look when you have time.

def print_usage(_):
parser.print_usage()

handler = print_usage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simpler:

handler = lambda _: parser.print_usage()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much nicer, thank you. This was a copy&paste of pieces from apollo.


def build_csr_matrix(input):
return csr_matrix(
(input['id_to_buckets_data'], input['id_to_buckets_indices'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to avoid using input as a variable name in python. Because it's built-in function: https://docs.python.org/3/library/functions.html#input

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super useful heads up, thank you!

Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
@bzz
Copy link
Contributor

bzz commented Apr 10, 2018

Great job, @carlosms ! Should be good to merge now

@carlosms carlosms merged commit c7016f5 into src-d:master Apr 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants