Report: Connected Components #77

smacker · 2018-03-21T17:56:40Z

Solves #58

bzz

Overall looks pretty good 👍 but some improvements in code style, DB DDL, integration test on CI, etc are needed.

Please let me know when the comments are addressed and I will be happy to make another pass.

bzz · 2018-03-22T03:01:18Z

src/main/scala/tech/sourced/gemini/ConnectedComponents.scala

+      .asScala
+      .map(_.getByte("hashtable"))
+      .toList.
+      sorted


.toList .sorted

bzz · 2018-03-22T03:10:02Z

src/main/scala/tech/sourced/gemini/ReportApp.scala

      gemini.applySchema(cassandra)

+      gemini.findConnectedComponents(cassandra)
+      System.exit(2)


Would anything below be executed?

yeah :) that's why it's WIP :)

bzz · 2018-03-22T03:13:12Z

src/main/scala/tech/sourced/gemini/ConnectedComponents.scala

+        list = list :+ i
+        elementToBuckets + (el -> list)
+      }
+    }


Do you think this method can be further improved by refactoring it a little bit, taking into consideration http://twitter.github.io/effectivescala/#Collections-Style ?

the problem is: there is no intermediate result that I could name. It's for loop inside for loop in a functional style.

bzz · 2018-03-22T03:14:41Z

src/main/scala/tech/sourced/gemini/ConnectedComponents.scala

+
+  def makeBuckets(): List[List[Int]] = {
+    val (buckets, elementIds) = getHashtables()
+      .foldLeft(List[List[Int]](), collection.mutable.Map[String, Int]()) {


If it's important to have a mutable version here for some reason, here and everywhere else it's better to follow http://twitter.github.io/effectivescala/#Formatting-Imports

this is the only place where I use mutable structure. Imo, it fits here. It's possible to rewrite using the immutable map but it would complicate code without any advantages. Usage of a mutable map is scoped to only this function.
It works similar to cache and using mutable map for cache is recommended in https://docs.scala-lang.org/overviews/collections/maps.html

bzz · 2018-03-22T03:16:12Z

src/main/scala/tech/sourced/gemini/ConnectedComponents.scala

+import com.datastax.driver.core.{Session, SimpleStatement}
+import org.slf4j.{Logger => Slf4jLogger}
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ListBuffer


Not sure why it's important to have mutable collection here, but here and below, it's good idea to follow http://twitter.github.io/effectivescala/#Collections-Use

bzz · 2018-03-22T03:16:17Z

src/main/scala/tech/sourced/gemini/ConnectedComponents.scala

+class DBConnectedComponents(
+                              log: Slf4jLogger,
+                              conn: Session,
+                              keyspace: String = Gemini.defautKeyspace) extends ConnectedComponents(log) {


Here and below, for the class declarations, its a good idea to check that the style guide is followed http://docs.scala-lang.org/style/declarations.html#classes

In this case, the preferred way would be:

class DBConnectedComponents( log: Slf4jLogger, conn: Session, keyspace: String = Gemini.defautKeyspace) extends ConnectedComponents(log) { }

I moved extends to a new line. But indent is still large, IntelliJ IDEA does it.

bzz · 2018-03-22T03:21:58Z

src/main/scala/tech/sourced/gemini/Gemini.scala

    ReportExpandedGroup(duplicates)
  }

+  def findConnectedComponents(conn: Session): Unit = {


AFAICS, it would be reasonable to have concrete return type of Map[Int, Set[Int]] as part of gemini API here.

bzz · 2018-03-22T03:23:48Z

src/main/scala/tech/sourced/gemini/ConnectedComponents.scala

+    }
+  }
+
+  def findInBuckets(


It may be a good idea to add some ScalaDoc with very high-level description of the steps of the algorithm, so the reader has some idea of what is being done below.

bzz · 2018-03-22T03:25:34Z

src/main/scala/tech/sourced/gemini/ConnectedComponents.scala

+
+  def getHashValues(hashtable: Byte): Iterable[FileHash]
+
+  def makeBuckets(): List[List[Int]] = {


It may be a good idea to add some ScalaDoc with very high-level description of the steps of the algorithm, so the reader has some idea of what is being done below.

bzz · 2018-03-22T03:29:10Z

src/main/scala/tech/sourced/gemini/ConnectedComponents.scala

+  }
+
+  def getHashValues(hashtable: Byte): Iterable[FileHash] = {
+    val cql = s"SELECT sha1, value FROM $keyspace.hashtables WHERE hashtable=$hashtable"


Do you think it would be possible for hashtables table, to follow the convention we use for meta table?

sorry, I don't really get the idea behind meta table. But I added table name as a parameter.

smacker · 2018-03-26T16:06:17Z

@bzz thanks a lot for the review! most of the comments are addressed others are answered.
From my point of view, there is only one issue with meta table. Could you please provide a small example how do you want to use it?

carlosms · 2018-04-03T15:26:09Z

For the record, a problem we found and discussed via chat:
The parquet file created by the code in this PR cannot be read from python. Tested with parquet-python, fastparquet, and pyarrow.parquet. Apparently Avro may be using some unique or non-standard format.

bzz · 2018-04-04T07:18:53Z

src/main/resources/schema.cql

 CREATE KEYSPACE IF NOT EXISTS __KEYSPACE__ WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
 USE __KEYSPACE__;
 CREATE TABLE IF NOT EXISTS __KEYSPACE__.blob_hash_files (sha1 ascii, repo text, commit ascii, path text, PRIMARY KEY (sha1, repo, commit, path));
+CREATE TABLE IF NOT EXISTS __KEYSPACE__.hashtables (sha1 ascii, hashtable tinyint, value blob, PRIMARY KEY (hashtable, value, sha1));


Let's keep a newline at the end of the file

smacker · 2018-04-04T09:14:51Z

Also:
Despite of problem with parquet file, I would better merge it because it implements querying DB. We can address python/scala communication in separate PR when both this and #85 get merged.

bzz · 2018-04-04T09:42:11Z

@carlosms thank you for useful followup! Indeed, after pip install pyarrow

import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('cc.parquet') # works
table2 = pq.read_table('cc.parquet')            # breaks

on cc.parquet.zip produced by this branch.

@smacker I would suggest to try and spend a bit of time right here, to make sure reading the data produced by this PR works.

To make serialization format compatible between different languages, we need to make sure that the schema, produced by this tool, conforms the spec https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists

Right now AFAIK it does not, you can see that by https://github.com/apache/parquet-mr/tree/master/parquet-tools#build and then

java -jar target/parquet-tools-*.jar schema cc.parquet

which results in 2-level list encoding, instead of 3-level one from the doc above

message cc {
  required group elements (LIST) {
    repeated int32 array;
  }
}

According to https://issues.apache.org/jira/browse/PARQUET-1122, there is a setting conf.set("parquet.avro.write-old-list-structure", "false"); that might help.

If you could try to make it generate the right schema for list encoding and then upload an example of .parquet file here - will be happy to assist with further debugging.

Agree, if we will not be able to have this fixed after the schema is correct - 👍 for moving it to a separate issue and merging this and #85

smacker · 2018-04-04T10:03:42Z

oh! thanks a lot for java -jar target/parquet-tools-*.jar schema cc.parquet. It wasn't obvious there is such command and what it produces according to readme. :-D

With your awesome tips, I'll try to fix compatibility. I'm changing PR to WIP state.

smacker · 2018-04-04T13:05:55Z

PR is updated. Now I'm able to read generated parquet using pyarrow.

don't pay attention to continuous-integration/travis-ci/push. I pushed the branch to a wrong remote.

smacker · 2018-04-05T09:18:49Z

rebased

bzz · 2018-04-05T09:22:08Z

src/main/scala/tech/sourced/gemini/ConnectedComponents.scala

+                             log: Slf4jLogger,
+                             conn: Session,
+                             table: String,
+                             keyspace: String = Gemini.defautKeyspace)


Could you please that the code follows Scala styleguide?
I.e http://docs.scala-lang.org/style/indentation.html#methods-with-numerous-arguments

most probably you mean https://docs.scala-lang.org/style/declarations.html#classes

but the long indentation is made by IntelliJ IDEA. We might need #74 to change it.
But for now, as I remember we agreed to format code using default "reformat code" feature of IntelliJ.

This makes my eyes hurt, but let's do as you suggest.

bzz · 2018-04-05T09:32:55Z

src/main/scala/tech/sourced/gemini/ConnectedComponents.scala

+  def makeBuckets(): List[List[Int]] = {
+    val (buckets, elementIds) = getHashtables()
+      .foldLeft(List[List[Int]](), mutable.Map[String, Int]()) {
+        (result, hashtable) =>


May be nice to be consistent with style of how function arguments are specified like i.e in https://github.com/smacker/gemini/blob/d96462356ac84906cb30fc1135f44d320701b604/src/main/scala/tech/sourced/gemini/ConnectedComponents.scala#L63

bzz

Looks great to me.

Good to , as soon as 2 minor things above are addressed.

carlosms

👍

Signed-off-by: Maxim Sukharev <maxim@sourced.tech>

smacker · 2018-04-05T16:30:09Z

Note: this code does what is described in the issue.
But it might or might not be enough for finding communities. If so, improvements will be handled in #60 or in a separate task.

bzz reviewed Mar 22, 2018

View reviewed changes

smacker changed the title ~~[WIP] Report: Connected Components~~ Report: Connected Components Mar 26, 2018

bzz reviewed Apr 4, 2018

View reviewed changes

smacker changed the title ~~Report: Connected Components~~ [WIP] Report: Connected Components Apr 4, 2018

smacker changed the title ~~[WIP] Report: Connected Components~~ Report: Connected Components Apr 4, 2018

smacker requested a review from carlosms April 5, 2018 09:20

bzz reviewed Apr 5, 2018

View reviewed changes

bzz approved these changes Apr 5, 2018

View reviewed changes

carlosms approved these changes Apr 5, 2018

View reviewed changes

smacker added 6 commits April 5, 2018 17:45

Implement findConnectedComponents

ccae4b8

Signed-off-by: Maxim Sukharev <maxim@sourced.tech>

Parquet serialization

2408fb7

Signed-off-by: Maxim Sukharev <maxim@sourced.tech>

create hashtables table

519e43b

Signed-off-by: Maxim Sukharev <maxim@sourced.tech>

new line at the end of cql file

0849d1d

Signed-off-by: Maxim Sukharev <maxim@sourced.tech>

Write parquet compatible with python readers

391bd36

Signed-off-by: Maxim Sukharev <maxim@sourced.tech>

reformat code

1e5566d

Signed-off-by: Maxim Sukharev <maxim@sourced.tech>

smacker merged commit d415673 into src-d:master Apr 5, 2018


		def getHashValues(hashtable: Byte): Iterable[FileHash]

		def makeBuckets(): List[List[Int]] = {

Report: Connected Components #77

Report: Connected Components #77

Uh oh!

Conversation

smacker commented Mar 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bzz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smacker commented Mar 26, 2018

Uh oh!

carlosms commented Apr 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smacker commented Apr 4, 2018

Uh oh!

bzz commented Apr 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smacker commented Apr 4, 2018

Uh oh!

smacker commented Apr 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smacker commented Apr 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bzz left a comment

Choose a reason for hiding this comment

Uh oh!

carlosms left a comment

Choose a reason for hiding this comment

Uh oh!

smacker commented Apr 5, 2018

Uh oh!

Reviewers

Assignees

Labels

smacker commented Mar 21, 2018 •

edited

Loading

bzz left a comment •

edited

Loading

bzz commented Apr 4, 2018 •

edited

Loading

smacker commented Apr 4, 2018 •

edited

Loading