Skip to content

Conversation

@smacker
Copy link
Contributor

@smacker smacker commented Mar 21, 2018

Solves #58

Copy link
Contributor

@bzz bzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks pretty good 👍 but some improvements in code style, DB DDL, integration test on CI, etc are needed.

Please let me know when the comments are addressed and I will be happy to make another pass.

.asScala
.map(_.getByte("hashtable"))
.toList.
sorted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.toList
.sorted

gemini.applySchema(cassandra)

gemini.findConnectedComponents(cassandra)
System.exit(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would anything below be executed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah :) that's why it's WIP :)

list = list :+ i
elementToBuckets + (el -> list)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this method can be further improved by refactoring it a little bit, taking into consideration http://twitter.github.io/effectivescala/#Collections-Style ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is: there is no intermediate result that I could name. It's for loop inside for loop in a functional style.


def makeBuckets(): List[List[Int]] = {
val (buckets, elementIds) = getHashtables()
.foldLeft(List[List[Int]](), collection.mutable.Map[String, Int]()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's important to have a mutable version here for some reason, here and everywhere else it's better to follow http://twitter.github.io/effectivescala/#Formatting-Imports

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only place where I use mutable structure. Imo, it fits here. It's possible to rewrite using the immutable map but it would complicate code without any advantages. Usage of a mutable map is scoped to only this function.
It works similar to cache and using mutable map for cache is recommended in https://docs.scala-lang.org/overviews/collections/maps.html

import com.datastax.driver.core.{Session, SimpleStatement}
import org.slf4j.{Logger => Slf4jLogger}
import scala.collection.JavaConverters._
import scala.collection.mutable.ListBuffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why it's important to have mutable collection here, but here and below, it's good idea to follow http://twitter.github.io/effectivescala/#Collections-Use

class DBConnectedComponents(
log: Slf4jLogger,
conn: Session,
keyspace: String = Gemini.defautKeyspace) extends ConnectedComponents(log) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below, for the class declarations, its a good idea to check that the style guide is followed http://docs.scala-lang.org/style/declarations.html#classes

In this case, the preferred way would be:

class DBConnectedComponents(
    log: Slf4jLogger,
    conn: Session,
    keyspace: String = Gemini.defautKeyspace) 
  extends ConnectedComponents(log) {
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved extends to a new line. But indent is still large, IntelliJ IDEA does it.

ReportExpandedGroup(duplicates)
}

def findConnectedComponents(conn: Session): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS, it would be reasonable to have concrete return type of Map[Int, Set[Int]] as part of gemini API here.

}
}

def findInBuckets(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be a good idea to add some ScalaDoc with very high-level description of the steps of the algorithm, so the reader has some idea of what is being done below.


def getHashValues(hashtable: Byte): Iterable[FileHash]

def makeBuckets(): List[List[Int]] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be a good idea to add some ScalaDoc with very high-level description of the steps of the algorithm, so the reader has some idea of what is being done below.

}

def getHashValues(hashtable: Byte): Iterable[FileHash] = {
val cql = s"SELECT sha1, value FROM $keyspace.hashtables WHERE hashtable=$hashtable"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be possible for hashtables table, to follow the convention we use for meta table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I don't really get the idea behind meta table. But I added table name as a parameter.

@smacker smacker changed the title [WIP] Report: Connected Components Report: Connected Components Mar 26, 2018
@smacker
Copy link
Contributor Author

smacker commented Mar 26, 2018

@bzz thanks a lot for the review! most of the comments are addressed others are answered.
From my point of view, there is only one issue with meta table. Could you please provide a small example how do you want to use it?

@carlosms
Copy link
Contributor

carlosms commented Apr 3, 2018

For the record, a problem we found and discussed via chat:
The parquet file created by the code in this PR cannot be read from python. Tested with parquet-python, fastparquet, and pyarrow.parquet. Apparently Avro may be using some unique or non-standard format.

CREATE KEYSPACE IF NOT EXISTS __KEYSPACE__ WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
USE __KEYSPACE__;
CREATE TABLE IF NOT EXISTS __KEYSPACE__.blob_hash_files (sha1 ascii, repo text, commit ascii, path text, PRIMARY KEY (sha1, repo, commit, path));
CREATE TABLE IF NOT EXISTS __KEYSPACE__.hashtables (sha1 ascii, hashtable tinyint, value blob, PRIMARY KEY (hashtable, value, sha1)); No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep a newline at the end of the file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

@smacker
Copy link
Contributor Author

smacker commented Apr 4, 2018

Also:
Despite of problem with parquet file, I would better merge it because it implements querying DB. We can address python/scala communication in separate PR when both this and #85 get merged.

@bzz
Copy link
Contributor

bzz commented Apr 4, 2018

@carlosms thank you for useful followup! Indeed, after pip install pyarrow

import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('cc.parquet') # works
table2 = pq.read_table('cc.parquet')            # breaks

on cc.parquet.zip produced by this branch.

@smacker I would suggest to try and spend a bit of time right here, to make sure reading the data produced by this PR works.

To make serialization format compatible between different languages, we need to make sure that the schema, produced by this tool, conforms the spec https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists

Right now AFAIK it does not, you can see that by https://github.com/apache/parquet-mr/tree/master/parquet-tools#build and then

java -jar target/parquet-tools-*.jar schema cc.parquet

which results in 2-level list encoding, instead of 3-level one from the doc above

message cc {
  required group elements (LIST) {
    repeated int32 array;
  }
}

According to https://issues.apache.org/jira/browse/PARQUET-1122, there is a setting conf.set("parquet.avro.write-old-list-structure", "false"); that might help.

If you could try to make it generate the right schema for list encoding and then upload an example of .parquet file here - will be happy to assist with further debugging.

Agree, if we will not be able to have this fixed after the schema is correct - 👍 for moving it to a separate issue and merging this and #85

@smacker
Copy link
Contributor Author

smacker commented Apr 4, 2018

oh! thanks a lot for java -jar target/parquet-tools-*.jar schema cc.parquet. It wasn't obvious there is such command and what it produces according to readme. :-D

With your awesome tips, I'll try to fix compatibility. I'm changing PR to WIP state.

@smacker smacker changed the title Report: Connected Components [WIP] Report: Connected Components Apr 4, 2018
@smacker smacker changed the title [WIP] Report: Connected Components Report: Connected Components Apr 4, 2018
@smacker
Copy link
Contributor Author

smacker commented Apr 4, 2018

PR is updated. Now I'm able to read generated parquet using pyarrow.

don't pay attention to continuous-integration/travis-ci/push. I pushed the branch to a wrong remote.

@smacker
Copy link
Contributor Author

smacker commented Apr 5, 2018

rebased

@smacker smacker requested a review from carlosms April 5, 2018 09:20
log: Slf4jLogger,
conn: Session,
table: String,
keyspace: String = Gemini.defautKeyspace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please that the code follows Scala styleguide?
I.e http://docs.scala-lang.org/style/indentation.html#methods-with-numerous-arguments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most probably you mean https://docs.scala-lang.org/style/declarations.html#classes

but the long indentation is made by IntelliJ IDEA. We might need #74 to change it.
But for now, as I remember we agreed to format code using default "reformat code" feature of IntelliJ.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes my eyes hurt, but let's do as you suggest.

def makeBuckets(): List[List[Int]] = {
val (buckets, elementIds) = getHashtables()
.foldLeft(List[List[Int]](), mutable.Map[String, Int]()) {
(result, hashtable) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@bzz bzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me.

Good to :shipit:, as soon as 2 minor things above are addressed.

Copy link
Contributor

@carlosms carlosms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

smacker added 6 commits April 5, 2018 17:45
Signed-off-by: Maxim Sukharev <maxim@sourced.tech>
Signed-off-by: Maxim Sukharev <maxim@sourced.tech>
Signed-off-by: Maxim Sukharev <maxim@sourced.tech>
Signed-off-by: Maxim Sukharev <maxim@sourced.tech>
Signed-off-by: Maxim Sukharev <maxim@sourced.tech>
Signed-off-by: Maxim Sukharev <maxim@sourced.tech>
@smacker
Copy link
Contributor Author

smacker commented Apr 5, 2018

Note: this code does what is described in the issue.
But it might or might not be enough for finding communities. If so, improvements will be handled in #60 or in a separate task.

@smacker smacker merged commit d415673 into src-d:master Apr 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants