-
Notifications
You must be signed in to change notification settings - Fork 15
Report: Connected Components #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks pretty good 👍 but some improvements in code style, DB DDL, integration test on CI, etc are needed.
Please let me know when the comments are addressed and I will be happy to make another pass.
| .asScala | ||
| .map(_.getByte("hashtable")) | ||
| .toList. | ||
| sorted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.toList
.sorted
| gemini.applySchema(cassandra) | ||
|
|
||
| gemini.findConnectedComponents(cassandra) | ||
| System.exit(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would anything below be executed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah :) that's why it's WIP :)
| list = list :+ i | ||
| elementToBuckets + (el -> list) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this method can be further improved by refactoring it a little bit, taking into consideration http://twitter.github.io/effectivescala/#Collections-Style ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem is: there is no intermediate result that I could name. It's for loop inside for loop in a functional style.
|
|
||
| def makeBuckets(): List[List[Int]] = { | ||
| val (buckets, elementIds) = getHashtables() | ||
| .foldLeft(List[List[Int]](), collection.mutable.Map[String, Int]()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's important to have a mutable version here for some reason, here and everywhere else it's better to follow http://twitter.github.io/effectivescala/#Formatting-Imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the only place where I use mutable structure. Imo, it fits here. It's possible to rewrite using the immutable map but it would complicate code without any advantages. Usage of a mutable map is scoped to only this function.
It works similar to cache and using mutable map for cache is recommended in https://docs.scala-lang.org/overviews/collections/maps.html
| import com.datastax.driver.core.{Session, SimpleStatement} | ||
| import org.slf4j.{Logger => Slf4jLogger} | ||
| import scala.collection.JavaConverters._ | ||
| import scala.collection.mutable.ListBuffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why it's important to have mutable collection here, but here and below, it's good idea to follow http://twitter.github.io/effectivescala/#Collections-Use
| class DBConnectedComponents( | ||
| log: Slf4jLogger, | ||
| conn: Session, | ||
| keyspace: String = Gemini.defautKeyspace) extends ConnectedComponents(log) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and below, for the class declarations, its a good idea to check that the style guide is followed http://docs.scala-lang.org/style/declarations.html#classes
In this case, the preferred way would be:
class DBConnectedComponents(
log: Slf4jLogger,
conn: Session,
keyspace: String = Gemini.defautKeyspace)
extends ConnectedComponents(log) {
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved extends to a new line. But indent is still large, IntelliJ IDEA does it.
| ReportExpandedGroup(duplicates) | ||
| } | ||
|
|
||
| def findConnectedComponents(conn: Session): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICS, it would be reasonable to have concrete return type of Map[Int, Set[Int]] as part of gemini API here.
| } | ||
| } | ||
|
|
||
| def findInBuckets( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be a good idea to add some ScalaDoc with very high-level description of the steps of the algorithm, so the reader has some idea of what is being done below.
|
|
||
| def getHashValues(hashtable: Byte): Iterable[FileHash] | ||
|
|
||
| def makeBuckets(): List[List[Int]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be a good idea to add some ScalaDoc with very high-level description of the steps of the algorithm, so the reader has some idea of what is being done below.
| } | ||
|
|
||
| def getHashValues(hashtable: Byte): Iterable[FileHash] = { | ||
| val cql = s"SELECT sha1, value FROM $keyspace.hashtables WHERE hashtable=$hashtable" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would be possible for hashtables table, to follow the convention we use for meta table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I don't really get the idea behind meta table. But I added table name as a parameter.
|
@bzz thanks a lot for the review! most of the comments are addressed others are answered. |
|
For the record, a problem we found and discussed via chat: |
src/main/resources/schema.cql
Outdated
| CREATE KEYSPACE IF NOT EXISTS __KEYSPACE__ WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}; | ||
| USE __KEYSPACE__; | ||
| CREATE TABLE IF NOT EXISTS __KEYSPACE__.blob_hash_files (sha1 ascii, repo text, commit ascii, path text, PRIMARY KEY (sha1, repo, commit, path)); | ||
| CREATE TABLE IF NOT EXISTS __KEYSPACE__.hashtables (sha1 ascii, hashtable tinyint, value blob, PRIMARY KEY (hashtable, value, sha1)); No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep a newline at the end of the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
|
Also: |
|
@carlosms thank you for useful followup! Indeed, after import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('cc.parquet') # works
table2 = pq.read_table('cc.parquet') # breakson cc.parquet.zip produced by this branch. @smacker I would suggest to try and spend a bit of time right here, to make sure reading the data produced by this PR works. To make serialization format compatible between different languages, we need to make sure that the schema, produced by this tool, conforms the spec https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists Right now AFAIK it does not, you can see that by https://github.com/apache/parquet-mr/tree/master/parquet-tools#build and then
which results in 2-level list encoding, instead of 3-level one from the doc above According to https://issues.apache.org/jira/browse/PARQUET-1122, there is a setting If you could try to make it generate the right schema for list encoding and then upload an example of .parquet file here - will be happy to assist with further debugging. Agree, if we will not be able to have this fixed after the schema is correct - 👍 for moving it to a separate issue and merging this and #85 |
|
oh! thanks a lot for With your awesome tips, I'll try to fix compatibility. I'm changing PR to WIP state. |
|
PR is updated. Now I'm able to read generated parquet using pyarrow. don't pay attention to |
|
rebased |
| log: Slf4jLogger, | ||
| conn: Session, | ||
| table: String, | ||
| keyspace: String = Gemini.defautKeyspace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please that the code follows Scala styleguide?
I.e http://docs.scala-lang.org/style/indentation.html#methods-with-numerous-arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most probably you mean https://docs.scala-lang.org/style/declarations.html#classes
but the long indentation is made by IntelliJ IDEA. We might need #74 to change it.
But for now, as I remember we agreed to format code using default "reformat code" feature of IntelliJ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes my eyes hurt, but let's do as you suggest.
| def makeBuckets(): List[List[Int]] = { | ||
| val (buckets, elementIds) = getHashtables() | ||
| .foldLeft(List[List[Int]](), mutable.Map[String, Int]()) { | ||
| (result, hashtable) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be nice to be consistent with style of how function arguments are specified like i.e in https://github.com/smacker/gemini/blob/d96462356ac84906cb30fc1135f44d320701b604/src/main/scala/tech/sourced/gemini/ConnectedComponents.scala#L63
bzz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me.
Good to
, as soon as 2 minor things above are addressed.
carlosms
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Signed-off-by: Maxim Sukharev <maxim@sourced.tech>
Signed-off-by: Maxim Sukharev <maxim@sourced.tech>
Signed-off-by: Maxim Sukharev <maxim@sourced.tech>
Signed-off-by: Maxim Sukharev <maxim@sourced.tech>
Signed-off-by: Maxim Sukharev <maxim@sourced.tech>
Signed-off-by: Maxim Sukharev <maxim@sourced.tech>
|
Note: this code does what is described in the issue. |
Solves #58