-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-17073] [SQL] generate column-level statistics #15090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #65347 has finished for PR 15090 at commit
|
table_no_cols, | ||
isDataSourceTable = true, | ||
hasSizeInBytes = true, | ||
expectedRowCounts = Some(10)) | ||
} | ||
} | ||
|
||
private def checkColStats( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test suite becomes bigger and bigger. For column stats, let us create a new file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used checkTableStats
in some cases for column stats, so maybe put all test cases for table/column stats into a separate file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move these utility functions to org.apache.spark.sql.QueryTest
? I am not 100% sure whether this is the best home, but we definitely need to split this suite to smaller suites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm planning to create a new suite for column stats test cases, put all the common methods to a new trait called StatisticsTest
, and let the two suites extend it. How about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
@@ -98,8 +98,12 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { | |||
ctx.identifier != null && | |||
ctx.identifier.getText.toLowerCase == "noscan") { | |||
AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier).toString) | |||
} else { | |||
} else if (ctx.identifierSeq() == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR changes the Parser, please update the comment of this function to reflect the latest changes.
In addition, please add the test cases in DDLCommandSuite
for verifying the Parser's behaviors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has a bug. It will jump to this branch, if users input
ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For analyze column command, users should know exactly what they want to do. So they need to specify the columns, otherwise, we don't compute statistics for columns. AFAIK, hive will generate all column stats for this case, but I don't think we should do that. At least, we could provide other command like FOR ALL COLUMNS to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, issue an exception here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I'm also thinking to do this:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile The tests for parsing analyze commands are in StatisticsSuite
, so I'll add those verifying tests in it instead of DDLCommandSuite
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the comment, we are going to change the "ANALYZE" syntax in SqlBase.g4, i.e. make the identifierSeq non-optional, which is different from Hive. Is this ok? @rxin @hvanhovell
Like Hive, I think we should implement a built-in function, |
Test build #65349 has finished for PR 15090 at commit
|
* which will be used in query optimizations. | ||
*/ | ||
case class AnalyzeColumnCommand( | ||
tableName: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a TableIdentifier
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like AnalyzeTableCommand
is also using String instead of TableIdentifier
. Want to change as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Yeah, I'll fix that too.
|
||
def updateStats(catalogTable: CatalogTable, newTotalSize: Long): Unit = { | ||
val lowerCaseNames = columnNames.map(_.toLowerCase) | ||
val attributes = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is nicer to resolve them using the available resolution functionality in LogicalPlan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, then we can let the resolve function deal with case sensitivity. Thanks.
|
||
case class BasicColStats( | ||
dataType: DataType, | ||
numNulls: Long, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why numNulls
is the only one not wrapped in Option
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm. I got it.
s"${otherRelation.nodeName}.") | ||
} | ||
|
||
def updateStats(catalogTable: CatalogTable, newTotalSize: Long): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in def run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be then defined outside? I know this is about personal taste but it causes confusion to at least two of reviewers here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we will need to pass many parameters to this method.
} | ||
} | ||
|
||
object ColumnStats extends Enumeration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private[sql]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it is not encouraged to add private[sql]
in this case. Please see 511f52f.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think it is more like the case of SparkPlanInfo
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anyway, no strong opinion to this.
Test build #65475 has finished for PR 15090 at commit
|
}.toMap | ||
|
||
val statistics = | ||
Statistics(sizeInBytes = newTotalSize, rowCount = Some(rowCount), basicColStats = colStats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question. Seems we overwrite all existing column statistics, no matter whether we collect statistics of other columns before. Is it better to only replace the column statistics of the same columns? We can keep the column statistics of other columns. @wzhfy What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should document the policy for (partial) statistics updates very very well. It is important that an end user understands what is going on. The current policy is to drop them as soon as we touch the table stats.
In this case I think it it important to consider a few elements:
- Relative column statistics typically change less than table statistics. So they do not need to changed as often.
- When do we consider column statistics to be stale (less trustworthy)? If the size of the table has changed by x%? If the statistics are older than x time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an open question. IMO, as a computing framework like Spark, it is difficult to decide when to recollect the stats, because that depends on how often the data is changed and how it is changed, i.e. how customers use Spark. For different applications, this can be quite different. So I think maybe we should leave the decision to users or someone like DBAs who know their applications well, let them decide when to run the "ANALYZE" commands and refresh the stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just not to sure that we should follow an all or nothing collection approach. Currently ANALYZE TABLE
discards column level statistics and ANALYSE COLUMN
discards statistics on any column that has not been included in the given analyze command.
In the databases I have worked with you could typically refresh columns stats independently. I would be a big proponent of doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell Agree. If we want to support refresh column stats independently, we should do as @viirya said. But the consistency between the kept column stats and the refreshed column stats still needs to be guaranteed by users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the "all or nothing" statistics approach, we can better maintain statistics consistency. However, from the viewpoint of usability, a user may want to collect column statistics for column A and column B of a given table first. After a while, he may want to collect column statistics for column C and column D because of the need for new queries. A user may intuitively specify only columns C and D in the new ANALZYE command because he expects the statistics of column A and B to be there as well. Hence, refreshing column stats independently can better match a user's expectation.
Yes we can keep stats of other columns, but we can't be sure if they are still correct between two analyze column commands, unless we recollect them. It's again the consistency issue, like between analyze table and analyze column. |
Test build #65477 has finished for PR 15090 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty solid. I have left some more comments.
avgColLen = if (row.isNullAt(4)) None else Some(row.getDouble(4)), | ||
maxColLen = if (row.isNullAt(5)) None else Some(row.getLong(5)), | ||
numTrues = if (row.isNullAt(6)) None else Some(row.getLong(6)), | ||
numFalses = if (row.isNullAt(7)) None else Some(row.getLong(7) - row.getLong(0))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be easier to write the source expression as Sum(If(Not(e), one, zero))
.
|
||
def apply(e: Expression, relativeSD: Double): CreateStruct = { | ||
var statistics = e.dataType match { | ||
case n: NumericType => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not combine NumericType
with TimestampType
& DateType
? e.g.. _: NumericType | TimestampType | DateType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause compilation error: illegal variable in pattern alternative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wzhfy Hm. Are you doubly sure that causes compilation error? I just checked out your PR and it seems compiling fine.
case _: NumericType | TimestampType | DateType =>
Seq(Max(e), Min(e), HyperLogLogPlusPlus(e, relativeSD), nullDouble, nullLong, nullLong,
nullLong)
If you keep n
variable then it will cause a compilation error but if we don't use this, it'd be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I made a mistake, thanks! @HyukjinKwon
Seq(nullBoolean, nullBoolean, two, nullDouble, nullLong, Sum(If(e, one, zero)), | ||
Sum(If(e, zero, one))) | ||
case otherType => | ||
throw new AnalysisException("ANALYZE command is not supported for data type: " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a bit clearer for the end user. Please add the name of the column & mention that you are analyzing columns.
BasicColStats( | ||
dataType = e.dataType, | ||
numNulls = row.getLong(0), | ||
max = if (row.isNullAt(1)) None else Some(row.get(1, e.dataType)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nicer to add a helper method here.
@@ -32,67 +34,25 @@ import org.apache.spark.sql.execution.datasources.LogicalRelation | |||
* Analyzes the given table in the current database to generate statistics, which will be | |||
* used in query optimizations. | |||
*/ | |||
case class AnalyzeTableCommand(tableName: String, noscan: Boolean = true) extends RunnableCommand { | |||
case class AnalyzeTableCommand( | |||
tableIdent: TableIdentifier, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -563,6 +563,13 @@ object SQLConf { | |||
.timeConf(TimeUnit.MILLISECONDS) | |||
.createWithDefault(10L) | |||
|
|||
val NDV_MAX_ERROR = | |||
SQLConfigBuilder("spark.sql.ndv.maxError") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you place this under spark.sql.statistics
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
assert(colStats.min == expectedColStats.min) | ||
if (expectedColStats.ndv.isDefined) { | ||
// ndv is an approximate value, so we just make sure we have the value | ||
assert(colStats.ndv.get >= 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also check if the value is within 3 standard deviations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to get the standard deviations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be:
- estimate column statistics for several times.
- calculate mean and standard deviation of ndv for the multiple statistics.
- check if the real ndv value is within (mean + 3 * sd, mean - 3 * sd).
Please correct me if it is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 3 standard deviations?
I think for tests in this suite, we just need to make sure we get the stats, we should leave the accuracy test to HyperLogLogPlusPlusSuite
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert(colStats.ndv.get >= 0)
might be too loose. Can we check if it is more than (or equal to) 0 and less than (or equal to) the number of values in the column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the check here for estimated ndv value is not really related to the actual ndv value, we can't make sure if we really get the estimated ndv back. Especially we only use ndv > 0
to check it. In an extreme case, we can replace the ndv expression with a literal(0) in the aggregation, the test would not check it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course my previous suggestion 0 <= ndv <= count(col)
is also failed for this standard. Using sd to check it is more reliable. I am not strong against this. But it should be better. See what @hvanhovell and @cloud-fan think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I may know what @hvanhovell meant. The standard deviation is the parameter we pass to HyperLogLogPlusPlus, so the code may look like this:
// ndv is an approximate value, so we make sure we have the value, and it should be
// within 3*SD's of the given rsd.
assert(colStats.ndv.get >= 0)
if (expectedColStats.ndv.get == 0) {
assert(colStats.ndv.get == 0)
} else if (expectedColStats.ndv.get > 0) {
val rsd = spark.sessionState.conf.ndvMaxError
val error = math.abs((colStats.ndv.get / expectedColStats.ndv.get.toDouble) - 1.0d)
assert(error <= rsd * 3.0d, "Error should be within 3 std. errors.")
}
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just think it is better to test this a bit more thoroughly. The HLL++ tests use a similar test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW: sorry for the very cryptic initial comment.
@@ -401,7 +401,10 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat | |||
var statsProperties: Map[String, String] = | |||
Map(STATISTICS_TOTAL_SIZE -> stats.sizeInBytes.toString()) | |||
if (stats.rowCount.isDefined) { | |||
statsProperties += (STATISTICS_NUM_ROWS -> stats.rowCount.get.toString()) | |||
statsProperties += STATISTICS_NUM_ROWS -> stats.rowCount.get.toString() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question: Is this how Hive stores this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we just give it a different name in Spark.
*/ | ||
override def visitAnalyze(ctx: AnalyzeContext): LogicalPlan = withOrigin(ctx) { | ||
if (ctx.partitionSpec == null && | ||
ctx.identifier != null && | ||
ctx.identifier.getText.toLowerCase == "noscan") { | ||
AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier).toString) | ||
AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier)) | ||
} else if (ctx.identifierSeq() == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make the identifierSeq non-optional in the grammar? Saves a lot of typing:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good idea
StructField(name = "c3", dataType = IntegerType, nullable = true) :: | ||
StructField(name = "c4", dataType = LongType, nullable = true) :: Nil) | ||
val expectedBasicStats = BasicColStats( | ||
dataType = ByteType, numNulls = 2, max = Some(3), min = Some(1), ndv = Some(3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is weird that this works. The Min and the Max value should not be equal for Short and Long types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain more about this?
What's "basic" about? Are we going to have something that's not basic in the future (e.g. histogram)? If yes, should those go into a separate class or just in ColumnStats? |
@rxin Yeah, I think it's better to move histograms into ColumnStats than to maintain two members like BasicColStats and Histograms. Let me rename |
} | ||
} | ||
|
||
case class BasicColStats( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a section of comments explaining the meaning for each statistic? E.g., Readers may not know what ndv
means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya thanks for the advice.
case o => o | ||
} | ||
assert(operators.size == 1) | ||
if (operators.head.getClass != c) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a good style, parsing a SQL string and constructing an expected logical plan and then compare them looks much clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, then I'll fix it
Test build #66131 has finished for PR 15090 at commit
|
Test build #66150 has finished for PR 15090 at commit
|
Test build #66188 has finished for PR 15090 at commit
|
val relation = spark.sessionState.catalog.lookupRelation(tableIdent) | ||
val columnStats = | ||
AnalyzeColumnCommand(tableIdent, columns.map(_.name)).computeColStats(spark, relation)._2 | ||
expectedColStatsSeq.foreach { expected => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about ...foreach { case (field, expectedStat) =>
? Then we use field.name
instead of expected._1.name
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's better.
*/ | ||
case class ColumnStat(statRow: InternalRow) { | ||
|
||
def forNumeric[T <: AtomicType](dataType: T): NumericColumnStat[T] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have NumericType
, can we use that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we need to include DateType and TimestampType also, so I use AtomicType.
dataType = IntegerType, | ||
colStat = colStat, | ||
expectedColStat = ColumnStat(InternalRow.fromSeq( | ||
Seq(0L, values.max, values.min, values.distinct.length.toLong))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should write InternalRow(0L, values.max, ...)
directly
df.write.format("json").saveAsTable(tmpTable) | ||
|
||
sql(s"CREATE TABLE $table (c1 int) USING PARQUET") | ||
sql(s"INSERT INTO $table SELECT * FROM $tmpTable") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can just write INSERT INTO $table SELECT 1
, thus don't need the temp table
val totalSize = BigInt(catalogTable.properties.get(STATISTICS_TOTAL_SIZE).get) | ||
// TODO: we will compute "estimatedSize" when we have column stats: | ||
// average size of row * number of rows | ||
if (catalogTable.properties.filterKeys(_.startsWith(STATISTICS_PREFIX)).nonEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make catalogTable.properties.filterKeys(_.startsWith(STATISTICS_PREFIX))
a variable? it's used twice
LGTM except some minor comment, thanks for working on it! |
if (!attributesToAnalyze.contains(expr)) { | ||
attributesToAnalyze += expr | ||
} else { | ||
logWarning(s"Duplicated column: $col") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to explain the context. Otherwise, the log message becomes useless.
val tableIdent = TableIdentifier(table, Some("default")) | ||
val relation = spark.sessionState.catalog.lookupRelation(tableIdent) | ||
val columnStats = | ||
AnalyzeColumnCommand(tableIdent, columns.map(_.name)).computeColStats(spark, relation)._2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
val (_, columnStats) =
AnalyzeColumnCommand(tableIdent, columnsToAnalyze).computeColStats(spark, relation)
val tableIdent = TableIdentifier(table, Some("default")) | ||
val relation = spark.sessionState.catalog.lookupRelation(tableIdent) | ||
val columnStats = | ||
AnalyzeColumnCommand(tableIdent, columnsToAnalyze).computeColStats(spark, relation)._2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here.
Test build #66226 has finished for PR 15090 at commit
|
retest this please |
Seq(numNulls(e), avgLength(e), maxLength(e)) | ||
} | ||
|
||
def booleanColumnStat(e: Expression): Seq[Expression] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the above 13 functions should be private, right?
} | ||
} | ||
|
||
def apply(e: Attribute, relativeSD: Double): CreateStruct = e.dataType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: e
-> attr
Test build #66227 has finished for PR 15090 at commit
|
} | ||
if (duplicatedColumns.nonEmpty) { | ||
logWarning(s"Duplicated columns ${duplicatedColumns.mkString("(", ", ", ")")} detected " + | ||
s"when analyzing columns ${columnNames.mkString("(", ", ", ")")}, ignoring them.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this?
logWarning("A duplicate column name was detected in `ANALYZE TABLE` statement. " +
s"Input columns: ${columnNames.mkString("(", ", ", ")")}. " +
s"Duplicate columns: ${duplicatedColumns.mkString("(", ", ", ")")}.")
Could you add a positive test case when we turn on the case sensitivity? The scenario is like: withTable(table) {
withSQLConf("spark.sql.caseSensitive" -> "true") {
sql(s"CREATE TABLE $table (c1 int, C1 double) STORED AS PARQUET")
sql(s"INSERT INTO $table SELECT 1, 3.0")
sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS c1, C1")
...
}
} Note: the above is just the SQL statements. We need the verification logics. This is to verify whether the case of column names is preserved in ANALYZE COLUMN. |
Another test case for Unicode column names in ANALYZE COLUMN: // scalastyle:off
// non ascii characters are not allowed in the source code, so we disable the scalastyle.
val colName1 = "`列1`"
val colName2 = "`列2`"
// scalastyle:on
withTable(table) {
sql(s"CREATE TABLE $table ($colName1 int, $colName2 double) STORED AS PARQUET")
sql(s"INSERT INTO $table SELECT 1, 3.0")
sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS $colName2, $colName1")
...
}``` |
LGTM except the above minor comments. Test cases mentioned above need to be added to If possible, also add a test case for |
LGTM except for one minor comment regarding ndv config document. |
Alright looking at the number of outstanding comments, it looks like none of the issues are structural anymore. I'm going to merge this in order to speed up this work. @wzhfy please make sure you fix the remaining issues in follow-up pull requests. Merging into master. |
@rxin Thanks, I'll fix them in the followup pr. |
@gatorsmile Hive tables don't support case sensitive column names, so I use data source tables in the added test cases. PR is already sent: #15360. |
What changes were proposed in this pull request?
Generate basic column statistics for all the atomic types:
Also support storing and loading these statistics.
One thing to notice:
We support analyzing columns independently, e.g.:
sql1:
ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS key;
sql2:
ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS value;
when running sql2 to collect column stats for
value
, we don’t remove stats of columnskey
which are analyzed in sql1 and not in sql2. As a result, users need to guarantee consistency between sql1 and sql2. If the table has been changed before sql2, users should re-analyze columnkey
when they want to analyze columnvalue
:ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS key, value;
How was this patch tested?
add unit tests