Skip to content

Conversation

@shfshihuafeng
Copy link
Contributor

DRILL-8526: Hive Predicate Push Down for ORC and Parquet

Description

Drill do not support filter push down for orc format.When a large amount of data is filtered out, Predicate PushDown can significantly improve the query performance of ORC format.

I have implemented this featur and supported the data types are as follows:
int
bigint,
INYINT,
SMALLINT,
double,
float,
BOOLEAN
date,
TIMESTAMP,
decimal
string,
varchar
char

Testing

Through comparative testing , ORC format with filter pushdown achieves nearly a 20x performance improvement over execution without pushdown on following sql.

1.Import TPCH data. In my test, the amount of Table lineitem was 6001215.
2.set "store.hive.optimize_scan_filter_pushdown =true" by the option on drill web
3. execute sql select * from hive.lineitem_o where L_ORDERKEY=1;

(1)store.hive.optimize_scan_filter_pushdown =false
image

(2)store.hive.optimize_scan_filter_pushdown =true

image

Copy link
Contributor

@cgivre cgivre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR! There are a few minor comments, but the biggest thing that we'll need to merge this are some unit tests. Can you please add some to confirm that this is working as you intend?

RuntimeException> {
private static final Logger logger = LoggerFactory.getLogger(HiveFilterBuilder.class);

final private HiveScan groupScan;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please be consistent with private final vs final private

@cgivre
Copy link
Contributor

cgivre commented Jun 26, 2025

Wow! This is a serious performance improvement. Nicely done! One other thing... I've found that the limit was also not pushed down in a lot of storage plugins. It is usually fairly easy to add. Would you consider checking to see if the limit pushdown is there and if not, add it as well?

@cgivre cgivre added enhancement PRs that add a new functionality to Drill doc-impacting PRs that affect the documentation performance PRs that Improve Performance labels Jun 26, 2025
@shfshihuafeng
Copy link
Contributor Author

Thank you for this PR! There are a few minor comments, but the biggest thing that we'll need to merge this are some unit tests. Can you please add some to confirm that this is working as you i

Wow! This is a serious performance improvement. Nicely done! One other thing... I've found that the limit was also not pushed down in a lot of storage plugins. It is usually fairly easy to add. Would you consider checking to see if the limit pushdown is there and if not, add it as well?

Thank you for this PR! There are a few minor comments, but the biggest thing that we'll need to merge this are some unit tests. Can you please add some to confirm that this is working as you intend?

@cgivre I did some test for this feature, including data type testing, logical expression testing, and some special expressions.as following:

  1. data types: Int,bigint,INYINT,SMALLINT,double,float,BOOLEAN,date,TIMESTAMP,decimal,string,varchar and char
  2. logical expression testing: and ,or
  3. some special expression: is null is false and is not true and so on

i will add some unit tests for it

@shfshihuafeng
Copy link
Contributor Author

Thank you for this PR! There are a few minor comments, but the biggest thing that we'll need to merge this are some unit tests. Can you please add some to confirm that this is working as you intend?

import org.apache.hive.com.esotericsoftware.kryo.io.Output;

public class HiveFilter {
private SearchArgument searchArgument;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make final.

@shfshihuafeng shfshihuafeng force-pushed the hive_orc_pushdown branch 2 times, most recently from 019ce0d to 0d20e3d Compare June 27, 2025 05:21
@shfshihuafeng shfshihuafeng requested a review from cgivre June 28, 2025 01:47
Copy link
Contributor

@cgivre cgivre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted some minor changes. Just a reminder but we really need unit tests in order to merge this.

Also, have you considered adding a limit pushdown? It is usually pretty easy to do and only involves:

  • Implementing two methods in the group scan (HiveScan) which are: supportsLimitPushdown and applyLimit.
  • Passing the limit through the subscans.
  • Adding some logic in the readers to stop when the limit is reached.
    Maybe it would be best to open a new JIRA for that, but IMHO, it is one of the easiest and most effective pushdowns that can be implemented yet Drill didn't seem to do for all the plugins.

}
return false;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please add a blank line at the end of all classes. Here and elsewhere.

private SearchArgument.Builder builder;
private HashMap<String, SqlTypeName> dataTypeMap;

HiveFilterBuilder(LogicalExpression le, HashMap<String, SqlTypeName> dataTypeMap) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be public?


private boolean allExpressionsConverted = false;
private SearchArgument.Builder builder;
private HashMap<String, SqlTypeName> dataTypeMap;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make any variables final which can be made final.

logger.warn("Unsupported logical operator:{} for push down", functionName);
return builder;
}
for (int i = 0; i < args.size(); ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace with enhanced for-loop?

private PredicateLeaf.Type convertLeafType(SqlTypeName sqlTypeName) {
PredicateLeaf.Type type = null;
switch (sqlTypeName) {
case BOOLEAN:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please indent switch statement.

@shfshihuafeng
Copy link
Contributor Author

I submitted some minor changes. Just a reminder but we really need unit tests in order to merge this.

Also, have you considered adding a limit pushdown? It is usually pretty easy to do and only involves:

  • Implementing two methods in the group scan (HiveScan) which are: supportsLimitPushdown and applyLimit.
  • Passing the limit through the subscans.
  • Adding some logic in the readers to stop when the limit is reached.
    Maybe it would be best to open a new JIRA for that, but IMHO, it is one of the easiest and most effective pushdowns that can be implemented yet Drill didn't seem to do for all the plugins.

@cgivre ok, I will add some unit tests and supoort limit push down later

@shfshihuafeng
Copy link
Contributor Author

I submitted some minor changes. Just a reminder but we really need unit tests in order to merge this.
Also, have you considered adding a limit pushdown? It is usually pretty easy to do and only involves:

  • Implementing two methods in the group scan (HiveScan) which are: supportsLimitPushdown and applyLimit.
  • Passing the limit through the subscans.
  • Adding some logic in the readers to stop when the limit is reached.
    Maybe it would be best to open a new JIRA for that, but IMHO, it is one of the easiest and most effective pushdowns that can be implemented yet Drill didn't seem to do for all the plugins.

@cgivre ok, I will add some unit tests and support limit pushdown

@cgivre
Copy link
Contributor

cgivre commented Jun 30, 2025

I submitted some minor changes. Just a reminder but we really need unit tests in order to merge this.
Also, have you considered adding a limit pushdown? It is usually pretty easy to do and only involves:

  • Implementing two methods in the group scan (HiveScan) which are: supportsLimitPushdown and applyLimit.
  • Passing the limit through the subscans.
  • Adding some logic in the readers to stop when the limit is reached.
    Maybe it would be best to open a new JIRA for that, but IMHO, it is one of the easiest and most effective pushdowns that can be implemented yet Drill didn't seem to do for all the plugins.

@cgivre ok, I will add some unit tests and support limit pushdown

@shfshihuafeng Just to be clear, we can't merge this without unit tests.

You don't have to do the limit pushdown if you don't want to or aren't able to, but I just wanted to recommend it as it can be a big performance benefit for not much work. Alternatively, you can create a separate pull request for that and do it later. It is your choice..

@shfshihuafeng
Copy link
Contributor Author

shfshihuafeng commented Jul 1, 2025

I submitted some minor changes. Just a reminder but we really need unit tests in order to merge this.

Also, have you considered adding a limit pushdown? It is usually pretty easy to do and only involves:

  • Implementing two methods in the group scan (HiveScan) which are: supportsLimitPushdown and applyLimit.
  • Passing the limit through the subscans.
  • Adding some logic in the readers to stop when the limit is reached.
    Maybe it would be best to open a new JIRA for that, but IMHO, it is one of the easiest and most effective pushdowns that can be implemented yet Drill didn't seem to do for all the plugins.

@cgivre I think it is best to open a new JIRA,so you can refer to following link for hive limit push down,and i commit pr for this
JIRA: https://issues.apache.org/jira/browse/DRILL-8527
@cgivre i am adding unit tests for Drill-8526. It may take some time

@shfshihuafeng shfshihuafeng force-pushed the hive_orc_pushdown branch 3 times, most recently from f1c632f to f3ef7a7 Compare July 19, 2025 15:44
@shfshihuafeng
Copy link
Contributor Author

@cgivre I added unit tests which were not run on JDK 11+. Therefore, I submitted another PR in version drill 18 to run these unit tests on jdk 8.

@shfshihuafeng shfshihuafeng requested a review from cgivre July 19, 2025 16:02
Copy link
Contributor

@cgivre cgivre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1 Unless there are any objections, I think we should merge this. It doesn't break anything and apparently the Hive unit tests were not running (and already broken).

@shfshihuafeng It would be nice if you could fix them and assist with updating the Hive client to a newer version, but since this is adding value for you, I'd like to get it merged.

@cgivre cgivre merged commit 6cea55d into apache:master Aug 7, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-impacting PRs that affect the documentation enhancement PRs that add a new functionality to Drill performance PRs that Improve Performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants