Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port remainder of window.rs to sqllogictest #6234

Merged
merged 3 commits into from
May 10, 2023
Merged

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 4, 2023

Draft as it builds on #6235

Which issue does this PR close?

Closes #6233

Rationale for this change

Make the tests easier to maintain -- sqlloigictest is easier to update and faster to build

What changes are included in this PR?

  1. port window.rs tests to sqllogictest including the data files

Are these changes tested?

This is all tests

Are there any user-facing changes?

No

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels May 4, 2023

let schema = Arc::new(Schema::new(vec![ts_field, inc_field, desc_field]));

let batch = RecordBatch::try_new(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This data is now checked in as a CSV file




# Columns in the table are a,b,c,d. Source is CsvExec which is ordered by
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests used the same fixture, so I had to port them too if I wanted to remove the get_testdata fixture

ProjectionExec: expr=[inc_col@1 as inc_col, desc_col@2 as desc_col, SUM(annotated_data_finite.inc_col) ORDER BY [annotated_data_finite.ts DESC NULLS FIRST] RANGE BETWEEN 1 PRECEDING AND 4 FOLLOWING@3 as SUM(annotated_data_finite.inc_col), SUM(annotated_data_finite.desc_col) ORDER BY [annotated_data_finite.ts DESC NULLS FIRST] RANGE BETWEEN 1 PRECEDING AND 8 FOLLOWING@4 as SUM(annotated_data_finite.desc_col), SUM(annotated_data_finite.desc_col) ORDER BY [annotated_data_finite.ts DESC NULLS FIRST] ROWS BETWEEN 1 PRECEDING AND 5 FOLLOWING@5 as SUM(annotated_data_finite.desc_col), MIN(annotated_data_finite.inc_col) ORDER BY [annotated_data_finite.ts DESC NULLS FIRST] RANGE BETWEEN 10 PRECEDING AND 1 FOLLOWING@6 as MIN(annotated_data_finite.inc_col), MIN(annotated_data_finite.desc_col) ORDER BY [annotated_data_finite.ts DESC NULLS FIRST] RANGE BETWEEN 5 PRECEDING AND 1 FOLLOWING@7 as MIN(annotated_data_finite.desc_col), MIN(annotated_data_finite.inc_col) ORDER BY [annotated_data_finite.ts DESC NULLS FIRST] ROWS BETWEEN 1 PRECEDING AND 10 FOLLOWING@8 as MIN(annotated_data_finite.inc_col), MAX(annotated_data_finite.inc_col) ORDER BY [annotated_data_finite.ts DESC NULLS FIRST] RANGE BETWEEN 10 PRECEDING AND 1 FOLLOWING@9 as MAX(annotated_data_finite.inc_col), MAX(annotated_data_finite.desc_col) ORDER BY [annotated_data_finite.ts DESC NULLS FIRST] RANGE BETWEEN 5 PRECEDING AND 1 FOLLOWING@10 as MAX(annotated_data_finite.desc_col), MAX(annotated_data_finite.inc_col) ORDER BY [annotated_data_finite.ts DESC NULLS FIRST] ROWS BETWEEN 1 PRECEDING AND 10 FOLLOWING@11 as MAX(annotated_data_finite.inc_col), COUNT(UInt8(1)) ORDER BY [annotated_data_finite.ts DESC NULLS FIRST] RANGE BETWEEN 6 PRECEDING AND 2 FOLLOWING@12 as COUNT(UInt8(1)), COUNT(UInt8(1)) ORDER BY [annotated_data_finite.ts DESC NULLS FIRST] ROWS BETWEEN 8 PRECEDING AND 1 FOLLOWING@13 as COUNT(UInt8(1)), SUM(annotated_data_finite.inc_col) ORDER BY [annotated_data_finite.ts ASC NULLS LAST] RANGE BETWEEN 10 PRECEDING AND 1 FOLLOWING@14 as SUM(annotated_data_finite.inc_col), SUM(annotated_data_finite.desc_col) ORDER BY [annotated_data_finite.ts ASC NULLS LAST] RANGE BETWEEN 5 PRECEDING AND 1 FOLLOWING@15 as SUM(annotated_data_finite.desc_col), SUM(annotated_data_finite.inc_col) ORDER BY [annotated_data_finite.ts ASC NULLS LAST] ROWS BETWEEN 1 PRECEDING AND 10 FOLLOWING@16 as SUM(annotated_data_finite.inc_col), MIN(annotated_data_finite.inc_col) ORDER BY [annotated_data_finite.ts ASC NULLS LAST] RANGE BETWEEN 10 PRECEDING AND 1 FOLLOWING@17 as MIN(annotated_data_finite.inc_col), MIN(annotated_data_finite.desc_col) ORDER BY [annotated_data_finite.ts ASC NULLS LAST] RANGE BETWEEN 5 PRECEDING AND 1 FOLLOWING@18 as MIN(annotated_data_finite.desc_col), MIN(annotated_data_finite.inc_col) ORDER BY [annotated_data_finite.ts ASC NULLS LAST] ROWS BETWEEN 1 PRECEDING AND 10 FOLLOWING@19 as MIN(annotated_data_finite.inc_col), MAX(annotated_data_finite.inc_col) ORDER BY [annotated_data_finite.ts ASC NULLS LAST] RANGE BETWEEN 10 PRECEDING AND 1 FOLLOWING@20 as MAX(annotated_data_finite.inc_col), MAX(annotated_data_finite.desc_col) ORDER BY [annotated_data_finite.ts ASC NULLS LAST] RANGE BETWEEN 5 PRECEDING AND 1 FOLLOWING@21 as MAX(annotated_data_finite.desc_col), MAX(annotated_data_finite.inc_col) ORDER BY [annotated_data_finite.ts ASC NULLS LAST] ROWS BETWEEN 1 PRECEDING AND 10 FOLLOWING@22 as MAX(annotated_data_finite.inc_col), COUNT(UInt8(1)) ORDER BY [annotated_data_finite.ts ASC NULLS LAST] RANGE BETWEEN 4 PRECEDING AND 8 FOLLOWING@23 as COUNT(UInt8(1)), COUNT(UInt8(1)) ORDER BY [annotated_data_finite.ts ASC NULLS LAST] ROWS BETWEEN 8 PRECEDING AND 1 FOLLOWING@24 as COUNT(UInt8(1))]
BoundedWindowAggExec: wdw=[SUM(annotated_data_finite.inc_col): Ok(Field { name: "SUM(annotated_data_finite.inc_col)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Range, start_bound: Preceding(Int32(10)), end_bound: Following(Int32(1)) }, SUM(annotated_data_finite.desc_col): Ok(Field { name: "SUM(annotated_data_finite.desc_col)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Range, start_bound: Preceding(Int32(5)), end_bound: Following(Int32(1)) }, SUM(annotated_data_finite.inc_col): Ok(Field { name: "SUM(annotated_data_finite.inc_col)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(1)), end_bound: Following(UInt64(10)) }, MIN(annotated_data_finite.inc_col): Ok(Field { name: "MIN(annotated_data_finite.inc_col)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Range, start_bound: Preceding(Int32(10)), end_bound: Following(Int32(1)) }, MIN(annotated_data_finite.desc_col): Ok(Field { name: "MIN(annotated_data_finite.desc_col)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Range, start_bound: Preceding(Int32(5)), end_bound: Following(Int32(1)) }, MIN(annotated_data_finite.inc_col): Ok(Field { name: "MIN(annotated_data_finite.inc_col)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(1)), end_bound: Following(UInt64(10)) }, MAX(annotated_data_finite.inc_col): Ok(Field { name: "MAX(annotated_data_finite.inc_col)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Range, start_bound: Preceding(Int32(10)), end_bound: Following(Int32(1)) }, MAX(annotated_data_finite.desc_col): Ok(Field { name: "MAX(annotated_data_finite.desc_col)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Range, start_bound: Preceding(Int32(5)), end_bound: Following(Int32(1)) }, MAX(annotated_data_finite.inc_col): Ok(Field { name: "MAX(annotated_data_finite.inc_col)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(1)), end_bound: Following(UInt64(10)) }, COUNT(UInt8(1)): Ok(Field { name: "COUNT(UInt8(1))", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Range, start_bound: Preceding(Int32(4)), end_bound: Following(Int32(8)) }, COUNT(UInt8(1)): Ok(Field { name: "COUNT(UInt8(1))", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(8)), end_bound: Following(UInt64(1)) }], mode=[Sorted]
BoundedWindowAggExec: wdw=[SUM(annotated_data_finite.inc_col): Ok(Field { name: "SUM(annotated_data_finite.inc_col)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Range, start_bound: Preceding(Int32(4)), end_bound: Following(Int32(1)) }, SUM(annotated_data_finite.desc_col): Ok(Field { name: "SUM(annotated_data_finite.desc_col)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Range, start_bound: Preceding(Int32(8)), end_bound: Following(Int32(1)) }, SUM(annotated_data_finite.desc_col): Ok(Field { name: "SUM(annotated_data_finite.desc_col)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(5)), end_bound: Following(UInt64(1)) }, MIN(annotated_data_finite.inc_col): Ok(Field { name: "MIN(annotated_data_finite.inc_col)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Range, start_bound: Preceding(Int32(1)), end_bound: Following(Int32(10)) }, MIN(annotated_data_finite.desc_col): Ok(Field { name: "MIN(annotated_data_finite.desc_col)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Range, start_bound: Preceding(Int32(1)), end_bound: Following(Int32(5)) }, MIN(annotated_data_finite.inc_col): Ok(Field { name: "MIN(annotated_data_finite.inc_col)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(10)), end_bound: Following(UInt64(1)) }, MAX(annotated_data_finite.inc_col): Ok(Field { name: "MAX(annotated_data_finite.inc_col)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Range, start_bound: Preceding(Int32(1)), end_bound: Following(Int32(10)) }, MAX(annotated_data_finite.desc_col): Ok(Field { name: "MAX(annotated_data_finite.desc_col)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Range, start_bound: Preceding(Int32(1)), end_bound: Following(Int32(5)) }, MAX(annotated_data_finite.inc_col): Ok(Field { name: "MAX(annotated_data_finite.inc_col)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(10)), end_bound: Following(UInt64(1)) }, COUNT(UInt8(1)): Ok(Field { name: "COUNT(UInt8(1))", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Range, start_bound: Preceding(Int32(2)), end_bound: Following(Int32(6)) }, COUNT(UInt8(1)): Ok(Field { name: "COUNT(UInt8(1))", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(1)), end_bound: Following(UInt64(8)) }], mode=[Sorted]
CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/data/window_1.csv]]}, projection=[ts, inc_col, desc_col], output_ordering=[ts@0 ASC NULLS LAST], has_header=true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is neat you can see here what the output ordering is (so it is clearer that the test has sorted inputs)

ProjectionExec: expr=[SUM(annotated_data_infinite.inc_col) ORDER BY [annotated_data_infinite.ts ASC NULLS LAST] ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING@4 as sum1, SUM(annotated_data_infinite.inc_col) ORDER BY [annotated_data_infinite.ts DESC NULLS FIRST] ROWS BETWEEN 3 PRECEDING AND UNBOUNDED FOLLOWING@2 as sum2, COUNT(annotated_data_infinite.inc_col) ORDER BY [annotated_data_infinite.ts ASC NULLS LAST] ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING@5 as count1, COUNT(annotated_data_infinite.inc_col) ORDER BY [annotated_data_infinite.ts DESC NULLS FIRST] ROWS BETWEEN 3 PRECEDING AND UNBOUNDED FOLLOWING@3 as count2, ts@0 as ts]
BoundedWindowAggExec: wdw=[SUM(annotated_data_infinite.inc_col): Ok(Field { name: "SUM(annotated_data_infinite.inc_col)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(NULL)), end_bound: Following(UInt64(1)) }, COUNT(annotated_data_infinite.inc_col): Ok(Field { name: "COUNT(annotated_data_infinite.inc_col)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(NULL)), end_bound: Following(UInt64(1)) }], mode=[Sorted]
BoundedWindowAggExec: wdw=[SUM(annotated_data_infinite.inc_col): Ok(Field { name: "SUM(annotated_data_infinite.inc_col)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(NULL)), end_bound: Following(UInt64(3)) }, COUNT(annotated_data_infinite.inc_col): Ok(Field { name: "COUNT(annotated_data_infinite.inc_col)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(NULL)), end_bound: Following(UInt64(3)) }], mode=[Sorted]
CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/data/window_1.csv]]}, projection=[ts, inc_col], infinite_source=true, output_ordering=[ts@0 ASC NULLS LAST], has_header=true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you can see infinite_source=true which shows that the source is infinite (and I think makes the intent of the test clearer)

@alamb alamb force-pushed the alamb/window branch 2 times, most recently from a4021a0 to 8b102b8 Compare May 4, 2023 20:33
@alamb alamb changed the title Port remainder of window.rs to sqllogictest Port remainder of window.rs to sqllogictest May 4, 2023
@alamb alamb marked this pull request as ready for review May 5, 2023 12:59
@alamb alamb requested a review from mustafasrepo May 8, 2023 15:24
STORED AS CSV
WITH HEADER ROW
WITH ORDER (a ASC, b ASC, c ASC)
LOCATION 'tests/data/window_2.csv';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these tests we wanted to enforce these queries can run with infinite sources. Hence I think it is better to add OPTIONS('infinite_source' 'true') here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you this is a good spot.

I misread the setup used in the test

    let ctx = get_test_context2(&tmpdir, true, session_config).await?;	

the true means to use infinite source, as you say. I fixed it in d24f563

WITH HEADER ROW
WITH ORDER (a ASC, b ASC, c ASC)
LOCATION 'tests/data/window_2.csv';

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current version

RepartitionExec: partitioning=Hash([Column { name: "b", index: 0 }, Column { name: "a", index: 1 }], 4), input_partitions=4
  RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 

stack messes ordering. Hence to not mess up ordering, when source is infinite. We should run query with single partition. Hence, we should add here

statement ok
set datafusion.execution.target_partitions = 1;

also.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In d24f563

Comment on lines 1957 to 1962
AggregateExec: mode=FinalPartitioned, gby=[b@0 as b, a@1 as a], aggr=[SUM(annotated_data_finite2.c)]
CoalesceBatchesExec: target_batch_size=8192
RepartitionExec: partitioning=Hash([Column { name: "b", index: 0 }, Column { name: "a", index: 1 }], 4), input_partitions=4
AggregateExec: mode=Partial, gby=[b@1 as b, a@0 as a], aggr=[SUM(annotated_data_finite2.c)], ordering_mode=FullyOrdered
RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, b, c], output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS LAST, c@2 ASC NULLS LAST], has_header=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above changes would tun this section into

  AggregateExec: mode=Single, gby=[b@1 as b, a@0 as a], aggr=[SUM(annotated_data_finite2.c)], ordering_mode=FullyOrdered
    CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, b, c], infinite_source=true, output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS LAST, c@2 ASC NULLS LAST], has_header=true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed -- that is what happened in d24f563

Comment on lines 1991 to 1996
AggregateExec: mode=FinalPartitioned, gby=[d@0 as d, a@1 as a], aggr=[SUM(annotated_data_finite2.c)]
CoalesceBatchesExec: target_batch_size=8192
RepartitionExec: partitioning=Hash([Column { name: "d", index: 0 }, Column { name: "a", index: 1 }], 4), input_partitions=4
AggregateExec: mode=Partial, gby=[d@2 as d, a@0 as a], aggr=[SUM(annotated_data_finite2.c)], ordering_mode=PartiallyOrdered
RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, c, d], output_ordering=[a@0 ASC NULLS LAST], has_header=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, this section would turn into

  AggregateExec: mode=Single, gby=[d@2 as d, a@0 as a], aggr=[SUM(annotated_data_finite2.c)], ordering_mode=PartiallyOrdered
    CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, c, d], infinite_source=true, output_ordering=[a@0 ASC NULLS LAST], has_header=true

@mustafasrepo
Copy link
Contributor

Thanks @alamb for this PR. I added some comments. Other than these this PR is LGTM!. However, I wonder whether we should keep an example test case each file. Because, tests in .rs files are quite useful during development. Novices can use these example test cases as starting point when they work on new functionality. I am not really sure, whether it will be helpful. What are you thoughts about it?

@alamb
Copy link
Contributor Author

alamb commented May 10, 2023

Other than these this PR is LGTM!.

Thank you for the review @mustafasrepo -- I will make the changes

However, I wonder whether we should keep an example test case each file. Because, tests in .rs files are quite useful during development.

Can you explain how SQL tests in .rs files are more useful during development than ones in .slt files? I suspect you are using a different workflow from me (which is totally good and understandable, but I therefore lack context).

When I am working on some new feature, I found it quite easy to make some new_feature.slt file with just the SQL I want to run and then test with cargo test --test sqllogictests -- new_feature. This is often better for me than .rs tests because I can change the test and re-run it immediately without having to wait for it to recompile.

Novices can use these example test cases as starting point when they work on new functionality. I am not really sure, whether it will be helpful. What are you thoughts about it?

My thoughts on this topic are:

  1. I agree that for people starting out with DataFusion development .rs tests are easier (because they are familiar to all Rust developers). sqllogictests are harder because they require learning another tool (though I think the barrier is pretty low)
  2. However, for SQL based tests, sqllogictest based tests are much easier to maintain (specifically updating the tests with cargo test --test sqllogictests -- --complete). In my opinion this long term benefit is worth the short term cost

The reason I am spending effort porting (and asking others to help) port the existing tests to sqllogictest is precisely to discourage new SQL tests written in rust (so the code is easier to maintain long term)

🤔 if this makes sense to you, perhaps I can try to incorporate some of this rational into the documentation to make it clearer.

@mustafasrepo
Copy link
Contributor

Other than these this PR is LGTM!.

Thank you for the review @mustafasrepo -- I will make the changes

However, I wonder whether we should keep an example test case each file. Because, tests in .rs files are quite useful during development.

Can you explain how SQL tests in .rs files are more useful during development than ones in .slt files? I suspect you are using a different workflow from me (which is totally good and understandable, but I therefore lack context).

When I am working on some new feature, I found it quite easy to make some new_feature.slt file with just the SQL I want to run and then test with cargo test --test sqllogictests -- new_feature. This is often better for me than .rs tests because I can change the test and re-run it immediately without having to wait for it to recompile.

Novices can use these example test cases as starting point when they work on new functionality. I am not really sure, whether it will be helpful. What are you thoughts about it?

My thoughts on this topic are:

  1. I agree that for people starting out with DataFusion development .rs tests are easier (because they are familiar to all Rust developers). sqllogictests are harder because they require learning another tool (though I think the barrier is pretty low)
  2. However, for SQL based tests, sqllogictest based tests are much easier to maintain (specifically updating the tests with cargo test --test sqllogictests -- --complete). In my opinion this long term benefit is worth the short term cost

The reason I am spending effort porting (and asking others to help) port the existing tests to sqllogictest is precisely to discourage new SQL tests written in rust (so the code is easier to maintain long term)

🤔 if this makes sense to you, perhaps I can try to incorporate some of this rational into the documentation to make it clearer.

My motivation was mainly 1. However, there are lots of example sqllogictest, one can see the pattern easily. Also it is not clear that single example test case in .rs file shouldn't grow, and examples should in the end go to .slt files. It should be communicated in every PR. In this case, I think it is better to have all sql tests in .slt files. Thanks for clearification.

SortExec: expr=[b@1 ASC NULLS LAST,a@0 ASC NULLS LAST,c@2 ASC NULLS LAST]
BoundedWindowAggExec: wdw=[SUM(annotated_data_finite2.c): Ok(Field { name: "SUM(annotated_data_finite2.c)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(2)), end_bound: Following(UInt64(1)) }, SUM(annotated_data_finite2.c): Ok(Field { name: "SUM(annotated_data_finite2.c)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Following(UInt64(1)), end_bound: Following(UInt64(5)) }], mode=[Sorted]
SortExec: expr=[a@0 ASC NULLS LAST,d@3 ASC NULLS LAST,b@1 ASC NULLS LAST,c@2 ASC NULLS LAST]
BoundedWindowAggExec: wdw=[SUM(annotated_data_finite2.c): Ok(Field { name: "SUM(annotated_data_finite2.c)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(2)), end_bound: Following(UInt64(1)) }, SUM(annotated_data_finite2.c): Ok(Field { name: "SUM(annotated_data_finite2.c)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(5)), end_bound: CurrentRow }], mode=[Sorted]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I guess, after some indentation level, indent goes to beginning, to be able to see deep queries better. Is it true, if so is there any setting for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 that is an excellent question -- I will look into it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into it also, it seems that space at the beginning ignored during comparison. Hence spaces at the beginning are arbitrary. I have changed the indentation for couple of lines they all passes from the tests.

Copy link
Contributor Author

@alamb alamb May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this -- and it seems like the sqllogictest comparisons actually ignore the leading space. I suspect that the sqloigctest-rs --complete is doing some sort of normalizing. I will see if I can find some better way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #6328 -- I will propose a fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed fix #6329

@alamb alamb merged commit 8a11248 into apache:main May 10, 2023
@alamb alamb deleted the alamb/window branch May 10, 2023 16:06
Ted-Jiang pushed a commit to Ted-Jiang/arrow-datafusion that referenced this pull request May 11, 2023
* Port remainder of window.rs to sqllogictest

* fix groupby.slt to use unbounded inputs
@alamb
Copy link
Contributor Author

alamb commented May 15, 2023

My motivation was mainly 1. However, there are lots of example sqllogictest, one can see the pattern easily. Also it is not clear that single example test case in .rs file shouldn't grow, and examples should in the end go to .slt files. It should be communicated in every PR. In this case, I think it is better to have all sql tests in .slt files. Thanks for clearification.

I tried to capture some of this conversation in #6357

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Port remaining window.rs tests to sqllogictests
2 participants