Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix datafusion-cli print output #8895

Merged
merged 3 commits into from
Jan 18, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 120 additions & 7 deletions datafusion-cli/src/print_format.rs
Original file line number Diff line number Diff line change
Expand Up @@ -161,23 +161,29 @@ impl PrintFormat {
maxrows: MaxRows,
with_header: bool,
) -> Result<()> {
if batches.is_empty() || batches[0].num_rows() == 0 {
// filter out any empty batches
let batches: Vec<_> = batches
.iter()
.filter(|b| b.num_rows() > 0)
.cloned()
.collect();
if batches.is_empty() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix (thanks @Jefffrey for finding this). The rest of the PR is tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep the intended behavior, I think this should probably be:

Suggested change
if batches.is_empty() {
if batches.is_empty() || batches.iter().all(|b| b.num_rows() == 0) {

However, I think the intention is wrong. We should totally print the header even if there's no data. This entire "don't print anything if the result is empty" is super confusing, at least in my opinion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I think the intention is wrong. We should totally print the header even if there's no data. This entire "don't print anything if the result is empty" is super confusing, at least in my opinion.

This is a good call -- I will update and add a test

I believe the logic is to support generating output incrementally ("Streaming") rather than waiting for the entire output to be generated as Vec<RecordBatch>.

However, that logic is sort of split between this module and print_format.rs which I agree is confusing

cc @berkaysynnada

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 2ef1f9d

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen for queries that return empty results?

e.g.

select * from table where 1=0

Will this print nothing at all, not even the table outline?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also like to mention that a similar behavior exists for streaming results. If the first batch of the stream is empty, the header will not be printed, even if it was set to true (this issue existed prior to this PR). Perhaps we should print an empty record batch for the header before these checks if the flag is set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this print nothing at all, not even the table outline?

Right, it will NOT print the outline / header. Also see #8895 (comment) , i.e. I also find this rather confusing. However this PR here just fixes the status quo to not swallow the output for aggregates. I personally think we should change the behavior in a follow-up to at least print the table outline / header even for empty results.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, I didn't realize the "print nothing if no data" was a previous issue before that streaming PR

Then yeah I agree can fix that separately if it wasn't technically a regression

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #8904 to track (and I confirmed the "print nothing if no data" behavior was present in versions 33 and 25).

return Ok(());
}

match self {
Self::Csv | Self::Automatic => {
print_batches_with_sep(writer, batches, b',', with_header)
print_batches_with_sep(writer, &batches, b',', with_header)
}
Self::Tsv => print_batches_with_sep(writer, batches, b'\t', with_header),
Self::Tsv => print_batches_with_sep(writer, &batches, b'\t', with_header),
Self::Table => {
if maxrows == MaxRows::Limited(0) {
return Ok(());
}
format_batches_with_maxrows(writer, batches, maxrows)
format_batches_with_maxrows(writer, &batches, maxrows)
}
Self::Json => batches_to_json!(ArrayWriter, writer, batches),
Self::NdJson => batches_to_json!(LineDelimitedWriter, writer, batches),
Self::Json => batches_to_json!(ArrayWriter, writer, &batches),
Self::NdJson => batches_to_json!(LineDelimitedWriter, writer, &batches),
}
}
}
Expand All @@ -189,7 +195,7 @@ mod tests {

use super::*;

use arrow::array::Int32Array;
use arrow::array::{ArrayRef, Int32Array};
use arrow::datatypes::{DataType, Field, Schema};
use datafusion::error::Result;

Expand Down Expand Up @@ -351,4 +357,111 @@ mod tests {

Ok(())
}

#[test]
fn test_print_batches_empty_batches() -> Result<()> {
let batch = one_column_batch();
let empty_batch = RecordBatch::new_empty(batch.schema());

#[rustfmt::skip]
let expected =&[
"+---+",
"| a |",
"+---+",
"| 1 |",
"| 2 |",
"| 3 |",
"+---+\n",
];

PrintBatchesTest::new()
.with_format(PrintFormat::Table)
.with_batches(vec![empty_batch.clone(), batch, empty_batch])
.with_expected(expected)
.run();
Ok(())
}

#[test]
fn test_print_batches_empty_batches_no_header() -> Result<()> {
let empty_batch = RecordBatch::new_empty(one_column_batch().schema());

// empty batches should not print a header
let expected = &[""];

PrintBatchesTest::new()
.with_format(PrintFormat::Table)
.with_batches(vec![empty_batch])
.with_header(true)
.with_expected(expected)
.run();
Ok(())
}

struct PrintBatchesTest {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to rework the rest of the tests in this module as a follow on PR using this struct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format: PrintFormat,
batches: Vec<RecordBatch>,
maxrows: MaxRows,
with_header: bool,
expected: Vec<&'static str>,
}

impl PrintBatchesTest {
fn new() -> Self {
Self {
format: PrintFormat::Table,
batches: vec![],
maxrows: MaxRows::Unlimited,
with_header: false,
expected: vec![],
}
}

/// set the format
fn with_format(mut self, format: PrintFormat) -> Self {
self.format = format;
self
}

/// set the batches to convert
fn with_batches(mut self, batches: Vec<RecordBatch>) -> Self {
self.batches = batches;
self
}

/// set whether to include a header
fn with_header(mut self, with_header: bool) -> Self {
self.with_header = with_header;
self
}

/// set expected output
fn with_expected(mut self, expected: &[&'static str]) -> Self {
self.expected = expected.to_vec();
self
}

/// run the test
fn run(self) {
let mut buffer: Vec<u8> = vec![];
self.format
.print_batches(&mut buffer, &self.batches, self.maxrows, self.with_header)
.unwrap();
let actual = String::from_utf8(buffer).unwrap();
let expected = self.expected.join("\n");
assert_eq!(
actual, expected,
"actual:\n\n{actual}expected:\n\n{expected}"
);
}
}

/// return a batch with one column and three rows
fn one_column_batch() -> RecordBatch {
RecordBatch::try_from_iter(vec![(
"a",
Arc::new(Int32Array::from(vec![1, 2, 3])) as ArrayRef,
)])
.unwrap()
}
}