Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid argument error: the data type binary has no natural order #7343

Closed
JayjeetAtGithub opened this issue Aug 21, 2023 · 5 comments
Closed
Labels
bug Something isn't working

Comments

@JayjeetAtGithub
Copy link
Contributor

JayjeetAtGithub commented Aug 21, 2023

Describe the bug

On running the query below on the Clickbench multi file dataset,

SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY to_timestamp_seconds("EventTime"), "SearchPhrase" LIMIT 10;

we get this error,

Arrow error: Invalid argument error: The data type type Binary has no natural order

To Reproduce

Download the data using,

 ./benchmarks/bench.sh data clickbench_partitioned

A hits_multi directory with the parquet files will be created.

Execute the above queries,

datafusion-cli -c "CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 'hits_multi';" "{query}"

Expected behavior

The queries should run successfully without erroring.

Additional context

Datafusion 29.0.0

@JayjeetAtGithub JayjeetAtGithub added the bug Something isn't working label Aug 21, 2023
@alamb
Copy link
Contributor

alamb commented Aug 21, 2023

This looks similar to #7039 which @jonahgao fixed by adding a coercion from binary --> UTF8 for comparison. I think we could do something similar here.

@alamb alamb added the good first issue Good for newcomers label Aug 21, 2023
@alamb
Copy link
Contributor

alamb commented Aug 21, 2023

Marking as a good first issue as there is a reproducer and I think the fix should be relatively straightforward

@tustvold
Copy link
Contributor

This may be fixed in the next arrow release, which adds Binary support to lexsort

@alamb alamb removed the good first issue Good for newcomers label Aug 21, 2023
@yjshen
Copy link
Member

yjshen commented Aug 27, 2023

A slightly simpler way to reproduce without the need for datafusion-cli:

#[tokio::test]
async fn binary_order() -> Result<()> {
    let schema = Arc::new(Schema::new(vec![
        Field::new("a", DataType::Binary, false),
        Field::new("b", DataType::Int32, false),
    ]));
    let batch = RecordBatch::try_new(
        schema.clone(),
        vec![Arc::new(BinaryArray::from(vec![
            "a".as_bytes(),
            "b".as_bytes(),
            "c".as_bytes(),
        ])),
        Arc::new(Int32Array::from(vec![1, 2, 3]))],
    )?;

    let ctx = SessionContext::new();
    ctx.register_batch("aa", batch)?;
    let result = ctx.sql("select a from aa where a <> '' order by b, a").await?;
    let result = result.collect().await?;
    let expected = vec![
        "+----+",
        "| a  |",
        "+----+",
        "| 61 |",
        "| 62 |",
        "| 63 |",
        "+----+",
    ];
    assert_batches_eq!(expected, &result);
    Ok(())
}

@alamb
Copy link
Contributor

alamb commented Oct 16, 2023

I have verified that this now works in datafusion 34. Tests added in #7839

@alamb alamb closed this as completed Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants