Skip to content

Perf: load default Utf8View for CSV datatype #16243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

zhuqi-lucas
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the datasource Changes to the datasource crate label Jun 3, 2025
@github-actions github-actions bot added the core Core DataFusion crate label Jun 3, 2025
@alamb
Copy link
Contributor

alamb commented Jun 3, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing default_utf8_for_unkown_type (7586ca2) to 5b08b84 diff
Benchmarks: h2o_small_window
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jun 3, 2025

🤖: Benchmark completed

Details

Comparing HEAD and default_utf8_for_unkown_type
--------------------
Benchmark h2o_window.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ default_utf8_for_unkown_type ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │  1907.12ms │                    1874.72ms │    no change │
│ QQuery 2     │  3435.06ms │                    3435.28ms │    no change │
│ QQuery 3     │  3836.65ms │                    4366.66ms │ 1.14x slower │
│ QQuery 4     │  1008.06ms │                     959.61ms │    no change │
│ QQuery 5     │  2238.31ms │                    2291.12ms │    no change │
│ QQuery 6     │  2998.08ms │                    2978.58ms │    no change │
│ QQuery 7     │  2948.78ms │                    2920.08ms │    no change │
│ QQuery 8     │ 11870.28ms │                   12098.03ms │    no change │
│ QQuery 9     │   866.09ms │                     852.79ms │    no change │
│ QQuery 10    │   938.85ms │                     940.99ms │    no change │
│ QQuery 11    │   910.25ms │                     894.20ms │    no change │
│ QQuery 12    │  1769.99ms │                    1741.19ms │    no change │
└──────────────┴────────────┴──────────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                           │ 34727.52ms │
│ Total Time (default_utf8_for_unkown_type)   │ 35353.24ms │
│ Average Time (HEAD)                         │  2893.96ms │
│ Average Time (default_utf8_for_unkown_type) │  2946.10ms │
│ Queries Faster                              │          0 │
│ Queries Slower                              │          1 │
│ Queries with No Change                      │         11 │
└─────────────────────────────────────────────┴────────────┘

@zhuqi-lucas
Copy link
Contributor Author

It looks like no performance improvement for h2o_window benchmark result...

@alamb
Copy link
Contributor

alamb commented Jun 4, 2025

It looks like no performance improvement for h2o_window benchmark result...

Now that I think about it, the h2o benchmark may not have any string columns 🤔

Do the TPCH benchmarks read from CSV? Maybe we could just get some manual benchmarks ?

@zhuqi-lucas
Copy link
Contributor Author

It looks like no performance improvement for h2o_window benchmark result...

Now that I think about it, the h2o benchmark may not have any string columns 🤔

Do the TPCH benchmarks read from CSV? Maybe we could just get some manual benchmarks ?

Thank you @alamb , this is a good point. Do some investigation from benchmark code now.

# Runs the tpch benchmark
run_tpch() {
    SCALE_FACTOR=$1
    if [ -z "$SCALE_FACTOR" ] ; then
        echo "Internal error: Scale factor not specified"
        exit 1
    fi
    TPCH_DIR="${DATA_DIR}/tpch_sf${SCALE_FACTOR}"

    RESULTS_FILE="${RESULTS_DIR}/tpch_sf${SCALE_FACTOR}.json"
    echo "RESULTS_FILE: ${RESULTS_FILE}"
    echo "Running tpch benchmark..."
    # Optional query filter to run specific query
    QUERY=$([ -n "$ARG3" ] && echo "--query $ARG3" || echo "")
    debug_run $CARGO_COMMAND --bin tpch -- benchmark datafusion --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" --format parquet -o "${RESULTS_FILE}" $QUERY
}
    /// File format: `csv` or `parquet`
    #[structopt(short = "f", long = "format", default_value = "csv")]
    file_format: String,

It looks like we default to parquet for tpch, but it also supports csv, i will try to create a PR to support csv from tpch benchmark parameters.

Because from the generator code for tpch, we also generate the CSV format, so it's reasonable for us to support CSV benchmark also, i will create a PR soon. Thanks

 # Create 'tbl' (CSV format) data into $DATA_DIR if it does not already exist
    FILE="${TPCH_DIR}/supplier.tbl"
    if test -f "${FILE}"; then
        echo " tbl files exist ($FILE exists)."
    else
        echo " creating tbl files with tpch_dbgen..."
        docker run -v "${TPCH_DIR}":/data -it --rm ghcr.io/scalytics/tpch-docker:main -vf -s "${SCALE_FACTOR}"
    fi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate datasource Changes to the datasource crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read from csv default datatype setting to utf8view
2 participants