Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ParquetObjectReader::with_runtime #6612

Merged
merged 7 commits into from
Nov 2, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Switch ParquetObjectReader runtime tests to not depend on tokio_unsta…
…ble anymore
  • Loading branch information
itsjunetime committed Oct 24, 2024
commit 80befa1e0bf3a9253784e88b3234d3e99944c869
4 changes: 0 additions & 4 deletions .cargo/config.toml

This file was deleted.

6 changes: 1 addition & 5 deletions parquet/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,6 @@ readme = "README.md"
edition = { workspace = true }
rust-version = "1.70.0"

# we need this for the arrow::async_reader::test_runtime_is_used test
[lints.rust]
unexpected_cfgs = { level = "warn", check-cfg = ["cfg(tokio_unstable)"] }

[target.'cfg(target_arch = "wasm32")'.dependencies]
ahash = { version = "0.8", default-features = false, features = ["compile-time-rng"] }

Expand Down Expand Up @@ -85,7 +81,7 @@ lz4_flex = { version = "0.11", default-features = false, features = ["std", "fra
zstd = { version = "0.13", default-features = false }
serde_json = { version = "1.0", features = ["std"], default-features = false }
arrow = { workspace = true, features = ["ipc", "test_utils", "prettyprint", "json"] }
tokio = { version = "1.0", default-features = false, features = ["macros", "rt", "io-util", "fs"] }
tokio = { version = "1.0", default-features = false, features = ["macros", "rt-multi-thread", "io-util", "fs"] }
rand = { version = "0.8", default-features = false, features = ["std", "std_rng"] }
object_store = { version = "0.11.0", default-features = false, features = ["azure"] }

Expand Down
51 changes: 44 additions & 7 deletions parquet/src/arrow/async_reader/store.rs
Original file line number Diff line number Diff line change
Expand Up @@ -173,11 +173,18 @@ impl AsyncFileReader for ParquetObjectReader {

#[cfg(test)]
mod tests {
use std::sync::Arc;
use std::{
convert::Infallible,
sync::{
atomic::{AtomicUsize, Ordering},
Arc,
},
};

use futures::TryStreamExt;

use arrow::util::test_util::parquet_test_data;
use futures::FutureExt;
use object_store::local::LocalFileSystem;
use object_store::path::Path;
use object_store::{ObjectMeta, ObjectStore};
Expand Down Expand Up @@ -233,14 +240,24 @@ mod tests {
#[tokio::test]
// We need to mark this with the `target_has_atomic` because the spawned_tasks_count() fn is
itsjunetime marked this conversation as resolved.
Show resolved Hide resolved
// only available for that cfg
#[cfg(all(target_has_atomic = "64", tokio_unstable))]
async fn test_runtime_is_used() {
alamb marked this conversation as resolved.
Show resolved Hide resolved
let rt = tokio::runtime::Builder::new_current_thread()
let num_actions = Arc::new(AtomicUsize::new(0));

let (a1, a2) = (num_actions.clone(), num_actions.clone());
let rt = tokio::runtime::Builder::new_multi_thread()
.on_thread_park(move || {
alamb marked this conversation as resolved.
Show resolved Hide resolved
a1.fetch_add(1, Ordering::Relaxed);
})
.on_thread_unpark(move || {
a2.fetch_add(1, Ordering::Relaxed);
})
.build()
.unwrap();

let (meta, store) = get_meta_store().await;

let initial_actions = num_actions.load(Ordering::Relaxed);

let reader = ParquetObjectReader::new(store, meta).with_runtime(rt.handle().clone());

let builder = ParquetRecordBatchStreamBuilder::new(reader).await.unwrap();
Expand All @@ -250,13 +267,33 @@ mod tests {
assert_eq!(batches.len(), 1);
assert_eq!(batches[0].num_rows(), 8);

// According to tokio documentation for the `spawned_tasks_count` method, this number
// starts at 0 when the runtime is created. So this check should actually verify what we
// want.
assert!(rt.metrics().spawned_tasks_count() > 0);
assert!(num_actions.load(Ordering::Relaxed) - initial_actions > 0);

// Runtimes have to be dropped in blocking contexts, so we need to move this one to a new
// blocking thread to drop it.
tokio::runtime::Handle::current().spawn_blocking(move || drop(rt));
}

#[tokio::test]
async fn test_runtime_thread_id_different() {
itsjunetime marked this conversation as resolved.
Show resolved Hide resolved
let rt = tokio::runtime::Builder::new_multi_thread()
.worker_threads(1)
.build()
.unwrap();

let (meta, store) = get_meta_store().await;

let reader = ParquetObjectReader::new(store, meta).with_runtime(rt.handle().clone());

let current_id = std::thread::current().id();

let other_id = reader
.spawn(|_, _| async move { Ok::<_, Infallible>(std::thread::current().id()) }.boxed())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.spawn(|_, _| async move { Ok::<_, Infallible>(std::thread::current().id()) }.boxed())
.spawn(|_, _| async move { Ok::<_, ParquetError>(std::thread::current().id()) }.boxed())

Would remove the need for the std::convert::Infallible conversion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this repo checked out and in the editor, so I just made this change to accelerate getting this PR in in 8d24cd7

It results in a nice simplification

.await
.unwrap();

assert_ne!(current_id, other_id);

tokio::runtime::Handle::current().spawn_blocking(move || drop(rt));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add unit tests for each of the three APIs in ParquetObjectReader that spawn is used?

  • get_bytes
  • get_byte_ranges
  • get_metadata?

}
}
7 changes: 7 additions & 0 deletions parquet/src/errors.rs
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,13 @@ impl From<str::Utf8Error> for ParquetError {
}
}

#[cfg(test)]
impl From<std::convert::Infallible> for ParquetError {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice improvement too. Thank you. Maybe it is worth adding publically as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this, the whole point of infallible is that it can't be constructed and so doesn't need to be handled

Copy link
Contributor

@alamb alamb Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it can't be constructed, but it often does need to be "handled" (aka to transform a Result<.., Infallible> to Result<.., Error> type expected by an API)

I don't feel strongly about this particular code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aka to transform a Result<.., Infallible> to Result<.., Error> type expected by an API)

Right but this is a little funky, because it then makes code look more fallible than it is. Often you can use an infallible version of the API, i.e. into() instead of try_into(), but sometimes you do have to either unwrap() or let _ = ...

FWIW Rust 1.82 gives us a very nice way to handle this, but I'm not sure whether our MSRV policy covers tests.

let Ok(value) = expression();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 8d24cd7

fn from(value: std::convert::Infallible) -> Self {
match value {}
}
}

#[cfg(feature = "arrow")]
impl From<ArrowError> for ParquetError {
fn from(e: ArrowError) -> ParquetError {
Expand Down
Loading