Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(index): implement file index format and interface #37

Closed
wants to merge 2 commits into from

Conversation

devillove084
Copy link
Contributor

Implement FileIndex related interfaces, implement Format format, and write related tests to ensure the correct file structure.

close #34

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks a lot for your PR. This PR is quite large and covers many decision-making points. I suggest splitting it into smaller ones.

@@ -27,6 +27,9 @@ license.workspace = true
version.workspace = true

[dependencies]
rand = "0.8.5"
async-trait = "0.1.81"
lazy_static = "1.5.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use once_cell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -35,3 +38,4 @@ serde_bytes = "0.11.15"
snafu = "0.8.3"
typed-builder = "^0.18"
opendal = "0.48"
tokio = { version = "1.39.2", features = ["full"] }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to avoid enabling full features, please only pick what you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


use crate::spec::{DataField, DataType};

pub fn to_map_key(map_column_name: &str, key_name: &str) -> String {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The common mod in Rust is unusual; I prefer to repeat them in the code and extract them while needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

/// - `BODY`: column index bytes + column index bytes + column index bytes + .......

#[derive(Debug)]
pub struct FileIndexFormat;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an empty struct to provide create_writer/create_reader is not a common pattern. I suggest implementing FileIndexReader and FileIndexWriter that can accept a FileIO instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe what is needed here is called FileIndexFormatReader and FileIndexFormatWriter.

pub struct FileIndexFormat;

impl FileIndexFormat {
pub fn create_writer<W: AsyncWrite + Unpin>(writer: W) -> Writer<W> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better for Paimon to provide its own io traits rather than depending directly on AsyncRead/AsyncWrite.

  1. We can avoid depending on tokio directly.
  2. We can allow users to implement io trait to provide best performance.

Most storage services in paimon are not good for AsyncRead + AsyncSeek, we should better avoid that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll take the time to carefully study and learn from the approach used in Iceberg-rust.

}

#[allow(dead_code)]
pub struct Writer<W: AsyncWrite + Unpin> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should build our io module first.

  • structs like InputFile & OutputFile
  • traits like FileRead and FileWrite.

We can follow the iceberg pattern: https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/io/file_io.rs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will mark this PR as Draft later, and then add the relevant IO to-do items in the corresponding Issue.

use tokio::io::{AsyncRead, AsyncSeek};

#[async_trait]
pub trait SeekableInputStream: AsyncRead + AsyncSeek + Unpin {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest not introducing traits like this. We should avoid using anything that accepts or consumes AsyncSeek.

We can start a seperate PR for this to add IO trait in paimon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, I'll start with the changes to the IO module. I really appreciate your helpful suggestions.

@devillove084 devillove084 marked this pull request as draft August 4, 2024 10:20
@devillove084
Copy link
Contributor Author

close it and move on!

@devillove084 devillove084 deleted the feat/impl_file_index branch September 11, 2024 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement FileIndex
3 participants