Skip to content

feat(java): support get lance schema with field ids#4207

Merged
jackye1995 merged 6 commits intolance-format:mainfrom
majin1102:lance_schema
Jul 15, 2025
Merged

feat(java): support get lance schema with field ids#4207
jackye1995 merged 6 commits intolance-format:mainfrom
majin1102:lance_schema

Conversation

@majin1102
Copy link
Contributor

@majin1102 majin1102 commented Jul 11, 2025

close #4202

#4119 would depend this PR to get field ids. Or we can't even construct a unit-test for replaceFieldMetadata

I struggled to consiter whether we need to build a flat LanceField in Java(compared with python thin LanceField). consider this case: I want to config all string type as compressed by zstd.

Thin LanceField would:

  1. get the type from arrow schema
  2. get the field id from lance schema
  3. replace field config

The flat LanceField could be more friendly to eliminate the case that we need to get type from arrow shema and field id from LanceField.

I also noticed there has been a pk config in rust lance field. I could raise another PR to add it into JAVA LanceField if it is stable

@github-actions github-actions bot added enhancement New feature or request java labels Jul 11, 2025
@majin1102
Copy link
Contributor Author

Please take a look when you have time @jackye1995 @yanghua

import java.util.Map;
import java.util.stream.Collectors;

public class LanceField {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to be exactly the same as the rust model:

pub struct Field {
    pub name: String,
    pub id: i32,
    // TODO: Find way to move these next three fields to private
    pub parent_id: i32,
    pub logical_type: LogicalType,
    pub metadata: HashMap<String, String>,
    pub encoding: Option<Encoding>,
    pub nullable: bool,

    pub children: Vec<Field>,

    /// Dictionary value array if this field is dictionary.
    pub dictionary: Option<Dictionary>,
    pub storage_class: StorageClass,
    pub unenforced_primary_key: bool,
}

this is missing some fields, and fields like dictionary is not matching exactly.

Copy link
Contributor Author

@majin1102 majin1102 Jul 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I originally thought keeping the same with arrow fields with the additional info like field id could be right but conservative. Let me look deep into the model details.

BTW, how do we know the exactly primary key sequence. like:
primary key(A, B), we need to know A before B right? by field id or what others? @jackye1995

Copy link
Contributor Author

@majin1102 majin1102 Jul 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @westonpace.

I didn't see encoding, dictonary, unenforced_primary_key putting in the metadata when converting to an arrow field.

I also noticed you said only field id is needed for lance schema.

What do you think of the upper three aligning with JAVA lance shema, are they unnecessary for schema readers? (especially dictionary, there may be an Array in it. I wish you could elaberase more about this)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dictionary and encoding are not important (only used for legacy 0.1 format). unenforced_primary_key could be important. We should probably include this.

(especially dictionary, there may be an Array in it. I wish you could elaberase more about this)

In the 0.1 format we implemented dictionary at the table level. We would store indices in the file and use the dictionary from the manifest to decode.

This field is unused in 2.0 (the current default) and 2.1 (the next default) and will be phased out at some point if we drop support for 0.1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

primary key(A, B), we need to know A before B right? by field id or what others? @jackye1995

I think primary key is expressed as a set, no ordering is needed. Because either A, B or B, A uniquely identifies a row.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

primary key(A, B), we need to know A before B right? by field id or what others? @jackye1995

I think primary key is expressed as a set, no ordering is needed. Because either A, B or B, A uniquely identifies a row.

Yes, makes sense.....this is datalake, not database...

private final int id;
private final String name;
private final boolean nullable;
private final ArrowType type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment, I think we need a logicalType and a class LogicalType, and logical type should have a field name (corresponding to the unnamed string in lance_core::datatypes::LogicalType) to record the string type name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😰 I'd really like to get rid of lance_core::datatypes::LogicalType from Rust. It's a relic I don't think we need anymore. Originally, we had a slightly different type system (because we couldn't support all Arrow data types), but at this point we implement most of the Arrow type system. I think it's much simpler to just use Arrow data types and error when we hit a type we don't expect.

Copy link
Contributor Author

@majin1102 majin1102 Jul 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😰 I'd really like to get rid of lance_core::datatypes::LogicalType from Rust. It's a relic I don't think we need anymore. Originally, we had a slightly different type system (because we couldn't support all Arrow data types), but at this point we implement most of the Arrow type system. I think it's much simpler to just use Arrow data types and error when we hit a type we don't expect.

Yeah, I found map and union types are not supported yet. Are they in the plan? @westonpace
I also believe keeping ArrowType could be more friendly to engines and users so that they don't need to deal with two almost same type systems. What do you think? @jackye1995

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I found map and union types are not supported yet. Are they in the plan?

I think we will add them when someone needs them. Map will be easy. Union is a little more tricky but we can support.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's much simpler to just use Arrow data types and error when we hit a type we don't expect.

I see, that makes sense. I thought we want to reserve the right to extend the Arrow types with that setup, but if we want to be fully Arrow then let's just use Arrow type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to extend Arrow we can use Arrow's builtin extension type mechanism.

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

@jackye1995 jackye1995 merged commit 2b2d2c2 into lance-format:main Jul 15, 2025
8 checks passed
@majin1102 majin1102 deleted the lance_schema branch September 10, 2025 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Brings java lance schema into java module

3 participants