feat(java): support get lance schema with field ids#4207
feat(java): support get lance schema with field ids#4207jackye1995 merged 6 commits intolance-format:mainfrom
Conversation
|
Please take a look when you have time @jackye1995 @yanghua |
| import java.util.Map; | ||
| import java.util.stream.Collectors; | ||
|
|
||
| public class LanceField { |
There was a problem hiding this comment.
I think we want to be exactly the same as the rust model:
pub struct Field {
pub name: String,
pub id: i32,
// TODO: Find way to move these next three fields to private
pub parent_id: i32,
pub logical_type: LogicalType,
pub metadata: HashMap<String, String>,
pub encoding: Option<Encoding>,
pub nullable: bool,
pub children: Vec<Field>,
/// Dictionary value array if this field is dictionary.
pub dictionary: Option<Dictionary>,
pub storage_class: StorageClass,
pub unenforced_primary_key: bool,
}
this is missing some fields, and fields like dictionary is not matching exactly.
There was a problem hiding this comment.
Sure, I originally thought keeping the same with arrow fields with the additional info like field id could be right but conservative. Let me look deep into the model details.
BTW, how do we know the exactly primary key sequence. like:
primary key(A, B), we need to know A before B right? by field id or what others? @jackye1995
There was a problem hiding this comment.
Hi @westonpace.
I didn't see encoding, dictonary, unenforced_primary_key putting in the metadata when converting to an arrow field.
I also noticed you said only field id is needed for lance schema.
What do you think of the upper three aligning with JAVA lance shema, are they unnecessary for schema readers? (especially dictionary, there may be an Array in it. I wish you could elaberase more about this)
There was a problem hiding this comment.
Dictionary and encoding are not important (only used for legacy 0.1 format). unenforced_primary_key could be important. We should probably include this.
(especially dictionary, there may be an Array in it. I wish you could elaberase more about this)
In the 0.1 format we implemented dictionary at the table level. We would store indices in the file and use the dictionary from the manifest to decode.
This field is unused in 2.0 (the current default) and 2.1 (the next default) and will be phased out at some point if we drop support for 0.1.
There was a problem hiding this comment.
primary key(A, B), we need to know A before B right? by field id or what others? @jackye1995
I think primary key is expressed as a set, no ordering is needed. Because either A, B or B, A uniquely identifies a row.
There was a problem hiding this comment.
primary key(A, B), we need to know A before B right? by field id or what others? @jackye1995
I think primary key is expressed as a set, no ordering is needed. Because either
A, BorB, Auniquely identifies a row.
Yes, makes sense.....this is datalake, not database...
| private final int id; | ||
| private final String name; | ||
| private final boolean nullable; | ||
| private final ArrowType type; |
There was a problem hiding this comment.
similar comment, I think we need a logicalType and a class LogicalType, and logical type should have a field name (corresponding to the unnamed string in lance_core::datatypes::LogicalType) to record the string type name.
There was a problem hiding this comment.
😰 I'd really like to get rid of lance_core::datatypes::LogicalType from Rust. It's a relic I don't think we need anymore. Originally, we had a slightly different type system (because we couldn't support all Arrow data types), but at this point we implement most of the Arrow type system. I think it's much simpler to just use Arrow data types and error when we hit a type we don't expect.
There was a problem hiding this comment.
😰 I'd really like to get rid of
lance_core::datatypes::LogicalTypefrom Rust. It's a relic I don't think we need anymore. Originally, we had a slightly different type system (because we couldn't support all Arrow data types), but at this point we implement most of the Arrow type system. I think it's much simpler to just use Arrow data types and error when we hit a type we don't expect.
Yeah, I found map and union types are not supported yet. Are they in the plan? @westonpace
I also believe keeping ArrowType could be more friendly to engines and users so that they don't need to deal with two almost same type systems. What do you think? @jackye1995
There was a problem hiding this comment.
Yeah, I found map and union types are not supported yet. Are they in the plan?
I think we will add them when someone needs them. Map will be easy. Union is a little more tricky but we can support.
There was a problem hiding this comment.
I think it's much simpler to just use Arrow data types and error when we hit a type we don't expect.
I see, that makes sense. I thought we want to reserve the right to extend the Arrow types with that setup, but if we want to be fully Arrow then let's just use Arrow type.
There was a problem hiding this comment.
If we want to extend Arrow we can use Arrow's builtin extension type mechanism.
close #4202
#4119 would depend this PR to get field ids. Or we can't even construct a unit-test for replaceFieldMetadata
I struggled to consiter whether we need to build a flat LanceField in Java(compared with python thin LanceField). consider this case: I want to config all string type as compressed by zstd.
Thin LanceField would:
The flat LanceField could be more friendly to eliminate the case that we need to get type from arrow shema and field id from LanceField.
I also noticed there has been a pk config in rust lance field. I could raise another PR to add it into JAVA LanceField if it is stable