Skip to content

ARROW-271: Update Field structure to be more explicit #124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 20 additions & 6 deletions format/Message.fbs
Original file line number Diff line number Diff line change
Expand Up @@ -91,17 +91,31 @@ union Type {
JSONScalar
}

/// ----------------------------------------------------------------------
/// The possible types of a vector

enum VectorType: short {
/// used in List type Dense Union and variable length primitive types (String, Binary)
/// used in List type, Dense Union and variable length primitive types (String, Binary)
OFFSET,
/// fixed length primitive values
VALUES,
/// Bit vector indicated if each value is null
/// actual data, either wixed width primitive types in slots or variable width delimited by an OFFSET vector
DATA,
/// Bit vector indicating if each value is null
VALIDITY,
/// Type vector used in Union type
TYPE
}

/// ----------------------------------------------------------------------
/// represents the physical layout of a buffer
/// buffers have fixed width slots of a given type

table VectorLayout {
/// the width of a slot in the buffer (typically 1, 8, 16, 32 or 64)
bit_width: short;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the bit width be for a data vector for strings? I'm not entirely clear what this means in all cases (or how it would be used).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 since that's how many bits you have in between offsets.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is more useful in cases where the bit_width is less definitive:

  • what is the dictionary id size?
  • what's the decimal byte size?

/// the purpose of the vector
type: VectorType;
}

/// ----------------------------------------------------------------------
/// A field represents a named column in a record / row batch or child of a
/// nested type.
Expand All @@ -120,10 +134,10 @@ table Field {
dictionary: long;
// children apply only to Nested data types like Struct, List and Union
children: [Field];
/// the buffers produced for this type (as derived from the Type)
/// layout of buffers produced for this type (as derived from the Type)
/// does not include children
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "does not include children" mean? I would expect this to list all buffers for a batch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Field which is part of the Schema (FieldNode and not Field will contain all the Buffers for a RecordBatch).
Each Field has a vector layout and children. Each child define its own layout.
For example:

  • a Tuple only has a validity vector (children define their own).
  • a List has a validity vector and an offset vector (and a single child)
  • an int has a validity vector and a data vector (and no children)

/// each recordbatch will return instances of those Buffers.
buffers: [ VectorType ];
layout: [ VectorLayout ];
}

/// ----------------------------------------------------------------------
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@

<#include "/@includes/vv_imports.ftl" />

import org.apache.arrow.flatbuf.Precision;

/**
* Nullable${minor.class} implements a vector of values which could be null. Elements in the vector
* are first checked against a fixed length vector of boolean values. Then the element is retrieved
Expand Down Expand Up @@ -97,9 +99,9 @@ public final class ${className} extends BaseDataValueVector implements <#if type
<#elseif minor.class == "Time">
field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Time(), null);
<#elseif minor.class == "Float4">
field = new Field(name, true, new FloatingPoint(org.apache.arrow.flatbuf.Precision.SINGLE), null);
field = new Field(name, true, new FloatingPoint(Precision.SINGLE), null);
<#elseif minor.class == "Float8">
field = new Field(name, true, new FloatingPoint(org.apache.arrow.flatbuf.Precision.DOUBLE), null);
field = new Field(name, true, new FloatingPoint(Precision.DOUBLE), null);
<#elseif minor.class == "TimeStamp">
field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(""), null);
<#elseif minor.class == "IntervalDay">
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@

public class ArrowVectorType {

public static final ArrowVectorType VALUES = new ArrowVectorType(VectorType.VALUES);
public static final ArrowVectorType DATA = new ArrowVectorType(VectorType.DATA);
public static final ArrowVectorType OFFSET = new ArrowVectorType(VectorType.OFFSET);
public static final ArrowVectorType VALIDITY = new ArrowVectorType(VectorType.VALIDITY);
public static final ArrowVectorType TYPE = new ArrowVectorType(VectorType.TYPE);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@
import org.apache.arrow.vector.types.pojo.ArrowType.Union;
import org.apache.arrow.vector.types.pojo.ArrowType.Utf8;

import com.google.common.base.Preconditions;

/**
* The layout of vectors for a given type
* It defines its own vectors followed by the vectors for the children
Expand Down Expand Up @@ -182,7 +184,7 @@ public TypeLayout visit(IntervalYear type) { // TODO: check size

public TypeLayout(List<VectorLayout> vectors) {
super();
this.vectors = vectors;
this.vectors = Preconditions.checkNotNull(vectors);
}

public TypeLayout(VectorLayout... vectors) {
Expand All @@ -205,4 +207,22 @@ public List<ArrowVectorType> getVectorTypes() {
public String toString() {
return "TypeLayout{" + vectors + "}";
}

@Override
public int hashCode() {
return vectors.hashCode();
}

@Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
TypeLayout other = (TypeLayout) obj;
return vectors.equals(other.vectors);
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -17,21 +17,24 @@
*/
package org.apache.arrow.vector.schema;

import static org.apache.arrow.vector.schema.ArrowVectorType.DATA;
import static org.apache.arrow.vector.schema.ArrowVectorType.OFFSET;
import static org.apache.arrow.vector.schema.ArrowVectorType.TYPE;
import static org.apache.arrow.vector.schema.ArrowVectorType.VALIDITY;
import static org.apache.arrow.vector.schema.ArrowVectorType.VALUES;

public class VectorLayout {
import com.google.common.base.Preconditions;
import com.google.flatbuffers.FlatBufferBuilder;

public class VectorLayout implements FBSerializable {

private static final VectorLayout VALIDITY_VECTOR = new VectorLayout(VALIDITY, 1);
private static final VectorLayout OFFSET_VECTOR = new VectorLayout(OFFSET, 32);
private static final VectorLayout TYPE_VECTOR = new VectorLayout(TYPE, 32);
private static final VectorLayout BOOLEAN_VECTOR = new VectorLayout(VALUES, 1);
private static final VectorLayout VALUES_64 = new VectorLayout(VALUES, 64);
private static final VectorLayout VALUES_32 = new VectorLayout(VALUES, 32);
private static final VectorLayout VALUES_16 = new VectorLayout(VALUES, 16);
private static final VectorLayout VALUES_8 = new VectorLayout(VALUES, 8);
private static final VectorLayout BOOLEAN_VECTOR = new VectorLayout(DATA, 1);
private static final VectorLayout VALUES_64 = new VectorLayout(DATA, 64);
private static final VectorLayout VALUES_32 = new VectorLayout(DATA, 32);
private static final VectorLayout VALUES_16 = new VectorLayout(DATA, 16);
private static final VectorLayout VALUES_8 = new VectorLayout(DATA, 8);

public static VectorLayout typeVector() {
return TYPE_VECTOR;
Expand Down Expand Up @@ -68,14 +71,21 @@ public static VectorLayout byteVector() {
return dataVector(8);
}

private final int typeBitWidth;
private final short typeBitWidth;

private final ArrowVectorType type;

private VectorLayout(ArrowVectorType type, int typeBitWidth) {
super();
this.type = type;
this.typeBitWidth = typeBitWidth;
this.type = Preconditions.checkNotNull(type);
this.typeBitWidth = (short)typeBitWidth;
if (typeBitWidth <= 0) {
throw new IllegalArgumentException("bitWidth invalid: " + typeBitWidth);
}
}

public VectorLayout(org.apache.arrow.flatbuf.VectorLayout layout) {
this(new ArrowVectorType(layout.type()), layout.bitWidth());
}

public int getTypeBitWidth() {
Expand All @@ -90,4 +100,28 @@ public ArrowVectorType getType() {
public String toString() {
return String.format("{width=%s,type=%s}", typeBitWidth, type);
}

@Override
public int hashCode() {
return 31 * (31 + type.hashCode()) + typeBitWidth;
}

@Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
VectorLayout other = (VectorLayout) obj;
return type.equals(other.type) && (typeBitWidth == other.typeBitWidth);
}

@Override
public int writeTo(FlatBufferBuilder builder) {;
return org.apache.arrow.flatbuf.VectorLayout.createVectorLayout(builder, typeBitWidth, type.getType());
}


}
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,11 @@

import static org.apache.arrow.vector.types.pojo.ArrowType.getTypeForField;

import java.util.ArrayList;
import java.util.List;
import java.util.Objects;

import org.apache.arrow.vector.schema.ArrowVectorType;
import org.apache.arrow.vector.schema.TypeLayout;
import org.apache.arrow.vector.schema.VectorLayout;

import com.google.common.collect.ImmutableList;
import com.google.flatbuffers.FlatBufferBuilder;
Expand All @@ -37,7 +36,7 @@ public class Field {
private final List<Field> children;
private final TypeLayout typeLayout;

public Field(String name, boolean nullable, ArrowType type, List<Field> children) {
private Field(String name, boolean nullable, ArrowType type, List<Field> children, TypeLayout typeLayout) {
this.name = name;
this.nullable = nullable;
this.type = type;
Expand All @@ -46,34 +45,37 @@ public Field(String name, boolean nullable, ArrowType type, List<Field> children
} else {
this.children = children;
}
this.typeLayout = TypeLayout.getTypeLayout(type);
this.typeLayout = typeLayout;
}

public Field(String name, boolean nullable, ArrowType type, List<Field> children) {
this(name, nullable, type, children, TypeLayout.getTypeLayout(type));
}

public static Field convertField(org.apache.arrow.flatbuf.Field field) {
String name = field.name();
boolean nullable = field.nullable();
ArrowType type = getTypeForField(field);
List<ArrowVectorType> buffers = new ArrayList<>();
for (int i = 0; i < field.buffersLength(); ++i) {
buffers.add(new ArrowVectorType(field.buffers(i)));
ImmutableList.Builder<org.apache.arrow.vector.schema.VectorLayout> layout = ImmutableList.builder();
for (int i = 0; i < field.layoutLength(); ++i) {
layout.add(new org.apache.arrow.vector.schema.VectorLayout(field.layout(i)));
}
ImmutableList.Builder<Field> childrenBuilder = ImmutableList.builder();
for (int i = 0; i < field.childrenLength(); i++) {
childrenBuilder.add(convertField(field.children(i)));
}
List<Field> children = childrenBuilder.build();
Field result = new Field(name, nullable, type, children);
TypeLayout typeLayout = result.getTypeLayout();
if (typeLayout.getVectors().size() != field.buffersLength()) {
List<ArrowVectorType> types = new ArrayList<>();
for (int i = 0; i < field.buffersLength(); i++) {
types.add(new ArrowVectorType(field.buffers(i)));
}
throw new IllegalArgumentException("Deserialized field does not match expected vectors. expected: " + typeLayout.getVectorTypes() + " got " + types);
}
Field result = new Field(name, nullable, type, children, new TypeLayout(layout.build()));
return result;
}

public void validate() {
TypeLayout expectedLayout = TypeLayout.getTypeLayout(type);
if (!expectedLayout.equals(typeLayout)) {
throw new IllegalArgumentException("Deserialized field does not match expected vectors. expected: " + expectedLayout + " got " + typeLayout);
}
}

public int getField(FlatBufferBuilder builder) {
int nameOffset = builder.createString(name);
int typeOffset = type.getType(builder);
Expand All @@ -82,18 +84,19 @@ public int getField(FlatBufferBuilder builder) {
childrenData[i] = children.get(i).getField(builder);
}
int childrenOffset = org.apache.arrow.flatbuf.Field.createChildrenVector(builder, childrenData);
short[] buffersData = new short[typeLayout.getVectors().size()];
int[] buffersData = new int[typeLayout.getVectors().size()];
for (int i = 0; i < buffersData.length; i++) {
buffersData[i] = typeLayout.getVectors().get(i).getType().getType();
VectorLayout vectorLayout = typeLayout.getVectors().get(i);
buffersData[i] = vectorLayout.writeTo(builder);
}
int buffersOffset = org.apache.arrow.flatbuf.Field.createBuffersVector(builder, buffersData );
int layoutOffset = org.apache.arrow.flatbuf.Field.createLayoutVector(builder, buffersData);
org.apache.arrow.flatbuf.Field.startField(builder);
org.apache.arrow.flatbuf.Field.addName(builder, nameOffset);
org.apache.arrow.flatbuf.Field.addNullable(builder, nullable);
org.apache.arrow.flatbuf.Field.addTypeType(builder, type.getTypeType());
org.apache.arrow.flatbuf.Field.addType(builder, typeOffset);
org.apache.arrow.flatbuf.Field.addChildren(builder, childrenOffset);
org.apache.arrow.flatbuf.Field.addBuffers(builder, buffersOffset);
org.apache.arrow.flatbuf.Field.addLayout(builder, layoutOffset);
return org.apache.arrow.flatbuf.Field.endField(builder);
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@
import static org.junit.Assert.assertEquals;

import org.apache.arrow.flatbuf.UnionMode;
import static org.junit.Assert.assertEquals;

import org.apache.arrow.vector.types.pojo.ArrowType.FloatingPoint;
import org.apache.arrow.vector.types.pojo.ArrowType.Int;
import org.apache.arrow.vector.types.pojo.ArrowType.List;
Expand Down