Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion cpp/src/arrow/ipc/metadata-internal.cc
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,8 @@ static Status FieldToFlatbuffer(
auto fb_children = fbb.CreateVector(children);

*offset = flatbuf::CreateField(
fbb, fb_name, field->nullable, type_enum, type_data, fb_children);
fbb, fb_name, field->nullable, type_enum, type_data, field->dictionary,
fb_children);

return Status::OK();
}
Expand Down
11 changes: 8 additions & 3 deletions cpp/src/arrow/type.h
Original file line number Diff line number Diff line change
Expand Up @@ -144,8 +144,13 @@ struct ARROW_EXPORT Field {
// Fields can be nullable
bool nullable;

Field(const std::string& name, const TypePtr& type, bool nullable = true)
: name(name), type(type), nullable(nullable) {}
// optional dictionary id if the field is dictionary encoded
// 0 means it's not dictionary encoded
int64_t dictionary;

Field(const std::string& name, const TypePtr& type, bool nullable = true,
int64_t dictionary = 0)
: name(name), type(type), nullable(nullable), dictionary(dictionary) {}

bool operator==(const Field& other) const { return this->Equals(other); }

Expand All @@ -154,7 +159,7 @@ struct ARROW_EXPORT Field {
bool Equals(const Field& other) const {
return (this == &other) ||
(this->name == other.name && this->nullable == other.nullable &&
this->type->Equals(other.type.get()));
this->dictionary == dictionary && this->type->Equals(other.type.get()));
}

bool Equals(const std::shared_ptr<Field>& other) const { return Equals(*other.get()); }
Expand Down
37 changes: 37 additions & 0 deletions format/Layout.md
Original file line number Diff line number Diff line change
Expand Up @@ -583,6 +583,43 @@ even if the null bitmap of the parent union array indicates the slot is
null. Additionally, a child array may have a non-null slot even if
the the types array indicates that a slot contains a different type at the index.

## Dictionary encoding
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an example to point at would be helpful here I think.

Copy link
Member

@wesm wesm Aug 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an example, you could have the following data:

type: List<String>

[
 ['a', 'b'],
 ['a', 'b'],
 ['a', 'b'],
 ['c', 'd', 'e'],
 ['c', 'd', 'e'],
 ['c', 'd', 'e'],
 ['c', 'd', 'e'],
 ['a', 'b']
 ]

In dictionary-encoded form, this could appear as:

data List<String> (dictionary-encoded, dictionary id i)
indices: [0, 0, 0, 1, 1, 1, 0]

dictionary i

type: List<String>

[
 ['a', 'b'],
 ['c', 'd', 'e'],
]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add


When a field is dictionary encoded, the values are represented by an array of Int32 representing the index of the value in the dictionary.
The Dictionary is received as a DictionaryBacth whose id is referenced by a dictionary attribute defined in the metadata (Message.fbs) in the Field table.
The dictionary has the same layout as the type of the field would dictate. Each entry in the dictionary can be accessed by its index in the DictionaryBatch.
When a Schema references a Dictionary id, it must send a DictionaryBatch for this id before any RecordBatch.

As an example, you could have the following data:
```
type: List<String>

[
['a', 'b'],
['a', 'b'],
['a', 'b'],
['c', 'd', 'e'],
['c', 'd', 'e'],
['c', 'd', 'e'],
['c', 'd', 'e'],
['a', 'b']
]
```
In dictionary-encoded form, this could appear as:
```
data List<String> (dictionary-encoded, dictionary id i)
indices: [0, 0, 0, 1, 1, 1, 0]

dictionary i

type: List<String>

[
['a', 'b'],
['c', 'd', 'e'],
]
```

## References

Apache Drill Documentation - [Value Vectors][6]
Expand Down
6 changes: 5 additions & 1 deletion format/Message.fbs
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,10 @@ table Field {
name: string;
nullable: bool;
type: Type;
// present only if the field is dictionary encoded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to remove/update the TODO ~line 169?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that isn't immediately obvious to me is the exact structure of the DictionaryBatch message, so it would be good to clarify this where there is that TODO. Is each DictionaryBatch constrained to only contain the dictionary for a single vector/array? If a DictionaryBatch can contain the dictionaries for multiple arrays (which seems complicated) then some more specification is needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add DictionaryBatch description. I vote for single dictionary per DictionaryBatch. This is simpler and the overhead of the DictionaryBatch header is minimal.

// will point to a dictionary provided by a DictionaryBatch message
dictionary: long;
// children apply only to Nested data types like Struct, List and Union
children: [Field];
}

Expand Down Expand Up @@ -165,8 +169,8 @@ table RecordBatch {
/// For sending dictionary encoding information. Any Field can be
/// dictionary-encoded, but in this case none of its children may be
/// dictionary-encoded.
/// There is one dictionary batch per dictionary
///
/// TODO(wesm): To be documented in more detail

table DictionaryBatch {
id: long;
Expand Down