Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] add list.value_counts() #2902

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open

Conversation

andrewgazelka
Copy link
Member

@andrewgazelka andrewgazelka commented Sep 24, 2024

still a few things to do particular with types being null or not... but at least should not crash now

TODO

  • add desc
  • make sure works for nested types "if no, or if we don't intend to support that usecase, we should assert the types allowed and add this to the doc"

daft/expressions/expressions.py Outdated Show resolved Hide resolved
src/daft-core/src/array/ops/list.rs Outdated Show resolved Hide resolved
Copy link

codspeed-hq bot commented Sep 24, 2024

CodSpeed Performance Report

Merging #2902 will not alter performance

Comparing andrew/list-value_counts (fa23718) with main (cd59c73)

Summary

✅ 17 untouched benchmarks

Copy link

codecov bot commented Sep 24, 2024

Codecov Report

Attention: Patch coverage is 77.87307% with 129 lines in your changes missing coverage. Please review.

Project coverage is 78.36%. Comparing base (46c5a7e) to head (fa23718).
Report is 17 commits behind head on main.

Files with missing lines Patch % Lines
src/common/tracing/src/lib.rs 15.38% 22 Missing ⚠️
src/daft-core/src/array/fixed_size_list_array.rs 0.00% 20 Missing ⚠️
src/daft-core/src/array/mod.rs 39.28% 17 Missing ⚠️
src/daft-functions/src/list/value_counts.rs 61.90% 16 Missing ⚠️
src/daft-table/src/lib.rs 44.44% 10 Missing ⚠️
src/daft-dsl/src/functions/map/get.rs 52.94% 8 Missing ⚠️
src/daft-core/src/array/ops/from_arrow.rs 76.92% 6 Missing ⚠️
src/daft-core/src/series/ops/list.rs 57.14% 6 Missing ⚠️
src/daft-micropartition/src/micropartition.rs 88.88% 5 Missing ⚠️
src/daft-schema/src/schema.rs 54.54% 5 Missing ⚠️
... and 6 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2902      +/-   ##
==========================================
+ Coverage   76.34%   78.36%   +2.01%     
==========================================
  Files         597      603       +6     
  Lines       71388    71374      -14     
==========================================
+ Hits        54504    55934    +1430     
+ Misses      16884    15440    -1444     
Flag Coverage Δ
?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
daft/expressions/expressions.py 93.67% <100.00%> (+0.29%) ⬆️
src/daft-core/src/array/list_array.rs 84.11% <100.00%> (+0.58%) ⬆️
src/daft-core/src/array/ops/arrow2/comparison.rs 98.70% <100.00%> (ø)
src/daft-core/src/array/ops/cast.rs 87.50% <100.00%> (+0.02%) ⬆️
src/daft-core/src/array/ops/groups.rs 98.18% <100.00%> (ø)
src/daft-core/src/array/ops/map.rs 87.50% <100.00%> (+14.77%) ⬆️
src/daft-core/src/array/ops/sparse_tensor.rs 100.00% <100.00%> (ø)
src/daft-core/src/array/serdes.rs 87.74% <100.00%> (+0.32%) ⬆️
src/daft-core/src/array/struct_array.rs 79.71% <ø> (ø)
src/daft-core/src/datatypes/logical.rs 74.07% <ø> (ø)
... and 28 more

... and 78 files with indirect coverage changes

@andrewgazelka andrewgazelka force-pushed the andrew/list-value_counts branch 4 times, most recently from b983578 to ba181af Compare September 25, 2024 19:10
@andrewgazelka andrewgazelka changed the title value counts [FEAT] add list.value_counts() Sep 25, 2024
@github-actions github-actions bot added the enhancement New feature or request label Sep 25, 2024
@andrewgazelka andrewgazelka marked this pull request as ready for review September 25, 2024 21:13
@@ -2922,6 +2922,17 @@ def join(self, delimiter: str | Expression) -> Expression:
delimiter_expr = Expression._to_expression(delimiter)
return Expression._from_pyexpr(native.list_join(self._expr, delimiter_expr._expr))

# todo: do we want type to be a Map expression? how should we do this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isnt a way to express the static type of an Expression in Python. It's just an Expression until it gets resolved against a dataframe's schema. At that point we can then use the Expression to generate a schema.

Let's rm this comment too before merge!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct me if I am right but can't we return something like

class ExpressionMapNamespace(ExpressionNamespace):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah great question -- not really because then we'd lose access to the other Expression functionality such as Expression.is_null().

There generally isn't a great solution here given the dynamic nature of how we're constructing these Expressions tbh. We can only definitively know what output type an Expression would produce once it is resolved against a schema. For example:

col("a") + col("b")

Should this expression produce a int32 type? The answer is: it depends. Because if col("a") and col("b") were string types, and we resolve this expression against that schema, then the output type will be a string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we intended to remove this comment?

daft/expressions/expressions.py Show resolved Hide resolved
src/arrow2/src/offset.rs Outdated Show resolved Hide resolved
src/daft-core/src/array/list_array.rs Outdated Show resolved Hide resolved
src/daft-core/src/array/ops/list.rs Show resolved Hide resolved
src/daft-core/src/array/ops/list.rs Show resolved Hide resolved
src/daft-core/src/array/ops/list.rs Show resolved Hide resolved
src/daft-core/src/array/ops/list.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@desmondcheongzx desmondcheongzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing what I have right now, seems you wanted to iterate a bit more on it.

Could we also update the PR description so we get a proper commit message?

@@ -2922,6 +2922,17 @@ def join(self, delimiter: str | Expression) -> Expression:
delimiter_expr = Expression._to_expression(delimiter)
return Expression._from_pyexpr(native.list_join(self._expr, delimiter_expr._expr))

# todo: do we want type to be a Map expression? how should we do this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we intended to remove this comment?

src/arrow2/src/array/map/mod.rs Show resolved Hide resolved
let outer_is_map = matches!(target, DataType::Map { .. });

if outer_is_map {
// we can do simple conversion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think a more useful comment would be explaining why we can do this simple conversion, instead of stating that it's a simple conversion before doing a simple conversion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/arrow2/src/array/mod.rs Outdated Show resolved Hide resolved
let from = format!("{:#?}", self.data_type());
let to = format!("{:#?}", data_type);


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uber nit: double newline intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

believe I changed this

let diff = diff
.unified_diff();

panic!("{diff}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we write a more descriptive error message to go with the diff of the types?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

believe I changed this

@@ -24,7 +24,7 @@ def test_daft_iceberg_table_open(local_iceberg_tables):


WORKING_SHOW_COLLECT = [
"test_all_types",
# "test_all_types",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should leave a comment on why we're skipping this test, and create a corresponding github issue for the broken test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/table/map/test_map_get.py Outdated Show resolved Hide resolved
src/lib.rs Outdated
@@ -96,6 +96,7 @@ pub mod pylib {
#[pymodule]
fn daft(py: Python, m: &Bound<PyModule>) -> PyResult<()> {
refresh_logger(py)?;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uber nit: stray whitespace from some previous change.

{"list_col": [["a", "b", "a", "c"], ["b", "b", "c"], ["a", "a", "a"], [], ["d", None, "d"]]}
)

# # Apply list_value_counts operation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uber nit: remove double comment?

Copy link
Contributor

@desmondcheongzx desmondcheongzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussing offline---what happens with lists of nested values?

Also, could we add this new expression to docs/source/api_docs/expressions.rst?

},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: isn't it typical to leave the comma?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not according to our current rustfmt rules, no

if let Ok(arrow_dtype) = field.dtype.to_physical().to_arrow() {
if !arrow_dtype.eq(data.data_type()) {
if let Ok(expected_arrow_physical_type) = physical_field.dtype.to_arrow()
// For instance Int32 -> Int 32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this comment referring to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think it was when I was debugging; fixed

e0ec7e0

if let Ok(expected_arrow_physical_type) = physical_field.dtype.to_arrow()
// For instance Int32 -> Int 32
{
let arrow_data_type = arrow_array.data_type(); // logical type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: comment unnecessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/daft-core/src/series/ops/concat.rs Show resolved Hide resolved
src/daft-micropartition/src/ops/eval_expressions.rs Outdated Show resolved Hide resolved
src/daft-schema/src/dtype.rs Outdated Show resolved Hide resolved
@@ -638,7 +649,29 @@ impl From<&ArrowType> for DataType {
ArrowType::FixedSizeList(field, size) => {
Self::FixedSizeList(Box::new(field.as_ref().data_type().into()), *size)
}
ArrowType::Map(field, ..) => Self::Map(Box::new(field.as_ref().data_type().into())),
ArrowType::Map(field, ..) => {
// todo: TryFrom in future? want in second pass maybe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we doing this TryForm? Fine if we don't

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaychia and I want to do a TryFrom since it is fallible

let flat_child = &*flat_child;

let is_equal = build_is_equal(
flat_child, flat_child, true, // todo: should nans be considered equal?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the verdict on this? NaNs should not be equal, but it would be interesting to allow the user to count the number of NaNs easily.

Copy link
Member Author

@andrewgazelka andrewgazelka Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are NaNs considered invalid according to daft? If they are they should be ignored anyway but I agree we should make false probably. 🤔

if they aren't invalid we could get weird behavior. simulating in my head something like

NaN: 1
NaN: 1

...

etc

for value_counts

src/daft-core/src/array/ops/list.rs Outdated Show resolved Hide resolved
src/daft-core/src/array/ops/list.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@desmondcheongzx desmondcheongzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for all the changes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants