-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] add list.value_counts()
#2902
base: main
Are you sure you want to change the base?
Conversation
CodSpeed Performance ReportMerging #2902 will not alter performanceComparing Summary
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2902 +/- ##
==========================================
+ Coverage 76.34% 78.36% +2.01%
==========================================
Files 597 603 +6
Lines 71388 71374 -14
==========================================
+ Hits 54504 55934 +1430
+ Misses 16884 15440 -1444
Flags with carried forward coverage won't be shown. Click here to find out more.
|
b983578
to
ba181af
Compare
daft/expressions/expressions.py
Outdated
@@ -2922,6 +2922,17 @@ def join(self, delimiter: str | Expression) -> Expression: | |||
delimiter_expr = Expression._to_expression(delimiter) | |||
return Expression._from_pyexpr(native.list_join(self._expr, delimiter_expr._expr)) | |||
|
|||
# todo: do we want type to be a Map expression? how should we do this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isnt a way to express the static type of an Expression in Python. It's just an Expression
until it gets resolved against a dataframe's schema. At that point we can then use the Expression to generate a schema.
Let's rm this comment too before merge!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct me if I am right but can't we return something like
Daft/daft/expressions/expressions.py
Line 3072 in 6e0944a
class ExpressionMapNamespace(ExpressionNamespace): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah great question -- not really because then we'd lose access to the other Expression functionality such as Expression.is_null()
.
There generally isn't a great solution here given the dynamic nature of how we're constructing these Expressions tbh. We can only definitively know what output type an Expression would produce once it is resolved against a schema. For example:
col("a") + col("b")
Should this expression produce a int32 type? The answer is: it depends. Because if col("a") and col("b") were string types, and we resolve this expression against that schema, then the output type will be a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we intended to remove this comment?
a4e1c7b
to
e063b04
Compare
9aee66d
to
4a6850a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing what I have right now, seems you wanted to iterate a bit more on it.
Could we also update the PR description so we get a proper commit message?
daft/expressions/expressions.py
Outdated
@@ -2922,6 +2922,17 @@ def join(self, delimiter: str | Expression) -> Expression: | |||
delimiter_expr = Expression._to_expression(delimiter) | |||
return Expression._from_pyexpr(native.list_join(self._expr, delimiter_expr._expr)) | |||
|
|||
# todo: do we want type to be a Map expression? how should we do this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we intended to remove this comment?
src/arrow2/src/array/map/mod.rs
Outdated
let outer_is_map = matches!(target, DataType::Map { .. }); | ||
|
||
if outer_is_map { | ||
// we can do simple conversion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think a more useful comment would be explaining why we can do this simple conversion, instead of stating that it's a simple conversion before doing a simple conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/arrow2/src/array/mod.rs
Outdated
let from = format!("{:#?}", self.data_type()); | ||
let to = format!("{:#?}", data_type); | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uber nit: double newline intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
believe I changed this
src/arrow2/src/array/mod.rs
Outdated
let diff = diff | ||
.unified_diff(); | ||
|
||
panic!("{diff}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we write a more descriptive error message to go with the diff of the types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
believe I changed this
@@ -24,7 +24,7 @@ def test_daft_iceberg_table_open(local_iceberg_tables): | |||
|
|||
|
|||
WORKING_SHOW_COLLECT = [ | |||
"test_all_types", | |||
# "test_all_types", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should leave a comment on why we're skipping this test, and create a corresponding github issue for the broken test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/lib.rs
Outdated
@@ -96,6 +96,7 @@ pub mod pylib { | |||
#[pymodule] | |||
fn daft(py: Python, m: &Bound<PyModule>) -> PyResult<()> { | |||
refresh_logger(py)?; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uber nit: stray whitespace from some previous change.
{"list_col": [["a", "b", "a", "c"], ["b", "b", "c"], ["a", "a", "a"], [], ["d", None, "d"]]} | ||
) | ||
|
||
# # Apply list_value_counts operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uber nit: remove double comment?
78fa527
to
dcc0a36
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussing offline---what happens with lists of nested values?
Also, could we add this new expression to docs/source/api_docs/expressions.rst
?
}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: isn't it typical to leave the comma?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not according to our current rustfmt
rules, no
src/daft-core/src/array/mod.rs
Outdated
if let Ok(arrow_dtype) = field.dtype.to_physical().to_arrow() { | ||
if !arrow_dtype.eq(data.data_type()) { | ||
if let Ok(expected_arrow_physical_type) = physical_field.dtype.to_arrow() | ||
// For instance Int32 -> Int 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this comment referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think it was when I was debugging; fixed
src/daft-core/src/array/mod.rs
Outdated
if let Ok(expected_arrow_physical_type) = physical_field.dtype.to_arrow() | ||
// For instance Int32 -> Int 32 | ||
{ | ||
let arrow_data_type = arrow_array.data_type(); // logical type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: comment unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -638,7 +649,29 @@ impl From<&ArrowType> for DataType { | |||
ArrowType::FixedSizeList(field, size) => { | |||
Self::FixedSizeList(Box::new(field.as_ref().data_type().into()), *size) | |||
} | |||
ArrowType::Map(field, ..) => Self::Map(Box::new(field.as_ref().data_type().into())), | |||
ArrowType::Map(field, ..) => { | |||
// todo: TryFrom in future? want in second pass maybe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we doing this TryForm
? Fine if we don't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaychia and I want to do a TryFrom since it is fallible
let flat_child = &*flat_child; | ||
|
||
let is_equal = build_is_equal( | ||
flat_child, flat_child, true, // todo: should nans be considered equal? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the verdict on this? NaNs should not be equal, but it would be interesting to allow the user to count the number of NaNs easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are NaNs considered invalid according to daft? If they are they should be ignored anyway but I agree we should make false probably. 🤔
if they aren't invalid we could get weird behavior. simulating in my head something like
NaN: 1
NaN: 1
...
etc
for value_counts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for all the changes!
still a few things to do particular with types being null or not... but at least should not crash now
TODO