Add support for some operations with decimals #988

philss · 2024-09-20T23:38:57Z

Adding support for decimals in some of functions that work with integers and floats. This is adding support for creating decimal series using the Explorer.Series.from_list/2 function as well.

The decimals support is still in the experimental phase in Polars, so we won't be able to support all the operations now. Some of the operations are returning f64 series or float results, and some of the functions are raising exceptions from Polars because they are not implemented at the backend.

Instead of relying on the calculated dtype from the backend, we take the maximum scale from the numbers passed in `from_list/2`, or we cast to the scale given at the creation.

billylanchantin

Great job! This looks like it was a ton of work. And I think the tradeoffs (like the not implemented errors) are very reasonable.

billylanchantin · 2024-10-10T14:29:09Z

test/explorer/data_frame_test.exs

+          # Casting works, but the df will be different if we don't pass the precision,
+          # because it will take the default precision for decimals, which is "38".
+          # The second problem of not passing precision is that scale will be handled differently.
+          # This example would divide the integers by 10^2 without the precision.


This comment seems important. Would it be feasible to write tests to demonstrate the behavior it references?

I will try to add some test cases for that. The only problem is that we have a constraint that raises when the "out DF" is different from what we calculate. I will check if it's possible to ignore that.

I think I'm covering most of what I said in the comment, specially after removing the support for nil precision and scale. There is still an issue when we do an arithmetic operation and the backend returns a dtype without a precision, but we are casting in the end.

lib/explorer/shared.ex

lib/explorer/series.ex

josevalim · 2024-10-10T19:37:16Z

lib/explorer/series.ex

@@ -2577,6 +2601,7 @@ defmodule Explorer.Series do

    * floats: #{Shared.inspect_dtypes(@float_dtypes, backsticks: true)}
    * integers: #{Shared.inspect_dtypes(@integer_types, backsticks: true)}


Btw, there is a typo here, it should be backticks but we fix it later. :D

lib/explorer/series.ex

lib/explorer/shared.ex

josevalim · 2024-10-10T19:40:32Z

lib/explorer/series.ex

@@ -3395,8 +3427,18 @@ defmodule Explorer.Series do
  defp cast_to_add({:datetime, p, tz}, {:duration, p}), do: {:datetime, p, tz}
  defp cast_to_add({:duration, p}, {:datetime, p, tz}), do: {:datetime, p, tz}
  defp cast_to_add({:duration, p}, {:duration, p}), do: {:duration, p}
+
+  defp cast_to_add({:decimal, p1, s1}, {:decimal, p2, s2}),
+    do: {:decimal, maybe_max(p1, p2), maybe_max(s1, s2)}


Is there some rule we need to follow here? For example, is there a maximum value for precision and scale?

According to ChatGPT:

Yes, in Apache Arrow's decimal128 type, the precision and scale are constrained as follows:

Precision: This represents the total number of digits that can be stored, both before and after the decimal point. For decimal128, the maximum precision is 38 digits. This means that it can store up to 38 significant digits. Scale: This defines how many of the digits are allocated to the fractional part (i.e., after the decimal point). The scale can be any value between 0 and the precision value. For example, if you have a precision of 38 and set a scale of 10, then 28 digits can be used before the decimal point and 10 digits after.

Thus, the maximum precision is 38, and the scale can be anywhere from 0 to 38, depending on the application needs.

So I think we are good, but I'd encapsulate this logic in a function. :)

lib/explorer/shared.ex

Co-authored-by: José Valim <jose.valim@dashbit.co>

lib/explorer/series.ex

lib/explorer/shared.ex

josevalim

Just some minor nits and ship it!!!

Co-authored-by: José Valim <jose.valim@dashbit.co>

philss added 6 commits October 1, 2024 15:22

Add support for decimals in Series.from_list/2

6e584bc

Add support for decimals in some of the series operations

cbc7e59

Document and allow more operations on decimals

3bdf370

Add type specs for the new :decimal dtype

1cd450c

Add error about fill_missing/2 with decimals

32c3478

Add some tests covering decimals

83afad1

philss force-pushed the ps-add-decimal-dtype-part-ii branch from 2d16ef0 to 83afad1 Compare October 8, 2024 00:24

philss added 4 commits October 9, 2024 11:15

Refactor dtype detection algo for decimals

ed66f21

Instead of relying on the calculated dtype from the backend, we take the maximum scale from the numbers passed in `from_list/2`, or we cast to the scale given at the creation.

Fix "number" guards to accept decimals and add docs

483aa9e

Add more complex example with decimal and mutate

eddfe4f

Use a standard name from the :decimal dtypes

7ac53e0

philss marked this pull request as ready for review October 9, 2024 19:05

billylanchantin reviewed Oct 10, 2024

View reviewed changes