Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If a precision timestamp literal is encountered there is no way to know the precision #611

Open
westonpace opened this issue Mar 19, 2024 · 8 comments

Comments

@westonpace
Copy link
Member

Currently precision timestamp literals are defined as:

      // If the precision is 6 or less then this is the microseconds since the UNIX epoch
      // If the precision is more than 6 then this is the nanoseconds since the UNIX epoch
      uint64 precision_timestamp = 34;
      uint64 precision_timestamp_tz = 35;

However, there is no way to know if the precision is 6 or less because the literal doesn't have to be associated with a type. For example, consider:

SELECT * FROM my_table WHERE my_ts_col < timestamp (3) '2004-10-19 10:23:54.123'

This will turn into a plan with a literal value in the "less than" expression:

"scalarFunction":{
  "functionReference":3,
  "outputType":{..},
  "arguments":[
    { ... },
   ...
   "literal":{ "precision_timestamp": 1098181434123000 }

However, there is no way of knowing what the precision of that timestamp is. I could infer it by tracing back and realizing that the other value in the expression has precision X but:

  • That's a lot of work, and I don't have to do this for other literals
  • They don't technically have to be the same, depending on what range of functions the producer / consumer supports
@EpsilonPrime
Copy link
Member

This behavior is identical to what happens with fixed_char, var_char, and decimal literals. The way this is addressed in Isthmus is to immediately cast the literal to the appropriate precision. For instance, a fixed character would be represented with the string "fruit" and then later is cast to the actual desired with such as fixed_char[20].

@westonpace
Copy link
Member Author

First, I'm not sure that's true with varchar or decimal, the precision is part of the literal:

    message VarChar {
      string value = 1;
      uint32 length = 2;
    }
    message Decimal {
      // little-endian twos-complement integer representation of complete value
      // (ignoring precision) Always 16 bytes in length
      bytes value = 1;
      // The maximum number of digits allowed in the value.
      // the maximum precision is 38.
      int32 precision = 2;
      // declared scale of decimal literal
      int32 scale = 3;
    }

With fixed_char / fixed_binary I can at least form some kind of "minimum precision literal". E.g. given the literal "foo" I can turn it into a fixed_char<3>.

I can't do that with timestamp. Given the literal 1710989532 how am I supposed to know if this is:

Timestamp(0) - Thursday, March 21, 2024 2:52:12 AM
Timestamp(3) - Tuesday, January 20, 1970 7:16:29.532 PM
Timestamp(6) - Thursday, January 1, 1970 12:28:30.989532 AM
Timestamp(9) - Thursday, January 1, 1970 12:00:01.710989532 AM

All of those are valid timestamps.

@EpsilonPrime
Copy link
Member

The core issue is that there isn't a fixed issue of the literal (it's either microseconds or nanoseconds but not both). I'd argue that the definition should just be nanoseconds since epoch.

Having the precision in the type would certainly make things easier for consumers (by greatly reducing the number of casts where literals are present).

The precision of varchar and decimal might be related to an older specification version, I've got examples of explicit casting to add precision. It also might be that the casting is there because the SQL parser is recognizing string an integer values and needed to cast them after the fact.

@westonpace
Copy link
Member Author

Ah, good point, there are really only two possibilities, not four.

Yes, if it was just nanoseconds then that would be easier. That was part of the original proposal. Unfortunately, the range of 64-bit nanosecond timestamps is only about ~550 years. At the time we declared 10k years was the range of our timestamp data type. I believe we ended up dropping any declaration of range as part of the PR.

Still, nanoseconds-only would be rather limiting (we support 10k years on date, interval_year and interval_day).

However, if we want to do nanoseconds-only, then we could make it a 128-bit integer.

It also might be that the casting is there because the SQL parser is recognizing string an integer values and needed to cast them after the fact.

That seems very possible. In SQL there are only a few literal types like strings (e.g. 'foo') and numbers (e.g. 11). Even timestamps are just "strings" in the beginning. Late inference is very much a necessity. The fact that Substrait doesn't require this is definitely convenient for consumers.

Although, some of that late inference is still often needed anyways. SQL parsers will often turn a numeric literal either into the smallest available integer type or just always use u64. Either way a consumer will probably have to apply some kind of cast to apply the literal in an operation so that it can fit the literal to the data type.

@westonpace
Copy link
Member Author

However, if we want to do nanoseconds-only, then we could make it a 128-bit integer.

That being said, I would still rather just declare the precision as part of the message. This is more consistent with our current definition of decimal / varchar and then I don't have to worry about dealing with bigint.

@EpsilonPrime
Copy link
Member

I'm onboard with fully specifying the literal. We should also include the missing timezone part in the tz variant as well.

@westonpace
Copy link
Member Author

We should also include the missing timezone part in the tz variant as well.

I'm not sure what you mean by this. The timestamp_tz data type does not store a timezone-per-cell. What is the missing timezone part?

westonpace added a commit to apache/arrow that referenced this issue Apr 11, 2024
### Rationale for this change

See #40695 

### What changes are included in this PR?

This PR does a few things:

 * Substrait is upgraded to the latest version
 * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611).
 * Support is added for the following arrow-specific types:
   * fp16
   * date_millis
   * time_seconds
   * time_millis
   * time_nanos
   * large_string
   * large_binary

When adding support for the new timestamp types I also relaxed the restrictions on the time zone column.  Substrait puts time zone information in the function and not the type.  In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`.

However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone.  With this latest change, we treat any non-empty timezone as a timezone_tz type.

In addition, I have enabled conversions from "encoded types" to their unencoded representation.  E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`.  At a logical expression / plan perspective these encodings are irrelevant.  If anything, they may belong in a more physical plan representation.  Should a need for them arise we can dig into it more later.  However, I believe it is better to err on the side of generating "something" rather than failing in these cases.  I don't consider this last change critical and can back it out if need be.

### Are these changes tested?

Yes, I added new unit tests

### Are there any user-facing changes?

Yes, via the Substrait conversion.  These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported".
* GitHub Issue: #40695

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
raulcd pushed a commit to apache/arrow that referenced this issue Apr 11, 2024
### Rationale for this change

See #40695 

### What changes are included in this PR?

This PR does a few things:

 * Substrait is upgraded to the latest version
 * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611).
 * Support is added for the following arrow-specific types:
   * fp16
   * date_millis
   * time_seconds
   * time_millis
   * time_nanos
   * large_string
   * large_binary

When adding support for the new timestamp types I also relaxed the restrictions on the time zone column.  Substrait puts time zone information in the function and not the type.  In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`.

However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone.  With this latest change, we treat any non-empty timezone as a timezone_tz type.

In addition, I have enabled conversions from "encoded types" to their unencoded representation.  E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`.  At a logical expression / plan perspective these encodings are irrelevant.  If anything, they may belong in a more physical plan representation.  Should a need for them arise we can dig into it more later.  However, I believe it is better to err on the side of generating "something" rather than failing in these cases.  I don't consider this last change critical and can back it out if need be.

### Are these changes tested?

Yes, I added new unit tests

### Are there any user-facing changes?

Yes, via the Substrait conversion.  These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported".
* GitHub Issue: #40695

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue Apr 15, 2024
### Rationale for this change

See apache#40695 

### What changes are included in this PR?

This PR does a few things:

 * Substrait is upgraded to the latest version
 * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611).
 * Support is added for the following arrow-specific types:
   * fp16
   * date_millis
   * time_seconds
   * time_millis
   * time_nanos
   * large_string
   * large_binary

When adding support for the new timestamp types I also relaxed the restrictions on the time zone column.  Substrait puts time zone information in the function and not the type.  In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`.

However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone.  With this latest change, we treat any non-empty timezone as a timezone_tz type.

In addition, I have enabled conversions from "encoded types" to their unencoded representation.  E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`.  At a logical expression / plan perspective these encodings are irrelevant.  If anything, they may belong in a more physical plan representation.  Should a need for them arise we can dig into it more later.  However, I believe it is better to err on the side of generating "something" rather than failing in these cases.  I don't consider this last change critical and can back it out if need be.

### Are these changes tested?

Yes, I added new unit tests

### Are there any user-facing changes?

Yes, via the Substrait conversion.  These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported".
* GitHub Issue: apache#40695

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
tolleybot pushed a commit to tmct/arrow that referenced this issue May 2, 2024
### Rationale for this change

See apache#40695 

### What changes are included in this PR?

This PR does a few things:

 * Substrait is upgraded to the latest version
 * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611).
 * Support is added for the following arrow-specific types:
   * fp16
   * date_millis
   * time_seconds
   * time_millis
   * time_nanos
   * large_string
   * large_binary

When adding support for the new timestamp types I also relaxed the restrictions on the time zone column.  Substrait puts time zone information in the function and not the type.  In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`.

However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone.  With this latest change, we treat any non-empty timezone as a timezone_tz type.

In addition, I have enabled conversions from "encoded types" to their unencoded representation.  E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`.  At a logical expression / plan perspective these encodings are irrelevant.  If anything, they may belong in a more physical plan representation.  Should a need for them arise we can dig into it more later.  However, I believe it is better to err on the side of generating "something" rather than failing in these cases.  I don't consider this last change critical and can back it out if need be.

### Are these changes tested?

Yes, I added new unit tests

### Are there any user-facing changes?

Yes, via the Substrait conversion.  These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported".
* GitHub Issue: apache#40695

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
### Rationale for this change

See apache#40695 

### What changes are included in this PR?

This PR does a few things:

 * Substrait is upgraded to the latest version
 * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611).
 * Support is added for the following arrow-specific types:
   * fp16
   * date_millis
   * time_seconds
   * time_millis
   * time_nanos
   * large_string
   * large_binary

When adding support for the new timestamp types I also relaxed the restrictions on the time zone column.  Substrait puts time zone information in the function and not the type.  In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`.

However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone.  With this latest change, we treat any non-empty timezone as a timezone_tz type.

In addition, I have enabled conversions from "encoded types" to their unencoded representation.  E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`.  At a logical expression / plan perspective these encodings are irrelevant.  If anything, they may belong in a more physical plan representation.  Should a need for them arise we can dig into it more later.  However, I believe it is better to err on the side of generating "something" rather than failing in these cases.  I don't consider this last change critical and can back it out if need be.

### Are these changes tested?

Yes, I added new unit tests

### Are there any user-facing changes?

Yes, via the Substrait conversion.  These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported".
* GitHub Issue: apache#40695

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@Blizzara
Copy link
Contributor

Blizzara commented Jul 12, 2024

Done by #659

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants