Skip to content

[SPARK-27328][SQL] Add 'deprecated' in ExpressionDescription for extended usage and SQL doc #24259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Mar 31, 2019

What changes were proposed in this pull request?

This PR proposes to two things:

  1. Add deprecated field to ExpressionDescription so that it can be shown in our SQL function documentation (https://spark.apache.org/docs/latest/api/sql/), and it can be shown via DESCRIBE FUNCTION EXTENDED.

  2. While I am here, add some more restrictions for note() and since(). Looks some documentations are broken due to malformed note:

    Screen Shot 2019-03-31 at 3 00 53 PM

    It should start with 4 spaces and end with a newline. I added some asserts, and fixed the instances together while I am here. This is technically a breaking change but I think it's too trivial to note somewhere (and we're in Spark 3.0.0).

This PR adds deprecated property into from_utc_timestamp and to_utc_timestamp (it's deprecated as of #24195) as examples of using this field.

Now it shows the deprecation information as below:

  • SQL documentation is shown as below:

    Screen Shot 2019-03-31 at 3 07 31 PM

  • DESCRIBE FUNCTION EXTENDED from_utc_timestamp;:

    Function: from_utc_timestamp
    Class: org.apache.spark.sql.catalyst.expressions.FromUTCTimestamp
    Usage: from_utc_timestamp(timestamp, timezone) - Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, 'GMT+1' would yield '2017-07-14 03:40:00.0'.
    Extended Usage:
        Examples:
          > SELECT from_utc_timestamp('2016-08-31', 'Asia/Seoul');
           2016-08-31 09:00:00
    
        Since: 1.5.0
    
        Deprecated:
          Deprecated since 3.0.0. See SPARK-25496.
    
    

How was this patch tested?

Manually tested via:

  • For documentation verification:

    $ cd sql
    $ sh create-docs.sh
    
  • For checking description:

    $ ./bin/spark-sql
    
    spark-sql> DESCRIBE FUNCTION EXTENDED from_utc_timestamp;
    spark-sql> DESCRIBE FUNCTION EXTENDED to_utc_timestamp;
    

@HyukjinKwon HyukjinKwon changed the title [SPARK-27328][SQL] Create 'deprecate' property in ExpressionDescription for SQL function documentation [SPARK-27328][SQL] Add 'deprecated' in ExpressionDescription for SQL function documentation Mar 31, 2019
@HyukjinKwon HyukjinKwon changed the title [SPARK-27328][SQL] Add 'deprecated' in ExpressionDescription for SQL function documentation [SPARK-27328][SQL] Add 'deprecated' in ExpressionDescription for extended usage and SQL doc Mar 31, 2019
@rxin
Copy link
Contributor

rxin commented Mar 31, 2019

The issue here is that this isn't quite sufficient, because users don't always read docs (as a matter of fact, for people that have something working, they don't need to look up the docs to get it work), and SQL users also don't see the warning messages in logs.

@SparkQA
Copy link

SparkQA commented Mar 31, 2019

Test build #104133 has finished for PR 24259 at commit ba4f976.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Thanks, @rxin. I agree that practically users tend to don't read the doc although they are supposed to read the documentation. I was thinking it can be a matter that we officially document the deprecation or not in a way (for instance, I don't think so many users would read release docs or actual JIRAs often ..).

I quickly gave a rough try to add a warning with all default settings with spark-sql:

spark-sql> SELECT from_utc_timestamp(timestamp(1), "UTC");
19/03/31 18:23:40 WARN FromUTCTimestamp: Don't use from_utc_timestamp.
19/03/31 18:23:40 WARN FromUTCTimestamp: Don't use from_utc_timestamp.
19/03/31 18:23:43 INFO CodeGenerator: Code generated in 215.181876 ms
19/03/31 18:23:43 INFO SparkContext: Starting job: processCmd at CliDriver.java:376
19/03/31 18:23:43 INFO DAGScheduler: Got job 0 (processCmd at CliDriver.java:376) with 1 output partitions
19/03/31 18:23:43 INFO DAGScheduler: Final stage: ResultStage 0 (processCmd at CliDriver.java:376)
19/03/31 18:23:43 INFO DAGScheduler: Parents of final stage: List()

I think Scala, Python, SparkR users also see similar stuff. Yes, SQL itself can be different in its nature but I think we should better have another mechanism for users to encourage to read them in this case.

In the way of throwing an error (proposed and being discussed at #24195), I think it can be problematic in the future to manage all those configurations. For instance, currently it sounded like we will go ahead with this approach from now on. As we know, Spark has a lot of configurations and flags, and I am worried of the maintenance overhead (I remember that I read an extreme example of flags).

At least this PR encourages Spark developers to note deprecated SQL functions which can be accessed in runtime (as well as they are shown in SQL documentation).

@HyukjinKwon
Copy link
Member Author

Or how about we have one configuration to enable and disable all deprecated SQL expressions? I think I can fix it here futher to add such mechanism if it's inevitable to throw an error.

@srowen
Copy link
Member

srowen commented Mar 31, 2019

I don't mind this support. I also think a WARN level message is pretty good.
For Spark 3 I'm not against removing this functionality entirely, though still have a slight preference for just deprecating it as much as is possible.

* for example, "2.2.0". It should not start with non-number characters.
*
* `deprecated()` contains deprecation information for the expression optionally, for example,
* "Deprecated since 2.2.0. Use something else instead". This property should have 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR, but why do we have such a weird requirement for the text?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it pretty in SQL documentation. Actually, we can loose the requirement by adding some logics to handle in gen-sql-markdown.py; however, I thought it's better to leave it strict to have uniformed code style.

@cloud-fan
Copy link
Contributor

I'm fine with this approach, with the proposal have one configuration to enable and disable all deprecated SQL expressions. The config can be added with another PR though.

@HyukjinKwon
Copy link
Member Author

WDYT, @rxin and @srowen?

My main concern is that we happen to start to have configuration for each deprecation, ending up with many configurations in the future. I prefer to don't throw an error as deprecation but I think I am fine since that's not my main concern.

"""

if deprecated != "":
deprecated = "\n".join(map(lambda n: n[4:], deprecated.split("\n")))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, for instance this logic matters.

If we put some strings in ..

@ExpressionDescription(
  ...
  deprecated = """
    blah blah blah
    I am super long line
    
    1. one
    2. I am markdwn

    - bullet marks
    - aa

    You know what? actually you can write a markdown table as well.

    | Tables        | Are           | Cool  |
    | ------------- |:-------------:| -----:|
    | col 3 is      | right-aligned | $1600 |
    | col 2 is      | centered      |   $12 |
  """)

gen-sql-markdown.py here split the doc into each line and then get rid of the first 4 characters for each. Last newline makes a proper indentation between each field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will people use multi-line string literal in Scala? e.g.

deprecated = """
  |xxx
  |xxx
""".stripMargin

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That cannot be used here due to compilation error. It seems annotation parameters only allow constants. There was a discussion about it before when I first add examples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the expression is inside an object? e.g.

object A {
  @ExpressionDescription(
    note = """
      blabla
    """ )
  class B
}

Then the string has 6 spaces at the beginning.

It looks to me that we can relax the format a little bit: just trim all the spaces at the beginning. Then we can even use a single line string.

Copy link
Member Author

@HyukjinKwon HyukjinKwon Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in that case it might be a problem although there look no case so far after this fix (I skimmed the built doc).

The thing we should take into account is we shouldn't trim all the white spaces all because spacing matters in md rendering. For instance, if we have a doc like ..

object A {
  @ExpressionDescription(
    note = """
      blabla
      - first
          blabla
      - second
          blabla
    """ )
  class B
}

We should only trim leading 6 spaces for each line. So, it will bring a bit of complication. It's doable though.

If you don't mind, I will handle it later. Was thinking it better have a single format for now. We should encourage to have more verbose and detailed doc anyway (rather than single line doc) in general.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah that makes sense, it's more tricky than I thought...

@rxin
Copy link
Contributor

rxin commented Apr 2, 2019

So the issue with a single flag for all the deprecated functions is that the more functions we deprecate, the more likely one will just turn it on always, and as a result it becomes completely useless. Why do you think it's a lot of overhead to add new flags? Isn't it just a few lines of code? We don't need to define an extra function in the SQLConf object.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Apr 2, 2019

It is a few pieces of codes, yes. I was worrying about we start to have many flags under SQLConf (where we already have a lot). From what I have seen, it always grows but never decreases.

Adding a configuration for each deprecation with throwing an exception looked like it makes harder to deprecate and get out of the legacy stuff. (for instance, the SparkR and PySpark tests had to be fixed few times in #24195).

@HyukjinKwon
Copy link
Member Author

Or, how about adding a configuration that lists up deprecated functions? In this way, everything could be fine I suspect.

@cloud-fan
Copy link
Contributor

Another point of view: how often do we deprecate SQL functions? from/to_utc_timestamp is the first one and I don't know if we will deprecate more. Seems like it's better to do generalization when we hit something more than once.

@HyukjinKwon
Copy link
Member Author

My impression was that it's been difficult to deprecate SQL functions so it has not been deprecated often (I suspect there might be some more cases that we should deprecate).

But .. okay ... we can revisit this when we see this issue again.

BTW, the PR itself here can be merged orthogonally I believe anyway since it just adds some more notes for SQL function deprecation.

@cloud-fan
Copy link
Contributor

cloud-fan commented Apr 2, 2019

BTW, the PR itself here can be merged orthogonally I believe anyway since it just adds some more notes for SQL function deprecation.

+1, it makes the generated doc better

@rxin
Copy link
Contributor

rxin commented Apr 2, 2019

Sorry guys your arguments don't really hold. If we deprecate only a small number of expressions, why does it matter than for each of them we'd need to add a few lines? If we don't want to pollute SQLConf because it is getting long, we can create a new object called SQLConfDeprecated or something like that.

A single global deprecation flag doesn't work. Imagine a library you use that uses one of those deprecated configs. Now you will never get any warning anywhere ...

@cloud-fan
Copy link
Contributor

cloud-fan commented Apr 2, 2019

@rxin yea we will go with the "fail by default, with a config to fallback" approach. FYI #24195 has been merge.

This PR is just to add a new entry in the generated SQL function doc, to indicate that a function is deprecated since version X.

image

@HyukjinKwon
Copy link
Member Author

Yes, this one just adds some more info about deprecation. For conf with failing, I'm okie with going ahead for now and talking about that later when we see it's often.

@SparkQA
Copy link

SparkQA commented Apr 2, 2019

Test build #104180 has finished for PR 24259 at commit ba4adb2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 2, 2019

Test build #104201 has finished for PR 24259 at commit 1db146c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 2, 2019

Test build #104202 has finished for PR 24259 at commit f4cd97f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 2, 2019

Test build #104203 has finished for PR 24259 at commit 4736081.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 3, 2019

Test build #104228 has finished for PR 24259 at commit 4736081.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 3, 2019

Test build #104238 has finished for PR 24259 at commit 4736081.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Hmmmhm? I haven;t seen such test failures before.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 3, 2019

Test build #104245 has finished for PR 24259 at commit 4736081.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nandorKollar
Copy link
Contributor

@HyukjinKwon I'm not familiar with Python, so not sure that this is helpful. :) Isn't this failure related somehow to this? Was pytz version updated in this node? I could see some version updates here:

The following NEW packages will be INSTALLED:

    blas:            1.0-mkl                
    ca-certificates: 2019.1.23-0            
    certifi:         2018.8.24-py35_1       
...
    pytz:            2018.5-py35_0          

but these are related to Python 3, and the test failed with 2.

@srowen
Copy link
Member

srowen commented Apr 4, 2019

See #24266

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 8, 2019

Test build #104388 has finished for PR 24259 at commit 4736081.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

I think this is ready for a look.

@cloud-fan
Copy link
Contributor

LGTM except one comment: https://github.com/apache/spark/pull/24259/files#r273316356

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in f16dfb9 Apr 9, 2019
@HyukjinKwon
Copy link
Member Author

Thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants