Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-36642][SQL] Add df.withMetadata pyspark API
This PR adds the pyspark API `df.withMetadata(columnName, metadata)`. The scala API is added in this PR apache#33853. ### What changes were proposed in this pull request? To make it easy to use/modify the semantic annotation, we want to have a shorter API to update the metadata in a dataframe. Currently we have `df.withColumn("col1", col("col1").alias("col1", metadata=metadata))` to update the metadata without changing the column name, and this is too verbose. We want to have a syntax suger API `df.withMetadata("col1", metadata=metadata)` to achieve the same functionality. ### Why are the changes needed? A bit of background for the frequency of the update: We are working on inferring the semantic data types and use them in AutoML and store the semantic annotation in the metadata. So in many cases, we will suggest the user update the metadata to correct the wrong inference or add the annotation for weak inference. ### Does this PR introduce _any_ user-facing change? Yes. A syntax suger API `df.withMetadata("col1", metadata=metadata)` to achieve the same functionality as`df.withColumn("col1", col("col1").alias("col1", metadata=metadata))`. ### How was this patch tested? doctest. Closes apache#34021 from liangz1/withMetadataPython. Authored-by: Liang Zhang <liang.zhang@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
- Loading branch information