Skip to content

[SPARK-52433][PYTHON] Unify the string coercion in createDataFrame #51140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Jun 10, 2025

What changes were proposed in this pull request?

Unify the string coercion in createDataFrame

Why are the changes needed?

currently there is behavior difference between PySpark Classic and PySpark Connect:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

items = [{'id': '5558382', 'broker': {'teamId': 3398, 'contactEmail': 'abc.xyz@123.ca'}} ]


schema = StructType([
    StructField("id", StringType()),
    StructField("broker", StringType())
   
])

df = spark.createDataFrame(items, schema=schema)

df.show(truncate=False)

PySpark Classic:

+-------+------------------------------------------+
|id     |broker                                    |
+-------+------------------------------------------+
|5558382|{contactEmail=abc.xyz@123.ca, teamId=3398}|
+-------+------------------------------------------+

PySpark Connect:

+-------+--------------------------------------------------+
|id     |broker                                            |
+-------+--------------------------------------------------+
|5558382|{'teamId': 3398, 'contactEmail': 'abc.xyz@123.ca'}|
+-------+--------------------------------------------------+

The latter seems more reasonable, the dicts should be converted into strings in the python side.

Does this PR introduce any user-facing change?

yes, in the PySpark classic

How was this patch tested?

new UT

Was this patch authored or co-authored using generative AI tooling?

no

@zhengruifeng
Copy link
Contributor Author

thanks, merged to master

@zhengruifeng zhengruifeng deleted the py_create_df branch June 16, 2025 02:32
haoyangeng-db pushed a commit to haoyangeng-db/apache-spark that referenced this pull request Jun 25, 2025
### What changes were proposed in this pull request?
Unify the string coercion in createDataFrame

### Why are the changes needed?
currently there is behavior difference between PySpark Classic and PySpark Connect:

```python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

items = [{'id': '5558382', 'broker': {'teamId': 3398, 'contactEmail': 'abc.xyz123.ca'}} ]

schema = StructType([
    StructField("id", StringType()),
    StructField("broker", StringType())

])

df = spark.createDataFrame(items, schema=schema)

df.show(truncate=False)
```

PySpark Classic:
```
+-------+------------------------------------------+
|id     |broker                                    |
+-------+------------------------------------------+
|5558382|{contactEmail=abc.xyz123.ca, teamId=3398}|
+-------+------------------------------------------+
```

PySpark Connect:
```
+-------+--------------------------------------------------+
|id     |broker                                            |
+-------+--------------------------------------------------+
|5558382|{'teamId': 3398, 'contactEmail': 'abc.xyz123.ca'}|
+-------+--------------------------------------------------+
```

The latter seems more reasonable, the dicts should be converted into strings in the python side.

### Does this PR introduce _any_ user-facing change?
yes, in the PySpark classic

### How was this patch tested?
new UT

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#51140 from zhengruifeng/py_create_df.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants