Add initial support for Ray dataset #166

kira-lin · 2021-07-09T02:18:14Z

This PR has to use ray nightly

carsonwang · 2021-07-19T07:54:54Z

@ericl, @wuisawesome , do you have any suggestion how to write an ArrowBlock to object store in Java? Currently what we can do is to write a pyarrow.table to object store in java. If we can not easily write an ArrowBlock to object store in Java, is it possible to change the definition of ArrowBlock to use a ref to pyarrow.table? So that both python and java can write a pyarrow.table to object store and then create an ArrowBlock.

ericl · 2021-07-19T17:45:56Z

@carsonwang I filed this PR to support that, would it work for your use case? ray-project/ray#17186

For now, maybe making a memory copy is the way to go. We can add this as a TODO.

carsonwang · 2021-07-20T06:13:12Z

@ericl , thanks a lot! That will be very helpful.

implements rdd to ensure locality

kira-lin · 2021-08-18T02:17:01Z

python/raydp/spark/dataset.py

+    #TODO how to branch on type of block?
+    sample = ray.get(blocks[0])


@ericl , how to branch on type of block? Maybe we can save it in metadata or blocklist class?

I think we should guarantee the block type in the dataset code, or pass in the type explicitly in the call here.

kira-lin · 2021-08-18T02:33:09Z

python/raydp/spark/ray_cluster.py

+        driver_cp_key = "spark.driver.extraClassPath"
+        driver_cp = ":".join(glob.glob(RAYDP_CP) + glob.glob(RAY_CP))
+        if driver_cp_key in extra_conf:
+            extra_conf[driver_cp_key] = driver_cp + ":" + extra_conf[driver_cp_key]
        else:
-            extra_conf[driver_cp] = ":".join(glob.glob(RAYDP_CP))
+            extra_conf[driver_cp_key] = driver_cp


We need to add this RAY_CP to spark's driver extra classpath because serialized owner_address is needed to register ownership in java, but we need to parse it to extract ip_address to use in spark. This might not be needed once we use cross language call to start our RayAppMaster, in that case we don't need to register ownership(?)

carsonwang · 2021-08-18T09:47:33Z

python/raydp/spark/dataset.py

+    locations = []
+    for block in blocks:
+        # address is needed for locality
+        locations.append(ray.worker.global_worker.core_worker.get_owner_address(block))


Will get_owner_address return multiple locations if the block is stored in more than one node?

This isn't a public API of Ray. If you really need it, you should call it from ray/data code and pass in the addresses here as an optional hint.

That way raydp isn't depending on private Ray APIs.

I have added a developer api to ray. For now addresses are needed to register ownership in java.

carsonwang · 2021-08-18T09:47:57Z

python/raydp/spark/dataset.py

+    blocks_df = spark.createDataFrame(ref_list, schema)
+    return blocks_df.mapInPandas(_convert_blocks_to_dataframe, schema)
+
+def _convert_by_rdd(spark: sql.SparkSession, blocks: Dataset, schema: StructType) -> DataFrame:


Is blocks a List[ObjectRef] instead of Dataset?

Yes, I called get_blocks on dataset. Actually it should be List[ObjectRef[Block]].

ericl · 2021-08-18T20:20:30Z

python/raydp/tests/test_spark_cluster.py

+    words_df = spark.createDataFrame([('look',), ('spark',), ('tutorial',), ('spark',), ('look', ), ('python', )], ['word'])
+    ds = raydp.spark.spark_dataframe_to_ray_dataset(words_df)
+    df = raydp.spark.ray_dataset_to_spark_dataframe(spark, ds)
+    assert words_df.toPandas().equals(df.toPandas())


Can we add a corresponding unit test on the Ray side that it isn't breaking raydp?

carsonwang · 2021-08-19T02:47:53Z

python/raydp/spark/dataset.py

+            dfs.append(data.to_pandas())
+        yield pd.concat(dfs)
+
+def _convert_by_udf(spark: sql.SparkSession,


When converting by UDF, we also want to utilize the customized the RDD so that we can have data locality.

carsonwang · 2021-08-19T03:07:57Z

python/raydp/tests/test_spark_cluster.py

@@ -61,6 +61,12 @@ def test_spark_driver_and_executor_hostname(spark_on_ray_small):
    driver_bind_address = conf.get("spark.driver.bindAddress")
    assert node_ip_address == driver_bind_address

+def test_ray_dataset_from_and_to_spark(spark_on_ray_small):


Can you also add a test by creating a Ray Dataset directly and then convert to Spark dataframe?

carsonwang · 2021-08-19T06:49:55Z

python/raydp/spark/dataset.py

+                    schema: StructType) -> DataFrame:
+    s = StructType([StructField("ref", BinaryType(), False)])
+    ref_list = [(ray.cloudpickle.dumps(block),) for block in blocks]
+    blocks_df = spark.createDataFrame(ref_list, schema)


can we first create a RDD for ref_list here by solving the locality issue and specifying the correct partition number?

kira-lin · 2021-09-02T07:35:59Z

The two PRs are depending on each other to pass the test, and thus is a deadlock. I have tested it in my local environment, and the newly added tests are also in the ray pr, thus will be tested there. I'm merging this now, and update later if needed.

jovany-wang · 2022-05-25T02:52:54Z

core/src/main/java/org/apache/spark/raydp/RayDPUtils.java

+    ObjectId id = new ObjectId(obj);
+    ObjectRefImpl<T> ref = new ObjectRefImpl<>(id, clazz);
+    ((RayRuntimeInternal) Ray.internal()).getObjectStore()
+        .registerOwnershipInfoAndResolveFuture(id, null, ownerAddress);


registerOwnershipInfoAndResolveFuture is invoked in the constructor of ObjectRefImpl. I don't think we should add it any more.

kira-lin added 6 commits July 7, 2021 17:04

initial implementation

0ba410a

fix

a493a7f

add to_spark, add test

56f262c

lint

26d22fc

install ray daily

2665297

fix CI

183b188

kira-lin mentioned this pull request Jul 13, 2021

Can I use spark.createDataFrame() with a list of ObjectRef from various remote Ray workers? #164

Closed

ericl mentioned this pull request Jul 15, 2021

[data] Move Block to public API so that datasource API doesn't reference a private interface ray-project/ray#17098

Merged

clear print

4f1b953

ericl mentioned this pull request Jul 19, 2021

[data] Enable zero-copy access to underlying Arrow tables ray-project/ray#17192

Merged

improve from_spark according to ray's new pr

b139800

kira-lin mentioned this pull request Aug 4, 2021

[Dataset] implement from_spark, to_spark and some optimizations ray-project/ray#17340

Merged

6 tasks

kira-lin added 4 commits August 17, 2021 11:22

improve to_spark performance

5f460c1

implements rdd to ensure locality

update ray

43d5090

Merge branch 'master' into ray-dataset

7f8c5dd

use ray nightly

86182df

kira-lin commented Aug 18, 2021

View reviewed changes

carsonwang reviewed Aug 18, 2021

View reviewed changes

ericl reviewed Aug 18, 2021

View reviewed changes

carsonwang reviewed Aug 19, 2021

View reviewed changes

fix; udf use rdd

22a7ed5

kira-lin merged commit c97e68e into oap-project:master Sep 2, 2021

kira-lin deleted the ray-dataset branch September 2, 2021 07:38

kira-lin added 5 commits September 2, 2021 14:19

respect ray dataset parallelism in from_spark

8c97cdd

no need to use ray nightly now

06d68d0

pylint

d7ccd2a

style

fd4f70b

comment this line to avoid a log4j bug when used with latest ray

6bfa732

jovany-wang reviewed May 25, 2022

View reviewed changes

		#TODO how to branch on type of block?
		sample = ray.get(blocks[0])

Add initial support for Ray dataset #166

Add initial support for Ray dataset #166

Uh oh!

Conversation

kira-lin commented Jul 9, 2021

Uh oh!

carsonwang commented Jul 19, 2021

Uh oh!

ericl commented Jul 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carsonwang commented Jul 20, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kira-lin commented Sep 2, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ericl commented Jul 19, 2021 •

edited

Loading