Skip to content

Support loading of RDDs (of case classes) from CSV. #52

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

rayortigas
Copy link

I'm still in RDD-land, so I'd like something like this to avoid writing things like

val rdd = sqlContext.csvFile(path, useHeader = false).map { row =>
  Foo(row.getString(0).toInt, row.getString(1).toInt, row.getString(2).toDouble)
}

So instead we can write

val rdd = sqlContext.csvFileToRDD[Foo](path, useHeader = false)

I tried to be minimally invasive here by building on top of csvFile. With more refactoring, I probably would've teased out some stuff in CsvRelation, but I hope this PR is useful in its present form.

Regards,
Ray

Squashed commit of the following:

commit e75167f
Author: Ray Ortigas <rayo@linkedin.com>
Date:   Sat Apr 18 15:39:30 2015 -0700

    Test for rejection of case classes with non-primitive fields.

commit c4a1de0
Author: Ray Ortigas <rayo@linkedin.com>
Date:   Sat Apr 18 11:54:53 2015 -0700

    Don't inherit from csv.CsvContext.

commit 674672d
Author: Ray Ortigas <rayo@linkedin.com>
Date:   Fri Apr 17 19:37:52 2015 -0700

    Add TSV support.

commit e93ec4c
Author: Ray Ortigas <rayo@linkedin.com>
Date:   Fri Apr 17 19:22:52 2015 -0700

    Add comment about not handling inner case classes.

commit 1495f51
Author: Ray Ortigas <rayo@linkedin.com>
Date:   Fri Apr 17 19:22:38 2015 -0700

    Add test for headerless CSV.

commit 6f7fcf3
Author: Ray Ortigas <rayo@linkedin.com>
Date:   Fri Apr 17 19:12:19 2015 -0700

    Add test for permissive mode (which is invalid).

commit ccbb6ba
Author: Ray Ortigas <rayo@linkedin.com>
Date:   Fri Apr 17 19:10:54 2015 -0700

    Add test for fail-fast mode.

commit fb0f50d
Author: Ray Ortigas <rayo@linkedin.com>
Date:   Fri Apr 17 19:04:33 2015 -0700

    Add test.

commit 51a9868
Author: Ray Ortigas <rayo@linkedin.com>
Date:   Fri Apr 17 17:21:13 2015 -0700

    Move RDD-related methods to own package.

commit f5a2c2c
Author: Ray Ortigas <rayo@linkedin.com>
Date:   Fri Apr 17 16:31:10 2015 -0700

    Use TypeTag and ClassTag instead of manifest.

commit ffed4fc
Author: Ray Ortigas <rayo@linkedin.com>
Date:   Fri Apr 17 15:41:32 2015 -0700

    Express csvFileToRDD() in terms of csvFile().

commit b52f582
Author: Ray Ortigas <rayo@linkedin.com>
Date:   Fri Apr 17 15:38:43 2015 -0700

    First cut at typed RDD.
@rxin
Copy link
Contributor

rxin commented Apr 19, 2015

@rayortigas this seems like something that can easily live outside of the CSV package. There isn't anything specific to CSV about this one.

As a matter of fact it probably deserves to either be part of the DataFrame API, or just an implicit conversion on DataFrame to add the following:

// or called toTyped, or typedRDD
def toTypedRDD[T : scala.reflect.runtime.universe.TypeTag : scala.reflect.ClassTag]: RDD[T] = {
   ...
}

@rayortigas
Copy link
Author

@rxin I'd love for DataFrames to support it directly... I picked CSV first because the conversion was more straightforward (just a row of primitives). :D

Maybe I'll put together a PR for spark proper that handles more complex objects? I see what ScalaReflection is doing (and I think I saw the latest refactoring), so I'll take a cue from that.

@rayortigas
Copy link
Author

OK, I opened apache/spark#5713. Thanks for the suggestion @rxin!

@rayortigas rayortigas closed this Apr 27, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants