Skip to content

Python API for Spark Streaming #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 636 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
636 commits
Select commit Hold shift + click to select a range
ac3440f
[SPARK-2859] Update url of Kryo project in related docs
gchen Aug 5, 2014
74f82c7
SPARK-2380: Support displaying accumulator values in the web UI
pwendell Aug 5, 2014
41e0a21
SPARK-1680: use configs for specifying environment variables on YARN
tgravescs Aug 5, 2014
cc491f6
[SPARK-2864][MLLIB] fix random seed in word2vec; move model to local
mengxr Aug 5, 2014
acff9a7
[SPARK-2503] Lower shuffle output buffer (spark.shuffle.file.buffer.k…
rxin Aug 5, 2014
1aad911
[SPARK-2550][MLLIB][APACHE SPARK] Support regularization and intercep…
miccagiann Aug 5, 2014
2643e66
SPARK-2869 - Fix tiny bug in JdbcRdd for closing jdbc connection
Aug 6, 2014
d94f599
[sql] rename project name in pom.xml of hive-thriftserver module
scwf Aug 6, 2014
d0ae3f3
[SPARK-2650][SQL] Try to partially fix SPARK-2650 by adjusting initia…
liancheng Aug 6, 2014
69ec678
[SPARK-2854][SQL] Finalize _acceptable_types in pyspark.sql
yhuai Aug 6, 2014
1d70c4f
[SPARK-2866][SQL] Support attributes in ORDER BY that aren't in SELECT
marmbrus Aug 6, 2014
82624e2
[SPARK-2806] core - upgrade to json4s-jackson 3.2.10
avati Aug 6, 2014
b70bae4
[SQL] Tighten the visibility of various SQLConf methods and renamed s…
rxin Aug 6, 2014
5a826c0
[SQL] Fix logging warn -> debug
marmbrus Aug 6, 2014
63bdb1f
SPARK-2294: fix locality inversion bug in TaskManager
CodingCat Aug 6, 2014
c7b5201
[MLlib] Use this.type as return type in k-means' builder pattern
Aug 6, 2014
ee7f308
[SPARK-1022][Streaming][HOTFIX] Fixed zookeeper dependency of Kafka
tdas Aug 6, 2014
09f7e45
[SPARK-2157] Enable tight firewall rules for Spark
andrewor14 Aug 6, 2014
4878911
[SPARK-2875] [PySpark] [SQL] handle null in schemaRDD()
davies Aug 6, 2014
a6cd311
[SPARK-2678][Core][SQL] A workaround for SPARK-2678
liancheng Aug 6, 2014
d614967
[SPARK-2627] [PySpark] have the build enforce PEP 8 automatically
nchammas Aug 6, 2014
4e98236
SPARK-2566. Update ShuffleWriteMetrics incrementally
sryza Aug 6, 2014
25cff10
[SPARK-2852][MLLIB] API consistency for `mllib.feature`
mengxr Aug 6, 2014
e537b33
[PySpark] Add blanklines to Python docstrings so example code renders…
rnowling Aug 6, 2014
c6889d2
[HOTFIX][Streaming] Handle port collisions in flume polling test
andrewor14 Aug 6, 2014
4e00833
SPARK-2882: Spark build now checks local maven cache for dependencies
GregOwen Aug 6, 2014
17caae4
[SPARK-2583] ConnectionManager error reporting
sarutak Aug 7, 2014
4201d27
SPARK-2879 [BUILD] Use HTTPS to access Maven Central and other repos
srowen Aug 7, 2014
a263a7e
HOTFIX: Support custom Java 7 location
pwendell Aug 7, 2014
a120d07
WIP
giwa Aug 7, 2014
ffd1f59
[SPARK-2887] fix bug of countApproxDistinct() when have more than one…
davies Aug 7, 2014
47ccd5e
[SPARK-2851] [mllib] DecisionTree Python consistency update
jkbradley Aug 7, 2014
75993a6
SPARK-2879 part 2 [BUILD] Use HTTPS to access Maven Central and other…
srowen Aug 7, 2014
8d1dec4
[mllib] DecisionTree Strategy parameter checks
jkbradley Aug 7, 2014
b9e9e53
[SPARK-2852][MLLIB] Separate model from IDF/StandardScaler algorithms
mengxr Aug 7, 2014
80ec5ba
SPARK-2905 Fixed path sbin => bin
dosoft Aug 7, 2014
32096c2
SPARK-2899 Doc generation is back to working in new SBT Build.
ScrapCodes Aug 7, 2014
6906b69
SPARK-2787: Make sort-based shuffle write files directly when there's…
mateiz Aug 8, 2014
4c51098
SPARK-2565. Update ShuffleReadMetrics as blocks are fetched
sryza Aug 8, 2014
9de6a42
[SPARK-2904] Remove non-used local variable in SparkSubmitArguments
sarutak Aug 8, 2014
9a54de1
[SPARK-2911]: provide rdd.parent[T](j) to obtain jth parent RDD
erikerlandson Aug 8, 2014
9016af3
[SPARK-2888] [SQL] Fix addColumnMetadataToConf in HiveTableScan
yhuai Aug 8, 2014
0489cee
[SPARK-2908] [SQL] JsonRDD.nullTypeToStringType does not convert all …
yhuai Aug 8, 2014
c874723
[SPARK-2877] [SQL] MetastoreRelation should use SparkClassLoader when…
yhuai Aug 8, 2014
45d8f4d
[SPARK-2919] [SQL] Basic support for analyze command in HiveQl
yhuai Aug 8, 2014
b7c89a7
[SPARK-2700] [SQL] Hidden files (such as .impala_insert_staging) shou…
chutium Aug 8, 2014
74d6f62
[SPARK-1997][MLLIB] update breeze to 0.9
mengxr Aug 8, 2014
ec79063
[SPARK-2897][SPARK-2920]TorrentBroadcast does use the serializer clas…
witgo Aug 8, 2014
1c84dba
[Web UI]Make decision order of Worker's WebUI port consistent with Ma…
WangTaoTheTonic Aug 9, 2014
43af281
[SPARK-2911] apply parent[T](j) to clarify UnionRDD code
erikerlandson Aug 9, 2014
28dbae8
[SPARK-2635] Fix race condition at SchedulerBackend.isReady in standa…
li-zhihui Aug 9, 2014
b431e67
[SPARK-2861] Fix Doc comment of histogram method
Aug 9, 2014
e45daf2
[SPARK-1766] sorted functions to meet pedantic requirements
Aug 10, 2014
4f4a988
[SPARK-2894] spark-shell doesn't accept flags
sarutak Aug 10, 2014
5b6585d
Updated Spark SQL README to include the hive-thriftserver module
rxin Aug 10, 2014
482c5af
Turn UpdateBlockInfo into case class.
rxin Aug 10, 2014
3570119
Remove extra semicolon in Task.scala
witgo Aug 10, 2014
1d03a26
[SPARK-2950] Add gc time and shuffle write time to JobLogger
shivaram Aug 10, 2014
28dcbb5
[SPARK-2898] [PySpark] fix bugs in deamon.py
davies Aug 10, 2014
b715aa0
[SPARK-2937] Separate out samplyByKeyExact as its own API in PairRDDF…
dorx Aug 10, 2014
90ae568
WIP added test case
giwa Aug 11, 2014
ba28a8f
[SPARK-2936] Migrate Netty network module from Java to Scala
rxin Aug 11, 2014
2cfd3a0
added basic operation test cases
giwa Aug 11, 2014
db0a303
delete waste file
giwa Aug 11, 2014
3334169
fixed PEP-008 violation
giwa Aug 11, 2014
e8c7bfc
remove export PYSPARK_PYTHON in spark submit
giwa Aug 11, 2014
bdde697
removed unnesessary changes
giwa Aug 11, 2014
a65f302
edited the comment to add more precise description
giwa Aug 11, 2014
db06a81
[PySpark] [SPARK-2954] [SPARK-2948] [SPARK-2910] [SPARK-2101] Python …
JoshRosen Aug 11, 2014
3733866
[SPARK-2952] Enable logging actor messages at DEBUG level
rxin Aug 11, 2014
90a6484
added mapValues and flatMapVaules WIP for glom and mapPartitions test
giwa Aug 11, 2014
7712e72
[SPARK-2931] In TaskSetManager, reset currentLocalityIndex after reco…
JoshRosen Aug 12, 2014
32638b5
[SPARK-2515][mllib] Chi Squared test
dorx Aug 12, 2014
6fab941
[SPARK-2934][MLlib] Adding LogisticRegressionWithLBFGS Interface
Aug 12, 2014
490ecfa
[SPARK-2844][SQL] Correctly set JVM HiveContext if it is passed into …
ahirreddy Aug 12, 2014
21a95ef
[SPARK-2590][SQL] Added option to handle incremental collection, disa…
liancheng Aug 12, 2014
e83fdcd
[sql]use SparkSQLEnv.stop() in ShutdownHook
scwf Aug 12, 2014
647aeba
[SQL] A tiny refactoring in HiveContext#analyze
yhuai Aug 12, 2014
c9c89c3
[SPARK-2965][SQL] Fix HashOuterJoin output nullabilities.
ueshin Aug 12, 2014
c686b7d
[SPARK-2968][SQL] Fix nullabilities of Explode.
ueshin Aug 12, 2014
bad21ed
[SPARK-2650][SQL] Build column buffers in smaller batches
marmbrus Aug 12, 2014
5d54d71
[SQL] [SPARK-2826] Reduce the memory copy while building the hashmap …
chenghao-intel Aug 12, 2014
9038d94
[SPARK-2923][MLLIB] Implement some basic BLAS routines
mengxr Aug 12, 2014
f0060b7
[MLlib] Correctly set vectorSize and alpha
Ishiihara Aug 12, 2014
882da57
fix flaky tests
davies Aug 12, 2014
c235b83
SPARK-2830 [MLlib]: re-organize mllib documentation
atalwalkar Aug 13, 2014
676f982
[SPARK-2953] Allow using short names for io compression codecs
rxin Aug 13, 2014
246cb3f
Use transferTo when copy merge files in ExternalSorter
colorant Aug 13, 2014
2bd8126
[SPARK-1777 (partial)] bugfix: make size of requested memory correctly
liyezhang556520 Aug 13, 2014
fe47359
[SPARK-2993] [MLLib] colStats (wrapper around MultivariateStatistical…
dorx Aug 13, 2014
869f06c
[SPARK-2963] [SQL] There no documentation about building to use HiveS…
sarutak Aug 13, 2014
c974a71
[SPARK-3013] [SQL] [PySpark] convert array into list
davies Aug 13, 2014
434bea1
[SPARK-2983] [PySpark] improve performance of sortByKey()
davies Aug 13, 2014
7ecb867
[MLLIB] use Iterator.fill instead of Array.fill
mengxr Aug 13, 2014
bdc7a1a
[SPARK-3004][SQL] Added null checking when retrieving row set
liancheng Aug 13, 2014
13f54e2
[SPARK-2817] [SQL] add "show create table" support
tianyi Aug 13, 2014
9256d4a
[SPARK-2994][SQL] Support for udfs that take complex types
marmbrus Aug 14, 2014
376a82e
[SPARK-2650][SQL] More precise initial buffer size estimation for in-…
liancheng Aug 14, 2014
9fde1ff
[SPARK-2935][SQL]Fix parquet predicate push down bug
marmbrus Aug 14, 2014
905dc4b
[SPARK-2970] [SQL] spark-sql script ends with IOException when EventL…
sarutak Aug 14, 2014
63d6777
[SPARK-2986] [SQL] fixed: setting properties does not effect
Aug 14, 2014
0c7b452
SPARK-3020: Print completed indices rather than tasks in web UI
pwendell Aug 14, 2014
0704b86
WIP: solved partitioned and None is not recognized
giwa Aug 14, 2014
9497b12
[SPARK-3006] Failed to execute spark-shell in Windows OS
tsudukim Aug 14, 2014
e424565
[Docs] Add missing <code> tags (minor)
andrewor14 Aug 14, 2014
69a57a1
[SPARK-2995][MLLIB] add ALS.setIntermediateRDDStorageLevel
mengxr Aug 14, 2014
d069c5d
[SPARK-3029] Disable local execution of Spark jobs by default
aarondav Aug 14, 2014
080541a
broke something
giwa Aug 14, 2014
6b8de0e
SPARK-2893: Do not swallow Exceptions when running a custom kryo regi…
GrahamDennis Aug 14, 2014
078f3fb
[SPARK-3011][SQL] _temporary directory should be filtered out by sqlC…
josephsu Aug 14, 2014
add75d4
[SPARK-2927][SQL] Add a conf to configure if we always read Binary co…
yhuai Aug 14, 2014
fde692b
[SQL] Python JsonRDD UTF8 Encoding Fix
ahirreddy Aug 14, 2014
267fdff
[SPARK-2925] [sql]fix spark-sql and start-thriftserver shell bugs whe…
scwf Aug 14, 2014
eaeb0f7
Minor cleanup of metrics.Source
rxin Aug 14, 2014
9622106
[SPARK-2979][MLlib] Improve the convergence rate by minimizing the co…
Aug 14, 2014
a7f8a4f
Revert [SPARK-3011][SQL] _temporary directory should be filtered out…
marmbrus Aug 14, 2014
a75bc7a
SPARK-3009: Reverted readObject method in ApplicationInfo so that App…
jacek-lewandowski Aug 14, 2014
fa5a08e
Make dev/mima runnable on Mac OS X.
rxin Aug 14, 2014
2112638
all tests are passed if numSlice is 2 and the numver of each input is…
giwa Aug 15, 2014
655699f
[SPARK-3027] TaskContext: tighten visibility and provide Java friendl…
rxin Aug 15, 2014
3a8b68b
[SPARK-2468] Netty based block server / client module
rxin Aug 15, 2014
9422a9b
[SPARK-2736] PySpark converter and example script for reading Avro files
kanzhang Aug 15, 2014
500f84e
[SPARK-2912] [Spark QA] Include commit hash in Spark QA messages
nchammas Aug 15, 2014
e1b85f3
SPARK-2955 [BUILD] Test code fails to compile with "mvn compile" with…
srowen Aug 15, 2014
fba8ec3
Add caching information to rdd.toDebugString
Aug 15, 2014
536def4
basic function test cases are passed
giwa Aug 15, 2014
a14c7e1
modified streaming test case to add coment
giwa Aug 15, 2014
7589c39
[SPARK-2924] remove default args to overloaded methods
avati Aug 15, 2014
fd9fcd2
Revert "[SPARK-2468] Netty based block server / client module"
pwendell Aug 15, 2014
e3033fc
remove waste duplicated code
giwa Aug 15, 2014
0afe5cb
SPARK-3028. sparkEventToJson should support SparkListenerExecutorMetr…
sryza Aug 15, 2014
c703229
[SPARK-3022] [SPARK-3041] [mllib] Call findBins once per level + unor…
jkbradley Aug 15, 2014
cc36487
[SPARK-3046] use executor's class loader as the default serializer cl…
rxin Aug 16, 2014
89ae38a
added saveAsTextFiles and saveAsPickledFiles
giwa Aug 16, 2014
5d25c0b
[SPARK-3078][MLLIB] Make LRWithLBFGS API consistent with others
mengxr Aug 16, 2014
2e069ca
[SPARK-3001][MLLIB] Improve Spearman's correlation
mengxr Aug 16, 2014
ea9c873
added TODO coments
giwa Aug 16, 2014
c9da466
[SPARK-3015] Block on cleaning tasks to prevent Akka timeouts
andrewor14 Aug 16, 2014
a83c772
[SPARK-3045] Make Serializer interface Java friendly
rxin Aug 16, 2014
20fcf3d
[SPARK-2977] Ensure ShuffleManager is created before ShuffleBlockManager
JoshRosen Aug 16, 2014
b4a0592
[SQL] Using safe floating-point numbers in doctest
liancheng Aug 16, 2014
4bdfaa1
[SPARK-3076] [Jenkins] catch & report test timeouts
nchammas Aug 16, 2014
76fa0ea
[SPARK-2677] BasicBlockFetchIterator#next can wait forever
sarutak Aug 16, 2014
7e70708
[SPARK-3048][MLLIB] add LabeledPoint.parse and remove loadStreamingLa…
mengxr Aug 16, 2014
ac6411c
[SPARK-3081][MLLIB] rename RandomRDDGenerators to RandomRDDs
mengxr Aug 16, 2014
379e758
[SPARK-3035] Wrong example with SparkContext.addFile
iAmGhost Aug 16, 2014
2fc8aca
[SPARK-1065] [PySpark] improve supporting for large broadcast
davies Aug 16, 2014
bc95fe0
In the stop method of ConnectionManager to cancel the ackTimeoutMonitor
witgo Aug 17, 2014
fbad722
[SPARK-3077][MLLIB] fix some chisq-test
mengxr Aug 17, 2014
73ab7f1
[SPARK-3042] [mllib] DecisionTree Filter top-down instead of bottom-up
jkbradley Aug 17, 2014
318e28b
SPARK-2881. Upgrade snappy-java to 1.1.1.3.
pwendell Aug 18, 2014
5ecb08e
Revert "[SPARK-2970] [SQL] spark-sql script ends with IOException whe…
marmbrus Aug 18, 2014
bfa09b0
[SQL] Improve debug logging and toStrings.
marmbrus Aug 18, 2014
9924328
[SPARK-1981] updated streaming-kinesis.md
cfregly Aug 18, 2014
95470a0
[HOTFIX][STREAMING] Allow the JVM/Netty to decide which port to bind …
harishreedharan Aug 18, 2014
c77f406
[SPARK-3087][MLLIB] fix col indexing bug in chi-square and add a chec…
mengxr Aug 18, 2014
5173f3c
SPARK-2884: Create binary builds in parallel with release script.
pwendell Aug 18, 2014
df652ea
SPARK-2900. aggregate inputBytes per stage
sryza Aug 18, 2014
3c8fa50
[SPARK-3097][MLlib] Word2Vec performance improvement
Ishiihara Aug 18, 2014
eef779b
[SPARK-2842][MLlib]Word2Vec documentation
Ishiihara Aug 18, 2014
d8b593b
add comments
giwa Aug 18, 2014
e7ebb08
removed wasted print in DStream
giwa Aug 18, 2014
636090a
added sparkContext as input parameter in StreamingContext
giwa Aug 18, 2014
a3d2379
added gorupByKey testcase
giwa Aug 18, 2014
665bfdb
added testcase for combineByKey
giwa Aug 18, 2014
5c3a683
initial commit for pySparkStreaming
giwa Jul 9, 2014
e497b9b
comment PythonDStream.PairwiseDStream
Jul 15, 2014
6e0d9c7
modify dstream.py to fix indent error
Jul 16, 2014
9af03f4
added reducedByKey not working yet
Jul 16, 2014
dcf243f
implementing transform function in Python
Jul 16, 2014
c5518b4
modified the code base on comment in https://github.com/tdas/spark/pu…
Jul 16, 2014
3758175
add coment for hack why PYSPARK_PYTHON is needed in spark-submit
Jul 16, 2014
e551e13
add coment for hack why PYSPARK_PYTHON is needed in spark-submit
Jul 16, 2014
2adca84
remove not implemented DStream functions in python
Jul 16, 2014
5594bd4
revert pom.xml
Jul 16, 2014
490e338
sorted the import following Spark coding convention
Jul 16, 2014
856d98e
add empty line
Jul 16, 2014
4ce4058
remove unused import in python
Jul 16, 2014
02f618a
initial commit for socketTextStream
Jul 16, 2014
4b69fb1
fied input of socketTextDStream
Jul 16, 2014
57fb740
added doctest for pyspark.streaming.duration
Jul 17, 2014
967dc26
fixed typo of network_workdcount.py
Jul 17, 2014
7f7c5d1
delete old file
Jul 17, 2014
d25d5cf
added reducedByKey not working yet
Jul 16, 2014
0b8b7d0
reduceByKey is working
Jul 17, 2014
d1ee6ca
edit python sparkstreaming example
Jul 18, 2014
a9f4ecb
added count operation but this implementation need double check
Jul 19, 2014
05459c6
fix map function
Jul 20, 2014
9fa249b
clean up code
Jul 20, 2014
aeaf8a5
clean up codes
Jul 20, 2014
5e822d4
remove waste file
Jul 20, 2014
4eff053
Implemented DStream.foreachRDD in the Python API using Py4J callback …
tdas Jul 23, 2014
4caae3f
Added missing file
tdas Aug 1, 2014
c9fc124
Added extra line.
tdas Aug 1, 2014
19ddcdd
tried to restart callback server
Aug 2, 2014
b47b5fd
Kill py4j callback server properly
Aug 3, 2014
b6468e6
Removed the waste line
giwa Aug 3, 2014
b8d7d24
implemented reduce and count function in Dstream
giwa Aug 4, 2014
189dcea
clean up examples
giwa Aug 4, 2014
79c5809
added stop in StreamingContext
giwa Aug 4, 2014
5a9b525
clean up dstream.py
giwa Aug 4, 2014
ea4b06b
initial commit for testcase
giwa Aug 4, 2014
5d22c92
WIP
giwa Aug 4, 2014
c880a33
update comment
giwa Aug 4, 2014
1fd12ae
WIP
giwa Aug 4, 2014
c05922c
WIP: added PythonTestInputStream
giwa Aug 5, 2014
1f68b78
WIP
giwa Aug 7, 2014
3dda31a
WIP added test case
giwa Aug 11, 2014
7f96294
added basic operation test cases
giwa Aug 11, 2014
fa75d71
delete waste file
giwa Aug 11, 2014
8efa266
fixed PEP-008 violation
giwa Aug 11, 2014
3a671cc
remove export PYSPARK_PYTHON in spark submit
giwa Aug 11, 2014
774f18d
removed unnesessary changes
giwa Aug 11, 2014
33c0f94
edited the comment to add more precise description
giwa Aug 11, 2014
4f2d7e6
added mapValues and flatMapVaules WIP for glom and mapPartitions test
giwa Aug 11, 2014
9767712
WIP: solved partitioned and None is not recognized
giwa Aug 14, 2014
35933e1
broke something
giwa Aug 14, 2014
7051a84
all tests are passed if numSlice is 2 and the numver of each input is…
giwa Aug 15, 2014
99e4bb3
basic function test cases are passed
giwa Aug 15, 2014
580fbc2
modified streaming test case to add coment
giwa Aug 15, 2014
94f2b65
remove waste duplicated code
giwa Aug 15, 2014
e9fab72
added saveAsTextFiles and saveAsPickledFiles
giwa Aug 16, 2014
4aa99e4
added TODO coments
giwa Aug 16, 2014
6d8190a
add comments
giwa Aug 18, 2014
14d4c0e
removed wasted print in DStream
giwa Aug 18, 2014
97742fe
added sparkContext as input parameter in StreamingContext
giwa Aug 18, 2014
e162822
added gorupByKey testcase
giwa Aug 18, 2014
e70f706
added testcase for combineByKey
giwa Aug 18, 2014
f1798c4
merge with master
giwa Aug 18, 2014
185fdbf
merge with master
giwa Aug 19, 2014
199e37f
adopted the latest compression way of python command
giwa Aug 19, 2014
58150f5
Changed the test case to focus the test operation
giwa Aug 19, 2014
09a28bf
improve testcases
giwa Aug 19, 2014
268a6a5
Changed awaitTermination not to call awaitTermincation in Scala. Just…
giwa Aug 19, 2014
4dedd2d
change test case not to use awaitTermination
giwa Aug 19, 2014
171edeb
clean up
giwa Aug 20, 2014
f0ea311
clean up code
giwa Aug 21, 2014
1d84142
remove unimplement test
giwa Aug 21, 2014
583e66d
move tests for streaming inside streaming directory
giwa Aug 21, 2014
b7dab85
improve test case
giwa Aug 21, 2014
0d30109
fixed pep8 violation
giwa Aug 21, 2014
24f95db
clen up examples
giwa Aug 21, 2014
9c85e48
clean up exmples
giwa Aug 21, 2014
7339df2
fixed typo
giwa Aug 21, 2014
9d1de23
revert pom.xml
giwa Aug 21, 2014
4f82c89
remove duplicated import
giwa Aug 21, 2014
50fd6f9
revert pom.xml
giwa Aug 21, 2014
93f7637
fixed explanaiton
giwa Aug 21, 2014
acfcaeb
revert pom.xml
giwa Aug 21, 2014
3b27bd4
remove the last brank line
giwa Aug 21, 2014
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ conf/spark-env.sh
conf/streaming-env.sh
conf/log4j.properties
conf/spark-defaults.conf
conf/hive-site.xml
docs/_site
docs/api
target/
Expand Down Expand Up @@ -50,9 +51,11 @@ unit-tests.log
rat-results.txt
scalastyle.txt
conf/*.conf
scalastyle-output.xml

# For Hive
metastore_db/
metastore/
warehouse/
TempStatsStore/
sql/hive-thriftserver/test_warehouses
2 changes: 2 additions & 0 deletions .rat-excludes
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ log4j-defaults.properties
bootstrap-tooltip.js
jquery-1.11.1.min.js
sorttable.js
.*avsc
.*txt
.*json
.*data
Expand Down Expand Up @@ -55,3 +56,4 @@ dist/*
.*ipr
.*iws
logs
.*scalastyle-output.xml
25 changes: 22 additions & 3 deletions LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -272,7 +272,7 @@ SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


========================================================================
For Py4J (python/lib/py4j0.7.egg and files in assembly/lib/net/sf/py4j):
For Py4J (python/lib/py4j-0.8.2.1-src.zip)
========================================================================

Copyright (c) 2009-2011, Barthelemy Dagenais All rights reserved.
Expand Down Expand Up @@ -442,7 +442,7 @@ Written by Pavel Binko, Dino Ferrero Merlino, Wolfgang Hoschek, Tony Johnson, An


========================================================================
Fo SnapTree:
For SnapTree:
========================================================================

SNAPTREE LICENSE
Expand Down Expand Up @@ -482,6 +482,24 @@ OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
SUCH DAMAGE.


========================================================================
For Timsort (core/src/main/java/org/apache/spark/util/collection/Sorter.java):
========================================================================
Copyright (C) 2008 The Android Open Source Project

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.


========================================================================
BSD-style licenses
========================================================================
Expand Down Expand Up @@ -514,7 +532,7 @@ The following components are provided under a BSD-style license. See project lin
(New BSD license) Protocol Buffer Java API (org.spark-project.protobuf:protobuf-java:2.4.1-shaded - http://code.google.com/p/protobuf)
(The BSD License) Fortran to Java ARPACK (net.sourceforge.f2j:arpack_combined_all:0.1 - http://f2j.sourceforge.net)
(The BSD License) xmlenc Library (xmlenc:xmlenc:0.52 - http://xmlenc.sourceforge.net)
(The New BSD License) Py4J (net.sf.py4j:py4j:0.8.1 - http://py4j.sourceforge.net/)
(The New BSD License) Py4J (net.sf.py4j:py4j:0.8.2.1 - http://py4j.sourceforge.net/)
(Two-clause BSD-style license) JUnit-Interface (com.novocode:junit-interface:0.10 - http://github.com/szeiger/junit-interface/)
(ISC/BSD License) jbcrypt (org.mindrot:jbcrypt:0.3m - http://www.mindrot.org/)

Expand All @@ -531,3 +549,4 @@ The following components are provided under the MIT License. See project link fo
(MIT License) pyrolite (org.spark-project:pyrolite:2.0.1 - http://pythonhosted.org/Pyro4/)
(MIT License) scopt (com.github.scopt:scopt_2.10:3.2.0 - https://github.com/scopt/scopt)
(The MIT License) Mockito (org.mockito:mockito-all:1.8.5 - http://www.mockito.org)
(MIT License) jquery (https://jquery.org/license/)
33 changes: 24 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,13 @@
# Apache Spark

Lightning-Fast Cluster Computing - <http://spark.apache.org/>
Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, and Python, and an optimized engine that
supports general computation graphs for data analysis. It also supports a
rich set of higher-level tools including Spark SQL for SQL and structured
data processing, MLLib for machine learning, GraphX for graph processing,
and Spark Streaming.

<http://spark.apache.org/>


## Online Documentation
Expand Down Expand Up @@ -69,29 +76,28 @@ can be run using:
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
storage systems. Because the protocols have changed in different versions of
Hadoop, you must build Spark against the same version that your cluster runs.
You can change the version by setting the `SPARK_HADOOP_VERSION` environment
when building Spark.
You can change the version by setting `-Dhadoop.version` when building Spark.

For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop
versions without YARN, use:

# Apache Hadoop 1.2.1
$ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly
$ sbt/sbt -Dhadoop.version=1.2.1 assembly

# Cloudera CDH 4.2.0 with MapReduce v1
$ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly
$ sbt/sbt -Dhadoop.version=2.0.0-mr1-cdh4.2.0 assembly

For Apache Hadoop 2.2.X, 2.1.X, 2.0.X, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
with YARN, also set `SPARK_YARN=true`:
with YARN, also set `-Pyarn`:

# Apache Hadoop 2.0.5-alpha
$ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
$ sbt/sbt -Dhadoop.version=2.0.5-alpha -Pyarn assembly

# Cloudera CDH 4.2.0 with MapReduce v2
$ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly
$ sbt/sbt -Dhadoop.version=2.0.0-cdh4.2.0 -Pyarn assembly

# Apache Hadoop 2.2.X and newer
$ SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly
$ sbt/sbt -Dhadoop.version=2.2.0 -Pyarn assembly

When developing a Spark application, specify the Hadoop version by adding the
"hadoop-client" artifact to your project's dependencies. For example, if you're
Expand All @@ -109,6 +115,15 @@ If your project is built with Maven, add this to your POM file's `<dependencies>
</dependency>


## A Note About Thrift JDBC server and CLI for Spark SQL

Spark SQL supports Thrift JDBC server and CLI.
See sql-programming-guide.md for more information about those features.
You can use those features by setting `-Phive-thriftserver` when building Spark as follows.

$ sbt/sbt -Phive-thriftserver assembly


## Configuration

Please refer to the [Configuration guide](http://spark.apache.org/docs/latest/configuration.html)
Expand Down
14 changes: 13 additions & 1 deletion assembly/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,14 @@
<packaging>pom</packaging>

<properties>
<sbt.project.name>assembly</sbt.project.name>
<spark.jar.dir>scala-${scala.binary.version}</spark.jar.dir>
<spark.jar.basename>spark-assembly-${project.version}-hadoop${hadoop.version}.jar</spark.jar.basename>
<spark.jar>${project.build.directory}/${spark.jar.dir}/${spark.jar.basename}</spark.jar>
<deb.pkg.name>spark</deb.pkg.name>
<deb.install.path>/usr/share/spark</deb.install.path>
<deb.user>root</deb.user>
<deb.bin.filemode>744</deb.bin.filemode>
</properties>

<dependencies>
Expand Down Expand Up @@ -163,6 +165,16 @@
</dependency>
</dependencies>
</profile>
<profile>
<id>hive-thriftserver</id>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive-thriftserver_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
</dependencies>
</profile>
<profile>
<id>spark-ganglia-lgpl</id>
<dependencies>
Expand Down Expand Up @@ -275,7 +287,7 @@
<user>${deb.user}</user>
<group>${deb.user}</group>
<prefix>${deb.install.path}/bin</prefix>
<filemode>744</filemode>
<filemode>${deb.bin.filemode}</filemode>
</mapper>
</data>
<data>
Expand Down
3 changes: 3 additions & 0 deletions bagel/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,9 @@

<groupId>org.apache.spark</groupId>
<artifactId>spark-bagel_2.10</artifactId>
<properties>
<sbt.project.name>bagel</sbt.project.name>
</properties>
<packaging>jar</packaging>
<name>Spark Project Bagel</name>
<url>http://spark.apache.org/</url>
Expand Down
5 changes: 5 additions & 0 deletions bagel/src/main/scala/org/apache/spark/bagel/Bagel.scala
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ object Bagel extends Logging {
var verts = vertices
var msgs = messages
var noActivity = false
var lastRDD: RDD[(K, (V, Array[M]))] = null
do {
logInfo("Starting superstep " + superstep + ".")
val startTime = System.currentTimeMillis
Expand All @@ -83,6 +84,10 @@ object Bagel extends Logging {
val superstep_ = superstep // Create a read-only copy of superstep for capture in closure
val (processed, numMsgs, numActiveVerts) =
comp[K, V, M, C](sc, grouped, compute(_, _, aggregated, superstep_), storageLevel)
if (lastRDD != null) {
lastRDD.unpersist(false)
}
lastRDD = processed

val timeTaken = System.currentTimeMillis - startTime
logInfo("Superstep %d took %d s".format(superstep, timeTaken / 1000))
Expand Down
30 changes: 30 additions & 0 deletions bin/beeline
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

#
# Shell script for starting BeeLine

# Enter posix mode for bash
set -o posix

# Figure out where Spark is installed
FWDIR="$(cd `dirname $0`/..; pwd)"

CLASS="org.apache.hive.beeline.BeeLine"
exec "$FWDIR/bin/spark-class" $CLASS "$@"
1 change: 1 addition & 0 deletions bin/compute-classpath.sh
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ if [ -n "$SPARK_PREPEND_CLASSES" ]; then
CLASSPATH="$CLASSPATH:$FWDIR/sql/catalyst/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/core/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/hive/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/hive-thriftserver/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/yarn/stable/target/scala-$SCALA_VERSION/classes"
fi

Expand Down
20 changes: 15 additions & 5 deletions bin/pyspark
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,18 @@ FWDIR="$(cd `dirname $0`/..; pwd)"
# Export this as SPARK_HOME
export SPARK_HOME="$FWDIR"

source $FWDIR/bin/utils.sh

SCALA_VERSION=2.10

if [[ "$@" = *--help ]] || [[ "$@" = *-h ]]; then
function usage() {
echo "Usage: ./bin/pyspark [options]" 1>&2
$FWDIR/bin/spark-submit --help 2>&1 | grep -v Usage 1>&2
exit 0
}

if [[ "$@" = *--help ]] || [[ "$@" = *-h ]]; then
usage
fi

# Exit if the user hasn't compiled Spark
Expand All @@ -52,7 +58,7 @@ export PYSPARK_PYTHON

# Add the PySpark classes to the Python path:
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.1-src.zip:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH

# Load the PySpark shell.py script when ./pyspark is used interactively:
export OLD_PYTHONSTARTUP=$PYTHONSTARTUP
Expand All @@ -66,10 +72,11 @@ fi
# Build up arguments list manually to preserve quotes and backslashes.
# We export Spark submit arguments as an environment variable because shell.py must run as a
# PYTHONSTARTUP script, which does not take in arguments. This is required for IPython notebooks.

SUBMIT_USAGE_FUNCTION=usage
gatherSparkSubmitOpts "$@"
PYSPARK_SUBMIT_ARGS=""
whitespace="[[:space:]]"
for i in "$@"; do
for i in "${SUBMISSION_OPTS[@]}"; do
if [[ $i =~ \" ]]; then i=$(echo $i | sed 's/\"/\\\"/g'); fi
if [[ $i =~ $whitespace ]]; then i=\"$i\"; fi
PYSPARK_SUBMIT_ARGS="$PYSPARK_SUBMIT_ARGS $i"
Expand All @@ -90,7 +97,10 @@ fi
if [[ "$1" =~ \.py$ ]]; then
echo -e "\nWARNING: Running python applications through ./bin/pyspark is deprecated as of Spark 1.0." 1>&2
echo -e "Use ./bin/spark-submit <python file>\n" 1>&2
exec $FWDIR/bin/spark-submit "$@"
primary=$1
shift
gatherSparkSubmitOpts "$@"
exec $FWDIR/bin/spark-submit "${SUBMISSION_OPTS[@]}" $primary "${APPLICATION_OPTS[@]}"
else
# Only use ipython if no command line arguments were provided [SPARK-1134]
if [[ "$IPYTHON" = "1" ]]; then
Expand Down
2 changes: 1 addition & 1 deletion bin/pyspark2.cmd
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ rem Figure out which Python to use.
if [%PYSPARK_PYTHON%] == [] set PYSPARK_PYTHON=python

set PYTHONPATH=%FWDIR%python;%PYTHONPATH%
set PYTHONPATH=%FWDIR%python\lib\py4j-0.8.1-src.zip;%PYTHONPATH%
set PYTHONPATH=%FWDIR%python\lib\py4j-0.8.2.1-src.zip;%PYTHONPATH%

set OLD_PYTHONSTARTUP=%PYTHONSTARTUP%
set PYTHONSTARTUP=%FWDIR%python\pyspark\shell.py
Expand Down
3 changes: 2 additions & 1 deletion bin/run-example
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,8 @@ if [ -n "$1" ]; then
else
echo "Usage: ./bin/run-example <example-class> [example-args]" 1>&2
echo " - set MASTER=XX to use a specific master" 1>&2
echo " - can use abbreviated example class name (e.g. SparkPi, mllib.LinearRegression)" 1>&2
echo " - can use abbreviated example class name relative to com.apache.spark.examples" 1>&2
echo " (e.g. SparkPi, mllib.LinearRegression, streaming.KinesisWordCountASL)" 1>&2
exit 1
fi

Expand Down
3 changes: 2 additions & 1 deletion bin/run-example2.cmd
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@ rem Test that an argument was given
if not "x%1"=="x" goto arg_given
echo Usage: run-example ^<example-class^> [example-args]
echo - set MASTER=XX to use a specific master
echo - can use abbreviated example class name (e.g. SparkPi, mllib.LinearRegression)
echo - can use abbreviated example class name relative to com.apache.spark.examples
echo (e.g. SparkPi, mllib.LinearRegression, streaming.KinesisWordCountASL)
goto exit
:arg_given

Expand Down
4 changes: 2 additions & 2 deletions bin/spark-class
Original file line number Diff line number Diff line change
Expand Up @@ -110,9 +110,9 @@ export JAVA_OPTS

TOOLS_DIR="$FWDIR"/tools
SPARK_TOOLS_JAR=""
if [ -e "$TOOLS_DIR"/target/scala-$SCALA_VERSION/*assembly*[0-9Tg].jar ]; then
if [ -e "$TOOLS_DIR"/target/scala-$SCALA_VERSION/spark-tools*[0-9Tg].jar ]; then
# Use the JAR from the SBT build
export SPARK_TOOLS_JAR=`ls "$TOOLS_DIR"/target/scala-$SCALA_VERSION/*assembly*[0-9Tg].jar`
export SPARK_TOOLS_JAR=`ls "$TOOLS_DIR"/target/scala-$SCALA_VERSION/spark-tools*[0-9Tg].jar`
fi
if [ -e "$TOOLS_DIR"/target/spark-tools*[0-9Tg].jar ]; then
# Use the JAR from the Maven build
Expand Down
Loading