content/releases/spark-release-0-9-0.html [1:414]: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Spark 0.9.0 is a major release that adds significant new features. It updates Spark to Scala 2.10, simplifies high availability, and updates numerous components of the project. This release includes a first version of GraphX, a powerful new framework for graph processing that comes with a library of standard algorithms. In addition, Spark Streaming is now out of alpha, and includes significant optimizations and simplified high availability deployment.
You can download Spark 0.9.0 as either a source package (5 MB tgz) or a prebuilt package for Hadoop 1 / CDH3, CDH4, or Hadoop 2 / CDH5 / HDP2 (160 MB tgz). Release signatures and checksums are available at the official Apache download site.
Spark now runs on Scala 2.10, letting users benefit from the language and library improvements in this version.
The new SparkConf class is now the preferred way to configure advanced settings on your SparkContext, though the previous Java system property method still works. SparkConf is especially useful in tests to make sure properties don’t stay set across tests.
Spark Streaming is now out of alpha, and comes with simplified high availability and several optimizations.
DStream
and PairDStream
classes have been moved from org.apache.spark.streaming
to org.apache.spark.streaming.dstream
to keep it consistent with org.apache.spark.rdd.RDD
.DStream.foreach
has been renamed to foreachRDD
to make it explicit that it works for every RDD, not every elementStreamingContext.awaitTermination()
allows you wait for context shutdown and catch any exception that occurs in the streaming computation.
*StreamingContext.stop()
now allows stopping of StreamingContext without stopping the underlying SparkContext.GraphX is a new framework for graph processing that uses recent advances in graph-parallel computation. It lets you build a graph within a Spark program using the standard Spark operators, then process it with new graph operators that are optimized for distributed computation. It includes basic transformations, a Pregel API for iterative computation, and a standard library of graph loaders and analytics algorithms. By offering these features within the Spark engine, GraphX can significantly speed up processing pipelines compared to workflows that use different engines.
GraphX features in this release include:
GraphX is still marked as alpha in this first release, but we recommend for new users to use it instead of the more limited Bagel API.
spark-shell
now supports the -i
option to run a script on startup.histogram
and countDistinctApprox
operators have been added for working with numerical data.This release is compatible with the previous APIs in stable components, but several language versions and script locations have changed.
spark-shell
and pyspark
have been moved into the bin
folder, while administrative scripts to start and stop standalone clusters have been moved into sbin
.DStream
and PairDStream
has been moved to package org.apache.spark.streaming.dstream
and DStream.foreach
has been renamed to foreachRDD
. We expect the current API to be stable now that Spark Streaming is out of alpha.We expect all of the current APIs and script locations in Spark 0.9 to remain stable when we release Spark 1.0. We wanted to make these updates early to give users a chance to switch to the new API.
The following developers contributed to this release:
Vector.random()
methodThanks to everyone who contributed!
Spark 0.9.0 is a major release that adds significant new features. It updates Spark to Scala 2.10, simplifies high availability, and updates numerous components of the project. This release includes a first version of GraphX, a powerful new framework for graph processing that comes with a library of standard algorithms. In addition, Spark Streaming is now out of alpha, and includes significant optimizations and simplified high availability deployment.
You can download Spark 0.9.0 as either a source package (5 MB tgz) or a prebuilt package for Hadoop 1 / CDH3, CDH4, or Hadoop 2 / CDH5 / HDP2 (160 MB tgz). Release signatures and checksums are available at the official Apache download site.
Spark now runs on Scala 2.10, letting users benefit from the language and library improvements in this version.
The new SparkConf class is now the preferred way to configure advanced settings on your SparkContext, though the previous Java system property method still works. SparkConf is especially useful in tests to make sure properties don’t stay set across tests.
Spark Streaming is now out of alpha, and comes with simplified high availability and several optimizations.
DStream
and PairDStream
classes have been moved from org.apache.spark.streaming
to org.apache.spark.streaming.dstream
to keep it consistent with org.apache.spark.rdd.RDD
.DStream.foreach
has been renamed to foreachRDD
to make it explicit that it works for every RDD, not every elementStreamingContext.awaitTermination()
allows you wait for context shutdown and catch any exception that occurs in the streaming computation.
*StreamingContext.stop()
now allows stopping of StreamingContext without stopping the underlying SparkContext.GraphX is a new framework for graph processing that uses recent advances in graph-parallel computation. It lets you build a graph within a Spark program using the standard Spark operators, then process it with new graph operators that are optimized for distributed computation. It includes basic transformations, a Pregel API for iterative computation, and a standard library of graph loaders and analytics algorithms. By offering these features within the Spark engine, GraphX can significantly speed up processing pipelines compared to workflows that use different engines.
GraphX features in this release include:
GraphX is still marked as alpha in this first release, but we recommend for new users to use it instead of the more limited Bagel API.
spark-shell
now supports the -i
option to run a script on startup.histogram
and countDistinctApprox
operators have been added for working with numerical data.This release is compatible with the previous APIs in stable components, but several language versions and script locations have changed.
spark-shell
and pyspark
have been moved into the bin
folder, while administrative scripts to start and stop standalone clusters have been moved into sbin
.DStream
and PairDStream
has been moved to package org.apache.spark.streaming.dstream
and DStream.foreach
has been renamed to foreachRDD
. We expect the current API to be stable now that Spark Streaming is out of alpha.We expect all of the current APIs and script locations in Spark 0.9 to remain stable when we release Spark 1.0. We wanted to make these updates early to give users a chance to switch to the new API.
The following developers contributed to this release:
Vector.random()
methodThanks to everyone who contributed!