content/releases/spark-release-0-8-0.html [1:356]: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Apache Spark 0.8.0 is a major release that includes many new capabilities and usability improvements. It’s also our first release in the Apache incubator. It is the largest Spark release yet, with contributions from 67 developers and 24 companies.
You can download Spark 0.8.0 as either a source package (4 MB tar.gz) or a prebuilt pacakge for Hadoop 1 / CDH3 or CDH4 (125 MB tar.gz). Release signatures and checksums are available at the official Apache download site.
Spark now displays a variety of monitoring data in a web UI (by default at port 4040 on the driver node). A new job dashboard contains information about running, succeeded, and failed jobs, including percentile statistics covering task runtime, shuffled data, and garbage collection. The existing storage dashboard has been extended, and additional pages have been added to display total storage and task information per-executor. Finally, a new metrics library exposes internal Spark metrics through various API’s including JMX and Ganglia.
This release introduces MLlib, a standard library of high-quality machine learning and optimization algorithms for Spark. MLlib was developed in collaboration with the UC Berkeley MLbase project. The current library contains seven algorithms, including support vector machines (SVMs), logistic regression, several regularized variants of linear regression, a clustering algorithm (KMeans), and alternating least squares collaborative filtering.
The Python API has been extended with many previously missing features. This includes support for different storage levels, sampling, and various missing RDD operators. We’ve also added support for running Spark in IPython, including the IPython Notebook, and for running PySpark on Windows.
Spark 0.8 add greatly improved support for running standalone Spark jobs on a YARN cluster. The YARN support is no longer experimental but now part of mainline Spark. Support for running against a secured YARN cluster has also been added.
Spark’s internal job scheduler has been refactored and extended to include more sophisticated scheduling policies. In particular, a fair scheduler implementation now allows multiple users to share an instance of Spark, which helps users running shorter jobs to achieve good performance, even when longer-running jobs are running in parallel. Support for topology-aware scheduling has been extended, including the ability to take into account rack locality and support for multiple executors on a single machine.
User programs can now link to Spark no matter which Hadoop version they need, without having to publish a version of spark-core
specifically for that Hadoop version. An explanation of how to link against different Hadoop versions is provided here.
Spark’s EC2 scripts now support launching in any availability zone. Support has also been added for EC2 instance types which use the newer “HVM” architecture. This includes the cluster compute (cc1/cc2) family of instance types. We’ve also added support for running newer versions of HDFS alongside Spark. Finally, we’ve added the ability to launch clusters with maintenance releases of Spark in addition to launching the newest release.
This release adds documentation about cluster hardware provisioning and inter-operation with common Hadoop distributions. Docs are also included to cover the MLlib machine learning functions and new cluster monitoring features. Existing documentation has been updated to reflect changes in building and deploying Spark.
unpersist
.takeOrdered
, zipPartitions
, top
.JobLogger
class has been added to produce archivable logs of a Spark workload.RDD.coalesce
function now takes into account locality.RDD.pipe
function has been extended to support passing environment variables to child processes.save
functions now support an optional compression codec.RDD
class to the org.apache.spark.rdd package (it was previously in the top-level package). The Spark artifacts published through Maven have also changed to the new package name.Option
class has been replaced with Optional
from the Guava library.hadoop-client
, instead of rebuilding spark-core
against your version of Hadoop. See the documentation here for details.sbt/sbt assembly
instead of package
.Spark 0.8.0 was the result of the largest team of contributors yet. The following developers contributed to this release:
takeOrdered
function, bug fixes, and a build fixThanks to everyone who contributed! We’d especially like to thank Patrick Wendell for acting as the release manager for this release.
Apache Spark 0.8.0 is a major release that includes many new capabilities and usability improvements. It’s also our first release in the Apache incubator. It is the largest Spark release yet, with contributions from 67 developers and 24 companies.
You can download Spark 0.8.0 as either a source package (4 MB tar.gz) or a prebuilt pacakge for Hadoop 1 / CDH3 or CDH4 (125 MB tar.gz). Release signatures and checksums are available at the official Apache download site.
Spark now displays a variety of monitoring data in a web UI (by default at port 4040 on the driver node). A new job dashboard contains information about running, succeeded, and failed jobs, including percentile statistics covering task runtime, shuffled data, and garbage collection. The existing storage dashboard has been extended, and additional pages have been added to display total storage and task information per-executor. Finally, a new metrics library exposes internal Spark metrics through various API’s including JMX and Ganglia.
This release introduces MLlib, a standard library of high-quality machine learning and optimization algorithms for Spark. MLlib was developed in collaboration with the UC Berkeley MLbase project. The current library contains seven algorithms, including support vector machines (SVMs), logistic regression, several regularized variants of linear regression, a clustering algorithm (KMeans), and alternating least squares collaborative filtering.
The Python API has been extended with many previously missing features. This includes support for different storage levels, sampling, and various missing RDD operators. We’ve also added support for running Spark in IPython, including the IPython Notebook, and for running PySpark on Windows.
Spark 0.8 add greatly improved support for running standalone Spark jobs on a YARN cluster. The YARN support is no longer experimental but now part of mainline Spark. Support for running against a secured YARN cluster has also been added.
Spark’s internal job scheduler has been refactored and extended to include more sophisticated scheduling policies. In particular, a fair scheduler implementation now allows multiple users to share an instance of Spark, which helps users running shorter jobs to achieve good performance, even when longer-running jobs are running in parallel. Support for topology-aware scheduling has been extended, including the ability to take into account rack locality and support for multiple executors on a single machine.
User programs can now link to Spark no matter which Hadoop version they need, without having to publish a version of spark-core
specifically for that Hadoop version. An explanation of how to link against different Hadoop versions is provided here.
Spark’s EC2 scripts now support launching in any availability zone. Support has also been added for EC2 instance types which use the newer “HVM” architecture. This includes the cluster compute (cc1/cc2) family of instance types. We’ve also added support for running newer versions of HDFS alongside Spark. Finally, we’ve added the ability to launch clusters with maintenance releases of Spark in addition to launching the newest release.
This release adds documentation about cluster hardware provisioning and inter-operation with common Hadoop distributions. Docs are also included to cover the MLlib machine learning functions and new cluster monitoring features. Existing documentation has been updated to reflect changes in building and deploying Spark.
unpersist
.takeOrdered
, zipPartitions
, top
.JobLogger
class has been added to produce archivable logs of a Spark workload.RDD.coalesce
function now takes into account locality.RDD.pipe
function has been extended to support passing environment variables to child processes.save
functions now support an optional compression codec.RDD
class to the org.apache.spark.rdd package (it was previously in the top-level package). The Spark artifacts published through Maven have also changed to the new package name.Option
class has been replaced with Optional
from the Guava library.hadoop-client
, instead of rebuilding spark-core
against your version of Hadoop. See the documentation here for details.sbt/sbt assembly
instead of package
.Spark 0.8.0 was the result of the largest team of contributors yet. The following developers contributed to this release:
takeOrdered
function, bug fixes, and a build fixThanks to everyone who contributed! We’d especially like to thank Patrick Wendell for acting as the release manager for this release.