How Spark Illuminates Deep Learning
Data scientists everywhere are delving more deeply into deep learning (DL). If you’re only skimming the surface of this trend, you might think that the Spark community, which focuses on broader applications of machine learning, is watching it all from the sidelines. Though Spark is certainly at the forefront of many innovations in machine intelligence, DL industry tools and frameworks—such as TensorFlow and Caffe—seem to be grabbing much of the limelight now.
Be that as it may, Spark is playing a significant, growing, and occasionally unsung role in the DL revolution. Developers of convolutional, recurrent, and other DL models use Spark in their projects for the following reasons:
- Available platforms, libraries, and tools: Spark lets DL developers quickly train and deploy multi-layered neural nets using libraries and compute clusters that are already at their disposal.
- Familiar computing model and development framework: Spark allows DL developers to get up to speed on distributed architectures without having to master an unfamiliar DL-specific computing model such as NVIDIA’s CUDA.
- Flexible execution and deployment options: Spark facilitates developer experimentation with DL architectures that incorporate model training on Spark clusters (horizontally scalable, CPU-based, multi-node, in-memory) alongside DL-optimized accelerators that do fast matrix manipulations with DL-optimized algorithm libraries on DL-optimized, GPU-based single-node co-processers.
With all of that in mind, we can track Spark’s growing adoption in the DL world of deep neural nets through its incorporation into the following open-source industry initiatives, tools, frameworks, and approaches:
- Spark for DL optimization: Spark’s machine learning tools are complementary to DL libraries such as TensorFlow, as Tim Hunter of Databricks recently discusses in this blog. In particular, Spark ML is well-suited for hyperparameter tuning and parallelized training of models that one might develop in TensorFlow. Hunter presents test results demonstrating that Spark ML can help optimize TensorFlow DL hyperparameters such as as number of neurons and learning rate. In this way, Spark can help reduce DL model training time and boost predictive accuracy. Spark’s distributed multi-node in-memory execution can significantly speed and scale training of TensorFlow models, as compared with their performance on TensorFlow’s single-node architecture. The TensorFlow library can be installed on Spark clusters as a regular Python library. TensorFlow models can also be directly embedded in machine-learning pipelines in parallel with Spark ML jobs. Hunter states that Databricks, the primary committer on Spark, is committed to providing deeper integration between TensorFlow and the rest of the Spark framework.
- Spark for distributed DL training: SparkNet—as discussed by Matthew Mayo in this recent KDNuggets article—plugs into Spark’s batch-oriented machine-learning pipeline to drive a distributed, parallelizable algorithm for fast, scalable training of DL models. Developed at UC Berkeley’s AMPLab and discussed in greater detail in this paper, SparkNet includes an interface for reading from Spark Resilient Distributed Datasets; a wrappered Scala interface for interacting with the C++-based Caffe DL framework; and a lightweight DL tensor library. SparkNet uses parallelized stochastic gradient descent (SGD) to minimize cross-node communications during the DL training process. Within this environment, Spark handles data cleansing, preprocessing, and other machine-learning pipeline tasks through an in-memory distributed architecture that keeps datasets in memory, thereby eliminating expensive disk writes.
-
Spark for multi-language training of DL models: Deeplearning4j (DL4J) leverages Spark clusters for fast, distributed, in-memory training of DL models that were developed Scala or Java. DL4J jobs execute in Java Virtual Machines. As discussed here, DL4J supports two Scala APIs: ScalNet and ND4S. The framework shards large datasets for parallelized multicore training of separate DL models. It uses a centralized DL model to iteratively average the parameters produced by separate neural nets.
- Spark and System ML for modular DL development: As reported in this recent article, IBM Spark Technology Center is developing a scalable deep learning library for the open-source Spark-based Apache SystemML. Written in the Declarative Machine Learning language, System ML’s Neural Network Library enables development of DL models that leverage multiple optimizers. These DL models are run through System ML, thereby enabling automatic in-memory Spark-based parallelized training against large data sets. The library provides developers with a simple API for swapping modular DL network building blocks into and out of any given model.
- Spark and HDFS for accelerated DL training through SGD: As discussed on this website, DeepDist is an open-source tool that leverages Spark and asynchronous SGD to accelerate DL training from HDFS data. It provides DL developers with a Python interface and leverages a model server to compute distributed gradient updates on partitions of a Spark RDD data set. It enables rapid convergence on DL model training through constant cross-node synchronization of gradient updates within in Spark clusters. It enables DL developers to boost training speeds through adaptive adjustment of learning rates.
- Spark-native DL libraries and model import on horizontally scalable clusters: As discussed in this GitHub page, BigDL is an open-source distributed deep learning library for Apache Spark. It enables developers to write DL models as standard Spark programs that execute on Spark CPU clusters using data stored in Hadoop clusters (HDFS, HBase, or Hive). Implemented in Scala and modeled on the Torch scientific-computing framework, BigDL supports comprehensive DL modeling through an embedded tensor and neural-network library. It also supports loading of pre-trained Caffe or Torch models into Spark programs. BigDL uses Intel Math Kernel Library, multi-threaded programming in each Spark task, and synchronous, mini-batch SGD on Spark.
- Spark framework for DL training library and platform: As discussed in its GitHub page, OpenDL and DistBelief provide a DL-training library that runs in the Spark framework. This framework splits DL-training data into data shards and trains each DL-model replica on a specific shard. OpenDL/DistBelief model replicas can train the data with different way based on different gradient update algorithms.
Clearly, there is no shortage of open-source DL options for Spark developers who want to roll up their sleeves and use their tools, libraries, and clusters for this hot new technology. For machine learning developers who want to take the next step into DL for continuous intelligence, please join us on February 15 for the livestream of the IBM Machine Learning Launch Event.
About the author: James Kobielus is a Data Science Evangelist for IBM. James spearheads IBM’s thought leadership activities in data science. He has spoken at such leading industry events as IBM Insight, Strata Hadoop World, and Hadoop Summit. He has published several business technology books and is a very popular provider of original commentary on blogs and many social media.