It Takes a Spark to Fire up Big Data

The comparative newcomer, Apache Spark, is rapidly gaining ground in the realm of big data analytics. Data analysts and enterprises are all talking about it, and why not? It is one of the most lively big data projects on the scene that is open source. Small wonder. Apache Spark is extremely versatile and applicable in a variety of ways.

Apache Spark is described as a cluster-computing, open-source framework which provides an interface to program whole clusters with fault tolerance and data parallelism. The interface centers on resilient distributed dataset, or RDD, which is a set of data distributed over a cluster of machines. It was created to overcome the limitations of MapReduce, which forces a linear data flow.

Spark was created in 2009 by MateiZaharia of UC Berkeley’s AMPLab. It was open sourced in 2010 and in 2014 using Spark, Databricks (Zaharia’s company) broke the world’s record in large scale sorting. By 2015, Spark had more than a thousand contributors, making it one of the most active open source big data projects of all time.

Some of the main features of this powerful big data engine include being outfitted with MLlib library, which is a distributed machine learning framework.  Because of the distributed memory-based architecture, it is up to nine times as fast as Apache Mahout and scales better than Vowpal Wabbit. A number of widely-used statistical and machine learning algorithms have been introduced and included with MLlib which greatly simplifies large scale machine learning pipelines.

It mimics Scala’s collection API as well as the way if functions. A solitary library performs SQL, streaming, and graph analytics. Spark is loved by analysts and developers alike for the ability to rapidly query, analyze and manipulate large amounts of data. Basically, it is a reasonable alternate for Hadoop, with all the plusses and minuses.

Sparks operates in-memory rapidly and elegantly, and is able to process terabytes of data efficiently all at once, unlike Hadoop MapReduce. There is a huge difference between Spark and Hadoop MapReduce, so let’s run a brief comparison side by side.

As we’ve noted, Spark processes in-memory. Hadoop MapReduce remains on the disk after performing map and reduce functions. Spark is based on Scala, so it is easier to run than Hadoop MapReduce, which is based on Java and is more difficult. Spark is up to one hundred times faster, too. It leaves Hadoop MapReduce in the dust. Users will also appreciate Spark’s lower latency. With Spark, iterative computation is possible. Not so with Hadoop MapReduce, which allows only single computation. Spark also has the capability to schedule tasks by itself. Hadoop MapReduce requires external schedulers.

One of the positive features of Spark is that it can be used in the data storage model Hadoop uses, for example, Hadoop Distributed File System. It can also easily integrate with Cassandra, HBase, MongoDB, and other big data frameworks. It is a superior choice for real-time machine learning algorithms in big data. You can run queries on large databases over and over.


If you are interested in Spark training, it doesn’t take long to learn it, perhaps as little as fifteen hours. A Spark training program will include a comparison of Spark and the Hadoop Distributed File System, or HDFS. It will demonstrate the limitations of MapReduce and how Spark overcomes these limitations. It will include an introduction to Spark components, a thorough review of common algorithms, graph analysis, machine learning, and how to run Spark on a cluster. You should already have a working familiarity with Python, Java, Scala, RDD and its operations for creating applications in Spark in order to learn to write Spark applications because Spark can run on many platforms in several languages. Understanding the fundamentals of Scala programming language is particularly important as is mastering SQL queries using SparkSQL.

A good course will also introduce you to the features of Spark ML programming and GraphX programming as well as the Enterprise Data Centre, common Spark algorithms and Spark Streaming.

And speaking of Spark Streaming, this application is particularly useful in forensic work such as managing fraudulent financial transactions. IT crime detection and prevention is a growing field with no indication of subsiding any time soon. But for the most part, Spark is of interest to big data analysts and architects, software developers, data scientists and engineers.

Spark Streaming utilizes Spark Core’s speedy scheduling capacity to perform streaming analytics. It takes in data in small chunks and performs RDD transformations on them. This allows the same code written for batch analytics to be applied to streaming analytics, thus easily facilitating the use of lambda architecture.Spark Streaming has support built-in to feed from KafkaFlumeTwitterZeroMQKinesis, and TCP/IP sockets.

For cluster management, Spark needs a cluster manager and can support Hadoop YARN, Apache Mesos or just operate in standalone, the native Spark cluster. It also requires a distributed storage system and can interact with HDFS, MapR File System, Cassandra, OpenStack, Swift, Amazon S3, Kudu and a wide variety of other systems.

Spark can also support a pseudo-distributed local mode for development or testing. In such a case, it runs on one machine with one executor for each CPU core.

At the base of the project is the Spark Core which performs distributed task dispatching, scheduling, and I/O functions through programming interface for Java, Python, Scala and R based on RDD abstraction. This interface involves a driver program for parallel operations on the cluster. Fault tolerance is maintained by tracking the lineage of every RDD in case reconstitution becomes necessary because of data loss. (RDD can hold any type of Scala, Python, or Java object.)