Spark exists for a while now, and we are actively piloting and listening to it. In this blog we want to introduce you to Apache Spark, give you the adequate retrospective and to announce the Spark’s second generation’s key features. In the following blog posts we will write in more detail and give you concrete examples, but now let’s start from the beginning, i.e the vision of the Apache Spark.
The Apache Spark concept
In the production environment Spark has been implemented over 1000 times and is spinning up big clusters with more than 8000 nods. This technology is definitely ready to be used widely. Why?
The answer is simple – because it works quickly, efficiently and with great stability. That is not really a satisfactory answer… Why? One of the main reasons is because it was created on a university. UC Berkley University united a couple of professional papers on the subject of SQL in the MPP world, made countless experiments and benchmarks, and, finally, made a lot of mistakes and learned from them („Fast-fail collaborative culture“). So, the things that come from an academic community should make sense and work! Our curiosity started rising in early 2015, with the Apache Spark 1.3. version. Spark can be integrated into almost any existing system that needs to process data, yeah right! OK, my technical head thinks, it is all just sales strategy, set up benchmarks, just one of the Big Data technologies and just a typical buzzword. And then we decided to try it, and we challenged Spark on a user demand that we’re been optimizing for years, which has limiting technologies and users infrastructures. After just two days of exploring Spark and preparing the environment, the conclusion was that the Spark’s response speed is more than impressive. We were pleasantly surprised on our first PoC. One user demand usually took about 6 or 7 hours to be executed, and sometimes it wasn’t executed in a traditional production and a relatively strong environment. Once we’ve overwrote it on Spark that was spinning on a laptop, our query was executed in under 45 seconds in average. Yes, you did read this correctly! 45 seconds opposed to 6 hours. It wasn’t until then that I understood why the Apache Spark jumps out of many buzzwords and why is everybody talking about it. This actually works!
Let’s move on… Now, how does the Apache Spark work? You are wondering why is it so fast and what is the secret of its success? Spark works in Cluster, which means it should have an elder (Driver program) which coordinates all available executors in the cluster, that we call the Worker nods. So, we have the elder and the cluster manager as the master and numerous workers/executors. Some of you might say it is nothing special, just a usual master/slave architecture. A Driver is one JVM process, communicating with executors which are individual JVM processes.
Spark core architecture
Spark Driver has two key actions. The first one is converting the user programme, i.e forming a physical plan of processing and defining the Spark tasks depending on the configuration. The other action is coordinating the execution of the distributed individual tasks over the executors in Worker nod. The important thing here is that the Driver creates the physical plan and distributes the tasks to those executors, i.e those places where a certain data exists, with the primary goal to avoid shuffling through network. Apart from that, the Driver takes care of the cache’s location in the cluster and places it on the optimal place for all the related executors.
The Spark Core engine is based on the MapReduce framework. It took good recipes for linear scalability and fault tolerance from the basic hadoop concept, and expanded the existing MapReduce with three very important ingredients. The first one, when MapReduce has to record to distributed system, it can directly drop Spark to the next step of the process (transformation). So, the disk I/O of the operation [ms] is extremely reduced, i.e everything is being executed in the RAM [ns], from one transformation to the other you get a sequence of one batch, 100% in-memory. The other important ingredient is related to the numerous prepared transformations which the user (developer, analyst) understands naturally. Here, there are a lot of additional APIs that enable you to work on the SQL level. The third one, not any less important ingredient is Spark’s only in-memory processing. Resilient Distributed Dataset (RDD) is an abstract which enables the programmer to do the transformations in-memory throughout the entire cluster, which means that the once loaded dataset stays available for as long as there is a need for it. This kind of approach helps to avoid reloading and calculating from the disk system. The next question is do we need machines with plenty of RAM. The answer is YES, but not necessary. This approach is better for working in a non-clustered mode too, because Spark uses the maximum out of RAM where the data itself exists with maximum parallelism, and thereby reduces the I/O operations and network shuffling inside the cluster or one machine to the maximum.
Spark Core & more
Spark adds numerous useful and functional APIs to the described basic architecture, which gives an additional layer of abstract to, on one hand, decreases the developer’s work (Spark reduces coding up to 20 times), and on the other hand, gives a more simple view and data processing to advanced analytics and Data Scientists. One of the initiatives and guidelines for the development of Spark is to solve the problem of the Data Science environment in which a Data Scientist wastes up to 80% of his time on preparing and processing of the data before the crucial analyses. Based on that, additional APIs with the purpose to ease data preparing have been developed:
- Spark SQL – SQL API which enables you to use the standard SQL in Spark, and by that process the structural and semi-structural data without knowing the programming languages
- Spark streaming – a module (API) for processing data in real time
- MLib (machine learning) – a module (API) for machine learning
- GraphX – a module (API) for processing Graph structures over the Spark environment
- SparkR – a module (API) for writing R programmes in Spark environment
Spark 2.0 second generation
After the 1.6 version, this summer a new Spark 2.0 was released, bringing along many novelties. The biggest change is aimed at optimization, simplification and unification of batch and streaming processes.
One of the greatest improvements is about the optimization – Spark can accelerate up to 10x by the Tungsten 2.0 project, whose main purpose is to give Spark a compiler position. That means that with this project, Spark got the „Native memory management“ and „runtime code generation“. In the 1.6 version Spark partially generated the code, and now it can be generated completely. Other than that, the elements that generated expensive iterator calls were removed and in/out was optimized by adding the Parquet built-in cache.
Tungsten second phase
The second important thing is the Structural Streaming. Structural streaming is intended for those applications, that have a request for batch, as well as streaming processes. It can be used when we want to select data over a livestream and send the results to the ML model training and update data in almost real time. So, we can batch process and stream analyze on just one DataFrame object, and not violate the data consistency by doing it. Before, we had to work on Lambda architecture which is complex enough itself, because it was necessary to maintain two versions of the code, one for streaming (current data) and the other one for batch (data), and to later combine them through a noSQL data store. Spark is revolutionary in dealing with cases like this and enables the developer to control all data over just one DataFrame.
Thirdly, another important improvement is unifying DataSet and DataFrame objects. DataFrame is a set of rows inside a clearly defined scheme, while DataSet is a statistic type of data.
DataFrame = Dataset[Row]
If you didn’t fully understand what is it exactly about, don’t worry and keep reading our blog, because in one of our next texts we will be discussing it. J The key idea is to have an abundant abstract layer, that hides the program code and preoccupies the reach and processing thanks to the Tungsten Project, while the RDD still stays supportive for low-level programming. As a novelty, SparkSession presents SparkContext for DataFrame/Dataset and enables working on all levels, from data reading, managing metadata, additional configuration and controlling on cluster manager level. DataSet and DataFrame can now fully support correlated and non correlated SQL subqueries, which means that Spark supports SQL2003 standard. From the 2.0 version, Spark supports starting all 99 TPC-DS* inquiries.
Apart from these conveniences, Spark 2.0 brings a lot of improvements and betterments, as well as new components that have been added: ML pipelines, SparkR StreamingML, DebugUI, new DataSources, Kafka connector and many others.
Application
Matei Zaharia, the creator of Spark, spontaneously said on one of webinars, that „One of the goals was to bring advanced analytics for all occasions“. J In other words, the idea is for all the users and existing applications that use traditional SQL to „come to life“ in the big data world. In the Data Science context, Spark’s goal is to be something similar to a spreadsheet to business analysts by uniting all sources in one way and enabling simple inquiries with SQL, and complex inquiries/procedures/algorithms in programme environment which suits the Data Scienctist best.
The other important application of Spark is in the development of Continuous applications systems, such as recommended apps or Fraude detection systems. In those applications the focus is still on processing all data (history) + the need for data breeding, which effect the real time processes. Up until now, these things were solved through the so called Lambda architecture, which was extremely complex and hard to develop, and even harder to maintain and deploy. Lambda is a Spark 2.0 hack. J
The prevention of Fraud is an ideal case for Spark 2.0 because it unites batch and streaming and doesn’t undermine the data consistency, but if the data is not consistent Spark will initiate a mistake in the runtime (Continuous applications). The fraud prevention system can be improved alone by using machine learning jobs (ML pipeline, MLib)…. Imagine a system that can learn what a Fraud is? J… and to control it all through just one application/system (standalone + streaming) in a quick and interactive way.
For the end of the retrospective, and before your new beginning
Spark is one of the first unified engines for data processing in a clustered, distributed system that integrates functional APIs. Spark will build new API interfaces on these foundations. In the future, more and more „computation semantics“ will be enabled. Matei Zaharia announced that in the near future (so it is currently being developed) you will be able to combine MLib and GraphX, GraphX and TensorFlow, probably on the unified DataFrame API level. In the Spark factory streaming (Engine Latnacy) and ML are being intensely worked on to ensure 24/7. We have a bright future. Also, DataFrame API is going to be optimized with new algorithms in the future, so it could be optimal on all abstract levels. Spark is a compution engine, and is not primarily a memory I/O manager, so, for instance, the Apache Tachyon (in-memory caching engine) is a better choice. On the other hand, by combining Spark and Thyon certain improvements can be achieved, for example: Tahyion improving cache and Spark jobs on nods.
How to start using Spark?
Thanks to Apache Spark’s polyvalence and unification it can be used for different purposes and in different environments. If you’re an analyst and know how to work only with SQL and you have trouble with big amounts of data, Spark is a great solution for you because it allows you to switch to processing in almost real time, by using SQL, a language you already know. If you work in Data Science environment Spark can too be of use. It can reduce the time you need for preparing data up to 80%.
We hope that you got the answers about Spark from these few paragraphs, and that it will be a motivation for trying Spark in your environment and with your own data. If you are interested in going on a Spark adventure, we suggest you our Spark courses:
- The introduction to Apache Spark – detailed introduction to Spark Core and Spark SQL concepts interactively, by writing codes and reading the API doc, opposed to the static PPT.
- Apache Spark – advanced using – if you already have a basic experience in using Spark, this course will upgrade it to Spark Streaming, MLib for machine learning and using GraphX API.
If we still haven’t succeeded in convincing you to adapt Spark in your environment, we are inviting you to read our next blogs about Spark. There will be plenty more because this is only the beginning…
* PC Benchmark™DS (TPC-DS): ‘The’ Benchmark Standard for decision support solutions including Big Data (http://www.tpc.org/tpcds/)
Related News