5 Min reading time

APACHE SPARK educational workshops

27. 02. 2018

Apache Spark is a framework used for processing large amounts of data and it can be used to process all kinds of data.

Apache Spark is a framework used for processing large amounts of data and it can be used to process all kinds of data. Apache Spark is a very popular system widely used for advanced analytics, data science, modern BigData architecture, as well as for complex batch (ETL) processing and real-time data processing. Spark contains several key components, such as Spark SQL for data structuring, Spark Streaming for processing big amounts of live data, Spark MLlib for machine learning, Spark GraphX for graph processing and SparkR for statistical data processing using R language. Spark can be run individually, on a YARN (Hadoop) cluster, or on Mesos – basically, it can be run in any environment. Spark is a polyglot framework, which means that it abstracts its usage to the maximum, and it imposes using a programming language (Python, Java, Scala, R) to a development environment which is the best fit for the organization or business type. All the examples in this education will be processed primarily in Python, but other programming languages, e.g. Scala, will also be used. Participants will work in independent and cluster environments, depending on the assignment.

Target audience

This education is aimed at IT architects, development engineers and business analysts.

Workshop Modules

Introduction to Spark is a workshop intended for everyone who wants to learn basic programming in the Spark framework. The workshop is modular, meaning that participants can choose modules depending on their points of interests. For instance, the participant can choose modules 1, 2 and 5, or 1, 2, 3 or 1, 2, 4, as well as combine them however they see fit.

Data Science is a topic suitable for BI experts, business analysts and predictive analysts.

Data Engineering is a topic convenient for system administrators, development engineers and system architects.

Workshop description according to modules


In order to partake in this training, you need to have knowledge of OO programming, as well as basic knowledge of SQL and basic knowledge of Python and/or Scala.

Introduction to Spark environments

This module covers an introduction to Spark and a basic explanation of how it works.  All aspects of this technology will be explained by means of interactive examples. System architecture, Apache Hadoop, the basics of MapReduce framework and basic Spark APIs RDD, DataFrame, Spark SQL will all be covered in detail.

This module is intended for IT architects, development engineers and business analysts.

Spark analytics – Spark in practice

The second module deals with developing a Spark application using previously acquired knowledge. The application will deal with advanced analytics and will process targeted data – ranging from loading big sets of data and cleaning to the final visualization.

Advanced usage – Spark in practice

The third module focuses on deploying Spark applications to production. How to set up a development environment, how to optimize Spark apps and how to run them in Hadoop environment will be explained in detail.

This module is aimed at IT architects and development engineers.

Advanced usage – Streaming and real-time data processing

The fourth module is aimed at participants that want to acquire advanced skills in Spark, such as Spark Streaming. The participants will be taught how to set up a streaming process for real-time data processing and will then upgrade it with streaming elements.

This module is intended for IT architects and development engineers.

Advanced usage – Data Science

The star of the fifth module is MLlib library for machine learning. The participants will build a model for machine learning on which they will show the process of training the model. By using Spark GraphX for processing graphs, several examples will show how to successfully use them in practice.

This module is aimed at business analysts and data science engineers.

Detailed descriptions of the workshop by days

Day 1 – Introduction to Apache Spark

What is Apache Spark?

Using Spark (independently, cluster, shell)

What is RDD and how to use it?

  • Transformations
  • Actions
  • “Lazy evaluation”
  • Data persistence – caching
  • Functions (Python, Scala, Java)

Key/Value structures

  • Types of data – creating and maintaining
  • Transformation (aggregation, sorting, join)


  • What is DataFrame?
  • Transformations and actions on DataFrame
  • Advanced usage

Loading and saving data

  • Working with different file formats (TXT, JSON, AVRO, Parquet, Seq File)
  • Working with different data repositories and file systems (Local, Amazon S3, HDFS)
  • Databases (RDBMS, Cassandra, HBase)
  • Spark SQL
  • Connecting Spark SQL
  • SQL in applications (initialization, DataFrames, Caching)
  • Functions

Day 2 – Spark Analytics – Spark in practice

Collecting data for the full-day task in the context of advanced analytics

  • Getting to know the data set
  • Job requirements and task goals

Data preparation and cleaning

  • Profiling data and existing structures
  • Cleaning data by removing anomalies and potential errors
    • Aggregating data
  • Methods of aggregating data
  • Finding the optimal case
  • Process optimization

Application debugging

Data visualization

Day 3 – Spark in production

Development environment:

  • Workspace (PyCharm, Anaconda, Zeppelin and Jupyter notebook)
  • Application build
  • Application deployment
  • Debugging
  • Testing (Unit test, integration test)

Running Spark applications on cluster environments (Hadoop YARN)

  • Executor and worker optimization
  • HW Sizing
  • Monitoring

Spark Memory Management

  • Persistence – Caching
  • Tungsten
  • Garbage collection

24/7 Streaming operations

  • Streaming process control mechanisms – Checkpointing
  • Streaming fault-tolerance (driver, worker, receiver, process guarantees)
  • Data supervision
  • Process supervision
  • UI Streaming

Performance management

  • System resizing (batch and window size)
  • Level parallelism

Day 4 – Spark Streaming – real-time data processing in practice

Introduction to streaming processes

Data sets for a streaming process

  • Correcting existing data sets
  • Preparation for streaming

Spark Streaming

  • Introduction to Spark Streaming
  • Basic concepts and operations
  • Streaming operations (Transformations, window operations)

Spark Structured Streaming

  • Structure Streaming API concept
  • Continuous application
  • Operations on streaming
  • Stream management and queries
  • Stream recovery and checkpointing

Operations for controlling the stream

  • Output operations (SaveDStream)
  • Data stream sources (basic sources, socket, Kafka, files)
  • Managing multiple sources
  • Resizing clusters for streaming operations

Day 5 – Data Science

Using SparkMLlib for machine learning

The basics of machine learning

Types of data – vectors

Using most frequent algorithms with concrete Spark examples

  • Classifications and regression
  • “Feature Extraction”
  • Clustering
  • Dimension reduction

Tricks and best practices

Model evaluation

Spark GraphX – processing a graph

The basics of graphs and why we use them

How to prepare a data set (GraphFrames)

Graph algorithms used on graphs using Spark GraphX

  • ShortestPaths
  • PageRank
  • ConnectedComponents

For more information, feel free to contact us!

Get in touch

If you have any questions, we are one click away.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Contact us

Schedule a call with an expert