APACHE SPARK educational workshops

27. 02. 2018

Overview

Apache Spark is a framework used for processing large amounts of data and it can be used to process all kinds of data.

Apache Spark is a framework used for processing large amounts of data and it can be used to process all kinds of data. Apache Spark is a very popular system widely used for advanced analytics, data science, modern BigData architecture, as well as for complex batch (ETL) processing and real-time data processing. Spark contains several key components, such as Spark SQL for data structuring, Spark Streaming for processing big amounts of live data, Spark MLlib for machine learning, Spark GraphX for graph processing and SparkR for statistical data processing using R language. Spark can be run individually, on a YARN (Hadoop) cluster, or on Mesos – basically, it can be run in any environment. Spark is a polyglot framework, which means that it abstracts its usage to the maximum, and it imposes using a programming language (Python, Java, Scala, R) to a development environment which is the best fit for the organization or business type. All the examples in this education will be processed primarily in Python, but other programming languages, e.g. Scala, will also be used. Participants will work in independent and cluster environments, depending on the assignment.

Target audience

This education is aimed at IT architects, development engineers and business analysts.

Workshop Modules

Introduction to Spark is a workshop intended for everyone who wants to learn basic programming in the Spark framework. The workshop is modular, meaning that participants can choose modules depending on their points of interests. For instance, the participant can choose modules 1, 2 and 5, or 1, 2, 3 or 1, 2, 4, as well as combine them however they see fit.

Data Science is a topic suitable for BI experts, business analysts and predictive analysts.

Data Engineering is a topic convenient for system administrators, development engineers and system architects.

Workshop description according to modules

WORKSHOP PARTICIPATION PREREQUISITES

In order to partake in this training, you need to have knowledge of OO programming, as well as basic knowledge of SQL and basic knowledge of Python and/or Scala.

Introduction to Spark environments

This module covers an introduction to Spark and a basic explanation of how it works. All aspects of this technology will be explained by means of interactive examples. System architecture, Apache Hadoop, the basics of MapReduce framework and basic Spark APIs RDD, DataFrame, Spark SQL will all be covered in detail.

This module is intended for IT architects, development engineers and business analysts.

Spark analytics – Spark in practice

The second module deals with developing a Spark application using previously acquired knowledge. The application will deal with advanced analytics and will process targeted data – ranging from loading big sets of data and cleaning to the final visualization.

Advanced usage – Spark in practice

The third module focuses on deploying Spark applications to production. How to set up a development environment, how to optimize Spark apps and how to run them in Hadoop environment will be explained in detail.

This module is aimed at IT architects and development engineers.

Advanced usage – Streaming and real-time data processing

The fourth module is aimed at participants that want to acquire advanced skills in Spark, such as Spark Streaming. The participants will be taught how to set up a streaming process for real-time data processing and will then upgrade it with streaming elements.

This module is intended for IT architects and development engineers.

Advanced usage – Data Science

The star of the fifth module is MLlib library for machine learning. The participants will build a model for machine learning on which they will show the process of training the model. By using Spark GraphX for processing graphs, several examples will show how to successfully use them in practice.

This module is aimed at business analysts and data science engineers.

Detailed descriptions of the workshop by days

Day 1 – Introduction to Apache Spark

What is Apache Spark?

Using Spark (independently, cluster, shell)

What is RDD and how to use it?

Transformations
Actions
“Lazy evaluation”
Data persistence – caching
Functions (Python, Scala, Java)

Key/Value structures

Types of data – creating and maintaining
Transformation (aggregation, sorting, join)

DataFrame

What is DataFrame?
Transformations and actions on DataFrame
Advanced usage

Loading and saving data

Working with different file formats (TXT, JSON, AVRO, Parquet, Seq File)
Working with different data repositories and file systems (Local, Amazon S3, HDFS)
Databases (RDBMS, Cassandra, HBase)
Spark SQL
Connecting Spark SQL
SQL in applications (initialization, DataFrames, Caching)
Functions

Day 2 – Spark Analytics – Spark in practice

Collecting data for the full-day task in the context of advanced analytics