News > APACHE SPARK educational workshops

5 Min reading time

APACHE SPARK educational workshops

27. 02. 2018

Overview

Apache Spark is a framework used for processing large amounts of data and it can be used to process all kinds of data.

Apache Spark is a framework used for processing large amounts of data and it can be used to process all kinds of data. Apache Spark is a very popular system widely used for advanced analytics, data science, modern BigData architecture, as well as for complex batch (ETL) processing and real-time data processing. Spark contains several key components, such as Spark SQL for data structuring, Spark Streaming for processing big amounts of live data, Spark MLlib for machine learning, Spark GraphX for graph processing and SparkR for statistical data processing using R language. Spark can be run individually, on a YARN (Hadoop) cluster, or on Mesos – basically, it can be run in any environment. Spark is a polyglot framework, which means that it abstracts its usage to the maximum, and it imposes using a programming language (Python, Java, Scala, R) to a development environment which is the best fit for the organization or business type. All the examples in this education will be processed primarily in Python, but other programming languages, e.g. Scala, will also be used. Participants will work in independent and cluster environments, depending on the assignment.

Target audience

This education is aimed at IT architects, development engineers and business analysts.

Workshop Modules

Introduction to Spark is a workshop intended for everyone who wants to learn basic programming in the Spark framework. The workshop is modular, meaning that participants can choose modules depending on their points of interests. For instance, the participant can choose modules 1, 2 and 5, or 1, 2, 3 or 1, 2, 4, as well as combine them however they see fit.

Data Science is a topic suitable for BI experts, business analysts and predictive analysts.

Data Engineering is a topic convenient for system administrators, development engineers and system architects.

Workshop description according to modules

WORKSHOP PARTICIPATION PREREQUISITES

In order to partake in this training, you need to have knowledge of OO programming, as well as basic knowledge of SQL and basic knowledge of Python and/or Scala.

Introduction to Spark environments

This module covers an introduction to Spark and a basic explanation of how it works. All aspects of this technology will be explained by means of interactive examples. System architecture, Apache Hadoop, the basics of MapReduce framework and basic Spark APIs RDD, DataFrame, Spark SQL will all be covered in detail.

This module is intended for IT architects, development engineers and business analysts.

Spark analytics – Spark in practice

The second module deals with developing a Spark application using previously acquired knowledge. The application will deal with advanced analytics and will process targeted data – ranging from loading big sets of data and cleaning to the final visualization.

Advanced usage – Spark in practice

The third module focuses on deploying Spark applications to production. How to set up a development environment, how to optimize Spark apps and how to run them in Hadoop environment will be explained in detail.

This module is aimed at IT architects and development engineers.

Advanced usage – Streaming and real-time data processing

The fourth module is aimed at participants that want to acquire advanced skills in Spark, such as Spark Streaming. The participants will be taught how to set up a streaming process for real-time data processing and will then upgrade it with streaming elements.

This module is intended for IT architects and development engineers.

Advanced usage – Data Science

The star of the fifth module is MLlib library for machine learning. The participants will build a model for machine learning on which they will show the process of training the model. By using Spark GraphX for processing graphs, several examples will show how to successfully use them in practice.

This module is aimed at business analysts and data science engineers.

Detailed descriptions of the workshop by days

Day 1 – Introduction to Apache Spark

What is Apache Spark?

Using Spark (independently, cluster, shell)

What is RDD and how to use it?

Transformations
Actions
“Lazy evaluation”
Data persistence – caching
Functions (Python, Scala, Java)

Key/Value structures

Types of data – creating and maintaining
Transformation (aggregation, sorting, join)

DataFrame

What is DataFrame?
Transformations and actions on DataFrame
Advanced usage

Loading and saving data

Working with different file formats (TXT, JSON, AVRO, Parquet, Seq File)
Working with different data repositories and file systems (Local, Amazon S3, HDFS)
Databases (RDBMS, Cassandra, HBase)
Spark SQL
Connecting Spark SQL
SQL in applications (initialization, DataFrames, Caching)
Functions

Day 2 – Spark Analytics – Spark in practice

Collecting data for the full-day task in the context of advanced analytics

Getting to know the data set
Job requirements and task goals

Data preparation and cleaning

Profiling data and existing structures
Cleaning data by removing anomalies and potential errors
- Aggregating data
Methods of aggregating data
Finding the optimal case
Process optimization

Application debugging

Data visualization

Day 3 – Spark in production

Development environment:

Workspace (PyCharm, Anaconda, Zeppelin and Jupyter notebook)
Application build
Application deployment
Debugging
Testing (Unit test, integration test)

Running Spark applications on cluster environments (Hadoop YARN)

Executor and worker optimization
HW Sizing
Monitoring

Spark Memory Management

Persistence – Caching
Tungsten
Garbage collection

24/7 Streaming operations

Streaming process control mechanisms – Checkpointing
Streaming fault-tolerance (driver, worker, receiver, process guarantees)
Data supervision
Process supervision
UI Streaming

Performance management

System resizing (batch and window size)
Level parallelism

Day 4 – Spark Streaming – real-time data processing in practice

Introduction to streaming processes

Data sets for a streaming process

Correcting existing data sets
Preparation for streaming

Spark Streaming

Introduction to Spark Streaming
Basic concepts and operations
Streaming operations (Transformations, window operations)

Spark Structured Streaming

Structure Streaming API concept
Continuous application
Operations on streaming
Stream management and queries
Stream recovery and checkpointing

Operations for controlling the stream

Output operations (SaveDStream)
Data stream sources (basic sources, socket, Kafka, files)
Managing multiple sources
Resizing clusters for streaming operations

Day 5 – Data Science

Using SparkMLlib for machine learning

The basics of machine learning

Types of data – vectors

Using most frequent algorithms with concrete Spark examples

Classifications and regression
“Feature Extraction”
Clustering
Dimension reduction

Tricks and best practices

Model evaluation

Spark GraphX – processing a graph

The basics of graphs and why we use them

How to prepare a data set (GraphFrames)

Graph algorithms used on graphs using Spark GraphX

ShortestPaths
PageRank
ConnectedComponents

For more information, feel free to contact us!

Signup

News

Your monthly dose of news

Get in touch

If you have any questions, we are one click away.

Schedule a call with an expert

Modern cobol v6 code written in big city screens as a sign of mainframe modernization header image

Engineering Mainframe

COBOL V6 Modernization Performance Gains: The Turbocharged Mainframe Revolution

DEVOPS Engineering Operations Support

From VMware to OpenShift Virtualization: A practical guide to modern infrastructure

AI and Data Engineering News

IBM Webinar Series: WebMethods (4/5)

IBM Webinar Series Header quantum safe image