Apache Spark is a framework used for processing large amounts of data and it can be used to process all kinds of data. Apache Spark is a very popular system widely used for advanced analytics, data science, modern BigData architecture, as well as for complex batch (ETL) processing and real-time data processing. Spark contains several key components, such as Spark SQL for data structuring, Spark Streaming for processing big amounts of live data, Spark MLlib for machine learning, Spark GraphX for graph processing and SparkR for statistical data processing using R language. Spark can be run individually, on a YARN (Hadoop) cluster, or on Mesos – basically, it can be run in any environment. Spark is a polyglot framework, which means that it abstracts its usage to the maximum, and it imposes using a programming language (Python, Java, Scala, R) to a development environment which is the best fit for the organization or business type. All the examples in this education will be processed primarily in Python, but other programming languages, e.g. Scala, will also be used. Participants will work in independent and cluster environments, depending on the assignment.
Target audience
This education is aimed at IT architects, development engineers and business analysts.
Workshop Modules
Introduction to Spark is a workshop intended for everyone who wants to learn basic programming in the Spark framework. The workshop is modular, meaning that participants can choose modules depending on their points of interests. For instance, the participant can choose modules 1, 2 and 5, or 1, 2, 3 or 1, 2, 4, as well as combine them however they see fit.
Data Science is a topic suitable for BI experts, business analysts and predictive analysts.
Data Engineering is a topic convenient for system administrators, development engineers and system architects.
Workshop description according to modules
WORKSHOP PARTICIPATION PREREQUISITES
In order to partake in this training, you need to have knowledge of OO programming, as well as basic knowledge of SQL and basic knowledge of Python and/or Scala.
Introduction to Spark environments
This module covers an introduction to Spark and a basic explanation of how it works. All aspects of this technology will be explained by means of interactive examples. System architecture, Apache Hadoop, the basics of MapReduce framework and basic Spark APIs RDD, DataFrame, Spark SQL will all be covered in detail.
This module is intended for IT architects, development engineers and business analysts.
Spark analytics – Spark in practice
The second module deals with developing a Spark application using previously acquired knowledge. The application will deal with advanced analytics and will process targeted data – ranging from loading big sets of data and cleaning to the final visualization.
Advanced usage – Spark in practice
The third module focuses on deploying Spark applications to production. How to set up a development environment, how to optimize Spark apps and how to run them in Hadoop environment will be explained in detail.
This module is aimed at IT architects and development engineers.
Advanced usage – Streaming and real-time data processing
The fourth module is aimed at participants that want to acquire advanced skills in Spark, such as Spark Streaming. The participants will be taught how to set up a streaming process for real-time data processing and will then upgrade it with streaming elements.
This module is intended for IT architects and development engineers.
Advanced usage – Data Science
The star of the fifth module is MLlib library for machine learning. The participants will build a model for machine learning on which they will show the process of training the model. By using Spark GraphX for processing graphs, several examples will show how to successfully use them in practice.
This module is aimed at business analysts and data science engineers.
Detailed descriptions of the workshop by days
Day 1 – Introduction to Apache Spark
What is Apache Spark?
Using Spark (independently, cluster, shell)
What is RDD and how to use it?
- Transformations
- Actions
- “Lazy evaluation”
- Data persistence – caching
- Functions (Python, Scala, Java)
Key/Value structures
- Types of data – creating and maintaining
- Transformation (aggregation, sorting, join)
DataFrame
- What is DataFrame?
- Transformations and actions on DataFrame
- Advanced usage
Loading and saving data
- Working with different file formats (TXT, JSON, AVRO, Parquet, Seq File)
- Working with different data repositories and file systems (Local, Amazon S3, HDFS)
- Databases (RDBMS, Cassandra, HBase)
- Spark SQL
- Connecting Spark SQL
- SQL in applications (initialization, DataFrames, Caching)
- Functions
Day 2 – Spark Analytics – Spark in practice
Collecting data for the full-day task in the context of advanced analytics
- Getting to know the data set
- Job requirements and task goals
Data preparation and cleaning
- Profiling data and existing structures
- Cleaning data by removing anomalies and potential errors
- Aggregating data
- Methods of aggregating data
- Finding the optimal case
- Process optimization
Application debugging
Data visualization
Day 3 – Spark in production
Development environment:
- Workspace (PyCharm, Anaconda, Zeppelin and Jupyter notebook)
- Application build
- Application deployment
- Debugging
- Testing (Unit test, integration test)
Running Spark applications on cluster environments (Hadoop YARN)
- Executor and worker optimization
- HW Sizing
- Monitoring
Spark Memory Management
- Persistence – Caching
- Tungsten
- Garbage collection
24/7 Streaming operations
- Streaming process control mechanisms – Checkpointing
- Streaming fault-tolerance (driver, worker, receiver, process guarantees)
- Data supervision
- Process supervision
- UI Streaming
Performance management
- System resizing (batch and window size)
- Level parallelism
Day 4 – Spark Streaming – real-time data processing in practice
Introduction to streaming processes
Data sets for a streaming process
- Correcting existing data sets
- Preparation for streaming
Spark Streaming
- Introduction to Spark Streaming
- Basic concepts and operations
- Streaming operations (Transformations, window operations)
Spark Structured Streaming
- Structure Streaming API concept
- Continuous application
- Operations on streaming
- Stream management and queries
- Stream recovery and checkpointing
Operations for controlling the stream
- Output operations (SaveDStream)
- Data stream sources (basic sources, socket, Kafka, files)
- Managing multiple sources
- Resizing clusters for streaming operations
Day 5 – Data Science
Using SparkMLlib for machine learning
The basics of machine learning
Types of data – vectors
Using most frequent algorithms with concrete Spark examples
- Classifications and regression
- “Feature Extraction”
- Clustering
- Dimension reduction
Tricks and best practices
Model evaluation
Spark GraphX – processing a graph
The basics of graphs and why we use them
How to prepare a data set (GraphFrames)
Graph algorithms used on graphs using Spark GraphX
- ShortestPaths
- PageRank
- ConnectedComponents
For more information, feel free to contact us!