Subscribe to our 0800-DEVOPS Newsletter

    Get in touch

    Not sure where to start? Let our experts guide you. Send us your query through this contact form.






      Get in touch

      Contact us for all inquiries regarding services and general information






        Use the form below to apply for course





          Get in touch

          Contact us for all inquiries regarding services and general information






          Blog

          APACHE SPARK educational workshops

          27.02.2018

          Apache Spark is a framework used for processing large amounts of data and it can be used to process all kinds of data. Apache Spark is a very popular system widely used for advanced analytics, data science, modern BigData architecture, as well as for complex batch (ETL) processing and real-time data processing. Spark contains several key components, such as Spark SQL for data structuring, Spark Streaming for processing big amounts of live data, Spark MLlib for machine learning, Spark GraphX for graph processing and SparkR for statistical data processing using R language. Spark can be run individually, on a YARN (Hadoop) cluster, or on Mesos – basically, it can be run in any environment. Spark is a polyglot framework, which means that it abstracts its usage to the maximum, and it imposes using a programming language (Python, Java, Scala, R) to a development environment which is the best fit for the organization or business type. All the examples in this education will be processed primarily in Python, but other programming languages, e.g. Scala, will also be used. Participants will work in independent and cluster environments, depending on the assignment.

          Target audience

          This education is aimed at IT architects, development engineers and business analysts.

          Workshop Modules

          Introduction to Spark is a workshop intended for everyone who wants to learn basic programming in the Spark framework. The workshop is modular, meaning that participants can choose modules depending on their points of interests. For instance, the participant can choose modules 1, 2 and 5, or 1, 2, 3 or 1, 2, 4, as well as combine them however they see fit.

          Data Science is a topic suitable for BI experts, business analysts and predictive analysts.

          Data Engineering is a topic convenient for system administrators, development engineers and system architects.

          Workshop description according to modules

          WORKSHOP PARTICIPATION PREREQUISITES

          In order to partake in this training, you need to have knowledge of OO programming, as well as basic knowledge of SQL and basic knowledge of Python and/or Scala.

          Introduction to Spark environments

          This module covers an introduction to Spark and a basic explanation of how it works.  All aspects of this technology will be explained by means of interactive examples. System architecture, Apache Hadoop, the basics of MapReduce framework and basic Spark APIs RDD, DataFrame, Spark SQL will all be covered in detail.

          This module is intended for IT architects, development engineers and business analysts.

          Spark analytics – Spark in practice

          The second module deals with developing a Spark application using previously acquired knowledge. The application will deal with advanced analytics and will process targeted data – ranging from loading big sets of data and cleaning to the final visualization.

          Advanced usage – Spark in practice

          The third module focuses on deploying Spark applications to production. How to set up a development environment, how to optimize Spark apps and how to run them in Hadoop environment will be explained in detail.

          This module is aimed at IT architects and development engineers.

          Advanced usage – Streaming and real-time data processing

          The fourth module is aimed at participants that want to acquire advanced skills in Spark, such as Spark Streaming. The participants will be taught how to set up a streaming process for real-time data processing and will then upgrade it with streaming elements.

          This module is intended for IT architects and development engineers.

          Advanced usage – Data Science

          The star of the fifth module is MLlib library for machine learning. The participants will build a model for machine learning on which they will show the process of training the model. By using Spark GraphX for processing graphs, several examples will show how to successfully use them in practice.

          This module is aimed at business analysts and data science engineers.

          Detailed descriptions of the workshop by days

          Day 1 – Introduction to Apache Spark

          What is Apache Spark?

          Using Spark (independently, cluster, shell)

          What is RDD and how to use it?

          • Transformations
          • Actions
          • “Lazy evaluation”
          • Data persistence – caching
          • Functions (Python, Scala, Java)

          Key/Value structures

          • Types of data – creating and maintaining
          • Transformation (aggregation, sorting, join)

          DataFrame

          • What is DataFrame?
          • Transformations and actions on DataFrame
          • Advanced usage

          Loading and saving data

          • Working with different file formats (TXT, JSON, AVRO, Parquet, Seq File)
          • Working with different data repositories and file systems (Local, Amazon S3, HDFS)
          • Databases (RDBMS, Cassandra, HBase)
          • Spark SQL
          • Connecting Spark SQL
          • SQL in applications (initialization, DataFrames, Caching)
          • Functions

          Day 2 – Spark Analytics – Spark in practice

          Collecting data for the full-day task in the context of advanced analytics

          • Getting to know the data set
          • Job requirements and task goals

          Data preparation and cleaning

          • Profiling data and existing structures
          • Cleaning data by removing anomalies and potential errors
            • Aggregating data
          • Methods of aggregating data
          • Finding the optimal case
          • Process optimization

          Application debugging

          Data visualization

          Day 3 – Spark in production

          Development environment:

          • Workspace (PyCharm, Anaconda, Zeppelin and Jupyter notebook)
          • Application build
          • Application deployment
          • Debugging
          • Testing (Unit test, integration test)

          Running Spark applications on cluster environments (Hadoop YARN)

          • Executor and worker optimization
          • HW Sizing
          • Monitoring

          Spark Memory Management

          • Persistence – Caching
          • Tungsten
          • Garbage collection

          24/7 Streaming operations

          • Streaming process control mechanisms – Checkpointing
          • Streaming fault-tolerance (driver, worker, receiver, process guarantees)
          • Data supervision
          • Process supervision
          • UI Streaming

          Performance management

          • System resizing (batch and window size)
          • Level parallelism

          Day 4 – Spark Streaming – real-time data processing in practice

          Introduction to streaming processes

          Data sets for a streaming process

          • Correcting existing data sets
          • Preparation for streaming

          Spark Streaming

          • Introduction to Spark Streaming
          • Basic concepts and operations
          • Streaming operations (Transformations, window operations)

          Spark Structured Streaming

          • Structure Streaming API concept
          • Continuous application
          • Operations on streaming
          • Stream management and queries
          • Stream recovery and checkpointing

          Operations for controlling the stream

          • Output operations (SaveDStream)
          • Data stream sources (basic sources, socket, Kafka, files)
          • Managing multiple sources
          • Resizing clusters for streaming operations

          Day 5 – Data Science

          Using SparkMLlib for machine learning

          The basics of machine learning

          Types of data – vectors

          Using most frequent algorithms with concrete Spark examples

          • Classifications and regression
          • “Feature Extraction”
          • Clustering
          • Dimension reduction

          Tricks and best practices

          Model evaluation

          Spark GraphX – processing a graph

          The basics of graphs and why we use them

          How to prepare a data set (GraphFrames)

          Graph algorithms used on graphs using Spark GraphX

          • ShortestPaths
          • PageRank
          • ConnectedComponents

          For more information, feel free to contact us!

          CONTACT

          Get in touch

          Want to hear more about our services and projects? Feel free to contact us.

          Contact us