Apache Spark Syllabus

Introduction to Apache Spark

Apache Spark is a powerful distributed computing system designed for big data processing. This module introduces Apache Spark, covering its core features, architecture, and use cases in data analytics, machine learning, and stream processing.

Setting Up Apache Spark

Learn how to install and configure Apache Spark. This section covers system requirements, installation procedures, and initial setup. Explore how to configure Spark clusters and understand the basics of Spark’s user interface.

Spark Architecture and Components

Discover the architecture of Apache Spark, including its key components such as RDDs, DataFrames, and DAGs. Learn how Spark’s architecture supports distributed data processing and how to design efficient data processing pipelines.

Creating and Managing Data Pipelines

Gain insights into creating and managing data pipelines in Apache Spark. Learn how to design data processing workflows, configure transformations, and optimize data pipelines for performance. Explore how to handle data transformation, enrichment, and aggregation.

Monitoring and Troubleshooting

Learn how to monitor and troubleshoot Apache Spark. Explore Spark’s monitoring tools, logs, and performance metrics. Understand techniques for diagnosing issues, managing system health, and ensuring data processing reliability.

Integration with Other Systems

Discover how to integrate Apache Spark with other systems and technologies. Learn about Spark’s connectors and integrations with databases, message queues, cloud services, and big data platforms. Explore how to use Spark for end-to-end data integration and analytics.

Data Security and Access Control

Understand data security and access control in Apache Spark. Learn about authentication, authorization, and encryption. Explore how to secure data processing workflows, manage user access, and ensure compliance with security policies.

Performance Tuning and Optimization

Learn about performance tuning and optimization for Apache Spark. Explore techniques for improving data processing efficiency, managing system resources, and handling large volumes of data. Understand best practices for configuring and maintaining Spark clusters.

Advanced Features and Customization

Explore advanced features and customization options in Apache Spark. Learn how to extend Spark with custom functions, libraries, and tools. Understand how to adapt Spark to meet specific data processing needs and use cases.

Apache Spark Training Syllabus

Module 1: Introduction

  • Overview of Hadoop
  • Architecture of HDFS & YARN
  • Overview of Spark version 2.2.0
  • Spark Architecture
  • Spark Components
  • Comparison of Spark & Hadoop
  • Installation of Spark v 2.2.0 on Linux 64 bit

Module 2: Spark Core

  • Exploring the Spark shell
  • Creating Spark Context
  • Operations on Resilient Distributed Dataset – RDD
  • Transformations & Actions
  • Loading Data and Saving Data

Module 3: Spark SQL & Hive SQL

  • Introduction to SQL Operations
  • SQL Context
  • Data Frame
  • Working with Hive
  • Loading Partitioned Tables
  • Processing CSV, JSON, Parquet files

Module 4: Scala Programming

  • Introduction to Scala
  • Feature of Scala
  • Scala vs Java Comparison
  • Data types
  • Data Structure
  • Arrays
  • Literals
  • Logical Operators
  • Mutable & Immutable variables
  • Type interface

Module 5: Scala Functions

  • OOP vs Functions
  • Anonymous
  • Recursive
  • Call-by-name
  • Currying
  • Conditional statement

Module 6: Scala Collections

  • List
  • Map
  • Sets
  • Options
  • Tuples
  • Mutable collection
  • Immutable collection
  • Iterating
  • Filtering and counting
  • Group By
  • Flat Map
  • Word count
  • File Access

Module 7: Scala Object Oriented Programming

  • Classes, Objects & Properties
  • Inheritance

Module 8: Spark Submit

  • Maven build tool implementation
  • Build Libraries
  • Create Jar files
  • Spark-Submit

Module 9: Spark Streaming

  • Overview of Spark Streaming
  • Architecture of Spark Streaming
  • File streaming
  • Twitter Streaming

Module 10: Kafka Streaming

  • Overview of Kafka Streaming
  • Architecture of Kafka Streaming
  • Kafka Installation
  • Topic
  • Producer
  • Consumer
  • File streaming
  • Twitter Streaming

Module 11: Spark Mlib

  • Overview of Machine Learning Algorithm
  • Linear Regression
  • Logistic Regression

Module 12: Spark GraphX

  • GraphX overview
  • Vertices
  • Edges
  • Triplets
  • Page Rank
  • Pregel

Module 13: Performance Tuning

  • On-Off-heap memory tuning
  • Kryo Serialization
  • Broadcast Variable
  • Accumulator Variable
  • DAG Scheduler
  • Data Locality
  • Check Pointing
  • Speculative Execution
  • Garbage Collection

Module 14: Project Planning, Monitoring, Trouble Shooting

  • Master – Driver Node capacity
  • Slave – Worker Node capacity
  • Executor capacity
  • Executor core capacity
  • Project scenario and execution
  • Out-of-memory error handling
  • Master logs, Worker logs, Driver logs
  • Monitoring Web UI
  • Heap memory dump

Training

Basic Level Training

Duration : 1 Month

Advanced Level Training

Duration : 1 Month

Project Level Training

Duration : 1 Month

Total Training Period

Duration : 3 Months

Course Mode :

Available Online / Offline

Course Fees :

Please contact the office for details

Placement Benefit Services

Provide 100% job-oriented training
Develop multiple skill sets
Assist in project completion
Build ATS-friendly resumes
Add relevant experience to profiles
Build and enhance online profiles
Supply manpower to consultants
Supply manpower to companies
Prepare candidates for interviews
Add candidates to job groups
Send candidates to interviews
Provide job references
Assign candidates to contract jobs
Select candidates for internal projects

Note

100% Job Assurance Only
Daily online batches for employees
New course batches start every Monday