INFOSOFT IT SOLUTIONS - Apache Spark Syllabus

Apache Spark Syllabus

Home
Courses

Apache Spark Syllabus

Introduction to Apache Spark

Apache Spark is a powerful distributed computing system designed for big data processing. This module introduces Apache Spark, covering its core features, architecture, and use cases in data analytics, machine learning, and stream processing.

Setting Up Apache Spark

Learn how to install and configure Apache Spark. This section covers system requirements, installation procedures, and initial setup. Explore how to configure Spark clusters and understand the basics of Spark’s user interface.

Spark Architecture and Components

Discover the architecture of Apache Spark, including its key components such as RDDs, DataFrames, and DAGs. Learn how Spark’s architecture supports distributed data processing and how to design efficient data processing pipelines.

Creating and Managing Data Pipelines

Gain insights into creating and managing data pipelines in Apache Spark. Learn how to design data processing workflows, configure transformations, and optimize data pipelines for performance. Explore how to handle data transformation, enrichment, and aggregation.

Monitoring and Troubleshooting

Learn how to monitor and troubleshoot Apache Spark. Explore Spark’s monitoring tools, logs, and performance metrics. Understand techniques for diagnosing issues, managing system health, and ensuring data processing reliability.

Integration with Other Systems

Discover how to integrate Apache Spark with other systems and technologies. Learn about Spark’s connectors and integrations with databases, message queues, cloud services, and big data platforms. Explore how to use Spark for end-to-end data integration and analytics.

Data Security and Access Control

Understand data security and access control in Apache Spark. Learn about authentication, authorization, and encryption. Explore how to secure data processing workflows, manage user access, and ensure compliance with security policies.

Performance Tuning and Optimization

Learn about performance tuning and optimization for Apache Spark. Explore techniques for improving data processing efficiency, managing system resources, and handling large volumes of data. Understand best practices for configuring and maintaining Spark clusters.

Advanced Features and Customization

Explore advanced features and customization options in Apache Spark. Learn how to extend Spark with custom functions, libraries, and tools. Understand how to adapt Spark to meet specific data processing needs and use cases.

Apache Spark Training Syllabus

Module 1: Introduction

Overview of Hadoop
Architecture of HDFS & YARN
Overview of Spark version 2.2.0
Spark Architecture
Spark Components
Comparison of Spark & Hadoop
Installation of Spark v 2.2.0 on Linux 64 bit

Module 2: Spark Core

Exploring the Spark shell
Creating Spark Context
Operations on Resilient Distributed Dataset – RDD
Transformations & Actions
Loading Data and Saving Data

Module 3: Spark SQL & Hive SQL

Introduction to SQL Operations
SQL Context
Data Frame
Working with Hive
Loading Partitioned Tables
Processing CSV, JSON, Parquet files

Module 4: Scala Programming

Introduction to Scala
Feature of Scala
Scala vs Java Comparison
Data types
Data Structure
Arrays
Literals
Logical Operators
Mutable & Immutable variables
Type interface

Module 5: Scala Functions

OOP vs Functions
Anonymous
Recursive
Call-by-name
Currying
Conditional statement

Module 6: Scala Collections

List
Map
Sets
Options
Tuples
Mutable collection
Immutable collection
Iterating
Filtering and counting
Group By
Flat Map
Word count
File Access

Module 7: Scala Object Oriented Programming

Classes, Objects & Properties
Inheritance

Module 8: Spark Submit

Maven build tool implementation
Build Libraries
Create Jar files
Spark-Submit

Module 9: Spark Streaming

Overview of Spark Streaming
Architecture of Spark Streaming
File streaming
Twitter Streaming

Module 10: Kafka Streaming

Overview of Kafka Streaming
Architecture of Kafka Streaming
Kafka Installation
Topic
Producer
Consumer
File streaming
Twitter Streaming

Module 11: Spark Mlib

Overview of Machine Learning Algorithm
Linear Regression
Logistic Regression

Module 12: Spark GraphX

GraphX overview
Vertices
Edges
Triplets
Page Rank
Pregel

Module 13: Performance Tuning

On-Off-heap memory tuning
Kryo Serialization
Broadcast Variable
Accumulator Variable
DAG Scheduler
Data Locality
Check Pointing
Speculative Execution
Garbage Collection

Module 14: Project Planning, Monitoring, Trouble Shooting

Master – Driver Node capacity
Slave – Worker Node capacity
Executor capacity
Executor core capacity
Project scenario and execution
Out-of-memory error handling
Master logs, Worker logs, Driver logs
Monitoring Web UI
Heap memory dump

Training

Basic Level Training

Duration : 1 Month

Advanced Level Training

Duration : 1 Month

Project Level Training

Duration : 1 Month

Total Training Period

Duration : 3 Months

Course Mode :

Available Online / Offline

Course Fees :

Please contact the office for details

Placement Benefit Services

Provide 100% job-oriented training

Develop multiple skill sets

Assist in project completion

Build ATS-friendly resumes

Add relevant experience to profiles

Build and enhance online profiles

Supply manpower to consultants

Supply manpower to companies

Prepare candidates for interviews

Add candidates to job groups

Send candidates to interviews

Provide job references

Assign candidates to contract jobs

Select candidates for internal projects

Note

100% Job Assurance Only

Daily online batches for employees

New course batches start every Monday

Download Syllabus