Apache Spark Syllabus
Introduction to Apache Spark
Apache Spark is a powerful distributed computing system designed for big data processing. This module introduces Apache Spark, covering its core features, architecture, and use cases in data analytics, machine learning, and stream processing.
Setting Up Apache Spark
Learn how to install and configure Apache Spark. This section covers system requirements, installation procedures, and initial setup. Explore how to configure Spark clusters and understand the basics of Spark’s user interface.
Spark Architecture and Components
Discover the architecture of Apache Spark, including its key components such as RDDs, DataFrames, and DAGs. Learn how Spark’s architecture supports distributed data processing and how to design efficient data processing pipelines.
Creating and Managing Data Pipelines
Gain insights into creating and managing data pipelines in Apache Spark. Learn how to design data processing workflows, configure transformations, and optimize data pipelines for performance. Explore how to handle data transformation, enrichment, and aggregation.
Monitoring and Troubleshooting
Learn how to monitor and troubleshoot Apache Spark. Explore Spark’s monitoring tools, logs, and performance metrics. Understand techniques for diagnosing issues, managing system health, and ensuring data processing reliability.
Integration with Other Systems
Discover how to integrate Apache Spark with other systems and technologies. Learn about Spark’s connectors and integrations with databases, message queues, cloud services, and big data platforms. Explore how to use Spark for end-to-end data integration and analytics.
Data Security and Access Control
Understand data security and access control in Apache Spark. Learn about authentication, authorization, and encryption. Explore how to secure data processing workflows, manage user access, and ensure compliance with security policies.
Performance Tuning and Optimization
Learn about performance tuning and optimization for Apache Spark. Explore techniques for improving data processing efficiency, managing system resources, and handling large volumes of data. Understand best practices for configuring and maintaining Spark clusters.
Advanced Features and Customization
Explore advanced features and customization options in Apache Spark. Learn how to extend Spark with custom functions, libraries, and tools. Understand how to adapt Spark to meet specific data processing needs and use cases.
Apache Spark Training Syllabus
Module 1: Introduction
- Overview of Hadoop
- Architecture of HDFS & YARN
- Overview of Spark version 2.2.0
- Spark Architecture
- Spark Components
- Comparison of Spark & Hadoop
- Installation of Spark v 2.2.0 on Linux 64 bit
Module 2: Spark Core
- Exploring the Spark shell
- Creating Spark Context
- Operations on Resilient Distributed Dataset – RDD
- Transformations & Actions
- Loading Data and Saving Data
Module 3: Spark SQL & Hive SQL
- Introduction to SQL Operations
- SQL Context
- Data Frame
- Working with Hive
- Loading Partitioned Tables
- Processing CSV, JSON, Parquet files
Module 4: Scala Programming
- Introduction to Scala
- Feature of Scala
- Scala vs Java Comparison
- Data types
- Data Structure
- Arrays
- Literals
- Logical Operators
- Mutable & Immutable variables
- Type interface
Module 5: Scala Functions
- OOP vs Functions
- Anonymous
- Recursive
- Call-by-name
- Currying
- Conditional statement
Module 6: Scala Collections
- List
- Map
- Sets
- Options
- Tuples
- Mutable collection
- Immutable collection
- Iterating
- Filtering and counting
- Group By
- Flat Map
- Word count
- File Access
Module 7: Scala Object Oriented Programming
- Classes, Objects & Properties
- Inheritance
Module 8: Spark Submit
- Maven build tool implementation
- Build Libraries
- Create Jar files
- Spark-Submit
Module 9: Spark Streaming
- Overview of Spark Streaming
- Architecture of Spark Streaming
- File streaming
- Twitter Streaming
Module 10: Kafka Streaming
- Overview of Kafka Streaming
- Architecture of Kafka Streaming
- Kafka Installation
- Topic
- Producer
- Consumer
- File streaming
- Twitter Streaming
Module 11: Spark Mlib
- Overview of Machine Learning Algorithm
- Linear Regression
- Logistic Regression
Module 12: Spark GraphX
- GraphX overview
- Vertices
- Edges
- Triplets
- Page Rank
- Pregel
Module 13: Performance Tuning
- On-Off-heap memory tuning
- Kryo Serialization
- Broadcast Variable
- Accumulator Variable
- DAG Scheduler
- Data Locality
- Check Pointing
- Speculative Execution
- Garbage Collection
Module 14: Project Planning, Monitoring, Trouble Shooting
- Master – Driver Node capacity
- Slave – Worker Node capacity
- Executor capacity
- Executor core capacity
- Project scenario and execution
- Out-of-memory error handling
- Master logs, Worker logs, Driver logs
- Monitoring Web UI
- Heap memory dump
Training
Basic Level Training
Duration : 1 Month
Advanced Level Training
Duration : 1 Month
Project Level Training
Duration : 1 Month
Total Training Period
Duration : 3 Months
Course Mode :
Available Online / Offline
Course Fees :
Please contact the office for details