Hadoop Training
Introduction to Hadoop
Get an overview of Hadoop, an open-source framework for distributed storage and processing of large data sets. Learn about Hadoop’s architecture, core components, and its role in the big data ecosystem.
Hadoop Distributed File System (HDFS)
Explore the Hadoop Distributed File System (HDFS), which provides scalable and fault-tolerant storage. Learn about its architecture, how it stores large files across distributed nodes, and how to manage HDFS data.
YARN (Yet Another Resource Negotiator)
Understand YARN, the resource management layer of Hadoop. Learn how YARN manages cluster resources, schedules tasks, and provides resource allocation for various applications running on Hadoop.
MapReduce Programming Model
Dive into the MapReduce programming model for processing large data sets. Learn about the Map and Reduce functions, how to write MapReduce jobs, and how to execute and optimize these jobs for efficient data processing.
Apache Hive
Discover Apache Hive, a data warehousing solution built on top of Hadoop. Learn how Hive provides a SQL-like interface for querying and managing data stored in Hadoop, and how to use HiveQL for data analysis.
Apache Pig
Explore Apache Pig, a high-level scripting platform for processing large data sets. Understand how Pig Latin simplifies the development of data processing workflows and how to use Pig for complex data transformations.
Apache HBase
Learn about Apache HBase, a NoSQL database that provides real-time read/write access to large datasets. Understand its architecture, data model, and use cases for high-performance data access in Hadoop environments.
Apache ZooKeeper
Discover Apache ZooKeeper, a service for coordinating distributed applications. Learn how ZooKeeper provides centralized configuration management, synchronization, and naming for distributed systems within Hadoop.
Apache Flume
Learn about Apache Flume, a distributed service for collecting and aggregating log data. Understand how Flume integrates with Hadoop to handle data ingestion from various sources and stream data to HDFS.
Apache Oozie
Explore Apache Oozie, a workflow scheduler system for managing Hadoop jobs. Learn how Oozie helps schedule, coordinate, and manage complex data processing workflows across various Hadoop components.
Apache Spark Integration
Understand how Apache Spark complements Hadoop by providing in-memory processing capabilities. Learn about Spark's integration with Hadoop, its advantages, and how to leverage Spark for advanced analytics and data processing.
Hands-On Labs and Projects
Engage in hands-on labs and projects to apply your knowledge of Hadoop. Work on real-world scenarios to develop practical skills in configuring, managing, and using Hadoop for various data processing and analysis tasks.
Hadoop Syllabus
1. Introduction to Hadoop
- High Availability
- Scaling
- Advantages and Challenges
2. Introduction to Big Data
- What is Big Data
- Big Data Opportunities and Challenges
- Characteristics of Big Data
3. Introduction to Hadoop
- Hadoop Distributed File System
- Comparing Hadoop & SQL
- Industries Using Hadoop
- Data Locality
- Hadoop Architecture
- MapReduce & HDFS
- Using the Hadoop Single Node Image (Clone)
4. Hadoop Distributed File System (HDFS)
- HDFS Design & Concepts
- Blocks, Name Nodes, and Data Nodes
- HDFS High-Availability and HDFS Federation
- Hadoop DFS Command-Line Interface
- Basic File System Operations
- Anatomy of File Read and Write
- Block Placement Policy and Modes
- Detailed Configuration Files Explanation
- Metadata, FS Image, Edit Log, Secondary Name Node, and Safe Mode
- Adding and Decommissioning Data Nodes Dynamically
- FSCK Utility (Block Report)
- Overriding Default Configuration at System and Programming Levels
- HDFS Federation
- ZOOKEEPER Leader Election Algorithm
- Exercises and Small Use Cases on HDFS
5. MapReduce
- MapReduce Functional Programming Basics
- Map and Reduce Basics
- How MapReduce Works
- Anatomy of a MapReduce Job Run
- Legacy Architecture: Job Submission, Initialization, Task Assignment, Execution, Progress, and Status Updates
- Job Completion and Failures
- Shuffling and Sorting
- Splits, Record Reader, Partition, Types of Partitions & Combiner
- Optimization Techniques: Speculative Execution, JVM Reuse, Number of Slots
- Types of Schedulers and Counters
- Comparisons Between Old and New API at Code and Architecture Levels
- Getting Data from RDBMS into HDFS Using Custom Data Types
- Distributed Cache and Hadoop Streaming (Python, Ruby, and R)
- YARN
- Sequential Files and Map Files
- Enabling Compression Codecs
- Map Side Join with Distributed Cache
- Types of I/O Formats: Multiple Outputs, NLINEInputFormat
- Handling Small Files Using CombineFileInputFormat
6. MapReduce Programming – Java
- Hands-on “Word Count” in MapReduce in Standalone and Pseudo Distribution Mode
- Sorting Files Using Hadoop Configuration API
- Emulating “grep” for Searching Inside a File
- DBInput Format
- Job Dependency API Discussion
- Input Format API Discussion, Split API Discussion
- Custom Data Type Creation in Hadoop
7. NoSQL
- ACID in RDBMS vs. BASE in NoSQL
- CAP Theorem and Types of Consistency
- Types of NoSQL Databases in Detail
- Columnar Databases in Detail (HBase and Cassandra)
- TTL, Bloom Filters, and Compensation
8. HBase
- HBase Installation and Concepts
- HBase Data Model and Comparison Between RDBMS and NoSQL
- Master & Region Servers
- HBase Operations (DDL and DML) Through Shell and Programming
- Catalog Tables
- Block Cache and Sharding
- SPLITS
- Data Modeling (Sequential, Salted, Promoted, and Random Keys)
- Java APIs and REST Interface
- Client-Side Buffering and Processing 1 Million Records
- HBase Counters
- Enabling Replication and HBase RAW Scans
- HBase Filters
- Bulk Loading and Co-processors (Endpoints and Observers with Programs)
- Real-World Use Case Consisting of HDFS, MapReduce, and HBase
9. Hive
- Hive Installation, Introduction, and Architecture
- Hive Services, Hive Shell, Hive Server, and Hive Web Interface (HWI)
- Meta Store and HiveQL
- OLTP vs. OLAP
- Working with Tables
- Primitive and Complex Data Types
- Working with Partitions
- User Defined Functions
- Hive Bucketed Tables and Sampling
- External Partitioned Tables
- Dynamic Partition
- ORDER BY vs. DISTRIBUTE BY vs. SORT BY
- Bucketing and Sorted Bucketing with Dynamic Partition
- RC File
- Indexes and Views
- Map Side Joins
- Compression on Hive Tables and Migrating Hive Tables
- Dynamic Substitution of Hive
- Log Analysis on Hive
- Accessing HBase Tables Using Hive
- Hands-on Exercises
10. Pig
- Pig Installation
- Execution Types
- Grunt Shell
- Pig Latin
- Data Processing
- Schema on Read
- Primitive and Complex Data Types
- Tuple Schema, BAG Schema, and MAP Schema
- Loading and Storing
- Filtering, Grouping, and Joining
- Debugging Commands (Illustration and Explanation)
- Validations and Type Casting in Pig
- Working with Functions
- User Defined Functions
- Types of Joins in Pig and Replicated Join
- SPLITS and Multi-Query Execution
- Error Handling, FLATTEN, and ORDER BY
- Parameter Substitution
- Nested For Each
- User Defined Functions, Dynamic Invokers, and Macros
- Accessing HBase Using Pig, Loading and Writing JSON Data
- Piggy Bank
- Hands-on Exercises
11. Sqoop
- Sqoop Installation
- Import Data (Full Table, Subset, Target Directory, Protecting Password, File Formats, Compressing, Control Parallelism)
- Incremental Import (New Data, Last Imported Data, Storing Password in Metastore)
- Free Form Query Import
- Export Data to RDBMS, Hive, and HBase
- Hands-on Exercises
12. HCatalog
- HCatalog Installation
- Introduction to HCatalog
- Interoperability with Hive and Pig
- Accessing Tables from Hive and Pig
- Hands-on Exercises
13. Oozie
- Oozie Installation
- Introduction to Oozie Workflow
- Oozie Coordinators
- Oozie Bundles
- Scheduling, Monitoring, and Troubleshooting Oozie
- Hands-on Exercises
14. Flume
- Flume Installation
- Introduction to Flume
- Flume Components (Source, Channel, Sink)
- Configuration and Monitoring Flume
- Hands-on Exercises
15. Zookeeper
- Zookeeper Installation
- Introduction to Zookeeper
- Zookeeper Architecture and Components
- ZooKeeper Operations and Commands
- Hands-on Exercises
16. YARN
- Introduction to YARN
- YARN Architecture and Components
- Resource Management and Scheduling
- YARN Resource Manager, Node Manager, and Application Master
- Hands-on Exercises
Training
Basic Level Training
Duration : 1 Month
Advanced Level Training
Duration : 1 Month
Project Level Training
Duration : 1 Month
Total Training Period
Duration : 3 Months
Course Mode :
Available Online / Offline
Course Fees :
Please contact the office for details