INFOSOFT IT SOLUTIONS

Hadoop Training

Home
Courses

Hadoop Training

Introduction to Hadoop

Get an overview of Hadoop, an open-source framework for distributed storage and processing of large data sets. Learn about Hadoop’s architecture, core components, and its role in the big data ecosystem.

Hadoop Distributed File System (HDFS)

Explore the Hadoop Distributed File System (HDFS), which provides scalable and fault-tolerant storage. Learn about its architecture, how it stores large files across distributed nodes, and how to manage HDFS data.

YARN (Yet Another Resource Negotiator)

Understand YARN, the resource management layer of Hadoop. Learn how YARN manages cluster resources, schedules tasks, and provides resource allocation for various applications running on Hadoop.

MapReduce Programming Model

Dive into the MapReduce programming model for processing large data sets. Learn about the Map and Reduce functions, how to write MapReduce jobs, and how to execute and optimize these jobs for efficient data processing.

Apache Hive

Discover Apache Hive, a data warehousing solution built on top of Hadoop. Learn how Hive provides a SQL-like interface for querying and managing data stored in Hadoop, and how to use HiveQL for data analysis.

Apache Pig

Explore Apache Pig, a high-level scripting platform for processing large data sets. Understand how Pig Latin simplifies the development of data processing workflows and how to use Pig for complex data transformations.

Apache HBase

Learn about Apache HBase, a NoSQL database that provides real-time read/write access to large datasets. Understand its architecture, data model, and use cases for high-performance data access in Hadoop environments.

Apache ZooKeeper

Discover Apache ZooKeeper, a service for coordinating distributed applications. Learn how ZooKeeper provides centralized configuration management, synchronization, and naming for distributed systems within Hadoop.

Apache Flume

Learn about Apache Flume, a distributed service for collecting and aggregating log data. Understand how Flume integrates with Hadoop to handle data ingestion from various sources and stream data to HDFS.

Apache Oozie

Explore Apache Oozie, a workflow scheduler system for managing Hadoop jobs. Learn how Oozie helps schedule, coordinate, and manage complex data processing workflows across various Hadoop components.

Apache Spark Integration

Understand how Apache Spark complements Hadoop by providing in-memory processing capabilities. Learn about Spark's integration with Hadoop, its advantages, and how to leverage Spark for advanced analytics and data processing.

Hands-On Labs and Projects

Engage in hands-on labs and projects to apply your knowledge of Hadoop. Work on real-world scenarios to develop practical skills in configuring, managing, and using Hadoop for various data processing and analysis tasks.

Hadoop Syllabus

1. Introduction to Hadoop

High Availability
Scaling
Advantages and Challenges

2. Introduction to Big Data

What is Big Data
Big Data Opportunities and Challenges
Characteristics of Big Data

3. Introduction to Hadoop

Hadoop Distributed File System
Comparing Hadoop & SQL
Industries Using Hadoop
Data Locality
Hadoop Architecture
MapReduce & HDFS
Using the Hadoop Single Node Image (Clone)

4. Hadoop Distributed File System (HDFS)

HDFS Design & Concepts
Blocks, Name Nodes, and Data Nodes
HDFS High-Availability and HDFS Federation
Hadoop DFS Command-Line Interface
Basic File System Operations
Anatomy of File Read and Write
Block Placement Policy and Modes
Detailed Configuration Files Explanation
Metadata, FS Image, Edit Log, Secondary Name Node, and Safe Mode
Adding and Decommissioning Data Nodes Dynamically
FSCK Utility (Block Report)
Overriding Default Configuration at System and Programming Levels
HDFS Federation
ZOOKEEPER Leader Election Algorithm
Exercises and Small Use Cases on HDFS

5. MapReduce

MapReduce Functional Programming Basics
Map and Reduce Basics
How MapReduce Works
Anatomy of a MapReduce Job Run
Legacy Architecture: Job Submission, Initialization, Task Assignment, Execution, Progress, and Status Updates
Job Completion and Failures
Shuffling and Sorting
Splits, Record Reader, Partition, Types of Partitions & Combiner
Optimization Techniques: Speculative Execution, JVM Reuse, Number of Slots
Types of Schedulers and Counters
Comparisons Between Old and New API at Code and Architecture Levels
Getting Data from RDBMS into HDFS Using Custom Data Types
Distributed Cache and Hadoop Streaming (Python, Ruby, and R)
YARN
Sequential Files and Map Files
Enabling Compression Codecs
Map Side Join with Distributed Cache
Types of I/O Formats: Multiple Outputs, NLINEInputFormat
Handling Small Files Using CombineFileInputFormat

6. MapReduce Programming – Java

Hands-on “Word Count” in MapReduce in Standalone and Pseudo Distribution Mode
Sorting Files Using Hadoop Configuration API
Emulating “grep” for Searching Inside a File
DBInput Format
Job Dependency API Discussion
Input Format API Discussion, Split API Discussion
Custom Data Type Creation in Hadoop

7. NoSQL

ACID in RDBMS vs. BASE in NoSQL
CAP Theorem and Types of Consistency
Types of NoSQL Databases in Detail
Columnar Databases in Detail (HBase and Cassandra)
TTL, Bloom Filters, and Compensation

8. HBase

HBase Installation and Concepts
HBase Data Model and Comparison Between RDBMS and NoSQL
Master & Region Servers
HBase Operations (DDL and DML) Through Shell and Programming
Catalog Tables
Block Cache and Sharding
SPLITS
Data Modeling (Sequential, Salted, Promoted, and Random Keys)
Java APIs and REST Interface
Client-Side Buffering and Processing 1 Million Records
HBase Counters
Enabling Replication and HBase RAW Scans
HBase Filters
Bulk Loading and Co-processors (Endpoints and Observers with Programs)
Real-World Use Case Consisting of HDFS, MapReduce, and HBase

9. Hive

Hive Installation, Introduction, and Architecture
Hive Services, Hive Shell, Hive Server, and Hive Web Interface (HWI)
Meta Store and HiveQL
OLTP vs. OLAP
Working with Tables
Primitive and Complex Data Types
Working with Partitions
User Defined Functions
Hive Bucketed Tables and Sampling
External Partitioned Tables
Dynamic Partition
ORDER BY vs. DISTRIBUTE BY vs. SORT BY
Bucketing and Sorted Bucketing with Dynamic Partition
RC File
Indexes and Views
Map Side Joins
Compression on Hive Tables and Migrating Hive Tables
Dynamic Substitution of Hive
Log Analysis on Hive
Accessing HBase Tables Using Hive
Hands-on Exercises

10. Pig

Pig Installation
Execution Types
Grunt Shell
Pig Latin
Data Processing
Schema on Read
Primitive and Complex Data Types
Tuple Schema, BAG Schema, and MAP Schema
Loading and Storing
Filtering, Grouping, and Joining
Debugging Commands (Illustration and Explanation)
Validations and Type Casting in Pig
Working with Functions
User Defined Functions
Types of Joins in Pig and Replicated Join
SPLITS and Multi-Query Execution
Error Handling, FLATTEN, and ORDER BY
Parameter Substitution
Nested For Each
User Defined Functions, Dynamic Invokers, and Macros
Accessing HBase Using Pig, Loading and Writing JSON Data
Piggy Bank
Hands-on Exercises

11. Sqoop

Sqoop Installation
Import Data (Full Table, Subset, Target Directory, Protecting Password, File Formats, Compressing, Control Parallelism)
Incremental Import (New Data, Last Imported Data, Storing Password in Metastore)
Free Form Query Import
Export Data to RDBMS, Hive, and HBase
Hands-on Exercises

12. HCatalog

HCatalog Installation
Introduction to HCatalog
Interoperability with Hive and Pig
Accessing Tables from Hive and Pig
Hands-on Exercises

13. Oozie

Oozie Installation
Introduction to Oozie Workflow
Oozie Coordinators
Oozie Bundles
Scheduling, Monitoring, and Troubleshooting Oozie
Hands-on Exercises

14. Flume

Flume Installation
Introduction to Flume
Flume Components (Source, Channel, Sink)
Configuration and Monitoring Flume
Hands-on Exercises

15. Zookeeper

Zookeeper Installation
Introduction to Zookeeper
Zookeeper Architecture and Components
ZooKeeper Operations and Commands
Hands-on Exercises

16. YARN

Introduction to YARN
YARN Architecture and Components
Resource Management and Scheduling
YARN Resource Manager, Node Manager, and Application Master
Hands-on Exercises

Hadoop Training

Hadoop Training

Introduction to Hadoop

Hadoop Distributed File System (HDFS)

YARN (Yet Another Resource Negotiator)

MapReduce Programming Model

Apache Hive

Apache Pig

Apache HBase

Apache ZooKeeper

Apache Flume

Apache Oozie

Apache Spark Integration

Hands-On Labs and Projects

Hadoop Syllabus

1. Introduction to Hadoop

2. Introduction to Big Data

3. Introduction to Hadoop

4. Hadoop Distributed File System (HDFS)

5. MapReduce

6. MapReduce Programming – Java

7. NoSQL

8. HBase

9. Hive

10. Pig

11. Sqoop

12. HCatalog

13. Oozie

14. Flume

15. Zookeeper

16. YARN

Training

Basic Level Training

Advanced Level Training

Project Level Training

Total Training Period

Course Mode :

Course Fees :

Placement Benefit Services

Provide 100% job-oriented training

Develop multiple skill sets

Assist in project completion

Build ATS-friendly resumes

Add relevant experience to profiles

Build and enhance online profiles

Supply manpower to consultants

Supply manpower to companies

Prepare candidates for interviews

Add candidates to job groups

Send candidates to interviews

Provide job references

Assign candidates to contract jobs

Select candidates for internal projects

Note

100% Job Assurance Only

Daily online batches for employees

New course batches start every Monday