Hadoop Developer

Description

Prerequisites

Description

Developer Training for Apache Spark and Hadoop

Learn how to import and process data with key Hadoop ecosystem tools

This four-day hands-on training course delivers the key concepts and expertise participants need to ingest and process data on a Hadoop cluster using the most up-to-date tools and techniques. Employing Hadoop ecosystem projects such as Spark (including Spark Streaming and Spark SQL), Flume, Kafka, and Sqoop, this training course is the best preparation for the real-world challenges faced by Hadoop developers. With Spark, developers can write sophisticated parallel applications to execute faster decisions, better decisions, and interactive actions, applied to a wide variety of use cases, architectures, and industries.

Prerequisites

Prerequisites

This course is designed for developers and engineers who have programming experience, but prior knowledge of Hadoop is not required.

Apache Spark examples and hands-on exercises are presented in Scala and Python. The ability to program in one of those languages is required
Basic familiarity with the Linux command line is assumed
Basic knowledge of SQL is helpful

Key Features

Curriculum

Introduction to Apache Hadoop and the Hadoop Ecosystem

Apache Hadoop File Storage

Problems with Traditional Large-Scale Systems
HDFS Architecture
Using HDFS
Apache Hadoop File Formats

Data Processing on an Apache Hadoop Cluster

YARN Architecture
Working With YARN

Importing Relational Data with Apache Sqoop

Apache Sqoop Overview
Importing Data
Importing File Options
Exporting Data

Apache Spark Basics

What is Apache Spark?
Using the Spark Shell
RDDs (Resilient Distributed Datasets)
Functional Programming in Spark

Working with RDDs

Creating RDDs
Other General RDD Operations

Aggregating Data with Pair RDDs

Key-Value Pair RDDs
Map-Reduce
Other Pair RDD Operations

Writing and Running Apache Spark Applications

Spark Applications vs. Spark Shell
Creating the SparkContext
Building a Spark Application (Scala and Java)
Running a Spark Application
The Spark Application Web UI

Configuring Apache Spark Applications

Configuring Spark Properties
Logging

Parallel Processing in Apache Spark

RDD Persistence

RDD Lineage
RDD Persistence Overview
Distributed Persistence

Common Patterns in Apache Spark Data Processing

Common Apache Spark Use Cases
Iterative Algorithms in Apache Spark
Machine Learning
Example: k-means

DataFrames and Spark SQL

Apache Spark SQL and the SQL Context
Creating DataFrames
Transforming and Querying DataFrames
Saving DataFrames
DataFrames and RDDs
Comparing Apache Spark SQL, Impala, and Hive-on-Spark
Apache Spark SQL in Spark 2.x C

Message Processing with Apache Kafka

What is Apache Kafka?
Apache Kafka Overview
Scaling Apache Kafka
Apache Kafka Cluster Architecture
Apache Kafka Command Line Tools

Capturing Data with Apache Flume

Integrating Apache Flume and Apache Kafka

Overview
Use Cases
Configuration

Apache Spark Streaming: Introduction to DStreams

Apache Spark Streaming Overview
Example: Streaming Request Count
DStreams
Developing Streaming Applications Apache Spark Streaming: Processing Multiple Batches
Multi-Batch Operations
Time Slicing
State Operations
Sliding Window Operations

Apache Spark Streaming: Data Sources

Streaming Data Source Overview
Apache Flume and Apache Kafka Data Sources
Example: Using a Kafka Direct Data Source

Have Any Questions?

We are happy to answer any questions and we appreciate every feedback about our work!