Big Data Comprehensive Training (Practical)

Big Data Comprehensive Training (Practical)

The Big Data foundation course provides you with an understanding of Big Data, potential data sources that can be used for solving real business problems, and an overview of data mining and the tools used in it.

4 Days Workshop

Course Outline

Day 1

  • Module 1: Big Data – History, Overview, and Characteristics

    • History
    • Big Data Definition
    • Big Data Benefits
    • Big Data Characteristics
    • Volume
    • Velocity
    • Variety

    Big Data Technologies – Overview

    • Big Data Success Stories

    Big Data – Privacy and Ethics

    • Privacy – Compliance
    • Privacy – Challenges
    • Privacy – Approach
    • Ethics

    Big Data Projects

    • Who Should Be Involved?
    • What Is Involved?

    Module 2: Big Data Sources

    2.1 Enterprise Data Sources

    • Enterprise Systems
    • Oracle
    • SAP
    • Microsoft
    • Data Warehouses
    • Unstructured Data – Introduction
    • Unstructured Data – Metadata

    2.2 Social Media Data Source

    • Introduction
    • Facebook – Introduction
    • Facebook – Public Feed API
    • Facebook – Keyword Insights API
    • Facebook – Graph API
    • Twitter – Introduction
    • Twitter – Streaming APIs
    • Twitter – REST APIs
    • Other Social Media

    2.3 Public Data Sources

    • Introduction
    • Weather
    • Economics
    • Finance
    • Regulatory Bodies

Day 2

      • Module 3: Data Mining – Concepts and Tools

        3.1 Data Mining – Introduction

        • Introduction
        • Types of Data Mining – Overview
        • Types of Data Mining – Classification
        • Types of Data Mining – Association
        • Types of Data Mining – Clustering

        3.2 Data Mining – Tools

        • Introduction
        • Weka
        • Modules of Weka Applications
        • KNIME
        • KNIME – Example
        • R Language

Day 3

  • Module 4: The Hadoop Distributed File System (HDFS)

    4.1 Hadoop Fundamentals

    • Introduction
    • Main Components of Hadoop
    • Additional Components of Hadoop

    4.2. The Hadoop Distributed File System (HDFS)

    • Overview of HDFS
    • Launching HDFS in Pseudo-Distributed Mode Core HDFS Services
    • Installing and Configuring HDFS
    • HDFS Commands
    • HDFS Safe Mode
    • Check Pointing HDFS
    • Federated and High Availability HDFS
    • Running a Fully-Distributed HDFS Cluster with Docker

    4.3. MapReduce with Hadoop

    • MapReduce from the Linux Command Line Scaling MapReduce on a Cluster Introducing Apache Hadoop Overview of YARN
    • Launching YARN in Pseudo-Distributed Mode Demonstration of the Hadoop Streaming API Demonstration of MapReduce with Java

    Module 5: Apache

    5.1. Introduction to Apache Spark

    • Why Spark?
    • Spark Architecture
    • Spark Drivers and Executors
    • Spark on YARN
    • Spark and the Hive Metastore
    • Structured APIs, DataFrames, and Datasets
    • The Core API and Resilient Distributed Datasets (RDDs)
    • Overview of Functional Programming
    • MapReduce with Python

    5.2. Apache Hive

    • Hive as a Data Warehouse
    • Hive Architecture
    • Understanding the Hive Metastore and HCatalog Interacting with Hive using the Beeline Interface Creating Hive Tables
    • Loading Text Data Files into Hive
    • Exploring the Hive Query Language
    • Partitions and Buckets
    • Built-in and Aggregation Functions Invoking MapReduce Scripts from Hive Common File Formats for Big Data Processing Creating Avro and Parquet Files with Hive Creating Hive Tables from Pig

    Accessing Hive Tables with the Spark SQL Shell

    5.3. Persisting Data with Apache HBase

    • Features and Use Cases
    • HBase Architecture
    • The Data Model
    • Command Line Shell
    • Schema Creation
    • Considerations for Row Key Design

    5.4 Apache Storm

    • Processing Real-Time Streaming Data
    • Storm Architecture: Nimbus, Supervisors, and ZooKeeper
    • Application Design: Topologies, Spouts, and Bolts

Day 4

  • Module 6: Data Modelling with Document Databases

    6.1 MongoDB Fundamentals

    • Introduction
    • Replication
    • Sharding
    • Sharding and Replication
    • MongoDB Ecosystem – Languages and Drivers
    • MongoDB Ecosystem – Hadoop Integration
    • MongoDB Ecosystem – Tools

    6.2 Install and Configure

    • Download
    • How to Install and Configure

    6.3 Document Databases

    • Introduction
    • Documents
    • Document Design Considerations
    • Fields

    6.4 Data Modelling with Document Databases

    • Introduction
    • Twitter Sentiment Analysis
    • Twitter Sentiment Analysis – Algorithm
    • Network Log Analysis
    • Network Log Analysis – Algorithm

FAQ

What are the prerequisites?

  • All trainees to have the following:

    i) Required knowledge for attendees

    • Conversant with any imperative programming language like C
    • Knowledge of SQL query

    ii) Hardware Requirement

    — Minimum Configuration of Laptop

    • Memory/ RAM 8 GB
    • Free Disk Space 30 GB
    • 4 CPU cores

    iii) Software Requirement:

    Windows or Mac

    Oracle Virtual Box (https://www.virtualbox.org/wiki/Downloads)

Scroll to top