GCP DATA ENGINEERING

Google Cloud Data Engineering Training

with Real-world Projects and Case Studies

GCP Cloud Basics

GCP Introduction
  • The need for cloud computing in modern businesses.
  • Key features and offerings of Google Cloud Platform (GCP).
  • Overview of core GCP services and products.
  • Benefits and advantages of using cloud infrastructure.
  • Step-by-step guide to creating a free-tier account on GCP.
GCP Interfaces
  • Console
  • • Navigating the GCP Console
  • Configuring the GCP Console for Efficiency
  • Using the GCP Console for Service Management
  • Shell
  • Introduction to GCP Shell
  • Command-line Interface (CLI) Basics
  • GCP Shell Commands for Service Deployment and Management
  • SDK
  • Overview of GCP Software Development Kits (SDKs)
  • Installing and Configuring SDKs
  • Writing and Executing GCP SDK Commands
GCP Locations
  • Regions
  • Understanding GCP Regions
  • Selecting Regions for Service Deployment
  • Impact of Region on Service Performance
  • Zones
  • Exploring GCP Zones
  • Distributing Resources Across Zones
  • • High Availability and Disaster Recovery Considerations
  • Importance
  • Significance of Choosing the Right Location
  • Global vs. Regional Resources
  • Factors Influencing Location Decisions
GCP IAM & Admin
  • Identities
  • Introduction to Identity and Access Management (IAM)
  • Users, Groups, and Service Accounts
  • Best Practices for Identity Management
  • Roles
  • GCP IAM Roles Overview
  • Defining Custom Roles
  • Role-Based Access Control (RBAC) Implementation
  • Policy
  • Resource-based Policies
  • Understanding and Implementing Organization Policies
  • Auditing and Monitoring Policies
  • Resource Hierarchy
  • GCP Resource Hierarchy Structure
  • Managing Resources in a Hierarchy
  • Organizational Structure Best Practices
Linux Basics on Cloud Shell
  • Getting started with Linux
  • Linux Installation
  • Basic Linux Commands
  • Cloud shell tips
  • File and Directory Operations (ls, cd, pwd, mkdir, rmdir, cp, mv, touch, rm, nano)
  • File Content Manipulation (cat, less, head, tail, grep)
  • Text Processing (awk, sed, cut, sort, uniq)
  • User and Permission related (whoami, id, su, sudo, chmod, chown)
Python for Data Engineer
  • Data Types
  • Strings
  • Operators
  • Numbers (Int, Float)
  • Booleans
  • Data Structures
  • Lists
  • Tuples
  • Dictionaries
  • Sets
  • Python Programming Constructs
  • if, elif, else statements
  • for loops, while loops
  • Exception Handling
  • File I/O operations
  • Modular Programming in Python
  • Functions & Lambda Functions
  • Classes

GCP Data Engineering Tools

Google Cloud Storage
  • Overview of Cloud Storage as a scalable and durable object storage service
  • Understanding buckets and objects in Cloud Storage.
  • Use cases for Cloud Storage, such as data backup, multimedia storage, and website content
  • Creating and managing Cloud Storage buckets.
  • Uploading and downloading objects to and from Cloud Storage.
  • Setting access controls and permissions for buckets and objects.
  • Data Transfer and Lifecycle Management
  • Versioning and Object Versioning
  • Integration with Other GCP Services
  • Implementing best practices for optimizing Cloud Storage performance.
  • Securing data in Cloud Storage with encryption and access controls
  • Monitoring and logging for Cloud Storage operations.
Cloud SQL
  • Introduction to Cloud SQL
  • Creating and Managing Cloud SQL Instances
  • Configuring database settings, users, and access controls
  • Connecting to Cloud SQL instances using Cloud SQL studio, Shell, Workbenches
  • Importing and exporting data in Cloud SQL.
  • Backups and High Availability
  • Integration with Other GCP Services
  • Managing database user roles and permissions.
  • Introduction to DMS
  • End to End Database migration Project
  • Offline: Export and Import method
  • Online: DMS method
BigQuery (SQL development)
  • Introduction to BigQuery
  • BigQuery Architecture
  • Use cases for BigQuery in business intelligence and analytics
  • Various method of creating table in BigQuery
  • BigQuery Data Sources and File Formats
  • Native table and External Tables
  • SQL Queries and Performance Optimization
  • Writing and optimizing SQL queries in BigQuery.
  • Understanding query execution plans and best practices
  • Partitioning and clustering tables for performance
  • Data Integration and Export
  • Loading data into BigQuery from Cloud Storage, Cloud SQL, and other sources
  • Exporting data from BigQuery to various formats.
  • Real-time data streaming into BigQuery.
  • Configuring access controls and permissions in BigQuery.
  • BigQuery Views:
  • Views
  • Materialized Views
  • Authorized Views
  • Integration with Other GCP Services
  • Integrating BigQuery with Dataflow for ETL processes
  • Building data pipelines with BigQuery and Composer
  • Case Study-1: Spotify
  • Case Study-2: Social Media
DataProc (Pyspark Development)
  • Introduction to Hadoop and Apache Spark
  • Understanding the difference between Spark and MapReduce
  • What is Spark and Pyspark
  • Understanding Spark framework and its functionalities
  • Overview of DataProc as a fully managed Apache Spark and Hadoop service.
  • Use cases for DataProc in data processing and analytics.
  • Cluster Creation and Configuration
  • Creating and managing DataProc clusters.
  • Configuring cluster properties for performance and scalability
  • Preemptible instances and cost optimization.
  • Running Jobs on DataProc
  • Submitting and monitoring Spark and Hadoop jobs on DataProc.
  • Use of initialization actions and custom scripts.
  • Job debugging and troubleshooting
  • Integration with Storage and BigQuery
  • Reading and writing data from/to Cloud Storage and BigQuery
  • Integrating DataProc with other storage solutions.
  • Performance optimization for data access.
  • Automation and scheduling of recurring jobs.
  • Case Study-1: Data Cleaning of Employee Travel Records
  • End to End Batch Pyspark pipeline using Dataproc, BigQuery, GCS
Databricks on GCP
  • What is Databricks lakehouse platform
  • Databricks architecture and components
  • Setting up and Administering a Databricks workspace
  • Managing data with Delta Lake
  • Databricks Unity Catalog
  • Note books and clusters
  • ELT with Spark SQL and Python
  • optimize performance within Databricks.
  • Incremental Data Processing
  • Delta Live tables
  • Case study: creating end to end workflows
DataFlow (Apache Beam development)
  • Introduction to DataFlow
  • Use cases for DataFlow in real-time analytics and ETL
  • Understanding the difference between Apache Spark and Apache Beam
  • How Dataflow is different from Dataproc
  • Building Data Pipelines with Apache Beam
  • Writing Apache Beam pipelines for batch and stream processing
  • Custom Pipelines and Pre-defined pipelines
  • Transformations and windowing concepts.
  • Integration with Other GCP Services
  • Integrating DataFlow with BigQuery, Pub/Sub, and other GCP services.
  • Real-time analytics and visualization using DataFlow and BigQuery.
  • Workflow orchestration with Composer.
  • End to End Streaming Pipeline using Apache beam with Dataflow, Python app, PubSub, BigQuery, GCS
  • Template method of creating pipelines
Cloud Pub/Sub  
  • Introduction to Pub/Sub
  • Understanding the role of Pub/Sub in event-driven architectures.
  • Key Pub/Sub concepts: topics, subscriptions, messages, and acknowledgments.
  • Creating and Managing Topics and Subscriptions
  • Using the GCP Console to create Pub/Sub topics and subscriptions.
  • Configuring message retention policies and acknowledgment settings.
  • Publishing and Consuming Messages
  • Writing and deploying code to publish messages to a topic.
  • Implementing subscribers to consume and process messages from subscriptions.
  • Integration with Other GCP Services
  • Connecting Pub/Sub with Cloud Functions for serverless event-driven computing
  • Integrating Pub/Sub with Dataflow for real-time stream processing.
  • Streaming use-case using Dataflow
Cloud Composer (DAG Creations)
  • Introduction to Composer/Airflow
  • Overview of Airflow Architecture
  • Use cases for Composer in managing and scheduling workflows.
  • Creating and Managing Workflows
  • Creating and configuring Composer environments.
  • Defining and scheduling workflows using Apache Airflow.
  • Monitoring and managing workflow executions.
  • Integration with Data Engineering Services
  • Orchestrating workflows involving BigQuery, DataFlow, and other services.
  • Coordinating ETL processes with Composer.
  • Integrating with external systems and APIs.
  • Error Handling and Troubleshooting
  • Handling errors and retries in Composer workflows
  • Debugging and troubleshooting failed workflow executions.
  • Logging and monitoring for Composer workflows.
  • Level-1-DAG: Orchestrating the BigQuery pipelines
  • Level-2-DAG: Orchestrating the DataProc pipelines
  • Level-3-DAG: Orchestrating the Dataflow pipelines
  • Implementing CI/CD in Composer Using Cloud Build and GitHub
Data Fusion
  • Introduction to Data Fusion
  • Overview of Data Fusion as a fully managed data integration service.
  • Use cases for Data Fusion in ETL and data migration.
  • Building Data Integration Pipelines
  • Creating ETL pipelines using the visual interface.
  • Configuring data sources, transformations, and sinks.
  • Using pre-built templates for common integration scenarios.
  • Integration with GCP and External Services
  • Integrating Data Fusion with BigQuery, Cloud Storage, and other GCP services.
  • End to End pipeline using Data fusion with Wrangler, GCS, BigQuery
Cloud Functions
  • Cloud Functions Introduction
  • Setting up Cloud Functions in GCP
  • Event-driven architecture and use cases
  • Writing and deploying Cloud Functions
  • Triggering Cloud Functions:
  • HTTP triggers
  • Pub/Sub triggers
  • Cloud Storage triggers
  • Monitoring and logging Cloud Functions
  • Usecase-1: Loading the files from GCS to BigQuery as soon as it is uploaded.
Terraform
  • Terraform Introduction
  • Installing and configuring Terraform.
  • Infrastructure Provisioning
  • Terraform basic commands
  • Init, plan, apply, destroy
  • Create Resources in Google Cloud Platform
  • GCS buckets
  • Dataproc cluster
  • BigQuery Datasets and tables
  • And more resources as needed

By the End of the course What Students can Expect

Proficient in SQL Development:

  • Mastering SQL for querying and manipulating data within Google BigQuery and Cloud SQL.
  • Writing complex queries and optimizing performance for large-scale datasets.
  • Understanding schema design and best practices for efficient data storage.

Pyspark Development Skills:

  • Proficiency in using PySpark for large-scale data processing on Google Cloud.
  • Developing and optimizing Spark jobs for distributed data processing.
  • Understanding Spark’s RDDs, Dataframes, and transformations for data manipulation.

Apache Beam Development Mastery:

  • Creating data processing pipelines using Apache Beam
  • Understanding the concepts of parallel processing and data parallelism.
  • Implementing transformations and integrating with other GCP services.

DAG Creations with Cloud Composer:

  • Designing and implementing Directed Acyclic Graphs (DAGs) for orchestrating workflows.
  • Using Cloud Composer for workflow automation and managing dependencies.
  • Developing DAGs that integrate various GCP services for end-to-end data processing.

Notebooks, Workflows with Databricks:

  • Understand how to build and manage data pipelines using Databricks and Delta Lake.
  • Efficiently query and analyze large datasets with Databricks SQL and Apache Spark
  • Implement scalable workflows and optimize performance within Databricks.

Architecture Planning:

  • Proficient in architecting end-to-end data solutions on GCP
  • Understanding the principles of designing scalable, reliable, and cost-effective data
    architectures.

Certification Readiness

  • Prepare for the Google Cloud Professional Data Engineer (PDE) and
  • Associate Cloud Engineer (ACE) certifications through a combination of theoretical knowledge
    and hands-on experience.

The course will empower students with practical skills in SQL, PySpark, Apache Beam, DAG creations,
and architecture planning, ensuring they are well-prepared to tackle real-world data engineering
challenges and successfully obtain GCP certifications.