Data Engineer Course Curriculum

Topic 3.1: NoSQL with MongoDB

Theory:

·Document-oriented structure
Collections, documents, and BSON
CRUD operations in MongoDB
Aggregation pipeline and indexing
Lab:
Insert and update customer profiles in MongoDB
Query transaction logs stored as JSON

Scenarios:

Store dynamic KYC details
Query transaction history based on user/device ID
Tasks:
Design a collection schema for credit card transactions
Run an aggregation pipeline to detect spending spikes
Challenges:
Schema design issues in unstructured data
Aggregation performance on large nested data

Topic 3.2: Search with ElasticSearch (Optional)

Theory:

Indexes, documents, mappings
Full-text search and analyzers
Ingest pipelines and text preprocessing

Lab:

Index customer feedback data
Search transaction logs for keywords like “failed”, “suspicious”

Scenarios:

Fast search of transaction logs during investigation
Extract complaints containing negative sentiment Tasks:
Index JSON logs into ElasticSearch
Build a search query for all failed transactions Challenges:
Text preprocessing errors causing false matches
High index size slowing down search

Topic 3.3: Graph Databases with Neo4j (Optional)

Theory:

Nodes, relationships, and properties
Cypher Query Language basics
Pattern matching in graphs
Lab:
Create a graph of customer-referral relationships
Query fraudulent clusters in payment networks
Scenarios:
Detect fraud rings through common devices or accounts
Track transaction paths in real time

Tasks:

Load customer-merchant transactions into a graph
Write a Cypher query to find highly connected nodes
Challenges:
Memory bottlenecks in graph traversal
Relationship duplication leading to noise

Topic 3.4: Column-Oriented Storage with HBase (Optional)

Theory:

HBase architecture (HDFS, H Region Server)
Tables, column families, rows, cells
Random access and batch jobs
Lab:
Store historical credit data in HBase
Scan for specific user ID or date range
Scenarios:
Store 10 years of customer activity for compliance
Retrieve long-term account data on demand Tasks:
Connect to HBase using Happybase client
Query a range of customer credit scores Challenges:
Key design leading to hot regions
Complex query logic compared to SQL/NoSQL

Module 4: Batch & Streaming (PySpark, Kafka)

Topic 4.1: Distributed Data Processing with PySpark

Theory:

Spark architecture (Driver, Executor, Cluster Manager)
RDDs vs DataFrames
Transformations (map, filter, join) & Actions (collect, count)
Spark SQL
Partitioning and performance tuning
Lab:
Load and process transaction data with PySpark
Run joins between customer and account datasets
Aggregate daily and weekly balances
Scenarios:
Analyze 10 million+ transactions in parallel
Detect anomalies in large datasets
Tasks:
Write PySpark code to summarize transactions by region
Perform ETL using DataFrame transformations Challenges:
Data skew causing stage failures
Memory overflows on wide transformations

Topic 4.2: Real-Time Processing with Spark Streaming

Theory:

Structured Streaming architecture
Micro-batch model
Sources: Kafka, Files, Sockets
Watermarking and late data handling
Lab:
Simulate a streaming pipeline from transaction logs
Join real-time transactions with static customer data

Scenarios:

Detect fraudulent transactions as they happen
Stream updates from ATM usage Tasks:
Set up checkpointing and fault tolerance
Build a pipeline that flags suspicious activity Challenges:
Handling late-arriving data correctly
Ensuring exactly-once processing guarantees

Topic 4.3: Kafka for Data Ingestion

Theory:

Kafka architecture: Producer, Broker, Topic, Consumer
Partitions and offsets
Kafka Streams vs Connect
Reliability (ack, retries)

Lab:

Create Kafka topics for real-time data (e.g., transactions, alerts)
Produce and consume messages using Python
Scenarios:
Ingest credit card transactions via Kafka
Send alerts to fraud detection module
Tasks:
Write a Python Kafka producer for simulated ATM logs
Consume Kafka messages into Spark Streaming job

Challenges:

Managing offset lag and duplicates
Producer/Consumer crash recovery

Module 5: Practical Data Warehousing (Snowflake, DBT)

Topic 5.1: Snowflake – Cloud Data Warehousing

Theory:

Introduction to Snowflake architecture (decoupled storage & compute)
Data loading using COPY INTO
Virtual warehouses and scaling
SQL-based analytics
Time Travel, Cloning, and Fail-Safe
Semi-structured data support (JSON, Avro, etc.)
Lab:
Load customer transaction files into Snowflake
Query historical balance snapshots using Time Travel
Use Snowflake’s variant column for semi-structured KYC data
Scenarios:
Store regulatory data for up to 7 years
Allow auditors to clone transaction data without duplicating storage
Tasks:
Set up a staging schema for raw data
Create optimized views for BI dashboards
Write SQL to track weekly balance changes
Challenges:
Cost management with virtual warehouse auto-scaling
Optimizing semi-structured query performance
Detecting and fixing failed bulk loads

Topic 5.2: DBT (Data Build Tool) – ELT in the Warehouse

Theory:

DBT vs traditional ETL
SQL-based transformation logic
Materializations: table, view, incremental
Testing, documentation, and lineage tracking
Jinja templating in SQL
Lab:
Use DBT to transform raw transaction data into analytics-ready models
Implement incremental models for daily processing
Add documentation and column-level tests
Scenarios:
Build a customer360 model for analytics
Maintain history tables for fraud analysis
Tasks:
Create a DBT model to join multiple raw sources
Add custom tests to ensure no nulls in customer_id
Schedule DBT models via Airflow Challenges:
Managing dependencies between DBT models
Debugging Jinja errors in templated SQL
Handling schema changes in source tables

Data Engineer Course Curriculum

Topic 3.1: NoSQL with MongoDB

Topic 3.2: Search with ElasticSearch (Optional)

Topic 3.3: Graph Databases with Neo4j (Optional)

Topic 3.4: Column-Oriented Storage with HBase (Optional)

Module 4: Batch & Streaming (PySpark, Kafka)

Topic 4.1: Distributed Data Processing with PySpark

Topic 4.2: Real-Time Processing with Spark Streaming

Topic 4.3: Kafka for Data Ingestion

Module 5: Practical Data Warehousing (Snowflake, DBT)

Topic 5.2: DBT (Data Build Tool) – ELT in the Warehouse

Innovation