Data Engineer Course Curriculum
Topic 3.1: NoSQL with MongoDB
Theory:
·Document-oriented structure
Collections, documents, and BSON
CRUD operations in MongoDB
Aggregation pipeline and indexing
Lab:
Insert and update customer profiles in MongoDB
Query transaction logs stored as JSON
Scenarios:
Store dynamic KYC details
Query transaction history based on user/device ID
Tasks:
Design a collection schema for credit card transactions
Run an aggregation pipeline to detect spending spikes
Challenges:
Schema design issues in unstructured data
Aggregation performance on large nested data
Topic 3.2: Search with ElasticSearch (Optional)
Theory:
Indexes, documents, mappings
Full-text search and analyzers
Ingest pipelines and text preprocessing
Lab:
Index customer feedback data
Search transaction logs for keywords like “failed”, “suspicious”
Scenarios:
Fast search of transaction logs during investigation
Extract complaints containing negative sentiment Tasks:
Index JSON logs into ElasticSearch
Build a search query for all failed transactions Challenges:
Text preprocessing errors causing false matches
High index size slowing down search
Topic 3.3: Graph Databases with Neo4j (Optional)
Theory:
Nodes, relationships, and properties
Cypher Query Language basics
Pattern matching in graphs
Lab:
Create a graph of customer-referral relationships
Query fraudulent clusters in payment networks
Scenarios:
Detect fraud rings through common devices or accounts
Track transaction paths in real time
Tasks:
Load customer-merchant transactions into a graph
Write a Cypher query to find highly connected nodes
Challenges:
Memory bottlenecks in graph traversal
Relationship duplication leading to noise
Topic 3.4: Column-Oriented Storage with HBase (Optional)
Theory:
HBase architecture (HDFS, H Region Server)
Tables, column families, rows, cells
Random access and batch jobs
Lab:
Store historical credit data in HBase
Scan for specific user ID or date range
Scenarios:
Store 10 years of customer activity for compliance
Retrieve long-term account data on demand Tasks:
Connect to HBase using Happybase client
Query a range of customer credit scores Challenges:
Key design leading to hot regions
Complex query logic compared to SQL/NoSQL
Module 4: Batch & Streaming (PySpark, Kafka)
Topic 4.1: Distributed Data Processing with PySpark
Theory:
Spark architecture (Driver, Executor, Cluster Manager)
RDDs vs DataFrames
Transformations (map, filter, join) & Actions (collect, count)
Spark SQL
Partitioning and performance tuning
Lab:
Load and process transaction data with PySpark
Run joins between customer and account datasets
Aggregate daily and weekly balances
Scenarios:
Analyze 10 million+ transactions in parallel
Detect anomalies in large datasets
Tasks:
Write PySpark code to summarize transactions by region
Perform ETL using DataFrame transformations Challenges:
Data skew causing stage failures
Memory overflows on wide transformations
Topic 4.2: Real-Time Processing with Spark Streaming
Theory:
Structured Streaming architecture
Micro-batch model
Sources: Kafka, Files, Sockets
Watermarking and late data handling
Lab:
Simulate a streaming pipeline from transaction logs
Join real-time transactions with static customer data
Scenarios:
Detect fraudulent transactions as they happen
Stream updates from ATM usage Tasks:
Set up checkpointing and fault tolerance
Build a pipeline that flags suspicious activity Challenges:
Handling late-arriving data correctly
Ensuring exactly-once processing guarantees
Topic 4.3: Kafka for Data Ingestion
Theory:
Kafka architecture: Producer, Broker, Topic, Consumer
Partitions and offsets
Kafka Streams vs Connect
Reliability (ack, retries)
Lab:
Create Kafka topics for real-time data (e.g., transactions, alerts)
Produce and consume messages using Python
Scenarios:
Ingest credit card transactions via Kafka
Send alerts to fraud detection module
Tasks:
Write a Python Kafka producer for simulated ATM logs
Consume Kafka messages into Spark Streaming job
Challenges:
Managing offset lag and duplicates
Producer/Consumer crash recovery
Module 5: Practical Data Warehousing (Snowflake, DBT)
Topic 5.1: Snowflake – Cloud Data Warehousing
Theory:
Introduction to Snowflake architecture (decoupled storage & compute)
Data loading using COPY INTO
Virtual warehouses and scaling
SQL-based analytics
Time Travel, Cloning, and Fail-Safe
Semi-structured data support (JSON, Avro, etc.)
Lab:
Load customer transaction files into Snowflake
Query historical balance snapshots using Time Travel
Use Snowflake’s variant column for semi-structured KYC data
Scenarios:
Store regulatory data for up to 7 years
Allow auditors to clone transaction data without duplicating storage
Tasks:
Set up a staging schema for raw data
Create optimized views for BI dashboards
Write SQL to track weekly balance changes
Challenges:
Cost management with virtual warehouse auto-scaling
Optimizing semi-structured query performance
Detecting and fixing failed bulk loads
Topic 5.2: DBT (Data Build Tool) – ELT in the Warehouse
Theory:
DBT vs traditional ETL
SQL-based transformation logic
Materializations: table, view, incremental
Testing, documentation, and lineage tracking
Jinja templating in SQL
Lab:
Use DBT to transform raw transaction data into analytics-ready models
Implement incremental models for daily processing
Add documentation and column-level tests
Scenarios:
Build a customer360 model for analytics
Maintain history tables for fraud analysis
Tasks:
Create a DBT model to join multiple raw sources
Add custom tests to ensure no nulls in customer_id
Schedule DBT models via Airflow Challenges:
Managing dependencies between DBT models
Debugging Jinja errors in templated SQL
Handling schema changes in source tables