Data Engineer Course Curriculum

Topic 3.1: NoSQL with MongoDB

Theory:

  • ·Document-oriented structure

  • Collections, documents, and BSON

  • CRUD operations in MongoDB

  • Aggregation pipeline and indexing

    Lab:

  • Insert and update customer profiles in MongoDB

  • Query transaction logs stored as JSON

Scenarios:

  • Store dynamic KYC details

  • Query transaction history based on user/device ID

    Tasks:

  • Design a collection schema for credit card transactions

  • Run an aggregation pipeline to detect spending spikes

    Challenges:

  • Schema design issues in unstructured data

  • Aggregation performance on large nested data

Topic 3.2: Search with ElasticSearch (Optional)

Theory:

  • Indexes, documents, mappings

  • Full-text search and analyzers

  • Ingest pipelines and text preprocessing

Lab:

  • Index customer feedback data

  • Search transaction logs for keywords like “failed”, “suspicious”

Scenarios:

  • Fast search of transaction logs during investigation

  • Extract complaints containing negative sentiment Tasks:

  • Index JSON logs into ElasticSearch

  • Build a search query for all failed transactions Challenges:

  • Text preprocessing errors causing false matches

  • High index size slowing down search

Topic 3.3: Graph Databases with Neo4j (Optional)

Theory:

  • Nodes, relationships, and properties

  • Cypher Query Language basics

  • Pattern matching in graphs

    Lab:

  • Create a graph of customer-referral relationships

  • Query fraudulent clusters in payment networks

    Scenarios:

  • Detect fraud rings through common devices or accounts

  • Track transaction paths in real time

Tasks:

  • Load customer-merchant transactions into a graph

  • Write a Cypher query to find highly connected nodes

    Challenges:

  • Memory bottlenecks in graph traversal

  • Relationship duplication leading to noise

Topic 3.4: Column-Oriented Storage with HBase (Optional)

Theory:

  • HBase architecture (HDFS, H Region Server)

  • Tables, column families, rows, cells

  • Random access and batch jobs

    Lab:

  • Store historical credit data in HBase

  • Scan for specific user ID or date range

    Scenarios:

  • Store 10 years of customer activity for compliance

  • Retrieve long-term account data on demand Tasks:

  • Connect to HBase using Happybase client

  • Query a range of customer credit scores Challenges:

  • Key design leading to hot regions

  • Complex query logic compared to SQL/NoSQL


Module 4: Batch & Streaming (PySpark, Kafka)

Topic 4.1: Distributed Data Processing with PySpark

Theory:

  • Spark architecture (Driver, Executor, Cluster Manager)

  • RDDs vs DataFrames

  • Transformations (map, filter, join) & Actions (collect, count)

  • Spark SQL

  • Partitioning and performance tuning

    Lab:

  • Load and process transaction data with PySpark

  • Run joins between customer and account datasets

  • Aggregate daily and weekly balances

    Scenarios:

  • Analyze 10 million+ transactions in parallel

  • Detect anomalies in large datasets

    Tasks:

  • Write PySpark code to summarize transactions by region

  • Perform ETL using DataFrame transformations Challenges:

  • Data skew causing stage failures

  • Memory overflows on wide transformations

Topic 4.2: Real-Time Processing with Spark Streaming

Theory:

  • Structured Streaming architecture

  • Micro-batch model

  • Sources: Kafka, Files, Sockets

  • Watermarking and late data handling

    Lab:

  • Simulate a streaming pipeline from transaction logs

  • Join real-time transactions with static customer data

Scenarios:

  • Detect fraudulent transactions as they happen

  • Stream updates from ATM usage Tasks:

  • Set up checkpointing and fault tolerance

  • Build a pipeline that flags suspicious activity Challenges:

  • Handling late-arriving data correctly

  • Ensuring exactly-once processing guarantees

Topic 4.3: Kafka for Data Ingestion

Theory:

  • Kafka architecture: Producer, Broker, Topic, Consumer

  • Partitions and offsets

  • Kafka Streams vs Connect

  • Reliability (ack, retries)

Lab:

  • Create Kafka topics for real-time data (e.g., transactions, alerts)

  • Produce and consume messages using Python

    Scenarios:

  • Ingest credit card transactions via Kafka

  • Send alerts to fraud detection module

    Tasks:

  • Write a Python Kafka producer for simulated ATM logs

  • Consume Kafka messages into Spark Streaming job

Challenges:

  • Managing offset lag and duplicates

  • Producer/Consumer crash recovery


Module 5: Practical Data Warehousing (Snowflake, DBT)

Topic 5.1: Snowflake Cloud Data Warehousing

Theory:

  • Introduction to Snowflake architecture (decoupled storage & compute)

  • Data loading using COPY INTO

  • Virtual warehouses and scaling

  • SQL-based analytics

  • Time Travel, Cloning, and Fail-Safe

  • Semi-structured data support (JSON, Avro, etc.)

    Lab:

  • Load customer transaction files into Snowflake

  • Query historical balance snapshots using Time Travel

  • Use Snowflake’s variant column for semi-structured KYC data

    Scenarios:

  • Store regulatory data for up to 7 years

  • Allow auditors to clone transaction data without duplicating storage

    Tasks:

  • Set up a staging schema for raw data

  • Create optimized views for BI dashboards

  • Write SQL to track weekly balance changes

    Challenges:

  • Cost management with virtual warehouse auto-scaling

  • Optimizing semi-structured query performance

  • Detecting and fixing failed bulk loads

Topic 5.2: DBT (Data Build Tool) ELT in the Warehouse

Theory:

  • DBT vs traditional ETL

  • SQL-based transformation logic

  • Materializations: table, view, incremental

  • Testing, documentation, and lineage tracking

  • Jinja templating in SQL

    Lab:

  • Use DBT to transform raw transaction data into analytics-ready models

  • Implement incremental models for daily processing

  • Add documentation and column-level tests

    Scenarios:

  • Build a customer360 model for analytics

  • Maintain history tables for fraud analysis

    Tasks:

  • Create a DBT model to join multiple raw sources

  • Add custom tests to ensure no nulls in customer_id

  • Schedule DBT models via Airflow Challenges:

  • Managing dependencies between DBT models

  • Debugging Jinja errors in templated SQL

  • Handling schema changes in source tables