Data Science Course Curriculum
Module 7: Machine Learning (Statistics, Visualization, ML, MLflow)
Topic 7.1: Exploratory Data Analysis (EDA) & Statistics
Theory:
Types of variables (numerical, categorical)
Descriptive statistics: mean, median, std dev, IQR
Correlation and covariance
Outlier detection
Probability distributions and central limit theorem
Lab:
Use Pandas to explore and summarize customer transaction data
Identify outliers in high-value transaction logs
Scenarios:
Detect suspicious activity based on statistical deviation
Compare user spending patterns by region
Tasks:
Compute z-scores for transaction amounts
Visualize missing values and fix them Challenges:
Interpreting misleading correlations
Handling skewed data distributions
Topic 7.2: Data Visualization
Theory:
Plot types: bar, histogram, scatter, box, pie
Visualizing distributions and relationships
Dashboards using Plotly & Dash
Effective storytelling with data
Lab:
Create matplotlib plots for ATM usage across states
Build an interactive dashboard showing fraud trends
Scenarios:
Visualize time-series of withdrawals during weekends
Highlight branch performance by customer type
Tasks:
Build scatter plot comparing income and loan default
Create Dash app to filter customer data by risk score
Challenges:
Cluttered plots with large data
Dynamic filtering and responsiveness in dashboards
Topic 7.3: Machine Learning (Supervised + Unsupervised)
Theory:
Workflow: data → features → model → evaluation
Classification (Logistic Regression, Decision Trees)
Regression (Linear, Ridge, Lasso)
Clustering (KMeans, DBSCAN)
Model validation: train/test split, cross-validation
Evaluation metrics: accuracy, precision, recall, F1
Lab:
Train a model to detect fraudulent transactions
Predict loan default using classification
Segment customers using KMeans
Scenarios:
Predict high-risk loan applicants
Identify customer clusters for personalized offers
Tasks:
Create a pipeline: preprocessing → model → predict
Tune hyperparameters using GridSearchCV
Challenges:
Imbalanced datasets causing false positives
Feature leakage during training
Overfitting and underfitting management
Topic 7.4: MLFlow for Model Lifecycle Management
Theory:
MLflow Tracking: log parameters, metrics, models
MLflow Projects: reproducible runs
MLflow Models: deployment formats
MLflow Registry: versioning and staging
Lab:
Log model training runs in MLFlow
Register fraud detection model
Deploy latest version for real-time scoring
Scenarios:
Track accuracy across different fraud detection models
Rollback to previous model version during failure
Tasks:
Log model predictions and accuracy to MLFlow
Serve model using mlflow models serve
Challenges:
Conflicts in model registry versions
Tracking errors in distributed jobs
Consistency in experiments and tags