Data Science Course Curriculum

Module 7: Machine Learning (Statistics, Visualization, ML, MLflow)

Topic 7.1: Exploratory Data Analysis (EDA) & Statistics

Theory:

  • Types of variables (numerical, categorical)

  • Descriptive statistics: mean, median, std dev, IQR

  • Correlation and covariance

  • Outlier detection

  • Probability distributions and central limit theorem

    Lab:

  • Use Pandas to explore and summarize customer transaction data

  • Identify outliers in high-value transaction logs

    Scenarios:

  • Detect suspicious activity based on statistical deviation

  • Compare user spending patterns by region

    Tasks:

  • Compute z-scores for transaction amounts

  • Visualize missing values and fix them Challenges:

  • Interpreting misleading correlations

  • Handling skewed data distributions

Topic 7.2: Data Visualization

Theory:

  • Plot types: bar, histogram, scatter, box, pie

  • Visualizing distributions and relationships

  • Dashboards using Plotly & Dash

  • Effective storytelling with data

    Lab:

  • Create matplotlib plots for ATM usage across states

  • Build an interactive dashboard showing fraud trends

    Scenarios:

  • Visualize time-series of withdrawals during weekends

  • Highlight branch performance by customer type

    Tasks:

  • Build scatter plot comparing income and loan default

  • Create Dash app to filter customer data by risk score

    Challenges:

  • Cluttered plots with large data

  • Dynamic filtering and responsiveness in dashboards

Topic 7.3: Machine Learning (Supervised + Unsupervised)

Theory:

  • Workflow: data features model evaluation

  • Classification (Logistic Regression, Decision Trees)

  • Regression (Linear, Ridge, Lasso)

  • Clustering (KMeans, DBSCAN)

  • Model validation: train/test split, cross-validation

  • Evaluation metrics: accuracy, precision, recall, F1

    Lab:

  • Train a model to detect fraudulent transactions

  • Predict loan default using classification

  • Segment customers using KMeans

    Scenarios:

  • Predict high-risk loan applicants

  • Identify customer clusters for personalized offers

    Tasks:

  • Create a pipeline: preprocessing model predict

  • Tune hyperparameters using GridSearchCV

    Challenges:

  • Imbalanced datasets causing false positives

  • Feature leakage during training

  • Overfitting and underfitting management

Topic 7.4: MLFlow for Model Lifecycle Management

Theory:

  • MLflow Tracking: log parameters, metrics, models

  • MLflow Projects: reproducible runs

  • MLflow Models: deployment formats

  • MLflow Registry: versioning and staging

Lab:

  • Log model training runs in MLFlow

  • Register fraud detection model

  • Deploy latest version for real-time scoring

    Scenarios:

  • Track accuracy across different fraud detection models

  • Rollback to previous model version during failure

    Tasks:

  • Log model predictions and accuracy to MLFlow

  • Serve model using mlflow models serve

Challenges:

  • Conflicts in model registry versions

  • Tracking errors in distributed jobs

  • Consistency in experiments and tags