A production‑style project that turns raw NYC taxi trips into predictions and insights. This page explains the architecture, tools, and choices behind the pipeline before you jump into the code.
An end‑to‑end MLOps workflow for NYC taxi trip data covering ingestion, validation, feature engineering, model training, experiment tracking, CI/CD, deployment, and monitoring.
Data Ingestion & Quality: Batch ingestion with Spark; data validation with Great Expectations; schema & drift checks.
Feature Engineering: Reproducible transformations; partitioning; feature parity across train/serve.
Model Training & Tracking: Scikit‑learn/XGBoost with MLflow tracking, metrics, and artifacts.
Packaging & CI/CD: Docker images; GitHub Actions for tests, linting, and pipeline runs.
Deployment & Serving: FastAPI service for prediction; environment‑based configs; IaC‑ready layout.
Monitoring: Data freshness & quality checks; model performance tracking; alert hooks.
High‑level flow: Ingestion → Validation → Feature Engineering → Training/Tracking → Packaging/CI → Serving → Monitoring.
Tools chosen for reliability, reproducibility, and smooth hand‑off from experimentation to production.
Tracked runs, parameters, metrics, and artifacts for full reproducibility.
Great Expectations suite catching schema issues and outliers early.
GitHub Actions: tests, linting, image build, and deploy steps.
Prereqs: Docker, Make, Python 3.10+
git clone https://github.com/airdmhund1/nyc-taxi-mlops
cd nyc-taxi-mlops
make setup # install deps / pre-commit
make train # run training with MLflow tracking
make serve # start FastAPI for predictions
make test # unit tests & data checks
Environment variables and config files are documented in the repo’s README.