Developed a regression model to predict shipping container ETA delays for CargoProbe, improving on their existing heuristics with a Random Forest.
Python | Scikit-learn | Random Forest | Data Analysis | Data Processing
This project for CargoProbe tackled supply chain disruptions by predicting shipping container delays. The provided data was complex, with over 330k entries, 23 columns, and significant missing values (e.g., 98% missing port data). After extensive data cleaning, feature engineering (creating 'time_left', 'time_spent', 'delay'), and imputation analysis, we compared several regression models. The final Random Forest model significantly outperformed CargoProbe's existing heuristics, reducing the Mean Absolute Error from 3.58 days to 1.20 days and the MSE from 54.75 to 7.92.
Analyzed and cleaned a large dataset of 331,956 entries, handling extensive missing values (NaNs) which constituted almost half the data.
Standardized timestamp data to UTC and converted string values to lowercase to resolve duplicates and misspellings.
Managed data structure by merging rows to fill missing values and removing duplicate updates to prevent model bias, resulting in 18,772 clean rows for training.
Performed a chronological train-test split (80-20) instead of random shuffling to prevent data leakage from the same trip appearing in both sets.
Engineered critical time-based features for regression, including ‘time_left’ (ETA - update_datetime) and ‘time_spent’ (update_datetime - date_added).
Created the target variable ‘delay’ (ATA - ETA) to quantify the prediction error of CargoProbe’s existing system.
Analyzed the cleaned data, finding that 84.53% of trips arrived late and 28.26% were delayed by 72 hours or more.
Applied One-Hot Encoding to categorical features like ‘origin_city’, ‘vesselname’, and ‘activity’ to prepare data for modeling.
Evaluated multiple regression models, including Linear Regression (which underfit) and Polynomial Regression (which was too computationally expensive).
Selected, trained, and tuned a Random Forest regression model as the final, most robust solution.
Used RandomizedSearchCV to find optimal hyperparameters for the Random Forest, balancing the model and reducing the MSE for outliers.