As practice for my degree and future job opportunities, I have been implemented several machine learning models from scratch using only Pandas and NumPy (no scikit-learn, TensorFlow, PyTorch, etc.).
The preprocessing/ directory contains several utilities for loading datasets from CSV files and transforming them to be fed into an ML model:
- Read dataset from CSV
- Split dataset into training and testing sets
- Discretize continuous features with buckets of equal width or equal frequency
- Encode categorical features with one-hot encoding
- Perform z-score normalization or min-max scaling on numerical features
- Impute missing values with feature mean or mode
The utilities/ directory also contains functions for performing k-fold or (k x 2)-fold cross-validation and computing several evaluation metrics.
The models/ directory contains the implementations of the models themselves. I implemented the following models:
- Decision Tree
- Random Forest
- K-Nearest Neighbors
- Neural Network with Backpropagation and Adam optimizer