Multivariable Regression
1. Problem
Do modeling the performance of computer programs. The dataset provided describes a few examples of running SGDClassifier in Python. The features of the dataset describes the SGDClassifier as well as the features used to generate the synthetic training data. The data to be analyzed is the training time of the SGDClassifier.
Time
is the training time of the model. The other feature follows the definition in Sklearn. Specifically, n_samples
, n_features
, n_classes
, n_clusters_per_class
, n_informative
, flip_y
, scale
, describes how the synthetic dataset for training is generated using sklearn.datasets.make_classification, and penalty
, l1_ratio
, alpha
, max_iter
, random_state
, n_jobs
describes the setup of sklearn.linear_model.SGDClassifier
2. Solution
I analyse the relation between features in training data and do feature engineering for the better prediction. A feed forward neural network is used to predict running time.
2.1 Data Analysis
I analyse dataset in terms of the description in SGDClassifier in Scikit-Learn and use Numpy and Pandas.
Shape of the train data with all features: (400, 15)
Shape of the test data with all features: (100, 14)
With the above code, we can see the correlation with each feature and time. This result shows weak correlations with each feature and time. It means I need to work on feature engineering (feature selection). According to the above document, its complexity is
The major advantage of SGD is its efficiency, which is basically linear in the number of training examples. If X is a matrix of size (n, p) training has a cost of O(knp¯), where k is the number of iterations (epochs) and p¯ is the average number of non-zero attributes per sample.
Recent theoretical results, however, show that the runtime to get some desired optimization accuracy does not increase as the training set size increases.
That means one of critical factorts to affect its performance is the number of objects (n) * iterations or epochs (k) in our dataset, rather than making them independent each other. Also, the feature (parameter) penalty
has 4 categories (options); ‘none’, ‘l1’, ‘l2’, and ‘elasticnet’. This should be handled as a categorial feature unlike other 14 numerical features. Thus, I create a function to create new feature, complexity
like below to see correlation my new created feature and time.
For the complexity, max_iter
represents k in the document, and n_samples
* n_features
* n_classes
means overall the dataset’s matrix size np¯, which I mutiply them all. On the other hand, n_jobs
makes the performance faster as it is the number of CPU. It should be devided the complexity, then. The maximum n_jobs
in dataset is 16 in the training dataset, so I replace -1 which means using all jobs with 16.
The above code creates new dataframe for each penalty respectively. Now, I can see my new created feature ‘complexity’ is correlated with time more than 97% for every penalty option like below.
Tables | complexity | time | |
---|---|---|---|
id | 1.00000 | 0.032260 | 0.035770 |
complexity | 0.03226 | 1.00000 | 0.980792 |
time | 0.03577 | 0.980792 | 1.000000 |
Tables | complexity | time | |
---|---|---|---|
id | 1.00000 | 0.057691 | 0.059486 |
complexity | 0.057691 | 1.00000 | 0.988946 |
time | 0.059486 | 0.988946 | 1.000000 |
Tables | complexity | time | |
---|---|---|---|
id | 1.00000 | 0.064950 | 0.015912 |
complexity | 0.064950 | 1.00000 | 0.982053 |
time | 0.015912 | 0.982053 | 1.000000 |
Tables | complexity | time | |
---|---|---|---|
id | 1.00000 | 0.011534 | 0.025684 |
complexity | 0.011534 | 1.00000 | 0.976314 |
time | 0.025684 | 0.976314 | 1.000000 |
You can check all code here Google Colab Data Analysis link Click!
2.2 Feature engineering and Prediction
I should do feature engineering according to the above data analysis result. In this part, I use one-hot encoding to deal with the cetegorial feature, penalty
and build a simple feed forward neural network with PyTorch for prediction.
2.2.1. Feature Scaling
I try normalisation and standardisation for rescaling data or __feature scaling __because of features which are highly varying in magnitudes, units and range.
2.2.2. Validation and Model
I split train data into train and validation dataset for validation and build a simple feed forward NN like below.
The following plot is about training loss and validation loss per epoch.
My prediction result with validation data is like that:
You can check all code here Google Colab Prediction with PyTorch link Click!
3. Future Work
- Replace n_jobs = -1 should be more sophisticated, not just 16, which requires another data anaysis task.
- Use scikit-learn to split train validation dataset, which is fancy, instead of Numpy array slicing.