Linear Regression: Building Predictive Models for Higher Dimensional Data

Linear regression is a powerful tool for predicting outcomes based on input variables. In our previous lecture, we explored a simple linear regression example with one input variable and one outcome variable, where we were able to recover the slope of the model using the least squares best fits. Now, let’s dive into the exciting world of higher dimensional data sets.

In higher dimensional data sets, instead of working with a single input variable, we have multiple factors that can be used to build a prediction model for the outcome variable. Imagine having four, five, or even six different variables that we measure, such as demographic information like age, sex, weight, and diet. We want to use these variables to build a linear regression model that predicts the risk of a heart attack, for example.

To achieve this, we collect these measurements in column vectors, where each row represents an individual patient. We then solve for the best fit slope of the model using the pseudo inverse of the matrix of input factors and the outcome vector.

For our next example, we will be working with Portland cement data. Concrete generates heat as it hardens, and this data set consists of 13 experiments where we mix different ingredients to make cement. Each row represents an experiment, and the columns represent the ingredients used. Our goal is to build a model that can predict the amount of heat generated by a new mixture of ingredients.

Using the pseudo inverse and the SVD, we will find the best fit slope that maps the four ingredients to the outcome variable of heat. This model will enable us to predict the heat generation of future mixtures that we haven’t tested yet.

Further reading: Build a Dashboard in Tableau: Master the Art of Data Visualization

But wait, we need to be careful not to overfit the model. Overfitting happens when we use all of our data to train the model and then fit the training data perfectly. To avoid this, it is crucial to split our data into a training section and a test section. We can use the first ten experiments to build the model and then test it on the remaining three experiments. This way, we can validate the model’s performance on a holdout dataset and ensure it is not biased.

In reality, there is much more to explore and learn about machine learning and statistical learning. Our upcoming book, “Data-Driven Science and Engineering,” delves into the intricacies of validating and cross-validating models, as well as the importance of holding out data for testing purposes.

Building predictive models for higher dimensional data sets opens up endless possibilities. By harnessing the power of linear regression and understanding the nuances of model validation, we can unlock valuable insights in various fields. So, let’s dive into the world of predictive modeling and unleash the true potential of data-driven science and engineering.

To explore more about the fascinating realm of technology and its impact on our lives, visit Techal.