This is often overlooked or simply forgotten. I know I have built numerous models and have not even attempted this crucial step. However, often times the data does not provide the opportunity for feature engineering. In this post I just want to provide a few examples of where one can look into feature engineering.
Time Variable: Very import variable however, June 26 2014 will not be the same as June 26 2015. huh??? What does that mean?? This is where feature engineering can prevent over-fitting. How about we remove the Time Variable and create a few new ones, Day Of Week, Season, Time Int or Time period (morning,day,night)
Name Variable: Well what does this variable represent to a mode? Essentially nothing. However, we can derive a few things from this model. Such as sex, approximate age by name, ethnicity & race. Now you must be very careful when using ethnicity and race, this is a legal grey area. Example if you are using this for credit scoring it's probably not okay but if you are using for medical reasons then it is okay. Easy way to ensure you don't fall into the grey area simply ask yourself "does this model benefit livelihood or health of others" and if it does not then do not derive race or ethnicity.
Temperature and Humidity Variable: Sometimes it is best to put this into buckets. However, this is all relative to the data set and target variable. Example: We are trying to predict ticket sales for an outdoor concert. One might want to try bucketing temperature into cold, enjoyable, and too hot!!
Zip Code Variable: Another useless variable however, this variable allows you to pull in much more information such as wealth, prop val, crime stats, flood plain, insurance info and so much more!
Address Variable: This variable does not tell us much however you can convert this into single family or multi/apt/condo based upon the address.
Time Variable: Very import variable however, June 26 2014 will not be the same as June 26 2015. huh??? What does that mean?? This is where feature engineering can prevent over-fitting. How about we remove the Time Variable and create a few new ones, Day Of Week, Season, Time Int or Time period (morning,day,night)
Name Variable: Well what does this variable represent to a mode? Essentially nothing. However, we can derive a few things from this model. Such as sex, approximate age by name, ethnicity & race. Now you must be very careful when using ethnicity and race, this is a legal grey area. Example if you are using this for credit scoring it's probably not okay but if you are using for medical reasons then it is okay. Easy way to ensure you don't fall into the grey area simply ask yourself "does this model benefit livelihood or health of others" and if it does not then do not derive race or ethnicity.
Temperature and Humidity Variable: Sometimes it is best to put this into buckets. However, this is all relative to the data set and target variable. Example: We are trying to predict ticket sales for an outdoor concert. One might want to try bucketing temperature into cold, enjoyable, and too hot!!
Zip Code Variable: Another useless variable however, this variable allows you to pull in much more information such as wealth, prop val, crime stats, flood plain, insurance info and so much more!
Address Variable: This variable does not tell us much however you can convert this into single family or multi/apt/condo based upon the address.