How Far We've Come

The purpose of this book, which was explained in Chapter 1, is to introduce non-experts and non-computer scientists to some of the methods and tools of data mining. Certainly there have been a number of processes, tools, operators, data manipulation techniques, etc., demonstrated in this book, but perhaps the most important lesson to take away from this broad treatment of data mining is that the field has become huge, complex, and dynamic. You have learned about the CRISP-DM process and had it shown to you numerous times as you have seen data mining models that classified, predicted, or did both. You have seen a number of data processing tools and techniques, and as you have done this, you have hopefully noticed the myriad other operators in RapidMiner and packages and functions in R that we did not use or discuss. Although you may be feeling like you're getting good at data mining (and we hope you do), please recognize that there is a world of data mining that this book has not touched on—so there is still much for you to learn.

This chapter and the next will discuss some precautions that should be taken before putting any real-world data mining results into practice. This chapter will demonstrate methods for using RapidMiner and R to conduct some validation for data mining models. Chapter 14 will discuss the choices you will make as a data miner and some ways to guide those choices in responsible directions. Remember from Chapter 1 that CRISP-DM is cyclical—you should always be learning from the work you are doing and feeding what you've learned from your work back into your next data mining activity.

For example, suppose you used a Replace Missing Values operator in a data mining model to set all missing values in a data set to the average for each attribute. Suppose further that you used results from that data mining model in making decisions for your company and that those decisions turned out to be less than ideal. What if you traced those decisions back to your data mining activities and found that by using the average, you made some general assumptions that weren't very realistic. Perhaps you don't need to throw out the data mining model entirely, but for the next run of that model you should be sure to either change it to remove observations with missing values or use a more appropriate replacement value based upon what you have learned. Even if you used your data mining results and had excellent outcomes, remember that your business is constantly moving, and through the day-to-day operations of your organization, you are gathering more data. Be sure to add this data to training data sets, compare actual outcomes to predictions, and tune your data mining models in accordance with your experience and the expertise you are developing. Consider Sarah, our hypothetical sales manager from Chapters 4 and 8. Certainly now that we've helped her predict heating oil usage by home through a linear regression model, Sarah can track these homes' actual heating oil orders to see how well their actual use matches our predictions. Once these customers have established several months or years of actual heating oil consumption, their data can be fed into Sarah's model's training data set, helping it to be even more accurate in its predictions.

One of the benefits of connecting RapidMiner to a file, database, or data warehouse rather than importing data into a RapidMiner repository is that data can be added to the data sets in real time and fed straight into the RapidMiner models. If you were to acquire some new training data, as in Sarah's scenario proposed in the previous paragraph, it could be immediately incorporated into the RapidMiner model if the data were in a connected data set rather than imported to the repository. Since R creates and stores copies of data in R objects such as data frames, we do not run into the connect-or-import question when working in R, though it is important to remember that if new data is added to a model's underlying data set, the object in R may need to be repopulated in order to include the new data.

As we tune and hone our models, they perform better for us. In addition to using our growing expertise and adding more training data, there are some built-in ways that we can check a model's performance in RapidMiner or in R.