Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying IBM SPSS Modeler Cookbook
  • Table Of Contents Toc
  • Feedback & Rating feedback
IBM SPSS Modeler Cookbook

IBM SPSS Modeler Cookbook

By : Keith McCormick, Abbott
4.4 (20)
close
close
IBM SPSS Modeler Cookbook

IBM SPSS Modeler Cookbook

4.4 (20)
By: Keith McCormick, Abbott

Overview of this book

IBM SPSS Modeler is a data mining workbench that enables you to explore data, identify important relationships that you can leverage, and build predictive models quickly allowing your organization to base its decisions on hard data not hunches or guesswork. IBM SPSS Modeler Cookbook takes you beyond the basics and shares the tips, the timesavers, and the workarounds that experts use to increase productivity and extract maximum value from data. The authors of this book are among the very best of these exponents, gurus who, in their brilliant and imaginative use of the tool, have pushed back the boundaries of applied analytics. By reading this book, you are learning from practitioners who have helped define the state of the art. Follow the industry standard data mining process, gaining new skills at each stage, from loading data to integrating results into everyday business practices. Get a handle on the most efficient ways of extracting data from your own sources, preparing it for exploration and modeling. Master the best methods for building models that will perform well in the workplace. Go beyond the basics and get the full power of your data mining workbench with this practical guide.
Table of Contents (11 chapters)
close
close
10
Index

Detecting potential model instability early using the Partition node and Feature Selection node

Model instability would typically be described as an issue most noticeably during the evaluation phase. Model instability usually manifests itself as a substantially stronger performance on the Train data set than on the Test data set. This bodes ill for the performance of the model on new data; in other words, it bodes ill for the practical application of the model to any business problem. Veteran data miners see this coming well before the evaluation phase, however, or at least they hope they do. The trick is to spot one of the most common causes; model instability is much more likely to occur when the same inputs are competing for the same variance in the model. In other words, when the inputs are correlated with each other to a large degree, it can cause problems. The data miner can also get themselves into hot water with their own behavior or imprudence. Overfitting, discussed in the Introduction of Chapter 7, Modeling – Assessment, Evaluation, Deployment, and Monitoring, can also cause model instability. The trick is to spot potential problems early. If the issue is in the set of inputs, this recipe can help to identify which inputs are at issue. The correlation matrix recipe and other data reduction recipes can assist in corrective action.

This recipe also serves as a cautionary tale about giving the Feature Selection node a heavier burden than it is capable of carrying. This node looks at the bivariate relationships of inputs with the target. Bivariate simply means two variables and it means that Feature Selection is blind to what might happen when lots of inputs attempt to collaborate together to predict the target. Bivariate analyses are not without value, they are critical to the Data Understanding phase, but the goal of the data miner is to recruit a team of variables. The team's performance is based upon a number of factors, only one of which is the ability of each input to predict the target variable.

Getting ready

We will start with the Stability.str stream.

How to do it...

To detect potential model instability using the Partition and Feature Selection nodes, perform the following steps:

  1. Open the stream, Stability.str.
    How to do it...
  2. Edit the Partition node, click on the Generate seed button, and run it. (Since you will not get the same seed as the figure shown, your results will differ. This is not a concern. In fact, it helps illustrate the point behind the recipe.)
    How to do it...
  3. Run the Feature Selection Modeling node and then edit the resulting generated model. Note the ranking of potential inputs may differ if the seed is different.
    How to do it...
  4. Edit the Partition node, generate a new seed, and then run the Feature Selection again.
  5. Edit the Feature Selection generated model.
    How to do it...
  6. For a third and final time, edit the Partition node, generate a new seed, and then run the Feature Selection. Edit the generated model.
    How to do it...

How it works...

At first glance, one might anticipate no major problems ahead. RFA_6, which is the donor status calculated six campaigns ago, is in first place twice and is in third place once. Clearly it provides some value, so what is the danger in proceeding to the next phase? The change in ranking from seed to seed is revealing something important about this set of variables. These variables are behaving like variables that are similar to each other. They are all descriptions of past donation behavior at different times. The larger the number after the underscore, the further back in time they represent. Why isn't the most recent variable, RFA_2, shown as the most predictive? Frankly, there is a good chance that it is the most predictive, but these variables are fighting over top status in the small decimal places of this analysis. We can trust Feature Selection to alert us that they are potentially important, but it is dangerous to trust the ranking under these circumstances, and it certainly doesn't mean than if we were to restrict our inputs to the top ten that we would get a good model.

The behavior revealed here is not a good indication of how these variables will behave in a model, a classification tree, or any other multiple input techniques. In a tree, once a branch is formed using RFA_6, the tendency would be for the model to seek a variable that sheds light on some other aspect of the data. The variable used to form the second branch would likely not be the second variable on the list because the first and second variables are similar to each other. The implication of this is that, if RFA_4 were chosen as the first branch, RFA_6 might not be chosen at all.

Each situation is different, but perhaps the best option here is to identify what these related variables have in common and distill it into a smaller set of variables. To the extent that these variables have a unique contribution to make—perhaps in the magnitude of their distance in the past—that too could be brought into higher relief during data preparation.

See also

  • The Selecting variables using the CHAID Modeling node recipe in Chapter 2, Data Preparation – Select
  • The Removing redundant variables using correlation matrices recipe in Chapter 2, Data Preparation – Select

Create a Note

Modal Close icon
You need to login to use this feature.
notes
bookmark search playlist font-size

Change the font size

margin-width

Change margin width

day-mode

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Delete Bookmark

Modal Close icon
Are you sure you want to delete it?
Cancel
Yes, Delete

Delete Note

Modal Close icon
Are you sure you want to delete it?
Cancel
Yes, Delete

Edit Note

Modal Close icon
Write a note (max 255 characters)
Cancel
Update Note

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY