Wednesday, August 18, 2010

Business Modeling & Data Mining

Ref: Berry, Michael J. A.; Linoff, Gordon S.. Data Mining Techniques : For Marketing, Sales, and Customer Relationship Management

The world, Knowledge, and Models:
Highlights:
  • The miner has to construct a model of a business situation before mining begins. The pre-mining model is used to define where uncertainties are in a situation, to determine where mining can offer the most value, and to discover what data needs to be appropriately mined to discover an answer
  • Other pre-models may define how the data needs to be enhanced or enriched, or they may help determine what features can be usefully extracted. In fact, mining takes place in a realm almost entirely populated by various types of models
  • Data mining and modeling are general-purpose techniques that can be applied to a wide range of problems. However, where data mining is principally being applied today is in discovering business opportunities and solving business problems
  • Data mining is currently used in business as a tactical tools. Clearly it has much value to offer at a tactical level; however, the core business processes take place at a strategic level, and it is the strategic use of data mining that promises the greatest return for a company
  • The problem is to explore the situation and circumstances so that the tools and techniques can be best applied to the data to derive results that can be used to improve the situation
The components of this problem are:
  1. Their present insight, understanding, and knowledge
  2. The situation and circumstances
  3. Tools to evaluate the situation and circumstances
  4. Data that might be relevant
  5. Tools to evaluate this data
  6. Techniques of applying the tools
  7. How to discover the right problem
Of course, until the right problem has been discovered, the first six parts cannot be used appropriately. The process of binding all of these parts together into a coherent whole is called modeling. Another thread is the process of solving problems. The term problem is meant only to convey the difficulty in finding an appropriate course, or course, of potential action. In other words, the problem is to discover what we could do to achieve particular desired ends. In this sense, the term problem also covers business opportunity discovery, as in "The problem is to find the best opportunity to pursue.
  • For better or worse, the current state of the technology has made a division of labor between the discoverers of information and the decision makers. A growing number of people in business organizations, either internally or externally, have the task of applying data mining techniques to data, the results of which are going to be used by decision makers who do not themselves either mine or model
  • It is very difficult for the miner to formulate strategic problems in a way that allows mining to add insight, unless the miner is directly involved in the managerial chain
  • In order to make that leap from the tactical to the strategic, the miner and modeler, whether employee or contractor, has to act in the role of consultant
  • The modeler is always concerned with business issues - with structuring the business problem or opportunity, with the business processes, with the business issues of data, with the application of the model to the appropriate business process, with connecting to the stakeholders, with deriving business value, and with return on the resources invested
  • After the business framework has been discovered, the miner is concerned with mining data - with data quality, tool selection, appropriate technique, relationship discovery, levels of confidence, and model clarity
  • The essence of business modeling is to create a structure that, at lowest cost and with lowest risk, returns the most advantageous gains and engenders enthusiastic support from all the stakeholders
  • Patterns occur in social activity, in animal behavior, in the physics and chemistry of the world, in our mental and emotional life, in fact in every aspect of the world that humans perceive. Symbolizing these patterns with words and numbers enables us to describe these patterns and theirn behavior as symbolic objects. Associating these object symbols with each other in various ways represents our understanding of the behavior of the world
  • It is the structures of these interlocking symbols representing worldly patterns that are called models.
  • Much of the effort of humans through the course of history has been devoted to discovering useable patterns to construct various types of useful models. Data mining is simply the latest in a long line of tools for detecting meaningful patterns and, ultimately, improving control of the world. On a fundamental level, it is no more, nor less, than the automated search for patterns in data sets
  • New knowledge and insight changes the structure of existing models - that is the whole purpose of mining and modeling
  • All models are constructed from the events and objects. All of the types of models that data mining deals with - symbolic models - are created from and modified by these elements
  • It is the objective of data mining to facilitate the continuing exploration a little further
  • Business is an enormous and interlocking knowledge structure, itself of enormous complexity. It exists as realm in itself, although tightly coupled to every other facet of human life and experience. It is a world of customers, development, marketing, inventory, complaints, profit, returns, and many other specialized events, objects, and relationships
Events:
  • A change in the state of the universe. This is the definition of an event. Every event has its effect. Everything in the universe is connected quite intimately
  • It is impossible to discuss or define an event without reference to something else. Things only happen when they change their relationship to other things. This means that every event can be expressed only relative to some surrounding framework. Very often, the framework is implicit or assumed. All events are relative
  • The relativity of events within their framework can be very important in both models and data mining. Many business events are represented as transactions - goods are exchanged for value received
  • In many models and mining applications, awareness of the surrounding framework is important
Objects:
  • If an event is something happening, an object is a thing to which the something happens. Any particular simplification of real world phenomena encompasses, ir can be regarded as a collection of features. These features form an appreciable matrix - a summation of impressions - that can be taken to indicate the presence of the specified object
  • For mining and modeling, the state of the features are taken as the defining characteristics of an object without any need to further consider the "true" underlying nature of the object itself
Perception:
  • Perception is regulated and enforced by the mental models of reality that we have created. The terms regulated and enforced may seem overly restrictive, but they are completely warranted
  • Perception is very much regulated and enforced by the existing knowledge that frames any situation. Such is the power of these frameworks that humans have to work very hard indeed to escape from the constraints of perception. Inability to escape from the constraints of preconceptions brought the world a monumental nuclear disaster, and doomed the engineers
  • Perception is the way that we view the world, whatever its limitations. Perceptions are built out of events that happen to the features of objects and are given meaning by reference to a framework of existing knowledge. The internal recording of these phenomena create and adjust the internal perceptual framework so that what we know constantly changes, and constantly changes what we can know
Data:
  • Data starts simply, with the recording of events or states of features of objects; however, not all of the events in the universe can be recorded, so some selectivity has to be applied 
  • Data records what the current knowledge framework defines as useful or potentially meaningful events
  • Recorded data is filtered through perceptions of what constitutes reality
  • Miners and modelers must be, and must remain, cognizant that the data to be minded or modeled is inextricably intertwined with the framework in place when it was collected
  • Business has many such framework models - such as the customer relationship management (CRM) model, the manufacturing efficiency model, and the product management model - that each views the world differently
  • Data collected under one view of the world will usually tend to support that view of the world (or framework model). If the framework model changes, data previously collected will unavoidably continue to contain traces of the model under which it was collected
Structures:
  • The technologies of storing information - language and writing - helped to create "permanent" knowledge structures
  • This phenomenon, known as the time binding of information, means that people confronting problems did not have to learn how to solve the problem from scratch on an individual basis. Instead they could draw on accumulated knowledge. It also means that continuous social structures could be perpetuated with the creation of a continuing common and shared framework of knowledge that we call culture
  • Data starts as the simple recording of events of features of objects. On this foundation, vast structures are built
  • Working within a corporate culture - understanding what can and can't be achieved, understanding what the structure defines as success, and understanding where change can best be effected - these important issues and many more are resolved by understanding the structure and the constraints it implies
  • Successful mining and modeling occur only when those involved understand the structure that exists and within which they work. To some extent, whether great or small, mining or modeling success depends absolutely in changing, modifying, or clarifying the existing structure
  • Structures are very important knowledge frameworks, but they are static representations of the inter-relationships among objects. The pinnacle of such structures is a dynamic system. In a system, all of the component parts discussed here - events, objects, features, and so on - have a dynamic and ever-changing relationship
  • Systems describe how structures maintain, modify, and adapt to internal and external changes over time
Systems:
  • Systems are knowledge structures that explicate the modes of interaction between their components as events occur. We commonly speak of systems as if they are objective things in themselves that exist in the world, systems only represent our abstraction or simplification of how the world works. It should by now be apparent that this is not only a philosophical point without application to the real world
  • A system is a structure and just as subject to revision as any other structure, thus any system imposes itself on the world and enforces its own particular world view
  • The systems of the world itself, if there are such things, do not change. Our explanations (knowledge structures) of the systems can and do change. "Rules of the game" is just another way of describing a knowledge structure. A recent business knowledge structure is CRM.
The structure of knowledge:
  • Knowledge has a very rich and complex structure, in some ways much like a multilayered onion. The reality that the modeler has to work with very much depends on where the project is located within the overlapping layers of the knowledge structure
  • At the core of our personal knowledge structures lies intimate personal knowledge. Wrapped around that is an implicit knowledge of the world (emotions, feelings, hunches), and around that, explicit knowledge (formal instruction, airline schedules)
  • Modelers must move into, and work within, knowledge structures that are not familiar, that engender assumptions that are not always apparent, and that preclude certain approaches unless the knowledge structures can be changed to accommodate them. As a result, modelers need objective techniques for discovering, working within, explicating, and accommodating these knowledge structures
The problem of knowing :
  • The fundamental problem of knowing is that of reification, defined as "to convert mentally into a thing"; in other words, to construct a mental representation of an object or typify a mode of relating
  • A miner has to work within existing frameworks of reified knowledge that exist at all levels from intimately personal to globally social. All knowledge is founded on reification
  • Our heads are filled with a vast amount of enormously complex equipment. The active equipment for discovering patterns seems to be a vast and intricate network of neurons
  • These are the neural networks that form our brains. In-depth investigation has revealed some information about how neural networks might work, eventually leading to the development of artificial neural networks (ANNs). So a slightly deeper look at the mechanism of reification involves considering how a neural network might learn a pattern
  • Another way of learning is to use what is called a self-organizing map (SOM), which very usefully discovers inherent patterns on its own
  • Neural networks, then, have the ability to organize themselves based on examples presented so that they learn the defining features of patterns
Paradigms, Archetypes, Patterns, and knowing:
  • Reification allows the construction of an enormously rich internal representation of knowledge
  • The paradigm is the overarching rule set by which we recognize, judge, and understand the world. It is, however, built only from our life experience and does not truly reflect the world
  • All paradigms are flawed. They do not represent the whole of reality, only a point of view
  • A modeler works with other people's views of the world, and may need to modify or change them. Modelers need to be aware of their internal paradigms, too, particularly where they are restrictive and prevent seeing. And of course, the purpose of these puzzles is to make you aware of your internal paradigms. Our paradigms are populated by archetypes
Frameworks for representing knowledge:
  • Reification, as discussed, in some sense creates the objects of knowledge. It is a process of converting simplifications of event sequences or impressions into mental objects
  • Objects do nothing - they are essentially inert simplifications - whereas knowledge is clearly dynamic
  • The mental objects that result from reification are of two fundamental types. One of these is the archetypal abstraction, a representation of a worldly simplification. The other fundamental type is the archetypal interaction, a representation of how worldly events interact. Add heat to water and it boils
  • Knowledge structures can be usefully represented as a network of interconnected reified abstractions
  • Knowledge structures can be represented as a network of interconnections among objects and the interactions that define the connections
Different types of knowledge:
  • Personal knowledge
  • Social knowledge
  • Recipe knowledge
  • Functional knowledge
  • Practical knowledge
  • Theoretical knowledge
Changing knowledge structures:
  • We live in a time when crashing and realignment of knowledge structures seems almost commonplace
  • Data miners and modelers set out to deliberately change and realign knowledge structures. These are generally corporate knowledge structures
  • Miners, and even more importantly modelers, need to be aware of how to make optional changes that cause the least disruption and can be implemented with minimal effort and cost. Also modelers need to be aware of how to implement, monitor, and ensure the performance of the changes so that the process is controlled
Symbol and symbolic knowledge:
  • All of the models are founded on symbolic knowledge. A symbol is a regularity in one frame of reference that represents a regularity in another frame of reference. The regularity may be an object, a relationship, or something else
  • The symbolic structure par excellence is language, which uses characteristics sounds, or characteristic visual representations of sounds, to represent a vast variety of objects and relationships that exist in human experience of the real world
  • The transference of descriptions of objects and relationships in language that the reader can interpret, or map back to, the phenomena of the experiential world
  • A modeler needs to be aware of the interfaces because currently, only a human can translate worldly, experiential phenomena into symbolic representation, and then translate such representation back into anticipated experiential phenomena. When translated, computers can manipulate such symbols. Within limits, computers can translate symbols from one domain of representation into another domain of representation
  • Symbolic knowledge, then, is an abstract representation in symbols of objects and relationships that are perceived in some other frame of reference
  • In the symbolic knowledge representation system of language, the rules of grammer constrain the relationships between between words (symbols of things in the perceptual world) to correspond more or less to the perceived relationships between the perceived objects
Knowledge as a network:
  • Language is the foundation symbolic model. Networks recognize objects and their interrelationships. Diagrammatically, these are represented as points and lines joining the points, respectively, illustrating a knowledge schema. And it is called Part-Kind-Relation (PKR).
  • PKR networks have a place in advanced data mining applications where data mining is used to extract knowledge schema. PKR forms a multithreaded network of interactions of various types. Such networks can be created within computer systems and manipulated as symbolic representations of knowledge structures
Changing evidence, changing conclusions:
  • Both mining and modeling essentially refine, change, or alter knowledge structures. But entire knowledge structures do not change in simple or straightforward ways as their components change.
Summary: Models are where rubber meets the road for the business modeler and for the data miner seeking to use data to characterize relationships in a business model. Model relationships, sometimes even model objects, are what the miner is looking for, or at least, models are in part built from what the miner discovers while mining. Models come in a bewildering variety of types ranging far beyond just symbolic models. Fashion models, dressmaker's models, aircraft models, ship models, data models, power grid models, mathematical models, logical models, geographic models, cosmological models - the variety is almost endless. Yet as models, they all have certain features in common.
The essence of a model is that it forms some idealized representation of a real-world object. Being idealized, it strips away much of the complexity attached to the real object or situation, presenting features of the real world situation in some more convenient, comprehensible, useful, or usable form.
It is crucial to understand exactly what purpose a model serves so that the produced model is not oversimplified or simplified in the wrong direction. It is fine to cast a reduced scale steel or heavy plastic model of an aircraft if it is to be tested in a wind tunnel; it is useless to do so if it is intended as a radio-controlled flying model.
Any model, necessarily a simplification, must contain all of the necessary features of interest to the user and be complete enough to model the phenomena of interest accurately enough to be of use. It is very important for a miner to establish what is to be left out, what is to be kept in, and what is to be revealed in any model.

Symbolic models come in a wide variety of types, although data miners today typically work only with inferential or predictive models. Nonetheless, the range of symbolic model types covers a very broad spectrum. More and more, different types of models are being constructed using data mining, so a miner needs not only to know what environment a model is to work in, but also to clearly establish what type of model is needed. Just where do descriptive, interpretive, explanatory, predictive or prescriptive models fit?
How do they differ? What about active and passive models? What is the difference between, and appropriate use of, qualitative and quantitative models? In any case, how is a miner to know which to use when? There are, of course, answers to all of these questions.

As data mining becomes more capable and sophisticated, the miner needs to be aware of these issues, to understand the implications of different models, and to know which are the appropriate tools and techniques to produce them. The answer to the question of what constitutes a model has to be made operationally. It is quite impossible to say what anything is "really."
There is some quantum mechanical wave function corresponding probabilistically to the car, and another, or many others, superpositionally, corresponding to the state of the world that interact in some fantastically complex manner. There is, at a wholly different level of abstraction, a mechanical interaction with the surface of the planet. Or again, there are ecological effects, economic effects, chemical and metallurgical effects, nuclear and atomic effects, sociological, cultural, and even semiotic effects. An the list can go on and on, demanding only on what the appropriate frame of reference is for providing an answer. But what "really" happens can be described to human understanding only by stripping away almost all of the things that happen and leaving some poor, threadbare simulacrum of the original richness of reality. All that can be said for what "really" happens is that everything in the whole of the universe interacts in ways beyond understanding.
In spite of that, models of reality do empirically work, and are usable, useful abstractions of what happens. Somehow or other, humans have evolved equipment that allows them not only to abstract selected features from the world, but also to connect one level of abstraction to another, and then use the second level that is more easily manipulated to influence the world to selected ends.

The key to using models most effectively lies in the way they are defined. For practical application, this must be a purely operational definition. It is an operational definition that theoretical models map onto the world. The question always must be: "What do you want to do with it?" No model is made, and no data mining is carried out, in a vacuum.
=================================================
Translating Experience:
Highlights:
  • Experience is a great teacher - it's a well-known aphorism. In fact, our experience of the world - our sense impressions - is our only teacher. Strategic thinking at least, is a tool to implement our ideas, hopes, and desires in the world. These are crucial issues to a business modeler because it is the experience of business that has to be translated into a model (the model represents some business "system"), and ultimately the model has to be used to devise, support, or otherwise enable corporate strategic action, be it large or small
  • Data mining might be described as the recovery of something valuable from data. Data mining is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.
  • Data mining is the search for valuable information in large volumes of data. Data mining is the exploration and analysis, by automatic or semiautomatic means of large quantities of data in order to discover meaningful patterns or rules. A new idea is nothing more or less than a new combination of old elements
  • Creative idea is the discovery of an unexpected regularity, which is one way the activity of data mining is described in many data mining texts today. But keep in mind that in essence, there is no difference between discovering a creative idea and the desired results from data mining
  • The creative idea data set is in the human head, whereas the data mining data set is on a computer. But in the grand scheme of things, that is a trivial difference because the data in the head didn't start there - it started in the world, just as computer data does
  • Data mining is the search for ideas. The importance here for any miner is not the mining, but the discovery of the idea - hopefully a novel idea, a new insight
  • Data mining for business models should be done inside a structure that is carefully designed to reveal hidden assumptions, uncover needs, determine problems, discover data, establish costs, and in general explore the whole domain of the problem. Remember that the most difficult part of finding a solution is accurately finding and stating the problem
Taking Data Apart:
  • The automation-assisted exploration of data sets for the discovery of commercially relevant, usable, applicable, and viable insights
  • The original activity before that time was not graced by any particular label and was simply known descriptively as data analysis
  • Our primary experience of the world is qualitative. In whatever follows, it is important to remember that it is the qualitative experience that comes first; the secondary quantitative explanation is trying to explain the primary qualitative experience. Quantitative analysis is only a specialized form of language for expressing some particular ideas about the qualitative experience of the world. Translating the quality of an experience into a language of quantities does not change the essential nature of the experience, nor is one form of expression inherently any better than the other
  • For instance, in trying to forecast future sales, a quantitative analyst (of which a data miner is one) essentially says: "Ignoring all of the things actually happening in the world, and focusing entirely and exclusively on the measurements of the features at hand, then the predicted sales value is the number x." The number x is simply a symbol produced by a sophisticated application of a set of rules. To use it, we may assume that it represents some number of dollars. We may also assume that it has some degree of uncertainty and so will be x plus or minus some other amount, say y. But even then, does this number x+-y really represent sales? Sales represent more than just a dollar figure. Consider the sales for the organization that you work for - what does a sales forecast mean to you? Optimistic sales projections represent hopes for the future, anticipated future salary, vacations, health care, pay increases, continued employment, and a whole host of other qualitative feelings. If sales projections are bad, it may mean reduced income, finding alternative employment, or dissolution of the company, all of which produce a host of associated qualitative feelings. Projecting sales figures is a fairly common corporate activity, and the example introduces little that is not straightforward. The significance here is that the miner needs only to find an appropriate symbol or symbols
  • The modeler, on the other hand, absolutely must be cognizant of the qualitative framework into which the quantitative projection has to fit. Whereas the miner works with a quantitative representation of the world, the modeler has to include relevant qualitative experience in any successful model
  • So it is that data modeling works with qualitative symbols. It also works with data that is essentially analytic in nature because the act of defining features, measuring states of nature, and recording those measurements is an inherently analytic activity. Throughout all of data mining (not to mention any other quantitative analytic method), it is crucial to keep in mind that the qualitative experience of the world is primary, and the data mining, in trying to explain and inform that experience, cannot supercede it
Data and Abstraction:
  • Turning qualitative experience into a quantitative description seems to come fairly naturally under the right circumstances. It's done through a process called abstraction, which is a mechanism for turning observations into numbers
  • Data miners frequently work with data that is abstracted to non-numerical symbols such as categorical labels - "blue" or "green," for instance. Once again, however, the measurement "blue" needs to be qualified before use. There has to be a blue something - a blue light, a blue car
  • Notice that the process of abstraction is very similar to the reification process. The end result differs, however, Abstraction produces externalized symbols and rules for manipulating those symbols. Reification results in identification of conceptual objects and relationships, the behavior of which can then be represented through abstracted symbols and rules
  • A data miner works with abstraction and reification without questioning the process or method that created the results
  • The modeler, on the other hand, needs to be very aware of the processes of reification and abstraction because they are what inform the relevance and utility of the model
  • To be useful, all of the symbols have to be arranged in patterns, and patterns are very important
Recognizing Patterns:
  • Patterns, at least as far as a modeler is concerned, are regularities in the world that represent an abstraction
  • Recognizing patterns, then, requires either discovering template constancies that remain common across repetitions of events, or matching event relationships to existing, already discovered templates
  • Several different types of relationships in data describe useful patterns, and the relationship patterns are similar regardless of the type of data in which they are discovered
  • For instance, one relationship pattern is "logarithmic," in which the logarithm of one set of measurements has a clear relationship with another set of measurements. Such a relationship is common in many numerical data set. However, what the data actually measures determines some very important qualities about what the patterns describe. Thus the underlying significance of any type of pattern can be determined only by considering what the data measures. One such pattern description, and one that is of great importance to the modeler, is the difference between static and dynamic relationship patterns
Static patterns:
  • Patterns can be described in a wide variety of useful ways. One way is to look for the relationships between objects. Objects have features, and features can be measured
  • Sets of measurements are taken to represent the object. Static patterns occur with object features when other features of the same, or related, objects change their value. These sorts of patterns are called static patterns because essentially the measurements are about the static or unchanging features
Dynamic patterns:
  • The dynamic parts of the company are the flows. Even the words stocks and flows indicate that one is static and the other dynamics
  • The dynamic patterns give a clearer picture of what is going on inside the company
  • Dynamic patterns capture more information than static patterns, but miners and modelers frequently are able to access only static information
Novelty, Utility, Insight, and Interest:
  • There are several reasons for looking for patterns in data and for modeling those patterns in a business environment. One is to clarify the performance of a process within a company, a division within the company, or the company as a whole
  • Another is to determine the areas in which there are opportunities for improvement and the type of improvement that's appropriate
  • Discoveries that are insightful and interesting carry more information (in a technical as well as a colloquial sense) than those that do not
Mining and pattern seeking:
  • If data is available, as is often the case, data mining provides a tool for discovering patterns in data
  • The discovery of such patterns is, no more than the discovery of ideas or inspiration. But even determining what data is appropriate to mine, and what can be discovered, requires a model of the situation
  • Data mining is not just the search for patterns. Patterns have to be placed in perspective within a model
  • Indeed, a model is needed before mining to determine how the mining should proceed, what data should be mined and how it should be mined
  • The patterns discovered, and their applicability to the situation at hand, depend very much on whether the data to be mined is of static or dynamic relationships
  • But in order to generate any mined insights, the mining must be informed by models, many of which have to be constructed before mining begins
  • There are many types of models, but one of singular importance in data mining, and the one that a modeler most works with when mining, is the system model
Systems of the world:
  • Businesses and companies are systems. It means something to be a part of - or not to be part of - a company. Companies have component systems, too. Employees are either in marketing or not in marketing, for instance
  • Systems form a large part of how we reify the world, especially the dynamic and interactive event patterns
  • A modeler has to pay particular attention to systems mainly the systems of businesses and corporations. They are an important part of translating experience
  • Systems do not actually exist in the world. They are abstractions
  • What makes such a collection of objects a system is usually that the internal interactions remain consistent over a wide range of external changes in the environemnt. They are only abstractions of reality, particularly when the boundary is breached, moved, or is inappropriately placed
  • In mining and business modeling, there is one particular type of system that is of interest - the dynamical system - although it comes in many forms
Open and Closed form systems and solutions:
  • The most common type of dynamical systems, or at least the most common type studied, is what is called differentiable systems. All this means is that the different feature measurements in the system are related to each other so that as the values of features change, they all change together in smooth or continuous ways
  • The mathematical equations that describe the relationships among features can be written in the form of differential equations (thus the name differentiable systems)
  • For such systems, it is not necessary to actually run the system to discover what its state will be under other conditions. Such a solution is called a closed form solution, and such systems can be called closed form systems
  • In making a sales forecast, for instance, it may be that some formula is discovered or invented that allows some set of measurements of the current position to be plugged in, plus some specific distance in time in the future, and the output is a forecast value. This is a closed form system because forecasting next year's sales, does not require predicting next month's forecast and then using that forecast to forecast the following month's, and so on. The final forecast can be reached in one step
  • Open form systems, on the other hand, have relationships that preclude the possibility of any such shortcut. The only way to get from one state to a discovery of some other state is to grind through all of the intermediate steps
  • The only way to discover the behavior of open form systems is a method of modeling called simulation
  • For instance, the weather is an open form system, and the only even marginally successful way to predict the weather is through the use of simulations
  • This is one great advantage of using a systems simulation software package for setting up mining problems. It can run a simulation of the system to explore assumptions, determine sensitivities, provide insights, validate and verify assumptions, clarify the problem statement, and much more that is of enormous use to both miner and modeler
The nature of systems:
  • All dynamical systems seem to have key features in common. They all have an "inside" and an "outside". Inputs come into the system from the outside, are processed by the system, and then outputs are returned to the outside
  • Mining needs to examine most particularly the dynamic flows. Modelers need to take as broad a view as possible to gain understanding
  • It is important to construct a model of a system before mining so that the mining is put into perspective. Second, when trying to change the nature of a system, if the change is attempted from outside, the system will resist - maybe strenuously
Coupling and Feedback:
  • In order to discuss the nature of parts of a system, it is necessary to talk about them as if they were separate from the rest of the system. Despite discussing them in this way, it is crucial to understand that everything within a system is interrelated
  • Objects in a system are coupled to each other. The coupling measurement ranges from uncoupled (where the objects have no direct effect on each other and are essentially disconnected) to tightly coupled (when changes in one object immediately and unavoidably produces changes in the other
  • Coupling considers how objects affect each other under the total set of conditions prevailing in the system. Thus, sales may be coupled to the level of the incentive program in place. They will also be correlated with each other. However, if the shipping department goes on strike, sales will inevitably fall as people turn elsewhere to get what they need
  • Coupling generally, manifests its effects on the flows of the system. But systems also manifest a level of behavior that is directly related to the flows. This is the level of information flow inside the system. Information about what is happening in one part of the system flows through the system to affect all of the events, flows, and stocks. This is often described as feedback (or feedforward) and refers to information about the behavior of some downstream part of the system being feedback to an upstream part to modify system behavior
  • Coupling, feedback or feedforward, and the relationships that exists inside systems are difficult concepts to grasp and understand, and even more difficult to keep track of. Tools that help with doing this are called systems thinking tools
Systems thinking:
  • Successful modeling, and the data mining that accompanies it, require the use of systems thinking. This involves being aware of the system framework in which the model is being constructed or the problem that is being solved
  • A systems thinker looks for the patterns that knit together the events that happen without asking which are causes and which are effects
  • A systems thinker looks for explanations and relationships. There are two characteristics of systems thinking: de-centered association and operational interaction
De-centered Association:
  • Putting some object at the center of attention so that it "radiates" other associations about the central object is centered association
  • A system is - all of the parts are interrelated. No one factor can be put at the center. Everything is very highly and inextricably interrelated
  • The key point is that the focus in systems thinking is on the fact that system behaviors are generated by the system as a whole, and it is not possible to isolate any one casual factor
  • If everything affects everything else more or less, there is no center. Thinking, and making associations in terms of mutual interconnections (no one of which forms the central focus), is called de-centered association - an important skill for the modeler to develop
Operational interaction:
Operational interactions form the core processes of a system; they are the flows.

Strategic and tactics:
  • There are business problems, albeit highly simplified ones, that business managers face every day. Some of the decisions are informed by corporate policies, whereas others have to be met fresh and decided from the ground up.
  • Some corporate policies are quite necessary because, as a practical matter, it is quite impossible to make every decision from scratch every time
  • These corporate policies are, in fact, rules for making routine decisions. In one form, that is exactly what a strategy is - a preset decision procedure. It is not a ready-made decision, but a framework of "if this happens, then do that" rules
Strategic versus Tactical decisions and actions:
  • A commonly accepted difference is that strategies represent sets of available activities that are possible in some particular set of circumstances, each of which leads to a different potential outcome; tactics, on the other hand, usually indicate concrete actions that actually put some strategy into effect by attempting to influence worldly events
  • Tactics, although several may be available to implement any particular strategy, essentially are sets of rules for putting a strategy into practice
  • The problem in business is to decide which rules are availbale and then which to choose - in other words, how to choose strategies and how to implement them with tactics
Dealing with problems:
  • A major practical use of experience is to solve problems. This means no more than determining how to intervene in a situation that presages a less than satisfactory outcome and convert the situation into one with a more desired outcome. This is the crux of the whole mining and modeling endeavor, at least as far as business modeling and mining are concerned: discovering new or better ways to influence events, to some extent discovering new strategies, and calibrating (estimating risk, resources, and likely result) selected strategies
  • The three components in any problem are: (1) The set of strategies available - (2) For each strategy there is a set of possible outcomes, each with some associated probability that it will happen - (3) The ultimate "value" of each of the possible outcomes
Types of Uncertainty:
  • There are at least three types of uncertainty in dealing with problems: strategic uncertainty, outcome uncertainty, and world state uncertainty
  • Strategic uncertainty describes a situation in which all of the strategies available are not clear
  • Outcome uncertainty is the uncertainty of knowing the likehood of any particular outcome, even when all of the relevant conditions are known
  • World state uncertainty is the uncertainty associated with being unsure of the state of the world
Costs of reducing Uncertainty:
  • It is always possible to reduce at least outcome and world state uncertainty. But, as with everything else in life, nothing comes without a cost. The universal law of "There Ain't No Such Thing as a Free Lunch. The cost is in dollars and in time
  • It is usually possible to buy more data or research to reduce the level of uncertainty
  • It is possible to use statistical techniques, such as likelihood estimates, to at least quantify the uncertainties
  • There are other techniques that also can reduce the amount of variability in distributions, and that, too, reduces the levels of uncertainty. These also extract costs, perhaps in computational time and expertise. Sometimes it is possible to discover patterns that allow for reducing the level of uncertainty over time
  • Another way of reducing uncertainty is to "move the goalposts". If the strategies can be enumerated and the outcomes estimated, each of them represents a possible potential profit or loss
  • Game theory calls these payoffs, and by using this theory it is sometimes possible to adjust how often each of the strategies is used in a way that optimizes the payoffs better than using any single strategy
Deciding with constrained options:
  • Almost always, a company has limited resources at its disposal for the available options that it could pursue with less limited resources. So the actual options that can be pursued are constrained by the resources available
  • Corporate management obviously wants to make the best use of the resources at its disposal. This is a very common problem, and there are some quite well known ways of setting up and solving such problems
  • Most MBA students will be very familiar with the techniques of linear, and maybe nonlinear, programming
Summary:
Translating experience isn't exactly straightforward. In fact it's downright paradoxical that at the same time it's so easy and so hard. It is easy because to a great extent it is a simple, intuitive process that happens without our noticing or paying attention to it. We are simply unaware that our experience of the world consists of sense impressions; we experience the world as an "out-there" independent reality made up of objects, some of which are dynamic and reactive systems, but objects nonetheless. On the other hand, because an intuitive impression of the world is so immediately present, it takes conscious and concerted effort to get beyond the world intuitively perceived. This is what makes translating experience hard.
Experience, for business purposes at least, has to be translated into an external representation. Implicit assumptions about relevant objects and systems need to be put into an externalized form so that the features and relationships that seem relevant and important can be examined, discussed, manipulated, experimented with, validated - basically, so that they can be used. It is the modeler's job to create or - if not actually starting from scratch - to define the features and details of these externalized representations. The purpose for all this activity is essentially to get some better idea of what to do. To get ideas is not easy or straightforward, and good ideas - that is, effective reassessments of object relationships in the real world - are at a premium.
Although systems representation is by no means the only way of translating experience into an externalization, it is by far the most common method that a business modeler uses. Very often, the externalized representations are not explicitly represented as a systems model, that is, the representations are not necessarily created using techniques explicitly known as systems thinking. Nonetheless, almost all representations of business situations (in other words, models of some part of the business situation) do recognize, at least implicitly, that they are both systems in themselves and part of some larger system. A sales forecast, for example, recognizes that there are interacting effects that affect the final forecast figure - in other words, a system. That forecast is based on assumptions about how the rest of the corporate enterprise performs, and assumptions about the rest of the world such as the economy and the business climate. Explicitly using systems thinking is not, by any means, the only way to build a useful representation of such business systems. It is, however, an enormously powerful and useful way, and one that works to generate sufficiently complete and adequate models to be usefully understood and applied, with many of the assumptions revealed. So useful and powerful is this method of thinking that all modelers wii find it very worthwhile to master the techniques of system thinking. Ultimately, for what purpose is all this activity - the search for ideas, the construction of external models of business systems? Why the effort, and indeed, why the concern for modeling whether or not supported by mining?
The purpose has to be to apply the insights discovered to gain some benefit. But benefit comes only by applying the insights, in other words, by acting to change the prevailing state of affairs so that the desired perceived benefit is more likely to come about. This is where strategy enters the picture.
Models and strategies have little to do with a business' objectives. If an executive decides to maximize shareholder value, as some say the proper concern of a company ought ot be, so be it. Models, at least business models, have little to do with informing desires. What we desire to achieve, whether personally, in business, or as a corporate entity, is an emotionally determined decision. However, once determined, these objectives can be achieved, if at all, only through the use of strategies - alternative choices for deploying the resources available with the intent of gaining the objective. Strategies do fall into the modeler's province. An important purpose for some models is to evaluate the cost, risk, and probability of success for each of the possible strategies. This is part of the role of scenario evaluation - playing with "what-if."
The whole purpose for modeling and mining in a business setting is to discover new and effective ideas about the business - relevant systems of the world, and finding the best strategies to improve a company's lot.
==================================================
Modeling and Mining: Putting it together:
Highlights:
  • Modeling is a continuous process throughout all of mining. First, a framework model has to be discovered to frame the problem. As mining progresses, the framework model will be revised, and several different types of models are needed for understanding and using what is discovered in the data
  • Although models inform the direction and progress of mining, mining is the activity that clarifies and gives substance to the models. Models define areas of inquiry; mining reduces the uncertainty about, and explicates the relationships that exist in, important areas that the model identifies
  • Mining, in other words, clarifies the guesswork, hunches, and assumptions that initially established the framework for the problem. In short, mining reduces uncertainty
  • Data mining not only proposes hypotheses about data, but also makes possible rational estimates of the reliability of the hypotheses, and under what range of worldly circumstances they are likely to hold up
  • It is these hypotheses that clarify and inform the proposed interactions in any model. In other words, these are hypotheses about how the model works and not only can be developed by examining data, but also can be checked and confirmed
  • Many types of models are used to represent business situations, including the system model, the game theory model, and the linear (or nonlinear) programming model
  • But whatever type of model (or models) are used to represent a situation, all of them need to have the assumptions on which they are built, and the estimates from which their conclusions are drawn, estimated as accurately as possible. This is the work of data mining
Problems:
  • Although there are a huge variety of possible problems, fortunately the data miner has to deal with only a very limited range and type of them
  • Furthermore, by focusing on business problems that can be solved by data mining, the types of problems to be considered are even more restricted
  • Problems are always related to the inability to control a situation
  • The miner always is a consultant - not necessarily an outside-the-company consultant, but someone who comes to the problem as an outsider
Recognizing problems:
  • Surely everyone recognizes a problem if it walks up and bites them. But that's just the problem. The problems approached by mining and modeling almost certainly won't walk up and bite. Usually, they are someone else's problems, and they may not articulate them as problems per se
  • In truth, more than one data mining project has started with a business manager saying, in effect, "Take my data and tell me useful and interesting things about it that I don't know!" Well, the data miner has a problem all right - to discover what the business manager thinks is interesting and useful to know and why those things are interesting and useful to that manager
  • The manager is aware of a problem's symptoms and requests a fix to make the symptoms disappear; however, without being aware of the underlying reasons for the symptoms' appearance in the first place, discovering a permanent solution can be very hard
  • Recognizing the right problem that the miner and modeler need to work on is not always straightforward
  • The first order of business must be to obtain a useable description of the problem
Describing problems:
  • To describe a problem effectively, miners and modelers need to grasp the metaphorical territory that encompasses the problem, that is, to create a map that illustrates the manager's existing understanding of the situation
  • The best way to create such a map is to do so in an interactive, intuitive, and iterative way. Interactive in that all of the parties involved participate. Intuitive in that all of the parties can understand the map without needing any lengthy introduction to the meaning of the techniques used. Iterative in that it is easy to continually refine and update the map from session to session (Cognitive Map)
  • Creating such a map often serves to increase the apparent number of problems and decrease the apparent consensus understanding of the situation
Structuring problems:
  • Often the final expression of a cognitive map is a system diagram. It is not always easy to progress from cognitive map to system diagram because most of the ambiguity and conflict have to be resolved before the system structures can be diagrammed
  • Attempting that transference may reveal the very problem areas for which smaller scale submodels have to be constructed and data mined
  • Structuring problems requires identifying the inputs (the data being used for mining, assumptions, and so on) and outputs (strategic options, probabilistic, relationships, and so on), and formulating the whole thing in a way that can actually be mined or modeled
  • The biggest difficulty in adequately structuring a problem for solution is discovering all of the relevant factors, including those infamous hidden assumptions
Hidden assumptions:
  • The question asked does have to be answered. However, it is equally important to the success of a modeling/mining project that the needed answers be found - not just the answers requested. For that to happen, it is up to the modeler to try hard to discover what the assumptions are before mining starts
  • The worst sorts of assumptions to deal with are hidden assumptions, because they are almost invisible
  • There are techniques for revealing hidden assumptions. Working through the system structure in the form of a cognitive map is one such technique. Locating the model in the appropriate decision map level is another
  • Another is to work through in detail how the produced strategic or tactical solution will be applied; how long it is expected to be effective; who is going to apply it; what levels of knowledge; expertise, or training are needed to use it
  • The miner needs to have a clear problem statement to begin mining. With luck and hard work, the problem that the miner works on will be relevant and not beset by hidden assumptions. From that point, it's time to look at the data available
Data about the world:
  • As far as the miner and modeler are concerned, the world is made of data, which is no more than a reflection of qualitative and quantitative impressions in a recorded and stored form. Capturing those qualitative and quantitative impressions uses a technique called measurement. Collecting these measurements, and organizing them in some manner, produces the raw stuff of data mining - data
The nature of Data:
  • Data is very fickle stuff. At best, it is but a pale reflection of reality. At anything less than the best, it seems intent on leading the unwary astray. Invariably, the data that a data miner has to use seems particularly well constructed to promote frustration
  • In an ideal world, the miner could design and orchestrate the collection of data for mining. The measured features would be pertinent to the problem
  • Real world data sets are brimming with faults, failings, inadequacies. noise, dirt, and pollution. But a data miner is constrained to work with the real data that happens to be at hand and simply has to make the best of it
Measurement and description:
  • The problem with data starts with measuring features of the world
  • Measurements are invariably and unavoidably contaminated by the world-view of the measurer
  • Measurement is no more than determining and reporting (even if only in your head) the quantity or quality of some experience
  • It is for all data collected. It may not be so intuitively obvious in many cases, but data is always collected for a purpose, and the collected data is filtered by assumptions, perceptions, interests, and motivations
  • The miner and modeler simply have to use the data at hand - at least to start
  • All facts unavoidably carry traces of the prevailing framework model, conscious or unconscious, that defined that these were the important facts
Error and Confidence:
  • The point is that all data is just as subject to caveats and uncertainties
  • All data represent a very simplified view of a very complex world
  • But any data set has just as many built-in possible sources of error, and just as many assumptions
  • In data mining it is important, sometimes critically important, not only to uncover and state the assumptions made, but also to understand the possible and likely sources and nature of the errors, and to assess the level of confidence of any insight, prediction, or relationship
Hypotheses: Explaining data
  • Data is always approached with some hypotheses that may explain it. There should also be a hypothesis to explain why this particular data is thought to address the problem for which it is to be mined
  • Two hypotheses about data that turn up in data mining: the internal hypothesis, which explains the data in its own terms, and the external hypothesis, which explains why mining this particular data should help shed light on the problem
Data structures:
  • Every data set has some structure. Even a data set that includes completely random data has a pattern - it's just completely random
  • But if the data set is not completely random, then there is some nonrandom pattern enfolded within it. This pattern, whatever it may be and however it may be characterized, enfolds the useful information that may be mined
  • Structure comes in different forms. Some types of structure are intentionally and deliberately established externally to the data. For instance, recording sales to the nearest penny is a deliberate choice
  • There are the patterns in the data that reflect various states of the world, at least as modified through the other structures imposed on the data
  • It is usually, but by no means always, these structures that carry what is expected to be the information that is mined
Interaction and relationship:
  • Measured objects interact in some way in the real world. Of course, it is always possible that the interaction turns out to be that the objects are, to all intents and purposes, disconnected from each other
  • In data sets for modeling, and more particularly those for mining, it is always assumed (with or without justification) that the data on hand does have some meaningful interaction or relationship enfolded within it
  • Those associations and relations may appear to hold useable information for determining something about the state of one from the state of the other. But what isn't apparent is any explanation that connects the two sets of measurements together into a systemic relationship. Without such a systemic explanation, no level of confidence is justified that the connection is anything more than coincidence
  • It is in part the modeler's job to discover just such a systemic connection that can, at least in principle, be justified in any data sets to be mined
  • Some of the methods will discover association and interactions that other methods will not notice at all. So how is the appropriate method of characterizing the enfolded relationship to be discovered? Should we just try them all? The answers to these two questions are determined by the hypothesis about the data that is brought to mining by the modeler. This is one reason that the model has to be constructed before mining the data
  • The hypothesis presupposes, with justifiable reason, that some particular method of characterizing the relationship will best reveal what is enfolded in the data set
Hypothesis and explanation:
  • Data mining and modeling can create any type of explanation of data. However, the explanations are not equally useful, depending, of course, on the need. That need is informed by the model, which is informed by the problem, and the problem is attacked only through a decision
Making decisions:
  • The problem is a formulation of what is known as The Prisoner's Dilemma. Should you cooperate with each other, or should you defect? Most considerations of the Prisoner's Dilemma lead into a discussion of how game theory can be used to look at the options and determine payoff strategies. However, there is more to this problem, in fact to any problem, than a payoff matrix
Framework for decision: representing choices
  • Separate from, but including the various strategies devised by statistical decision theory, game theory, or any other theory, there are essentially five ways of dealing with problems. These are simply the five options available in everyday life
  • First, ignore it - Second, absolve the problem - Third, the problem dissolves. When values encompassed in the problem situation change, either through internal action or outside change, it totally eclipses the original problem valuation - Fourth, resolve the problem. This modifies the situation, removing it as a problem - Fifth, solve the problem. This is where the discussion of game theory enters the picture. Solving a problem means dealing with it on its own terms. Decide on your best payoffs, chances, and so on
  • Of the five basic ways of dealing with a problem, solving it may not be the best choice. Solving a problem means playing by the rules that the problem establishes. The alternatives allow for far more creativity, reaching outside the problem's framework and finding some way of changing the rules
  • Developing strategies for ignoring, absolving, dissolving, or resolving the problem are often no more difficult than developing a strategy for solving it - and sometimes it's easier. However, the only formal approaches for dealing with problems seem to fall exclusively into the area of solving problems
  • Two popular approaches are game theory (more properly the Theory of Games of strategy) and linear programming
  • Both modeling and data mining may be used to assist in setting up problems for solution using one of these techniques
Playing Games:
  • The key point that allows some problems to be solved using game theory is that there must be more than one "player," and the "winning" of the game requires there to be a "loser." In other words, there must be some conflict of interest
  • Essentially, the idea is to work out various moves that each player can make within the confines of the rules of the game. These optional actions are labeled strategies. Various strategies are available to each player, but are perhaps different for each player
  • For each strategy there is some outcome, in game theory called a payoff, that makes it possible to lay out in some formal way - say, in a table off of the possible combinations of strategies for each player and the payoffs that eventuate if an opponent uses a responding strategy
  • Game theory is not simple or straightforward, and where it is applicable it can reveal some remarkable insights. One problem for applying game theory to real-world problems is that the strategies, and particularly the payoff values, are not always clear
  • The world is a place full of FUD (fear, uncertainty, and doubt), which can preclude setting up the problem clearly. That, of course, is where data mining and modeling can sometimes contribute not in the solution of the games, but in identifying strategies and their payoffs
Linear Programming:
  • One of the key techniques developed for planning the best use of scarce resources is linear programming
  • Modeling and data mining today can play the same part in setting up problems for operations research-type solutions as it does for game theory problems. The more accurate the inputs to any system, the more accurate the outcomes. This, of course, is no more than the reverse of the well-known aphorism GIGO (garbage in, garbage out)
Deciding:
  • The whole purpose behind all of modeling and mining is to decide what to do
  • If the effort did nothing to help decide what to do, it would quickly be abandoned for something more useful
  • It is useful, and the modeler/miner's whole purpose is to provide some basis for deciding what to do. That, however, isn't straightforward
Normative Decisions: What should we do?
  • One possible way to decide what should be done is to appeal to some framework, or code of conduct, that prescribes and proscribes particular actions and courses of action under particular circumstances
  • The law is an example of such a code of conduct. A smaller scale example might be something like the rule of thumb that says, "If you want to get good grades, you will have to study hard." Such frameworks of rules are called normative in that they provide direction as to what should be done
  • There is a normative framework for deciding what should be done that is often appealed to as if it were a law of nature. This framework, which is well grounded in observations of the way of the world as it actually appears, is the normative framework that a modeler almost invariably seems to be working in. This is the normative theory of probability
  • For a modeler, it is very important to understand that probability theory is regarded as optional
  • The problems are that mining is entirely based on models grounded in probability theory; the modeler has to explain the results; and when deciding what "should" be done, it's important to discover what normative model holds sway
  • If necessary, the modeler needs to be prepared to lay the groundwork that supports the conclusions, particularly if they fly in the face of accepted intuition
Finding possibilities: What could we do?
  • Decisions are motivated by problems, which in turn are motivated by a perceived need for change. Or in other words, decisions are called for to select among options when it seems that there is a need to change the source of events to bias those events toward a more beneficial outcome
  • Decision making is about managing, or even promoting, change. Change without change is such a well-recognized phenomenon that it is encapsulated in many common aphorisms
  • Any theory of change, even an outline sketch, needs to deal not only with change, but also with persistence. Persistence needs to be explained every bit as much as change
A sketch Theory of Persistence and change:
  • Many of the complex phenomena of the world appear to us to be related as systems - combinations of objects, events, and circumstances that are closely interconnected and interrelated with each other, and which somehow seem to interact and react to and with the world as a whole
  • One of the most important features of a system, is that it has a boundary - an "inside" and an "outside." Thus, whatever else it may be, a system is a group of objects, relationships, and so on, tied together and contained within the boundary
  • One of the points that group theory encompasses is the way that the content of groups can change
Systems as Groups:
  • Some systems recognizable in the world are groups in the sense proposed by group theory. So not all systems are groups, although many systems are built from component systems that are groups in the group theoretical sense
  • In sketching a theory of persistence and change, these systems that are groups are very important
  • The first important property of a group is that it has members. It's not important that the members are alike in any way except that they must all share some common membership characteristics
  • In order to be a group, any members must be able to be combined such that any combination is also a member of the group. So, to take a very simple example, numbers from a group. Combining numbers produces a result that is just another number - a member of the group
  • Another important property of a group is the presence of an identity member - a member of the group such that a combination of any other member with the identity member gives the other member as a result
  • The final property of groups that needs to be looked at is that every member has a reciprocal member, such that if the member and reciprocal member are combined, out comes the identity member
Deck Chairs in business:
  • There are many times when organizational structures rearrange their components without changing anything apparently meaningful outside the system
  • In a business situation, this is exactly the situation that applies when one company competes with another on its own terms
Getting out of the Box:
  • "Thinking out of the box" is an aphorism, but one that truly recognizes the underlying nature of what needs to be done. It turns out that it's not exactly a box, but the boundary of a system that needs to be crossed
  • The aphorism refers to discovering a change that is not within a group-theoretic type change. What is sought is a change that is external to the group system and that has impact on the internal workings and relationships of the system
  • One of the fundamental requirements of this theory is that anything that covers all of the class (group or system) must not be part of the system. As an example, all of the customers of a company are a class, but the class of all customers is not itself a customer
  • Getting out of the box is stepping out of the system of endless "what goes around comes around," and changing the system from the outside. This requires moving to a meta level, getting to a higher dimension, or looking in from the outside. It involves not playing by the existing rules of the game
  • The foundational modeling tool is language. We can describe almost anything with language - except language. To do that we have to invent another language, a meta language, especially for talking about language
Modeling out of the Box:
  • Today, we prefer our theories and metaphors to have a quantitative foundation and (ideally) a qualitative interpretation
  • Perhaps group theory and the theory of logical types, as applied to systems theory, presages a more firmly grounded and comprehensive theory of change: one that has metaphors that we can apply directly to manipulating the world. But if we succeed in this, it's what we today call creativity, and we don't yet know how to bottle that! The best that can be said realistically is that perhaps these are pointers that one day, some part of what is an exclusively human capability will be automated
  • Business models are, perforce, of business systems. There are no tools, at least at present, for creating meta models. Perhaps a more rigorous theory of persistence and change will lead there eventually, but business modeling is pretty much limited to in-the-box modeling at the moment
  • But business modeling, even supported by data mining, is not yet ready to reach outside the box. This is not to say that a modeler will not be expected to take part in the analysis and discussions of deciding what can be done. But what can be done with today's technologies pushes beyond the present limits of business modeling
Summary: 
What is best to do very much depends on the objective. If the object is more profit, the best things to do might be very different than if the object is to attract more customers, or reduce churn, or raise the stock price, or produce a better product, or any of a host business objectives. Whether in or out of the box thinking is needed, understanding the objective is paramount in determining and selecting the best course of action from those possible and available.
With objectives set, business models can help evaluate possible outcomes and support choices between options. But this requires understanding the frame that surrounds the problem.
====================================================
What is a Model?
Business Modeling:
  • Models are utterly crucial to all areas of life, especially business. Every day business managers at all levels take actions to change the behavior of the business in ways intended to benefit both business and manager. Yet many models are empirical and experiential - not the sort that a business modeler can work with
  • A business modeler has to create models that capture insights into the way the world really works, identify and satisfy all of the conflicting requirements of the various stakeholders, create usable models that make sense, fit the model into exsiting business processes, justify the model based on the data available, and deliver on time and under budget. The on time, under budget requirement lies in the arena of project management
  • Models are designed backwards and built forwards. What this means is that everything has to be considered prior to building, particularly deployment issues that come at the end of building
  • As noted in the instructions for any complex device, "Before assembly, please read all the instructions!" Also keep also in mind the adage, "Measure twice, cut once."
What is a Model?
  • It is impossible for any of us to function in our personal lives, let alone in business, without models
  • These models function as our mental representations of the world, and they are dynamic and flexible
  • Models live in a world dominated by data, information, and knowledge. As far as businesses are concerned, this is a world of business information (BI) and knowledge management (KM). Now it's quite true that the results of modeling and mining do indeed reside in these specialties, but note that it's the results thst live there, not the modeling or mining activities themselves
  • For the purpose of looking at models, it is enough to consider only three concepts: data, information, and knowledge. It is these concepts that need to be clearly explicated in terms of what they mean to the model - and the modeler
  • Understanding the power and utility of models requires coming to grips with the structures of data, information, and knowledge
Introduction to Data, Information, and Knowledge:
  • All models are based on data. All data originates in the world. Mental models are based on data that reaches us through our senses
  • In creating models that are formed from such externally stored data it's important to explicitly consider the process that turns data into information and information into knowledge because this is, in part, the process that the modeler has to practice as a technique - one of the most basic processes of modeling
Data:
  • Data, however, is often described as collected objective observations (or facts) about the world
  • Data, nonetheless, is formed from observations of the world. When such observations are collected, they are indeed data
  • Data is simply a collection of symbols representing that particular events occurred within particular contexts
  • The event symbols characterize the extension, duration, quality, quantity, and so on of the event
  • However, it is its wholly uninterpreted nature that makes the collection data, and nothing more
  • A collection of unanalyzed observations of worldly events. Data is what happened
Information:
  • Information forms much of the background to today's world. Information is interpreted data, or data "endowed with meaning and purpose
  • Information is bicameral in nature; at the very least, it is always a summary of the data from which it is formed
  • For data to provide information, the summary of the data must be made in such a way that it addresses the particular ends, or purpose, of the informed
  • The important point so far is one that is generally crucial for models, and is no less crucial to gleaning information from data - it is that of context. For instance, the context within which temperature can be measured is the considerable amount of relevant physical science that bears on the subject. It is this context that gives the data its particular meaning
  • Information is, at least in part, summarized data placed in a context
  • For instance, if it was based on some data, the summary statement "fewer web shopping carts are abandoned by shoppers with coupons than by those without coupons" is information that can demonstrate every point in this section. What is a "Web shopping cart?" If you have the knowledge to understand the context of this piece of information, it is meaningful - if not, the statement is nonsense
  • One of the most fundamental is that information has to be communicated to inform. This means that the summary must be passed into, and incorporated in, a contextual framework
  • Information theory also points out that any informative communication unavoidably comprises three elements: information, noise, and redundancy
  • Information, in this information theoretic sense, is that part of the communication that comprises the valid features of the summary - those that do apply to the data. Noise implies that, to a greater or lesser extent, some part of the communication expresses conclusions that may seem to be about the data, but are in fact invalid. Redundancy implies duplication in that some part of the valid information turns up in the communication in more than one way
  • Information can be described as a communication of a summary of various similarities, differences, and relationships discovered in data, described within a particular context that includes valid characteristics, erroneous characteristics, and repetition
  • A summary and communication of the main components and relationships contained within the data and presented within a specific context. Information is how you know it happened
Knowledge:
  • Knowledge is intimately intertwined with information. In fact, as the discussion of information revealed, information cannot inform without the presence of existing knowledge
  • One marked difference between knowledge and information is that information is static in its transmission. Knowledge, on the other hand, is ever dynamic and changing, dependent for existence on the context in which we find ourselves. Knowledge is, in fact, a process, not a thing at all
  • Knowledge is explicated as operational depictions. Such an explication is, of course, information, in this case, intended to transfer knowledge from a "knower" to an "unknower"
  • For instance, do you know how to read a book? If so, the knowledge of how to do so comprises an enormously complex set of actions including holding the book, opening the book, and turning pages, every bit as much as it includes the recipe for scanning letters, absorbing words, and gleaning understanding from them
  • Knowledge, then, can be described as a set of operational recipes together with the contexts in which those recipes can become effective - in other words, what it's possible to do, and when it is appropriate to do it, to achieve particular results
  • An interrelated collection of procedures for acting toward particular results in the world with associated references for when each is applicable along with its range of effectiveness. Knowledge is what to do about it
An observer's guide to models:
  • Models use a framework that encompasses data, information, and knowledge. In fact, in a sense, models are knowledge. They are intimately connected at one end with data, they to some degree provide interpretation of the data to some purpose, and they end with knowledge
  • Models encapsulate the information present in data within some particular framework
  • In business applications, models can usefully be described as falling along the five categories: (1) Inferential/predictive - (2) Associative/systemic - (3) Static/dynamic - (4) Qualitative/quantitative - (5) Comparative/interactive
Many problems of intellectual, economic, and business interest can be phrased in terms of the following six tasks: 
  • Classification
  • Estimation 
  • Prediction 
  • Affinity grouping 
  • Clustering 
  • Description and profiling
Inferential models:
  • Inferential models are also known as explanatory models, and either description is equally appropriate. Being wholly based in data, such models necessarily explain (or explicate inferences about) the data
  • Explanatory or inferential models can be among the simplest, and most useful, kinds of models
  • An inferential model essentially relates objects to each other. The objects have to be represented in the data, and are usually represented as variables either singly or as groups of variables
  • This model, in fact, does no more than describe associations and relationships between objects (variables). No valid inference from the data alone is possible other than that certain features seem to be associated together. This is therefore termed an inferential (or explanatory) model because, inherently, there is not necessarily any predictive power in the model, but it does allow the interactions to be inferred
Predictive Models:
  • This also may be an associative model in which care was taken to separate cause and effect. This casual separation is usually made in time, such that associations are made in the temporal direction leading from what came first to what followed
  • Predictive models require an explanation of the data that is external to the data in some way and that describes which phenomena are causes and which are effects
  • Predictive models all attempt to determine a later outcome of a prior situation
Associative Models:
  • Also called a correlational model, an associative model depends on finding the association, or correlation, between the attributes of objects
  • An associative model is based on the assumption that to some greater or lesser extent these factors drive sales (if this is a predictive model), or are at least associated with sales (if an inferential model)
  • Such associative models can work well, but they have significant drawbacks. Mainly, they incorporate a significant number of unverifiable assumptions
Systems Models:
  • Systems models view the world as an interconnected and interrelated mesh of events
  • In an associative or correlational model, all of the correlations are symmetrical. But in a systems models it is exactly noncorrelational, nonsymmetrical interactions in which the direction of the effect is crucial to understanding what is occurring
  • In an associative model it is the associations (correlations) that form the entire structure of the model, a systems model regards the data as an incomplete set of instances describing some larger phenomenon or set of phenomena
  • A system is an essentially dynamic structure. A systems model is of, but not in, the data. It is a model about the data, not a model of the data. Indeed, a perfect associative model could, given suitable inputs, re-create the data set from which it was made
  • A system model is more connected with accurately modeling the interactions that produced the data, and may well not exactly reproduce the data set
  • Systems models are not concerned with duplicating any specific set of data, but with representing the generating behavior of the features in the world that gave rise to the unique data set
  • Systems model recognize three types of phenomena in the world: stocks, flows, and information exchange. Stocks are represented in the world as things that accumulate or diminish
  • The second type of phenomenon that systems models look at is flows, which give rise to the accumulation or diminution of stocks. Just as a river flows into and accumulates in a reservoir
  • Flows in the world are regulated - some by chance events, as with rainfall, and some by planned interaction and intervention, as with inventory. Systems models recognize these as a third type of phenomenon, information exchange
  • In a systems model, the information is represented separately as information exchange
Static Models:
  • The static/dynamic depiction has two parts. One part refers to the internal structure of the model; the other part refers to the way the model deals with data to which it is exposed
  • A static model applies to one, and only one, data set. When internally static in form, the model, once created, is fixed. For example, a purely associative model has a form that is fixed by the data set from which it was created
  • Even a systems model can have an internal structure that is static
  • Once created from a data set, the structure of sttaic models is unaffected by any other data. Of course no such model is assumed to be permanently valid. At some time or change in circumstances, the static model may well be discovered to be ineffective for the purpose for which it was created. Then, a new model is developed and the old one discarded. That's the point - the old model is discarded, not in some way updated. A new data set would be discovered and a new model created, incorporating the changed circumstances. The new static model would be used in place of the old
  • In some modeling environments, a cycle of creating static models may be set up specifically to deal with changing conditions. Sometimes this recreating of static models cycle is represented as dynamic modeling. Although it is a valid way of dealing with changing circumstances, and may work quite as well as a dynamic model in the same circumstances, it is not dynamic modeling
Dynamic Models:
  • A series of static models does not constitute a dynamic modeling approach because there is no element of incremental learning. The notion of incremental learning introduces a concept that is central to dynamic models, but is not present at all in static models
  • Incremental learning has to be embedded in the structure of a dynamic model
  • In some sense, a model is a summary representation of the information enfolded in a data set. The associations discovered or the systems proposed are, in a way, no more than a summary explication of the information that the data set enfolds within it
  • A model is itself a summary statement about the data that is endowed with meaning and purpose
  • In an operational sense, there always have to be two parts of a whole model: the information representation structure and an interpretation mechanism
  • For dynamic models, although the interpretive structure may well remain fixed, the actions that it takes, driven by the parameters that represent the changing information, do change as learning takes place
  • Incremental learning takes place when new information changes what is "known." In a dynamic model, new information does make a difference. Indeed, new information is constantly incorporated into the existing domain information. In some sense, the dynamic model continuously "learns" as it goes along
Qualitative Models:
  • The qualitative/quantitative spectrum has much to do with the type of data from which the model is drawn. Certainly, the vast majority of a data miner's work is based on nonqualitative types of data, usually pretty exclusively numerical and categorical types of data. However, a business modeler will be quite concerned with qualitative models, at least during some stages
  • Qualitative models are based on summary data, but in this case, it is usually experiential and descriptive. A management description of a problem situation will also be presented using qualitative data. This only means that the problem situation will be described to the modeler in words
  • It is from the narrative, textual, and diagrammatic presentation that the modeler has to begin to create and structure a model of the business situation or problem
  • Qualitative models may be built from a vast range of data including images and film, sounds and speech, and narrative and dialogue
Quantitative Models:
  • These models are based on numerous measurements of recorded observations, so they are based on data that is different from that used for qualitative models. Quantitative measurements have an inherent structure that is not present in qualitative data. Quantities are all recorded against a prespecified scale, usually either numerical or categorical
  • Collected quantitative data is usually structured as a table, with columns representing the variables and rows as the instances of simultaneously recorded measurement values
  • Qualitative analysis techniques also include a wealth of statistical and summarization techniques, as well as the more recent computer-aided techniques of online analytical processing (OLAP)
  • One of the challenges facing data miners is how to incorporate external domain knowledge into quantitative models
  • As business modeling matures, models comprising a blend of qualitative and quantitative data will clearly become more prevalent
Comparative Models:
  • These models lie at one end of the comparative/interactive dimension. As the name implies, such models deal with comparisons within the data from which they are made
  • The implied comparison that the name focuses on is the comparison of instances in the data set
  • To be sure, qualitative models may well be built from narrative summaries of source data, and are thus built, perhaps, from information rather than data. But ultimately, lying behind all else, things happened, were noticed, and salient points were noted. In whatever form it arrived, this is data. Noted points associated with a single event are grouped so that the association remains, and such a group of associated noticed events can be described as an instance
  • Discovering that several instances seem identical requires comparing the instances with each other. It is only by making such between-instance comparisons that it is possible to note not only that some of the instances have all their features in common, but also that in other comparisons they have features that differ. It is such between-instance comparisons that form the basis of a comparative model
  • Comparative models tend to be concerned with which patterns occur and to make less-or even no-use of how often each pattern occurs
  • Look at the instances that do occur, and explain those by comparing them with others that do (or do not) occur
  • Comparative models are wholly concerned with similarities and differences between instances
Interactive models:
  • Whereas comparative models compare individual instances - one against the others - to discover similarities and differences among instances that have actually occurred, interactive models look at the interactions that take place between the measurements as their values change. Thus, interactive models focus on the variables and characterize the changes that take place between the variables
  • In terms of the table layout used predominantly for quantitative data, comparative models focus on data row by row to make comparisons between the rows, whereas interactive models focus on the data column by column and characterize the interactions between those
  • Interactive models are usually probability based, or, to put it another way, they tend to be sensitive to the frequencies with which the different patterns occur
  • Interactive models are particularly widely used in statistical and data-mined models of quantitative data
  • Prescriptive models provide a list of instructions for accomplishing some specific objective under a specific set of circumstances
Modeling as an Activity:
  • Although a modeler uses a framework for describing models just as much as the available data, what actually happens is an activity - modeling. The activity of modeling - the modeler's issues and concerns - is every bit as important as the descriptive framework and the data
Objectives:
  • Every model is built with some purpose in mind. This would be true even if models were only built "for fun." In that case, having fun would be the purpose
  • However, almost all business models are built to improve efficiency or profit, reduce costs, or fulfill some other stated objective. But having an objective in mind and clearly articulating it are not one and the same thing. That's one problem. Another key concern is trying to meet arbitrary objectives
  • When business objectives are not clearly defined, the modeler is often expected to make operational definitions. As an example, customer attrition is widely considered to be a serious problem, and suitable for management at least partly through modeling and data mining. However, in several attrition model business problems worked on by the author there has been no clear management definition of what constituted customer attrition
  • The major problem for the modeler is that, unless defined, it can't be measured (even qualitatively), and if not measured, it can't be managed
  • Qualitative measurements can be used for items such as morale that can't be measured in a quantitative way very easily or effectively, but are nonetheless observable phenomena
  • A modeler must always work toward a clear and defined objective
  • Even with no clear definition from management of the problem to solve, whatever measure the modeler devises must meet the qualitative expectation
Empirical Modeling:
  • Empirical modeling is one pole of yet another possible dimension for describing models. The other end of this continuum consists of theoretical models
  • When creating business models, and particularly when when using data mining to do it, it is always empirical models that are created
  • The term empirical implies that the model needs to reflect some phenomena that exist in the world. Such models are built from direct experience of the world (data), rather than being based on aggregated information such as "basic principles." However, this doesn't mean that theory isn't an important consideration, even when empirically modeling
  • It is impossible to approach any data set without some theory. For example, even in simply looking at a data set to decide whether the data is reasonable to work with, modelers unavoidably have to have in mind some idea (or theory) of what "reasonable-to-work-with" data looks like
  • It is always possible to explain (or have a theory about) what is going on in a data set, even if it's spurious
  • It is very likely the case that no one single factor was the cause to the exclusion of all others. So it is quite possible that an interlocking set of modes of explanation can be found in any particular data set - even when some of the modes seem to be in conflict with each other
  • A multiple starting point approach is more likely to discover the path to the "best" model, if indeed one exists, than using one single "best" theory to make the exploration
  • Even though modeling works with empirical modelstheory plays an unavoidable and necessary role
Explaining Data:
  • Much of the work of the modeling involves creating and deploying predictive models. Such models are often termed as "explaining" the data, although the method of explanation implied may be rather technical. However, explaining model operation and results in human terms is very often an important objective of a model, even when it seems that an explanation is not what was originally requested
  • Explaining data, which is tantamount to explaining some worldly phenomenon, has to be done in the simplest terms possible
  • Explanatory modeling requires providing the simplest possible explanation that meets the need
  • Interactions between variables and nonlinear effects of any complexity range from difficult to impossible to explain in a humanly meaningful way. Even in systems models, the objective is to remove as much complexity as possible, and to characterize the nature of the interactions of stocks, flows, and information links in the simplest possible way
  • The decision-support explanatory models are a form of strategic model because they assist in discovering some appropriate strategy in a particular situation. This type of explanatory model is termed a strategic model
  • All this preamble about strategic models leads to an important point: a strategic model cannot be used for its intended purpose unless it explains the directly controllable, or at least the indirectly controllable, variables. In building an explanatory model, it may well be that some of the concomitant variables produce some technically "best" explanation
  • If the purpose is to support decision making, modeling must concentrate on what can be controlled
Modeling Assumptions:
  • Modeling carries some assumptions of its own, separate from the assumptions that may be embodied in the model itself. One assumption is that the data is somehow "generated" by the world
  • For simplicity, in fact, simply to make modeling possible in the first place, some sort of generating mechanisms are assumed to have produced the measured values of the variables
  • Looking at the actual measured values produces various summary and descriptive estimates, called statistics, about the data - mean, variance, and so on
Summary: 
Structure of models come from come from data, embody information, and provide knowledge. There are many ways of describing models, even when the models being described are limited to business models. The descriptive dimensions help to describe and categorize business models in ways that are most connected back to the data on which they are to be built. However, whether mining or modeling, the models will ultimately address social and economic phenomena that are of interest to business managers as they attempt to gain and maintain competitive advantage for as long as possible in an ever-shifting landscape.
The modeler will have to provide models that have business value, even when problem definitions are unclear - even to the managers struggling with the problems. It is important that the modeler develops skills and techniques in discovering hidden assumptions and unarticulated expectations, and clarifying the actual objectives.
=================================================
Highlights:
  • Frameworks set ground rules. Frameworks also entail assumptions, approaches, and options - choosing some and discarding others
  • Every problem has to be framed, and every model has to be built within a framework. The frame has to come first. As an example, consider risk. This is a single consideration in the frame, and assessing and modeling risk is an integral part of many models
  • When setting the frame, the problem owner and the problem stakeholders have some idea of the particular risks that concern them. Any model has to include these in the frame - although, of course, the model may later reveal other risks as equally or more important. In the frame, various risks are being introduced into, and established in, the model.
  • When modeling, the risks are being clarified, evaluated, and assessed. Framing the risk part of a problem ensures that the appropriate risks, and the appropriate features of those risks, are present in the model
  • A frame is set in much the same way as a photographer frames a picture. A photographer has to decide, what to include, what to exclude, what the style and subject matter should be, and what is to be emphasized and de-emphasized within the final picture. Choices of subject matter, lighting, focus, depth of field, and so on all play a very significant part in the final image
  • The framing of a model determines what the final model shows and means
Setting a frame:
  • You can not decide which actions to take until you have framed the problem - that is, which strategies to adopt
  • As a modeler you do not have to invent, devise, or select a suitable frame. All you have to do is determine what the problem stakeholders see as the appropriate frame for the problem
  • Of course, if you have responsibilities other than, or in addition to, being a modeler, you may be responsible for actually devising or selecting a suitable frame. However, the practitioner modeler's first job is to discover the existing frame
  • The modeler's job is to produce an objective representation of the frame in some form - in other words, a map. One good way to discover and map the frame, perhaps the best way, is to talk to people
Framing the decision process:
  • Defining the problem results in a decision to do something to change the current circumstances and resolve the problem. At the heart of the decision-making process lies the selection of the best option available under the circumstances
  • The frame maps all of the issues that go into recognizing the circumstances, the options, and the selection criteria
  • The process starts with the world state. Whatever happens, it happens because of a perceived need for change because of a gap between where we are and where we would like to be. We all have a frame of preconceived notions, a model of the way the world that tells us what's important and allows us to interpret the world state. So too do companies, and their framing models of preconceived notions determine what the company sees in the world state through business model frames such as CRM (Customer Relationship Management), JIT (Just In Time), and ERP (Enterprise Resource Planning) that inform the managers what features of the world state should be noted
  • Although the framing model of preconceived notions informs what features are important to see, what is perceived is a situation wherein those features take on specific values
  • The frame provides many simplifying assumptions about the situation. The nature of the assumptions, and the level of complexity remaining in the situation, has to be explicated (for modeling) and is mapped into the nine-level decision map
  • Recall that knowledge (gained from information that is summarized data) provides a list of potential actions that could be taken in a recognized situation. This knowledge, informed by the characterization provided by the nine-level decision map, is applied to select actions from the options available
  • The actions are in the form of strategies, each of which has an expected payoff associated with it, and a risk level associated with getting that payoff (by a symbol for the strategy, "En" for the expected payoff, and "Rn" for the risk characterization
  • The execute phase actually takes the action by implementing all of the necessary steps to change the course of events
  • For a modeler, understanding every phase of this process is crucial. A successful model incorporates input from every stage of this process, and the modeler needs to clearly understand the assumptions, requirements, and expectations from the stakeholders of every step in this process - even if the stakeholders are not themselves immediately aware of all the issues. If that's the case, it's the modeler's role to develop the needed framing for the decision process with the stakeholders
Objectives: Getting your bearings
  • The first order of business is to define the objectives. In many modeling applications the initial problem statement is made not in terms of the objectives, but in terms of what the problem owner thinks is a desired solution
  • You will have to explore the objectives, and be somewhat skeptical of them, too. Find out how the stated objective will help, or what it will achieve. Ask why this is thought to be an appropriate objective, and what else could be achieved. Ask what a perfect result would look like, and why such a perfect result cannot be accomplished
  • From time to time, try to express the objectives as a statement of the form: "The objectives of this model are to be expressed primarily in terms of . . . , and secondarily in terms of  . . . . " The expressions in the blanks might be "customers" and "profit", or "returns" and "production failures." It is this "framing in terms" technique that will help identify the primary objectives that frame the model
  • Fundamentally, framing is done in terms of what is required of the model. This is a simple extension of how we answer any question. Perhaps someone asks what you want out of your next job. You could answer in terms of salary, time off, job satisfaction, career advancement, or even recreational aspirations
  • A full and complete answer addressing all possible terms would take forever! Almost certainly the questioner has particular terms of an answer in mind. Perhaps: "In terms of salary, what are you looking for from your next job?" So too it is with discovering the framing terms for a model. In short, the frame into which a model is placed provides the meaning that it yields. More than that, it actually creates the meaning it yields
Problems and decisions:
  • Models are built only when there is an identified problem (As a reminder, the "problem" could equally well be to discover an opportunity
  • In business, models are built only to help deal with problems. The fcat that there is a problem implies some level of uncertainty in determining which course of action to pursue. Resolving that uncertainty, and determining a course of action, requires a decision. A decision chooses between alternative courses of action and, in that choice, deliberately selects a specific intervention (or sequence of interventions) in the course of events
  • Ultimately, the purpose of modeling, at least in business, is to inform decision making. Clearly, part of the framework requires identifying the type, nature, range, and scope of the decisions that the model is intended to address
  • Models serve three basic decision making purposes in a specific situation: (1). To clarify risk (2). To determine options (3). To evaluate possible outcomes
  • So-called "rational" decision making in a specific situation requires these three inputs - available options, possible outcomes, and associated risk. Because these are key inputs into decision making, these are the key outputs required of a model, at least if a rational decision is to be made as a result
Decision symbols:
  • There are several types of decisions such as intuitive, reactive, or preferential. These types of decisions are not primarily dealing with rationally solving a problem
  • The expression "making a decision" seems to imply some sort of situation analysis before deciding on a course of action
  • Business modeling and data mining support only one particular type of decision making: rational decision making. For a rational decision, the reasons come first: the decision is deliberate and follows analytic and (when using modeling to support it) synthetic reasoning. Other types of decisions are relevant, useful, even vital, in business
  • The nexus point in the center represents the place in the process where all the elements are pulled together, and at which a decision is made
The following describe the parts of a decision:
  • Situations: It illustrates the current state of the world. This is a summation of all events up to the decision nexus, forever fixed and immutable. Relevant happening form situations that are to be addressed by the decision
  • Options: Situations present certain options - different sequences of actions that could be taken in the circumstances prevailing in the current situation. In any case, each optional sequence of actions will be expected to have a different outcome
  • Selection: Making a decision is actually selecting an option and acting on it. What is selected is one of the options. The choice of option is based on the expected outcome for that option compared with the expected outcomes for the other options. Implementing the decision requires acting so as to influence worldly events
  • Expectations: Each option has associated expectations about what will result. At the time of decision, it is impossible to know the actual outcome. A selection from the available options is needed based on evaluating expected outcomes
  • Actions: With the selection made, actions modify the flow of worldly events. However, the actual results of the decision, made on the basis of expected results, may not turn out as anticipated
  • Outcomes: In the real world, outcomes are the result of the decision. Outcomes have nothing to do with the making of a decision because they are totally unknowable at the time the decision is made. They are interspersed with the situations that arise in the world that are not results of the decision
  • Taken together, these are the components of a decision: situations, options, selection, expectations, actions, and outcomes
  • The decision process is viewed as a continuous process where situations yield options, options require selections, selections (and implicitly, actions) produce outcomes, outcomes produce situations, and so on. Here, the wheel rolls on forever as we continually make and remake the world we live in, interacting and responding continuously. The first map focuses on a single decision, the second on continuing interaction with the world that requires continuous and endless decisions
Decision maps:
  • There are a wide variety of decision types. In making a map of various relevant decision types, we have no need for a comprehensive map that covers all types of decisions. What is needed for framing the model is only a map of rationally made decisions
  • The decision map shows nine levels of decisions. Each of the levels increases in complexity from the bottom, level 1, to the top, level 9. The small cicles represent discrete elements of or issues in, the situation, options, or expectation segments of the problem. A single circle represents a single element or issue
  • Level 1: Shows a single situation that calls for a decision. There is only one option considered viable, and only a single expected outcome of that option is considered relevant
  • Level 2: Shows a single situation and option, but several separate outcomes are expected ,the outcomes being simply connected
  • Level 3: Shows one situation, several simply connected options, and several simply connected outcomes
  • Level 4: Shows one situation and one option, but multiple outcomes that interact with each in complex ways
  • Level 5: Shows one situation, several simply connected options, and complexly interacting outcomes
  • Level 6: Shows one situation, several complexly interacting options, and several complexly interacting outcomes
  • Level 7: Shows several complexly interacting situations, several simply connected options, and several simply connected outcomes
  • Level 8: Shows several complexly interacting situations, several simply connected options, and several complexly interacting outcomes
  • Level 9: Shows several complexly interacting situations, several complexly interacting options, and several complexly interacting outcomes
Setting the decision frame:
  • The decision map is a powerful tool for identifying the type of decision to be modeled. Identifying a decision with one of the decision levels on the map shows the modeler many of the issues that need to be dealt with and the assumptions made, albeit unconsciously. Remember that in identifying the problem level on the map, it is not the actual world state that is to be mapped
  • For any problem, the conditions in the world are enormously complex. The problem's stakeholders will already have made many simplifying assumptions, and it is the problem as presented to the modeler that is to be located on the map
  • The map shows a problem broken down into three component pieces: situation, options, and expectations. Each of these components requires separate consideration, and may even require separate models. Each of these components of the overall decision model has separate issues in framing that need to be explored separately
Modeling situations: connecting the decision to a worldview
  • The situation is the current state (at the time of the problem) of what is relevant in the real world. The difficulty here is that our view of the world is determined by whatever model is used to view it (recall Chernobyl). Such a model frames our view of the world
  • It is very unlikely that a modeler will ever have to construct such a worldview model from scratch. (constructing and modeling a whole worldview from scratch would be a truly massive undertaking
  • Indeed, the modeler must simply adopt whatever worldview model the client is currently using, or perhaps one that the client wants to use in the future. This type of model, of course, has been called a framing model
  • A framing model identifies particular aspects of the world as being important and relevant. Perhaps a particular framing model says that "rate of inflation" is important. A situation model must then capture the current value say "rate of inflation = 3.5%." Thus, when making a decision, the framing model only points put what is important
  • The situation model is then just a set of values for features that the framework model points to as important. A situation model is a single, particular representation of what the framing model points to as important and relevant
  • So far as business modeling is concerned, framing models abound. Such framing models are called business models, and each constitutes a particular overarching philosophical approach to business. Just in time (JIT) models point to one set of considerations as important to a business; enterprise resource planning (ERP) models point to another set, not entirely dissimilar in some aspects from JIT. Customer relationship management (CRM) has yet a different set of issues that the model indicates as relevant and important. These business models are large and frequently complex, even though they all simplify the complexity of the world rendering it manageable
  • Business decision making starts by using whatever business model the problem owner selects to frame the situation. The business situation represents the current state of the world as looked at through the prevailing business model so that the problem holder and the modeler look at the world through, say, "CRM colored glasses." What they see is defined by the "color" of the glasses, and that view is the business situation. Thus framed, the situation model for the decision is set up
  • Remember that a situation model is, even at its most complete, initially only a single instantiation of values in a framework model
  • Before the decision is framed, the completed business (framework) model is assumed
Options: Assessing the possible
  • In spite of the prevalence of the use of the term strategy in connection with business, it is surprisingly hard to find the term defined, and particularly to find a definition that separates the term strategy from tactics
  • So far as modeling is concerned, these terms are used in fairly precise ways 
Strategies:
  • Game theory is a traditional tool in the decision-making arsenal. It also serves as a good starting point for discovering how the term strategy is used in modeling
  • For game theory, a strategy is a plan of campaign that is in the control of the player, and that generates a payoff
  • For modeling, strategies are plans of action that can reasonably be expected to be completed as intended and that have a payoff
  • Real strategies do not always execute perfectly, but if one doesn't execute at all, it's no strategy
Tactics:  
  • When strategy is executed or put into action, the tactics actually implement the necessary sequence of real-world actions
  • For the selected strategy, these are the tactics: the plan of actions that actually realize the strategy. There is, however, no payoff for implementing each tactic - the payoffs occur only after the tactics are all executed, and the payoffs accrue to the strategy, not to individual tactics
  • Strategies and tactics are relative, however, there is no hard, fast, or absolute distinction between the one and the other that can be based on the actions that each specifies
  • Tactics do not themselves have payoffs, they simply execute strategic decisions. However, tactics may otherwise be indistinguishable from strategies. For convenience, tactics can be designated as partial strategies that have no payoff
Linking strategy payoffs:
  • The only decision is about whether or not, or perhaps when, to execute all of the actions required based on a consideration of the anticipated results (Anticipated results are, of course, what the map shows as the expectation)
  • When options are linked, it indicates that the payoffs are significantly interrelated. That is to say that the payoffs are expected to vary depending on which strategies or set of strategies are executed
  • Take care not to confuse an expectation with an outcome. Decisions are based on expectations. Payoffs of fired strategies accrue the values of outcomes
  • At decision time, only the expectations are known - the outcomes are unknown and unknowable
  • The outcome matrix can be constructed only after the event; the expectation matrix is constructed before the event
Threading strategies together:
  • When dependencies are linked, it indicates that strategy firing order is significantly interrelated. The payoffs of those strategies higher up the thread roll up the payoffs of those lower the thread. At the bottom of the thread are the tactics. Having strategies "higher" and "lower" on the threads implies that they can be represented hierarchically
  • Strategies are numbered uniquely for identification by the first number. For the highest level strategy. For example, that reference number is 10. The subscript 2 is significant and indicates the level of a strategy on its thread. Subscript 0 indicates a tactic because it is at the lowest level
  • Another useful representation of the threads is strategy firing matrix. This shows the firing strategies along the top and the fired strategies down the side. Tactics are shown in italics and grayed out because they cannot fire other strategies and therefore will always have empty cells
Mapping options as strategies:
  • Options can only be mapped after the problem has been mapped to an appropriate decision level
  • If the problem is mapped at levels 1, 2, and 4, the only remaining options mapping activity is to characterize the option part of the single strategy
  • For a problem that maps to Levels 3, 5, 7, and 8, the situation is only slightly different. The problem holder has defined the problem such that interactions among options are not construed as relevant to the problem
  • The strategies, although there may be several, are essentially considered independently. A modeler might well want to discuss the actual state of this independence with the stakeholders in terms of both payoff linkage and thread linkage. However. if they are to be considered independently from each other, a list of the option parts of the identified strategies suffices to map them
  • For decisions that map to Levels 6 or 9, the strategy descriptions have to additionally map the payoff linkages and thread linkages. Payoff linkages are mapped as payoff matrices
  • Thread linkages are mapped as hierarchies, or perhaps firing matrices
  • Strategies straddle options and expectations, and before the whole strategy can be mapped we must look at how to map expectations
Expectations: Assessing the future
  • As the well-known aphorism has it, "Prediction is very difficult, particularly when it has to do with the future. But all problems have to be faced, and all decisions made, with complete ignorance of what the future actually holds is store
  • In fact, by looking at what is going on now, and looking at how things have turned out in the past, our guesses about the future are right often enough that we can sometimes build a fairly good picture of the future
  • It is quite reasonable to give our expectations a concrete expression in terms of probabilities and risks. And it is in terms of these probabilities and risks that we express our expectations of particular outcomes. So in order to put expectations into the frame, we must look at probabilities and risks
Probably a risky business:
  • Probability has a venerable history. It has been earnestly studied over hundreds of years, and still the question of what probability is has yet to be settled
  • Probability is measured to address a problem, and so what it "really" is depends on how the problem is framed. It will simply measure how likely we believe some event is. When actually modeling risk, data mining is a good tool
  • The two traditional ways to determine probability in elementary statistical texts are to: (1) Look at the historical record of similar events and use the ratio of occurrence to nonoccurrence under a set of circumstances. (2) Count the total number of ways an event can come out and declare the probability to be the proportion of each individual occurrence to the total number of possible occurrences
  • Risk is far more than simply probability of occurrence of some event. Showing what goes into the frame about risk requires a brief discussion of how risk is modeled
  • The statistical use of the term risk measures how dramatically the expected values differ from each other using a measure called variance
Risk selection:
  • In the frame for the strategy, it is crucial to specify the risk target or targets. There could quite easily be more than one
  • It is clear that it is meaningless to ask how risky any strategy is unless the terms of the risk to be assessed are specified first
  • It is clear that some better method of characterizing risk is needed
Satisfying gains, regrettable losses:
  • Risk is clearly not just an expression of the chance of a loss. The clue to finding a better measure is in discussing the loss-risk and the gain-risk separately. When seeking new customers, the risk is that the most satisfactory result will not eventuate
  • Clearly, that is a regrettable situation. When seeking return, the most satisfying option is to get the largest return. The most regrettable outcome would be to get the greatest loss. Yet the strategy that offers one also offers the other
  • In general, the risk seems to be either that the strategy selected won't be the most satisfying or that it will entail the greatest cause for regret. Clearly, some measure of the potential for satisfaction and regret for each strategy would be very useful
  • "RAVE" which is an acronym for Risk Adjusted Value Expected. Raw risk is a number determined such that it varies its value between +1 and -1. When it registers +1, it indicates that the strategy offers pure satisfaction. When raw risk is -1, pure regret is indicated. A number in between the two indicates some intermediate balance between satisfaction and regret
  • For a purely satisfactory strategy (raw risk = +1), the actual amount of satisfaction available depends on the expected value of the strategy. Adjusting the expected value for the strategy by multiplying it by the raw risk yields the risk adjusted value expected (RAVE) for each strategy
  • The value of RAVE would equal the expected value for the strategy
Benchmarks:
  • RAVE offers a good indication of risk. The RAVE number needs to be modified to reflect prior expectations about performance. These prior expectations are the benchmark against which any performance is to be compared
  • Benchmarks can be anything convenient - interest rates, level of inflation, industry standard performance, even wishful thinking, although that is to be discouraged
  • Benchmarks modify the raw risk measure to give a benchmark - adjusted risk
  • It is important in setting the frame for risk evaluation to discover what benchmark performance is expected, and exactly what constitutes the relevant benchmark for the risk targeted. In the end, how much risk is involved in a strategy depends on how Benchmark Risk Adjusted Value Expected (BRAVE) it is
Strategic Risk:
  • At last, after looking at how risk is addressed in a model, the actual components that go into the expectations part of a strategy are clear, as are the reasons why they are needed. The components of the completely assembled strategy are explained as follows:
  • Probabilities: Risk assessment requires specifying some way of determining the probabilities associated with each outcome. Obviously, these will not be exact, but the better and more justified the estimates, the better will be the risk assessment
  • Payoff: Only when the probabilities for each outcome are determined can the expected payoff be discovered
  • Targets: Each strategy can have one or more risk targets. There is no problem modeling multiple risk targets. Assessing overall risk for a multi-target strategy is not difficult. Defining a risk target requires determining how the risk for that target is to be measured
  • Benchmarks: Every target requires a benchmark. A usual default is that risk is 0 when return equals investment. In almost all cases, that is not true in the real world. To get back no more than what is invested is almost always a regrettable situation. Without adequate benchmarks, risk modeling is usually of small value. (Return and investment are used generically here and are not restricted to money measures. Investment is whatever is put in to get the return; return is whatever the strategy is measured to produce)
Final Alignment:
  • Business models are purpose-built structures. They have their own internal purposes, whether it is to predict fraud or model the world economy. But all models share external purposes that are part of their frame
  • Three of the purposes - to clarify risk, to determine options, and to evaluate outcomes are not alone enough. Every modeler must keep firmly in mind that the three touchstones by which the final results will be judged are novelty, utility, and interest
  • Models support decisions in each of three areas. Models help both develop and explore new options. Given the options, they explore the likely outcomes for each so that an informed decision is possible
  • The risks of each action are then determined - at least as well as they can be known. Finally, options are explored in the situation
  • Each option is developed as a possible strategy for action for which outcomes and risk are evaluated. But although a modeler has to set a frame around the whole model, and the whole model has to be explored, not all features of the final result will be of equal interest to the problem holder and owner. For simple models, final alignment of the frame may seem obvious. However, using an alignment matrix, can help to organize thoughts for more complex models
  • Such a matrix can serve as a sort of "sanity check" for the modeler and also serve to help set expectations among those who will be using the model
  • For each of the column and row junctions, ask how important it is in the overall model. Enter scores in a range, say, of 1 - 10. When such a matrix is completed with the participation of all involved, normalize it so that the most important box contains a 1 and the least important a 0. It indicates where effort needs to be concentrated, thus providing a frame for the whole project
  • These three features of a model - novelty, utility, and interest - are part of the general expectations from modeling. As a modeler be aware of these, and in setting the frame, look for where the novelty is wanted, what is considered useful, and what is interesting. The easiest way to discover this is to ask! Although important, this part is easy - unless it is overlooked
Mapping the problem frame:
  • Mapping a problem using the techniques and tools developed is quite easy. Fitting the problem into the map is easy, even if getting to understand the map was not. The easiest way to show how to map a problem is to actually map a problem
  • Before beginning, note that just because all parts of a map exist, does not mean that they have to be used. If you were to drive from New York to Tampa, a road map of the United States would come in handy. However, on your journey, the part that covers California is not going to get much use. It is there in case you have a sudden change of plan, or need to go there another day. But just because the map covers California does not mean that you have to drive there. So too with the decision mapping techniques developed. All the parts are in place if needed. That does not mean that they all have to be used. It does, however, allow conscious knowledge of where the decision did not go, which might be just as important as where it did. For instance, just because the map has all the tools necessary to fully frame risk in a model does not mean that the stakeholders necessarily care much about risk, nor that the modeler actually has to model risk. This is a map that only lays out where you can go. There is no compulsion that says you must go!
  • As a modeler, you should be aware of the issue and discover the stakeholder's concerns
  • There are two steps in framing the actual mapping process - a prologue and a postlogue - and nine steps in mapping the problem.(The prologue and postlogue are the parts that come before and after the mapping process proper. The items shown numbered 1 through 3 are required for every problem. The other six depend on circumstances
The steps are:  
  • Prologue: Find the terms of the problem
  • Step 1: Locate the decision on the decision map
  • Step 2: Discover the framing model or issue
  • Step 3: Build the needed strategies
  • Step 4: If several interrelated payoffs exist, map the relations
  • Step 5: If interdependent firing of strategies exist, determine the threads
  • Step 6: Determine the targets
  • Step 7: Determine the payoffs
  • Step 8: If modeling risk, determine the benchmarks
  • Step 9: If modeling risk, determine the probabilities
  • Postlogue: Final alignment
Mapping, Modeling, and Mining:
  • Models, as it will turn out, model business situations, and have their impact by modifying business processes
  • Creating a frame around the model so that it will meet the needs of the business, as expressed by the stakeholders
  • Data mining is an excellent tool for both discovering and representing relationships. There are many relationships, as long as data is available to describe them, can be discovered and characterized using data mining techniques. For instance, recall that for strategies, when options are linked, the payoffs are interrelated. Almost always the characterization of the linked interrelationships will not be obvious, but will be discovered in, and can be characterized from, the appropriate data
  • Indeed, at any of the six levels of the decision map (4 through 9) for which issues are linked in complex ways, if data is available, data mining may be the most appropriate way to characterize the relationships
  • However, the main application of data mining comes when the problem is framed, and the main question then is to determine exactly what the expected outcomes are for each of the strategies to enable a rational selection among data
  • Data mining also will be used in assessing the risks continuously
Explanation of the decision map:
  • It is important to remember that the decision map is not intended to represent the true state of the world, but to represent the parameters of a decision for determination
  • A very complex world situation might be simplified so that the decision is to be made about only a single issue
  • For example, the whole problem: "Should I have lunch now?" is enormously complex. The issues in the situation are staggeringly diverse ranging from the world food situation to complex biochemical issues, not to mention all of the social and cultural implications. If the world situation is to be represented, the options are equally vast ranging in scale and scope, as are the expectations. Yet normally, this is far more usefully represented as a problem requiring level 1 representation
  • Regardless of the true complexity of the actual world situation, in using this map, the modeler has to locate the problem as presented, not the problem as it exists in the real world. Without this simplification, all problems are level 9 problems, and are simply intractable. Simplification makes them tractable
  • Single refers to a single issue, multiple to a simply connected set of issues where the interconnections are considered unimportant, and complex to multiple issues where the interconnections are to be considered
Summary: 
Getting the correct frame around a problem is absolutely essential to creating a usable problem model. And all models are built to deal with problems. The key to building a model that addresses the user's need is in discovering what that need really is. That is what framing a problem is about. All of the issues and explanations are aimed at keeping the modeler on track, and assisting a modeler in exploring the problem frame with the client.
==================================================
Getting the Right Model
Highlights:
  • Discovering the right model is without doubt an art
  • It is art at least in part because there is no one "right way" to discover the right model - although there is always a right model to be discovered. Only one rule exists for discovery: practice. All else consists of rules of thumb and helpful hints
  • At first step the modeler personally and interactively explores the territory to be modeled
  • Next, when the modeler has some idea of the business situation, it is important to map it as an objective statement
  • It is very important that the modeler's map of the proposed model space be both available and intuitively accessible to everyone who is interested
  • Creating the map dynamically with members of the management team, incorporating their feedback and insights, may well be a crucial element in the discovery process
  • A third stage of discovery, one that is not always needed but sometimes turns out to be crucial, is the creation of a simulation of the map that allows a dynamic and interactive exploration of the map's limitations, sensitivities, and implications
  • Such a simulation model allows a lot of assumptions to be checked, and a lot of false starts to be avoided, before the final model is built
  • These three stages are the steps that any modeler has to take during model discovery
Interactive exploration of the territory: 
  • Approaching a complex situation is never easy, and most business situations are nothing if not complex. The first task faced by any modeler is to understand enough about the business situation to be able to ask intelligent questions. To a great extent, that is what setting the problem frame is all about
  • Framing a problem, and understanding the framework within which any solution has to be discovered, is the activity that is designed to ensure that all of the important issues are addressed, that assumptions about the situation are at least revealed, and that issues to be excluded are at least knowingly excluded for a reason
  • All humans are equipped for social interaction - it's part of our built-in capabilities. Interactively exploring the area that needs to be modeled requires no more than the skills of basic communication, somewhat sharpened and focused toward the needs of the modeler
  • It is just as important to identify the players - the stakeholders - and their needs and motivations
Stakeholders: 
  • One big question the modeler needs to answer is who to talk to. The general answer is: everyone who needs to be involved
  • More specifically, there are five groups of people, or stakeholders, who need to be involved in any modeling project: Need stakeholders - Money stakeholders - Decision stakeholders - Beneficiary stakeholders - Kudos stakeholders
Need Stakeholders: 
  • The need stakeholders are those who actually experience the business problem on a day-to-day basis
  • A modeler is well advised to take a somewhat skeptical approach to the need, at least as initially expressed, since there may be a larger problem to explore
  • The need may be expressed as an expected solution, not as a description of the problem
Money Stakeholders: 
  • These stakeholders hold the purse strings - they will commit the resources that allow the project to move forward. The business case document written to support modeling is largely addressed to the money stakeholders, and it has to be expressed in terms that appeal to them
  • The money stakeholders, as a money stakeholder, is only a "gatekeeper" for the project. It is usually not possible for this stakeholder to say yes to a project - that is the purview of the decision stakeholder - but they can easily say no if the numbers aren't convincing
Decision Stakeholders: 
  • Decision stakeholders make the decision of whether to execute the project. This is another person or group that the modeling case has to address directly
  • The clearest way to identify the decision stakeholder is to determine whose budget is going to be tapped to pay for the modeling and mining
Beneficiary Stakeholders: 
  • This is a crucially important group to include at all stages of the modeling project. These are the people in the corporation who will get the "benefit" of the results of modeling - the folk who will be directly affected
  • The beneficiaries are the grass roots whose support is so crucial 
  • The point is that success is in implementation, not just in modeling. Without the support, involvement, and input of the beneficiary stakeholders, implementation won't happen
Kudos Stakeholders: 
  • These stakeholders are often the people who have "sold" the project internally. Credit for a success will accrue to them - but more importantly, so will the negative impact of a less than successful project
  • These stakeholders do feel a personal sense of involvement with, and responsibility for, the success of the project, even if they are not otherwise directly involved
Talking and listening: 
  • Once the stakeholders are identified, meeting with them to determine their needs is the next step. It's really important to speak with the problem stakeholders with a specific intent and purpose
Initial Questions: 
  • In any situation, start by getting the situation discussed openly and by having all of the participants talk about the issues in broad terms. Doing this means asking broad questions such as, "What are the general objectives for the project?" or "What do you see as the major problems in this situation?" In these initial questions, it is important not to make assumptions
  • Throughout the entire interview process-it's important to be a mirror: "What I'm hearing you say is, ...". Reflect back whatever points or understanding are appropriate
  • The old standby "Who, What, When, Where, Why, and How" is a good place to start exploring
Getting a Helping Hand: 
  • Clearly, as a modeler, unless you happen to be a subject matter expert about the situation at hand, and very often even the, you need help in understanding the ramifications of the situation. That's good, and the best way to get the needed help is to ask for it
  • Asking participants to change their perspective and look from the outside of the situation inward is often useful 
  • During exploration it's very helpful, especially when a knotty issue arises, to pass the ball back and wait for suggestions about resolving or solving it. Very often this technique stimulates a lot of insight and thought about the issues. The point during all early discussions is not to try solve problems, but to explore issues, and this leads to the need to ask open-ended questions
Answers and Opinions: 
  • The modeler's role may be blurred at first, even ambiguous. After all, the modeler is there to help the team (and although the term team is used, it could consist of just one person)
  • The modeler is in the role of "problem solver" or "expert."
  • The modeler's role is to help the team come to a conclusion or a resolution, or to find a solution as appropriate, through modeling
  • As a modeler, you are there only to interact, listen, and facilitate the exploration process
Purposes: 
  • One of the modeler's crucial objectives is to understand the true, underlying, or real purpose for the whole project
  • Questions such as "What will you be doing differently when the project is successfully?" or "What will you know on the completion of a successful project that you don't know now?" are very useful in discovering the important purposes of the project
  • Building concrete objectives for the project is much easier with answers to such questions
  • Another technique useful for understanding the purposes of the model, and its place in the overall corporate strategy, is to create a goal map
What if you get what you asked for? 
  • Sometimes projects fail for lack of complete exploration early in the process. Perhaps no one seriously considered what they would do if they actually got what they requested
  • The point is to consider the overall effect the project may have. In other words, what if you do get what you asked for? Will it make a difference? If so, where and to whom will it make the difference and what else will be affected?
Expectations: 
  • Discovering appropriate frames of reference for the problem, creating a map of the problem, discovering assumptions, determining the role of risk, and so on, are all techniques designed to, in part, align modeler and client/stakeholder expectations 
  • Expectations are closely allied to assumptions, and the unarticulated, unstated expectations are so close to hidden assumptions that they will serve as indistinguishable "potholes" along the road to success
What? So What? Now What? Three key questions encapsulating most of what is in this section:
  • What?: What is the situation, what's wanted, what's going on, what's expected? The question at this stage reminds a modeler to focus on understanding the basic issues, the context, and the situation
  • So What?: For each of the "whats," why do they matter? What impact does each have on the business situation, on the modeling requirements, on the modeling project, and on the modeler? This stage reminds the modeler to focus on understanding the relationship between the basic "what" issues and solving the business problem, creating the model, and deploying the solution 
  • Now What?: What actually needs to be done next as a result of each of the "so whats"? This stage reminds the modeler to concentrate on the practical needs of getting the necessary data and creating a model in the form needed that serves the needs of all stakeholders, that addresses the business problem, and that can be deployed 
These three questions, and what they imply, provide goals for each stage of the interview and discovery process.

Modeling the business situation using metaphors:
  • Models do not reflect reality, only perceptions of reality. This is an important concept because it means that models are symbolic representations of perceptions of events as filtered through a (or several) human minds
  • Not only is what is "really" happening impossible to grasp, but it also isn't of any value to attempt to describe it - at least, not to a modeler
  • Models are designed to strip away the complexity that is inherent in the real world and to represent some important features that humans can appreciate, react to, and use to influence the course of events
  • A business modeler has to create models to address the purposes and expectations of the stakeholders in terms familiar to them
  • All descriptions are metaphors, likening one set of experiences to another, more familiar, set in another domain. The roots of all our metaphors convert experiences into analogous depictions that reflect our immediate physical sense impressions of the world that arise by virtue of the fact that our minds inhabit human bodies
  • Do you grasp what I am saying? No, of course not - at least, not physically. It's a metaphor that analogizes our ability to hold physical objects in our grasp to our metaphorical mental ability to "grasp" (a physical action) "ideas" (a totally nonphysical and ephemeral structure)
  • Modeling business situations requires the use of metaphors that illuminates business processes
  • One very useful technique in finding alternative, novel, or unfamiliar metaphors to use is to find a magazine in a field that you have never heard or thought about, and at first blush seems totally irrelevant to your field. Buy a copy. Read through the magazine. However, don't just read it - read it with the assumption that it contains at least one, and maybe many, useful approaches to problems, issues, and situations that are directly relevant to you. Your job is to find them
The systems metaphor:
  • It is persuasive to regard the world as consisting of many systems
  • One key idea that lies at the core of systems is feedback, which is no more than interaction between processes
  • All system components may potentially interact with each other with either positive or negative feedback
Balance, Cause, and Effect:
  • Most systems do not fly apart or collapse. There are, of course, many systems in which those things happen, but by virtue of the fact that they do, they don't form sustainable systems. They may be systems, albeit unstable and temporary ones, but the systems that mainly interest the business modeler are stable systems
  • The relationship between components in a system has to include what might be called a propagation delay - how long relative to other changes does it take for the effects to propagate through their related connections
  • Relationships and reaction time are the two crucial elements in systems. Unfortunately, the traditional ideas of cause and effect simply don't serve to describe relationships in any system. Which parts of the system cause what effect? The answers in the simple, two-part systems sketched so far have to be that either element causes the other, or both cause each other
  • Where the system describes important business behaviors, and data is available, data mining is an excellent tool for characterization these relationships
  • Another important features of systems, is that systems unaffected by outside disturbance tend, by virtue of their relationships, to settle into stable states
  • Learning is embedded in the company in the form of business practices that, after a delay change "established business processes" that also effect what "decisions" a company can and will take
In creating a system diagram to represent a situation, try the following:
  • Identify the business objects as elements in the diagram
  • Include each object cause and effect in order of importance
  • Include the least possible number of objects, causes, and effects that result in a system that describes the phenomenon of interest 
  • Use nouns and noun phrases to represent objects. Avoid the use of verbs and relative or directional qualifiers (greater, more, decrease, etc.)
  • Identify the key linkages between elements
  • Use arrows to point loop connectors from cause to effect
  • If a change in one element leads to a change in a connected element in the same direction, indicate the relationship with a"+."
  • If a change in one element leads to a change in a connected element in the opposite direction, indicate the relationship with a "-."
  • Indicate the character of each system loop. A positive reinforcing loop is shown by a "+" in a circle near the center of the loop; a negatively reinforcing loop is shown by a "-" in a circle near the center of the loop. A balanced loop (that will, if undisturbed, approach an attractor) is shown by an "=" in a circle near the center of the loop
  • Indicate where time delays are important
  • When the system diagram is complete, look at it in total and make sure that it seems to make sense
System diagrams are one way to represent a business situation and are quite intuitive; however, system diagrams aren't the only way to represent systemic structures. Other ways include influence diagrams and systems models; these two alternative representations are also available as software that can be used to extend the passive diagrammatic representation into a dynamic simulation.

Influence Diagrams:
  • Naturally, all of the relationships in a system can be expressed as influences, and one way of describing systems is by using an influence diagram
Stocks, Flows, and Relation Connections:
  • One founding conception of systems representation, and one that has turned out to be remarkably powerful, is based on stocks, flows, and relation connections. High Performance Systems (HPS) has created software that implements this concept
  • The accumulations or stocks are one of the foundation concepts of this systems representation
  • Morale and interest can accumulate just as much as manufactured units and cash, and equally diminish, too
Using systems representations:
  • Although it is very important to keep systems as simple as possible, modeling real-world systems still produces fairly detailed diagrams
  • Most software systems representation tools have some way of diagrammatically simplifying the full detail of a system by encapsulating subsystems into a single symbol: a powerful tool for representing important business objects and relationships intuitively and making presentation and explanation easy
  • These are the relationships that have a high level of influence on the system outcomes of interest
  • A great advantage of an automated systems representation is that the tools allow the whole system to be simulated, and its performance explored
  • Systems, useful and powerful as they are, are not the only metaphors useful in constructing business models
Physical system metaphors:
  • Our immediate impressions of the world are that it exists as a separate entity from our "selves" and that it is a primarily physical place. For instance, the term objects used in the expression business objects is derived analogously from our experience of physical objects such as trees, doors, houses, and cars
  • Extending our immediate appreciation of the world by using metaphorical constructs allows us to use all of the associations that accompany the metaphor
  • In constructing business models using physical analogs as inspiration, and as thought and imagination support tools, it is useful to somewhat circumscribe the metaphors and concentrate on generalized physical models
  • The most useful physical analogs are those describing fixed properties, such as density, weight, and so on, and variable energetic concepts, such as pressure, flow, work, and power
  • Imagining business situations analogously in such terms can be very useful in structuring them and pointing to useful relationships and data that are needed to model them
Variables: 
A vast number of physical variables are measured by the scientific community, and some are useful in modeling business situations - specifically, what things change, why things change, and what is needed to bring about particular changes. Change is of primary concern, so the variables chosen are those that bear on the what, why, and how of change. Changes of state require energy, so most variables measure results from expending energy. Useful variables in business modeling include measures of:
  • Energy: The capacity to do an amount of work. Heat is an energy formable to raise the temperature of water. Analogously, "marketing budget" is the energy form able to raise the likelihood that entities in the market will become prospects. One customer acquisition unit might be the amount of resources required to convert one prospect into one initial-purchase customer
  • Flow: The amount of motion produced by expending energy measured as the total distance moved by all the objects affected. Analogously, two prospects moved to make an initial purchase each day results in two prospect conversions per day
  • Power: The rate at which work is performed. Companies can be regarded as converting energy in the form of various resources - human, financial, and physical - into other forms of energy, such as customers, profits, pollution, and so on
  • Pressure: The amount of work that must be done to move an object from one place to another. Dollars flow through corporate "pipes" from places of high pressure to places of low pressure
  • Work: The amount of energy expended in a specific time. One measure of work in business is person - days - the number of people working at a task multiplied by the number of days they worked
  • Friction: The amount of work required to overcome the energy dissipation of an object. Many business processes experience friction. Expense claim reporting, say, or inventory control processes not only take effort (energy) to set up, but once up and running require a constant input of effort (energy) to maintain. "Natural attrition" is one human resource equivalent of friction
  • Inertia: The amount of energy required to produce a given change. Business processes and organizations sometimes certainly seem to have inertia, and march on long after all need for them is passed and require great effort to halt or change. Some products such as the VW Beetle, seem to have inertia. (Production of the Beetle rolled on much unchanged for over 50 years, and even when forcibly stopped, burst out again in a new incarnation)
  • Mass: It expresses the famous Eisenstein insight that the amount of energy locked up in a physical object is equivalent to its mass multiplied by the square of the speed of light. In business terms, the conversion may produce more noticeable results. Inertia is proportional to mass - the more the mass, the higher the inertia
It is useful to think about variables in three ways. Across, through, and against variables correspond respectively to pressure, flow, and friction, called:
  • Across variables: deal with pressure where force has to be applied across (or to) a situation to engender motion
  • Through variables: measure the motion that results from the pressure of the across variables
  • Against variables: measure resistance to motion, and unintended effects that are more or less unavoidable
Ref: Business Modeling and Data Mining by Dorian Pyle,
=====================================================
Getting the Model Right - Part 1:

Getting the structure and requirements that determine the right model forms only half the battle. Once that is discovered, there still remains the business of determining how to create the model so that it accurately represents the business situation - in other words, how to get the model right.

Finding Data to mine:
  • There are only three sources of data: Data obtained from outside the organization - Data on hand - Data developed specifically for the project
  • "Look at the evidence!" Evidence is, of course, data, and the valid evaluation of data is the only way to confirm (or deny) opinion as a representation of the world
External data: 
  • Numerous companies offer many kinds of data for purchase. Axciom, for instance, offers a large data set containing geographic and demographic information, considering of many hundreds of variables that include all kinds of lifestyle attributes
  • Credit agencies offer data on personal credit status and credit usage. Census information is available from the federal government 
  • Although such data is publicly available, the fact that it is publicly available makes it of limited competitive value
  • Nonpublic data developed by a company from its own resources is its most valuable resource. Adding external data to, and combining it with, internally generated data may well make both far more valuable for mining purposes
Existing data:
  • Company-generated data can be a blessing and a curse. It has almost certainly been collected at great expense, and was collected to address particular business needs
  • So far as the miner is concerned, this data should be evaluated exactly as if it were external data that is, it needs to be assessed for relevance, not simply used because it is available
  • This data contains at least the seeds of the relationships needed for a company to gain a competitive advantage
  • The data may appear to bear on the issue, but in fact turn out to be historical data. "Surely," the astute reader may ask at this point, "since all data is necessarily from the past, isn't it all historical?" In that sense, yes. However, the term historical as applied to data has a slightly different connotation
  • Imagine, for instance, an assembly line that is producing hard disk drives. Suppose that the disk drives for some reason have a higher failure rate than is expected or acceptable. Data from the manufacturing process - from the assembly line - can be used to determine whether there is a problem. But having discovered the problem, the assembly line is then modified to correct the problem. What of the state of the data? Any data that describes the process before the process changed is now historical data - it is no longer relevant to the current (changed) process
  • If the process changes the data has to change as well since that is the only way to recognize that the process has changed
  • The name of any data is reflected in its distribution - literally a description of the way that the values occur. The so-called normal distribution with its well-known bell curve is one such description
  • Changing data-generating processes that produce changing distributions of the values that they generate are said to have non-stationary distributions. Processes generating non-stationary distributions are one reason that current data turns into historical data over time
Purpose-developed data: 
  • Data developed to specifically support model creation is far and away the most useful. On the other hand, it's always the most expensive to produce, since procedures and processes to create and capture the data have to be created, often from scratch 
  • Two very useful tools can be used to help determine what data needs to be collected for specific modeling purposes: business process mapping and cause and effect mapping
  • Everything that a business does can be represented as a business process - everything
  • Most business processes, and certainly a company's core activity processes, execute frequently and are a great place to look for purpose-developed data
  • A business process map is just a special type of systems diagram
Mapping business processes: 
  • Business processes are what a company is built from, and everything that a company does can be represented as business process
  • Finding new markets is a business process, as are developing new products, servicing customer complaints, acquiring other companies, opening a new division, managing the supply chain, creating marketing support materials, creating corporate communications, performing market research, and every other activity that any company engages in
  • The architecture of a business process is built out of the same components familiar from the earlier systems discussion, but here the system elements are made more specific to business. They are: Inputs and outputs - Flows (in flow units) - Networks of interconnected activities - Resources - Information flow structure
  • The accepted technique for mapping business processes starts with developing a high-level structure and working down toward the details
  • To view a business organization in terms of its processes, first identify the inputs, the outputs, and the entry and exit points that define the process boundaries
  • Perhaps the most important feature of process diagrams is that the processes' customers are clearly identified. Indeed, in many ways it is the customers of a process - those who get the value from it - who define the boundaries of the process
  • The information flow structure referred to earlier is the information that is separate from the process itself, but is required to manage it. Often the information flow structure is in the opposite direction to the process flow
  • The key places to collect data about a business process are where the information flow crosses functional boundaries. These are the important points to measure because they represent the important transition stages in the process, and it is their performance that reveals and characterizes the important relationships in the process
  • Each event includes a date and timestamps (to determine response time, flow rates, and so on), as well as context information measuring the product's value, variety, and quality at each transition
  • As a general rule, businesses try to design and manage processes so that they yield the greatest performance for the investment
  • The review of business processes is to discover how they help to indicate what data is needed and where it should be measured to create the business model
  • One key feature is how well the process is organized to match the demands placed upon it
  • In finding data to mine, the process map provides a good place to start. Business processes serve their customers, and the result of a process is to deliver a product to the customer
A product is a business object, of course, and has attributes that include:
  • Product value: Includes the product cost and comprises the total cost to the customer for owing and using the product
  • Product delivery response time: The time it takes to deliver a product from the initiation of the process
  • Product variety: The range and limits to the product that the process can deliver
  • Product quality: Quality is recognized by a non-thinking process, and therefore cannot be defined. However, in this case, "quality" indicates the fitness for purpose of the product
Data from process flows: 
  • The managerial levers, knobs, dials, and switches connect to the process flows, and management control is a part of the information flow structure
  • There are three crucial pieces of data to capture: time, event, and context. These associated items of data are the molecules or droplets that form the process flows
  • The first measurement definition needed is to determine what it is that flows, and to establish appropriate units. This could be number of cars, tons of steel, emails received, customers served, complaints closed, calls made, candidates delivered, prospects converted, or whatever is the appropriate unit of interest. But flow happens in time-cars per minute, tons per hour, emails per second, and so on
  • Measures start with what is flowing, and the time at which one unit flowed
  • The context is represented by the relevant information flow structure attributes 
  • Capturing the data leads to quickly discovering the basic relationships for this flow point: Average, maximum, and minimum number of units per unit of time - Average, maximum, and minimum dwell period (how long a unit spends in a sub-process) - Average, maximum, and minimum inventory per period (how many units are within the process boundaries at any one time)
  • This basic data of a time-stamped, identified flow unit event, along with the necessary contextual information, is the crucial business process information to collect. From this can be inferred stimulus, context, and event relationships. With this data in hand, a miner can conquer the world! well, perhaps not - but the miner at least can create powerful and relevant business models
  • Each function consumes resources - financial, human, and physical. These also flow into the process and are consumed as "energy" that drives the process to produce its flow
The basic questions that need answering to characterize a business process are: 
  • Who are the customers for the process?
  • What is the value that the customers receive from the process?
  • What is the input to the process?
  • What is the output from the process?
  • What are the sub-processes?
  • What starts each sub-process?
  • What flows into each subprocess?
  • Where does the flow to a subprocess originate?
  • Where does the flow from a subprocess go?
  • What information does the subprocess need to operate?
  • Where does the needed information for the subprocess originate?
  • What information does the subprocess generate?
  • Where does the subprocess send its information?
  • What resources does the subprocess consume?
Answering these questions characterizes the nature of the important flows: the product flow, the information flow, and the resource flow.

Data about what? 
  • Managing a company requires managers to use five different types of process manipulation. They aren't necessarily all used at the same time, but they represent between them all of the knobs, levers, and switches that managers use
  • The five types of levers are for managing: Flow time - Flow capacity - Waiting time - Process variability - Process efficiency 
  • Every business activity can be characterized as a business process, so managing these five process attributes is common to all business endeavors
  • If there is no data available, refer back to the process map for the project and identify, at the inter-function boundaries, which data will characterize the relationship represented by the management lever
  • For instance, a restaurant may need more customers. It has business processes for attracting customers (such as sales, marketing, and advertising), which are not performing as desired. This might be a problem related to process efficiency - not converting enough of the resources used into customers. Say that the main resource used for generating customers is cash. Lever 2 applies to "decrease financial resource requirements." How can this be done? A context model will discover the various factors that are important in that they have a strong relationship with financial efficiency. A relationship model will characterize the relationship of input (cash) and output (customers) for all of the various customer conversion mechanisms (perhaps yellow pages, fliers, newspapers, etc). A forecast model will point to the optimum mix of investments
  • The tools for mining are the tools to make any business process as efficient and effective as it can be
  • Data mining has a very broad  role to play in improving business processes and in contributing to efficient corporate performance
Characterizing Business Processes:
  • It is important to note that not all business processes are created equal. Some processes are of primary importance to the company; taken together, these can be characterized as the primary process flow of a company. These processes that form the primary process flow of a company, typically, are related to the P3TQ relationships
  • There are many other processes and flows in a company, but they are all of secondary importance and can be characterized as the secondary process flows
  • The primary process flow is the one that all other flows support. This is where a company's attention is focused
  • All data collection, aggregation, and storage processes, as well as analysis and business information processes, are secondary processes, not primary processes. In the main, the data that a company collects focuses on its primary process flow
  • Any data already collected is collected not just to reveal things. All data is collected in a way that, by design, hides important problems. A system can't tell the difference, and often nor can the data collected say what is wrong
  • Business processes don't always do what the designers intended, don't always work as expected, and don't always generate data that means what it says! In mapping the business processes, get all of the people in the process involved
  • For data generation, it is important to map reality as closely as possible, not some theoretical vision of what should happen. This requires the involvement of all stakeholders - and may deliver some surprises
Context: 
  • Context is important issue, because it is context that gives any event its meaning. Colloquially, to "put an event into context" means to explain its relevance and import, in fact, to explain the meaning of the event
Context affects data about an event in four ways: Event data falls into two broad groupings. Some data is sensitive to the context in which it is generated; some is insensitive. This breaks further into four categories of interest here:
  • Context-specific data: has meaning only within a restricted contextual range, and any change in the context can totally invalidate the data. A conductor's baton movements mean something in the context of an orchestral performance, but nothing in the context of repairing a car
  • Context-general data: is implicitly sensitive to the context in which it occurs. In the context of medical diagnosis for males, pregnancy is not a valid option. In the context of eating a vegetarian meal, ordering steak is not a valid option
  • Context-generic data: is independent of any context. The number of customers a company has, or the capacity of a shipping container, is generic data
  • Context-free data: simply ignores context sensitivity. Frequently, forecasts such as market growth or sales projections are made context-free: "If things continue as before, then . . . " About the only thing that humans can be completely certain, excluding perhaps death and taxes, is that things will definitely not continue as before, and that later will be different than now
In context-sensitive data - two important features about the situation can change: options and probabilities. For instance, tactically the contextual concerns of price sensitivity and urgency can make a huge difference in choosing a shipping method. Overnight shipping may not be an option if price sensitivity is paramount.
  • At a strategic level, suppose management contemplates expansion into a new corporate facility, or a new market, or a new product area. Context has to include the economic climate, availability of funds, competitive position, management objectives, and so on
  • Capturing the context of a business event is just as important as capturing the time and event itself
  • Cause and effect maps are a good way to discover contextually important data that has to be captured about an event
Mapping cause and effect: 
  • Context models, are used to discover relevant variables and features that help to more fully characterize events. This is all well and good if the miner happens to have a selection of candidates variables and features for the context model to explore. But what if there aren't any? How does the miner discover appropriate context data to include in the business process model? The answer to that calls for another useful tool, cause and effect mapping
  • There are several useful ways of mapping cause and effect, but the one that has proved to be most useful is what is known as a "fishbone" or an "Ishikawa diagram"
  • The map is useful because the process of creating it promotes a structured exploration of possible causes and effects, and points directly to the data that needs to be collected to determine the relationships that the map proposes
  • In making cause and effect diagrams, it's important to keep representing more detailed causes until you reach something that can be measured. The undivided branches (using a tree metaphor) are leaves, and it is the leaves that point directly and unambiguously to the data that has to be measured
  • Process maps point to where data can be captured, and causal maps point to what data needs to be captured. Of course, process maps also point back to other important data that needs to be captured too, the data that describes process flow
  • Taken together, process maps and cause and effect maps are powerful tools to use in discovering data for mining that addresses business problems
 Using data:
  • How data is actually used greatly impacts the quality and performance of any model - not just which data is gathered, but how it is characterized and incorporated into a model or used to reveal relationships
  • Two important issues in the way that any available data is actually used: The types of variables available - Effectively fusing data sets of different domain coverage to create a single, common, domain-mineable representation
  • The data that is used in data mining, as applied to business models any way, is always intended to have an impact. The outcome of the mining is expected to make a difference to a business situation
  • The data, in the form of a variety of variables, needs to be carefully considered, and different types of variables are of different use and significance in the model
  • Understanding the role and use of these different types of variables is very imporatnt in determining the way that they are incorporated into different models
  • Very often, the data that is available for modeling comes from disparate sources. These various data sets have to be assembled into one single data set for modeling - but this is not always either easy or straightforward
  • Take a company that has, say, a million customers. If they decide to conduct a detailed market survey of a thousand customers, how is this detailed information about one tenth of one percent of the total number of customers to be meaningfully fused with the original customer data set so that the result can be mined? Can it even be done? (Actually, yes, but only very carefully!)
Types of variables: 
  • There are, of course, numerous ways to characterize variables. However, in getting the model right, one characterization is particularly important: the way that the variables will be used
  • First, there are the variables that go into a model and variables that come out of a model. Such variables are described as clusters called batteries, and they are labeled as the input battery, and the output battery, as appropriate
  • Whether the variables are in input or output batteries, however, there is another important way of looking at them: as control, environmental, or intermediate variables
  • Control variables represent objects, or features of objects, that can be controlled, or controlled for, directly by the organization. They appear, for instance, as the leaves in cause-and-effect diagrams
  • Environemtal variables represent features of the environment that are beyond any form of control by the organization, such as hours of sunshine, gross domestic product, inflation rate, and unemployment rate. These are sometimes called nuisance variables by statisticians. These variables most definitely have an impact on the customer
  • Intermediate variables lie between control variables and environmental variables. They are, therefore, influenceable to some greater or lesser degree. These are often found in the output battery, and there they are no problem; but sometimes when they are in the input battery they can cause problems
  • The problem comes when they look like, and are treated like, environmental variables that are independent of each other
  • It is often important to make sure that the model has control variables in the input battery
  • It would be of no value to present a forecast based on environmental variables alone, however accurate it might be. The sales manager has no control over the mortgage rate or discount rate, for example, even if they are material to the forecast
Fusing data sets: 
  • Business data comes from a wide variety of sources, and often with great differences in data domain coverage. For instance, basic customer information may be fairly complete for all customers. A market research survey may be very comprehensive in its coverage for a minute fraction of the total number of customers, since it may cover many prospects who aren't customers as well as some that are. How can the two data sets be combined in such a way as to support modeling and mining? How are data sets derived from such disparate sources to be used for modeling and mining? The answer is through the use of a process called data set fusion
  • The data sets to be fused are individually in the form of tables containing numerical or categorical variables
  • The problem is how how to combine these data sets that have some (usually small) overlap
  • One way to start the data set fusion is by creating a model on only the variables that the two data sets have in common. That is, it models which records in the donor data set are most similar to each record in the recipient data set. Nearest neighbor methods are well suited to this
  • The objective is to achieve the best possible match with the data sets on hand
  • The second method, depending on the density of information in the common variables, may produce a more realistic distribution of values for the estimated unique variable values, but this can also produce estimates that are impossible values for the real world
  • In making data set fusions, it is worth considering using rule extraction as a way of generating predictions 
  • The problem with fusion is that there is no way to definitively answer the question of what the distribution should look like, except by comparing it to real-world data - and if that had been available the fusion wouldn't have been needed
Here is the method:
  • Locate all the records in the donor where the common variables match records in the recipient. (So, for instance, if the records refer to people, select records for those people in the donor data set that also have records in the recipient data set
  • Select a subset of these records that have a similar distribution in their common variables to the distribution in all of the recipient records in their common variables
  • From this second subset, determine the distribution (measure the first four moments) of the distribution of the non-common donor variables
  • Compare the distribution just found to the distribution of the non-common donors in the whole fused data set. (Look at the differences between the first four moments of both data sets
The moments of a data set are more fully the "moments about the mean." The moments use the sum of the powers of the difference between the mean and each number, divided by the total number of records. The mean, of course, is just the average found by summing all the values and dividing by the number of records.
If the distributions are very different, it is a strong indication that the fusion should not be relied on without further testing against the world. If no distribution checking is possible, testing is all that is left.

It is worth noting that there are standard measures that are based on the first four moments; these are the variance, standard deviation, skew, and kurtosis. Using these measures is usually a lot easier than using the "raw" moments as defined here because there are many software packages that calculate these standard measures.

Fusing data sets that are comparable in size for record count may improve the models enormously. If two or more disjointed data sets have to be mined, fusing them is the only option, and results are very often worthwhile.


Summary: 
Data comes from one of three sources: internal, external, or purpose-developed. The most useful is purpose-developed data enriched with external data, if it is possible to do so. Appropriate data is generated at the business processes. which ultimately consume the results of modeling by means of managers pulling controlling "levers" to modify business processes. The characterizations of variables that are important to modelers include: Control, intermediate, and environmental. These characterizations are important to recognize so that the models appropriately address the business process changes needed.
Getting the model right, requires no more nor less than calibrating an appropriately constructed model with appropriate data.


Link II - M II Technique Boxes 
=================================================
Getting the Right Model (Part 2)
Exploration Tools:
  • One-on-one exploration of problems is very useful and frequently productive. On other occasions, group sessions are necessary, especially with multiple stakeholders simultaneously contributing
  • A very useful skill for a modeler to develop is a set of exploration tools that can be used individually and with groups, and for presenting interim results
Mind Maps: 
  • Maps show features and objects of interest, not the whole territory. They attempt to present visually an individual or group understanding of a situation. Such maps may or may not match reality very closely, but matching reality is not the issue; matching the mental image is the primary object
  • An easy way to begin is with the techniques of mind mapping - a quick way to sketch mental maps that are very simple and easy to create, yet which communicate a huge amount of information very succinctly
  • They are useful for many applications, and especially for describing situations
  • Mind maps are not limited to hand-drawn sketches. Various mind mapping software products are available, such as Visual Mind
Cognitive Maps: 
  • It's a short, but significant, step from mind maps to cognitive maps. A cognitive map is similar in many respects to a system diagram. Rather than depicting the associations that are easy and intuitive in mind maps, cognitive maps have to identify objects, features of objects, and significant interconnections between them
  • Cognitive map objects have two important characteristics: The objects exist only inasmuch as we interact with them - Properties of objects exist only inasmuch as the objects interact with the world
Cognitive Models: 
  • A cognitive model takes a step further from a cognitive map, in that its distinguishing feature is in the specification of the relationships
  • A cognitive map indicates that an important relationship exists, and the general direction of its influence
  • A cognitive model precisely characterizes the indicated relationships, which requires determining units of measure for the objects, ranges of values, the "shape" of the relationship, and the strength of the interaction
  • With data available, of course, this is where data mining comes strongly into the picture, since data mining is an extremely powerful tool for characterizing all of the features of a relationship, whether the measured values are numerical or categorical, and whether the relationship is linear or highly nonlinear
Simulation: 
  • We are pretty hopeless at guessing what complex, nonlinear systems are going to do. But it turns out that we are smart enough to make devices that can do the guessing for us - simulators
  • To run a cognitive model as a simulation, simply plug in actual values for some of the system's variable values
Calibrating Cognitive models: 
  • Simulation time almost always ticks by far faster than real time. Hours, weeks, or months of simulated time may pass in seconds of elapsed time
  • For business simulations the point is to try simulating things that are unwise to try in practice until better understood, so faster than real time is pretty much the rule
  • However, cognitive models can be very usefully calibrated against what really happens in the world, if data is available
  • If a cognitive model "works," it should describe in its limited terms the system in the world of which it is an abstraction
  • Calibrating the model ideally requires matching the system performance back to the world's performance and having them agree
  • A valid system model will always include the performance that the world actually exhibited
Knobs, Dials, Levers. and Switches: 
  • Managers have to manage: practicing the art of recognizing situations and taking appropriate actions that are effective
  • The manager's core responsibility is to make appropriate selections because managerial performance is judged almost entirely by selection appropriateness 
  • The ideal information that a manager wants presents clues to recognizing situations unambiguously. This is exactly what an aircraft does for a pilot. Pilots have various dials that present relevant information that represents the situation, and alarms that presents alerts that action is needed, like stall warning, low fuel warnings, and so on
  • Systems simulations can be fitted with visualizations of control panels, which provide images of knobs, dials levers, and switches that show the state of the business situation simulation
The Business Case: 
  • A well-presented business case might have to include details of who, what, when, where, why, and how for the project
  • But the focus, and the reason for existence of the business case, is entirely concerned with only one of these: why is this project to be undertaken? The business case explains in management terms - and note the phrase "in management terms" - why, and how much, the project will contribute to the business
  • The essential job of the business case is to quantify the return expected for the invested resources
What is a business case? 
  • Very simply, a business case is the material that is presented to decision makers to persuade them that the idea proposed should be pursued
  • In order to gain support and the be successful, the project must engage senior management and executives from the very beginning
  • No improvement or change stands alone, but exists only in comparison to some existing baseline. The business plan has to explicitly include the baseline benchmark, for without it there is no way to measure the improvement or increased return from the project
  • Changing corporate behaviors is not easy. There are several barriers to change that are common in most organizations. The business plan has to address how the barriers are to be overcome. These barriers are the "friction" that impede "movement " toward the goal
Aligning the business case with corporate needs: 
  • The business case has to be aligned with the needs of the business. The business case itself not only has to present a plan that is aligned with the business, but also it has to explicitly show how it aligns on at least two dimensions
  • It also usefully include explicit references to alignment on several other dimensions. The two dimensions that must be included are: Business goals - Timetable
  • The hidden criteria to at least consider, and also to explicitly address if possible in the business case, include: Corporate culture - Corporate investments - Process capabilities
Preparing the business case: 
  • A business case is intended to convince someone to make a decision. Decisions aren't made in vacuums, but are made relative to alternative choices
  • The business case for a large project will present several to many different issues that require management decisions
  • Discussion of alternatives will almost always be most effective when the discussion is in terms of money
  • When the business case and the expected return are discussed, it is the discounted amount that is important to the manager
  • Business case has to be designed to support only the business decisions, and must reflect business values and support business judgments
Return on Investment (ROI): 
  • As far as presenting the business case goes, numbers come first, which is fine - but how are those - but how are those numbers to be developed? The simple answer is that the most important number in the business case is the estimated return on investment (ROI)
  • In any analysis that leads to an ROI calculation, two separate investment (costs) and two separate returns (gains) have to be qualified. Investments: One-time investments - Ongoing investments. Returns: Tangible returns - Intangible returns
Describing the returns and investments considered in an ROI analysis is fairly straightforward. However, there are many roads that can be traveled to reach a justification of an estimated project ROI. Here is a very brief look at several of the most popular, listed alphabetically:
  • Breakeven analysis
  • Cost/Benefit analysis
  • Investment opportunity analysis
  • Pareto analysis
  • Sensitivity analysis
  • Trend analysis
Assembling and presenting the business case: 
  • The business case, recall, is defined here as the material that is presented to decision makers to persuade them that the idea proposed should be pursued
  • The material should include at least a written report and an in-person presentation. The report should comprise seven sections: the management summary and six subsequent sections
  • The management summary (This must be short)
  • The opportunity (the business problem)
  • Current situation
  • Alternatives
  • The solution
  • Resources
  • Proposed acion
This is a brief description of what goes into the business case. Keep the whole thing as short as possible, but cover all of the main points. A business case document must be complete, but it shouldn't try to be comprehensive. Focus and concentrate on business management needs, not on technical needs.

The reality: "What can you do with my data?" 
  • The purpose of using data mining in business modeling is to apply it as a tool that will support informed decision making and the tactical deployment of solutions in a company
  • Data mining and business modeling are not solutions in search of problems - the problem always has to come first
  • Data mining and business modeling are tools just like any other tools. No company buys tools first and decides what to do with them afterward
  • Even when geologically mining, companies don't dig first and see what has come up afterward. Inviting a mining company to "come and mine my backyard to see if there is anything of value there" won't work. Even if there is oil in your backyard, a diamond or gold mining company won't be interested. And in any case, the question is whether the oil is present in economic quantities. Analogously, so it is with data mining
  • It isn't appropriate to mine first and see what is discovered afterwards
  • It is valid to ask what types of tasks particular tools can undertake, and what they are particularly useful for before actually committing time and resources to purchasing some and learning to use them. The biggest problem still comes when the data comes first. And all too often the data does come first and that is the business problem the modeler must face
Hunting for a problem:
  • Until a problem is identified, there isn't anything that data mining or modeling the "business situation" can achieve, because there isn't a business situation 
  • Remember that data mining is useful only when the business situation contains relationships that are described by data
Problem opportunity: The corporate value chain
  • The place that any search for problems has to begin is with the company's value chain. Corporate value chains are immensely coplex in detail, but in principle simple
  • All companies face one fundamental problem: to have the right product in the right place at the right time, in the right quantity and for the right price. All of a company's core business processes focus on this multi-part objective
  • The objectives form a system, that the author calls the P3TQ (Product, Place, Price, Time, Quantity) system. These key components mutually interact
  • The tools that any company can deploy to understand and manage the intricacies and vagaries of its unique P3TQ system are data, forecasting, and resource deployment
  • Relationships, as revealed in data, are what data mining tools excel at revealing 
  • The unique P3TQ system of any company is constructed from the corporate resources available to it formed into business processes
  • Some processes are dedicated to determining the appropriate product, place, price, time, or quantity, whereas other processes attempt to get products to the right place, create products for the right time, have the right quantity at the right time, and so on
  • When starting a mining project with data, the first step is to audit the data using P3TQ relationships. For which relationships is the data most timely, accurate, and available? Probably, the data collected will already be used to understand the relationship that it was addressed to
  • Every manager's core responsibilities map onto one of three areas: the P3TQ relationships, maintenance of the business processes, or management of corporate resources. However, many senior management responsibilities, and most high visibility projects, all map onto the P3TQ relationships, and this is where to seek a business situation that, when improved, produces a result that gets attention
  • Forecasts are not predictions; they are possible future outcomes based on some set of assumptions
  • Whatever else a forecast does, it provides a linkage between one circumstances (x) and another circumstance (y). What links x and y is an (assumed) connecting relationship. Where there is accurate, relevant, and available data, data mining is a good candidate for producing accurate and timely forecasts. That forecasts that will get the most attention are those that address the P3TQ relationships
  • If the available data doesn't support forecasts that directly address the P3TQ relationships look for projects that are as close to these core relationships as possible
  • For example, the supply chain of a company focuses on four of the five core relationships: primarily product, time, and quantity, with a secondary focus on price. Since this touches on so many of the core P3TQ relationships, this is a good secondary place to look for a valuable first project. Or again, Customer Relationship Management (CRM) touches all five core P3TQ relationships, and again offers excellent opportunities
Initial project size:
  • The final question to address is the scale of the initial project. Your objective is to have a successful project. Success requires that the results are noticeable, that the project uses data already available, that it is of a suitable scale, and that it isn't too innovative
  • Success requires noticeability so that the results can be used as a reference within the company to garner further support. It also has to be noticeable so that other managers not initially involved become aware that the project has produced useful results. This will, hopefully, motivate them to discover how business modeling and data mining can help them, too
  • The best place to look is in current data streams, because these are often more easily diverted for the project than attempting to modify the exiting procedures for maintaining a large corporate warehouse
  • The scale of the project is crucial. It has to be large enough to make a difference, but not too much of a difference. Deployment is going to require changing a business process in the company. This means, ultimately, that people will have to behave differently somewhere in the company. Since this is a first project of an untried technology, the least change that the project requires for deployment, the better. The less change is required, the more likely it is to be implemented - and without implementation, and cooperative implementation at that, there will be no success
Summary: 
Discovering the right model to fit a business situation is neither straightforward nor easy. It is a difficult process that requires experience, insight, and judgment. Discovering the problem to address involves talking to the right people (the stakeholders), in the right way (by asking open-ended questions).
Using an appropriate metaphor, one that seems intuitive and relevant to the modeler and the stakeholders, create a cognitive map, and perhaps a cognitive model of the situation. Look for the key relationships, perhaps using cognitive model simulation as a tool to illuminate the situation. With the problem defined, create a business case for the project that appeals to the business managers whose support for any project is vital.
This is an appropriate process for discovering the right model when business modeling and data mining are used as tools for solving a problem, for improving a business process, or for supporting decision making in a corporate setting. Unfortunately, it's no help if the business problem starts with data already collected. Then the business problem is to find an opportunity to successfully apply business modeling and, most particularly, data mining in an effective way. Finding the initial problem that the data addresses requires a close focus on the core P3TQ relationships that are represented and how the data represents them, especially those relationships that are represented in the data but that have not been explored as deeply as the data suggests they might be.

Getting the Model Right 
=================================================
Deploying the Model:
  • Regardless of how much effort the modeler puts into the modeling project, and equally regardless of how technically successful the projects turns out to be, unless the model is successfully deployed and remains productively engaged with the necessary business processes, the project must fail and deliver no return for the investment made
  • The one unambiguous fact that has to be faced by every modeler is that a less than perfect model that is deployed is of infinitely more value to a company than any better model that isn't deployed
  • Many of the deployment considerations have to be made at the start of the project
Modifying business processes:
  • What does it mean to deploy a model? In detail, the answer to that question depends a great deal on what type of model is to be deployed. However, in general, the answer to the question is very straightforward. Deployment requires using the results of the modeling project to modify existing business processes
  • The practice of any business process is always maintained by people. It is reinforced by written procedures, corporate culture, tradition, explicit incentives, emotional motivation, implicit and explicit expectation, and familiarity and is enmeshed in a formal and informal web of internal interactions that serve to maintain and modify it
  • Recruiting and maintaining support from all of the stakeholders from the beginning of the project through deployment is key to success
  • The project is not complete until the deployed model has actually changed corporate behaviors and the changes have been monitored to determine if the actual outcomes matched expectations 
  • The connection between modeler and all the stakeholders needs to be structured, and two-way information flow and engagement maintained
  • The modeler, at least as a modeler, can do nothing about any of these problems unless the appropriate stakeholders are informed, engaged, and committed
Motivation for success:
  • A model is a component of a business process and must itself have input, outputs, flow rates, and all of the concomitant parts
  • At least one way to judge the effectiveness of any model is by the way that it impacts the business process into which it is embedded. However, since almost all business processes incorporate people, ultimately the deployed model has to change people's behaviors - or at least, enable them to change their behaviors in ways that are intended to affect a business process and effect an improvement in some way
  • The most important consideration is making sure that the users of the model are indeed motivated to use it. Regardless of how much easier it might be for them, or how much improvement the company might glean, the users have to invest time and effort in changing a behavior that they already know works into a different behavior that uses the results of the model
  • Failing to provide the necessary motivation is the single largest cause of failure to change a system's behaviors because without an incentive to change - well, there's no incentive to change, so why make the effort? Successful deployment lies in large part on having a good answer to the question, "What's in it for me?"
Impact of model types:
  • The types of models in terms of five dimensions: inferential/predictive, associative/systemic, static/dynamic, qualitative/quantitative, and comparative/interactive. All data-based business models will fall somewhere on at least each of these five dimensions, and the precise characterization of these dimensions makes a difference to how the models are deployed. However, by far the biggest practical implications for deployment come with the distinction of the inferential/predictive dimension
  • An inferential model is expected to deliver explanatory insights that allow the users of the model to go and do something differently and more effectively as a result of the explanation - in other words, to make better predictions about the outcomes of their actions
Inferential models: Delivering explanations
  • Almost all models have to deliver explanations. Even if the model isn't used to explain relationships in the world, the modeler will at least have to explain what the model is doing. If the developed model is primarily an inferential model, then in this case, model deployment is actually delivering an explanation about relationships in the world
  • Whatever has been discovered, deployment requires a justified (by the data) and convincing narrative
  • A clear distinction must be drawn between explaining the modeling process, which is not a part of the inferential model at all, and explaining the relationships among the business objects in the real world, which is what an explanatory model is all about
  • The explanatory model only describes the world (at least as revealed by the data), quite regardless of how the relationships were discovered and explicated
  • The key to any explanatory model is to tell a story - the story that the data reveals. The story is told in business terms: explicitly not in terms of the data, nor in terms of the modeling tools used, but entirely in terms that the stakeholders will understand
  • This is the purpose of framing the model in which the crucial question is very simply, "In what terms is the answer needed?" In the appropriate terms for the audience, the model is a story with a beginning, a middle, and an end
  • If the opening lines of the story don't grab attention and interest, the deployment fails right there
  • Exploratory modeling takes place in two stages - discovery and verification - and the modeler performs different activities in each stage
Discovery: Discovery comprises 11 basic themes. Approximately in the order that they are used to explore the data, they are:
  • Noting patterns and themes
  • Discovering plausible explanations
  • Clustering
  • Counting 
  • Contrasting and comparing
  • Partitioning variables
  • Deriving generalities from particularities
  • Proposing plausible explicit and implicit (latent) factors
  • Identifying and explicating relationships among variables (or variables groups)
  • Creating a logical explanatory chain
  • Creating conceptual coherence
These 11 stages are what a modeler does in the discovery stage of explanatory modeling, and they need to be convincingly summarized at model deployment.

Verification: Verification comes at the point that discovery stops. Then it's time to try to confirm, or deny, the various explanations that were discovered. Verification can be summarized in 8:
  • Checking for representativeness 
  • Checking for bias
  • Triangulation: Using different data sources - Using different modeling methods - Using different theories
  • Accounting for outliers
  • Explaining surprises
  • Incorporating negative evidence
  • Incorporating external empirical and experential evidence
  • Corroboration of discovered insights, objects, and relationships from feedback
Here is where the story that has been developed during the discovery process is put to the test.
  • Triangulation can be a very powerful justification for accepting the results as valid
  • Explanatory models require strong theoretical underpinnings - that's their purpose after all
  • Triangulation calls for the development of multiple thepretical underpinning that all go to support the explanation
  • Validation of an explanatory model always has to account for outliers
  • For instance, in the daily weather patterns, tornadoes and hurricanes are outliers, but crucial to include. Huge insurance claims occur very infrequently, are outliers, but are a focus of interest. Fraud is outlying behavior most of the time, but of primary interest
  • One purpose of an explanatory model is to make discoveries, and the discovery of meaningful outliers may be the most important result of modeling
  • An explanatory model that discovers and explains the obvious gains credibility, if it doesn't, there is something suspicious going on somewhere - bias, lack of representativeness . . . something
 Predictive models:
  • The model will deliver one, or no more than a few, predictions and never be used again. Such models are used, for instance, in scenario evaluation in which the object is to determine possible future outcomes 
  • Perhaps a company, in planning future strategies, needs to project economic conditions with high, medium, and low inflation assumptions. Or again, city planners may need to predict service needs and traffic congestion under several different hypothetical emergency scenarios. These predictive models, although truly predictive, can be treated more like explanatory models in their deployment phase
  • At the other extreme are fully automated predictive models that are embedded in dynamic systems and are intended to modify system behavior and response in real time
  • An example familiar to many millions of people is embedded in the amazon.com and bn.com Web sites. Their "recommendation engines" are predictive models that are designed to provide unique recommendations based on predictions of which products will be most likely to elicit additional purchases from browsers on the site
  • Such models are also common in many other applications, such as in industrial process control where the predictive models are used to predict the system's most likely future response to changes in conditions, allowing process correction and optimization
Dynamic data modification:
  • The important point is that during modeling, the data has necessarily been adjusted in many ways. It may have been cleaned, had missing values replaced, been recorded, had features added in the form of additional variables, and had any number of transformations and modifications made to it before actually constructing the model
  • The model won't work on raw data, only on data that has been transformed and adjusted - so all of the transformations and adjustments that are made at modeling time have to be re-created and duplicated at predict time
  • For instance, suppose that the feature "total purchases this year" turns out to be an important variable for the model. Is this available at predict time? The answer depends. It's certainly feasible to create this feature as a variable at modeling time, but if the main database is updated only monthly, then any purchases in the current month will not be included in that database
Missing "Now": Skip Windows
  • Data about "now" always presents an issue that needs to be thought about for models. The issue, at least for predictive models, is that data describing "now" is very likely not present in the execution time data stream
  • The duration of "now" depends very much on the system involved and may stretch from micro- or milliseconds to days, weeks, or years
  • For the U.S. census, for instance, "now" takes 10 years since that is the sampling rate for the U.S. population
  • All systems that a modeler deals with have a sampling rate that essentially determines how recent the latest possible data available can be. Frequently, different systems have different sampling rates that have to be merged into the data set to be modeled
  • Although challenging, this isn't usually a problem. The difficulty is that for some period, defined here as "now," the data isn't actually available at execution time because it has not yet been assembled. These "now" periods form modeling windows
  • In general, although skip windows (as the periods of unavailable data are known) may not be required in a model, depending on exactly what has to be predicted and when, a modeler has to pay close attention to what data will actually be available to create the needed prediction at execution time, and not rely on what is available in the training data sets
Summary: Probably the single biggest failing for data-based business models when they are deployed is a failure to close the loop, or in other words to put in place a mechanism to determine exactly what the effect and impact of the deployed model turns out to be. Success or failure should be determined by measuring the actual results and comparing them with the expected results, comparing expectations and outcomes so that expectations can be better set.
Without such feedback and correction, if necessary, deployment is not complete. The process for measuring actual outcomes, like so much about the deployment issues, have to be thought through at the beginning of the project, supported by the stakeholders, and incorporated into the project from the start.
==================================================
Getting started (Data Mining):
Highlights:
  • The path that leads from raw data to a deployable, mined model looks straightforward enough, but is in fact replete with many backtracks and detours
  • Mining data is not magic, and it is not something that computer software will do for you. Essentially, data mining is a structured way of plating with data, of finding out what potential information it contains and how it applies to solving the business problems
  • Most of the tools currently in use were developed from three main areas: statistics, artificial intelligence, and machine learning
  • Despite apparently different roots, these tools essentially do only one thing: discover a relationship that more or less maps measurements in one part of a data set to measurements in another, linked part of the data set
  • Data mining is a human activity, and it is the miner that produces the results of data mining, not the tools. The results come from the insight and understanding applied by human intelligence
  • The first three stages of mining data are: the assay, feature extraction, and the data survey. These three stages taken together comprise most of what constitutes data preparation
Looking at Data: 
  • The sort of data used in mining can be imagined to be very similar to say, in rows and columns. The columns are usually referred to as variables and represent different values that the variable can assume
  • Each row represents a collection of measured values that are associated together in some way
  • This column/row representation of data is a fairly common representation, but by no means the only one. Databases are created to hold data in other types of representations. One quite common representation, for instance, is called a star schema. A number of column/row tables are related to each other in a way that can be represented as a sort of star-like "hub and spoke" relationship
  • They are, in general, from poor to disastrous when heavy-duty, column-oriented operations are demanded of them - yet column-oriented operations are exactly what data mining demands
  • Most mining tools require access to data in the form of a single table. Sometimes columns (variables) are also called fields. Rows (instances) are also known as vectors
First steps in preparation: The Assay 
  • Preparation starts with the assay, which is no more than assessing the data's fitness or worth for mining. Miners assay data for very similar purposes: to determine the fitness or worth of the data for mining
  • Assaying data starts with a simple but crucial step: simply look at the data and check that it is indeed the data as represented
  • The data assay requires looking at, not changing, data at this stage. At this stage it's looking and understanding that's important, so take notes if necessary
  • The best plan of action to use when assaying data is to start with the variables as individuals and progress to considering the data set as a whole
"Eyeballing" Variables: 
  • One very fundamental characteristic of data is that there are different types of variables. In mining business data, three qualitatively different types usually jump right out: numbers, dates, and some that are neither dates nor numbers. Variables that contain numbers are called numeric variables, those containing dates are called date variables, and the others are usually known as categorical variables
Basic checks on numeric variables: 
  • The most basic check to perform on a numeric variable is its range - in other words, identify the maximum and minimum values in the data set. Compare these against the expected maximum and minimum that the data set is supposed to contain
In addition, and if relevant, there are several other criteria that are worth comparing against what should be expected:
  • Averages: Look at the mean, median, and mode
  • Missing values: What constitutes a missing value in each numeric field? Many databases deliver null values (meaning "no entry made here"), where there is no numeric value. Count the number of nulls present. Also check carefully for surrogate nulls. A surrogate nulls, at least in a numeric field, is a value that is actually numeric but is somehow entered when there is no known numeric value. Sometimes, zero is entered when no number is known. If surrogate nulls are present, they will cause a problem when mining the data. Removing surrogate nulls, if you can find them, can improve a model tremendously. Replace them with nulls, or treat them as missing values
  • Distribution estimates: Measures of variance - the way that the values distribute themselves within the range - are very useful for understanding a "snapshot" view of a variable. Variance may be reported directly, or it may be expressed in terms of standard deviation. Useful distribution measures also include skew and kurtosis
  • Distribution histograms: A histogram is a graph that depicts how many values are present in each part of the range of a variable
  • Error values: Oftentimes, a non-numeric value will sneak into a variable. Or, sometimes an error made while creating the data set may move a whole section from one variable to another. This may put a chunk of categorical values in the middle of a numeric variable. They can destroy the miner's best efforts
  • Outliers: These values are located far from the majority of values. There may be nothing amiss when single or value-groups of outliers are discovered, but they are at least more likely to be erroneous than non-outlying values. It is certainly worth checking. However, they shouldn't be removed or altered just because they are outliers. Outliers may have to receive special treatment during preparation
Basic checks on data variables: 
  • The main problem with time and date variables is that they appear in a wide variety of formats. Some mining tools can handle date variables, or at least recognize them as dates, if they are in the appropriate format; others can't recognize dates at all
Basic checks on categorical variables: 
  • Categorical variables may have few values (such as gender) or many (such as ZIP codes or personal names) in a variable. When there are many values, it's usually impossible to check them all individually
When it is impossible to inspect all of the values, use a histogram type of graphical check. One difficulty with categorical values is that there is very often no rationale for ordering the categories, so there is no metric by which to determine how to arrange the category labels on the histogram's axis. Even so, much can be learned by quickly inspecting such a chart. Look for patterns such as the followings:
  • Modal distribution: Some categorical variables have a very high proportion of instances fall into relatively few categories, and only relatively few instances fall into very many categories. Retail grocery items purchased are a good example of this. In any single shopping basket, many people include selections from relatively few staples (bread, milk) in their purchases and relatively few people include any one of the non-staple items (toothpaste, organic cereal)
  • Uniform distributions:  Some categorical variables show their categories fairly evenly represented by count in a data set. A histogram-type display (in which the height of a bar represents the number of instances in a data set with a particular value) will show each category with relatively uniform height
  • Monotonic distribution: In this case, every categorical value is unique, so every category has exactly one entry. For instance, serial numbers, personal names, and Social Security numbers are all monotonic and categorical in spite of the term "number" in the categorical description
Repairing basic problems with variables: 
  • It is conceivable that one day a data miner somewhere will have a data set to model that has no problems and in which nothing needs to be repaired. But the author very strongly suspects that it hasn't happened yet, and isn't likely to any time soon! Strictly speaking, the assay is only a look at a data set, not adjustment to and modeling of data
  • The problem is that as the assay proceeds from examining variables individually to looking at the data set as a whole, it is best practice to fix at least the worst problems with each variable before proceeding to work with the entire data set
  • Correcting these problems is really the first step in preparing the data for mining rather than just assaying since from here on, after the problems have been fixed to at least some degree, the data is no longer in its "raw" state. So fixing problems requires making changes to data
  • The miner should keep a copy of (back up) this untouched data set, at least as a reference 
  • Even more important than backing up the data set is to document and "carry" the transformations forward from stage to stage
  • Data preparation is a process of modifying data so that it "works" better for mining
Basic adjustments to numeric variables: There essentially three basic adjustments to make at this stage to numeric variables: 
  • Constant values: Checking the basic statistics for a variable may show that although the variable has a numeric value, it has only a single value and no variance. It's quite easy for such values to sneak into a data set, especially if they have been extracted from a larger data set. The variable may vary in the larger data set, but not in the sample. For instance, if the sample is being used to model credit card data of the most creditworthy individuals, the variable "Credit Limit" may have some uniform value since all highly creditworthy customers get the maximum. Whether the reason, and if, after checking, the constant value is one that is expected, then this variable should be removed
  • Empty variables: Sometimes variables turn up that are totally without any values - all nulls. Naturally these too carry no information since, as with all nulls, this is identical to not having the variable in the data set at all
  • Sparsely populated variables: These are variables that are mostly populated with nulls - say 80% to 99.999% nulls - but that do have a few values present. There are advanced preparation techniques for handling such variables, but an easy way to deal with them is to try to model what happens with them in the data set, and then try modeling again without them in the data set. If they cause a problem, remove them
Basic adjustments to date variables: Date variables can be considered as a special type of numeric variable. However, at this basic stage, and for purposes of continuing the assay, it is enough to think of date variables in a way similar to the numeric variables just discussed:
  • Constant dates: These are not often encountered, but if a date variable does in fact contain only one date, remove the variable
  • Empty dates: Entirely empty variables are of no value as far as mining goes. Remove the variable
  • Sparsely populated dates: Use the same method as for sparsely populated numeric variables
    Basic adjustments to categorical variables: To quite a large degree, considerations for categorical variables are similar to those already mentioned. Constant, empty, and sparsely populated categorical can be treated in the same way as numeric and date variables. However, even at this basic stage of adjustment, categorical variables begin to warrant additional consideration: 
    • Pseudo - numeric categorical variables: Remember that some categorical variables, such as ZIP codes, may masquerade as numeric variables. Most data mining tools that handle categorical variables have some specific mechanism for flagging a variable as categorical, regardless of the fact that it appears to be numeric. It's important to identify and flag such variables at this point
    • High category count variables: Categorical variables that have very high numbers of categories are going to cause any mining tool some form of indigestion. One problem here is that many tools have some form of internal transformation that will allow the tool to cope with the problems that these variables would otherwise cause. The problem for the miner is that although whatever adjustment the tool suggests or adopts will allow modeling to continue, the adjustment almost certainly will not be the one that best suits the business problem
    • Ordinal categorical variables: Some categoricals have a natural ordering or ranking - at least within some domain. Whenever something is ordered but expressed as a categorical - say, "order of importance" or "sales rank" - it may be worth numerating the categorical. This consists of simply assigning a number to a category based on its rank position or ordering
    • Monotonic categorical variables: Recall that these are variables in which every category is unique. There are a huge number of such variables: account numbers, serial numbers, order numbers - their variety is endless. It may well be possible to transform them in some useful way - social security numbers contain information about date and place of birth or naturalization, for instance; car license plates provide information about state, county, and date of registration
    Basic checks on the data set:
    • The assay process has involved evaluating only one variable at a time. Next, the miner uses the assay to examine the data as part of an integrated whole using a basic data mining tool - a form of decision tree
    • The tree used here is a CHAID tree (an acronym for CHI squared Automatic Interaction Detection)
    • The next step in the assay is single variable analysis (comparing each variable against another)
    Single variable CHAID Analysis: 
    • As a data miner involved in a data assay, expect to spend some time looking at data with single variable CHAID (or similar) analysis
    Basic adjustments to missing values: 
    • Missing values in any type of variable present special, and very tricky, problems due to the fact that values missing in a data set are very likely not missing at random
    • These missing values may be associated with each other, so that when "age" is missing, for instance, perhaps "date of birth" has a far higher chance of being missing than when "age" is present
    • Discovering and understanding these missing value patterns may be an important part of the discoveries that a miner makes
    • Focus on the fact that some values are missing or not missing, and the patterns with which they are or are not missing may themselves carry useful predictive or inferential information
    • The problem the miner now faces is that missing values may contain useful patterns that are important to keep in a data set, yet most tools cannot mine this missing value information
    • A major problem is that whatever method is used has to be applied to run-time data, not just the data that a miner is using to make a model. Worse, the most commonly recommended methods of replacing missing values - using the mean, median, or mode as replacement values - are damaging choices
    • One of the miner's tasks is to discover whether values are missing, and if so, to discover clues suggesting what to do about it
    • One way to discover whether the missing value information is important starts with making a temporary copy of the whole data set. For this exploration, use the temporary copy instead of the original data set. Except for the target or output variable, replace all of the values that are present with, say, "1" and all of the values that are missing with, say, "0." The modified data set now contains no null, or missing, values. Use the tree tool to find which, if any, of the modified variables in this data set has a relationship with the target. If you find such a variable, exclude it and try again. Keep going until there is no strong association between the target and any individual remaining variable. Having done that, apply the tree, using all of the associated variables to predict the output
    Anachronistic Variables: 
    • These are nasty wee beasties! An anachronism is something that is out of place in time. In this case, anachronistic variables carry information that is displayed in time. Specially, the variable tells of something that cannot be known at prediction or inferencing time
    • Relative to some "now" in the data set, it tells of some future outcome, and is embedded in the "input" variable side of the data set. Such variables carry information that is useless to the miner in building a predictive model since, if used, the model will need information that cannot be known at the time that the prediction is needed
    • There are really only two ways to look for anachronistic variables. One is to look at and think about what is in the data. The other is to build single variable and simple, multiple variable models and look for results that are too good to be true. "Too good to be true" very often is exactly that. When faced with known or suspected anachronistic, remove them from the model
    • The bad news is that any model built to be deployed on current data that actually calls for future data is a waste of time and resources, and can never return the results expected
    Basic data adequacy assessment 1: How much is enough? 
    • At this point in the assay, the miner begins to have a feel for the data. However, the miner still does not know yet whether the data is sufficient for its intended purpose. Essentially, the miner needs some assessment of whether there is enough information about the real world reflected in the data to make a model that will actually "work" - that is, whether it will make decent predictions if a predictive model is needed, or allow decent inferences if an inferential model is needed, and so on
    • This usually impossible to collect, comprehensive data set represents what is called the population. Roughly speaking, the population consists of all of the things that exist to be measured in the real world. Since it is almost always impossible to collect data that represents the population, a miner usually works with some lesser collection, called a sample. Intuitively, the larger a sample, the better it will reflect the population at large, so long as the sample instances are drawn at random from the population
    • To assay data adequacy, the miner needs to determine whether there is enough data in the sample that it dose indeed accurately reflect the relative frequencies in the population 
    •  There are two measures of model quality that are common enough to be mentioned. One is a statistically based measure called R, sometimes also called correlation. Another is called a confusion matrix. R will appear as a number between 1 and -1
    Basic data adequacy assessment 2: Negative examples 
    • A mining tool will learn what is present in the data set used for mining - it won't learn anything about the nature of the world at all, only about the data. If the model is to reflect the world, so too must the data used for mining reflect the world. And most specifically, the data must contain both examples of the phenomena of interest and counterexamples
    • As a modeling technique, it is sometimes necessary to structure the data so that it is easier to model
    • A mining tool won't - can't - learn about what it can't see in the data. Be sure that you have answered these two assay questions: (1). Does this data represent the full range of outcomes of interest, or the full range of behaviors of interest? (2). How do the frequencies of occurrence of the outcomes/behaviors of interest in the data set compare to the outcome/behavior frequencies in the real world?
     Sampling Bias:
    • Data is always said to be "sampled," and a data set is usually called, in statistical terms, a sample. It's a sample because it usually represents only a small portion of all the data that could possibly be collected. All the possible data is called the population, and it's from this that the sample comes
    • Sampling bias turns up when the method of selecting instances of data from the population results in the sample distribution not representing the population distribution. The key here is that it's the way that the instances are selected, the sampling method itself, that introduces bias
    • Taking a sample of the American population, for instance, is often done by random selection from home telephone numbers in the United States. What's wrong with this as a sampling method? It depends. If the sample is to be used, say, to estimate what percentage of the population has telephone service, it's immediately obvious that this sampling method is a complete bust! The sample is biased in favor of instances (people) who have phones
    • As far as sampling bias is concerned, the miner's job is to infer, detect, and correct for it if it exists, and if it's possible at all
    Bias and Distribution: 
    • Fairly obviously, as the value of the output battery variables changes its value across its range, so too the input battery variables change in values across their ranges
    • Although the values may change, very often the distributions of values in the variables at different points in their ranges don't change. Here is where a clue to sampling bias may lie. It's by no means conclusive, but if the distributions among the variables change as the value of the output battery variables change, it's a hint that sampling bias may lurk in the data set
    • The best that a miner can do is to compare distributions for the input variables with sub-ranges of values of the output variables, and be suspicious! Even with such clues in hand, if sampling bias exists, it will still have to be justified based on external evidence
    Completing the basic assay: 
    • Although the assay is intended to help a miner discover the important parameters and limitations of a data set, finding out this information alone isn't enough. Of course, this is information that will be used in later stages, or used to justify a decision to go forward or change the direction of the mining project. Keep notes about what is happening in the data and how it was discovered 
    Basic Feature Extraction: 
    • A feature, in this case, means no more than "something notable in the data." Now, of course, something notable must depend on the defined mining purpose of the data set
    • Also, sometimes the variables of a data set are called features of a data set. In data mining, the term feature has a context-dependent usage, which isn't hard to understand, once you get the hang of it
    • The term extraction is used here because the features will originate from the data already assayed, but also will add to that data
        Representing Time, Distance, and Differential Relationships:
        • Remembering that the object is to provide a transformation of data into the most accessible form for a mining tool, a very good first question to ask yourself is, "Could I easily use the data the way it's expressed now to make good analysis?"
        • Another way to think about variables is as absolute, linear, or cyclic. An intuitive place to start is with variables that express time. Often, time is expressed in a data set as some sort of date/time label such as "July 4th 2001 at 17:27:45," which expresses a specific date and time to the nearest second. This is an absolute expression of time
        • Often in mining, absolute time is not much help. Relative time expresses the relation of that label to some other label of the same type
        • Many time events turn out to be cyclic, very often based on an annual cycle. In the United States, peaks and troughs in various types of business and commercial activity are very regular, influenced by annual events such as Thanksgiving and Christmas
        • With a generic representation of a cycle present in the data set, the mining tool can "notice" that some phenomenon relates to the cycle in some way. Without any representation of the cycle, no tool could make such an inference
        • Cycles, almost by definition, go around in circles, and circles can't be represented by only one variable or dimension - it takes two
        • Differential representation expresses the difference between two measurements from here and there, or from now and then, which is what was occurring in the change from absolute to linear feature extraction
        • So comparing the difference in admission rates between surgical and medical admissions, between household claims and auto claims in insurance, between millinery and hosiery in sales conversion rates, are all differential representations that extend the notion of absolute, linear, and cyclic representations into other domains
        Recoding: 
        • Coding is the term applied to the results of the process of categorizing information. A simple example is using city names. In an address, the name of a city is simply a category. However, the label alone isn't enough to express the full richness of what the city address implies
        • How might information about cities be usefully recoded? There are a couple of ways. One is called 1-of-n recoding, the other is called m-of-n recoding. It will turn out that 1-of-n is really a special case of m-of-n, but 1-of-n is the easier place to begin the explanation
        1-of-n and m-of-n Recoding: 
        • Some modeling tools may have difficulty in using city names in a single list. In such a form, the algorithm on which the tool is based cannot correctly characterize the information that the city name implies. One fairly common practice is to break out the individual categories, each represented as a separate variable, and then represent category membership by a binary indication (1-0) in the variable
        • The maximum number of possible categories is represented by n. Since only one category is flagged out of the possible categories, this method of recoding is called 1-of-n
        • By using an m-of-n recoding, the number of category variables can be significantly reduced. Instead of having only one category membership flag on in any instance, m-of-n has m flags on, where m is a number 1 or higher, but not more than n
        • For an m-of-n recoding, it is quite possible to have different categories share an encoding
        • The one important caveat for the miner when making an m-of-n recoding is if you're unsure of exactly how to make the recoding, look it up or dig it up - don't make it up!
        Representing Objects:
        • Data, at least for business modeling, is always about objects. They may be customers, branches, departments, divisions, chart of accounts categories, fraudulent transactions, salespeople, profit, promotions, and a vast number of other similar objects
        • Data may represent the relationships between object features, such as promotional activity and customer spending, but still the data represents something about these objects
        • It's very important that the data adequately represents the objects of interest in order for the mining tool to infer the appropriate relationships between the features of the objects
        • It's important to make sure that the data actually supports the problem to be modeled - if it doesn't, then it's important to reconfigure the whole data set, if necessary, to make sure that it does
        • Much of the reconfiguration may be in the form of extracting the necessary features
        • One very common problem for miners is that the available data to begin with is transaction data, which represents a transaction as its primary object
        Surveying Data: 
        • The assay starts with raw data and ends with an assembled data set that addresses the business problem. Although presented as a continuous process, in fact, any miner will have to loop back through the various parts of the assay many times
        • Each new feature extracted will need a return to an earlier part of the assay. Nonetheless, the assay has almost entirely focused on the data as a collection of variables, and has looked at the variables as individuals
        • The data survey can begin only when the "final" data set is ready for modeling. ("Final" is a relative term since something discovered in the survey - or later, something discovered during mining - may always indicate that a change or addition is needed, which means returning to earlier parts of the process. Data preparation is the most time consuming part of mining, and it's not a linear process)
        • The survey looks to answer questions about the data set as a whole, and about expectations for modeling based on that data. Surveying data is an advanced topic, and the tools for making a comprehensive data survey are not readily available and require advanced mining knowledge
        • To some extent, the data survey is accomplished by using mining tools and then carefully exploring the limits of the mined model. In advanced mining, the survey can yield a tremendous amount of pre-mining information that guides an experienced miner during additional data preparation and mining
        Summary:  By far the most difficult task, and one that will take the data miner the greatest amount of time, is actually preparing and getting comfortable and familiar with the data. For what it's worth, the author, when mining, spends a lot of time simply "slopping" data about - cutting off samples, building small models, trying different features, and looking for insight and understanding.
        The essential idea for a miner is to get comfortable with the data. Feel that you intuitively know what's in it, what models it will support, what its limitations are. It is absolutely essential to connect the data back to the business problem, and to make quite sure that the data addresses that problem in the needed terms of reference.

        It's probably worth noting that miners need other preparation techniques when mining data in other domains, such as biomedical data, industrial automation data, telemetry data, geophysical data, time domain data, and so on. With data prepared and assayed, it's time to look at building the needed model or models. However, before launching into modeling practices, it's time to introduce mining tools themselves and the principles on which they work.
        =================================================== 

        What Mining Tools Do:
        Highlights:
        • Algorithms are important - for researchers in data mining who are attempting to develop new ways of mining data, and who are attempting to mine data that hasn't yet been mineable. Much leading edge work is going into developing techniques (including new algorithms and variations of existing algorithms) for mining Web data, pictures, text, spoken words, and many other types of data
        • However, for an analyst who needs to use the techniques of data mining to solve business problems, reasonably good algorithms have already been developed, taken out of the laboratory, wrapped in robust and reliable commercial packaging, tested for usability, and delivered with help screens, training manuals, tutorials, and instruction
        • When so wrapped, these are tools, not algorithms. Buried inside the tool are one or more algorithms, but the miner is interested more in the usability and relevance of the tool for the business problem than in the details of what algorithms it contain
        • Skill, experience, practice, and familiarity with the data and the domain are the keys to success, not the algorithms involved
        • Data mining algorithms are the core mathematical and logical structures that direct and determine a specific computational approach to examining data. Data mining tools are the commercially wrapped data mining algorithms ready for use in a business setting
        Data Mining Algorithms: 
        • A truth about data mining that's not widely discussed is that the relationships in data the miner seeks are either very easy to characterize or very, very hard
        • Most of the relationships in mining for business models consist of the very easy to characterize type. However, just because they are easy to characterize doesn't mean that they will be easy to discover, or obvious when discovered. What it does mean for the miner is that tools wrapping even the simplest algorithms can deliver great results
        Variable types and their impact on algorithms: Variables come in a variety of types that can be distinguished by the amount of information that they encode. They are briefly reviewed here starting with the "simplest" (those that carry the least information) to those that carry the most information:
        • Nominal variables: Essentially, these are no more than labels identifying unique entities. Personal names are nominal labels identifying unique individuals
        • Categorical variables: These are group labels identifying groups of entities sharing some set of characteristics implied by the category. In addition to personal names, all readers of a book belong to the category of humans
        • Ordinal variables: These are categories that can be rationally listed in some order. Examples of such categories might include small, medium, and large or hot, warm, tepid, cool, and cold
        • Interval variables: These are ordinal variables in which it is possible to determine a distance between the ordered categories. However, their intervals may well be arbitrary, as in a temperature scale
        • Ratio variables: These are interval variables in which ratios are valid,  and which have a true zero point
        As far as mining algorithms are concerned, they don't distinguish between nominal and categorical variable types, although the miner may well have to. Mining algorithms are also insensitive to the interval/ratio distinction, although once again a miner may need to be sensitive to the distinction. Algorithm sensitivity can be described as nominal, ordinal, or numeric, respectively.

        Characterizing Neighborhoods: Nearest neighbors
        • The term nearest neighbors is suggestive, but before addressing what it means for data mining, some preamble is needed
        • For any point, the row that defines where it is to be plotted is usually called a vector
        • Data mined models, regardless of which of the types of business situation models they will be applied to, are usually used for inference or prediction
        • Inferences are essentially descriptions of what is going on in the variables in particular neighborhoods
        • Predictions answer the question, "Given one or more, but not all, of the values in a vector, what are the most reasonable values for the missing entry(s)?"
        • Prediction with nearest neighbor requires only finding the nearest neighbor for the vectoral values that are present, and using the nearest neighbor's values as a reasonable estimate for the values that are missing
        • A slight modification, at least as far as prediction goes, is to use more than one neighbor along with some weighting mechanism to average the values to be predicted across several neighbors. This is called k-nearest neighbor, where stands for the number of neighbors to use
        • The alternative is to find some shorthand way to characterize neighborhoods that doesn't involve retaining all of the data for comparison, and doesn't need so many look-ups and comparisons to characterize an instance. There are several ways to achieve this, and some of them are briefly reviewed in the nest subsection: decision trees, rule extraction, clustering, self-organizing maps (SOMs), and support vector machine
        • The regressions and neural networks rely on representing data as a function (a mathematical expression that translates a unique input value, or set of values, into a unique output value
        Decision trees:
        • Nearest neighbors methods do indeed characterize neighborhoods, but really only the immediate point, or area closely around a point, demarked by a vector - so it takes a lot of points to do a good job of characterizing the whole space. This space, by the way, is technically called state space, and is a type of mathematical analog of real space
        • A different way of characterizing the space is to chop out areas that are to some degree similar to each other; this is what a decision tree tries to do
        • Note that decision trees decide how to split a data set one variable at a time. It's the "one variable at a time" part that is significant here. The tree algorithms pay no attention to interaction between variables, and modeling interaction between variables can be utterly crucial in crafting the needed model. However, if interactions are important, and a decision tree is the modeling tool chosen, then interactions have to be explicitly included in the input bettery
        • Note that a decision tree is a tool designed to characterize interactions between the input battery variables, considered one by one, and the output battery variable
        • A decision tree does not model the interactions between the input variables, nor does it characterize how any between-input-variable interaction relates to the output battery variable
        • When creating models using decision trees, it's very important to discover if between-input-battery-variable interactions are important
        • The decision tree needs only to store the rules that were discovered to be able to characterize any future data vector - that is to say, to identify the area of state space it falls into, and the properties of that area
        Rule Extraction:
        • Decision trees work by dividing the whole of state space into chunks, so that the data in each chunk characterize the whole chunk in some particular way
        • Rule extractors typically are not concerned with state space as such but search for common features among the vectors
        • Rule extraction works by generating covering rules. Not much mystery here: these rules cover a certain number of instances
        • The basic form of an extracted rule is "If condition, then outcome." Just as with lengthy and complex with multiple conditions joined by "and," so "If condition and condition and . . . then ."
        • But the additional power of "If, and only if, . . ." rules makes them a much rarer discovery, and highly explanatory when they do occur
        Clusters:
        • Clustering is another method of characterizing areas of state space. It also is an algorithm that doesn't necessarily characterize the whole of state space. A major difference from the previous algorithms is that the boundaries of the clusters are definitely not parallel to the axes
        • It is also important to note that clustering can be implemented as either a supervised or an unsupervised algorithm, and is the most popular of the unsupervised data mining algorithms
        • As an unsupervised algorithm, it simply tries to find ways of segregating the data, based on distance, with all of the variables in a vector equally weighted
        • When used as a supervised method, the clustering is made based on the values of one, or more, nominated variables - usually those about which a prediction is wanted
        • However, what they all have in common is the idea of distance in state space, finding neighbors, and then characterizing the boundary points
        • One interesting method of clustering is the self-organizing map (SOM). SOMs will serve as a simplified example of how clustering can work
        Self-Organizing Maps:
        • A self-organizing map (SOM) is created from elements that are called neurons, since a SOM is usually regarded as a type of neural network. Each neuron has a set of weights
        • Each neuron has one weight for every variable in the data set. Initially the SOM algorithm sets the weights for all neurons to random values so that every neuron has a different set of starting weights from every other neuron
        • Training a SOM occurs when an instance of data is presented to the SOM. One of the neurons will have weights that are the closest match to the instance, although at first they may not be very similar; nonetheless, one of the neurons will still be closest. This closest neuron raises a flag and "captures" the instance
        • Having captured the instance by already being the closest neuron, it then adjusts its weights to be even closer to the actual values of the instance. However, the neuron only goes part way with the adjustment, so although it is closer, it isn't identical in values to the instance
        • The SOM algorithm adjusts the neighbors of the capturing neuron, so that they are more similar to the capturing neuron, although they too are adjusted only part way, not to become identical to the capturing neuron
        • At this point the SOM is ready for another instance, and the training continues. As training continues, neighborhoods are adjusted to become more similar, and different instances are "attracted" to different neighborhoods
        Support vector machines:
        • Support vector machines (SVMs) recognize that the vast majority of the instances play no part in deciding how to place the separators in state space. All that really matters are the vectors at the "edge" of the "cloud" of vectors that represents the cluster
        • It is these edge vectors that support a separator, and so long as the separator passes between these vectors
        • SVMs result in clusters that are delineated by separators that, again, aren't necessarily parallel to the axes of state space, and may indeed wiggle about a fair amount. It's a relatively easy matter to store the vectors and use them to discover where any other instance vector falls when the model is to be used. Explanations are not always intuitively accessible
        Smooth Representations:
        • The The previous subsection algorithms described various ways of chopping state space into discrete areas. There may be many different areas generated by any one of the techniques, but each area has its own distinct value and characterization. Crossing one of the separators results in stepping from one characterization to another - no gradual change, just a discontinuous step. This isn't the only way to characterize state space; a number of techniques result in smoothly changing output values rather than the step changes in values so far generated
        • Many of these techniques are a result of what are called regressions for reasons that aren't important to the explanation of algorithms here
        • Many types of regression are thought of as statistical algorithms, and indeed they are. However, just because they are statistical algorithms doesn't mean they aren't also useful as data mining algorithms
        • Sometimes the data spreads through state space in a way that is far more analogous to a line or a curve in two dimensions, although it may wiggle about a good bit. When this is the case, rather than chopping the line into patches as a way of grouping similar points, another way to characterize them is to find how to best draw the higher order analog of a line through the points. This is the approach taken by smooth representations, so called because the value represented by the line changes smoothly and continuously as the line is traversed between state space areas
        Linear Regression:
        • Linear regression is an archetypal statistical technique, one that is taught in basic statistics courses. It is, nonetheless, one of the most powerful and useful data mining algorithms, and to be excluded on the basis of its statistical heritage
        • It's linear regression that enables the miner to make valuable and insightful discoveries in data, as well as easy explanations of what's happening inside the discovered relationship
        • Linear regression is a tried and tested way of fitting a single straight line through state space so that the line is as close as possible to all of the points in the space. Now, to be sure, when state space has more than two dimensions, it isn't exactly a line
        • Invariably in data mining, there is more than one input variable involved - often hugely more. Nonetheless, the linear regression method can be easily extended to deal with multiple input variables - often called independent variables in regression analysis - so long as there is only one output (dependent) variable to predict. This extension is called multiple linear regression
        • In spite of the apparent inherent non-linearity of most real-world phenomena, it is well worth noting that the vast majority of relationships that a miner encounters in business data turn out to be either linear, partially linear, semi-linear, or linearizable
        • No data miner should underestimate the importance and utility of linear and multiple-linear regression as a data mining tool
        Curvilinear Regression:
        • The method ranges from quite easy (read "quick") to algorithmically easy but computationally intensive (read "slow") to discover almost any curve that's needed to fit what's called curvilinear data
        • When curvilinear relationships exist in data, generally speaking, it's not hard for data mining tools to discover them. Very often, what is hard is to explain them
        • Curvilinear regressions are powerful algorithms for discovering relationships when incorporated into mining tools. They are great for making predictions, but intuitive explanations may have to be discovered elsewhere since what nonlinear techniques reveal is generally not intuitively easy to represent 
        Neural Networks:
        • Neural networks are no more and no less than algorithms for building nonlinear regressions. They don't actually perform regressions in the same way that statistically developed regression techniques work, and they do have the potential, when properly configured, to characterize highly complex and convoluted nonlinear relationships that are really tough to discover with other nonlinear regression techniques
        • Correctly configuring a neural network is a very difficult art, and art it is, not science. Neutral networks are highly complex algorithms, and the tools that encapsulate them have good rules of thumb, in general, for setting them up
        • It's very hard to learn to tune a neural network, and there's no practical way to determine if more tuning would work better except to keep trying 
        • The best way to gain experience is to build models using a variety of tools - say trees, rules, and neural networks - and carefully study differences in performance and results. It's also worthwhile to consider using some other technique that is easier to explain in terms of characterizing a relationship with a neural network
        Bayes and probability:
        • All of statistical analysis is based on probabilities. In general, it is based on a particular type of probability - that represented by a simple frequency of occurrence of events
        • Bayesian methods are a way of starting with one set of evidence (in the form of multi-variable data) and arriving at an assessment of a justifiable estimate of the outcome probabilities given the evidence
        • Bayes' formula also specifies how the probabilities should be revised in the light of new evidence
        • Theoretically, for instance, in order to discover the actual probability of an outcome given some multivariate evidence, a Bayesian method called naive Bayes requires all of the variables to be "independent" of each other. That means that, theoretically, "age" and "income" can't be used together since as age increases, income also tends to increase in most data sets
        • Bayesian-based probability models very often work remarkably well in practice, even when many of the theoretical constraints are obviously breached
        • Using the most probable chains of variable interaction and given a large set of multivariate evidence (data), various types of Bayesian networks can be constructed either manually or automatically
        • The automatic construction of Bayesian networks produces models that can be used for either explanation or prediction. Ultimately, all such networks are based on multivariate joint frequencies of occurrence in data sets combined using Bayes' Theorem, and are useful tools in mining data
        Discontinuous and Nonfunctional Representations:
        • The truth is that in almost all cases in data mining where the discovered relationships are not linear, they are either curvilinear or clusters
        • A function is a mathematical construction that produces a single unique output value for every input value
        • All of the linear and nonlinear methods, including neural networks, are essentially methods for discovering a function that relates inputs to outputs
        Evolution Programming:
        • Evolution programming is potentially the most powerful relationship discovering algorithm in the data miner's toolkit
        • What it intends to convey is that the algorithm noted can theoretically discover and represent any possible mathematical fucntion
        • First, mathematical functions only represent a subset of relationships that exist in data - and only the simple ones at that. And second, just because it's theoretically possible for an algorithm to get even that far doesn't mean that it actually will in practice
        • Evolution programming is not, even theoretically, a universal function approximating tool. It may find a functional representation if it's there to be found, but this tool is perfectly happy to find any other relationship that can be characterized. It is quite capable of discovering "If . . . then . . . " types of rules, and then of characterizing what's going on in the neighborhood with a function, if appropriate - or, in fact, of discovering any other type of relationship
        • Essentially, evolution programming evolves computer programs that computationally represent the interrelationship between the two data sets, input and output. This means that, in theory anyway, if the relationship is computable, evolution programming will get there. This is a stronger claim than universal function approximation. The practice turns out to be not so very different most - but not all - of the time
        • Essentially, evolution programming works on the individual symbols and groups of symbols that form components of simple computer programs
        • These are typically the mathematical parts, like "+" and "-" and other mathematical operators such as "sine" and "logarithm," logical operators like "and" and "or," program flow like "If, then, else," and program variables, plus many others
        • The discovery of relationships using these methods is called symbolic regression. Of course, it's a computational regression rather than a mathematical one, and what is produced is often a combination of continues functions and discontinuous rules
                          Tools and Toolsets: This part aims to give a practical introduction to mining data, not with algorithms in the raw, but with commercial data mining tools.
                          • Megaputer Intelligence (www.megaputer.com): PloyAnalyst, the data mining tool kit from Megaputer Intelligence, comprises a well-integrated array of mining tools that can all be accesses from a common interface. The interface is project-oriented, but organized around a data set referred to as "world." The suite includes tools for manipulating data sets, splitting data sets, and creating a limited number of extracted features. This is a client/server tool in which the server mines the data based on requests initiated by the client. Results are also returned to the client
                           The menu shows 11 different exploration tools, each of which embodies a different algorithm for exploring data sets:
                          • Summary Statistics: A place to start with a data assay
                          • Find Dependencies: Similar to the single variable CHAID analysis. The algorithm used isn't CHAID, but the intent and purpose are the same - given a variable, quickly discover what associates with it in the data set. This is an exploration tool rather than a modeling tool
                          • Linear Regression: This is a very useful tool. This implementation is more useful than straightforward linear regression in that it automatically selects the most important variables for the regression. Plenty of graphical and text reporting from this tool makes it a great place to start an exploration of a data set
                          • Cluster: An unsupervised version of the algorithm that is useful for exploring a data set
                          • Decision Tree: Offers a fast and visual way to examine data. The tree is presented as a sort of hierarchical structure from left to right, rather than as the more traditional tree. It is a binary split tree, which is to say, unlike the CHAID tree, this tree splits each variable into two sections
                          • PolyNet Predictor: A neutral network with automatic setting of the internal network parameters. This version will only build predictive models and doesn't reveal anything explanatory about the data
                          • Nearest Neighbor: The MBR tool. It is slow and powerful, but the resulting model can be used only to make predictions, as there is no way of examining the neighborhoods
                          • Find Laws: Somewhat similar to symbolic regression. This produces expressions that can include "If . . .  then" structures, as well as more familiar function-type mathematical expressions
                          In addition to these exploratory and modeling algorithms, PolyAnalyst includes some special purpose tools:
                          • Basket Analysis: Pretty much a free-standing tool that looks for associations in voluminous transaction files
                          • Classify: Used to separate a data set into two classes based on relationships discovered. Classify finds the optimum way to actually separate the data set
                          • Discriminate: Uses two data sets. One data set is the "world" data set that is modeled. The other data set is compared to it and is essentially assessed instance by instance to determine if the new data set id similar to the data set modeled, or different from it
                          It should also be noted that Megaputer Intelligence has two other tools: TextAnalyst, which is used for analyzing text, and WebAnalyst, a custom tool set for Web data analysis.
                          • Angoss Knowledge Studio (www.angoss.com):  KnowledgeSEEKER and KnowledgeSTUDIO, are the flagship products of Angoss and have an emphasis on gleaning knowledge from, or discovery of knowledge in, data. In fact, its aim with the original tools was to facilitate data exploration rather than to produce predictive modeling. That philosophical underpinning is still present in its current suite; the tools and products mainly focus on discovering knowledge in data. KnowledgeSTUDIO incorporates KnowledgeSEEKER as a main exploration tool. 
                          • WizWhy (www.wizsoft.com): WizWhy is a single algorithm-based rule extraction tool. The actual rule induction algorithm is proprietary but contrives to finess the combinatorial explosion handily. This rule extraction tool is capable of extracting necessary and sufficient rules. The author's web site contains a paper titled "Auditing Data WizRule and WizWhy - What Do I Do with All These Rules?" describing the author's approach to using the rules
                          • Bayesware Discoverer (www.bayesware.com): Bayesware Discoverer is a tool for both building and discovering Bayesian nets. Bayesian nets consist of networks of nodes that are connected together. All of the nodes interact with each other in accordance with the principles of Bayesian probability calculations. This tool will actually discover the relationships in a data set and express them in the form of a networks
                          • e (www.sdi-inc.com): The tool "e" performs symbolic regression using evolution programming
                          • Microsoft SQL Server 2000 (www.microsoft.com): SQL Server 2000 from Microsoft is a strong enough entrant that it needs to be mentioned. As a significant management system, it is gaining strength, features, and market share, and is becoming ever more widely used as a business solution. This version has embedded data mining capabilities in the database, and so places data mining as a no-cost capability into any database or data warehouse built on the platform. The product offers two algorithms, a decision tree and a neural network. One of the great strengths of this approach is that the mining suite is integrated into the data management system along with the data itself. This encourages data exploration and modeling as an easy activity. One of Microsoft's gifts to all users of personal computers is a standard way of using and accessing the PC's power. This gift is embedded in the operating system known generically as Windows
                          Summary: Algorithms are theoretical and mathematical constructions that outline methods for exploring data. When the algorithms are wrapped in a user interface, and tries and tested against the vagaries of the real world and still found to perform - when supported with documentation, technical support, bug fixes, training, help screens, and all the other facets that go to make a robust product - they are tools. No tool supports an algorithm in its raw form - there wouldn't be any point since it can't be used in that way. However, when suitably modified, algorithms do form the core of all mining tools. Different algorithms imply different capabilities and different constraints, and it's the capabilities and constraints that underlying algorithms imply that become important to a miner when constructing mined models. 
                          Tools and software suites come in a variety of guises. Some support multiple tools within integrated suites; others support single algorithms. 
                          ==================================================
                          Getting the Initial Model (Basic Practices of Data Mining) - Part I
                          Getting the Initial Model:
                          • Data mining is not simply concerned with applying either algorithms or mining tools to a data set, and then standing back and waiting for magic to happen. First and foremost, data mining helps to solve those types of business problems that require insight and action based on collected data
                          • Acquiring the required insight and taking action does require applying algorithms to data in the form of mining tools. But crucial to the whole endeavor is that it takes the intelligent and knowledgeable application of those tools to data to achieve meaningful results
                          Preparing to stay Honest: 
                          • One of the biggest problems in mining data, particularly for an inexperienced miner, is that it's all too easy to inadvertently lead yourself astray. To avoid this pitfall, take precautions to make sure that the results achieved from mining are actually worth having
                          • This means that the results, in whatever form they appear, should be applicable to the business problem, which in turn generally means that they should apply to the world at large
                          Since the data comes from the world, there shouldn't be any problem with the model applying to the world. However, there are three separate "gotchas" to contend with: bias, garbage, and oversearching. 
                          • Bias: Data might not actually represent the part of the world that the model (through the business problem) is meant to address. Bias means that the data is somehow skewed away from addressing the issue expected, or that the relationships in the data on hand will be different from those in a data set collected about the actual population to whom the results are to be applied
                          • Garbage: The second gotchas is that every real-world data set contains more information than just the information about the relationships of interest. Some of the information is truly meaningful, but for any number of reasons, all real-world data sets that a miner in a business situation will see also will contain what is best characterized as garbage
                          • Oversearching: If you look long enough and hard enough for any particular pattern in a data set, the more you look, the more likely you are to find it - whether it's meaningful or not. It is with data mining, which essentially looks and looks and looks again for patterns in the data, so the fact that any particular pattern is discovered isn't necessarily as meaningful as it might at first appear
                          So these are three gotchas - bias, garbage, and oversearching. The key to taming these problems is to use at least three data sets, not one. Three is the minimum. How do you get the three needed data set? Take the original data set developed through the assay and chop it into three parts. Call one the training data set, one the test data set, and the other the evaluation data set.
                          Ideally, each of these three data sets needs to be large enough to pass the tests for having enough data, which means that each of the three data sets must be large enough to be representative of the pool from which it was drawn. 

                          However, it turns out that it takes more data for a mining tool to learn a relationship than it does to confirm a discovered relationship. This can be seen intuitively by realizing that it takes several to many examples or repetitions for us to learn a relationship, but it takes only one brief error to show if we didn't learn it. That's why in school it takes a semester to learn a topic, but that learning can be assessed in a test that lasts only about an hour. So the test and evaluation data sets can have relaxed representativeness requirements. In other words, the training data set must be as representative as possible of the full range of variations that the world is likely to show, whereas if data is not plentiful, it's possible to get away with less representative test and evaluation data sets.
                          As a rule of thumb, and it is only a rule of thumb, with plenty of data on hand, divide the data into 60/20/20% for train, test, and evaluation, respectively.

                          Addressing the Data: 
                          • Actual data mining activities address data sets, and although the whole mining endeavor is set up to solve a business problem, at some point the miner must address the data directly. At this "directly addressing the data" level, the actual process of data mining becomes remarkably simple
                          • In essence, all the data mining tools do is to discover a relationship between two parts of a data set - the input variables and the output variables - such that specific configurations of values in the input variables map, more or less well, to specific configurations in the output variables
                          • Apart from data preparation issues, a data miner addresses data directly by building a purposeful model. In general, there are only three types of purposes for constructing a model: for understanding, for classification, and for prediction
                          • These three ways of modeling data are partly related to the nature of the business problem, partly to the nature of the input and output data sets, and partly to the nature of the relationship sought
                          Input and output data set configuration:
                          • All business data sets for mining consist of many variables - usually tens, often hundreds, and sometimes thousands. In rare cases, some data sets may have tens of thousands of variables. The variables are grouped into what are called batteries, so the group of input variables is often called the input battery, and the group of output variables is called the output battery
                          • The prediction variable might have been a category indicating if someone was predominantly a beer, wine, or spirits drinker. Or again, it could have been a continuous variable, say a measure of average beer consumption over some period
                          • A miner would explore how batteries A and B relate to C, A and C to B, and B and C to A. What's important in this case is understanding the relationships, not classification or prediction
                          • Variables, whether input or output, can individually be binary, categorical, or continuous. Variables in a data set are grouped into batteries and each battery may contain a mixture of variable types. One way of modeling a data set relates the variables to each other; for instance, determining how beer consumption varies with income and occupation. Another way of modeling data relates instances to each other; perhaps determining the number of typical types of beer drinkers. Basic ways of modeling data include modeling for understanding, modeling to classify, and modeling to predict
                          • Tool Selection Matrix and Algorithm Selection
                            Missing Value Check Model:
                            • With the data set assembled and partitioned (into training, test, and evaluation data sets), the very first model to build should be a missing value check model (MVCM)
                            • Start by making a copy of all three data sets. In each data set copy, replace all of the values in the variables of the input battery that are not missing with a "1" and all of the values of the input battery variables that are missing with a "0." Notice that only the input battery variables are modified, not those of the output battery. You end up with an all-binary input battery of variables, and the output battery remains untouched. Now, regardless of what type of model the business problem ultimately calls for, create a predictive model that attempts to predict the value of the output variables
                            • Now the only information that remains in the input data set is whether or not there is something entered for a variable's values. If after this transformation any of these variables show any pattern that relates to the output variable, then it has to be a pattern based entirely on the presence or absence of data
                            • The MVCM often reveals a lot of interesting and applicable information exactly of the type. Working with the MVCM is always worthwhile, and each variable of significance can usually be examined
                            • However, before leaving the MVCM data set, it's worth using it to illustrate how the training and test data sets work as a check to help confirm that what is discovered is actually a real phenomenon rather than simply a meaningless aberration
                            • MVCM is used to discover the presence of bias in the data set and to note and explain the effect that missing values have in the data set. The technique also discovers when, where, and how missing value information has to be explicitly included in a data set to improve the model
                            Applied Honesty: Using Training and test data sets
                            • There's a lot of noise ("garbage") in this data set, and what the model learned as predictive patterns were only present in part of the data set. That's the whole point of using multiple data sets. Intuitively,  any real underlying relationship will exist in all of the data sets, but spurious noise won't remain constant, and will differ from data set to data set
                            • So these two data sets reveal that most likely there isn't enough data here to make a reliable model, and at any rate, this model isn't to be relied on since its accuracy in any new data set may vary wildly
                            • The training data set is used continuously in creating the model. A miner will almost certainly make several, perhaps many models over the course of a project. Each will be checked against the test data set to determine the model's level of calibration accuracy - in other words, how well it fits the world
                            • The whole objective is to build a model exclusively in the training data set that fits the test data set as well as possible
                            Modeling to Understand:
                            • Whenever the underlying question about a data set is "why?", the miner needs to provide answers that will help explain what's happening in the world. Of course, a data set is limited, and any answer can be determined only in terms of the data contained in the data set
                            • The very first question a miner must address is: can the data be transformed into more meaningful features? Next, what tools are appropriate for this data?
                            • It's essential that the explanation can be communicated clearly and succinctly and in terms that business users can understand
                            • Regardless of the underlying algorithm, the interface and display format of mined results go a long way toward building effective presentations - and graphical presentations can be very helpful indeed in presenting results
                            • It's also worth keeping in mind that explanations of a data set work best when they are crafted in one of three ways: either explain one variable at a time, or explain linear relationships, or refer to labeled aggregates as whole
                            • Three tools as good starting points to use in gaining an understanding of data are decision tree, self-organizing maps (SOMs), and linear regression
                            Modeling for Understanding using Decision Trees:
                            • Explanatory models communicate relationships of business relevance in a data set in easily understood words and illustrations
                            • A more important consideration than the number of splits in selecting a tree tool for exploring and explaining data is that the tree should be highly controllable 
                            • In explaining data, it is very important that the tree tool does not steer the miner, but allows the miner to steer the tool! This is very important. Without steerability, the miner is limited to explaining what the tree algorithm wants to explain, not what the miner wants to explain
                            • For explanatory modeling purposes, it is crucial to be able to step through the split variables and examine the relationships
                            • A second important criterion for tool selection is that it is valuable to be able to look at both absolute and relative weights among the leaves at each split
                            • So any explanatory tree tool must make it easy for the miner to look at the relationships presented in as many different ways as possible, but certainly in both relative and absolute representations
                            Once you are comfortable with the decision tree tool you have chosen, turn now to using it for explaining data in the three ways mentioned: one variable at a time, linear relationships, and labeled aggregates.

                            Modeling for Understanding using SOMs: 
                            • Inherently, a SOM is not a supervised learning tool. This means that, unlike a tree, which requires a specific objective to build a relationship, the SOM tool doesn't create a map about anything in particular, simply about the data in general
                            • Essentially, after mapping, the SOM presents all of the relationships together and at once, rather than through sequential "prodding" at relationships one at a time as with a tree
                            • The SOM tool creates not only an overall map of the data, but also a series of other maps that includes one map for each of the variables in the data set
                            • Color plays a large part in interpreting SOMs since almost all SOM tools use color variations and shades to indicate features of the data
                            • Exploration of a SOM results in what is very much a qualitative understanding of the data and the relationships that it enfolds
                            • The strength of each relationship can be easily seen by the congruence of colors and patterns across and between the variables and the main map
                            Modeling for Understanding using Linear Regression 
                            • It is a tool for understanding most of the combination of variables
                            • It is, as the name implies, not going to find any nonlinear relationships - although that may not be all that important since many of the relationships a miner will work with are essentially linear anyway, or approximately linear. So, as a quick look at data, linear regression works very well
                            • Linear regression is very much affected by which variables turn out to be the most significant - it can have a dramatic effect of the weightings of the other variables
                            • It's always worthwhile removing the most important variable or two from consideration and rebuilding the regression to look for other relationships that the strongly interacting variables may mask
                            Summary of Understanding Data Sets: 
                            • All data exploration, in the end, has to shed some light on how to address the business problem - preferably illuminating previously unknown and, best of all, non-obvious relationships in a data set. But most of all, explanations have to be delivered in a way that seems intuitive to the business manager that is, in business terms, not in data terms
                            • One major problem for any miner is to determine what is the most appropriate way to represent the problem in data - in other words, to define the objective
                            • These are the basic techniques of explanation - single variables, linear relationships, and identified segments. Presenting the results, however, requires both qualitative explanation and quantitative explication of the size of the result to be expected
                            • Here, too, is where a miner needs to be creative - and always remember that a picture is worth the proverbial thousand words. Illustrations - relevant and simple illustrations - present powerful results. Illustrations that need more than a thousand words to explain them are probably of little help! The key to explaining data is practice and familiarity 
                            ===================================================
                            Getting the Initial Model (Basic Practices of Data Mining) - Part II
                            Modeling to Classify: 
                            • Statistical classification is a knotty problemIt is a very difficult problem indeed to decide how a data set should best be divided into classes (or classified). The technical issues are formidable, but fortunately the data mining practice is much easier than the statistical theory
                            • Classification, to a data miner, looks very much like a special case of what is colloquially called "prediction." In other words, classification is often expressed as a problem in trying to predict to which class an instance belong
                            • The predictive or classification model will produce a score that is a continuous number between 0 and 1, rather than a binary score equal to either 0 or 1 
                            Balancing Data Sets: 
                            • The essential difficulty in modeling to classify is that the world often does not cooperate in helping solve the miner's problems
                            • For many tools, a 1% response rate simply isn't good enough to build a decent model. All that a mined model has to do to get a 99% hit rate in such a data set is to predict "0" all the time. After all, 99 times out of 100, with a 1% response rate, the model will be spot on. This is what is sometimes called the naive prediction rate (or naive error rate), and in order to be effective, any model has to do better than this
                            • In classification and predictive models, to have a 99% accuracy rate (or any other number other than exactly 0% or 100%) doesn't mean much at all. What is important is the lift, or how much better a model does than the naive prediction rate
                            • In order to get the needed information exposed to the tool, the data set has to be adjusted
                            • Succinctly put, to create an effective model we need to have a higher ratio of buyers to non-buyers in the training, test, and evaluation data sets than the real world procuces
                            • To make the adjustment, first construct the data representation models. These models, constructed on the whole data set, are used to verify that there is sufficient data to represent the whole population so long as it was appropriately sampled in the first place. With these models already constructed, remove non-effect data at random, until the desired balance arrives
                            • At this point, use the data representation models to check that they still apply to the adjusted data set. The adjusted data set should (by the performance of the check models) still represent the population - at least, insofar as it's possible to confirm this
                            Building a Dichotomous Classification Model: 
                            • The idea of a classification is to put each instance in a data set unambiguously into the relevant class. However, many classes there are that apply to the data set as a whole, each individual instance can belong only to one class
                            • As far as a model trying to classify it goes, either it is correctly classified, or it isn't. In a way, as far as any individual instance is concerned, there are only two classes that are relevant - the right one, and all the other classes lumped together which make a wrong one. This becomes important when measuring how well a classification model works, and in deciding how to adjust the model to improve the classification
                            • It's easy to determine the performance of a classification model, but only by analyzing the model's performance in detail
                            Classification Errors: 
                            • A two-class classification model can produce two classifications leading to four results. Taking "1" to indicate class membership and "0" to indicate class non-membership
                            • The model can classify as 1 (class 1) when the actual result is 1 (Is 1), or it can classify as 1 (Class 1) when the actual result is 0 (Is 0). Similarly, it can indicate 0 (Class 0) when the actual result is 0 (Is 0), or indicate 0 (Class 0) when the actual result is 1 (Is 1). This forms the basis of what is called a confusion matrix because it allows an easy indication of where the model is confused (classifies 1 for 0 and 0 for 1) and where it isn't confused
                            Classification by Score: 
                                  • Often, the nature of the data, or even the limitations of the tools available to a miner, requires a tool that produces a continuous numeric output - in other words, a number
                                  • This number is a score, indicating the proper class to assign to a given instance. This continuous variable output can offer more power and flexibility than a binary categorical assignment, depending on the actual business problem, but the problem for the miner is in deciding  exactly how to use a continuous score
                                  • With a continuous score, the model response rate will clearly depend on where the score cutoff is set for assigning instances to the different classes. For instance, if the 10 highest scored instances were all in one class, say "1," and the 11th wasn't, if the miner set the cutoff value between instances 10 and 11, the model may produce a perfect predictive result - but identify so few responders as to be of little practical use. Sliding the cutoff to some lower value allows in more instances, but also reduces the discriminatory power of the model
                                  • The miner's problem involves finding the optimal place to make the cutoff, or, in other words, how to adjust the cutoff value to get the best possible confusion matrix given the needs of the business problem
                                  Building a continuous classification model: 
                                  • The neutral network is trained on the training data set, and only after training was complete and the model created it applied to the test data set. Neural network algorithms are often described as needing training and test data sets. And for instance, the PolyAnalyst tool automatically, and invisibly to the miner, creates the network and performs its own internal testing. Thus it only uses the training data set. The test data set is not used at all in constructing the network, only for applying the resulting model
                                  • If the model is successful at classifying the instances, the actual chance of discovering a class 1 instance will be higher than this in the highly scored sections, and lower than this in the lower-scored sections. The actual frequency of Class 1 instances discovered in the data set when it is ordered by the value of the prediction can be plotted against the accumulated average expectation. This is what is known as a cumulative response curve, since it literally shows the accumulated response on the data set ordered by the score
                                  • The scale along the bottom shows the number of instances to that point. The diagonal straight line running from lower left to upper right shows how many Class 1 instances would be expected up to that point if the data set were actually ordered randomly. The curve that rises above the line is the actual cumulative response curve, showing how many Class 1 instances have actually been discovered up to that point. If the number of instances is more than the random ordering expectation, the cumulative response curve appears above the random expectation. Of course, if the number were less, the cumulative response curve would appear below the random expectation
                                  • This cumulative response curve (often erroneously called a lift curve) is a very common way of displaying the quality of a continuous classification model. The problem is that it isn't apparent how to use this curve to decide the location of the best cutoff point. The answer is to plot the actual lift - that is, the amount by which the cumulative response is greater than the random expectation
                                  Building a Multiple classification model: 
                                  • The techniques developed to create classification models have focused on the situation in which the data set should be partitioned into only two classes.Very many workaday data mining models do, in fact, require no more than a single, two-class model. However, many problems crop up that require the classification of far more classes. Every time the problem entails selecting one action from several choices, one product from many products, one offer from several possibilities, one page from several pages, or a single selection from several choices, the miner faces a multiple classification problem
                                  • This calls for building several classification models since a model separating, say, Class A from Class B and C may well not do a very good job separating Class B from A and C
                                  • The basic techniques - developing a naive confusion matrix, a model confusion matrix, a cumulative response curve, and the lift profit - form the basis of modeling multiple classes
                                  • A classification problem never calls for using two models. With two classes one model is enough. With three classes, however, three models are needed, one per class
                                  The perfect classification model:
                                  • Unfortunately, perfect models are rare, and often suspect when they do turn up because of possible leakage from anachronistic variables and the like. What does a real-world model produce?
                                  Using a single multiple - classification model: 
                                  • The actual lift curve for a single model for classifying the classes is used. The model is trained on estimated classes in the training data set, and then is used to create a single score in the test data set. The test data set is then ranked from lowest rank to highest
                                  • The actual lift curves are at best only a very rough approximation of anything shown by the perfect model lift curves. However, as far as using this information for classification goes, the same principle applies. To make the optimal classification, select the curve that has the steepest upward trend, moving from left to right, as the class in which to make the assighment
                                  Combining multiple classification models: 
                                  • An alternative approach to the single model classification approach is to create several separate models, one predicting each class
                                  • The combined outputs of these models can be assembled into a multi-variable intermediate data set with one variable for each separate model's predictions. And a neural network model combining the predictions of the individual models is built to thereby combine them into a single score that can be used for ranking the test data set exactly as before with the single model
                                  • By comparing the tables for naive, single, and combined models with each other, it's easy to see that the combined model does produce better predictions than either the single model or the naive model. The ranking produced by combining multiple models, and discovering the associated cutoffs by inspecting the smoothed lift curves, does improve performance
                                  • The only answer to how "good" these models are can only - and that is only - be given in terms of the business objective
                                  Summary of modeling to classify: 
                                  • Very often, a miner works with classification models when the required business result is described in colloquial business terms as a "prediction." Scoring a data set is a classification problem, not a prediction problem. Selecting one object or action from a choice based on data is a classification problem
                                  • Splitting data sets into classes - who will respond to a solicitation and who won't, and who is in one age group rather than another is the bread-and-butter work of data mining
                                  • Much of the purpose behind understanding data sets is to build better classification models. Sometimes the number of classes is small - very often, only two. Responders either do or do not respond, and those are the only two classes of interest. As the number of classes increases, the problem looks more and more like a continuous estimation rather than a classification problem
                                  Modeling to Predict: 
                                  • There seems to be an impression that prediction is just like classification, except about the future. It's not. Prediction is about intelligently forecasting states that have not yet been encountered in existing data
                                  • Misappropriation of the term prediction has created a lot of misapprehension and misunderstanding about the nature of the activity in data mining
                                  • Classification is a technique, supported by tools, for dividing a data set into two or more parts based on characteristic features in the data: dividing the credit data set into two classes of buyers and non-buyers, forinstance
                                  • Prediction is concerned with causes and effects, with dynamic relationships that interconnect the objects represented in the data. Prediction explores rational expectations of what will be encountered in situations that haven't happened before
                                  • The essence of the difference is this: classification characterizes and associates patterns in input and output batteries; prediction characterizes system behaviors and uses the characterization to extimate outcomes under novel circumstances where no outcome pattern is known 
                                  • Data represents measurements about a selected system that exists in the world. Classification addresses the problem of how to use some of the features of that system, as represented in the data, to determine other features of the same system. In other words, classification uses data that describes both the patterns and the outcome from those patterns and characterizes which outcomes belong to which patterns
                                  • Prediction addresses the problem of how changes in one set of features in a system will affect other features of the same system
                                  • It is the relationship among the objects in the system that are important. In prediction, there is no outcome pattern available in data anywhere to associate with the input battery patterns
                                  • For example, a manager may want to know how changes in commission rates will affect the level of sales. Or what changes in the level of research investment will have on a competitive position. Or how introducing a new product will affect sales of existing products. Or which skill sets are least impacting to profitability and market position. Or what products should be placed on order now to meet anticipated future need 
                                  • Essentially, businesses use prediction for a couple of purposes. One is to discover which possible business scenarios are most likely to occur based on the present circumstances. The other main purpose is to examine different possible business scenarios, however the scenarios are devised, and to explore the likely outcomes in each scenario
                                  • Another view is that classification is about modeling the state of stocks in a system, whereas prediction is about modeling the information connections and flows. The answer to the question "Where do we need to most effectively invest resources to reduce backlog?" is not a classification problem. It's a predictive problem
                                  • Predictive modeling requires the miner to use inferential and classification modeling tools plus particular methods and skills in applying those tools. The key lies entirely in miner skill sets, not in automated tools or technologies
                                  • In building predictive models, one major task that a miner faces is simply building the minable data set
                                  • The three main problems that face a miner (that is, after defining both the business problem and the system that needs modeling) are: (1). Gathering data - (2). Looking for causes - (3). Reporting outcomes
                                  Gathering data for prediction:
                                  • Data used for building classification models is different from that used for building predictive models. A glance at the data in two data sets, one designed for classification and the other for prediction, might not show any apparent difference - both would have variables containing dates, categorical values, numerical values, and so on. The input batteries would still consist of several to many variables and the output batteries few or, most likely, one variable only
                                  • The difference is real nonetheless, and it is indeed a marked one. Understanding the types of data needed for each model goes a long way to explaining the difference between classificatory and predictive models
                                  • Classification models look for structure in data set. The classification model discovers the input battery structures and, it can be imagined, takes "snapshot" of each pattern, including the outcome associated with each
                                  • Predictive models, as distinct from classificatory models, are inherently asked to estimate system behavior for states of the system that have not previously been seen, and thus, that cannot be represented in data. But, you may object, "If there is no data, how can it be possible to make a model of it?" If you ask that question, you immediately understand the difference - and the difference in challenge - between creating and using classification models and creating and using predictive models
                                  • What is to be done is to try to create a representation of the behaviors of important parts of the system using data describing individual system component interactions. The system model to be created describes how each component relates to the other components
                                  • This is all a more detailed way of saying that predictive models play the "What if . . ." game. So the question would be posed as, "What if everything stayed the same except that we changed thus-and-so?" It's the change to thus-and-so that makes the prediction necessary. The way forward is to construct a model of the system in which thus-and-so exists that is as complete as possible and duplicate known results. With such a model in hand, change thus-and-so and see what the system produces. Whatever it does produce, along with the necessary caveats of confidence and probable variance, is the prediction
                                  • For example, to continue with credit card acquisition as a theme, one classification model explored earlier seeks only to answer the question of who will respond to a solicitation used in the marketing test. A predictive approach might ask who would respond to some different offer, one that has never been tried before. Or a predictive approach might ask what should be put into a new offer so that it would both appeal to the largest number of target market prospects and return the highest ROI to the company making the offer
                                  • In the first case, there is no data about how people would respond to the different offer since this particular offer has never been made. In the second case, there is no data since the components of the offer - interest rate, payoff conditions, length of time to pay, associated benefits, and so on - have not only never been tried, but they also haven't ever been thought of since it is the discovery of some optimal set of components that is the object of the prediction
                                  • Prediction poses questions about how a system will behave when as a whole it is in some novel condition
                                  • In the case of the first question - who will respond to a newly created credit card solicitation - the specific components of the offer as a whole have never been assembled and offered to prospects, but it is very likely that the individual components of terms and conditions, interest rates, introductory offers, features, and benefits have all been individually made in previous offers. If so, the problem is to construct a model that tries to use the existing data, allow for all of the changed conditions, and extrapolate a response to the hypothetical offer
                                  • The key objective is to identify the relevant system components and to ensure that as much data as possible is found, created, or collected that describes the subsystem interactions as fully as possible
                                  Causality:
                                  • The predictions from a predictive model do not foretell the future. Predictions are estimates of how a system might behave under specified circumstances that the system has not previously experienced. Only sometimes does the circumstance specify that the time is somewhere in the future 
                                  • Many predictions are concerned more with how the system of interest will perform under conditions that, to all intents and purposes, are in the present
                                  • Predictions are not about foretelling the future, but about rationally exploring various business scenarios under a range of circumstances, only some of which involve displacement into the future
                                  • The essential problem that a miner faces in creating predictive models is that most situations are looked at from the "outside" rather than the "inside." 
                                  • The classification model is attempting to create a model that can be looked at as saying, "Given these inputs, the observed system will produce this output."
                                  • Predictive models always, to some extent, have to be explanatory models. In every case, along with a depiction of likely outcomes and the confidence and limits associated with each, it is very important to be able to explain why the outcomes are likely, and what the limits are and why
                                  • Most importantly, the model must provide a type of explanation known as a causal explanation that is different from the explanatory models
                                  • The explanatory models present their explanations in terms of associations
                                  • Predictive models require explanations in the rather different terms of causality 
                                  • Causality is a tough concept and is fraught with a vast array of philosophical problems. One way of another it looks very much as if every event or action that is the cause of some phenomenon always itself has a cause
                                  • In everyday usage, it's not ultimate causes that are needed, but the leverage points that get the maximum effects for the minimum intervention. Predictive models, especially ones well calibrated against real-world data so that the actual quality of the predictions is well established, do very well at pointing out causes
                                  Summary of modeling to predict:
                                  • Predictive modeling is an essentially different mining task from classification modeling. Predictive modeling requires constructing a network of interacting models, each expressing the relationships that exist among components of some system of interest. The key question to be answered is how the system will behave under some set of conditions that has not previously occurred, and for which there is no descriptive data
                                  • A predictive model is the most difficult of the three model types to set up, but potentially the most valuable in the insight it delivers and the power that it offers. It is, nonetheless, an advanced data mining technique at the present state of the art, which a miner will approach only when thoroughly familiar with the basic modeling techniques of explanatory and classificatory modeling
                                  • In the practice of modeling data to create a predictive model, at the actual addressing-the-data level, what the miner produces is actually a set of interlocking associative models that emulate the system for which predictions are needed
                                  • Despite the potential power of predictive data mining to directly address significant business problems, particularly important strategic business issues such as scenario exploration, the bread-and-butter work of data mining today is explanatory and classificatory modeling
                                  Summary: 
                                  • The central message is very simple. First, explore the data set with whatever tools you favor or simply have on hand. Generate the best explanation of the data in business terms that you can get. Second, use the explanation to reconfigure the data and to generate business relevant features. Third, build the best model you can get
                                  • Of course, the detail of actually performing those steps may be quite complex and will very likely be time consuming, but the outline is straightforward. However, even with the model in hand, the process of modeling is only just begun. The next stage is to "tune" the model
                                  ===================================================
                                  Improving the Model (Part I)
                                  Highlights:
                                  • Just because a model exists doesn't mean that it has no problems, or that it is the best or most appropriate model that the data permits
                                  • Although most of the processes apply as much to improving an explanatory model as to improving a classificatory model, most of the issues are addressed as if the model to be improved is classificatory
                                  • The process of improving a model can be broken into two broad categories: discovering where the model has problems, and fixing the discovered problems
                                  • The two activities - diagnosing model problems and applying remedies are both applied in order to refine the initial model
                                  • The ultimate purpose underlying all of the issues and processes is to deliver a model that represents the business-relevant, meaningful relationships in the data set as perfectly as the data permits, and as simply as possible
                                  • It is in this stage of refining the initial mined model that a data miner must expect to start revisiting earlier parts at least of the mining process, and perhaps of the whole modeling process
                                  • None of the processes is carried out in isolation from any other part. Mining is an interactive whole, and all of the processes interact - hopefully to improve the model
                                  • Models are created on the data in the training data set, but all of the checking for problems and improvements happens in the test data set
                                  • The cycle works as follows: rebuild the model in the training data set, and look for any change in results when applying the new model in the test data set
                                  • If the data needs adjusting, remember to make the necessary adjustments in all three data sets, but don't change the instances (records) that are included in the data sets
                                  • The mixing of training and test data would pretty much invalidate the purpose of the separate data sets, and wholly undermine the purpose of the evaluation data set altogether
                                  Learning from errors: 
                                  • We are all encouraged to learn from our mistakes, and it's no different when mined models make the mistakes. There is a lot to be learned from a close examination of the errors made by a classification model
                                  • These errors represent the difference between what the model predicts and what the actual outcome turns out to be in the real world. Whenever a model turns out to be worth considering for application, the next step is to look at the errors that it makes in the test data set - and very often, actually looking is a useful thing to do, not merely a metaphorical looking
                                  • For a binary classification model predicting a binary outcome, the confusion matrix reveals the most about the model's performance
                                  • The residual value, or simply the residual, is the name given to the difference between the predicted and the actual values. In this case, actually looking at the residuals from the continuous score as well as looking at the confusion matrix begins to be helpful
                                  • Residual values are determined by subtracting the predicted value from the actual value. Symbolically, this might be represented as r = a - p, where represents the residual value, a represents the actual value, and represents the predicted value
                                  Looking at errors: 
                                  • It is important to get a feel for (and later to quantify) the differences between the model's predictions of the values and the actual values
                                  • Sometimes the pattern of the residuals can be used to improve the model, and it is this sort of pattern that a miner must seek
                                  Predicted versus residual diagnostic plot:
                                  • It is a XY plot. Each point is plotted in a position on the graph to represent its values on two measurements, the predicted value and the residual value
                                  • For a miner, by far the best practice is to eyeball the residuals and to become familiar with the look and feel of these plots
                                  Predicted versus actual diagnostic plot:
                                  • The other useful XY plot that a miner needs to become familiar with shows predicted versus actual values
                                  • This is similar to the previous plot except that the values form vertical columns
                                  Predicting Errors:
                                  • Looking at the errors in the form of residuals provides a fair amount of information. However, if the original data set could somehow be used to predict what the errors were going to be, that prediction of the errors could be used to improve the prediction
                                  • Data mining tools are, or should be, very good at characterizing relationships, whether linear or non-linear. The resulting relationship between the actual and predicted values, however rough and imprecise, should at least be linear, so a linear comparison is quite a reasonable way to check on the actual relationship
                                  • However, as a sort of "sanity check," it's worth building a model that attempts to predict the value of the residual. Again, this model will be built using the training data set input battery and predicting the residual value in the training data set as the output bettery
                                  To make the initial model and residual model:
                                  1. Build an initial model
                                  2. Apply the initial model to the training data set, creating a set of predictions
                                  3. Calculate the residuals using the predicted values in the training data set
                                  4. Add a variable to the training data set input battery containing the value of the residual
                                  5. Build a second model to predict residuals using all of the training data except the original output variable and the predicted values
                                  Thus, the residual test model must not include any actual values or predicted values from the original model. If using a multiple-algorithm mining tool, it's worth building the second model with a different algorithm than the original model. Next, build the residual test model:
                                  1. Include the prediction and predicted residuals in the input battery
                                  2. Build a model to predict the original output battery
                                  Continuous classifier residuals:
                                  • With only two classes, the residual plot and predicted/actual plot can produce only limited additional insight over that offered by the confusion matrix. In fact, for a two-class output classification, the confusion matrix pretty much offers the best insight into the workings of the model
                                  • When the output variable is to all intents and purposes a continuous variable, confusion matrices become totally impractical, and the only way to understand model performance is by using these plots
                                  • Start with an XY plot of the residual values versus the predicted values. When the output variable is continuous, it is necessary to order the residuals by the prediction value
                                  • Most algorithms that fit functions, curves, and other characterizations to data use one of a relatively few methods to determine how good the fit is, and the algorithms adjust their parameters until the fit, according to the criterion chosen, is as good as possible
                                  • There are, in fact, relatively few metrics for determining the level of fitness, but the most popular for continuous variables is Mean Least Squares (MLS), This involves minimizing the sum of the weighted squares of the residuals 
                                  • When the miner looks at the residuals in the test data set, they may not have a mean of 0. Generally speaking, the divergence from 0 represents a problem of some sort - insufficient data, poor model, problems with the data, inappropriate modeling tool, or some other problem
                                  • However, if the divergence from 0 in the test data set is large, it may be worth checking the mean of the residuals in the training  data set. If it isn't 0 there, either the tool or algorithm is somehow "broken" or the tool is using some other best-fit metric
                                  • In general then, the mean of residuals in the test data set should be 0. In addition, a straight line fitted across the range of the prediction with linear regression should fit through the center of the residual distribution, and should be flat along the zero point
                                  • Recall that whatever the distribution of the input data, and however nonlinear the relationships between input battery and output battery, a mining tool should, if effective, characterize the fit between input and output to include any peculiarities of distribution and to accommodate any non-linearity present
                                  • However, dealing with other types of residual distribution when modeling with continuous input and output battery variables is an advanced modeling topic
                                  • In effect, if the residuals' distribution is far from normal, it almost certainly indicates a potential with the model
                                  • If the histogram of residuals in the test data set is far from normal, compare it with the histogram of residuals in the training data set. If the distributions are dissimilar between training and test data sets, the problem is most likely with the data. If the distributions are similar and both are far from normal, the culprit may well be the modeling tool. If possible, try a different algorithm and check again
                                  Continuous classifier residuals versus actual values plot:
                                  • Recall that at run time, the best available estimate of the actual value is the prediction produced by the model. We know for sure that there will be errors, and using training and test data sets (which have actual values for the output variable available to train the model), it's possible to know the actual residual error
                                  Continuous classifier actual versus predicted values plot: 
                                  • Another plot that a miner should routinely examine is an XY plot of actual values versus predicted values 
                                  • Data mining tools should model non-linearity very well, so the predicted/actual values relationship should be pretty much linear, with all of the non-linearity accounted for in the model. If there is an evident curve that clearly fits the data better than the diagonal, it is an indication that the model is underspecified, which means not complex enough to capture the non-linearity present. (An over-specified model captures too much complexity, so it characterizes noise)
                                  Continuous classifier variance plot:
                                  • Variance is a very straightforward measurement. It simply expresses how much the value of a group of values varies from the mean value of the group. In this case, the measurement is of how much the residual, or error, varies from the predicted value
                                  • A spreadsheet program such as Excel can be easily used to create a variance plot
                                  • Variance can be extremely useful in understanding the model's performance 
                                  • Since the distribution of the error term is very nearly normal, it is relatively easy to explain the "reliability" of the prediction from the properties of the standard deviation
                                  • Similarly, for any point on the curve, the reliability of the prediction can be easily described in terms of how many of the actual values (as a percentage) can be expected to be within what distance of the predicted value
                                  Perfect Models:
                                  • Perfect models rarely, if ever, occur. Even very good models that are close to perfect are highly suspicious. Genuinely, justifiably perfect models are only likely when either the problem is utterly trivial and the relationship and predictions are obvious, or when leakage from anachronistic variables feeds information back from future to past
                                  • If any suspiciously good model turns up (that is, one that is far better than expected), it is worth checking very, very carefully to discover the nature and source of the error
                                  Summary of classification model residual checking:
                                  • Looking at residuals and comparing them with actual and predicted values in the structured way are important. This is a diagnostic technique used to determine if there are problems with the data, and either with the modeling tool, with the model, or both
                                  • The miner uses residual error versus prediction plots to look for possible problems with the data, the model, or the modeling tool
                                  • In brief, looking at plots of residual values across the range of the prediction can be very illuminating to a miner, and are very helpful in getting the model right
                                  Improving explanatory models:
                                  • Diagnosing problems with an explanatory model is, in a sense, much easier and less technically exact than with classification models
                                  • Essentially, either an explanatory model does provide a convincing, relevant, applicable explanation that serves to address the business problem, or it doesn't! If it does, no further diagnosis is needed - the model works. If it doesn't, that in itself is pretty much the diagnosis
                                  • The ability of a model (the model complexity, and its power to extract detail from the data, set either too low or too high) to capture detail is called its specificity, and when the model does not capture enough detail, it is called underspecified. When it captures only the froth of detail, the model is over-specified. Setting an appropriate level of specification is as important in an explanatory model as it is in a classificatory model
                                  • Sometimes it is convenient, or easier to understand, if categories are used to explain a relationship, even when the underlying variable is a continuous number. A process called binning turns continuous variables into categories
                                  ==================================================
                                  Improving the Model (Part II)
                                  Improving model quality, solving problems: 
                                  • Improving the quality of a model means understanding what "quality" means in terms of a model. There are lots of different ways to characterize the quality of a model. Partly, of course, it depends on the type of model. As far as evaluating the quality of explanatory models goes, it's the quality of the explanation that counts
                                  • Based on the needs of the business problem, of course, judging the quality of an explanatory model is pretty much a subjective exercise
                                  • The ability of classificatory models to address the business problem, which is still pretty much a qualitative issue, can be judged against each other on technical criteria. In addition to the diagnostic tests, it's useful to become familiar with understanding and interpreting any other quality measures provided by a mining tool. The fundamental diagnostic tests of a model's quality - interpreting confusion matrices, XY plots of residuals, predicted values, actual values, and residual histograms. These are the fundamental and crucial determinants of model quality
                                  Problem: The data doesn't support the model 
                                  • A miner might find that the input battery doesn't relate to the output battery - in other words, the data doesn't support the model needed
                                  • This is a perennial data mining problem. The data available to fill the input battery simply doesn't have any very useful relationship to the output battery. Given the input and output batteries, no data miner is going to get a very useful model if this is genuinely the situation
                                  • The best approach is to find more or different data - that is, data that hopefully holds the relationships of interest
                                  • Discovering that a data set does not contain any very useful relationships to the object of interest is a useful contribution of knowledge to the search for appropriate data
                                  • The problem is that secondary data, such as that from a data warehouse or from a standardized database, may well have had many of the interesting relationships removed inadvertently
                                  Problem: The data partially doesn't support the model 
                                  • Another possible problem is that the input battery doesn't sufficiently define the relationship to the output battery over all or part of the output range - in other words, the data doesn't support the model needed
                                  • The issue here is that over some part of the output battery's range, the prediction is simply not accurate enough to provide the necessary level of confidence to use the model - at least, not when it makes predictions in the problematic part of the range
                                  • What is really needed is to improve the accuracy over the problematic part of the range
                                  • It is worthwhile to work to discover additional data that better defines the relationship over the problematic part of its range. First, look in the existing data set itself. Careful explanatory modeling may reveal features in the input battery that, when introduced into the data set as dummy variables, do elucidate the relationship more clearly
                                  Problem: Reformatting data 
                                  • A miner might find that the tool (algorithm) selected to make the model cannot deal with the data in the format provided
                                  • Fortunately, the problem is relatively easy to address since the miner can reformat the data before applying the modeling tool, so that the input battery presents the variable formats in a way that is appropriate for the chosen underlying algorithm. Generally speaking, tools only reformat data that the automated transformation method recognizes as needing transformation
                                  • There is another issue with the variable's data format - that of missing values. Some algorithms cannot deal with missing values at all; others deal with missing values very poorly; and yet others apparently have no particular problem with missing values
                                  • Thus, it is a good practice to replace all missing values using well-founded imputation methods
                                  • There are three basic techniques for reformatting data, plus the not-exactly-reformatting technique of replacing missing values: Binning - Normalizing range - Normalizing distribution
                                  • Apply these techniques with care, as they are not all equally applicable under all circumstances
                                  Reformatting data: Binning 
                                  • Binning is a very simple and straightforward technique for turning continuous variables into ordinal or categorical variables
                                  • It should be noted that ordinal and categorical variables could also sometimes be usefully binned. However, binning ordinal or categorical variables requires advanced binning tools and techniques, such as information-based binning
                                  • Binning is so called, perhaps, because when binning a variable, various sub-ranges of values of the variable are all put together into a bin
                                  • Binning can actually remove more noise than useful information, especially if the binning is optimally done. This can sometimes result in better models, even for algorithms that could use the variables in their unbinned form
                                  • The problem of deciding how to bin a variable is twofold: 1) discover how many bins to use, and 2) determine how best to assign values to each bin
                                  Assigning Bin boundaries:
                                  • One simple way to assign bin boundaries is to divide the range of the variable into a number of bins, and let each bin cover its appropriate fraction of the range
                                  • Another bin is sometimes used for instances with missing values. The bins are the ordered range of the variable with, say, least values on the left and greatest on the right. Each bin covers the same amount of the range of values of the variable as any other bin, and this binning arrangement is called equal range binning
                                  • This arrangement might work if the variables' values were distributed fairly evenly across the range of the variable. However, most variables don't have particularly uniform distributions
                                  • Another bin is a nonuniform distribution that approximates a normal distribution. Most of the values cluster around the mean value
                                  • The bin boundaries are adjusted to evenly balance the bin contents. This arrangement, not surprisingly, is called equal frequency binning since the bin boundaries are arranged so that, as much as possible, all the bins have a similar number of instances in them
                                  Information-Based binning: 
                                  • The simple binning strategies are unsupervised strategies. That is to say, any binning of a variable is made without any reference to any other variables at all, including those in the output battery
                                  • Since a classificatory model has an output battery, it is possible to use the output battery to direct the binning of the input battery so that the binning reveals the maximum amount of information about the output battery
                                  • This, of course, would be a supervised binning strategy. Potentially, it can do a better job than the simple binning strategies since it uses information from both the input battery variable and the output battery
                                  • Information content in variables can be measured according to the underlying theory called information theory. It is possible to create a binning strategy using information theory that retains in one variable the maximum amount of information about another variable. There are several ways of implementing an information-based binning strategy, but two are particularly useful
                                  • Least information loss binning, as the name implies, creates bin boundaries that optimally retain information in the input battery variable that describes the output battery variable
                                  • Maximum information gain binning is potentially the most powerful information-based binning strategy. This is another supervised binning strategy, but in this case, the ordering of the input variable is not necessarily maintained, and input battery variable values are mapped into bins so that maximum information is gained about the output battery 
                                  • Optimal binning can be crucial in deriving a good model from a data set
                                  Reformatting data: Normalizing ranges 
                                  • Some algorithms, most notably neural networks, are highly restricted in the range of values to which they are sensitive. Many of the most popular types of neural networks require a numeric input range, including values either from -1 to +1 or from 0 to 1. For any tool implementing one of these algorithms, there's no problem whatsoever in modifying the input range of a numeric variable to match the needs of the algorithm
                                  • The tool simply scans the input battery,v determines the maximum and minimum values present for all numeric variables, and rescales the input values appropriately
                                  • The first point for any miner to note is that whenever there is a need to convert categorical and ordinal - particularly ordinal - variables to numerical representations, it is very important to discover if there is a naturally occurring order or sequence for the categories, and to distribute the categories in their appropriate locations in the range of the numeric representation. As is often the case when dealing with data, the rule is: "whatever possible, look it up or dig it up rather than make it up
                                    Reformatting data: Normalizing distribution
                                    • The distribution of a variable describes the way that the variable's values spread themselves through the range of the variable. Some distributions are fairly familiar, such as what is known as the normal distribution
                                    • In a normal distribution, the greatest number of values occurs clustered around the mean (or average) of the distribution, with far fewer values falling at the extremes
                                    • It's not only numeric variables that have a distribution. Ordinal and categorical variables also have values, although they aren't numeric values, and the values usually occur with frequencies different from each other. An easy way to represent such a distribution is with a histogram, each column representing the number of instances in each class
                                    • Normalizing a distribution isn't necessarily the process of making the distribution more like a normal distribution. Rather, the term means regularizing or standardizing a distribution in some way
                                    • Usually, it's only numerical variables that have their distributions normalized, distribution normalization is confined to numeric variables
                                    • One strategy for normalizing a distribution that can work quite well is to use equal frequency binning with a high bin count (say 101 bins), and assign each bin a value a uniform increment apart. If the chosen bin value ranges from 0 to 1, the bins would be assigned values of 0, 0.01, 0.02, 0.03 . . . 0.98, 0.99, 1.00.
                                    • To normalize just the distribution, assign each bin the mean value of the instances in the bin
                                    Why does distribution normalization work?
                                    • Consider an extreme case of skew in a distribution. Think of the series of numbers 1, 2, 3, 4, 5, 6, 7, 8, 9, 1000. This series ranges from 1 to 1000. However, almost all of the values fall in only 1% of this range. The value 1000 is called an outlier since it lies far from the bulk of other values in the series
                                    • For this example, it is a rather extreme outlier, but it may be quite impossible to say that it is an error of any sort or even an erroneous value. Quite justifiably, this might be a perfectly valid entry. For an example of a real-world situation in which such extremes occur, consider insurance claims, where most are for very small amounts but a few are huge
                                    • Without binning or some other redistribution strategy, almost all numerically sensitive algorithms, when presented with this range of values, would have to scale their inputs such that this actual distribution would be indistinguishable from a data set containing only two values
                                    • For such distributions, either high-bin count, equal-frequency unsupervised binning, continuous remapping, or supervised binning handily finesses the problem by producing a variable with a distribution from which any mining tool can extract the maximum information
                                    Distribution normalization in explanatory models:
                                    • Distribution normalization can play a very important role in improving the performance of classificatory models. However, it can play an even more important role in building explanatory models, particularly when using clustering, and especially when using visually based clustering tools, such as the SOM tool
                                    • Redistributing the values spreads the values that are present across the displayable range, and makes any patterns present far easier to see
                                    Reformatting data: Replacing missing values 
                                    • Replacing missing values doesn't change the format of the data. However, it's necessary in some cases, and always worthwhile, for a number of reasons
                                    • For those algorithms that cannot deal with missing values, something has to be done - the miner has no choice. Some tools automatically ignore the whole instance, and all of the values it contains, if one of them is missing
                                    • Empirical evidence suggests that replacing missing values with well-founded imputed values turns out to improve the quality of the resulting model
                                    • MVCM is a model that is built on a data set that within the input battery, contains only and exclusively within  the input battery a characterization of which values are missing and which values are present
                                    • If it is indeed true that values are not missing at random, they must be missing with some regularity, or pattern. Replacing missing values with any constant value, exactly as was done in the MVCM, will reveal that pattern. Now this isn't in itself a bad thing since it may very well be that those patterns need to be explicated, which is why the MVCM technique calls for adding a variable describing any useful discovered relationship
                                    • It is, regrettably, a very common practice to replace missing values with some constant value, such as the mean of a numeric variable or the most frequent category of a categorical
                                    • Replacement of missing values is most easily accomplished by automated tools as it is impractical to manually make the necessary calculations to build the replacement algorithm
                                    ==================================================
                                    Improving the Model (Part III)
                                    Problem: Respecifying the algorithm
                                    • One of the possible problems a miner might face is that the model wasn't able to characterize the relationships from the input battery to the output battery adequately
                                    • The viewpoint here is that the difficulty in improving model performance may not necessarily lie with the data, but in the capabilities of the mining algorithm and the way it has been specified
                                    • Recall that an underspecified model is one in which the constraints on the algorithm were such that it didn't have enough flexibility to properly characterize the relationships in the data. An over-specified model is one that has so much flexibility that it captured not only the underlying relationships, but a lot of junk too
                                    • The answer to determining when the model is well-specified - not too much, not too little, but "just right" - is to keep improving the model until it is just over-specified. In other words, keep building more specific models in the training data set all the time the improved models do better in the test data set
                                    • Regardless of performance in the training data set, as soon as a model returns worse results in the test data set than a previous iteration of the model, use the previous iteration as the final specification level for the model
                                    • Any change in any of the model parameters requires a total recalibration of all the others, thus leading to an almost endless improvement process
                                    Algorithm adjustment: Nearest neighbor or memory-based reasoning 
                                    • Nearest neighbor algorithms offer only two types of basic adjustment: the number of neighbors to be considered, and the method of determining the estimated value
                                    • Adjusting the number of neighbors is fairly straightforward. The algorithm is sometimes known as K-nearest neighbor, where k stands for the number of neighbors
                                    • The estimated value is determined by looking at the output battery values for each of the k neighbors and taking an average of them all as the estimate
                                    Algorithm adjustment: Decision trees 
                                    • Decision trees split individual variables into leaves. At each leaf, the decision tree selects the best variable to split that leaf from all of the available variables. The root covers the whole data set
                                    • Thus, one way to prevent trees from learning noise is to set some minimum amount of instances that a leaf must contain. If that limit is set too high, the tree will be underspecified. If the limit is set too low, the tree will be overspecified
                                    Selecting root splits: 
                                    • Decision trees choose to split each leaf on the variable that the tree algorithm determines as providing the best split. This applies to the root just as much as to the other leaves
                                    • Simply removing the variable on which the initial tree split the root from the input battery is not suggested here - just change the variable that is allowed to split the root
                                    • Empirically, the reason that better-specified trees result from not using the "best" split of the root is that it rearranges the tree so that the later leaves are more appropriately split and are more resistant to learning noise
                                      Algorithm adjustment: Rule extraction
                                      Among the features of rules. there are three that are important to specification: 
                                      1. They cover some number of instances
                                      2. They have some probability of being true
                                      3. Each rule has a level of complexity depending on how many conditions can be included in each rule
                                      All three of these features can be adjusted to change the specificity of the model.
                                      •  The number of instances in which the rule is correct, divided by the number of instances to which the rule applies, correct or not, gives the accuracy (also sometimes called probability level or confidence level)
                                      • Requiring a higher minimum level of accuracy produces more general models; lowering the required minimum accuracy increases specificity
                                      • Rules can be constructed from multiple conditions. The conditions are the "if" part of the rule. Each additional condition can be joined by logical connections such as "If . . . and . . . and . . . then . . . " This rule has three conditions. Some rule extractors - by no means all - can incorporate other logical connectors such as "or" and "not." The more conditions allowed in a rule, the more specific the rule becomes
                                      Algorithm adjustment: Clustering
                                      • There are many different algorithms that perform unsupervised clustering; there are also many that perform supervised clustering. They do not work in the same manner, and so each particular algorithm will almost certainly produce a very different set of clusters from the other clustering algorithms
                                      • Clustering algorithms offer essentially two adjustments. One is the number of clusters specified by the miner. It's very common that the algorithm requires the miner to select some number of clusters for the algorithm to use to cluster the data
                                      • The second adjustment works with a set of slightly different clustering algorithms that try to find some appropriate number of clusters, rather than having the miner choose some arbitrary number
                                      Algorithm adjustment: Self-organizing maps
                                      • The specificity of a SOM requires a fairly straightforward adjustment. More neurons, more specificity; fewer neurons, more generality. With very few neurons, the map will be extremely general
                                      • Adding neurons makes the map more specific in that it reveals more detailed relationships
                                      • Although not directly related to appropriate specification of the SOM, a very useful technique for improving the explanatory insight from a SOM can be normalizing distributions
                                      • Normalizing distributions often produces more insight than trying for more specification
                                      Algorithm adjustment: Support Vector Machines
                                      • Support vector machines are another form of clustering and have many similar issues. The main specificity issues that are particular to support vector machines concern how the overlapping clusters are to be separated - and almost all clusters in real-world data sets overlap
                                      • Essentially, the more flexible the boundary, the more easily it is able to modify each boundary to surround each cluster; but too much flexibility, and it will be cutting out relationships that do exist in the training data, but not in the test data
                                      Algorithm adjustment: Linear Regression
                                      • The basic linear regression algorithm is a masterpiece of mathematical simplicity and elegance. In its basic form, it has no "knobs" at all. However, no tool applies linear regression in its basic form - usually multiple linear regression at the very least
                                      • Linear regression inherently resists over-specification. After all, it can only represent linear relationships, and in this sense is the ultimately "stiff" fit to any data set
                                      • The main problem that concerns specification is the presence of outliers
                                      • Over-sensitivity to a few, possibly only one, data points is an example of over-specification. The more robust a linear regression, the more general it is; the less robust the regression, the more specific
                                      • However, an astute reader will note that normalizing the distribution in part removes the inordinate effect of outliers in any case
                                      Algorithm adjustment: Curvilinear Regression 
                                      • When it comes to data points, the purpose is to find a flexible line that best characterizes any curvature that exists in the data set. To do that, it has to pass as close as possible to all the points that represent the true curvature present in the data without being too flexible
                                      • Too much flexibility and the curve represents noise. The "knob" in nonlinear regression is the amount of curvature allowed in the regression curve. It may be called "degrees of freedom" or "magnitude of exponent" or "stiffness" or quite a lot of other things according to the toll-maker's whim
                                      • The knob simply adjusts the algorithm to allow more (or less) kinks, twists, bends, and curves
                                      • The more flexible the curve is allowed to be, the more specific the model; the less flexible the curvature, the less specific or more general the model
                                      Algorithm adjustment: Neural networks 
                                      • Neural networks offer the ultimate in flexibility of fitting a regression curve to a data set. Unlike curvilinear regression, if properly set, they can induce greater stiffness on some parts of the curve than other parts
                                      • Exactly as with curvilinear regressions, specificity of the models produced using neural networks is accomplished by controlling the amount of flexibility allowed the curve
                                      • Neural networks are built from artificial neurons. Conceptually, each of the input battery variables is assigned to an input neuron, and each output battery variable is assigned to an output neuron. Between the input and output neurons there may be - and almost always are - what are called hidden neurons. They are hidden in the sense that they are sandwiched between the input and output neurons, and like the cheese in a cheese sandwich where the slices of bread hide the cheese, so the input and output neurons hide the hidden neurons
                                      • Hidden neurons are often arrayed in layers. A network containing one hidden layer connects all of the input neurons to one side, and all of the output neurons are connected to the other side
                                      • The number of input and output neurons is fixed - one per variable - and can't be altered without changing the data set. What varies is the number of hidden neurons. The number of hidden layers can alter too, of course, but that has less effect on curve flexibility and more on learning speed. 
                                      • Rule of thumb: Start with three layers (so one must be a hidden layer). Usually, the output is a single neuron corresponding to the single-variable output battery. Structure the hidden layer so that it has half the number of neurons as the input layer. It's a rule of thumb, and a starting point only, but in manually set networks, it often proves to be a good place to start. Many tools use automated procedures to estimate an appropriate beginning network architecture
                                      • In general, more neurons make for a more specific model; less neurons make for a less specific model
                                      Algorithm adjustment: Bayesian Nets
                                      • Naive Bayesian networks look, if their architecture is drawn out, rather like neural networks. However, these networks are built of nodes, not neurons
                                      • The internal complexity of each node is very different from that of neurons, but the architecture in which the nodes are arranged appears similar. Naive Bayesian networks may have no hidden layer, so the inputs connect straight to the output. More complex Bayesian networks still may not have layers, but separate clusters of nodes cross-connected in complex ways
                                      • As a rule of thumb, the complexity in Bayesian networks derives from both the number of nodes and the number of interconnections between the nodes
                                      • Rule of thumb: More complex networks are more likely to be over-specified than less complex networks. So, more nodes, more connections, or both means more specificity. Less nodes, connections, or both means less specificity, thus more generality
                                      Algorithm adjustment: Evolution programming 
                                      • Evolution programming produces program fragments that can be embedded into complete programs for execution. The fragments are usually more or less complex logical and/or mathematical statements that express the relationship between input battery and output battery
                                      • The commercially available tools implementing evolution programming, however, do expose knobs for adjusting model specification and, quite separately, controls for adjusting the learning process
                                      • The longer the program, or the more the variety of functions, the greater the complexity and the higher the degree of specification
                                      Algorithm adjustment: Some other algorithm 
                                      • All algorithms pretty much separate the learning process from the specification process. There are almost always some controls on raw algorithms - and certainly on any of the more complex raw algorithms - that tune the training or learning process, and a pretty much separate set of knobs that tune the specification process
                                      • When complexity is exposed, it is usually for knobs to tune the specification process
                                      • All mining algorithms available in tools today can be viewed as working in one of two fundamental ways. Either they chop instances up into discrete chunks, as in decision tree leaves or clustering algorithm clusters, or they find continuous estimates, as with regressions or neural networks
                                      • With any algorithm that chops instances into discrete chunks, adjusting a knob that decreases the minimum permitted size of the chunks always increases specificity of the resulting model. Increasing the minimum permitted chunk size decreases model specificity
                                      • Algorithms that assemble a continuous estimate always seem to require a number of internal structure. For neural networks, it's neurons; for Bayesian networks, it's nodes and interconnections; for self-organizing maps, it's neurons; for evolution programs, it's program steps; for nonlinear regression, it's degree of freedom, and so the list goes on
                                      ===============================================
                                      Improving the Model (Part 4)
                                      Problem: Insufficient data 
                                      • One of the possible problems a miner might find is that the test data set isn't representative of the same relationships that are in the training data set
                                      • Having different underlying relationships in different data sets is normally a problem with shortage of data. Without sufficient data, it often happens that noise predominates since, when split into the three required data sets, there isn't enough data to truly represent the underlying relationships adequately in the separate data sets
                                      • Assume that the checks have been done, and that the data set as whole, and the separate training, test, and evaluation data sets, passed those tests for consistency. If those tests have not been done, or the data set did not pass the tests, then the problem is almost certainly insufficient data - and consistently insufficient data, too
                                      • If the test have not yet been done, or if the tests indicated insufficient data, find more data, or accept the model as the best that can be had under the circumstances
                                      • Dividing a source data set into training, test, and evaluation data sets requires that each instance in the source has a proportional chance of being assigned to one of the data sets. Thus, with a 60/20/20 division, any instance has to have a 60% chance of being assigned to the training data set, a 20% chance of being assigned to the test data set, and a 209% chance of being assigned to the evaluation data set
                                      • The only real answer is to check that the relationship between input and output batteries is similar in all three data sets - but that isn't accomplished by a rule of thumb. That calls for full-scale modeling of any relationships, which is what the miner is trying to do in constructing the model in the first place. There is no shortcut, only good modeling practice
                                      • The rule of thumb is a guide, not a certainty, and only modeling finally discovers whether the data is in fact sufficient to define a satisfactory model. If, despite all efforts, this turns out to be the case, the only answer is to get more, or better, or more and better data
                                      Problem: Uneven data
                                      • Another problem a miner might find is that the training and test data represent some relationships better than others. This may be indicated by some residual values that are more common than others
                                      • One problem is that the correlation of the patterns in the input battery may change depending on various factors, such as the actual values of the input battery variables
                                      • There may be nothing to be done about the problem. Some things in life come in patches or clusters, and the data simply reflects this as a fact of life. Data miners don't always accept this as a valid excuse and do things like balancing data sets to account for it, which will work in this case, too
                                      • However, before balancing a data set, it's worth ensuring that the data as collected for mining does, in fact, reflect the full range of behaviors that the world offers, and that the data set hasn't been selectively truncated in some way
                                      • Selectively including, or not including, some instances introduces bias into a data set, something that needs to be very carefully monitored 
                                      Problem: Estimation bias when mining a model
                                      • A miner might find that the model is "biased" into preferentially producing certain predicted values
                                      • Bias here is used in exactly its colloquial meaning - to lean toward, to be predisposed toward, or to favor something. In mining, therefore, a mined model is said to be biased when it has a tendency to produce one, or several, particular classifiactions
                                      • A mined model is also biased when it produces estimates that are all offset by a fixed amount, or by an amount that varies in fixed relationship to the magnitude or class of the estimate
                                      • Bias in the estimates of mined models are relatively easy to address, and may be produced by underspecification 
                                      • It's worth noting that bias is a crucial issue when constructing data sets and during deployment, but these issues don't arise during mining and refining a model
                                      Problem: Noise reduction
                                      • A miner might discover that in order to avoid learning noise, the tool was too restricted in the flexibility it was allowed when learning the relationships
                                      • Noise is a problem, and using an underspecified model is one of the techniques that makes the model more noise resistant. However, finding an appropriate specification level is better than underspecifying a model just to enable it to resist learning noise
                                      • Any noise-removing techniques that are applied to the source data set have to be duplicated on any run-time data during deployment, so the transformations to reduce noise have to be carried forward
                                      • One important noise reduction technique is missing value replacement. Replacing missing values may allow a more specific model to be created on a data set than before they were replaced
                                      • Another noise reduction technique is binning; the supervised binning techniques for input battery variables are particularly useful
                                      • Another, perhaps preferable, alternative to binning is normalizing the distribution of the input battery 
                                      • Manual aggregation of variable detail also sometimes works well, especially if a data set has several aggregation levels. It's often beneficial to have variable aggregations along some common metric. For instance, aggregating hourly sales to daily, daily to weekly, or weekly to monthly (depending on the needs of the business problem) may produce better estimates
                                      • Another noise reduction technique that works when there are a fair number of variables in the input battery is to "bundle" groups of variables together. (This is a data miner's version of what statisticians think of as principal curves and surfaces analysis) Many mining tools provide information about how well the variable correlate with each other. If the tool available doesn't provide such information, Excel will do the job, although it's more time consuming
                                       Problem: Categorical correlation
                                      • A miner might find that there are many categorical values in the input battery that all represent a similar phenomenon or phenomena 
                                      • Categorical variables, just as with other types of variables, can carry information that is sufficiently similar to each other so that the variables seem effectively identical to the modeling tool. Age and income are typical examples of this phenomenon in many data sets. Age and income can be represented as categories, and if this were done, both age and income, on average, might very well increase together
                                      • Independent variables do not have similar relationships to each other. Similarity of information content makes variables "dependent" on each other in the sense that what value one variable takes on depends on the value taken by another variable
                                      • Bundling highly correlated variables serves, in part, to overcome this tendency toward bias
                                      Problem: Partial colinearities 
                                      • Another possible problem is that a large number of variables in the input battery may carry similar information over parts of their range
                                      • Variables, quite naturally, vary their values over their range. However, the variance is usually not uniform over the range of the variable. This non-uniformity shows up in the variable's distribution, which in at least some sense is a description of the non-uniformity of variance of a variable 
                                      • The central "hump" of the normal distribution, for instance, represents a clustering of values about the variable's mean. It's also possible for variables to have actual, or relative, gaps in their distributions. When many variables show such behaviors, it still causes little problem - unless the variables are partially dependent on each other. Perhaps for some parts of their range, the variables are somehow linked, whereas over the rest they aren't 
                                      • Distribution normalization, or possibly binning, may ameliorate this problem
                                      Problem: Data not representative of the business problem
                                      • A miner might face the problem of an input battery that, although it checks as representative of the population, has some parts of the output battery range represented by very few instances (records) and other parts represented by very many instances
                                      • Initially, it is important to both modeler and miner that the data set be as unbiased as possible. Thus, the data set should represent as true a state of the world as is possible
                                      • Adjustment is needed to make the data set representative of the business problem as well as of the world
                                      Problem: Output limiting 
                                      • A miner might discover that the tool may be clipping the output predictions. This may be indicated when the regression line fitted to the actual value versus residual XY plot is not horizontal
                                      • As far as checking the residuals for systematic error is concerned, output limiting is not a problem
                                      Problem: Variance bias 
                                      • One of the problems a miner might find is that there may be a bias that affects all of the input battery variables. This may be indicated when the regression line fitted to the actual value versus residual XY plot is not horizontal
                                      • Sometimes when bias is present, it may leave traces of its presence by changing distribution as magnitude changes
                                      • Another possible result of bias that affects the output variance is that, while retaining a mean of 0, the variance is nonetheless correlated to prediction magnitude
                                      • Not only that better data would improve the model, but also that the data might be available - or at least, a clue that the phenomenon might be measurable. If it affects all, or a very significant fraction, of the input battery variables so significantly, it should be possible to discover what is actually producing this effect
                                      Problem: The modeling tool is broken 
                                      • A broken mining tool or algorithm may be indicated by almost any unwanted situation or circumstances. Mining tools or algorithms do break - but not very often! Data mining tools are computer programs - pieces of software - and as with all other software, they are subject to the normal array of "features" (also known as bugs and glitches)
                                      • A workaround won't fix it, it isn't an intermittent or transient problem, and it isn't dependent on the data that a miner chooses to model. In this case, the tool is broken and requires repair before it is again usable
                                      • The corrective action is simply to use another tool. Do, of course, report the problem to the tool vendor
                                      Problem: Anachronistic variables 
                                      • Leakage from anachronistic variables may be indicated by a model making perfect, or near perfect, classifications that are not explicate as either trivial or obvious
                                      • Anachronistic variables are a pernicious mining problem. However, they aren't problem at all at deployment time - unless someone expects the model to work! Anachronistic variables are out of place in time. Specifically, at data modeling time, they carry information back from the future to the past
                                      • Since the data set necessarily contains information about events that occur later than the "now" point, it's crucial to take scrupulous care that no later information leaks back
                                      • If any outcome classification model seems far better than reasonably expected, check carefully for anachronistic variables. Build single variable models to discover any variable that individually seem too good to be true. Think carefully about how or why they might be anachronistic
                                      • Eventually, deployment will certainly prove whether any temporal leakage occurred, but that is not the best time to discover the problem
                                      Problem: Noisy or irrelevant variables
                                      • Another possible problem is that the input battery contains one or more very noisy or completely irrelevant variables 
                                      • Noisy or totally irrelevant variables may be a problem, and certainly are for some algorithms. The problem is not with the noise, nor the irrelevancy, but with the fact that they interfere with the algorithm's ability to discover relationships from the other variables
                                      • In fact, random variables are, perhaps surprisingly to contain patterns. The longer the random sequence, the more patterns the random variable will contain
                                      • Many mining tools do rate variable importance for the created model. After trying several iterations of refining a model, if it turns out that some selection of variables are consistently rated as unimportant, remove them
                                      • If an importance measure isn't available, try building several models with small but different selections of variables. Discard any variables that are commonly present in the worst models
                                      Problem: Interaction effects 
                                      • One of the possible problems a miner might find is that the tool (algorithm) selected does not inherently explore interaction effects when important interaction effects are present
                                      • Interaction effects can be crucial, and they are very easy to understand. If you want to carpet a room, it's not enough to know just the length of the room, nor is it enough to know just the width of the room. Interaction between length and width gives the number of square feet of a room, and that is what is needed to buy carpet. In this example, multiplication produces the interaction effect
                                      • Several data mining algorithms do not incorporate interaction effects into their modeling. Several other algorithms do
                                      • Even when an algorithm can learn interaction effects, it speeds learning and better resists noise if the interaction effect is explicitly included
                                      • An easy way of representing interacting variables is to multiply them and add another variable to the input battery with the result
                                      • It's always worth at least checking performance using a tool with an algorithm that does incorporate interaction effects, such as neural networks
                                      Variable interactions:
                                      • Interactions between variables very simply means that the effect that one input variable has on the output variable changes depending on the value of some other variable. So, as an example, with "Y" as the output variable and "A" and "B" as input variables, the effect that "A" has on "Y" depends on the value of "B."
                                      • An easy way to determine whether the data contains important interactions, and a good rules-of-thumb to indicate how to characterize the interactions, is through the use of interaction indicator plots
                                      Interaction indicator plots: 
                                      • Using IIPs allows the miner to put the important interactions into the input data set instead of into model complexity, even for those algorithms that could, if so configured, characterize the interactions
                                      • The principle underlying IIPs is straightforward. It requires all the variables to be numeric, or recorded numerically using a principled recording method if they are categorical
                                      • There are noncommercial tools available that help, but short of such aid a miner has to fall back on other methods. First, go back to the business problem frame, the problem map, the cognitive map, the business process map, and the cause and effect map. In all these places, there are clues as to where variables are expected to interact, and which variables they are. At least create IIPs for these variables
                                      • The main point is that incorporating interaction variables is essential in creating the best quality models for some algorithms, and extremely useful for others
                                      • Including necessary interactions almost always improves a model, and never harms it
                                      Problem: Insufficient data
                                      • Sometimes the data available is very limited in quantity, or instances describing specific outcomes of particular business interest represent a very low proportion of the total number of instances and are insufficient to build a useful model, even with a balanced data set, and no further data is to be had
                                      • The essence of this problem is that the existing instances somehow have to be increased in number to a total number large enough for modeling
                                      • The data has to somehow be increased in quantity in such a way that the expanded sample remains at least as representative of the population as the original sample
                                      Expanding the data: 
                                      • The most straightforward method of increasing the apparent amount of data is simply to duplicate the instances. Copy the original data set and append the copy to the original. Do this several times and the result appears to be a sizable data set
                                      • Expanding the data in this way only results in creating duplicates of the existing instances
                                      Multiplying the data: 
                                      • A second approach to increasing the apparent amount of data, data multiplication, requires the availability of appropriate and technically more complex tools than those used for data expansion
                                      • The idea is to determine the joint distribution of the data set, and to create random values that have the same joint distribution characteristics as the original data set
                                      Summary: Refining the model is a crucial piece of the data mining process. Refining requires methods for checking a model's performance, insight for understanding what the checks reveal, knowledge of what applicable techniques are relevant to improve model performance, and methods for applying the techniques to modeling data, or business problems, as appropriate. 
                                      No matter how technically effective the model appears, and no matter how well tested, the model has no value unless effectively deployed. Deployment is where the technical effort blends into meeting the business needs. Deployment is the final, utterly crucial step in effective data mining.
                                      ==============================================
                                      Deploying the Mined Model
                                      Deploying the Mined Model:
                                      • The road that leads from raw data to deployed mined model is not a straight one. After considerable effort to create a simple, reliable, and robust model - or one as simple, reliable, and robust as the business problem and data permit - it's time to take the result of the hard work and use it to get some business value from all the effort. It is, in other words, time to deploy it
                                      • The technical deployment of a mined model and the business deployment of a data-based business model are two parts of a single whole
                                      • The mined model must solve, or at least address, business issues, and the business issues have to be so framed that they lead to a solution through mining data
                                      Deploying explanatory models: 
                                      • An explanatory model has to tell a story. Plain and simple, deploying an explanatory model is no more than delivering a narrative explanation of the world supported by the facts (data). It's a special sort of story, and it has to tell about particular features, but it is a story nonetheless
                                      • A summary - an explanation - has to interpret the facts. "Just the facts" is no good to anyone. It is the thread of explanation - the narration of interconnections and relationships - that is important. What they really want is a story that provides "just the meaningful summary of the facts." Here is where a story is worth a thousand pictures
                                       Novelty, and keeping the model working:
                                      • The world constantly creates new events, circumstances, ideas, and experiences. Always there is something new under the sun. For no two instants is the world ever the same
                                      • Although some things remain broadly the same over even quite long periods of time, everything is also simultaneously in a constant flux, and that makes problems of many sorts for reliably deploying models
                                      Regression to the mean: 
                                      • The constant novelty of the world is not without patterns. To be sure, nothing, not even the patterns, remain quite the same; yet still the pattern is there, and one of the patterns that is of great importance for many models is labeled with the formidable title of regression to the mean
                                      • The mean intended is simply an average value. The term regression was originally used in a sense of "return," so the pattern that this label applies to could be, perhaps, a little more intuitively termed "return to the average," The essential idea is simple: anything that is at some point extreme tends later to be less extreme
                                      • Any deployed model, or at least, monitoring the results of any deployed model, will be impacted in various ways by this regression-to-the-mean phenomenon
                                      • More generally, of course, the point is to create a baseline estimate of the probable course of uninfluenced events to compare against the influenced (by the model) course of events
                                      • Systems try to maintain themselves against changes, for example. Regression to the mean is just another,  although quantifiable, example of this same phenomenon
                                      It turns out that very often business models are trying to either:
                                      • Move extreme events back toward the mean (say, under-performing customers)
                                      • Move mean events out toward the extremes (say, reduce plant industrial accidents toward zero)
                                      • Move the mean of the whole population of events (say, reduce overall energy consumption of the company)
                                      • Change the shape of the distribution of events (say, increase the ration of satisfied customers to dissatisfied customers)
                                          Quantifying regression to the mean:
                                          • The essential point is to get a grasp on the way that the world would have changed had the model not been employed. Regression to the mean (this will be abbreviated "R2M") takes place between any two variables that are imperfectly correlated
                                          • This is true regardless of the nature of the relationship between the two variables, specifically it doesn't matter if the relationship is linear or nonlinear, nor whether the discussion of the two variables is normal or something else other than normal
                                          • It turns out that the R2M effect is directly proportional to the correlation between the two variables, and can be determined (when nothing else changes and the distributions are normal) from the correlation coefficient) 
                                          • If the correlation coefficient r is known, the expected R2M effect can be approximated from the following expression:          R2M% = 100 x (1 - r). This simple expression provides an approximation of the percentage change due to R2M effects
                                          Explaining regression to the mean: 
                                          • R2M turns out to be difficult to convincingly explain to many stakeholders, so it's worth discussing the point a little more here since the miner's (and the modeler's) biggest problem, after working out the expected R2M effects, will certainly be to convincingly explain what's going on
                                          • The problem is that "common sense" does not lead to an expectation of R2M effects, so people, specifically stakeholders and corporate managers, don't understand it
                                          • Any expression or expectation that "things will get back to normal" expresses R2M most explicitly, as does an expression or expectation that "things will even out over the long haul," and usually a colloquial appeal to the "law of averages" is also an appeal toR2M
                                          • The truth is that extreme events regress - return - to the mean. Randomness reigns supreme and that's the way life works, like it or not. The message from R2M is watch for it - or it will catch you out!
                                          Distributions: 
                                          • The most familiar distribution is the one known variously as "the bell curve," the normal distribution," or "the Gaussian distribution." 
                                          • The point about distributions is that they are really only a way of expressing the relative frequencies with which particular values turn up. For the normal distribution, for instance, values that are nearest to the mean of the distribution (the point in the center of the bell shape) are the most likely, whereas those far away (in the tails) are far less likely
                                          • The key point is that all modeling techniques used in data mining are essentially frequency-based. Information about a distribution simply summarizes the characteristic frequency of occurrence of any value
                                          • Making sure that the distributions in the model-creating data set are the distributions in the world
                                          • Although the distribution may remain normal, perhaps the mean or the standard deviation may change over time either quickly or slowly. (Mean and standard deviations are both measures of the exact shape and position of a normal distribution.)
                                          • If the distribution of values for one or more variables changes over time, the model will become less relevant; the more the distribution changes, the less relevant the model becomes
                                          • What is important to know in a deployed model is how reliable it is "now," not how it was performing yesterday, last week, last month, or last year
                                          • One way to check model performance for run-time data distributional nonstationarity (drift in the distribution) is to compare actual against predicted output
                                          • The key is to compare distribution characteristics of the evaluation data set residuals (actual versus predicted) with the real-world data set residual distribution
                                          Distributions that aren't: 
                                          • Recall that distributions have a "shape," among other characteristics, that represents the relative frequency of occurrence of specific values (or categories) in specific parts of the variable's range. For numeric variables such distributions have various features that describe the distribution, such as the mean, median, or more
                                          • Categorical distributions have analogous descriptive features of their distributions. Yet there are some collections of numbers or categories that have no known,"shape," nor any other of the descriptive measures used for distributions, and these are collections of values that have no distribution. Unfortunately, they aren't at all rare, and even in business modeling, the data miner is very likely to have to model such apparently strange beasts as collections of data that have no distribution
                                          • If a miner can identify a nonstationarity distribution, and if the distribution is changing quickly enough to affect the performance of the model, it's perfectly possible and legitimate to try to create a predictive model that estimates the changed distributional parameters
                                          • For a little further digging by anyone interested, many of these non-distributional series can be described using a phenomenon called self-organized critically [SOC]
                                          Detecting Novelty: 
                                          • The problem of keeping models working in the face of changing (non-stationary) distributions - and indeed series that have no theoretical distribution, stationary or otherwise - seem difficult to manage until the model has failed anyway, and that may be known only long after the fact. This is not a lot of help in trying to determine how well the model is working right now, and whether it's likely to work acceptably in any specific instance
                                          • It turns out that outlying behaviors are very often fraud, or precursors to major changes that are about to hit the model. These behaviors are novel, which is to say that they haven't been seen before in the training, test, and evaluation data sets - and not even in the run time data - not seen that is, until they do indeed turn up. Detecting novel single instances of data is tough, but not always impossible, and for critical cpplications, well worth attempting
                                          • The key insight for constructing a "novelty detector" is simply to realize that novel instances of data are unlike instances previously encountered - in other words, outliers. In principle, all that's needed is a device (or model) that indicates how likely the multivariate instance is to be an outlier compared to the data set used to train the model
                                          • For a normally distributed single variable, the measurement is fairly easy using the standard deviation. Any instance where the standard deviation is more than, sat, three standard deviations distant from the mean may be declared an outlier, or a novel instance
                                          • When detecting how "novel" an instance is in a variable that is normally distributed, the measurement needed is some indication of just how unlikely it is for the instance value to turn up using the known "shape" of the distribution
                                          • For the univariate normal distribution, this is a measure of how far the value falls from the bulk of the values at the mean; the measure used as a surrogate for novelty was standard deviation, also known sometimes by its symbol z, which is why it's sometimes called a z-score
                                          • The actual degree of novelty is impossible to measure, but the distance from cluster centers st least serves as one measure of how "likely" or "unlikely" any given instance is. Thus, distance from the cluster centers serves well as a surrogate for novelty
                                          • Creating a novelty detector requires using the minimum distance that an instance is located away from any of the cluster centers. If it's close to one of them, it's not likely to be novel
                                          • To create the pseudo z-score, call it pZ, first determine the standard deviation for all the minimum distances to cluster centers for all the data in the data set. The explanation actually takes longer to work through and understand than it takes to create the novelty detector. This is a quick and easy device to create to estimate novelty instance-by-instance (or, in other words, record-by-record)
                                          Novelty detector depends on clustering, and most forms of clustering require the miner to select the appropriate number of clusters, and the best that can be done there is a rule-of-thumb estimate. (Read "guess.") However, a workable rule of thumb seems to be:
                                          • Use no less than three clusters
                                          • Then: use the number of clusters equal to the square root of the number of variables
                                          • Adjust the number of clusters, if necessary, until a pZ = 3 captures about 90% of the instances
                                          Using the novelty detector:
                                          • A novelty detector is simply a device that, on a case-by-case basis, produces a score that goes some way toward indicating whether an individual instance (record) is from a distribution that is similar to, or dissimilar from, the distribution of the data set used for training (and test and evaluation data sets too, of course)
                                          • It's devised so that any pZ<=3 indicates that the instance is likely to come from a similar multivariate distribution, and any pZ > 3 points to a likelihood that the instance is novel
                                          • The algorithm used to produce the pZ estimate is very easy to implement and fast to execute. Any data set can be quickly scored as a whole, appending the pZ score to each record, or, in a real-time setting where instances arrive asynchronously, it's still possible to make case-by case judgments, keep moving averages, and so on
                                          • In short, pZ and pZc scores are quick and easy to generate, intuitive to understand, and easy to use and deploy: a powerful front-line check on current model relevance. Additionally, they can provide a powerful mechanism for potentially improving overall model performance by splitting the data set for multiple models
                                              Deployed model form: 
                                              • The form of the delivered model has to be considered at some point in model creation. Some data mining tools require the modeler to deploy the model by calling some run-time version (or perhaps even the full version) of the modeling tool. This requirement is imposed by making a generic form of the model either hard or impossible to retrieve from the modeling environment
                                              • It's all very well to have a neural network or decision tree or clusters as a model that works fine, but if a set of equations, rules, or whatever cannot be produced that express the model, it can't then be ported into a different environment 
                                              • Some models are created in one environment (usually some form of MS Windows) and have to be deployed elsewhere - perhaps on a mainframe or running under some version of Unix, or as a distributed application on multiple systems
                                              • Whatever the final needs for the form of the deployed model, checking how the tool allows the model to be deployed can save some nasty surprises when it comes time to put it into a production environment
                                              • Deployment issues blend into the general application development and deployment issues; these need to be addressed by the miner, probably with the stakeholders and the IT folk responsible for getting the model into production 

                                              ============================================
                                              Ref: Business Modeling and Data Mining by Dorian Pyle,



                                              Data mining itself is the process of finding useful patterns and rules in large volumes of data. Data Mining is an important component of analytic customer relationship management. The goal of analytic customer relationship management is to recreate, to the extent possible, the intimate, learning relationship that a well-run small business enjoys with its customers. A company’s interactions with its customers generates large volumes of data. This data is initially captured in transaction processing systems such as automatic teller machines, telephone switch records, and supermarket scanner files. The data can then be collected, cleaned, and summarized for inclusion in a customer data warehouse. A well-designed customer data warehouse contains a historical record of customer interactions that becomes the memory of the corporation. Data mining tools can be applied to this historical record to learn things about customers that will allow the company to serve them better in the future.

                                              Many problems of intellectual, economic, and business interest can be phrased in terms of the following six tasks: 
                                              • Classification
                                              • Estimation
                                              • Prediction
                                              • Affinity grouping
                                              • Clustering
                                              • Description and profiling 
                                              The first three are all examples of directed data mining, where the goal is to find the value of a particular target variable. Affinity grouping and clustering are undirected tasks where the goal is to uncover structure in data without respect to a particular target variable. Profiling is a descriptive task that may be either directed or undirected. In directed data mining there is always a target variable— something to be classified, estimated, or predicted.
                                              In undirected data mining, there is no target variable. The data mining task is to find overall patterns that are not tied to any one variable. The most common form of undirected data mining is clustering, which finds groups of similar records without any instructions about which variables should be considered as most important. Undirected data mining is descriptive by nature, so undirected data mining techniques are often used for profiling, but directed techniques such as decision trees are also very useful for building profiles.

                                              In the machine learning literature, directed data mining is called supervised learning and undirected data mining is called unsupervised learning.

                                              For mass-market products, data about customer interactions is the new water-power; knowledge drives the turbines of the service economy and, since the line between service and manufacturing is getting blurry, much of the manufacturing economy as well. Information from data focuses marketing efforts by segmenting customers, improves product designs by addressing real customer needs, and improves allocation of resources by understanding and predicting customer preferences.

                                              The four stages of virtuous cycle of data mining are:
                                              1. Identifying the business problem (business opportunities where analyzing data can provide value)
                                              2. Mining data to transform the data into actionable information (into actionable information using data mining techniques)
                                              3. Acting on the information
                                              4. Measuring the results (of the efforts to complete the learning cycle)
                                              Highlights:
                                              • Customer patterns become evident over time. Data warehouses need to support accurate historical data so that data mining can pick up these critical trends
                                              • When doing market research on existing customers, it is a good idea to use data mining to take into account what is already known about them
                                              • When talking to business users about data mining opportunities, make sure they focus on the business problems and not technology and algorithms. Let the technical experts focus on the technology and the business experts focus on the business
                                              • The results of data mining need to feed into business processes that touch customers and affect the customer relationship
                                              • Data mining is about connecting the past— through learning— to future actions
                                              • A classic application of data mining is to find the most cost-effective way to reach the desired number of responders
                                              • For data mining to succeed, there must be some relationship between the input variables and the target
                                              • It is impossible to do a good job of selecting input variables without knowledge of the business problem being addressed
                                              • Experts from several different functional areas including marketing, sales, and customer support usually meet together with outside data mining consultants to brainstorm about the best way to make use of available data to define inputs
                                              • Patterns often do reflect some underlying truth about the way the world works
                                              • The challenge for data miners is to figure out which patterns are predictive and which are not
                                              • The technical term for finding patterns that fail to generalize is overfitting. Overfitting leads to unstable models that work one day, but not the next. Building stable models is the primary goal of the data mining methodology. Building stable models is the primary goal of the data mining methodology
                                              • The right way to decide if a rule is stable and predictive is to compare its performance on multiple samples selected at random from the same population
                                              • The model set is the collection of historical data that is used to develop data mining models
                                              • A sample that does not properly reflect its parent population is biased 
                                              • Using a biased sample as a model set is a recipe for learning things that are not true
                                              • Careful attention to selecting and sampling data for the model set is crucial to successful data mining
                                              • Sometimes it is only a failure of imagination that makes new information appear useless. A study of customer attrition is likely to show that the strongest predictors of customers leaving is the way they were acquired. It is too late to go back and change that for existing customers, but that does not make the information useless. Future attrition can be reduced by changing the mix of acquisition channels to favor those that bring in longer-lasting customers
                                              • The data mining methodology is designed to steer clear of the Scylla of learning things that aren’t true and the Charybdis of not learning anything useful
                                              • In a more positive light, the methodology is designed to ensure that the data mining effort leads to a stable model that successfully addresses the business problem it is designed to solve
                                              • Hypothesis testing is the simplest approach to integrating data into a company’s decision-making processes. The purpose of hypothesis testing is to substantiate or disprove preconceived ideas, and it is a part of almost all data mining endeavors
                                              • A hypothesis is a proposed explanation whose validity can be tested by analyzing data. Such data may simply be collected by observation or generated through an experiment, such as a test mailing. There are some identifiable steps to the process, the first and most important of which is generating good ideas to test
                                              • Each time a company solicits a response from its customers, whether through advertising or a more direct form of communication, it has an opportunity to gather information. Slight changes in the design of the communication, such as including a way to identify the channel when a prospect responds, can greatly increase the value of the data collected
                                              • The data mining techniques are all designed for learning new things by creating models based on data

                                                • In the most general sense, a model is an explanation or description of how something works that reflects reality well enough that it can be used to make inferences about the real world
                                                • Data mining is all about creating models. Models take a set of inputs and produce an output.  
                                                • Profiling uses data from the past to describe what happened in the past. Prediction goes one step further. Prediction uses data from the past to predict what is likely to happen in the future
                                                • The role of the data miner is to ensure that the final statement of the business problem is one that can be translated into a data mining problem. Otherwise, the best data mining efforts in the world may be addressing the wrong business problem
                                                • Data mining is often presented as a technical problem of finding a model that explains the relationship of a target variable to a group of input variables
                                                • The first place to look for data is in the corporate data warehouse. Data in the warehouse has already been cleaned and verified and brought together from multiple sources
                                                • Once the preclassified data has been obtained from the appropriate timeframes, the methodology calls for dividing it into three parts. The first part, the training set, is used to build the initial model. The second part, the validation set1, is used to adjust the initial model to make it more general and less tied to the idiosyncrasies of the training set. The third part, the test set, is used to gauge the likely effectiveness of the model when applied to unseen data
                                                • Lift is a very handy tool for comparing the performance of two models applied to the same or comparable data. Note that the performance of two models can only be compared using lift when the tests sets have the same density of the outcome
                                                • Deploying a model means moving it from the data mining environment to the scoring environment
                                                • The challenging in deploying data mining models is that they are often used to score very large datasets. In some environments, every one of millions of customer records is updated with a new behavior score every day. A score is simply an additional field in a database table. Scores often represent a probability or likelihood so they are typically numeric values between 0 and 1
                                                • A score might also be a class label provided by a clustering model, for instance, or a class label with a probability
                                                The data used to create the model is called a model set. When models are applied to new data, this is called the score set. The model set has three components: 
                                                • The training set is used to build a set of models
                                                • The validation set is used to choose the best model of these
                                                • The test set is used to determine how the model performs on unseen data
                                                The virtuous cycle of data mining is about harnessing the power of data and transforming it into actionable business results. Just as water once turned the wheels that drove machines throughout a mill, data needs to be gathered and disseminated throughout an organization to provide value. 
                                                If data is water in this analogy, then data mining is the wheel, and the virtuous cycle spreads the power of the data to all the business processes. The virtuous cycle of data mining is a learning process based on customer data. It starts by identifying the right business opportunities for data mining. The best business opportunities are those that will be acted upon. Without action, there is little or no value to be gained from learning about customers. Also very important is measuring the results of the action. This completes the loop of the virtuous cycle, and often suggests further data mining opportunities.

                                                The data mining methodology has 11 steps. 
                                                1. Translate the business problem into a data mining problem
                                                2. Select appropriate data
                                                3. Get to know the data
                                                4. Create a model set
                                                5. Fix problems with the data
                                                6. Transform data to bring information to the surface
                                                7. Build models
                                                8. Asses models
                                                9. Deploy models
                                                10. Assess results
                                                11. Begin again
                                                Data mining comes in two forms. Directed data mining involves searching through historical records to find patterns that explain a particular outcome. Directed data mining includes the tasks of classification, estimation, prediction, and profiling. Undirected data mining searches through the same records for interesting patterns. It includes the tasks of clustering, finding association rules, and description. 
                                                Data mining brings the business closer to data. As such, hypothesis testing is a very important part of the process. Data mining is full of traps for the unwary and following a methodology based on experience can help avoid them. 
                                                The first hurdle is translating the business problem into one of the six tasks that can be solved by data mining: classification, estimation, prediction, affinity grouping, clustering, and profiling. 
                                                The next challenge is to locate appropriate data that can be transformed into actionable information. Once the data has been located, it should be thoroughly explored. The exploration process is likely to reveal problems with the data. It will also help build up the data miner’s intuitive understanding of the data. The next step is to create a model set and partition it into training, validation, and test sets. 
                                                Data transformations are necessary for two purposes: to fix problems with the data such as missing values and categorical variables that take on too many values, and to bring information to the surface by creating new variables to represent trends and other ratios and combinations. 
                                                Once the data has been prepared, building models is a relatively easy process. Each type of model has its own metrics by which it can be assessed, but there are also assessment tools that are independent of the type of model. Some of the most important of these are the lift chart, which shows how the model has increased the concentration of the desired value of the target variable and the confusion matrix that shows that misclassification error rate for each of the target classes.
                                                Ref: Berry, Michael J. A.; Linoff, Gordon S.. Data Mining Techniques : For Marketing, Sales, and Customer Relationship Management
                                                =========================================
                                                In marketing, a prospect is someone who might reasonably be expected to become a customer if approached in the right way. Both noun and verb resonate with the idea of using data mining to achieve the business goal of locating people who will be valuable customers in the future.

                                                Data mining can play many roles in prospecting. The most important of these are:
                                                • Identifying good prospects
                                                • Choosing a communication channel for reaching prospects 
                                                • Picking appropriate messages for different groups of prospects
                                                Truly good prospects are not only interested in becoming customers; they can afford to become customers, they will be profitable to have as customers, they are unlikely to defraud the company and likely to pay their bills, and, if treated well, they will be loyal customers and recommend others. No matter how simple or sophisticated the definition of a prospect, the first task is to target them.
                                                Targeting is important whether the message is to be conveyed through advertising or through more direct channels such as mailings, telephone calls, or email.

                                                Highlights:
                                                • For many companies, the first step toward using data mining to identify good prospects is building a response model
                                                • Advertising targets groups of people based on common traits; however, advertising does not make it possible to customize messages to individuals
                                                • One way of targeting prospects is to look for people who resemble current customers
                                                • By targeting prospects who match the profile, it can increase the rate of response to its own promotional efforts
                                                • The data mining challenge was to come up with a good definition of what it means to match the profile
                                                • One way of determining whether a customer fits a profile is to measure the similarity— which we also call distance— between the customer and the profile
                                                • Several data mining techniques use the idea of measuring similarity as a distance. Such as Memory-based reasoning is a technique for classifying records based on the classifications of known records that are “in the same neighborhood.” - Automatic cluster detection is another data mining technique that depends on the ability to calculate a distance between two records in order to find clusters of similar records close to each other
                                                • When comparing customer profiles, it is important to keep in mind the profile of the population as a whole. For this reason, using indexes is often better than using raw values
                                                • One philosophy of marketing is based on the old proverb “birds of a feather flock together.” That is, people with similar interests and tastes live in similar areas (whether voluntarily or because of historical patterns of discrimination)
                                                • One philosophy of marketing is based on the old proverb “birds of a feather flock together.” That is, people with similar interests and tastes live in similar areas (whether voluntarily or because of historical patterns of discrimination)
                                                • Advertising can be used to reach prospects about whom nothing is known as individuals
                                                • At the most basic level, data mining can be used to improve targeting by selecting which people to contact. Actually, the first level of targeting does not require data mining, only data
                                                • A principal application of data mining to prospects is targeting— finding the prospects most likely to actually respond to an offer
                                                • Direct marketing campaigns typically have response rates measured in the single digits. Response models are used to improve response rates by identifying prospects who are more likely to respond to a direct solicitation
                                                • The most useful response models provide an actual estimate of the likelihood of response
                                                • A smaller, better-targeted campaign can be more profitable than a larger and more expensive one. Lift increases as the list gets smaller, so is smaller always better? The answer is no because the absolute revenue decreases as the number of responders decreases
                                                • The profitability of a campaign depends on so many factors that can only be estimated in advance that the only reliable way to do it is to use an actual market test
                                                • The goal of a marketing campaign is to change behavior. In this regard, reaching a prospect who is going to purchase anyway is little more effective than reaching a prospect who will not purchase despite having received the offer
                                                • By recording everything that was known about a customer at the time of acquisition and then tracking customers over time, businesses can use data mining to relate acquisition-time variables to future outcomes such as customer longevity, customer value, and default risk
                                                • Customer relationship management naturally focuses on established customers. Happily, established customers are the richest source of data for mining. Best of all, the data generated by established customers reflects their actual individual behavior
                                                • Does the customer pay bills on time? Check or credit card? When was the last purchase? What product was purchased? How much did it cost? How many times has the customer called customer service? How many times have we called the customer? What shipping method does the customer use most often? How many times has the customer returned a purchase? This kind of behavioral data can be used to evaluate customers’ potential value, assess the risk that they will end the relationship, assess the risk that they will stop paying their bills, and anticipate their future needs
                                                • Customer segmentation is a popular application of data mining with established customers
                                                • With existing customers, a major focus of customer relationship management is increasing customer profitability through cross-selling and up-selling. Data mining is used for figuring out what to offer to whom and when to offer it
                                                • Cross-selling is defined as "the action or practice of selling among or between established clients, markets, traders, etc." or "that of selling an additional product or service to an existing customer". ...
                                                • Up-selling is a sales technique whereby a salesperson induces the customer to purchase more expensive items, upgrades, or other add-ons in an attempt to make a more profitable sale. ...
                                                • Churn (or, to look on the bright side, retention) is a major application of data mining. Churn is generally used in the telephone industry to refer to all types of customer attrition whether voluntary or involuntary
                                                • Churn is easiest to define in subscription-based businesses, and partly for that reason, churn modeling is most popular in these businesses
                                                • Churn is important because lost customers must be replaced by new customers, and new customers are expensive to acquire and generally generate less revenue in the near term than established customers
                                                • The motivation for building churn models is to figure out who is most at risk for attrition so as to make the retention offers to high-value customers who might leave without the extra incentive
                                                • Churn is voluntary. Customers, of their own free will, decide to take their business elsewhere. This type of attrition, known as voluntary churn, is actually only one of three possibilities. The other two are involuntary churn and expected churn
                                                • Involuntary churn, also known as forced attrition, occurs when the company, rather than the customer, terminates the relationship— most commonly due to unpaid bills
                                                • Expected churn occurs when the customer is no longer in the target market for a product
                                                The data mining techniques have applications in fields as diverse as biotechnology research and manufacturing process control. Data mining is used in support of both advertising and direct marketing to identify the right audience, choose the best communications channels, and pick the most appropriate messages. Prospective customers can be compared to a profile of the intended audience and given a fitness score. Should information on individual prospects not be available, the same method can be used to assign fitness scores to geographic neighborhoods using data of the type available form the U.S. census bureau, Statistics Canada, and similar official sources in many countries.
                                                A common application of data mining in direct modeling is response modeling. A response model scores prospects on their likelihood to respond to a direct marketing campaign. This information can be used to improve the response rate of a campaign, but is not, by itself, enough to determine campaign profitability. Estimating campaign profitability requires reliance on estimates of the underlying response rate to a future campaign, estimates of average order sizes associated with the response, and cost estimates for fulfillment and for the campaign itself. A more customer-centric use of response scores is to choose the best campaign for each customer from among a number of competing campaigns. This approach avoids the usual problem of independent, score-based campaigns, which tend to pick the same people every time.
                                                It is important to distinguish between the ability of a model to recognize people who are interested in a product or service and its ability to recognize people who are moved to make a purchase based on a particular campaign or offer. Differential response analysis offers a way to identify the market segments where a campaign will have the greatest impact. Differential response models seek to maximize the difference in response between a treated group and a control group rather than trying to maximize the response itself.
                                                Information about current customers can be used to identify likely prospects by finding predictors of desired outcomes in the information that was known about current customers before they became customers. This sort of analysis is valuable for selecting acquisition channels and contact strategies as well as for screening prospect lists. Companies can increase the value of their customer data by beginning to track customers from their first response, even before they become customers, and gathering and storing additional information when customers are acquired.
                                                Once customers have been acquired, the focus shifts to customer relationship management. The data available for active customers is richer than that available for prospects and, because it is behavioral in nature rather than simply geographic and demographic, it is more predictive. Data mining is used to identify additional products and services that should be offered to customers based on their current usage patterns. It can also suggest the best time to make a cross-sell or up-sell offer.
                                                One of the goals of a customer relationship management program is to retain valuable customers. Data mining can help identify which customers are the most valuable and evaluate the risk of voluntary or involuntary churn associated with each customer. Armed with this information, companies can target retention offers at customers who are both valuable and at risk, and take steps to protect themselves from customers who are likely to default.
                                                From a data mining perspective, churn modeling can be approached as either a binary-outcome prediction problem or through survival analysis. There are advantages and disadvantages to both approaches. The binary outcome approach works well for a short horizon, while the survival analysis approach can be used to make forecasts far into the future and provides insight into customer loyalty and customer value as well.

                                                Ref: Berry, Michael J. A.; Linoff, Gordon S.. Data Mining Techniques : For Marketing, Sales, and Customer Relationship Management
                                                ============================================
                                                Highlights: 
                                                • The simplest explanation is usually the best one— even (or especially) if it does not prove the hypothesis you want to prove
                                                • One difference between data miners and statisticians is that data miners are often working with sufficiently large amounts of data that make it unnecessary to worry about the mechanics of calculating the probability of something being due to chance
                                                • The p-value is the probability that the null hypothesis is true. Remember, when the null hypothesis is true, nothing is really happening, because differences are due to chance. Much of statistics is devoted to determining bounds for the p-value
                                                • Confidence, sometimes called the q-value, is the flip side of the p-value. Generally, the goal is to aim for a confidence level of at least 90 percent, if not 95 percent or more (meaning that the corresponding p-value is less than 10 percent, or 5 percent, respectively)
                                                • Null hypothesis, p-value, and confidence— are three basic ideas in statistics
                                                • A statistic refers to a measure taken on a sample of data. Statistics is the study of these measures and the samples they are measured on. A good place to start, then, is with such useful measures, and how to look at data
                                                • Much of the data used in data mining is discrete by nature, rather than continuous. Discrete data shows up in the form of products, channels, regions, and descriptive information about businesses
                                                • The most basic descriptive statistic about discrete fields is the number of times different values occur
                                                • Histograms are quite useful and easily made with Excel or any statistics package. However, histograms describe a single moment. Data mining is often concerned with what is happening over time. A key question is whether the frequency of values is constant over time
                                                • Time series analysis requires choosing an appropriate time frame for the data; this includes not only the units of time, but also when we start counting from
                                                • When looking at field values over time, look at the data by day to get a feel for the data at the most granular level
                                                • A time series chart has a wealth of information. For example, fitting a line to the data makes it possible to see and quantify long term trends
                                                • There is a basic theorem in statistics, called the Central Limit Theorem, which says the following: As more and more samples are taken from a population, the distribution of the averages of the samples (or a similar statistic) follows the normal distribution. The average (what statisticians call the mean) of the samples comes arbitrarily close to the average of the entire population
                                                • The purpose of standardizing the values is to test the null hypothesis. When true, the standardized values should follow the normal distribution (with an average of 0 and a standard deviation of 1), exhibiting several useful properties
                                                • The z-value is useful for other reasons as well. For instance, it is one way of taking several variables and converting them to similar ranges. This can be useful for several data mining techniques, such as clustering and neural networks
                                                • One very important idea in statistics is the idea of a distribution. For a discrete variable, a distribution is a lot like a histogram— it tells how often a given value occurs as a probability between 0 and 1
                                                • The normal distribution, which plays a very special role in statistics, is an example of a distribution for a continuous variable
                                                • Statistics originated to understand the data collected by scientists, most of which took the form of continuous measurements. In data mining, we encounter continuous data less often, because there is a wealth of descriptive data as well
                                                • Standard deviation, the square root of the variance, is the most frequently used measure of dispersion
                                                • Correlation is a measure of the extent to which a change in one variable is related to a change in another. Correlation ranges from –1 to 1. A correlation of 0 means that the two variables are not related. A correlation of 1 means that as the first variable changes, the second is guaranteed to change in the same direction, though not necessarily by the same amount
                                                • Another measure of correlation is the R 2 value, which is the correlation squared and goes from 0 (no relationship) to 1 (complete relationship). For instance, the radius and the circumference of a circle are perfectly correlated, although the latter grows faster than the former. A negative correlation means that the two variables move in opposite directions. For example, altitude is negatively correlated to air pressure
                                                • Regression is the process of using the value of one of a pair of correlated variables in order to predict the value of the second. The most common form of regression is linear regression, so called because it attempts to fit a straight line through the observed X and Y pairs in a sample. Once the line has been established, it can be used to predict a value for Y given any X and for X given any Y
                                                • Confidence intervals only measure the likelihood that sampling affected the result. There may be many other factors that we need to take into consideration to determine if two offers are significantly different. Each group must be selected entirely randomly from the whole population for the difference of proportions method to work
                                                • The confidence interval is a measure of only one thing, the statistical dispersion of the result
                                                • The champion-challenger model is an example of a two-way test, where a new method (the challenger) is compared to business-as-usual activity (the champion)
                                                • Before running a marketing test, determine the acuity of the test by calculating the difference in response rates that can be measured with a high confidence (such as 95 percent)
                                                • The chi-square test, is designed specifically for the situation when there are multiple tests and at least two discrete outcomes (such as response and non-response)
                                                • The appeal of the chi-square test is that it readily adapts to multiple test groups and multiple outcomes, so long as the different groups are distinct from each other. This, in fact, is about the only important rule when using this test
                                                The data mining approach differs from the standard statistical approach in several areas:
                                                • Data miners tend to ignore measurement error in raw data
                                                • Data miners assume that there is more than enough data and processing power
                                                • Data mining assumes dependency on time everywhere
                                                • It can be hard to design experiments in the business world
                                                • Data is truncated and censored
                                                One major difference between business data and scientific data is that the latter has many continuous values and the former has many discrete values. Even monetary amounts are discrete— two values can differ only by multiples of pennies (or some similar amount)—even though the values might be represented by real numbers.
                                                Almost all data used in data mining has a time dependency associated with it. Customers’ reactions to marketing efforts change over time. Prospects’ reactions to competitive offers change over time. Comparing results from a marketing campaign one year to the previous year is rarely going to yield exactly the same result. We do not expect the same results.
                                                Data mining, on the other hand, must often consider the time component of the data.
                                                Data mining has to work within the constraints of existing business practices. This can make it difficult to set up experiments.
                                                The data used for data mining is often incomplete, in one of two special ways. Censored values are incomplete because whatever is being measured is not complete. One example is customer tenures.
                                                Truncated data poses another problem in terms of biasing samples. Truncated data is not included in databases, often because it is too old.

                                                When looking at data, it is useful to look at histograms and cumulative histograms to see what values are most common. More important, though, is looking at values over time.
                                                One of the big questions addressed by statistics is whether observed values are expected or not. For this, the number of standard deviations from the mean (z-score) can be used to calculate the probability of the value being due to chance (the p-value). High p-values mean that the null hypothesis is true; that is, nothing interesting is happening. Low p-values are suggestive that other factors may be influencing the results. Converting z-scores to p-values depends on the normal distribution.
                                                Business problems often require analyzing data expressed as proportions. Fortunately, these behave similarly to normal distributions. The formula for the standard error for proportions (SEP) makes it possible to define a confidence interval on a proportion such as a response rate. The standard error for the difference of proportions (SEDP) makes it possible to determine whether two values are similar. This works by defining a confidence interval for the difference between two values.
                                                When designing marketing tests, the SEP and SEDP can be used for sizing test and control groups. In particular, these groups should be large enough to measure differences in response with a high enough confidence. Tests that have more than two groups need to take into account an adjustment, called Bonferroni’s correction, when setting the group sizes.
                                                The chi-square test is another statistical method that is often useful. This method directly calculates the estimated values for data laid out in rows and columns. Based on these estimates, the chi-square test can determine whether the results are likely or unlikely. The chi-square test and SEDP methods produce similar results.
                                                Statisticians and data miners solve similar problems. However, because of historical differences and differences in the nature of the problems, there are some differences in approaches. Data miners generally have lots and lots of data with few measurement errors. This data changes over time, and values are sometimes incomplete. The data miner has to be particularly suspicious about bias introduced into the data by business processes.

                                                Ref: Berry, Michael J. A.; Linoff, Gordon S.. Data Mining Techniques : For Marketing, Sales, and Customer Relationship Management 
                                                ============================================ 

                                                No comments:

                                                Post a Comment