Table of contents

Most chapters introduce a set of fundamental concepts and exemplary techniques.  Starred sections are optional and introduce advanced technical material.


    Our Conceptual Approach to Data Science

    To the Instructor

    Other Skills and Concepts

    Sections and Notation

    Using Examples

    Safari® Books Online

    How to Contact Us


1. Introduction: Data-Analytic Thinking

    The Ubiquity of Data Opportunities

    Example: Hurricane Frances

    Example: Predicting Customer Churn

    Data Science, Engineering, and Data-Driven Decision Making

    Data Processing and “Big Data”

    From Big Data 1.0 to Big Data 2.0

    Data and Data Science Capability as a Strategic Asset

    Data-Analytic Thinking

    This Book

    Data Mining and Data Science, Revisited

    Chemistry Is Not About Test Tubes: Data Science Versus the Work
        of the Data Scientist


2. Business Problems and Data Science Solutions

Fundamental concepts: A set of canonical data mining tasks; The data mining process; Supervised versus unsupervised data mining.

    From Business Problems to Data Mining Tasks

    Supervised Versus Unsupervised Methods

    Data Mining and Its Results

    The Data Mining Process

        Business Understanding

        Data Understanding

        Data Preparation




    Implications for Managing the Data Science Team

    Other Analytics Techniques and Technologies


        Database Querying

        Data Warehousing

        Regression Analysis

        Machine Learning and Data Mining

        Answering Business Questions with These Techniques


3. Introduction to Predictive Modeling: From Correlation to Supervised Segmentation

Fundamental concepts: Identifying informative attributes; Segmenting data by progressive attribute selection.

Exemplary techniques: Finding correlations; Attribute/variable selection; Tree induction.

    Models, Induction, and Prediction

    Supervised Segmentation

        Selecting Informative Attributes

        Example: Attribute Selection with Information Gain

        Supervised Segmentation with Tree-Structured Models

    Visualizing Segmentations

    Trees as Sets of Rules

    Probability Estimation

    Example: Addressing the Churn Problem with Tree Induction


4. Fitting a Model to Data

Fundamental concepts: Finding “optimal” model parameters based on data; Choosing the goal for data mining; Objective functions; Loss functions.

Exemplary techniques: Linear regression; Logistic regression; Support-vector machines.

    Classification via Mathematical Functions

        Linear Discriminant Functions

        Optimizing an Objective Function

        An Example of Mining a Linear Discriminant from Data

        Linear Discriminant Functions for Scoring and Ranking Instances

        Support Vector Machines, Briefly

    Regression via Mathematical Functions

    Class Probability Estimation and Logistic “Regression”

        * Logistic Regression: Some Technical Details

    Example: Logistic Regression versus Tree Induction

    Nonlinear Functions, Support Vector Machines, and Neural Networks


5. Overfitting and Its Avoidance

Fundamental concepts: Generalization; Fitting and overfitting; Complexity control.

Exemplary techniques: Cross-validation; Attribute selection; Tree pruning;  Regularization.



    Overfitting Examined

        Holdout Data and Fitting Graphs

        Overfitting in Tree Induction

        Overfitting in Mathematical Functions

    Example: Overfitting Linear Functions

    * Example: Why Is Overfitting Bad?

    From Holdout Evaluation to Cross-Validation

    The Churn Dataset Revisited

    Learning Curves

    Overfitting Avoidance and Complexity Control

        Avoiding Overfitting with Tree Induction

        A General Method for Avoiding Overfitting

        * Avoiding Overfitting for Parameter Optimization


6. Similarity, Neighbors, and Clusters

Fundamental concepts: Calculating similarity of objects described by data; Using similarity for prediction; Clustering as similarity-based segmentation.

Exemplary techniques:  Searching for similar entities; Nearest neighbor methods; Clustering methods; Distance metrics for calculating similarity.

    Similarity and Distance

    Nearest-Neighbor Reasoning

        Example: Whiskey Analytics

        Nearest Neighbors for Predictive Modeling


            Probability Estimation


        How Many Neighbors and How Much Influence?

        Geometric Interpretation, Overfitting, and Complexity Control

        Issues with Nearest-Neighbor Methods


            Dimensionality and domain knowledge

            Computational efficiency

    Some Important Technical Details Relating to Similarities and Neighbors

        Heterogeneous Attributes

        * Other Distance Functions

        * Combining Functions: Calculating Scores from Neighbors


        Example: Whiskey Analytics Revisited

        Hierarchical Clustering

        Nearest Neighbors Revisited: Clustering Around Centroids

        Example: Clustering Business News Stories

            Data preparation

            The news story clusters

        Understanding the Results of Clustering

        * Using Supervised Learning to Generate Cluster Descriptions

    Stepping Back: Solving a Business Problem Versus Data Exploration


7. Decision Analytic Thinking I: What Is a Good Model?

Fundamental concepts: Careful consideration of what is desired from data science results; Expected value as a key evaluation framework; Consideration of appropriate comparative baselines.

Exemplary techniques: Various evaluation metrics; Estimating costs and benefits; Calculating expected profit; Creating baseline methods for comparison.

    Evaluating Classifiers

        Plain Accuracy and Its Problems

        The Confusion Matrix

        Problems with Unbalanced Classes

        Problems with Unequal Costs and Benefits

    Generalizing Beyond Classification

    A Key Analytical Framework: Expected Value

        Using Expected Value to Frame Classifier Use

        Using Expected Value to Frame Classifier Evaluation

            Error rates

            Costs and benefits

    Evaluation, Baseline Performance, and Implications for Investments in Data


8. Visualizing Model Performance

Fundamental concepts: Visualization of model performance under various kinds of uncertainty; Further consideration of what is desired from data mining results.

Exemplary techniques: Profit curves; Cumulative response curves; Lift curves; ROC curves.

    Ranking Instead of Classifying

    Profit Curves

    ROC Graphs and Curves

    The Area Under the ROC Curve (AUC)

    Cumulative Response and Lift Curves

    Example: Performance Analytics for Churn Modeling


9. Evidence and Probabilities

Fundamental concepts: Explicit evidence combination with Bayes' Rule; Probabilistic reasoning via assumptions of conditional independence.

Exemplary techniques: Naive Bayes classification; Evidence lift.

    Example: Targeting Online Consumers With Advertisements

    Combining Evidence Probabilistically

        Joint Probability and Independence

        Bayes’ Rule

    Applying Bayes’ Rule to Data Science

        Conditional Independence and Naive Bayes

        Advantages and Disadvantages of Naive Bayes

    A Model of Evidence ``Lift''

    Example: Evidence Lifts from Facebook "Likes"

        Evidence in Action: Targeting Consumers with Ads


10. Representing and Mining Text

Fundamental concepts: The importance of constructing mining-friendly data representations; Representation of text for data mining.

Exemplary techniques: Bag of words representation; TFIDF calculation; N-grams; Stemming; Named entity extraction; Topic models.

    Why Text Is Important

    Why Text Is Difficult


        Bag of Words

        Term Frequency

        Measuring Sparseness: Inverse Document Frequency

        Combining Them: TFIDF

    Example: Jazz Musicians

    * The Relationship of IDF to Entropy

    Beyond Bag of Words

        N-gram Sequences

        Named Entity Extraction

        Topic Models

    Example: Mining News Stories to Predict Stock Price Movement

        The Task

        The Data

        Data Preprocessing



11. Decision Analytic Thinking II: Toward Analytical Engineering

Fundamental concept: Solving business problems with data science starts with analytical engineering: designing an analytical solution, based on the data, tools, and techniques available.

Exemplary technique: Expected value as a framework for data science solution design.

    Targeting the Best Prospects for a Charity Mailing

        The Expected Value Framework: Decomposing the Business Problem and Recomposing the Solution Pieces

        A Brief Digression on Selection Bias

    Our Churn Example Revisited with Even More Sophistication

        The Expected Value Framework: Structuring a More Complicated Business Problem

        Assessing the Influence of the Incentive

        From an Expected Value Decomposition to a Data Science Solution


12. Other Data Science Tasks and Techniques

Fundamental concepts: Our fundamental concepts as the basis of many common data science techniques; The importance of familiarity with the building blocks of data science.

Exemplary techniques: Association and co-occurrences; Behavior profiling; Link prediction; Data reduction; Latent information mining; Movie recommendation; Bias-variance decomposition of error; Ensembles of models; Causal reasoning from data.

    Co-occurrences and Associations: Finding Items That Go Together

        Measuring Surprise: Lift and Leverage

        Example: Beer and Lottery Tickets

        Associations Among Facebook Likes

    Profiling: Finding Typical Behavior

    Link Prediction and Social Recommendation

    Data Reduction, Latent Information, and Movie Recommendation

    Bias, Variance, and Ensemble Methods

    Data-Driven Causal Explanation and a Viral Marketing Example


13. Data Science and Business Strategy

Fundamental concepts: Our principles as the basis of success for a data-driven business; Acquiring and sustaining competitive advantage via data science; The importance of careful curation of data science capability.

    Thinking Data-Analytically, Redux

    Achieving Competitive Advantage with Data Science

    Sustaining Competitive Advantage with Data Science

        Formidable Historical Advantage

        Unique Intellectual Property

        Unique Intangible Collateral Assets

        Superior Data Scientists

        Superior Data Science Management

    Attracting and Nurturing Data Scientists and Their Teams

    Examine Data Science Case Studies

    Be Ready to Accept Creative Ideas from Any Source

    Be Ready to Evaluate Proposals for Data Science Projects

        Example Data Mining Proposal

        Flaws in the Big Red Proposal

    A Firm's Data Science Maturity

14. Conclusion

Fundamental concept: (Recursive) Understanding the fundamental concepts of data science facilitates understanding data science quite broadly and interacting competently about business analytics.

    The Fundamental Concepts of Data Science

        Applying Our Fundamental Concepts to a New Problem: Mining Mobile Device Data

        Changing the Way We Think about Solutions to Business Problems

    What Data Can't Do: Humans in the Loop, Revisited

    Privacy, Ethics, and Mining Data About Individuals

    Is There More to Data Science?

    Final Example: From Crowd-Sourcing to Cloud-Sourcing

    Final Words



Appendix A: Proposal Review Guide

    Business and Data Understanding

    Data Preparation


    Evaluation and Deployment