Table of Contents

Most chapters introduce a set of fundamental concepts and exemplary techniques. Starred sections are optional and introduce advanced technical material.

Preface

  • Our Conceptual Approach to Data Science
  • To the Instructor
  • Other Skills and Concepts
  • Sections and Notation
  • Some Notes on Grammar and Usage
  • Using Examples
  • Safari® Books Online
  • How to Contact Us
  • Acknowledgments

 

1. Introduction: Data-Analytic Thinking

  • The Ubiquity of Data Opportunities
  • Example: Hurricane Frances
  • Example: Predicting Customer Churn
  • Data Science, Engineering, and Data-Driven Decision Making
  • Data Processing and “Big Data”
  • From Big Data 1.0 to Big Data 2.0
  • Data and Data Science Capability as a Strategic Asset
  • Data-Analytic Thinking
  • This Book
  • Data Mining and Data Science, Revisited
  • Chemistry Is Not About Test Tubes: Data Science Versus the Work of the Data Scientist
  • Summary

 

2. Business Problems and Data Science Solutions

  • Fundamental concepts: A set of canonical data mining tasks; The data mining process; Supervised versus unsupervised data mining.
  • From Business Problems to Data Mining Tasks
  • Supervised Versus Unsupervised Methods
  • Data Mining and Its Results
  • The Data Mining Process
  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Modeling
  • Evaluation
  • Deployment
  • Implications for Managing the Data Science Team
  • Other Analytics Techniques and Technologies
  • Statistics
  • Database Querying
  • Data Warehousing
  • Regression Analysis
  • Machine Learning and Data Mining
  • Answering Business Questions with These Techniques
  • Summary

 

3. Introduction to Predictive Modeling: From Correlation to Supervised Segmentation

  • Fundamental concepts: Identifying informative attributes; Segmenting data by progressive attribute selection.
  • Exemplary techniques: Finding correlations; Attribute/variable selection; Tree induction.
  • Models, Induction, and Prediction
  • Supervised Segmentation
  • Selecting Informative Attributes
  • Example: Attribute Selection with Information Gain
  • Supervised Segmentation with Tree-Structured Models
  • Visualizing Segmentations
  • Trees as Sets of Rules
  • Probability Estimation
  • Example: Addressing the Churn Problem with Tree Induction
  • Summary

 

4. Fitting a Model to Data

  • Fundamental concepts: Finding “optimal” model parameters based on data; Choosing the goal for data mining; Objective functions; Loss functions.
  • Exemplary techniques: Linear regression; Logistic regression; Support-vector machines.
  • Classification via Mathematical Functions
  • Linear Discriminant Functions
  • Optimizing an Objective Function
  • An Example of Mining a Linear Discriminant from Data
  • Linear Discriminant Functions for Scoring and Ranking Instances
  • Support Vector Machines, Briefly
  • Regression via Mathematical Functions
  • Class Probability Estimation and Logistic “Regression”
  • * Logistic Regression: Some Technical Details
  • Example: Logistic Regression versus Tree Induction
  • Nonlinear Functions, Support Vector Machines, and Neural Networks
  • Summary

 

5. Overfitting and Complexity Control

  • Fundamental concepts: Generalization; Fitting and overfitting; Complexity control.
  • Exemplary techniques: Cross-validation; Attribute selection; Tree pruning; Regularization.
  • Generalization
  • Overfitting
  • Overfitting Examined
  • Holdout Data and Fitting Graphs
  • Overfitting in Tree Induction
  • Overfitting in Mathematical Functions
  • Example: Overfitting Linear Functions
  • * Example: Why Is Overfitting Bad?
  • From Holdout Evaluation to Cross-Validation
  • The Churn Dataset Revisited
  • Learning Curves
  • Overfitting Avoidance and Complexity Control
  • Avoiding Overfitting with Tree Induction
  • A General Method for Avoiding Overfitting
  • * Avoiding Overfitting for Parameter Optimization
  • Summary

 

6. Similarity, Neighbors, and Clusters

  • Fundamental concepts: Calculating similarity of objects described by data; Using similarity for prediction; Clustering as similarity-based segmentation.
  • Exemplary techniques: Searching for similar entities; Nearest neighbor methods; Clustering methods; Distance metrics for calculating similarity.
  • Similarity and Distance
  • Nearest-Neighbor Reasoning
  • Example: Whiskey Analytics
  • Nearest Neighbors for Predictive Modeling
  • Classification
  • Probability Estimation
  • Regression
  • How Many Neighbors and How Much Influence?
  • Geometric Interpretation, Overfitting, and Complexity Control
  • Issues with Nearest-Neighbor Methods
  • Intelligibility
  • Dimensionality and domain knowledge
  • Computational efficiency
  • Some Important Technical Details Relating to Similarities and Neighbors
  • Heterogeneous Attributes
  • * Other Distance Functions
  • * Combining Functions: Calculating Scores from Neighbors
  • Clustering
  • Example: Whiskey Analytics Revisited
  • Hierarchical Clustering
  • Nearest Neighbors Revisited: Clustering Around Centroids
  • Example: Clustering Business News Stories
  • Data preparation
  • The news story clusters
  • Understanding the Results of Clustering
  • * Using Supervised Learning to Generate Cluster Descriptions
  • Stepping Back: Solving a Business Problem Versus Data Exploration
  • Summary

 

7. Decision Analytic Thinking I: What Is a Good Model?

  • Fundamental concepts: Careful consideration of what is desired from data science results; Expected value as a key evaluation framework; Consideration of appropriate comparative baselines.
  • Exemplary techniques: Various evaluation metrics; Estimating costs and benefits; Calculating expected profit; Creating baseline methods for comparison.
  • Evaluating Classifiers
  • Plain Accuracy and Its Problems
  • The Confusion Matrix
  • Problems with Unbalanced Classes
  • Problems with Unequal Costs and Benefits
  • Generalizing Beyond Classification
  • A Key Analytical Framework: Expected Value
  • Using Expected Value to Frame Classifier Use
  • Using Expected Value to Frame Classifier Evaluation
  • Error rates
  • Costs and benefits
  • Evaluation, Baseline Performance, and Implications for Investments in Data
  • Summary

 

8. Visualizing Model Performance

  • Fundamental concepts: Visualization of model performance under various kinds of uncertainty; Further consideration of what is desired from data mining results.
  • Exemplary techniques: Profit curves; Cumulative response curves; Lift curves; ROC curves.
  • Ranking Instead of Classifying
  • Profit Curves
  • ROC Graphs and Curves
  • The Area Under the ROC Curve (AUC)
  • Cumulative Response and Lift Curves
  • Example: Performance Analytics for Churn Modeling
  • Summary

 

9. Evidence and Probabilities

  • Fundamental concepts: Explicit evidence combination with Bayes’ Rule; Probabilistic reasoning via assumptions of conditional independence.
  • Exemplary techniques: Naive Bayes classification; Evidence lift.
  • Example: Targeting Online Consumers With Advertisements
  • Combining Evidence Probabilistically
  • Joint Probability and Independence
  • Bayes’ Rule
  • Applying Bayes’ Rule to Data Science
  • Conditional Independence and Naive Bayes
  • Advantages and Disadvantages of Naive Bayes
  • A Model of Evidence “Lift”
  • Example: Evidence Lifts from Facebook “Likes”
  • Evidence in Action: Targeting Consumers with Ads
  • Summary

 

10. Representing and Mining Text

  • Fundamental concepts: The importance of constructing mining-friendly data representations; Representation of text for data mining.
  • Exemplary techniques: Bag of words representation; TFIDF calculation; N-grams; Stemming; Named entity extraction; Topic models.
  • Why Text Is Important
  • Why Text Is Difficult
  • Representation
  • Bag of Words
  • Term Frequency
  • Measuring Sparseness: Inverse Document Frequency
  • Combining Them: TFIDF
  • Example: Jazz Musicians
  • * The Relationship of IDF to Entropy
  • Summary
  • Beyond Bag of Words
  • N-gram Sequences
  • Named Entity Extraction
  • Topic Models
  • Example: Mining News Stories to Predict Stock Price Movement
  • The Task
  • The Data
  • Data Preprocessing
  • Results
  • Summary

 

11. Decision Analytic Thinking II: Toward Analytical Engineering

  • Fundamental concept: Solving business problems with data science starts with analytical engineering: designing an analytical solution, based on the data, tools, and techniques available.
  • Exemplary technique: Expected value as a framework for data science solution design.
  • Targeting the Best Prospects for a Charity Mailing
  • The Expected Value Framework: Decomposing the Business Problem and Recomposing the Solution Pieces
  • A Brief Digression on Selection Bias
  • Our Churn Example Revisited with Even More Sophistication
  • The Expected Value Framework: Structuring a More Complicated Business Problem
  • Assessing the Influence of the Incentive
  • From an Expected Value Decomposition to a Data Science Solution
  • Summary

 

12. Other Data Science Tasks and Techniques

  • Fundamental concepts: Our fundamental concepts as the basis of many common data science techniques; The importance of familiarity with the building blocks of data science.
  • Exemplary techniques: Association and co-occurrences; Behavior profiling; Link prediction; Data reduction; Latent information mining; Movie recommendation; Bias-variance decomposition of error; Ensembles of models; Causal reasoning from data.
  • Co-occurrences and Associations: Finding Items That Go Together
  • Measuring Surprise: Lift and Leverage
  • Example: Beer and Lottery Tickets
  • Associations Among Facebook Likes
  • Profiling: Finding Typical Behavior
  • Link Prediction and Social Recommendation
  • Data Reduction, Latent Information, and Movie Recommendation
  • Bias, Variance, and Ensemble Methods
  • Data-Driven Causal Explanation and a Viral Marketing Example
  • Summary

 

13. Data Science and Business Strategy

  • Fundamental concepts: Our principles as the basis of success for a data-driven business; Acquiring and sustaining competitive advantage via data science; The importance of careful curation of data science capability.
  • Thinking Data-Analytically, Redux
  • Achieving Competitive Advantage with Data Science
  • Sustaining Competitive Advantage with Data Science
  • Formidable Historical Advantage
  • Unique Intellectual Property
  • Unique Intangible Collateral Assets
  • Superior Data Scientists
  • Superior Data Science Management
  • Attracting and Nurturing Data Scientists and Their Teams
  • Examine Data Science Case Studies
  • Be Ready to Accept Creative Ideas from Any Source
  • Be Ready to Evaluate Proposals for Data Science Projects
  • Example Data Mining Proposal
  • Flaws in the Big Red Proposal
  • A Firm’s Data Science Maturity

 

14. Conclusion

  • The Fundamental Concepts of Data Science
  • Applying Our Fundamental Concepts to a New Problem: Mining Mobile Device Data
  • Changing the Way We Think about Solutions to Business Problems
  • What Data Can’t Do: Humans in the Loop, Revisited
  • Privacy, Ethics, and Mining Data About Individuals
  • Is There More to Data Science?
  • Final Example: From Crowd-Sourcing to Cloud-Sourcing
  • Final Words

 

Appendix A. Proposal Review Guide

  • Business and Data Understanding
  • Data Preparation
  • Modeling
  • Evaluation and Deployment

 

Appendix B. Another Sample Proposal

  • Business and Data Understanding
  • Data Preparation
  • Modeling
  • Evaluation and Deployment

Glossary

Bibliography

Index

About the Authors