Table of contents

Table of contents

Preface

Our Conceptual Approach to Data Science

To the Instructor

Other Skills and Concepts

Sections and Notation

Using Examples

Safari® Books Online

How to Contact Us

Acknowledgments

1. Introduction: Data-Analytic Thinking

The Ubiquity of Data Opportunities

Example: Hurricane Frances

Example: Predicting Customer Churn

Data Science, Engineering, and Data-Driven Decision Making

Data Processing and “Big Data”

From Big Data 1.0 to Big Data 2.0

Data and Data Science Capability as a Strategic Asset

Data-Analytic Thinking

This Book

Data Mining and Data Science, Revisited

Chemistry Is Not About Test Tubes: Data Science Versus the Work

of the Data Scientist

Summary

2. Business Problems and Data Science Solutions

Fundamental concepts: A set of canonical data mining tasks; The data mining process; Supervised versus unsupervised data mining.

From Business Problems to Data Mining Tasks

Supervised Versus Unsupervised Methods

Data Mining and Its Results

The Data Mining Process

Business Understanding

Data Understanding

Data Preparation

Modeling

Evaluation

Deployment

Implications for Managing the Data Science Team

Other Analytics Techniques and Technologies

Statistics

Database Querying

Data Warehousing

Regression Analysis

Machine Learning and Data Mining

Answering Business Questions with These Techniques

Summary

3. Introduction to Predictive Modeling: From Correlation to Supervised Segmentation

Fundamental concepts: Identifying informative attributes; Segmenting data by progressive attribute selection.

Exemplary techniques: Finding correlations; Attribute/variable selection; Tree induction.

Models, Induction, and Prediction

Supervised Segmentation

Selecting Informative Attributes

Example: Attribute Selection with Information Gain

Supervised Segmentation with Tree-Structured Models

Visualizing Segmentations

Trees as Sets of Rules

Probability Estimation

Example: Addressing the Churn Problem with Tree Induction

Summary

4. Fitting a Model to Data

Fundamental concepts: Finding “optimal” model parameters based on data; Choosing the goal for data mining; Objective functions; Loss functions.

Exemplary techniques: Linear regression; Logistic regression; Support-vector machines.

Classification via Mathematical Functions

Linear Discriminant Functions

Optimizing an Objective Function

An Example of Mining a Linear Discriminant from Data

Linear Discriminant Functions for Scoring and Ranking Instances

Support Vector Machines, Briefly

Regression via Mathematical Functions

Class Probability Estimation and Logistic “Regression”

* Logistic Regression: Some Technical Details

Example: Logistic Regression versus Tree Induction

Nonlinear Functions, Support Vector Machines, and Neural Networks

Summary

5. Overfitting and Its Avoidance

Fundamental concepts: Generalization; Fitting and overfitting; Complexity control.

Exemplary techniques: Cross-validation; Attribute selection; Tree pruning; Regularization.

Generalization

Overfitting

Overfitting Examined

Holdout Data and Fitting Graphs

Overfitting in Tree Induction

Overfitting in Mathematical Functions

Example: Overfitting Linear Functions

* Example: Why Is Overfitting Bad?

From Holdout Evaluation to Cross-Validation

The Churn Dataset Revisited

Learning Curves

Overfitting Avoidance and Complexity Control

Avoiding Overfitting with Tree Induction

A General Method for Avoiding Overfitting

* Avoiding Overfitting for Parameter Optimization

Summary

6. Similarity, Neighbors, and Clusters

Fundamental concepts: Calculating similarity of objects described by data; Using similarity for prediction; Clustering as similarity-based segmentation.

Exemplary techniques: Searching for similar entities; Nearest neighbor methods; Clustering methods; Distance metrics for calculating similarity.

Similarity and Distance

Nearest-Neighbor Reasoning

Example: Whiskey Analytics

Nearest Neighbors for Predictive Modeling

Classification

Probability Estimation

Regression

How Many Neighbors and How Much Influence?

Geometric Interpretation, Overfitting, and Complexity Control

Issues with Nearest-Neighbor Methods

Intelligibility

Dimensionality and domain knowledge

Computational efficiency

Some Important Technical Details Relating to Similarities and Neighbors

Heterogeneous Attributes

* Other Distance Functions

* Combining Functions: Calculating Scores from Neighbors

Clustering

Example: Whiskey Analytics Revisited

Hierarchical Clustering

Nearest Neighbors Revisited: Clustering Around Centroids

Example: Clustering Business News Stories

Data preparation

The news story clusters

Understanding the Results of Clustering

* Using Supervised Learning to Generate Cluster Descriptions

Stepping Back: Solving a Business Problem Versus Data Exploration

Summary

7. Decision Analytic Thinking I: What Is a Good Model?

Fundamental concepts: Careful consideration of what is desired from data science results; Expected value as a key evaluation framework; Consideration of appropriate comparative baselines.

Exemplary techniques: Various evaluation metrics; Estimating costs and benefits; Calculating expected profit; Creating baseline methods for comparison.

Evaluating Classifiers

Plain Accuracy and Its Problems

The Confusion Matrix

Problems with Unbalanced Classes

Problems with Unequal Costs and Benefits

Generalizing Beyond Classification

A Key Analytical Framework: Expected Value

Using Expected Value to Frame Classifier Use

Using Expected Value to Frame Classifier Evaluation

Error rates

Costs and benefits

Evaluation, Baseline Performance, and Implications for Investments in Data

Summary

8. Visualizing Model Performance

Fundamental concepts: Visualization of model performance under various kinds of uncertainty; Further consideration of what is desired from data mining results.

Exemplary techniques: Profit curves; Cumulative response curves; Lift curves; ROC curves.

Ranking Instead of Classifying

Profit Curves

ROC Graphs and Curves

The Area Under the ROC Curve (AUC)

Cumulative Response and Lift Curves

Example: Performance Analytics for Churn Modeling

Summary

9. Evidence and Probabilities

Fundamental concepts: Explicit evidence combination with Bayes' Rule; Probabilistic reasoning via assumptions of conditional independence.

Exemplary techniques: Naive Bayes classification; Evidence lift.

Example: Targeting Online Consumers With Advertisements

Combining Evidence Probabilistically

Joint Probability and Independence

Bayes’ Rule

Applying Bayes’ Rule to Data Science

Conditional Independence and Naive Bayes

Advantages and Disadvantages of Naive Bayes

A Model of Evidence ``Lift''

Example: Evidence Lifts from Facebook "Likes"

Evidence in Action: Targeting Consumers with Ads

Summary

10. Representing and Mining Text

Fundamental concepts: The importance of constructing mining-friendly data representations; Representation of text for data mining.

Exemplary techniques: Bag of words representation; TFIDF calculation; N-grams; Stemming; Named entity extraction; Topic models.

Why Text Is Important

Why Text Is Difficult

Representation

Bag of Words

Term Frequency

Measuring Sparseness: Inverse Document Frequency

Combining Them: TFIDF

Example: Jazz Musicians

* The Relationship of IDF to Entropy

Beyond Bag of Words

N-gram Sequences

Named Entity Extraction

Topic Models

Example: Mining News Stories to Predict Stock Price Movement

The Task

The Data

Data Preprocessing

Results

Summary

11. Decision Analytic Thinking II: Toward Analytical Engineering

Fundamental concept: Solving business problems with data science starts with analytical engineering: designing an analytical solution, based on the data, tools, and techniques available.

Exemplary technique: Expected value as a framework for data science solution design.

Targeting the Best Prospects for a Charity Mailing

The Expected Value Framework: Decomposing the Business Problem and Recomposing the Solution Pieces

A Brief Digression on Selection Bias

Our Churn Example Revisited with Even More Sophistication

The Expected Value Framework: Structuring a More Complicated Business Problem

Assessing the Influence of the Incentive

From an Expected Value Decomposition to a Data Science Solution

Summary

12. Other Data Science Tasks and Techniques

Fundamental concepts: Our fundamental concepts as the basis of many common data science techniques; The importance of familiarity with the building blocks of data science.

Exemplary techniques: Association and co-occurrences; Behavior profiling; Link prediction; Data reduction; Latent information mining; Movie recommendation; Bias-variance decomposition of error; Ensembles of models; Causal reasoning from data.

Co-occurrences and Associations: Finding Items That Go Together

Measuring Surprise: Lift and Leverage

Example: Beer and Lottery Tickets

Associations Among Facebook Likes

Profiling: Finding Typical Behavior

Link Prediction and Social Recommendation

Data Reduction, Latent Information, and Movie Recommendation

Bias, Variance, and Ensemble Methods

Data-Driven Causal Explanation and a Viral Marketing Example

Summary

13. Data Science and Business Strategy

Fundamental concepts: Our principles as the basis of success for a data-driven business; Acquiring and sustaining competitive advantage via data science; The importance of careful curation of data science capability.

Thinking Data-Analytically, Redux

Achieving Competitive Advantage with Data Science

Sustaining Competitive Advantage with Data Science

Formidable Historical Advantage

Unique Intellectual Property

Unique Intangible Collateral Assets

Superior Data Scientists

Superior Data Science Management

Attracting and Nurturing Data Scientists and Their Teams

Examine Data Science Case Studies

Be Ready to Accept Creative Ideas from Any Source

Be Ready to Evaluate Proposals for Data Science Projects

Example Data Mining Proposal

Flaws in the Big Red Proposal

A Firm's Data Science Maturity

14. Conclusion

Fundamental concept: (Recursive) Understanding the fundamental concepts of data science facilitates understanding data science quite broadly and interacting competently about business analytics.

The Fundamental Concepts of Data Science

Applying Our Fundamental Concepts to a New Problem: Mining Mobile Device Data

Changing the Way We Think about Solutions to Business Problems

What Data Can't Do: Humans in the Loop, Revisited

Privacy, Ethics, and Mining Data About Individuals

Is There More to Data Science?

Final Example: From Crowd-Sourcing to Cloud-Sourcing

Final Words

Glossary

Bibliography

Appendix A: Proposal Review Guide

Business and Data Understanding

Data Preparation

Modeling

Evaluation and Deployment