# Table of Contents

Most chapters introduce a set of fundamental concepts and exemplary techniques. Starred sections are optional and introduce advanced technical material.

Most chapters introduce a set of fundamental concepts and exemplary techniques. Starred sections are optional and introduce advanced technical material.

- Our Conceptual Approach to Data Science
- To the Instructor
- Other Skills and Concepts
- Sections and Notation
- Some Notes on Grammar and Usage
- Using Examples
- Safari® Books Online
- How to Contact Us
- Acknowledgments

- The Ubiquity of Data Opportunities
- Example: Hurricane Frances
- Example: Predicting Customer Churn
- Data Science, Engineering, and Data-Driven Decision Making
- Data Processing and “Big Data”
- From Big Data 1.0 to Big Data 2.0
- Data and Data Science Capability as a Strategic Asset
- Data-Analytic Thinking
- This Book
- Data Mining and Data Science, Revisited
- Chemistry Is Not About Test Tubes: Data Science Versus the Work of the Data Scientist
- Summary

** **

*Fundamental concepts: A set of canonical data mining tasks; The data mining process; Supervised versus unsupervised data mining.*- From Business Problems to Data Mining Tasks
- Supervised Versus Unsupervised Methods
- Data Mining and Its Results
- The Data Mining Process
- Business Understanding
- Data Understanding
- Data Preparation
- Modeling
- Evaluation
- Deployment
- Implications for Managing the Data Science Team
- Other Analytics Techniques and Technologies
- Statistics
- Database Querying
- Data Warehousing
- Regression Analysis
- Machine Learning and Data Mining
- Answering Business Questions with These Techniques
- Summary

** **

*Fundamental concepts: Identifying informative attributes; Segmenting data by progressive attribute selection.**Exemplary techniques: Finding correlations; Attribute/variable selection; Tree induction.*- Models, Induction, and Prediction
- Supervised Segmentation
- Selecting Informative Attributes
- Example: Attribute Selection with Information Gain
- Supervised Segmentation with Tree-Structured Models
- Visualizing Segmentations
- Trees as Sets of Rules
- Probability Estimation
- Example: Addressing the Churn Problem with Tree Induction
- Summary

** **

*Fundamental concepts: Finding “optimal” model parameters based on data; Choosing the goal for data mining; Objective functions; Loss functions.**Exemplary techniques: Linear regression; Logistic regression; Support-vector machines.*- Classification via Mathematical Functions
- Linear Discriminant Functions
- Optimizing an Objective Function
- An Example of Mining a Linear Discriminant from Data
- Linear Discriminant Functions for Scoring and Ranking Instances
- Support Vector Machines, Briefly
- Regression via Mathematical Functions
- Class Probability Estimation and Logistic “Regression”
- * Logistic Regression: Some Technical Details
- Example: Logistic Regression versus Tree Induction
- Nonlinear Functions, Support Vector Machines, and Neural Networks
- Summary

** **

*Fundamental concepts: Generalization; Fitting and overfitting; Complexity control.**Exemplary techniques: Cross-validation; Attribute selection; Tree pruning; Regularization.*- Generalization
- Overfitting
- Overfitting Examined
- Holdout Data and Fitting Graphs
- Overfitting in Tree Induction
- Overfitting in Mathematical Functions
- Example: Overfitting Linear Functions
- * Example: Why Is Overfitting Bad?
- From Holdout Evaluation to Cross-Validation
- The Churn Dataset Revisited
- Learning Curves
- Overfitting Avoidance and Complexity Control
- Avoiding Overfitting with Tree Induction
- A General Method for Avoiding Overfitting
- * Avoiding Overfitting for Parameter Optimization
- Summary

** **

- Fundamental concepts: Calculating similarity of objects described by data; Using similarity for prediction; Clustering as similarity-based segmentation.
- Exemplary techniques: Searching for similar entities; Nearest neighbor methods; Clustering methods; Distance metrics for calculating similarity.
- Similarity and Distance
- Nearest-Neighbor Reasoning
- Example: Whiskey Analytics
- Nearest Neighbors for Predictive Modeling
- Classification
- Probability Estimation
- Regression
- How Many Neighbors and How Much Influence?
- Geometric Interpretation, Overfitting, and Complexity Control
- Issues with Nearest-Neighbor Methods
- Intelligibility
- Dimensionality and domain knowledge
- Computational efficiency
- Some Important Technical Details Relating to Similarities and Neighbors
- Heterogeneous Attributes
- * Other Distance Functions
- * Combining Functions: Calculating Scores from Neighbors
- Clustering
- Example: Whiskey Analytics Revisited
- Hierarchical Clustering
- Nearest Neighbors Revisited: Clustering Around Centroids
- Example: Clustering Business News Stories
- Data preparation
- The news story clusters
- Understanding the Results of Clustering
- * Using Supervised Learning to Generate Cluster Descriptions
- Stepping Back: Solving a Business Problem Versus Data Exploration
- Summary

** **

- Fundamental concepts: Careful consideration of what is desired from data science results; Expected value as a key evaluation framework; Consideration of appropriate comparative baselines.
- Exemplary techniques: Various evaluation metrics; Estimating costs and benefits; Calculating expected profit; Creating baseline methods for comparison.
- Evaluating Classifiers
- Plain Accuracy and Its Problems
- The Confusion Matrix
- Problems with Unbalanced Classes
- Problems with Unequal Costs and Benefits
- Generalizing Beyond Classification
- A Key Analytical Framework: Expected Value
- Using Expected Value to Frame Classifier Use
- Using Expected Value to Frame Classifier Evaluation
- Error rates
- Costs and benefits
- Evaluation, Baseline Performance, and Implications for Investments in Data
- Summary

** **

- Fundamental concepts: Visualization of model performance under various kinds of uncertainty; Further consideration of what is desired from data mining results.
- Exemplary techniques: Profit curves; Cumulative response curves; Lift curves; ROC curves.
- Ranking Instead of Classifying
- Profit Curves
- ROC Graphs and Curves
- The Area Under the ROC Curve (AUC)
- Cumulative Response and Lift Curves
- Example: Performance Analytics for Churn Modeling
- Summary

** **

- Fundamental concepts: Explicit evidence combination with Bayes’ Rule; Probabilistic reasoning via assumptions of conditional independence.
- Exemplary techniques: Naive Bayes classification; Evidence lift.
- Example: Targeting Online Consumers With Advertisements
- Combining Evidence Probabilistically
- Joint Probability and Independence
- Bayes’ Rule
- Applying Bayes’ Rule to Data Science
- Conditional Independence and Naive Bayes
- Advantages and Disadvantages of Naive Bayes
- A Model of Evidence “Lift”
- Example: Evidence Lifts from Facebook “Likes”
- Evidence in Action: Targeting Consumers with Ads
- Summary

** **

- Fundamental concepts: The importance of constructing mining-friendly data representations; Representation of text for data mining.
- Exemplary techniques: Bag of words representation; TFIDF calculation; N-grams; Stemming; Named entity extraction; Topic models.
- Why Text Is Important
- Why Text Is Difficult
- Representation
- Bag of Words
- Term Frequency
- Measuring Sparseness: Inverse Document Frequency
- Combining Them: TFIDF
- Example: Jazz Musicians
- * The Relationship of IDF to Entropy
- Summary
- Beyond Bag of Words
- N-gram Sequences
- Named Entity Extraction
- Topic Models
- Example: Mining News Stories to Predict Stock Price Movement
- The Task
- The Data
- Data Preprocessing
- Results
- Summary

** **

- Fundamental concept: Solving business problems with data science starts with analytical engineering: designing an analytical solution, based on the data, tools, and techniques available.
- Exemplary technique: Expected value as a framework for data science solution design.
- Targeting the Best Prospects for a Charity Mailing
- The Expected Value Framework: Decomposing the Business Problem and Recomposing the Solution Pieces
- A Brief Digression on Selection Bias
- Our Churn Example Revisited with Even More Sophistication
- The Expected Value Framework: Structuring a More Complicated Business Problem
- Assessing the Influence of the Incentive
- From an Expected Value Decomposition to a Data Science Solution
- Summary

- Fundamental concepts: Our fundamental concepts as the basis of many common data science techniques; The importance of familiarity with the building blocks of data science.
- Exemplary techniques: Association and co-occurrences; Behavior profiling; Link prediction; Data reduction; Latent information mining; Movie recommendation; Bias-variance decomposition of error; Ensembles of models; Causal reasoning from data.
- Co-occurrences and Associations: Finding Items That Go Together
- Measuring Surprise: Lift and Leverage
- Example: Beer and Lottery Tickets
- Associations Among Facebook Likes
- Profiling: Finding Typical Behavior
- Link Prediction and Social Recommendation
- Data Reduction, Latent Information, and Movie Recommendation
- Bias, Variance, and Ensemble Methods
- Data-Driven Causal Explanation and a Viral Marketing Example
- Summary

- Fundamental concepts: Our principles as the basis of success for a data-driven business; Acquiring and sustaining competitive advantage via data science; The importance of careful curation of data science capability.
- Thinking Data-Analytically, Redux
- Achieving Competitive Advantage with Data Science
- Sustaining Competitive Advantage with Data Science
- Formidable Historical Advantage
- Unique Intellectual Property
- Unique Intangible Collateral Assets
- Superior Data Scientists
- Superior Data Science Management
- Attracting and Nurturing Data Scientists and Their Teams
- Examine Data Science Case Studies
- Be Ready to Accept Creative Ideas from Any Source
- Be Ready to Evaluate Proposals for Data Science Projects
- Example Data Mining Proposal
- Flaws in the Big Red Proposal
- A Firm’s Data Science Maturity

- The Fundamental Concepts of Data Science
- Applying Our Fundamental Concepts to a New Problem: Mining Mobile Device Data
- Changing the Way We Think about Solutions to Business Problems
- What Data Can’t Do: Humans in the Loop, Revisited
- Privacy, Ethics, and Mining Data About Individuals
- Is There More to Data Science?
- Final Example: From Crowd-Sourcing to Cloud-Sourcing
- Final Words

- Business and Data Understanding
- Data Preparation
- Modeling
- Evaluation and Deployment

- Business and Data Understanding
- Data Preparation
- Modeling
Evaluation and Deployment