Python Data Science Handbook

Recorded: Dec. 3, 2025, 3:04 a.m.

Original

Summarized

Python Data Science Handbook | Python Data Science Handbook

About
Archive

Python Data Science Handbook

Jake VanderPlas

This website contains the full text of the Python Data Science Handbook by Jake VanderPlas; the content is available on GitHub in the form of Jupyter notebooks.
The text is released under the CC-BY-NC-ND license, and code is released under the MIT license.
If you find this content useful, please consider supporting the work by buying the book!

Table of Contents¶Preface¶1. IPython: Beyond Normal Python¶
Help and Documentation in IPython
Keyboard Shortcuts in the IPython Shell
IPython Magic Commands
Input and Output History
IPython and Shell Commands
Errors and Debugging
Profiling and Timing Code
More IPython Resources

2. Introduction to NumPy¶
Understanding Data Types in Python
The Basics of NumPy Arrays
Computation on NumPy Arrays: Universal Functions
Aggregations: Min, Max, and Everything In Between
Computation on Arrays: Broadcasting
Comparisons, Masks, and Boolean Logic
Fancy Indexing
Sorting Arrays
Structured Data: NumPy's Structured Arrays

3. Data Manipulation with Pandas¶
Introducing Pandas Objects
Data Indexing and Selection
Operating on Data in Pandas
Handling Missing Data
Hierarchical Indexing
Combining Datasets: Concat and Append
Combining Datasets: Merge and Join
Aggregation and Grouping
Pivot Tables
Vectorized String Operations
Working with Time Series
High-Performance Pandas: eval() and query()
Further Resources

4. Visualization with Matplotlib¶
Simple Line Plots
Simple Scatter Plots
Visualizing Errors
Density and Contour Plots
Histograms, Binnings, and Density
Customizing Plot Legends
Customizing Colorbars
Multiple Subplots
Text and Annotation
Customizing Ticks
Customizing Matplotlib: Configurations and Stylesheets
Three-Dimensional Plotting in Matplotlib
Geographic Data with Basemap
Visualization with Seaborn
Further Resources

5. Machine Learning¶
What Is Machine Learning?
Introducing Scikit-Learn
Hyperparameters and Model Validation
Feature Engineering
In Depth: Naive Bayes Classification
In Depth: Linear Regression
In-Depth: Support Vector Machines
In-Depth: Decision Trees and Random Forests
In Depth: Principal Component Analysis
In-Depth: Manifold Learning
In Depth: k-Means Clustering
In Depth: Gaussian Mixture Models
In-Depth: Kernel Density Estimation
Application: A Face Detection Pipeline
Further Machine Learning Resources

Appendix: Figure Code¶

The Python Data Science Handbook, authored by Jake VanderPlas, provides a comprehensive introduction to fundamental concepts and techniques within the Python ecosystem specifically geared towards data science applications. The core of the handbook centers around a structured learning path, beginning with an exploration of IPython as a more powerful and interactive environment than standard Python. VanderPlas emphasizes the utility of IPython, detailing its features such as keyboard shortcuts, magic commands for enhanced control and debugging, and efficient history management for streamlining workflows. He stresses the importance of profiling and timing code to optimize performance.

Subsequent chapters then delve into the core libraries essential for data manipulation and analysis. The book meticulously introduces NumPy, focusing on its data types, the creation and manipulation of arrays, and the utilization of universal functions for calculating aggregates, min, max, and performing computations leveraging NumPy’s broadcasting capabilities. VanderPlas expands on this with an analysis of boolean logic and masking operations, alongside techniques for sorting arrays based on various criteria. The introduction of NumPy’s structured arrays allows for the efficient representation and handling of data with named fields, reflecting a move toward more robust data organization.

Pandas, the cornerstone of data analysis in Python, receives significant attention. The handbook introduces Pandas objects and their structure, demonstrating the techniques for data indexing and selection. It highlights the operations available for manipulating data within Pandas, with a particular focus on managing missing data, a prevalent issue in real-world datasets. Hierarchical indexing is explained, enabling complex data selection based on multiple levels of categorization. The book explores different methods for combining datasets – concat, append, merge, and join – along with techniques for aggregation and grouping data, exemplified by the creation of pivot tables. Vectorized string operations, a key feature of Pandas, are detailed, alongside the ability to work with time series data. Additional discussion covers high-performance strategies, including the use of `eval()` and `query()`.

The handbook then transitions to visualization with Matplotlib. It starts with the creation of basic plots like line and scatter plots. The concepts of visualizing errors and density plots are explored. The book explains the use of histograms, binnings, and density estimation. Customization options, including legend control and colorbar manipulation, are presented. Multidimensional plotting using subplots is covered, alongside the inclusion of text and annotation for enhanced clarity. The customization of ticks and Matplotlib configurations and stylesheets are also examined. Finally, the book discusses three-dimensional plotting and the use of Basemap for incorporating geographic data. The introduction of Seaborn as an alternative visualization library is mentioned, along with its potential use within the broader data science workflow.

The final section is dedicated to the introduction of Machine Learning concepts, primarily through the Scikit-Learn library. VanderPlas begins with defining machine learning and then systematically explores key modeling techniques. It covers the significance of hyperparameters and model validation. Fundamental feature engineering practices are discussed. The book provides detailed in-depth explanations of several machine learning algorithms, including Naive Bayes classification, Linear Regression, Support Vector Machines, Decision Trees and Random Forests, Principal Component Analysis, Manifold Learning, k-Means Clustering, Gaussian Mixture Models, Kernel Density Estimation and provides an application example; a face detection pipeline. The handbook effectively serves as an introductory text for college graduates seeking a foundational understanding of the core practices and tools within the Python data science ecosystem.