Statistical Process Control in Python

Recorded: Nov. 27, 2025, 1:02 a.m.

Original

Summarized

16 Statistical Process Control in Python | System Reliability and Six Sigma in R and Python

Introduction
1 About this Book

1.1 What You’ll Learn
1.2 Learning Approach
1.3 Prerequisites
1.4 About the Author
1.5 Acknowledgments
1.6 How to Use This Book

I Failure Mode and Effects Analysis
2 Coding in R

Getting Started

Making an Posit.Cloud account
Using R for the First Time

2.1 Introduction to R
2.2 Basic Calculations in R
Learning Check 1
2.3 Types of Values in R
2.4 Types of Data in R

2.4.1 Values
2.4.2 Vectors
2.4.3 Dataframes

Learning Check 2
2.5 Common Functions in R

2.5.1 Measures of Central Tendency
2.5.2 Measures of Dispersion
2.5.3 Other Good Functions

2.6 Missing Data
Learning Check 3
2.7 Packages

2.7.1 Using Packages
2.7.2 Installing Packages
2.7.3 Loading Packages

The Pipeline
2.8 Visualizing Data with Histograms

2.8.1 hist()

Learning Check 4

2.8.2 geom_histogram() in ggplot2

Learning Check 5
Conclusion

Next Steps
Advice
Troubleshooting

3 Coding in Python

Getting Started

Making a Posit Cloud account
Using Python for the First Time

3.1 Introduction to Python
3.2 Basic Calculations in Python
Learning Check 1
3.3 Types of Values in Python
3.4 Types of Data in Python

3.4.1 Values and Variables
3.4.2 Lists (like R vectors)
3.4.3 DataFrames with pandas

Learning Check 2
3.5 Common Functions in Python

3.5.1 Measures of Central Tendency
3.5.2 Measures of Dispersion
3.5.3 Other Good Functions

3.6 Missing Data
Learning Check 3
3.7 Packages

3.7.1 Installing packages
3.7.2 Importing packages

The Pipeline
3.8 Visualizing Data with Histograms

3.8.1 pandas/matplotlib
3.8.2 geom_histogram() in plotnine

Learning Check 4
Learning Check 5
Conclusion

Next Steps
Advice
Troubleshooting

4 FMEA in R

Getting Started

Load Packages

4.1 Example: Ben and Jerry’s Ice Cream

4.1.1 Scope & Resolution
4.1.2 Measuring Criticality
4.1.3 Block Diagram
4.1.4 Failure Modes

4.2 Calculating Criticality

4.2.1 Estimate Severity
4.2.2 Estimate Occurrence
4.2.3 Estimate Detection
4.2.4 Estimate Criticality (RPN)

Learning Check 1
Conclusion

5 Visualization with ggplot in R

Getting Started

Loading Packages
Gapminder data

5.1 Your first scatterplot
Learning Check 1
Learning Check 2
Learning Check 3
5.2 Improving our Visualizations
Learning Check 4
Learning Check 5
5.3 Visualizing diamonds data
Learning Check 6
5.4 Visualizing Distributions
Learning Check 9
Conclusion

6 Visualization with plotnine in Python

Getting Started

Loading Packages
Gapminder data

6.1 Your first scatterplot
Learning Check 1
6.2 Transparency (alpha)
Learning Check 2
6.3 Color: constant vs mapped
Learning Check 3
6.4 Improving our visualizations
6.5 Visualizing diamonds data

6.5.1 Boxplots by cut

Learning Check 4
6.6 Visualizing Distributions

6.6.1 Try adjusting binwidth and theme

Learning Check 5
Conclusion

II Statistics & Reliability
7 Distributions and Descriptive Statistics in R

Getting Started

Load Packages

7.1 Distributions
7.2 Descriptive Statistics

7.2.1 What’s a statistic?

7.3 Our Data
7.4 Size

7.4.1 Length

7.5 Location

7.5.1 Mean
7.5.2 Median
7.5.3 Mode

7.6 Spread (1)

7.6.1 Percentile

Learning Check 1
7.7 Spread (2)

7.7.1 Standard Deviation
7.7.2 Variance
7.7.3 Coefficient of Variation (CV)
7.7.4 Standard Error (SE)

Learning Check 2
7.8 Shape

7.8.1 Skewness
7.8.2 Kurtosis

Learning Check 3
7.9 Simulating Distributions
7.10 Finding Parameters for your Distribution
7.11 Common Distributions

7.11.1 Normal Distribution
7.11.2 Poisson Distribution
7.11.3 Exponential Distribution
7.11.4 Gamma Distribution
7.11.5 Weibull Distribution

7.12 Special Distributions

7.12.1 Binomial Distribution
7.12.2 Uniform Distribution

7.13 Comparing Distributions
Learning Check 4
Conclusion

8 Distributions and Descriptive Statistics in Python

Getting Started

Install and Import Packages

8.1 Our Data
8.2 Size

8.2.1 Length

8.3 Location

8.3.1 Mean and Median
8.3.2 Mode

8.4 Spread (1)

8.4.1 Percentiles

8.5 Spread (2)

8.5.1 Standard Deviation, Variance, CV, SE

8.6 Shape

8.6.1 Skewness and Kurtosis

8.7 Finding Parameters for Your Distributions
8.8 Common Distributions

8.8.1 Normal
8.8.2 Poisson
8.8.3 Exponential
8.8.4 Gamma
8.8.5 Weibull

8.9 Comparing Distributions
Learning Check 1
Conclusion

9 Functions in R

Getting Started

Packages

9.1 Coding your own function!
9.2 Functions with Default Inputs
Conclusion

10 Functions in Python

Getting Started

Packages

10.1 Coding your own function!
10.2 Functions with Default Inputs
Conclusion

11 Probability in R

Getting Started

Load Packages

11.1 Key Functions

11.1.1 bind_rows()
11.1.2 mutate()
11.1.3 summarize()

11.2 Probability
11.3 Conditional Probability

11.3.1 Example: Twizzlers

11.4 Total Probabilities

11.4.1 Example: Marbles!

11.5 Bayes Rule

11.5.1 Example: Coffee Shop (Incomplete Information)
11.5.2 Example: Coffee Shop (Complete Information)
11.5.3 Example: Movie Theatre Popularity

Conclusion

12 Probability Functions in R

Getting Started

Load Packages

12.1 Probability Functions
12.2 Hypothetical Probability Functions

Example: Farmers Market
Unobserved Distributions
Using Hypothetical Probability Functions
12.2.1 Density (PDF)
12.2.2 Cumulative Probabilities (CDF)
12.2.3 Quantiles

Learning Check 1
12.3 Observed Probability Functions

12.3.1 Example: Observed Distributions
12.3.2 Observed PDFs (Probability Density Functions)
12.3.3 density()
12.3.4 Using PDFs (Probability Density Functions)
12.3.5 Observed CDFs (Cumulative Distribution Functions)
12.3.6 Using Calculus!

Learning Check 2
Conclusion

13 System Reliability

Getting Started

Packages

13.1 Concepts

13.1.1 Life Distributions
13.1.2 Example: Airplane Propeller Failure!

Learning Check 1
13.2 Joint Probabilities

13.2.1 Multiplication Rule
13.2.2 Compliment Rule

13.3 Table of Failure-Related Functions
13.4 Hazard Rate Function
13.5 Accumulative Hazard Function
13.6 Average Failure Rate
13.7 Units

13.7.1 Failure Functions
13.7.2 Conversion Functions
13.7.3 Converting Estimates

Learning Check 2
13.8 System Reliability

13.8.1 Series Systems
13.8.2 Parallel Systems
13.8.3 Combined Systems
13.8.4 Example: Business Reliability

13.9 Renewal Rates via Bayes Rule

14 System Reliability in Python

Getting Started

Packages
Custom Functions

14.1 Concepts

14.1.1 Life Distributions
14.1.2 Example: Airplane Propeller Failure!
14.1.3 Joint Probabilities

14.2 Hazard Rate Function
14.3 Accumulative Hazard Function
14.4 Average Failure Rate
14.5 Units and Conversions

14.5.1 Failure Functions
14.5.2 Conversion Functions
14.5.3 Converting Estimates

Learning Check 1

📱💥Exploding Phones!

14.6 Units and Conversions
Learning Check 2

🍜 Does Instant Ramen ever go bad?

14.7 System Reliability

14.7.1 Series Systems
14.7.2 Parallel Systems
14.7.3 Combined Systems

Conclusion

III Statistical Process Control
15 Statistical Process Control in R

Getting Started

Packages
Our Case
Our Data

15.1 Visualizing Quality Control

15.1.1 theme_set()
15.1.2 Process Descriptive Statistics
15.1.3 Process Overview Visual

Learning Check 1
15.2 Average and Standard Deviation Graphs

15.2.1 Key Statistics
15.2.2 Subgroup (Within-Group) Statistics
15.2.3 Total Statistics (Between Groups)
15.2.4 Average and Standard Deviation Charts

Learning Check 2
15.3 Moving Range Charts

15.3.1 Individual and Moving Range Charts
15.3.2 Factors \(d_{2}\) and Friends
15.3.3 Estimating \(\sigma_{short}\) for Moving Range Statistics

15.4 Constants

15.4.1 Find any \(d_x\) Factor
15.4.2 Using \(d_x\) factors
15.4.3 Finding any \(b_x\) Factor.

Learning Check 3

16 Statistical Process Control in Python

Getting Started

Packages
Custom Functions
Our Case
Our Data

16.1 Process Descriptive Statistics
16.2 Process Overview Visual
16.3 Subgroup (Within-Group) Statistics

16.3.1 Total Statistics (Between Groups)
16.3.2 Average and Standard Deviation Charts

Learning Check 1
16.4 Moving Range Charts (n=1)
Conclusion

17 Indices and Confidence Intervals for Statistical Process Control in R

Getting Started

Packages
Our Data

17.1 Process Capability vs. Stability

17.1.1 Definitions
17.1.2 Table of Indices

17.2 Index Functions

17.2.1 Capacity Index \(C_{p}\)
17.2.2 Process Performance Index \(P_{p}\)
17.2.3 Capacity Index \(C_{pk}\)
17.2.4 Process Performance Index \(P_{pk}\)
17.2.5 Equality

Learning Check 1
17.3 Confidence Intervals

17.3.1 Confidence Intervals show us Sampling Distributions
17.3.2 Bootstrapped or Theoretical Sampling Distributions?
17.3.3 Confidence Intervals with Theoretical Sampling Distributions
17.3.4 Visualizing Confidence Intervals
17.3.5 Bootstrapping \(C_{p}\)

17.4 CIs for any Index!

17.4.1 CIs for \(P_p\)
17.4.2 CIs for \(C_{pk}\)
17.4.3 CIs for \(P_p\)
17.4.4 CIs for \(P_{pk}\)

Conclusion

18 Indices and Confidence Intervals for Statistical Process Control in Python

Getting Started

Packages
Custom Functions
Our Data
18.0.1 Process Overview

18.1 Process Capability vs. Stability

18.1.1 Definitions
18.1.2 Table of Indices
18.1.3 Index Functions
18.1.4 Ingredients
18.1.5 Cp and Pp
18.1.6 Cpk and Ppk
18.1.7 Equality

18.2 Confidence Intervals (Normal Approximation)

18.2.1 Cp Confidence Interval
18.2.2 Cpk Confidence Interval
18.2.3 Pp and Ppk Confidence Intervals
18.2.4 Visualizing Confidence Intervals

18.3 Bootstrapping Cp (Example)

18.3.1 Bootstrap Sampling Distributions Visualization
18.3.2 Bootstrap Confidence Intervals

Conclusion

19 Useful Life Distributions (Exponential)

19.1 Getting Started

19.1.1 Load Packages
19.1.2 Key Concepts
19.1.3 Our Data

19.2 Quantities of Interest

19.2.1 Lack of Memory
19.2.2 Mean Time to Fail
19.2.3 Mean Time to Fail via Integration
19.2.4 Median Time to Fail \(T_{50}\)
19.2.5 Modal Time to Fail

Learning Check 1
Learning Check 2
19.3 Quantities of Interest (continued)

19.3.1 Conditional Reliability (Survival) Function
19.3.2 \(\mu(t)\): Mean Residual Life

Learning Check 3
19.4 System Failure with Independent Failure Rates

19.4.1 Reliability Functions with Multiple Inputs
19.4.2 Phase-Type Distributions

19.5 Conclusion

20 Statistical Techniques for Exponential Distributions

20.1 Getting Started

20.1.1 Load Packages
20.1.2 Our Data

20.2 Factors in R
20.3 Crosstabulation
Learning Check 1
20.4 Estimating Lambda

20.4.1 Lambda from Complete Sample
20.4.2 \(\hat{\lambda}\) from Cross-Tabulation

Learning Check 2

20.4.3 Confidence Intervals for \(\hat{\lambda}\)

Learning Check 3
20.5 Planning Experiments

20.5.1 Example: Minimum Sample Size

20.6 Chi-squared

20.6.1 Compute Chi-squared from scratch
20.6.2 Building a Chi-squared function

Learning Check 4
20.7 Conclusion

21 Useful Life Distributions (Weibull, Gamma, & Lognormal)

Getting Started

Load Packages
Load Data

21.1 Maximum Likelihood Estimation (MLE)

21.1.1 Likelihood Function
21.1.2 Maximizing with optim()
21.1.3 Under the Hood
21.1.4 Multi-parameter optimization
21.1.5 MLE with Censored Data

Learning Check 1
21.2 Gamma Distribution

21.2.1 \(\Gamma\) function gamma()
21.2.2 \(f(t)\) or d(t)(Probability Density Function, aka PDF)
21.2.3 \(F(t)\) Failure Function and \(R(t)\) Reliability Function
21.2.4 MTTF and Variance

Learning Check 2
21.3 Weibull Distribution

21.3.1 \(F(t)\) Failure Function and \(R(t)\) Reliability Function
21.3.2 \(f(t)\) or d(t) (PDF)
21.3.3 \(z(t)\) Failure Rate
21.3.4 \(m\) and \(c\)
21.3.5 Weibull Series Systems
21.3.6 MTTF and Variance

Learning Check 3
Learning Check 4
21.4 Lognormal Distribution

21.4.1 \(f(t)\) or d(t) (PDF)
21.4.2 \(\Phi\)
21.4.3 Key Functions
21.4.4 MTTF and Variance
21.4.5 Using Other Parameters

Learning Check 5
Learning Check 6
21.5 Conclusion

22 Fault Tree Analysis in R

Getting Started

Prerequisites
Load Packages

22.1 Simulating Fault Trees
22.2 Example: Superman Fault Tree
22.3 Using Raw Probabilities
Learning Check 1
22.4 Using Failure Rates at Time t
Learning Check 2
22.5 Simulating Uncertainty in Probabilities
22.6 Simulating Uncertainty in Failure Rates
22.7 Example: Contagion
Learning Check 3

23 Physical Acceleration Models

Getting Started

Load Packages
Helpful Functions

23.1 Acceleration Factor

23.1.1 Stress vs. Usage Conditions
23.1.2 Acceleration as a Function \(\frac{f_{s}(t)}{f_{u}(t)}\)
23.1.3 Linear Acceleration

Learning Check 1
Learning Check 2
23.2 Modeling Normal Use Lifespan

23.2.1 Models and Prediction
23.2.2 Arrhenius Model (Temperature)
23.2.3 Visualizing the Arrhenius Model
23.2.4 Estimating the Arrhenius Model
23.2.5 Prediction

23.3 Eyring Model (Multiple Stressors)
23.4 Degradation Model (Time Trends)
23.5 Burn-in Periods
23.6 Maximum Likelihood Estimation (MLE) for Physical Acceleration Models

23.6.1 MLE for an Example Arrhenius Model
23.6.2 Estimating \(\Delta H\) with MLE
23.6.3 Visualizing Probabilities

23.7 Conclusion

IV Life Distributions
V Physical Acceleration Models
VI Regression
24 Bivariate Regression: Modeling Diamond Pricing

Getting Started

Load Data
View Data
Codebook

24.1 Review

24.1.1 Scatterplots
24.1.2 Correlation
24.1.3 Example
24.1.4 cor.test() and tidy()

24.2 Regression and the Line of Best Fit

24.2.1 Model Equation
24.2.2 Coding Regression Models
24.2.3 lm()

Learning Check 1
24.3 Statistical Significance
Learning Check 2
24.4 Visualization
Learning Check 3
24.5 Finding the Line of Best Fit

24.5.1 Predicted Values
24.5.2 Residuals
24.5.3 R-squared

Learning Check 4
24.6 F-statistic

24.6.1 Interpreting an F-statistic

24.7 summary()

24.7.1 Get all stats, all at once with summary()
24.7.2 4 Key Statistics

Learning Check 5

25 Multivariate Regression: Modeling Effects of Disaster on Social Capital

Getting Started

Load Data & Packages
View Data
Codebook

25.1 Multiple Regression

25.1.1 Beta coefficients
25.1.2 Controls
25.1.3 Planes

Learning Check 1
25.2 Effect Sizes
Learning Check 2
25.3 Multiple Models
Learning Check 3
25.4 Great Tables
Learning Check 4

VII Comparing Groups and ANOVA
26 Design of Experiments in R

Getting Started

The Experiment
Import Data

26.1 Difference of Means (t-tests)

26.1.1 Calculating dbar
26.1.2 Visualizing the difference of means

26.2 Definitions: Null vs. Sampling Distributions
26.3 Null Distributions

26.3.1 Permutation Test (Computational Approach)
26.3.2 t-statistics (Theoretical Approach)

26.4 Sampling Distribution

26.4.1 Bootstrapped CIs for dbar
26.4.2 Assume Normal CIs for dbar

26.5 Speedy t-tests
26.6 ANOVA

26.6.1 Visualizing ANOVA
26.6.2 ANOVA Explained
26.6.3 Functions for ANOVA
26.6.4 lm() - the easy way
26.6.5 aov() - the normal way
26.6.6 oneway.test() - if you know your group variances are unequal
26.6.7 Multiple Variables

Conclusion

VIII Factorial Design & RSM
27 Factorial Design and Interaction Effects in R

Getting Started

Factorial Design
Data & Packages

27.1 Estimating Direct Effects

27.1.1 Difference of Grand Means \(\bar{\bar{d}}\)
27.1.2 Slower-but-Clearer Method of Getting the Difference of Grand Means \(\bar{\bar{d}}\)
27.1.3 Faster-but-Messier Method of Getting the Difference of Grand Means \(\bar{\bar{d}}\)
27.1.4 Estimating Many Difference of Grand Means \(\bar{\bar{d}}\)

27.2 Standard Errors for \(\bar{\bar{d}}\)

27.2.1 Estimating Standard Error when \(s^2_i\) is known
27.2.2 Using Pooled Variance to Estimate Standard Errors

27.3 Pivoting Data with pivot_longer() & pivot_wider()

27.3.1 contains()
27.3.2 pivot_longer()
27.3.3 pivot_wider()

27.4 Visualizing & Reporting Treatment Effects

27.4.1 Building a Table of Direct Effects
27.4.2 Visualization Strategies
27.4.3 An Excellent Visual

27.5 Interaction Effects

27.5.1 Estimating a Pooled Standard Error
27.5.2 Estimating 1-way interations
27.5.3 Estimating 2-way interactions
27.5.4 Estimating 3-way interactions

27.6 Estimating Interactions in lm()

27.6.1 Modeling Interactions
27.6.2 Using predict() to visualize interaction effects
27.6.3 Finding the best model with anova()

Conclusion

28 Response Surface Methodology in R

Getting Started

Packages
Our Data
Import Data

28.1 Models for RSM

28.1.1 Specifying an Interaction Model with Polynomials
28.1.2 Modeling with rsm()
28.1.3 Transforming Variables

Learning Check 1
28.2 Contour Plots

28.2.1 Simple contour() plots
28.2.2 geom_tile() plots
28.2.3 Pretty geom_contour() plots!
28.2.4 One-step RSM in ggplot
28.2.5 More Realistic Plots

Learning Check 2
28.3 Iterate!

28.3.1 Modeling many Interactions
28.3.2 Contours with Multiple Variables
28.3.3 ggplot contour plots!

Learning Check 3
28.4 Quantities of Interest in RSM

28.4.1 Percent Change in Bins
28.4.2 Study Range

28.5 Extra Concepts of Interest: Canonical Form

28.5.1 Finding the Canonical Form
28.5.2 Canonical Form Coefficients
28.5.3 Stationary Points
28.5.4 Shift Points

Conclusion
IX Appendices
29 Appendix: ggplot tips

Getting Started

29.0.1 Resources
29.0.2 Our Packages
29.0.3 Our Data

29.1 Flipping Axes

Learning Check 1

29.2 Legend Positioning

Learning Check 8

29.3 Breaking Up a Visual

Learning Check 10

30 Appendix: mermaid Block Diagrams (Flowcharts) in R

Getting Started

Block Diagrams
Load Packages

30.1 Mermaid

30.1.1 Example Block Diagram with mermaid
Learning Check 1

30.2 Diagram of Ice Cream Shipment
30.3 Diagram Failures

31 Appendix: Using fitdistr to Fitting Distribution Parameters

Getting Started

Load Packages
31.0.1 Our Data

31.1 Example: Exponential Distribution
31.2 Method of Moments
31.3 MLE with fitdistr()
31.4 MLE with optim()
31.5 Applications

31.5.1 Normal Distribution
31.5.2 Poisson Distribution
31.5.3 Gamma Distribution
31.5.4 Weibull Distribution

Learning Check 1
31.6 Conclusion

32 Appendix: Statistical Tables and Common Equations

32.1 Table 1: Chi-Squared
32.2 Table 2: k-factor rules
32.3 Table 3: k-factor (upper)
32.4 Table 4: k-factor (lower)
32.5 Table 5: Control Constants (d)
32.6 Table 6: Control Constants (b)
32.7 Formula

System Reliability and Six Sigma in R and Python

16 Statistical Process Control in Python

Figure 2.1: Statistical Process Control!

In this workshop, we will learn how to perform statistical process control in Python, using statistical tools and plotnine visualizations! Statistical Process Control refers to using statistics to (1) measure variation in product quality over time and (2) identify benchmarks to know when intervention is needed. Let’s get started!

Getting Started

Packages
# Remember to install these packages using a terminal, if you haven't already!
!pip install pandas plotnine scipy
We’ll be using pandas for data manipulation, plotnine for visualization, and scipy for statistical functions.
import pandas as pd
from plotnine import *

Custom Functions
This workshop uses custom functions from the functions/ directory. You may need both:
- functions_distributions.py - for reliability and distribution functions
- functions_process_control.py - for statistical process control functions
To use these functions, you need to acquire them from the repository at github.com/timothyfraser/sigma/tree/main/functions.
Add the functions directory to your Python path
import sys
import os
# Add the functions directory to Python path
sys.path.append('functions') # or path to wherever you placed the functions folder
Once you have the functions available, you can import them:
from functions_distributions import density, tidy_density, approxfun
# from functions_process_control import ggprocess, ggsubgroup, ggmoving, ggcapability # if needed

Our Case

Figure 12.1: Obanazawa City, Yamagata Prefecture - A Hot Springs Economy. Photo credit and more here.

For today’s workshop, we’re going to think about why quality control matters in a local economy, by examining the case of the Japanese Hot Springs bath economy! Hot springs, or onsen, are a major source of tourism and recreation for families in Japan, bringing residents from across the country every year to often rural communities where the right geological conditions have brought on naturally occurring hot springs. Restaurants, taxi and bus companies, and many service sector firms rely on their local onsen to bring in a steady stream (pun intended) of tourists to the local economy. So, it’s often in the best interest of onsen operators to keep an eye on the temperature, minerals, or other aspects of their hot springs baths to ensure quality control, to keep up their firm (and town’s!) reputation for quality rest and relaxation!
Onsen-goers often seek out specific types of hot springs, so it’s important for an onsen to actually provide what it advertises! Serbulea and Payyappallimana (2012) describe some of these benchmarks.

Temperature: Onsen are divided into “Extra Hot Springs” (>42°C), “Hot Springs” (41~34°C), and “Warm Springs” (33~25°C).
pH: Onsen are classified into “Acidic” (pH < 3), “Mildly Acidic” (pH 3~6), “Neutral” (pH 6~7.5), “Mildly alkaline” (pH 7.5~8.5), and “Alkaline” (pH > 8.5).
Sulfur: Sulfur onsen typically have about 2mg of sulfur per 1kg of hot spring water; sulfur levels must exceed 1 mg to count as a Sulfur onsen. (It smells like rotten eggs!)

These are decent examples of quality control metrics that onsen operators might want to keep tabs on!

Figure 4.1: Monkeys are even fans of onsen! Read more here!

Our Data
You’ve been hired to evaluate quality control at a local onsen in sunny Kagoshima prefecture! Every month, for 15 months, you systematically took 20 random samples of hot spring water and recorded its temperature, pH, and sulfur levels. How might you determine if this onsen is at risk of slipping out of one sector of the market (eg. Extra Hot!) and into another (just normal Hot Springs?).
Let’s read in our data from workshops/onsen.csv!
# Add functions directory to path if not already there
import sys
if 'functions' not in sys.path:
sys.path.append('functions')

from functions_distributions import density, tidy_density, approxfun

water = pd.read_csv('workshops/onsen.csv')
water.head(3)
## id time temp ph sulfur
## 0 1 1 43.2 5.1 0.0
## 1 2 1 45.3 4.8 0.4
## 2 3 1 45.5 6.2 0.9

16.1 Process Descriptive Statistics
First, let’s get a sense of our process by calculating some basic descriptive statistics. We’ll create a simple function to calculate the mean and standard deviation, which are fundamental to evaluating process variation.
from pandas import Series
def describe(x: Series):
x = Series(x)
out = pd.DataFrame({
'mean': [x.mean()],
'sd': [x.std()],
})
out['caption'] = ("Process Mean: " + out['mean'].round(2).astype(str) +
" | SD: " + out['sd'].round(2).astype(str))
return out

tab = describe(water['temp'])
tab
## mean sd caption
## 0 44.85 1.989501 Process Mean: 44.85 | SD: 1.99
Now let’s apply this to our temperature data to see the overall process mean and variation.

16.2 Process Overview Visual
The process overview chart is one of the most important tools in SPC. It shows us how our process behaves over time, helping us identify patterns, trends, and potential issues. We’ll create a visualization that shows individual measurements, subgroup means, and the overall process average.
g1 = (ggplot(water, aes(x='time', y='temp', group='time')) +
geom_hline(aes(yintercept=water['temp'].mean()), color='lightgrey', size=3) +
geom_jitter(height=0, width=0.25) +
geom_boxplot() +
labs(x='Time (Subgroup)', y='Temperature (Celsius)', subtitle='Process Overview', caption=tab['caption'][0]))

# Save the plot
g1.save('images/05_process_overview.png', width=8, height=6, dpi=100)

g2 = (ggplot(water, aes(x='temp')) + geom_histogram(bins=15, color='white', fill='grey') + theme_void() + coord_flip())

# Save the plot
g2.save('images/05_process_histogram.png', width=8, height=6, dpi=100)

The histogram shows us the distribution of all temperature measurements, giving us insight into the overall process variation. This helps us understand if our process is centered and how much variation we’re seeing.

16.3 Subgroup (Within-Group) Statistics
In SPC, we often work with subgroups - small samples taken at regular intervals. This allows us to distinguish between common cause variation (inherent to the process) and special cause variation (due to specific events). Let’s calculate statistics for each subgroup to see how the process behaves over time.
stat_s = (water.groupby('time').apply(lambda d: pd.Series({
'xbar': d['temp'].mean(),
'r': d['temp'].max() - d['temp'].min(),
'sd': d['temp'].std(),
'nw': len(d)
})).reset_index())
stat_s['df'] = stat_s['nw'] - 1
stat_s['sigma_s'] = ( (stat_s['df'] * (stat_s['sd']**2)).sum() / stat_s['df'].sum() )**0.5
stat_s['se'] = stat_s['sigma_s'] / (stat_s['nw']**0.5)
stat_s['upper'] = stat_s['xbar'].mean() + 3*stat_s['se']
stat_s['lower'] = stat_s['xbar'].mean() - 3*stat_s['se']
stat_s.head(3)
## time xbar r sd nw df sigma_s se upper lower
## 0 1 44.635 4.2 1.342533 20.0 19.0 1.986174 0.444122 46.182366 43.517634
## 1 3 45.305 7.9 2.001440 20.0 19.0 1.986174 0.444122 46.182366 43.517634
## 2 5 44.765 5.9 1.628133 20.0 19.0 1.986174 0.444122 46.182366 43.517634
Here we’ve calculated key statistics for each subgroup:

xbar: The mean of each subgroup
r: The range (max - min) within each subgroup

sd: The standard deviation within each subgroup
sigma_s: The pooled within-subgroup standard deviation
se: The standard error for each subgroup mean

16.3.1 Total Statistics (Between Groups)
Now let’s calculate the overall process statistics that summarize the behavior across all subgroups:
stat_t = pd.DataFrame({
'xbbar': [stat_s['xbar'].mean()],
'rbar': [stat_s['r'].mean()],
'sdbar': [stat_s['sd'].mean()],
'sigma_s': [(stat_s['sd']**2).mean()**0.5],
'sigma_t': [water['temp'].std()]
})
stat_t
## xbbar rbar sdbar sigma_s sigma_t
## 0 44.85 7.2625 1.93619 1.986174 1.989501
These statistics give us:

xbbar: The grand mean (average of all subgroup means)
rbar: The average range across subgroups
sdbar: The average standard deviation across subgroups
sigma_s: The pooled within-subgroup standard deviation
sigma_t: The total process standard deviation

16.3.2 Average and Standard Deviation Charts
Control charts are the heart of SPC. They help us monitor process stability over time and detect when the process is out of control. We’ll create charts for both the subgroup means (X-bar chart) and standard deviations (S chart).
labels = pd.DataFrame({
'time': [stat_s['time'].max()]*3,
'type': ['xbbar','upper','lower'],
'name': ['mean','+3 s','-3 s'],
'value': [stat_s['xbar'].mean(), stat_s['upper'].iloc[0], stat_s['lower'].iloc[0]]
})

control_chart = (ggplot(stat_s, aes(x='time', y='xbar')) +
geom_hline(aes(yintercept=stat_s['xbar'].mean()), color='lightgrey', size=3) +
geom_ribbon(aes(ymin='lower', ymax='upper'), fill='steelblue', alpha=0.2) +
geom_line(size=1) + geom_point(size=5) +
geom_label(data=labels, mapping=aes(x='time', y='value', label='name'), ha='right') +
labs(x='Time (Subgroups)', y='Average', subtitle='Average and Standard Deviation Chart'))

# Save the plot
control_chart.save('images/05_control_chart.png', width=8, height=6, dpi=100)

This control chart shows:

Center line: The grand mean (xbbar)
Control limits: Upper and lower 3-sigma limits based on the standard error
Individual points: Each subgroup mean plotted over time
Shaded area: The control limits region

Points outside the control limits or showing non-random patterns indicate the process may be out of control and requires investigation.

Learning Check 1
Question
Produce the same process overview chart for pH.

[View Answer!]

def ggprocess(x, y, xlab='Subgroup', ylab='Metric'):
import pandas as pd
from plotnine import ggplot, aes, geom_hline, geom_jitter, geom_boxplot, labs
d = pd.DataFrame({'x': x, 'y': y})
g = (ggplot(d, aes(x='x', y='y', group='x')) +
geom_hline(aes(yintercept=d['y'].mean()), color='lightgrey', size=3) +
geom_jitter(height=0, width=0.25) +
geom_boxplot() +
labs(x=xlab, y=ylab, subtitle='Process Overview'))
return g

ph_chart = ggprocess(water['time'], water['ph'])

# Save the plot
ph_chart.save('images/05_ph_chart.png', width=8, height=6, dpi=100)

16.4 Moving Range Charts (n=1)
When we have individual measurements rather than subgroups, we use moving range charts. The moving range is the absolute difference between consecutive measurements, which helps us estimate process variation when we can’t calculate within-subgroup statistics.
indiv = water.iloc[[0,20,40,60,80,100,120,140]]
mr = (indiv['temp'].diff().abs().dropna())
mrbar = mr.mean()
import numpy as np
d2 = np.mean(np.abs(np.diff(np.random.normal(0,1,10000))))
sigma_s = mrbar / d2
se = sigma_s / (1**0.5)
upper = mrbar + 3*se
lower = 0
istat = pd.DataFrame({'time': indiv['time'].iloc[1:], 'mr': mr, 'mrbar': mrbar, 'upper': upper, 'lower': lower})
mr_chart = (ggplot(istat, aes(x='time', y='mr')) +
geom_ribbon(aes(ymin='lower', ymax='upper'), fill='steelblue', alpha=0.25) +
geom_hline(aes(yintercept=mr.mean()), size=3, color='darkgrey') +
geom_line(size=1) + geom_point(size=5) +
labs(x='Time (Subgroup)', y='Moving Range', subtitle='Moving Range Chart'))

# Save the plot
mr_chart.save('images/05_moving_range_chart.png', width=8, height=6, dpi=100)

The moving range chart shows:

Center line: The average moving range (mrbar)
Upper control limit: Based on the estimated process standard deviation
Lower control limit: Set to 0 (moving ranges can’t be negative)
Individual points: Each moving range value

This chart helps us monitor process variation when we have individual measurements rather than subgroups.

Conclusion
You’ve successfully produced SPC visuals and statistics in Python: process overviews, subgroup statistics, and moving range logic. These tools help us understand process behavior, identify when processes are in or out of control, and make data-driven decisions about process improvement.

This workshop provides a thorough introduction to Statistical Process Control (SPC) utilizing Python and R. It covers key concepts like process descriptive statistics, control charts, and moving range charts. The workshop effectively demonstrates how to calculate and visualize key SPC metrics, enabling users to monitor process stability and identify potential issues. The inclusion of visual aids and code examples facilitates understanding and application of the techniques. The hands-on approach by letting the user produce the output is a great strategy for learning. Overall, this workshop offers a valuable foundation for individuals seeking to implement and maintain SPC programs.