16 Statistical Process Control in Python | System Reliability and Six Sigma in R and Python
Introduction 1 About this Book
1.1 What You’ll Learn 1.2 Learning Approach 1.3 Prerequisites 1.4 About the Author 1.5 Acknowledgments 1.6 How to Use This Book
I Failure Mode and Effects Analysis 2 Coding in R
Getting Started
Making an Posit.Cloud account Using R for the First Time
2.1 Introduction to R 2.2 Basic Calculations in R Learning Check 1 2.3 Types of Values in R 2.4 Types of Data in R
2.4.1 Values 2.4.2 Vectors 2.4.3 Dataframes
Learning Check 2 2.5 Common Functions in R
2.5.1 Measures of Central Tendency 2.5.2 Measures of Dispersion 2.5.3 Other Good Functions
2.6 Missing Data Learning Check 3 2.7 Packages
2.7.1 Using Packages 2.7.2 Installing Packages 2.7.3 Loading Packages
The Pipeline 2.8 Visualizing Data with Histograms
2.8.1 hist()
Learning Check 4
2.8.2 geom_histogram() in ggplot2
Learning Check 5 Conclusion
Next Steps Advice Troubleshooting
3 Coding in Python
Getting Started
Making a Posit Cloud account Using Python for the First Time
3.1 Introduction to Python 3.2 Basic Calculations in Python Learning Check 1 3.3 Types of Values in Python 3.4 Types of Data in Python
3.4.1 Values and Variables 3.4.2 Lists (like R vectors) 3.4.3 DataFrames with pandas
Learning Check 2 3.5 Common Functions in Python
3.5.1 Measures of Central Tendency 3.5.2 Measures of Dispersion 3.5.3 Other Good Functions
3.6 Missing Data Learning Check 3 3.7 Packages
3.7.1 Installing packages 3.7.2 Importing packages
The Pipeline 3.8 Visualizing Data with Histograms
3.8.1 pandas/matplotlib 3.8.2 geom_histogram() in plotnine
Learning Check 4 Learning Check 5 Conclusion
Next Steps Advice Troubleshooting
4 FMEA in R
Getting Started
Load Packages
4.1 Example: Ben and Jerry’s Ice Cream
4.1.1 Scope & Resolution 4.1.2 Measuring Criticality 4.1.3 Block Diagram 4.1.4 Failure Modes
4.2 Calculating Criticality
4.2.1 Estimate Severity 4.2.2 Estimate Occurrence 4.2.3 Estimate Detection 4.2.4 Estimate Criticality (RPN)
Learning Check 1 Conclusion
5 Visualization with ggplot in R
Getting Started
Loading Packages Gapminder data
5.1 Your first scatterplot Learning Check 1 Learning Check 2 Learning Check 3 5.2 Improving our Visualizations Learning Check 4 Learning Check 5 5.3 Visualizing diamonds data Learning Check 6 5.4 Visualizing Distributions Learning Check 9 Conclusion
6 Visualization with plotnine in Python
Getting Started
Loading Packages Gapminder data
6.1 Your first scatterplot Learning Check 1 6.2 Transparency (alpha) Learning Check 2 6.3 Color: constant vs mapped Learning Check 3 6.4 Improving our visualizations 6.5 Visualizing diamonds data
6.5.1 Boxplots by cut
Learning Check 4 6.6 Visualizing Distributions
6.6.1 Try adjusting binwidth and theme
Learning Check 5 Conclusion
II Statistics & Reliability 7 Distributions and Descriptive Statistics in R
Getting Started
Load Packages
7.1 Distributions 7.2 Descriptive Statistics
7.2.1 What’s a statistic?
7.3 Our Data 7.4 Size
7.4.1 Length
7.5 Location
7.5.1 Mean 7.5.2 Median 7.5.3 Mode
7.6 Spread (1)
7.6.1 Percentile
Learning Check 1 7.7 Spread (2)
7.7.1 Standard Deviation 7.7.2 Variance 7.7.3 Coefficient of Variation (CV) 7.7.4 Standard Error (SE)
Learning Check 2 7.8 Shape
7.8.1 Skewness 7.8.2 Kurtosis
Learning Check 3 7.9 Simulating Distributions 7.10 Finding Parameters for your Distribution 7.11 Common Distributions
7.11.1 Normal Distribution 7.11.2 Poisson Distribution 7.11.3 Exponential Distribution 7.11.4 Gamma Distribution 7.11.5 Weibull Distribution
7.12 Special Distributions
7.12.1 Binomial Distribution 7.12.2 Uniform Distribution
7.13 Comparing Distributions Learning Check 4 Conclusion
8 Distributions and Descriptive Statistics in Python
Getting Started
Install and Import Packages
8.1 Our Data 8.2 Size
8.2.1 Length
8.3 Location
8.3.1 Mean and Median 8.3.2 Mode
8.4 Spread (1)
8.4.1 Percentiles
8.5 Spread (2)
8.5.1 Standard Deviation, Variance, CV, SE
8.6 Shape
8.6.1 Skewness and Kurtosis
8.7 Finding Parameters for Your Distributions 8.8 Common Distributions
8.8.1 Normal 8.8.2 Poisson 8.8.3 Exponential 8.8.4 Gamma 8.8.5 Weibull
8.9 Comparing Distributions Learning Check 1 Conclusion
9 Functions in R
Getting Started
Packages
9.1 Coding your own function! 9.2 Functions with Default Inputs Conclusion
10 Functions in Python
Getting Started
Packages
10.1 Coding your own function! 10.2 Functions with Default Inputs Conclusion
11 Probability in R
Getting Started
Load Packages
11.1 Key Functions
11.1.1 bind_rows() 11.1.2 mutate() 11.1.3 summarize()
11.2 Probability 11.3 Conditional Probability
11.3.1 Example: Twizzlers
11.4 Total Probabilities
11.4.1 Example: Marbles!
11.5 Bayes Rule
11.5.1 Example: Coffee Shop (Incomplete Information) 11.5.2 Example: Coffee Shop (Complete Information) 11.5.3 Example: Movie Theatre Popularity
Conclusion
12 Probability Functions in R
Getting Started
Load Packages
12.1 Probability Functions 12.2 Hypothetical Probability Functions
Example: Farmers Market Unobserved Distributions Using Hypothetical Probability Functions 12.2.1 Density (PDF) 12.2.2 Cumulative Probabilities (CDF) 12.2.3 Quantiles
Learning Check 1 12.3 Observed Probability Functions
12.3.1 Example: Observed Distributions 12.3.2 Observed PDFs (Probability Density Functions) 12.3.3 density() 12.3.4 Using PDFs (Probability Density Functions) 12.3.5 Observed CDFs (Cumulative Distribution Functions) 12.3.6 Using Calculus!
Learning Check 2 Conclusion
13 System Reliability
Getting Started
Packages
13.1 Concepts
13.1.1 Life Distributions 13.1.2 Example: Airplane Propeller Failure!
Learning Check 1 13.2 Joint Probabilities
13.2.1 Multiplication Rule 13.2.2 Compliment Rule
13.3 Table of Failure-Related Functions 13.4 Hazard Rate Function 13.5 Accumulative Hazard Function 13.6 Average Failure Rate 13.7 Units
13.7.1 Failure Functions 13.7.2 Conversion Functions 13.7.3 Converting Estimates
Learning Check 2 13.8 System Reliability
13.8.1 Series Systems 13.8.2 Parallel Systems 13.8.3 Combined Systems 13.8.4 Example: Business Reliability
13.9 Renewal Rates via Bayes Rule
14 System Reliability in Python
Getting Started
Packages Custom Functions
14.1 Concepts
14.1.1 Life Distributions 14.1.2 Example: Airplane Propeller Failure! 14.1.3 Joint Probabilities
14.2 Hazard Rate Function 14.3 Accumulative Hazard Function 14.4 Average Failure Rate 14.5 Units and Conversions
14.5.1 Failure Functions 14.5.2 Conversion Functions 14.5.3 Converting Estimates
Learning Check 1
📱💥Exploding Phones!
14.6 Units and Conversions Learning Check 2
🍜 Does Instant Ramen ever go bad?
14.7 System Reliability
14.7.1 Series Systems 14.7.2 Parallel Systems 14.7.3 Combined Systems
Conclusion
III Statistical Process Control 15 Statistical Process Control in R
Getting Started
Packages Our Case Our Data
15.1 Visualizing Quality Control
15.1.1 theme_set() 15.1.2 Process Descriptive Statistics 15.1.3 Process Overview Visual
Learning Check 1 15.2 Average and Standard Deviation Graphs
15.2.1 Key Statistics 15.2.2 Subgroup (Within-Group) Statistics 15.2.3 Total Statistics (Between Groups) 15.2.4 Average and Standard Deviation Charts
Learning Check 2 15.3 Moving Range Charts
15.3.1 Individual and Moving Range Charts 15.3.2 Factors \(d_{2}\) and Friends 15.3.3 Estimating \(\sigma_{short}\) for Moving Range Statistics
15.4 Constants
15.4.1 Find any \(d_x\) Factor 15.4.2 Using \(d_x\) factors 15.4.3 Finding any \(b_x\) Factor.
Learning Check 3
16 Statistical Process Control in Python
Getting Started
Packages Custom Functions Our Case Our Data
16.1 Process Descriptive Statistics 16.2 Process Overview Visual 16.3 Subgroup (Within-Group) Statistics
16.3.1 Total Statistics (Between Groups) 16.3.2 Average and Standard Deviation Charts
Learning Check 1 16.4 Moving Range Charts (n=1) Conclusion
17 Indices and Confidence Intervals for Statistical Process Control in R
Getting Started
Packages Our Data
17.1 Process Capability vs. Stability
17.1.1 Definitions 17.1.2 Table of Indices
17.2 Index Functions
17.2.1 Capacity Index \(C_{p}\) 17.2.2 Process Performance Index \(P_{p}\) 17.2.3 Capacity Index \(C_{pk}\) 17.2.4 Process Performance Index \(P_{pk}\) 17.2.5 Equality
Learning Check 1 17.3 Confidence Intervals
17.3.1 Confidence Intervals show us Sampling Distributions 17.3.2 Bootstrapped or Theoretical Sampling Distributions? 17.3.3 Confidence Intervals with Theoretical Sampling Distributions 17.3.4 Visualizing Confidence Intervals 17.3.5 Bootstrapping \(C_{p}\)
17.4 CIs for any Index!
17.4.1 CIs for \(P_p\) 17.4.2 CIs for \(C_{pk}\) 17.4.3 CIs for \(P_p\) 17.4.4 CIs for \(P_{pk}\)
Conclusion
18 Indices and Confidence Intervals for Statistical Process Control in Python
Getting Started
Packages Custom Functions Our Data 18.0.1 Process Overview
18.1 Process Capability vs. Stability
18.1.1 Definitions 18.1.2 Table of Indices 18.1.3 Index Functions 18.1.4 Ingredients 18.1.5 Cp and Pp 18.1.6 Cpk and Ppk 18.1.7 Equality
18.2 Confidence Intervals (Normal Approximation)
18.2.1 Cp Confidence Interval 18.2.2 Cpk Confidence Interval 18.2.3 Pp and Ppk Confidence Intervals 18.2.4 Visualizing Confidence Intervals
18.3 Bootstrapping Cp (Example)
18.3.1 Bootstrap Sampling Distributions Visualization 18.3.2 Bootstrap Confidence Intervals
Conclusion
19 Useful Life Distributions (Exponential)
19.1 Getting Started
19.1.1 Load Packages 19.1.2 Key Concepts 19.1.3 Our Data
19.2 Quantities of Interest
19.2.1 Lack of Memory 19.2.2 Mean Time to Fail 19.2.3 Mean Time to Fail via Integration 19.2.4 Median Time to Fail \(T_{50}\) 19.2.5 Modal Time to Fail
Learning Check 1 Learning Check 2 19.3 Quantities of Interest (continued)
19.3.1 Conditional Reliability (Survival) Function 19.3.2 \(\mu(t)\): Mean Residual Life
Learning Check 3 19.4 System Failure with Independent Failure Rates
19.4.1 Reliability Functions with Multiple Inputs 19.4.2 Phase-Type Distributions
19.5 Conclusion
20 Statistical Techniques for Exponential Distributions
20.1 Getting Started
20.1.1 Load Packages 20.1.2 Our Data
20.2 Factors in R 20.3 Crosstabulation Learning Check 1 20.4 Estimating Lambda
20.4.1 Lambda from Complete Sample 20.4.2 \(\hat{\lambda}\) from Cross-Tabulation
Learning Check 2
20.4.3 Confidence Intervals for \(\hat{\lambda}\)
Learning Check 3 20.5 Planning Experiments
20.5.1 Example: Minimum Sample Size
20.6 Chi-squared
20.6.1 Compute Chi-squared from scratch 20.6.2 Building a Chi-squared function
Learning Check 4 20.7 Conclusion
21 Useful Life Distributions (Weibull, Gamma, & Lognormal)
Getting Started
Load Packages Load Data
21.1 Maximum Likelihood Estimation (MLE)
21.1.1 Likelihood Function 21.1.2 Maximizing with optim() 21.1.3 Under the Hood 21.1.4 Multi-parameter optimization 21.1.5 MLE with Censored Data
Learning Check 1 21.2 Gamma Distribution
21.2.1 \(\Gamma\) function gamma() 21.2.2 \(f(t)\) or d(t)(Probability Density Function, aka PDF) 21.2.3 \(F(t)\) Failure Function and \(R(t)\) Reliability Function 21.2.4 MTTF and Variance
Learning Check 2 21.3 Weibull Distribution
21.3.1 \(F(t)\) Failure Function and \(R(t)\) Reliability Function 21.3.2 \(f(t)\) or d(t) (PDF) 21.3.3 \(z(t)\) Failure Rate 21.3.4 \(m\) and \(c\) 21.3.5 Weibull Series Systems 21.3.6 MTTF and Variance
Learning Check 3 Learning Check 4 21.4 Lognormal Distribution
21.4.1 \(f(t)\) or d(t) (PDF) 21.4.2 \(\Phi\) 21.4.3 Key Functions 21.4.4 MTTF and Variance 21.4.5 Using Other Parameters
Learning Check 5 Learning Check 6 21.5 Conclusion
22 Fault Tree Analysis in R
Getting Started
Prerequisites Load Packages
22.1 Simulating Fault Trees 22.2 Example: Superman Fault Tree 22.3 Using Raw Probabilities Learning Check 1 22.4 Using Failure Rates at Time t Learning Check 2 22.5 Simulating Uncertainty in Probabilities 22.6 Simulating Uncertainty in Failure Rates 22.7 Example: Contagion Learning Check 3
23 Physical Acceleration Models
Getting Started
Load Packages Helpful Functions
23.1 Acceleration Factor
23.1.1 Stress vs. Usage Conditions 23.1.2 Acceleration as a Function \(\frac{f_{s}(t)}{f_{u}(t)}\) 23.1.3 Linear Acceleration
Learning Check 1 Learning Check 2 23.2 Modeling Normal Use Lifespan
23.2.1 Models and Prediction 23.2.2 Arrhenius Model (Temperature) 23.2.3 Visualizing the Arrhenius Model 23.2.4 Estimating the Arrhenius Model 23.2.5 Prediction
23.3 Eyring Model (Multiple Stressors) 23.4 Degradation Model (Time Trends) 23.5 Burn-in Periods 23.6 Maximum Likelihood Estimation (MLE) for Physical Acceleration Models
23.6.1 MLE for an Example Arrhenius Model 23.6.2 Estimating \(\Delta H\) with MLE 23.6.3 Visualizing Probabilities
23.7 Conclusion
IV Life Distributions V Physical Acceleration Models VI Regression 24 Bivariate Regression: Modeling Diamond Pricing
Getting Started
Load Data View Data Codebook
24.1 Review
24.1.1 Scatterplots 24.1.2 Correlation 24.1.3 Example 24.1.4 cor.test() and tidy()
24.2 Regression and the Line of Best Fit
24.2.1 Model Equation 24.2.2 Coding Regression Models 24.2.3 lm()
Learning Check 1 24.3 Statistical Significance Learning Check 2 24.4 Visualization Learning Check 3 24.5 Finding the Line of Best Fit
24.5.1 Predicted Values 24.5.2 Residuals 24.5.3 R-squared
Learning Check 4 24.6 F-statistic
24.6.1 Interpreting an F-statistic
24.7 summary()
24.7.1 Get all stats, all at once with summary() 24.7.2 4 Key Statistics
Learning Check 5
25 Multivariate Regression: Modeling Effects of Disaster on Social Capital
Getting Started
Load Data & Packages View Data Codebook
25.1 Multiple Regression
25.1.1 Beta coefficients 25.1.2 Controls 25.1.3 Planes
Learning Check 1 25.2 Effect Sizes Learning Check 2 25.3 Multiple Models Learning Check 3 25.4 Great Tables Learning Check 4
VII Comparing Groups and ANOVA 26 Design of Experiments in R
Getting Started
The Experiment Import Data
26.1 Difference of Means (t-tests)
26.1.1 Calculating dbar 26.1.2 Visualizing the difference of means
26.2 Definitions: Null vs. Sampling Distributions 26.3 Null Distributions
26.3.1 Permutation Test (Computational Approach) 26.3.2 t-statistics (Theoretical Approach)
26.4 Sampling Distribution
26.4.1 Bootstrapped CIs for dbar 26.4.2 Assume Normal CIs for dbar
26.5 Speedy t-tests 26.6 ANOVA
26.6.1 Visualizing ANOVA 26.6.2 ANOVA Explained 26.6.3 Functions for ANOVA 26.6.4 lm() - the easy way 26.6.5 aov() - the normal way 26.6.6 oneway.test() - if you know your group variances are unequal 26.6.7 Multiple Variables
Conclusion
VIII Factorial Design & RSM 27 Factorial Design and Interaction Effects in R
Getting Started
Factorial Design Data & Packages
27.1 Estimating Direct Effects
27.1.1 Difference of Grand Means \(\bar{\bar{d}}\) 27.1.2 Slower-but-Clearer Method of Getting the Difference of Grand Means \(\bar{\bar{d}}\) 27.1.3 Faster-but-Messier Method of Getting the Difference of Grand Means \(\bar{\bar{d}}\) 27.1.4 Estimating Many Difference of Grand Means \(\bar{\bar{d}}\)
27.2 Standard Errors for \(\bar{\bar{d}}\)
27.2.1 Estimating Standard Error when \(s^2_i\) is known 27.2.2 Using Pooled Variance to Estimate Standard Errors
27.3 Pivoting Data with pivot_longer() & pivot_wider()
27.3.1 contains() 27.3.2 pivot_longer() 27.3.3 pivot_wider()
27.4 Visualizing & Reporting Treatment Effects
27.4.1 Building a Table of Direct Effects 27.4.2 Visualization Strategies 27.4.3 An Excellent Visual
27.5 Interaction Effects
27.5.1 Estimating a Pooled Standard Error 27.5.2 Estimating 1-way interations 27.5.3 Estimating 2-way interactions 27.5.4 Estimating 3-way interactions
27.6 Estimating Interactions in lm()
27.6.1 Modeling Interactions 27.6.2 Using predict() to visualize interaction effects 27.6.3 Finding the best model with anova()
Conclusion
28 Response Surface Methodology in R
Getting Started
Packages Our Data Import Data
28.1 Models for RSM
28.1.1 Specifying an Interaction Model with Polynomials 28.1.2 Modeling with rsm() 28.1.3 Transforming Variables
Learning Check 1 28.2 Contour Plots
28.2.1 Simple contour() plots 28.2.2 geom_tile() plots 28.2.3 Pretty geom_contour() plots! 28.2.4 One-step RSM in ggplot 28.2.5 More Realistic Plots
Learning Check 2 28.3 Iterate!
28.3.1 Modeling many Interactions 28.3.2 Contours with Multiple Variables 28.3.3 ggplot contour plots!
Learning Check 3 28.4 Quantities of Interest in RSM
28.4.1 Percent Change in Bins 28.4.2 Study Range
28.5 Extra Concepts of Interest: Canonical Form
28.5.1 Finding the Canonical Form 28.5.2 Canonical Form Coefficients 28.5.3 Stationary Points 28.5.4 Shift Points
Conclusion IX Appendices 29 Appendix: ggplot tips
Getting Started
29.0.1 Resources 29.0.2 Our Packages 29.0.3 Our Data
29.1 Flipping Axes
Learning Check 1
29.2 Legend Positioning
Learning Check 8
29.3 Breaking Up a Visual
Learning Check 10
30 Appendix: mermaid Block Diagrams (Flowcharts) in R
Getting Started
Block Diagrams Load Packages
30.1 Mermaid
30.1.1 Example Block Diagram with mermaid Learning Check 1
30.2 Diagram of Ice Cream Shipment 30.3 Diagram Failures
31 Appendix: Using fitdistr to Fitting Distribution Parameters
Getting Started
Load Packages 31.0.1 Our Data
31.1 Example: Exponential Distribution 31.2 Method of Moments 31.3 MLE with fitdistr() 31.4 MLE with optim() 31.5 Applications
31.5.1 Normal Distribution 31.5.2 Poisson Distribution 31.5.3 Gamma Distribution 31.5.4 Weibull Distribution
Learning Check 1 31.6 Conclusion
32 Appendix: Statistical Tables and Common Equations
32.1 Table 1: Chi-Squared 32.2 Table 2: k-factor rules 32.3 Table 3: k-factor (upper) 32.4 Table 4: k-factor (lower) 32.5 Table 5: Control Constants (d) 32.6 Table 6: Control Constants (b) 32.7 Formula
System Reliability and Six Sigma in R and Python
16 Statistical Process Control in Python
Figure 2.1: Statistical Process Control!
In this workshop, we will learn how to perform statistical process control in Python, using statistical tools and plotnine visualizations! Statistical Process Control refers to using statistics to (1) measure variation in product quality over time and (2) identify benchmarks to know when intervention is needed. Let’s get started!
Getting Started
Packages # Remember to install these packages using a terminal, if you haven't already! !pip install pandas plotnine scipy We’ll be using pandas for data manipulation, plotnine for visualization, and scipy for statistical functions. import pandas as pd from plotnine import *
Custom Functions This workshop uses custom functions from the functions/ directory. You may need both: - functions_distributions.py - for reliability and distribution functions - functions_process_control.py - for statistical process control functions To use these functions, you need to acquire them from the repository at github.com/timothyfraser/sigma/tree/main/functions. Add the functions directory to your Python path import sys import os # Add the functions directory to Python path sys.path.append('functions') # or path to wherever you placed the functions folder Once you have the functions available, you can import them: from functions_distributions import density, tidy_density, approxfun # from functions_process_control import ggprocess, ggsubgroup, ggmoving, ggcapability # if needed
Our Case
Figure 12.1: Obanazawa City, Yamagata Prefecture - A Hot Springs Economy. Photo credit and more here.
For today’s workshop, we’re going to think about why quality control matters in a local economy, by examining the case of the Japanese Hot Springs bath economy! Hot springs, or onsen, are a major source of tourism and recreation for families in Japan, bringing residents from across the country every year to often rural communities where the right geological conditions have brought on naturally occurring hot springs. Restaurants, taxi and bus companies, and many service sector firms rely on their local onsen to bring in a steady stream (pun intended) of tourists to the local economy. So, it’s often in the best interest of onsen operators to keep an eye on the temperature, minerals, or other aspects of their hot springs baths to ensure quality control, to keep up their firm (and town’s!) reputation for quality rest and relaxation! Onsen-goers often seek out specific types of hot springs, so it’s important for an onsen to actually provide what it advertises! Serbulea and Payyappallimana (2012) describe some of these benchmarks.
Temperature: Onsen are divided into “Extra Hot Springs” (>42°C), “Hot Springs” (41~34°C), and “Warm Springs” (33~25°C). pH: Onsen are classified into “Acidic” (pH < 3), “Mildly Acidic” (pH 3~6), “Neutral” (pH 6~7.5), “Mildly alkaline” (pH 7.5~8.5), and “Alkaline” (pH > 8.5). Sulfur: Sulfur onsen typically have about 2mg of sulfur per 1kg of hot spring water; sulfur levels must exceed 1 mg to count as a Sulfur onsen. (It smells like rotten eggs!)
These are decent examples of quality control metrics that onsen operators might want to keep tabs on!
Figure 4.1: Monkeys are even fans of onsen! Read more here!
Our Data You’ve been hired to evaluate quality control at a local onsen in sunny Kagoshima prefecture! Every month, for 15 months, you systematically took 20 random samples of hot spring water and recorded its temperature, pH, and sulfur levels. How might you determine if this onsen is at risk of slipping out of one sector of the market (eg. Extra Hot!) and into another (just normal Hot Springs?). Let’s read in our data from workshops/onsen.csv! # Add functions directory to path if not already there import sys if 'functions' not in sys.path: sys.path.append('functions')
from functions_distributions import density, tidy_density, approxfun
water = pd.read_csv('workshops/onsen.csv') water.head(3) ## id time temp ph sulfur ## 0 1 1 43.2 5.1 0.0 ## 1 2 1 45.3 4.8 0.4 ## 2 3 1 45.5 6.2 0.9
16.1 Process Descriptive Statistics First, let’s get a sense of our process by calculating some basic descriptive statistics. We’ll create a simple function to calculate the mean and standard deviation, which are fundamental to evaluating process variation. from pandas import Series def describe(x: Series): x = Series(x) out = pd.DataFrame({ 'mean': [x.mean()], 'sd': [x.std()], }) out['caption'] = ("Process Mean: " + out['mean'].round(2).astype(str) + " | SD: " + out['sd'].round(2).astype(str)) return out
tab = describe(water['temp']) tab ## mean sd caption ## 0 44.85 1.989501 Process Mean: 44.85 | SD: 1.99 Now let’s apply this to our temperature data to see the overall process mean and variation.
16.2 Process Overview Visual The process overview chart is one of the most important tools in SPC. It shows us how our process behaves over time, helping us identify patterns, trends, and potential issues. We’ll create a visualization that shows individual measurements, subgroup means, and the overall process average. g1 = (ggplot(water, aes(x='time', y='temp', group='time')) + geom_hline(aes(yintercept=water['temp'].mean()), color='lightgrey', size=3) + geom_jitter(height=0, width=0.25) + geom_boxplot() + labs(x='Time (Subgroup)', y='Temperature (Celsius)', subtitle='Process Overview', caption=tab['caption'][0]))
# Save the plot g1.save('images/05_process_overview.png', width=8, height=6, dpi=100)
g2 = (ggplot(water, aes(x='temp')) + geom_histogram(bins=15, color='white', fill='grey') + theme_void() + coord_flip())
# Save the plot g2.save('images/05_process_histogram.png', width=8, height=6, dpi=100)
The histogram shows us the distribution of all temperature measurements, giving us insight into the overall process variation. This helps us understand if our process is centered and how much variation we’re seeing.
16.3 Subgroup (Within-Group) Statistics In SPC, we often work with subgroups - small samples taken at regular intervals. This allows us to distinguish between common cause variation (inherent to the process) and special cause variation (due to specific events). Let’s calculate statistics for each subgroup to see how the process behaves over time. stat_s = (water.groupby('time').apply(lambda d: pd.Series({ 'xbar': d['temp'].mean(), 'r': d['temp'].max() - d['temp'].min(), 'sd': d['temp'].std(), 'nw': len(d) })).reset_index()) stat_s['df'] = stat_s['nw'] - 1 stat_s['sigma_s'] = ( (stat_s['df'] * (stat_s['sd']**2)).sum() / stat_s['df'].sum() )**0.5 stat_s['se'] = stat_s['sigma_s'] / (stat_s['nw']**0.5) stat_s['upper'] = stat_s['xbar'].mean() + 3*stat_s['se'] stat_s['lower'] = stat_s['xbar'].mean() - 3*stat_s['se'] stat_s.head(3) ## time xbar r sd nw df sigma_s se upper lower ## 0 1 44.635 4.2 1.342533 20.0 19.0 1.986174 0.444122 46.182366 43.517634 ## 1 3 45.305 7.9 2.001440 20.0 19.0 1.986174 0.444122 46.182366 43.517634 ## 2 5 44.765 5.9 1.628133 20.0 19.0 1.986174 0.444122 46.182366 43.517634 Here we’ve calculated key statistics for each subgroup:
xbar: The mean of each subgroup r: The range (max - min) within each subgroup
sd: The standard deviation within each subgroup sigma_s: The pooled within-subgroup standard deviation se: The standard error for each subgroup mean
16.3.1 Total Statistics (Between Groups) Now let’s calculate the overall process statistics that summarize the behavior across all subgroups: stat_t = pd.DataFrame({ 'xbbar': [stat_s['xbar'].mean()], 'rbar': [stat_s['r'].mean()], 'sdbar': [stat_s['sd'].mean()], 'sigma_s': [(stat_s['sd']**2).mean()**0.5], 'sigma_t': [water['temp'].std()] }) stat_t ## xbbar rbar sdbar sigma_s sigma_t ## 0 44.85 7.2625 1.93619 1.986174 1.989501 These statistics give us:
xbbar: The grand mean (average of all subgroup means) rbar: The average range across subgroups sdbar: The average standard deviation across subgroups sigma_s: The pooled within-subgroup standard deviation sigma_t: The total process standard deviation
16.3.2 Average and Standard Deviation Charts Control charts are the heart of SPC. They help us monitor process stability over time and detect when the process is out of control. We’ll create charts for both the subgroup means (X-bar chart) and standard deviations (S chart). labels = pd.DataFrame({ 'time': [stat_s['time'].max()]*3, 'type': ['xbbar','upper','lower'], 'name': ['mean','+3 s','-3 s'], 'value': [stat_s['xbar'].mean(), stat_s['upper'].iloc[0], stat_s['lower'].iloc[0]] })
control_chart = (ggplot(stat_s, aes(x='time', y='xbar')) + geom_hline(aes(yintercept=stat_s['xbar'].mean()), color='lightgrey', size=3) + geom_ribbon(aes(ymin='lower', ymax='upper'), fill='steelblue', alpha=0.2) + geom_line(size=1) + geom_point(size=5) + geom_label(data=labels, mapping=aes(x='time', y='value', label='name'), ha='right') + labs(x='Time (Subgroups)', y='Average', subtitle='Average and Standard Deviation Chart'))
# Save the plot control_chart.save('images/05_control_chart.png', width=8, height=6, dpi=100)
This control chart shows:
Center line: The grand mean (xbbar) Control limits: Upper and lower 3-sigma limits based on the standard error Individual points: Each subgroup mean plotted over time Shaded area: The control limits region
Points outside the control limits or showing non-random patterns indicate the process may be out of control and requires investigation.
Learning Check 1 Question Produce the same process overview chart for pH.
[View Answer!]
def ggprocess(x, y, xlab='Subgroup', ylab='Metric'): import pandas as pd from plotnine import ggplot, aes, geom_hline, geom_jitter, geom_boxplot, labs d = pd.DataFrame({'x': x, 'y': y}) g = (ggplot(d, aes(x='x', y='y', group='x')) + geom_hline(aes(yintercept=d['y'].mean()), color='lightgrey', size=3) + geom_jitter(height=0, width=0.25) + geom_boxplot() + labs(x=xlab, y=ylab, subtitle='Process Overview')) return g
ph_chart = ggprocess(water['time'], water['ph'])
# Save the plot ph_chart.save('images/05_ph_chart.png', width=8, height=6, dpi=100)
16.4 Moving Range Charts (n=1) When we have individual measurements rather than subgroups, we use moving range charts. The moving range is the absolute difference between consecutive measurements, which helps us estimate process variation when we can’t calculate within-subgroup statistics. indiv = water.iloc[[0,20,40,60,80,100,120,140]] mr = (indiv['temp'].diff().abs().dropna()) mrbar = mr.mean() import numpy as np d2 = np.mean(np.abs(np.diff(np.random.normal(0,1,10000)))) sigma_s = mrbar / d2 se = sigma_s / (1**0.5) upper = mrbar + 3*se lower = 0 istat = pd.DataFrame({'time': indiv['time'].iloc[1:], 'mr': mr, 'mrbar': mrbar, 'upper': upper, 'lower': lower}) mr_chart = (ggplot(istat, aes(x='time', y='mr')) + geom_ribbon(aes(ymin='lower', ymax='upper'), fill='steelblue', alpha=0.25) + geom_hline(aes(yintercept=mr.mean()), size=3, color='darkgrey') + geom_line(size=1) + geom_point(size=5) + labs(x='Time (Subgroup)', y='Moving Range', subtitle='Moving Range Chart'))
# Save the plot mr_chart.save('images/05_moving_range_chart.png', width=8, height=6, dpi=100)
The moving range chart shows:
Center line: The average moving range (mrbar) Upper control limit: Based on the estimated process standard deviation Lower control limit: Set to 0 (moving ranges can’t be negative) Individual points: Each moving range value
This chart helps us monitor process variation when we have individual measurements rather than subgroups.
Conclusion You’ve successfully produced SPC visuals and statistics in Python: process overviews, subgroup statistics, and moving range logic. These tools help us understand process behavior, identify when processes are in or out of control, and make data-driven decisions about process improvement. |