When analyzing data, it is often useful to fit a regression line to model the relationship between two variables. However, it is also important to understand the uncertainty associated with the line of best fit. One way to display this uncertainty is by plotting the confidence interval about the regression line. In this document, we will discuss two methods for plotting the confidence interval about a best fit regression line using R and Python. Finally, we decide on when to use which one.
Method 1: Using R + ggplot2
R is a popular open-source programming language for statistical computing and graphics. To plot the confidence interval about a best fit regression line in R, we can use the ggplot2 package. Here are the steps to do so:
Load the necessary libraries:
library(ggplot2)
Generate some data
> a=c(1:10)
> b=5*a+5*rnorm(10)
> df=data.frame(a,b)
> df
a b
1 1 5.253065
2 2 18.189419
3 3 15.137868
4 4 20.399989
5 5 27.297348
6 6 27.935176
7 7 29.603539
8 8 34.692199
9 9 38.631428
10 10 57.167884
Create a scatter plot with ggplot() and specify the data and variables. The mapping is necessary to let ggplot know that we want to plot the column “a” along the x-axis and the column “b” along the y-axis.
ggplot(df, mapping=aes(x=a,y=b)) + geom_point(shape=18)

Add the regression line with geom_smooth(method="lm"):
ggplot(df, mapping=aes(x=a,y=b)) + geom_point(shape=18) + geom_smooth(method="lm")

The confidence interval is automatically added. In case it isn’t add the following to the plot: ggplot(df, mapping=aes(x=a,y=b)) + geom_point(shape=18) + geom_smooth(method="lm") + geom_ribbon(aes(ymin=ci[,2], ymax=ci[,3]), alpha=0.2) . The whole code looks like this:
ci=predict(lm.fit, newdata = df['a'], interval = "confidence")
fit lwr upr
1 7.348585 0.5342818 14.16289
2 11.811297 6.0319880 17.59061
3 16.274010 11.4134868 21.13453
4 20.736723 16.6005974 24.87285
5 25.199435 21.4780169 28.92085
6 29.662148 25.9407294 33.38357
7 34.124860 29.9887351 38.26099
8 38.587573 33.7270495 43.44810
9 43.050285 37.2709758 48.82959
10 47.512998 40.6986947 54.32730
ggplot(df, mapping=aes(x=a,y=b)) + geom_point(shape=18) + geom_smooth(method="lm") + geom_ribbon(aes(ymin=ci[,2], ymax=ci[,3]), alpha=0.2)
ymin and ymax are the lower and upper bounds of the confidence interval. The alpha parameter adjusts the transparency of the ribbon.
Method 2: Python + seaborn
Python is another popular programming language for data analysis and visualization. To plot the confidence interval about a best fit regression line in Python, we can use the seaborn package. Here are the steps to do so:
Load the necessary libraries:
import pandas as pd
import numpy as np
import seaborn as sns
sns.set_style("whitegrid")
Generate data
a = np.arange(10)
b = 5*a + 5*np.random.rand(10)
df = pd.DataFrame({'a':a, 'b':b})
Create a scatter plot with sns.scatterplot() and specify the data and variables:
_ = sns.scatterplot(data=df, x="a", y="b")

Add the regression line with sns.regplot():
_ = sns.regplot(data=df, x="a", y="b")

Finally, add the confidence interval with sns.regplot(ci=95):
_ = sns.regplot(data=df, x="a", y="b", ci=95)

The ci parameter specifies the confidence interval level in percentage.
Verdict
We used the ggplot2 package in R and the seaborn package in Python to generate the confidence interval plots. The ggplot2 result definitely looks more professional quality while the seaborn was much faster to code. We can choose the method that fits our needs. If we want to publish our graphs in journals then ggplot2 might be a better choice (not always). If we want to do a quick presentation then I will prefer seaborn.



