2 ways to plot the confidence interval of a best fit regression line using R and python

When analyzing data, it is often useful to fit a regression line to model the relationship between two variables. However, it is also important to understand the uncertainty associated with the line of best fit. One way to display this uncertainty is by plotting the confidence interval about the regression line. In this document, we will discuss two methods for plotting the confidence interval about a best fit regression line using R and Python. Finally, we decide on when to use which one.

Method 1: Using R + ggplot2

R is a popular open-source programming language for statistical computing and graphics. To plot the confidence interval about a best fit regression line in R, we can use the ggplot2 package. Here are the steps to do so:

Load the necessary libraries:

library(ggplot2)

Generate some data

> a=c(1:10)
> b=5*a+5*rnorm(10)
> df=data.frame(a,b)
> df
    a         b
1   1  5.253065
2   2 18.189419
3   3 15.137868
4   4 20.399989
5   5 27.297348
6   6 27.935176
7   7 29.603539
8   8 34.692199
9   9 38.631428
10 10 57.167884

Create a scatter plot with ggplot() and specify the data and variables. The mapping is necessary to let ggplot know that we want to plot the column “a” along the x-axis and the column “b” along the y-axis.

ggplot(df, mapping=aes(x=a,y=b)) + geom_point(shape=18)

Add the regression line with geom_smooth(method="lm"):

ggplot(df, mapping=aes(x=a,y=b)) + geom_point(shape=18) + geom_smooth(method="lm")

The confidence interval is automatically added. In case it isn’t add the following to the plot: ggplot(df, mapping=aes(x=a,y=b)) + geom_point(shape=18) + geom_smooth(method="lm") + geom_ribbon(aes(ymin=ci[,2], ymax=ci[,3]), alpha=0.2) . The whole code looks like this:

ci=predict(lm.fit, newdata = df['a'], interval = "confidence")
fit        lwr      upr
1   7.348585  0.5342818 14.16289
2  11.811297  6.0319880 17.59061
3  16.274010 11.4134868 21.13453
4  20.736723 16.6005974 24.87285
5  25.199435 21.4780169 28.92085
6  29.662148 25.9407294 33.38357
7  34.124860 29.9887351 38.26099
8  38.587573 33.7270495 43.44810
9  43.050285 37.2709758 48.82959
10 47.512998 40.6986947 54.32730
ggplot(df, mapping=aes(x=a,y=b)) + geom_point(shape=18) + geom_smooth(method="lm") + geom_ribbon(aes(ymin=ci[,2], ymax=ci[,3]), alpha=0.2)

ymin and ymax are the lower and upper bounds of the confidence interval. The alpha parameter adjusts the transparency of the ribbon.

Method 2: Python + seaborn

Python is another popular programming language for data analysis and visualization. To plot the confidence interval about a best fit regression line in Python, we can use the seaborn package. Here are the steps to do so:

Load the necessary libraries:

import pandas as pd
import numpy as np
import seaborn as sns
sns.set_style("whitegrid")

Generate data

a = np.arange(10)
b = 5*a + 5*np.random.rand(10)
df = pd.DataFrame({'a':a, 'b':b})

Create a scatter plot with sns.scatterplot() and specify the data and variables:

_ = sns.scatterplot(data=df, x="a", y="b")

Add the regression line with sns.regplot():

_ = sns.regplot(data=df, x="a", y="b")

Finally, add the confidence interval with sns.regplot(ci=95):

_ = sns.regplot(data=df, x="a", y="b", ci=95)

The ci parameter specifies the confidence interval level in percentage.

Verdict

We used the ggplot2 package in R and the seaborn package in Python to generate the confidence interval plots. The ggplot2 result definitely looks more professional quality while the seaborn was much faster to code. We can choose the method that fits our needs. If we want to publish our graphs in journals then ggplot2 might be a better choice (not always). If we want to do a quick presentation then I will prefer seaborn.

R Studio installation issue workaround for Ubuntu 22

If you are facing issues with installing R Studio on Ubuntu 22, and getting this error

Depends: libicu70 (>= 70.1-1~) but it is not installable

Here is a simple workaround that you can follow.

  1. First, follow the steps as outlined in the official website of R Studio

https://cran.rstudio.com/

Then download the ICU v70 from official Ubuntu website.

https://packages.ubuntu.com/jammy/libicu70

  1. Download the package and install using dpkg
sudo dpkg -i libicu70_70.1-2_amd64.deb
  1. After the installation is complete, run R Studio from the applications menu or by typing rstudio in the terminal.

This workaround should help you install R Studio on Ubuntu 22 without any issues. If you face any problems, feel free to consult the official R Studio documentation or reach out to the community for help.

The Role of p-Value in Hypothesis Testing and Machine Learning

Introduction

In statistical hypothesis testing, the p-value is a widely-used tool to evaluate the significance of results obtained from a sample as they relate to a population. In machine learning, hypothesis testing is employed to validate models and determine whether the results obtained from the models are statistically significant.

Example using the Normal Distribution

Suppose we have a dataset of the heights of 1000 people, and we want to test whether the average height of the population is greater than 170 cm. We can use the normal distribution to model the distribution of heights in the population.

We can use the null hypothesis that the average height is 170 cm, and the alternative hypothesis that the average height is greater than 170 cm.

We take a sample of 50 people from the population and calculate the sample mean and standard deviation. We then calculate the test statistic:

z = (sample_mean - hypothesized_mean) / (standard_error)

where the standard error is given by:

standard_error = standard_deviation / sqrt(sample_size)

If the null hypothesis is true, then the test statistic follows a standard normal distribution. We can calculate the p-value as the area under the tail of the distribution corresponding to the observed test statistic.

If the p-value is less than the significance level (usually 0.05), then we reject the null hypothesis and conclude that the average height is greater than 170 cm. Otherwise, we fail to reject the null hypothesis and conclude that there is not enough evidence to support the alternative hypothesis.

Uses in Machine Learning

In machine learning, the p-value is used to evaluate the performance of a model. For instance, a machine learning algorithm could be used to predict whether a customer will buy a product or not. The p-value would then be used to determine whether the results obtained from the algorithm are statistically significant, i.e., whether the algorithm is better than random guessing.

The p-value is also used in feature selection, which is the process of selecting relevant features that contribute to the prediction of the target variable. Features with a low p-value are considered to be statistically significant and are usually selected for use in the model.

Example of p-value in ML

In R, the lm() function is used to fit linear regression models. After fitting a model using lm(), the summary() function can be used to obtain a summary of the model, which includes the p-values for the coefficients.

For example, suppose we have a dataset of the weights and heights of 100 people. We want to fit a linear regression model to predict weight from height. We can use the following R code to fit the model:

Suppose we have a dataset of the weights and heights of 100 people. We want to fit a linear regression model to predict weight from height. We can use the following R code to fit the model:

model <- lm(weight ~ height, data = mydata)
summary(model)

The output of the summary() function would be:

Call:
lm(formula = weight ~ height, data = mydata)

Residuals:
    Min      1Q  Median      3Q     Max
-6.2500 -1.4375  0.2812  1.6250  5.8125

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -103.285      5.181 -19.924   <2e-16 ***
height         3.451      0.079  43.545   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.424 on 98 degrees of freedom
Multiple R-squared:  0.8676,	Adjusted R-squared:  0.8663
F-statistic:  1895 on 1 and 98 DF,  p-value: < 2.2e-16

The table of coefficients includes the estimated coefficients for the intercept and the height variable, as well as their standard errors, t-values, and p-values. The p-value for the height variable is less than 0.05, indicating that the height variable is statistically significant and contributes to the prediction of the target variable (weight).

Limitations of p-value

It is important to note that the p-value is not without its limitations. One of the main limitations is that it can be influenced by the sample size and the significance level chosen by the researcher. Additionally, the p-value does not provide information about the effect size or the practical significance of the results obtained from a study. Therefore, it is important to interpret the p-value in conjunction with other statistical measures, such as effect size and confidence intervals, to obtain a more complete understanding of the results obtained from a study.

Once common issue is that we need to know which tailed test we use based on the context. The lm() function always outputs the two-tailed p-value (t-value). However, some cases as described below is really one-tailed.

For example, let’s say we are interested in studying the effect of a new drug on reducing anxiety levels in patients. We have a hypothesis that the new drug will decrease anxiety levels in patients, and we are not interested in the possibility that the drug may increase anxiety levels. In this case, we would use a one-tailed test with the alternative hypothesis stating that the true mean difference in anxiety levels between the treatment and control groups is less than zero (i.e., the drug reduces anxiety levels).

mydata <- data.frame(anxiety_reduction = c(2, 4, 3, 5, 1, 6, 7, 
                        8, 9, 10, 12, 13, 14, 15, 11),
                      new_drug = c(0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0))

model <- lm(anxiety_reduction ~ new_drug, data = mydata)
summary(model)

Call:
lm(formula = anxiety_reduction ~ new_drug, data = mydata)

Residuals:
      Min        1Q    Median        3Q       Max
-2.041667 -0.916667 -0.041667  0.958333  2.458333

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   2.5000     0.8021   3.114    0.009 **
new_drug      1.4167     0.8021   1.766    0.099 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.386 on 13 degrees of freedom
Multiple R-squared:  0.1792,	Adjusted R-squared:  0.09458
F-statistic: 3.131 on 1 and 13 DF,  p-value: 0.09852

It looks like the above is not statistically significant since p-value>0.05. However, the test is left-tailed since we want to know if there is any decrease in symptoms. The p-value in the output is for two-tailed. Therefore, the left-tailed p-value would be half of that: 0.09852/2 = 0.049 < 0.05, making the effect of the drug statistically significant.

Conclusion

In conclusion, the p-value is a valuable tool in hypothesis testing and machine learning. It is used to evaluate the statistical significance of results obtained from models and experiments. However, it is important to interpret the p-value in conjunction with other statistical measures and to be aware of its limitations to obtain a complete understanding of the results obtained from a study.

Deciding between AR and MA model for time-series analysis

When it comes to time series analysis, it is important to choose the right model to make accurate predictions. Two of the most commonly used models are autoregressive (AR) and moving average (MA). The question is, how do you decide which one to use? One useful tool to help make this decision is the ACF plot.

The ACF (autocorrelation function) plot is a graph that shows the correlation between a time series and its lagged values. It is a measure of how similar a time series is to itself at different points in time. ACF plots can be used to determine whether an AR or MA model is more appropriate for a given time series.

To use ACF plots to decide on AR vs MA models, follow these steps:

Step 1: Determine the order of differencing

The first step is to determine the order of differencing needed to make the time series stationary. Stationarity is important for time series analysis because it ensures that the statistical properties of the time series remain constant over time. To determine the order of differencing, look at the p-value of the ADF test.

import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
import statsmodels.api as sm
import pmdarima as pm
import yfinance as yf
import seaborn as sns
from statsmodels.tsa.arima.model import ARIMA

msft = yf.Ticker('MSFT')
df = msft.history(period='5y')

sns.light_palette("seagreen", as_cmap=True)
sns.set_style("darkgrid", {"grid.color": ".6", "grid.linestyle": ":"})
sns.lineplot(df['Close'])
plt.title('MSFT')

ADF Test

The Augmented Dickey-Fuller (ADF) test is a statistical test used to determine whether a time series has a unit root, which indicates non-stationarity. Non-stationarity refers to the property of a time series where the statistical properties, such as mean and variance, change over time. The ADF test is commonly used in econometrics and finance to analyze the stationarity of economic and financial data, such as stock prices, interest rates, and exchange rates. The test is named after the econometricians David Dickey and Wayne Fuller, who extended the original Dickey-Fuller test to include additional explanatory variables to improve its statistical power. The ADF test is a popular tool for analyzing time series data and is widely used in academic research and practical applications.

adf_result = sm.tsa.stattools.adfuller(df['Close'])
print('ADF Statistic:', adf_result[0])
print('p-value:', adf_result[1])
print('Critical Values:', adf_result[4])

ADF Statistic: -1.1569117974100418
p-value: 0.6918512889859891
Critical Values: {'1%': -3.4355964295197743, '5%': -2.863856825923603, '10%': -2.5680035060041626}

If the p-value of an ADF test is greater than 0.05 then we will need to keep differentiating the series till we reach stationarity. The concept of stationarity will be explained in a separate post. We perform differentiation on a discrete time-series by differencing. pandas diff function will do the trick. We have to keep differencing till the p-value of the ADF test falls below the threshold of 0.05. The number of times we had to perform the diff operation is the order of differencing needed to make the time series stationary.

p = 0.5
df1 = df['Close']
d = 0
while p > 0.05:
    p = adfuller(df1)[1]
    df1 = df1.diff().fillna(0)
    d = d + 1
d = d - 1
print("Difference value (d) for the time-series: ", d)

Difference value (d) for the time-series:  1

Since we are looking at a stock price d=1. Differencing the stock prices once give us a stationary series.

The difference of stock prices does not produce martingles, since difference is not the same as returns. The difference of the log of the stock prices produces a time-series which is normally distributed. This is why stock prices are said to follow a log-normal distribution. We will not concern ourselves with the distribution and focus on the time-series analysis. However, in real world applications we will need to worry about the distribution of the underlying process (Levy processes etc.).

Step 2: Plot the ACF

The second step is to plot the ACF for the time series data. The ACF plot will show the correlation between the time series and its lagged values. The plot will have lag on the x-axis and the correlation coefficient on the y-axis.

plot_acf(df['Close'].diff().fillna(0), lags=20)

Step 3: Look for significant spikes in the ACF plot

After determining the order of differencing, look for significant spikes in the ACF plot. A significant spike is one that is outside the confidence interval. The confidence interval is the range within which the correlation coefficient is likely to fall. If there are significant spikes at lag 1, 2, 3, etc., then an AR model is appropriate. If there are significant spikes at multiple lags, then an ARMA model may be appropriate. If there are significant spikes at lag 0, 1, 2, 3, etc., then an MA model is appropriate.

Step 4: Determine the order of the AR or MA model

The final step is to determine the order of the AR or MA model. This can be done by looking at the significant spikes in the ACF plot. If there is a significant spike at lag 1, then an AR(1) model is appropriate. If there are significant spikes at lag 1 and 2, then an AR(2) model is appropriate. If there is a significant spike at lag 0, then an MA(1) model is appropriate. If there are significant spikes at lag 0 and 1, then an MA(2) model is appropriate. In our case, it seems there is a significant spike at log 0 and no more significant spikes (there is a spurious spike at lag 9 which we can ignore). Therefore, the best mode lfor this time-series will be a MA(1) model. We can verify this conclusion using auto_arima function as well.

AUTO_ARIMA

We can use the auto_arima to find the appropriate ARIMA model. The function finds the right model in the search space by minimizing the AIC of the model.

model = pm.auto_arima(df['Close'], seasonal=True, m=12)
print(model.order)
print(model.seasonal_order)

(0, 1, 1)
(0, 0, 0, 12)

In conclusion, the ACF plot is a useful tool for deciding on an AR vs MA model for time series analysis. By following the steps outlined above, you can determine whether an AR or MA model is appropriate and the order of the model needed to make accurate predictions.

There are beautiful flowcharts in the book by Peixiero [2] which I found very useful in identifying the time-series model. I am including a variation below.

References

  1. MSFT Stock Price Data: https://finance.yahoo.com/quote/MSFT/history/
  2. Peixeiro, M. (2020). Time Series Forecasting Using Python: An introduction to traditional and deep learning models for time series forecasting. Apress.

Changing the Linux terminal colors

If you have directory in the terminal window blending into the background, changing the color scheme of the terminal will not help.

Try this instead.

Add the following line to .bashrc (and also do a source .bashrc).

LS_COLORS="di=4;36:ex=3;32"

di is for directory
ex is for executable
di="[0:no effect, 1: bold, 3: italic, 4=underline, 5=blinking];[fg color]"

Experiment with the foreground color [fg]. 32=green, 34=blue etc.

If anyone has any other Linux terminal tips, add in the comments below.

Calculating checksum on streaming data

In today’s fast-paced world, where data is generated at a massive scale, it is essential to process it efficiently and in real-time. This is where the concept of streaming comes into play. Streaming refers to the continuous flow of data, and it is a crucial component of many modern applications and services.

Streaming is required because traditional batch processing techniques are not suitable for handling large volumes of data that need to be processed in real-time. Streaming allows us to process data as it is generated, providing near-instantaneous results.

One example of a service that heavily relies on streaming is Amazon Web Services (AWS). AWS is the to go storage cloud platform for most business, although Azure and GCP are also strong contenders. The basic idea of streaming is the same for all these services. It is instructive to have a knowledge of the process independent of the platform (AWS, GCP).

In this blog post, we will focus on calculating a checksum on streaming data using Python. We will explore how to convert a pandas DataFrame to a text stream and calculate a checksum on it. This approach can be useful for verifying the integrity of data in real-time applications such as data pipelines or streaming APIs.

def data_to_txt_stream(df, sep="\t", header=True, 
                                       index=False):
    logging.info(f"\n{df.head (5)}")
    output= io.BytesIO()
    df.to_csv(output, sep=sep, header-header, 
                    index=index, quoting-csv.QUOTE_NONE, 
                    quotechar='', escapechar='')
    data = output.getvalue()
    return data

The data_to_txt_stream function shown above is a Python function that takes in a pandas DataFrame df and converts it into a text stream. The text stream is then returned as a string. This function is useful when dealing with streaming data because it allows us to process the data as it is generated.

The to_csv method of the pandas DataFrame is used to convert the DataFrame to a CSV-formatted string. The resulting CSV string is then converted to a text stream using the io.BytesIO() method. The sep parameter specifies the separator to be used in the CSV file (in this case, a tab character). The header and index parameters specify whether or not to include the header and index in the CSV file, respectively. The quoting parameter specifies the quoting behavior for fields that contain special characters, and the quotechar and escapechar parameters specify the quote and escape characters to use, respectively.

The text stream returned by the function can then be used to calculate a checksum on the data. A checksum is a value that is computed from a block of data and is used to verify the integrity of the data. In the context of streaming data, a checksum can be used to ensure that the data has not been corrupted during transmission.

s3 = boto3.client('s3')
# Create buf object
buf = io.BytesIO()
# Download file to calculate checksum on the file
s3.download_fileobj(bucket_name, object_name, buf)
# Calculate checksum
file_checksum = hashlib.md5(buf.getvalue()).hexdigest()

To calculate a checksum on the text stream, we can use the Python hashlib library. The hashlib library provides various hash functions, such as SHA-256 and MD5, that can be used to compute a checksum on the data. Once the checksum has been computed, it can be compared to the expected checksum to verify the integrity of the data.

In conclusion, the ability to process streaming data efficiently and in real-time is essential in many modern applications and services. The data_to_txt_stream function presented in this blog post provides a way to convert a pandas DataFrame to a text stream, which can be useful when dealing with streaming data. Additionally, computing a checksum on the data can help verify the integrity of the data, which is important in real-time applications such as data pipelines or streaming APIs.

Setup python virtual environment with tensorflow-gpu

The main issues with having a GPU accelerated Tensorflow installation is the myriad compatibility issues. The easiest way proposed online is to use a docker image. However, the docker image didn’t work and it took up too much space. I discarded the docker image idea mostly because of space constraints. I will return to it later during the production phase. The main issue with tensorflow is that the tensorflow version must be compatible with the CUDA version installed. 

Tensorflow 2.3.1 needs CUDA 10 and above and NVIDIA 450 above preferably nvidia-455

These are the steps to get a working GPU accelerated tensorflow environment (Debian based system).

1. Purge nvidia drivers

sudo apt remove --purge “*nvidia*”

2. Install latest Nvidia drivers

sudo apt install nvidia-driver-455

Check your GPU and CUDA version

nvidia-smi

Or you can skip this step if installing the older nvidia=450 drivers in step #4 below.

3. Create a virtual environment to contain the tensorflow

pip install virtualenv
cd ~
python3 -m venv tf-env
source tf-env/bin/activate

Replace tf-env by the name of your choice. This will create a directory structure which will contain all the python packages, so it’s best to create in a drive with lots of free space, although it is easy to move.

4. Install CUDA following the recommendations from tensorflow website

Trying to install CUDA independently from NVIDIA website will break it in all possible ways. I have tried all possible combinations – CUDA 11.1 with tensorflow nightly, CUDA 10.1 with tensorflow stable. Something always breaks. The best method is to follow the install instructions on the tensorflow website to the dot. 

https://www.tensorflow.org/install/gpu

The only exception is that I didn’t install the older nvidia-450 drivers. I kept the newer nvidia-455 driver.

5. Make sure all links are working

Make sure there’s a link from cuda to the actual CUDA installation in /usr/local

$ ls -l /usr/local/
lrwxrwxrwx  1 root root 9 Oct  9 17:21 cuda -> cuda-11.1
drwxr-xr-x 14 root root 4096 Oct  9 17:21 cuda-11.1

$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64

6. Install tensorflow

Start virtualenv if not in it already

$ source tf-env/bin/activate

And then install tensorflow

(tf-env) $ pip install tensorflow

If you already have installed the nightly (unstable) version from #4 above then it is better to uninstall it first with

(tf-env) $pip uninstall tf-nightly

7. Test tensorflow

(tf-env) $ python
>>> import tensorflow as tf
2020-10-09 18:24:57.371340: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
>>> tf.__version__
'2.3.1'
>>> tf.config.list_physical_devices()
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

All seems to be running OK

8. Setup virtualenv kernel to Jupyter 

While in the virtual environment install ipykernel

(tf-env) $ pip install ipykernel

Add current virtual environment to Jupyter 

(tf-env) $ python -m ipykernel install --user --name=tf-env

tf-env will show up in the list of Jupyter kernels. The name for the Jupyter kernel can be anything. I kept it the same for consistency.

You can find the Jupyter kernels in ~/.local/share/jupyter/kernels

Test tensorflow gpu support in jupyter

(tf-env) $ jupyter notebook

import tensorflow as tf
tf.config.experimental.list_physical_devices()
tf.config.list_physical_devices()
tf.test.gpu_device_name()

Note: The tensorflow GPU detection in Jupyter will only work when Jupyter is run from within the virtual environment. Running Jupyter outside the virtualenv will not work even if the virtualenv kernel (tf-env) is chosen over regular system python kernel.

Quick intro to MySQL

Installation

sudo apt-get update
sudo apt-get upgrade
sudo apt-get install mysql-server
sudo mysql_secure_installation
sudo mysql_install_db

Adding databases

Login as root

$ mysql -u root -p
mysql> CREATE DATABASE TEST;

Adding user

mysql> GRANT ALL ON TEST.* TO lion@localhost IDENTIFIED BY ‘temppass1’;
mysql> FLUSH PRIVILEGES;

Display specific columns from a table

mysql> SELECT Continent FROM Country;

+---------------+
| Continent |
+---------------+
| North America |
| Asia          |
| Africa        |
| North America |
| Europe        |

Display specific columns from a table but make rows unique

mysql> SELECT DISTINCT Continent FROM Country;

Display COLUMN names

How would I know which columns to ask for if the list is big and scrolls too fast and I can’t see the column names. Every column is first defined in SQL, then and only then data is added to them.
mysql> SHOW COLUMNS FROM Country;

Select specific number of rows

1. Select first 10 rows: LIMIT <no of rows>

SELECT NAME FROM country limit 10 ;

2. Select from row 4-13 : LIMIT <line no> <no of rows>

SELECT NAME FROM country limit 3,10 ;

Table data command list

Open table

DATABASES are like Excel files
Tables are like Sheets in each file

mysql> SHOW DATABASES;
mysql> USE JOURNAL;
mysql> SHOW TABLES;
mysql> SELECT * FROM FITNESS;

Create a table

mysql> CREATE TABLE DAILY_JOURNAL (Date DATE NOT NULL, 
Entry VARCHAR(200), Feeling VARCHAR(20), needsimprovement VARCHAR(40) ) ;

Add primary key

mysql> ALTER TABLE DAILY_JOURNAL ADD PRIMARY KEY (Date);

Delete data

mysql> DELETE FROM DAILY_JOURNAL WHERE needsimprovement = "goal setting";

Insert data

mysql> INSERT INTO DAILY_JOURNAL (Date, Entry, needsimprovement) VALUES("2016-12-14","1. ordered washing machine", "goal setting");

Add a new column

mysql> ALTER TABLE DAILY_JOURNAL ADD Learnt VARCHAR(40);

Rename column

Changing column name, renaming a column, alter column name

mysql> ALTER TABLE DAILY_JOURNAL 
CHANGE gratefulfor GratefulFor VARCHAR(40);

Giving just a new column name is not enough. The variable type has to be defined again.

Edit row data

mysql> UPDATE DAILY_JOURNAL SET Learnt='1. getopts 2. REGEX 3. MYSQL' 
WHERE Date='2016-12-15';

How to add (embed) Jupyter notebooks to WordPress blog posts

Jupyter notebooks are an effective way to share research, ideas, steps, and other information to others. Sharing single notebooks or entire folders to collaborators is straightforward via GitHub but sharing the notebooks with the public through a blog post needs a few more steps (people read blogs not GitHub (unless they are software developers)). In the following we will outline the steps to embed Jupyter notebooks in a blog post.

Sharing non-interactive Jupyter notebook as static HTML

There are a few ways a non-interactive Jupyter notebook be shared as static HTML as long as they do not have interactive elements like pywidgets in them. They will still be exported as HTML but the interactive elements will not work. We will discuss embedding of interactive Jupyter notebooks in a later section.

Method I (simplest method)

  1. Upload file to Google Colab
  2. File -> Save a copy as GitHub Gist. It will ask for login permissions to GitHub first time ti is run.
  3. Go to GitHub. Then go to your Gists by clicking your profile pic at the upper right corner and selecting Your gists from the menu.
  4. Locate the Google Colab file just shared with Gist and open the Gist.
  5. Copy the Gist ID. For example, if the address bar shows https://gist.github.com/saugatach/100a28eb7dc353feb1ed3bf18f251443 then the Gist ID is 100a28eb7dc353feb1ed3bf18f251443.
  6. Go back to WordPress editor.
  7. In the WordPress post start a new paragraph.
  8. Edit as HTML.
  9. Add the line {gist]<Gist ID>[/gist]. For the Gist ID example above the line would be {gist]100a28eb7dc353feb1ed3bf18f251443[/gist]. Change the { before gist to a [. It is not a typo. WordPress will attempt to interpret the command if [ is used instead of {.
  10. Revert back to Edit visually. Wait for the Jupyter notebook to load with the Google Colab badge on it. If it doesn’t load within 30 seconds hit refresh.

Method II (slightly more hands-on)

  1. Export Jupyter notebook as a IPYNB file directly from JupterLab (or Jupyter interface).
  2. Open the IPYNB file in a text editor and copy the code.
  3. Go to GitHub and start a new Gist.
  4. Paste the IPYNB code in the code section.
  5. Manually name the file in Gist ending with .ipynb, for example, test.ipynb.
  6. Click edit at the top and after the page loads, change the privacy setting at the lower right corner of the code snippet to Public.
  7. Copy the Gist ID. For example, if the address bar shows https://gist.github.com/saugatach/100a28eb7dc353feb1ed3bf18f251443 then the Gist ID is 100a28eb7dc353feb1ed3bf18f251443
  8. Go back to WordPress editor.
  9. In the WordPress post start a new paragraph.
  10. Edit as HTML.
  11. Add the line {gist]<Gist ID>[/gist]. For the Gist ID example above the line would be {gist]100a28eb7dc353feb1ed3bf18f251443[/gist]. Change the { before gist to a [. It is not a typo. WordPress will attempt to interpret the command if [ is used instead of {.
  12. Revert back to Edit visually. Wait for the Jupyter notebook to load with the Google Colab badge on it. If it doesn’t load within 30 seconds hit refresh.

Method III (no extra frills, confusing steps)

  1. Export Jupyter notebook as a HTML file directly from JupterLab (or Jupyter interface).
  2. Open the HTML file in a text editor and copy the code.
  3. Go back to WordPress editor.
  4. In the WordPress post start a new paragraph.
  5. Edit as HTML.
  6. Paste the HTML code.
  7. Revert back to Edit visually. Wait for the Jupyter notebook to load. If it doesn’t load within 30 seconds hit refresh.

To access the HTML editor in WordPress

For older WordPress editor, there is a HTML editor tab. For the new WordPress editor, start a new paragraph, start typing random letters, stop after a few letters, move and hover the mouse over the text, a horizontal menu will appear, click on the 3 horizontal dots and select Edit a HTML. Clear the random letters, paste the HTML code from the Jupyter file, click on the 3 horizontal dots again, and select Edit visually.

Auto update episode names of shows by google scraping

After backing up DVDs we are sometimes forced with the mundane task of renaming files with the episode names. A small python script can automate this laborious task.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

/media/htd/Seagate Backup Plus Drive/TV Shows/Buffy the Vampire Slayer/buffyS01/BuffytheVampireSlayer_s01_e01.mp4
[renamed to]
/media/htd/Seagate Backup Plus Drive/TV Shows/Buffy the Vampire Slayer/buffyS01/S01 E01 · Welcome to the Hellmouth (1).mp4

/media/htd/Seagate Backup Plus Drive/TV Shows/Buffy the Vampire Slayer/buffyS01/BuffytheVampireSlayer_s01_e02.mp4
[renamed to]
/media/htd/Seagate Backup Plus Drive/TV Shows/Buffy the Vampire Slayer/buffyS01/S01 E02 · The Harvest.mp4

/media/htd/Seagate Backup Plus Drive/TV Shows/Buffy the Vampire Slayer/buffyS01/BuffytheVampireSlayer_s01_e03.mp4
[renamed to]
/media/htd/Seagate Backup Plus Drive/TV Shows/Buffy the Vampire Slayer/buffyS01/S01 E03 · Witch.mp4

/media/htd/Seagate Backup Plus Drive/TV Shows/Buffy the Vampire Slayer/buffyS01/BuffytheVampireSlayer_s01_e04.mp4
[renamed to]
/media/htd/Seagate Backup Plus Drive/TV Shows/Buffy the Vampire Slayer/buffyS01/S01 E04 · Teacher’s Pet.mp4