A Comparison of Classifiers for Mammogram Data

In this post, we will explore the mammogram dataset from the UCI Machine Learning Repository to learn how to predict malignant tumors using different ML classifiers. The point is to compare and contrast the different classifiers to find the best one. This is a good use case where we analyze the small dataset before deploying it at scale. This is the EDA where we perform analysis on a small portion of the dataset to decide on the best ML algorithm to use.

We will be using the mammogram dataset from the UCI Machine Learning Repository. The dataset contains 961 instances and 6 attributes, including the target variable. The goal is to predict the presence of a mass in the mammogram, a binary classification task.

There are several classifiers we can use. However, we will be comparing the results of three classifiers: Deep Neural Network (DNN), Random Forest, and XGBoost. We will also be using Stratified K-Fold cross-validation to ensure a fair comparison of the classifiers.

The reason for choosing these classifiers is that they are widely used and have distinct characteristics. The DNN is a deep learning model that can capture complex, non-linear relationships. Random Forest is an ensemble of decision trees, known for its stability against outliers, avoiding overfitting, ability to generalize well, and ability to handle categorical features. XGBoost is a gradient boosting algorithm that excels in performance and speed, particularly with large datasets. When the dataset has a large number of categories, then XGBoost is a good choice. However, if it performs poorly on a smaller dataset, there is no need to choose XGBoost over Random Forest. Higher speed should not come at the cost of accuracy.

To begin our analysis, we make sure we ingest the data correctly. Since the dataset is in a zip file containing data and a summary, which is normal for most ML datasets, we have to extract and locate the actual data in the zip. You can skip the next section if you are interested in the classifiers only. For MLEs and data scientists, it is of interest to see the data ingestion pipeline.

Handling ZIP Files and Data Extraction

import pandas as pd
from io import BytesIO
from zipfile import ZipFile
import requests

# Load and extract data from ZIP file
url = 'https://archive.ics.uci.edu/static/public/161/mammographic+mass.zip'
response = requests.get(url)
with ZipFile(BytesIO(response.content), 'r') as zip_file:
    with zip_file.open('mammographic_masses.data') as file:
        masses_data = pd.read_csv(file)

In the above code, we are using the requests library to download the zip file from the UCI Machine Learning Repository. We then use the ZipFile class from the zipfile module to extract the data from the zip file. The data is then read into a pandas DataFrame using the read_csv method. The data is now ready for further processing and analysis.

Classifiers at a Glance

We will be testing three distinctive classifiers:

1. Deep Neural Network (DNN)

We will create a DNN model using the Keras API from TensorFlow. The model will have an input layer with 6 neurons and a ReLU activation function, and an output layer with 1 neuron and a sigmoid activation function. The model will be compiled with the Adam optimizer, binary crossentropy loss function, and accuracy as the evaluation metric.

Choice of the Activation Function

The first layer has 6 neurons connected to the input of 4 neurons (which are the 4 categories on which prediction is done). Whatever the number of neurons in the input layer, it should be equal to the number of categories in the dataset.

The 6 neurons need to be activated. We do not want negative values, so we can squash those values and keep only positive values by using the ReLU function. The ReLU activation function is a popular choice for deep learning models due to its simplicity and effectiveness. It helps mitigate the vanishing gradient problem and accelerates convergence.

The output layer has 1 neuron connected to the 6 neurons of the previous layer. This is because we want one output which is positive (1) or negative (0) classification of the feature. Positive if the features suggest a possible malignancy, and negative if the features suggest a benign mass. Since the neurons have continuous values between 0 and 1, we need to squash the values into either 0 or 1. The best function to do this is the sigmoid function. The sigmoid function is a logistic function that squashes the values between 0 and 1. The sigmoid function is the best choice for binary classification tasks.

Choice of the Loss Function

The loss function is a measure of how well the model is performing. The choice of the loss function is crucial, as it guides the model towards the right direction during training. The choice depends on the nature of the problem. The binary crossentropy loss function is a popular choice for binary classification tasks. It measures the difference between the predicted probability distribution and the true probability distribution. The binary crossentropy loss function is well-suited for our mammogram dataset, making it an ideal choice for our DNN model.

Choice of the Optimizer

The Adam optimizer is a popular choice for training deep learning models. It is an extension of the stochastic gradient descent algorithm that computes adaptive learning rates for each parameter. The Adam optimizer is known for its speed and performance, making it a compelling choice for training deep neural networks. This one is conventional, and you can use it for most of the deep learning models.

from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

# Model for Deep Neural Network (DNN)
def create_original_model():
    model = Sequential()
    model.add(Dense(6, input_dim=4, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(metrics=['accuracy'], optimizer='adam', loss='binary_crossentropy')
    return model

The model architecture is as follows:

2. Random Forest

Random Forest is a powerful ensemble of decision trees recognized for its robust predictions. Its strength lies in stability against outliers, effective management of categorical features, and resilience to overfitting, making it a dependable choice, especially for smaller datasets. The deliberate setting of the random state ensures reproducibility, while the choice of 100 estimators optimizes its performance.

from sklearn.ensemble import RandomForestClassifier

# Model for Random Forest
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

3. XGBoost

XGBoost is a gradient boosting algorithm meticulously designed for speed and optimal performance. Its configuration, incorporating ‘learning_rate’ and ‘n_estimators,’ ensures the delivery of superior results. The establishment of a random state guarantees reproducibility, making XGBoost a compelling choice, particularly for handling large datasets.

from xgboost import XGBClassifier

# Model for XGBoost
xgb_classifier = XGBClassifier(learning_rate=0.1, n_estimators=100, random_state=42)

When to Use Each Classifier

When the dataset is small, Random Forest is a reliable choice. It’s robust, less prone to overfitting, and can handle categorical features with ease. XGBoost, on the other hand, shines with large datasets, offering superior performance and speed. As for DNN, it’s a versatile choice, particularly for complex, non-linear relationships. Its deep architecture can capture intricate patterns, making it a compelling option for image and text data.

Stratified K-Fold and Imbalanced Datasets

The dataset’s imbalances demand careful handling. We will use Stratified K-Fold, a technique that ensures class distribution integrity in each fold. This guarantees a fair evaluation, particularly crucial when dealing with imbalanced datasets.

from sklearn.model_selection import StratifiedKFold

cv_method = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

Cross-Validation Scores and Comparison

from sklearn.model_selection import cross_val_score
import seaborn as sns
import matplotlib.pyplot as plt

# Run cross-validation for each model
dnn_scores = cross_val_score(dnn, X, y, cv=cv_method)
rf_scores = cross_val_score(rf_classifier, X, y, cv=cv_method)
xgb_scores = cross_val_score(xgb_classifier, X, y, cv=cv_method)

# Compile results
results = pd.DataFrame({
    'DNN': dnn_scores,
    'RandomForest': rf_scores,
    'XGBoost': xgb_scores
})

# Plot results using seaborn boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(data=results)
plt.title('Cross-Validation Scores Comparison')
plt.ylabel('Accuracy')
plt.show()

And we get a beautiful boxplot that visually compares the cross-validation scores of our classifiers. The boxplot illuminates the strengths and variations in our classifiers’ performances. The choice was boxplot to highlight the spread of the CV scores and also to demonstrate the mean.

Conclusion

In this study, we conducted an analysis of mammogram data, employing a comprehensive examination of three classification algorithms. The focus extended to a thorough comparison and contrast of these algorithms, evaluating their performance in a simple classification task. Additionally, a detailed exploration of a Deep Neural Network (DNN) was undertaken, emphasizing the architectural considerations relevant to the specified classification task.

Happy coding and exploring the realms of machine learning!

The full code

import pandas as pd
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from scikeras.wrappers import KerasClassifier
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import matplotlib.pyplot as plt
from urllib.request import urlopen
from zipfile import ZipFile
from io import BytesIO


def load_data():
    print("Downloading and extracting the ZIP file...")
    # Download and extract the ZIP file
    zip_url = 'https://archive.ics.uci.edu/static/public/161/mammographic+mass.zip'
    response = urlopen(zip_url)
    zip_data = BytesIO(response.read())

    with ZipFile(zip_data, 'r') as zip_ref:
        # Assuming there is only one CSV file in the ZIP file
        csv_file_name = zip_ref.namelist()[0]
        print(f"Extracting data from {csv_file_name}...")
        with zip_ref.open(csv_file_name) as file:
            # Read the CSV file into a DataFrame
            masses_data = pd.read_csv(file, na_values=['?'],
                                      names=['BI-RADS', 'age', 'shape', 'margin', 'density', 'severity'])

    print("ETL: Dropping missing values...")
    # Assuming masses_data is your DataFrame
    # ETL and handling missing values
    masses_data = masses_data.dropna()

    all_features = masses_data[['age', 'shape', 'margin', 'density']].values
    all_classes = masses_data['severity'].values

    # Scaling features
    scaler = StandardScaler()
    all_features_scaled = scaler.fit_transform(all_features)
    return all_features_scaled, all_classes


# Model for Neural Network (DNN)
def create_original_model():
    model = Sequential()
    model.add(Dense(6, input_dim=4, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(metrics=['accuracy'], optimizer='adam', loss='binary_crossentropy')
    return model


# Function to run cross-validation and return scores
def run_cv(model, X, y, cv_method):
    print(f"Running {model.__class__.__name__} model...")
    return cross_val_score(model, X, y, cv=cv_method)


# Main method to compile and plot results
def run():
    X, y = load_data()

    # Model for Random Forest
    rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

    # Model for XGBoost
    xgb_classifier = XGBClassifier(learning_rate=0.1, n_estimators=100, random_state=42)

    # Model for a Deep Neural Network
    dnn = KerasClassifier(build_fn=create_original_model, epochs=100, verbose=0)

    cv_method = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

    # Run cross-validation for each model
    dnn_scores = run_cv(dnn, X, y, cv_method)
    rf_scores = run_cv(rf_classifier, X, y, cv_method)
    xgb_scores = run_cv(xgb_classifier, X, y, cv_method)

    # Compile results
    results = pd.DataFrame({
        'DNN': dnn_scores,
        'RandomForest': rf_scores,
        'XGBoost': xgb_scores
    })

    # Plot results using seaborn boxplot
    plt.figure(figsize=(10, 6))
    sns.boxplot(data=results)
    plt.title('Cross-Validation Scores Comparison')
    plt.ylabel('Accuracy')
    plt.show()
    # save the plot
    plt.savefig('cross_val_scores.png')


# Run the main method
run()

Heteroscedasticity Unmasked: Taming Increasing Variance with Transformations!

  1. What is Heteroscedasticity? Heteroscedasticity refers to the uneven spread or varying levels of dispersion in data points, meaning that the variability of data isn’t consistent across the range.
  2. Why is it Bad for Modeling and Prediction? Heteroscedasticity can wreak havoc on modeling and prediction because it violates the assumption of constant variance in many statistical techniques, leading to biased results and unreliable predictions.
  3. How to Handle It To tackle Heteroscedasticity, one effective approach is data transformation. This involves altering the data using mathematical functions to stabilize variance and make your models more robust.

The Data: Let’s start with some code and synthetic data for house prices, where the variability increases as house sizes grow. This mirrors real-world scenarios where larger properties often exhibit more price variability.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from numpy.polynomial import Polynomial

# Step 1: Generate Synthetic Data with Heteroscedasticity
np.random.seed(0)  # For reproducibility
house_sizes = np.linspace(1000, 5000, 100)  # House sizes
true_prices = 50000 + 30 * house_sizes  # True house prices

# Introduce heteroscedasticity - variance increases with house size
error_term = np.random.normal(np.zeros(100), 7 * house_sizes)
noisy_prices = true_prices + error_term

# Split error_term into positive and negative components
positive_errors = np.where(error_term > 0, error_term, 0)
negative_errors = np.where(error_term < 0, error_term, 0)

Step 2: Now, let’s plot the bounding lines, illustrating the increasing variance.

# Fit polynomials to the positive and negative errors
positive_poly = Polynomial.fit(house_sizes, positive_errors, deg=1)
negative_poly = Polynomial.fit(house_sizes, negative_errors, deg=1)

# Calculate values for the bounding lines
positive_bounds = positive_poly(house_sizes)
negative_bounds = negative_poly(house_sizes)

Step 3: To stabilize the data, we’ll apply a logarithmic transformation to the noisy prices. This transformation makes the data more suitable for analysis.

# Step 3: Apply a Logarithmic Transformation
data['Log_Price'] = np.log(data['Price'])

Result: The transformation effectively reduces the increasing variance. We can visualize the change through comparison plots, showing the data’s improved stability.

Conclusion: Addressing increasing variance with transformations is a valuable tool in data analysis. Whether you’re dealing with house prices or any other dataset, understanding and mitigating heteroscedasticity enhances the reliability of your analysis and decision-making.

Fitting noisy data with a spline to see trends

Are you struggling with visualizing or analyzing data that is noisy or contains irregularities? Do you need to identify trends that may not be immediately apparent to the naked eye or require an analytical approach? One way to address these challenges is by using a spline, a mathematical function that smoothly connects data points. Splines can be used to “smooth out” data, making it easier to analyze and visualize.

In this blog post, we will explore the use of the UnivariateSpline function in Python’s scipy.interpolate library to fit a spline to noisy data. Spline interpolation is a method of interpolating data points by fitting a piecewise-defined polynomial function to the data.

In Python, the scipy.interpolate module provides several functions for fitting splines to data. One of these functions is UnivariateSpline, which fits a spline to a set of one-dimensional data points.

Let’s take a closer look at how UnivariateSpline works and how it can be used to smooth out noisy data in Python. We’ll use an example code snippet to illustrate the process.

import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import UnivariateSpline

x = np.linspace(0, 10, 100)
y1 = 0.2*x + 0.5*np.sin(x) + 2*np.random.normal(0, 0.1, size=100)
xs = np.linspace(np.min(x), np.max(x), 1000)

In this code, we first import the necessary modules: numpy, matplotlib.pyplot, and scipy.interpolate. We then create a set of x values using numpy.linspace(), which generates a linearly spaced array of 100 values between 0 and 10. Then, we generate some noisy data points using numpy’s linspace and random functions:

Next, we define the range of our x-axis by creating a new array, xs, with 1000 evenly spaced points between the minimum and maximum x values:

Now, we use the UnivariateSpline function to fit a spline to the noisy data:

spl = UnivariateSpline(x, y1)

We then plot the original data points and the fitted spline with all default parameters:

fig, ax = plt.subplots()
ax.plot(x, y1, 'k.', alpha=0.5)
plt.plot(xs, spl(xs), 'b', lw=1)
plt.title("Fitting a spline to noisy data (k=AUTO, smooth=AUTO)")
plt.xlabel('time')
plt.ylabel('Generated Stock prices')
plt.show()

The resulting plot shows that the spline fits the noisy data reasonably well, but there is some room for improvement.

Next, we try fitting the spline with a different degree, k, and a smoothing factor, s:

spl = UnivariateSpline(x, y1, k=4)
spl.set_smoothing_factor(20)

fig, ax = plt.subplots()
ax.plot(x, y1, 'k.', alpha=0.5)
plt.plot(xs, spl(xs), 'b', lw=1)
plt.title("Fitting a spline to noisy data (k=4, smooth=20)")
plt.xlabel('time')
plt.ylabel('Generated Stock prices')
plt.show()

This time, we use a degree of 4 for the spline and a smoothing factor of 20. The resulting plot shows that the spline fits the data even better than before. Finally, we try fitting the spline with a different smoothing factor:

spl = UnivariateSpline(x, y1)
spl.set_smoothing_factor(1)

fig, ax = plt.subplots()
ax.plot(x, y1, 'k.', alpha=0.5)
plt.plot(xs, spl(xs), 'b', lw=1)
plt.title("Fitting a spline to noisy data (k=AUTO, smooth=1)")
plt.xlabel('time')
plt.ylabel('Generated Stock prices')
plt.show()

This time, we set the smoothing factor to 1. The resulting plot shows that the spline now fits the data too closely, and has likely overfit the data.

By default, UnivariateSpline uses a smoothing factor of 0, which can result in a curve that closely follows the original data points, even if they are noisy or irregular. However, this can also lead to overfitting, where the curve follows the noise rather than the underlying trend in the data.

In conclusion, the UnivariateSpline function in Python’s scipy.interpolate library is a powerful tool for fitting a spline to noisy data. By adjusting the degree of the spline and the smoothing factor, we can achieve a good balance between fitting the data closely and avoiding overfitting. This method helps us draw envelopes around Monte Carlo chains or draw the Snell’s envelope. The possibilities are endless.

Full code

import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import UnivariateSpline

x = np.linspace(0, 10, 100)
y1 = 0.2*x + 0.5*np.sin(x) + 2*np.random.normal(0, 0.1, size=100)
xs = np.linspace(np.min(x), np.max(x), 1000)

spl = UnivariateSpline(x, y1)

fig, ax = plt.subplots()
ax.plot(x, y1, 'k.', alpha=0.5)
plt.plot(xs, spl(xs), 'b', lw=1)
plt.title("Fitting a spline to noisy data (k=AUTO, smooth=AUTO)")
plt.xlabel('time')
plt.ylabel('Generated Stock prices')
plt.show()

spl = UnivariateSpline(x, y1, k=4)
spl.set_smoothing_factor(20)

fig, ax = plt.subplots()
ax.plot(x, y1, 'k.', alpha=0.5)
plt.plot(xs, spl(xs), 'b', lw=1)
plt.title("Fitting a spline to noisy data (k=4, smooth=20)")
plt.xlabel('time')
plt.ylabel('Generated Stock prices')
plt.show()

spl = UnivariateSpline(x, y1)
spl.set_smoothing_factor(1)

fig, ax = plt.subplots()
ax.plot(x, y1, 'k.', alpha=0.5)
plt.plot(xs, spl(xs), 'b', lw=1)
plt.title("Fitting a spline to noisy data (k=AUTO, smooth=1)")
plt.xlabel('time')
plt.ylabel('Generated Stock prices')
plt.show()

2 ways to plot the confidence interval of a best fit regression line using R and python

When analyzing data, it is often useful to fit a regression line to model the relationship between two variables. However, it is also important to understand the uncertainty associated with the line of best fit. One way to display this uncertainty is by plotting the confidence interval about the regression line. In this document, we will discuss two methods for plotting the confidence interval about a best fit regression line using R and Python. Finally, we decide on when to use which one.

Method 1: Using R + ggplot2

R is a popular open-source programming language for statistical computing and graphics. To plot the confidence interval about a best fit regression line in R, we can use the ggplot2 package. Here are the steps to do so:

Load the necessary libraries:

library(ggplot2)

Generate some data

> a=c(1:10)
> b=5*a+5*rnorm(10)
> df=data.frame(a,b)
> df
    a         b
1   1  5.253065
2   2 18.189419
3   3 15.137868
4   4 20.399989
5   5 27.297348
6   6 27.935176
7   7 29.603539
8   8 34.692199
9   9 38.631428
10 10 57.167884

Create a scatter plot with ggplot() and specify the data and variables. The mapping is necessary to let ggplot know that we want to plot the column “a” along the x-axis and the column “b” along the y-axis.

ggplot(df, mapping=aes(x=a,y=b)) + geom_point(shape=18)

Add the regression line with geom_smooth(method="lm"):

ggplot(df, mapping=aes(x=a,y=b)) + geom_point(shape=18) + geom_smooth(method="lm")

The confidence interval is automatically added. In case it isn’t add the following to the plot: ggplot(df, mapping=aes(x=a,y=b)) + geom_point(shape=18) + geom_smooth(method="lm") + geom_ribbon(aes(ymin=ci[,2], ymax=ci[,3]), alpha=0.2) . The whole code looks like this:

ci=predict(lm.fit, newdata = df['a'], interval = "confidence")
fit        lwr      upr
1   7.348585  0.5342818 14.16289
2  11.811297  6.0319880 17.59061
3  16.274010 11.4134868 21.13453
4  20.736723 16.6005974 24.87285
5  25.199435 21.4780169 28.92085
6  29.662148 25.9407294 33.38357
7  34.124860 29.9887351 38.26099
8  38.587573 33.7270495 43.44810
9  43.050285 37.2709758 48.82959
10 47.512998 40.6986947 54.32730
ggplot(df, mapping=aes(x=a,y=b)) + geom_point(shape=18) + geom_smooth(method="lm") + geom_ribbon(aes(ymin=ci[,2], ymax=ci[,3]), alpha=0.2)

ymin and ymax are the lower and upper bounds of the confidence interval. The alpha parameter adjusts the transparency of the ribbon.

Method 2: Python + seaborn

Python is another popular programming language for data analysis and visualization. To plot the confidence interval about a best fit regression line in Python, we can use the seaborn package. Here are the steps to do so:

Load the necessary libraries:

import pandas as pd
import numpy as np
import seaborn as sns
sns.set_style("whitegrid")

Generate data

a = np.arange(10)
b = 5*a + 5*np.random.rand(10)
df = pd.DataFrame({'a':a, 'b':b})

Create a scatter plot with sns.scatterplot() and specify the data and variables:

_ = sns.scatterplot(data=df, x="a", y="b")

Add the regression line with sns.regplot():

_ = sns.regplot(data=df, x="a", y="b")

Finally, add the confidence interval with sns.regplot(ci=95):

_ = sns.regplot(data=df, x="a", y="b", ci=95)

The ci parameter specifies the confidence interval level in percentage.

Verdict

We used the ggplot2 package in R and the seaborn package in Python to generate the confidence interval plots. The ggplot2 result definitely looks more professional quality while the seaborn was much faster to code. We can choose the method that fits our needs. If we want to publish our graphs in journals then ggplot2 might be a better choice (not always). If we want to do a quick presentation then I will prefer seaborn.

Calculating checksum on streaming data

In today’s fast-paced world, where data is generated at a massive scale, it is essential to process it efficiently and in real-time. This is where the concept of streaming comes into play. Streaming refers to the continuous flow of data, and it is a crucial component of many modern applications and services.

Streaming is required because traditional batch processing techniques are not suitable for handling large volumes of data that need to be processed in real-time. Streaming allows us to process data as it is generated, providing near-instantaneous results.

One example of a service that heavily relies on streaming is Amazon Web Services (AWS). AWS is the to go storage cloud platform for most business, although Azure and GCP are also strong contenders. The basic idea of streaming is the same for all these services. It is instructive to have a knowledge of the process independent of the platform (AWS, GCP).

In this blog post, we will focus on calculating a checksum on streaming data using Python. We will explore how to convert a pandas DataFrame to a text stream and calculate a checksum on it. This approach can be useful for verifying the integrity of data in real-time applications such as data pipelines or streaming APIs.

def data_to_txt_stream(df, sep="\t", header=True, 
                                       index=False):
    logging.info(f"\n{df.head (5)}")
    output= io.BytesIO()
    df.to_csv(output, sep=sep, header-header, 
                    index=index, quoting-csv.QUOTE_NONE, 
                    quotechar='', escapechar='')
    data = output.getvalue()
    return data

The data_to_txt_stream function shown above is a Python function that takes in a pandas DataFrame df and converts it into a text stream. The text stream is then returned as a string. This function is useful when dealing with streaming data because it allows us to process the data as it is generated.

The to_csv method of the pandas DataFrame is used to convert the DataFrame to a CSV-formatted string. The resulting CSV string is then converted to a text stream using the io.BytesIO() method. The sep parameter specifies the separator to be used in the CSV file (in this case, a tab character). The header and index parameters specify whether or not to include the header and index in the CSV file, respectively. The quoting parameter specifies the quoting behavior for fields that contain special characters, and the quotechar and escapechar parameters specify the quote and escape characters to use, respectively.

The text stream returned by the function can then be used to calculate a checksum on the data. A checksum is a value that is computed from a block of data and is used to verify the integrity of the data. In the context of streaming data, a checksum can be used to ensure that the data has not been corrupted during transmission.

s3 = boto3.client('s3')
# Create buf object
buf = io.BytesIO()
# Download file to calculate checksum on the file
s3.download_fileobj(bucket_name, object_name, buf)
# Calculate checksum
file_checksum = hashlib.md5(buf.getvalue()).hexdigest()

To calculate a checksum on the text stream, we can use the Python hashlib library. The hashlib library provides various hash functions, such as SHA-256 and MD5, that can be used to compute a checksum on the data. Once the checksum has been computed, it can be compared to the expected checksum to verify the integrity of the data.

In conclusion, the ability to process streaming data efficiently and in real-time is essential in many modern applications and services. The data_to_txt_stream function presented in this blog post provides a way to convert a pandas DataFrame to a text stream, which can be useful when dealing with streaming data. Additionally, computing a checksum on the data can help verify the integrity of the data, which is important in real-time applications such as data pipelines or streaming APIs.

Setup python virtual environment with tensorflow-gpu

The main issues with having a GPU accelerated Tensorflow installation is the myriad compatibility issues. The easiest way proposed online is to use a docker image. However, the docker image didn’t work and it took up too much space. I discarded the docker image idea mostly because of space constraints. I will return to it later during the production phase. The main issue with tensorflow is that the tensorflow version must be compatible with the CUDA version installed. 

Tensorflow 2.3.1 needs CUDA 10 and above and NVIDIA 450 above preferably nvidia-455

These are the steps to get a working GPU accelerated tensorflow environment (Debian based system).

1. Purge nvidia drivers

sudo apt remove --purge “*nvidia*”

2. Install latest Nvidia drivers

sudo apt install nvidia-driver-455

Check your GPU and CUDA version

nvidia-smi

Or you can skip this step if installing the older nvidia=450 drivers in step #4 below.

3. Create a virtual environment to contain the tensorflow

pip install virtualenv
cd ~
python3 -m venv tf-env
source tf-env/bin/activate

Replace tf-env by the name of your choice. This will create a directory structure which will contain all the python packages, so it’s best to create in a drive with lots of free space, although it is easy to move.

4. Install CUDA following the recommendations from tensorflow website

Trying to install CUDA independently from NVIDIA website will break it in all possible ways. I have tried all possible combinations – CUDA 11.1 with tensorflow nightly, CUDA 10.1 with tensorflow stable. Something always breaks. The best method is to follow the install instructions on the tensorflow website to the dot. 

https://www.tensorflow.org/install/gpu

The only exception is that I didn’t install the older nvidia-450 drivers. I kept the newer nvidia-455 driver.

5. Make sure all links are working

Make sure there’s a link from cuda to the actual CUDA installation in /usr/local

$ ls -l /usr/local/
lrwxrwxrwx  1 root root 9 Oct  9 17:21 cuda -> cuda-11.1
drwxr-xr-x 14 root root 4096 Oct  9 17:21 cuda-11.1

$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64

6. Install tensorflow

Start virtualenv if not in it already

$ source tf-env/bin/activate

And then install tensorflow

(tf-env) $ pip install tensorflow

If you already have installed the nightly (unstable) version from #4 above then it is better to uninstall it first with

(tf-env) $pip uninstall tf-nightly

7. Test tensorflow

(tf-env) $ python
>>> import tensorflow as tf
2020-10-09 18:24:57.371340: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
>>> tf.__version__
'2.3.1'
>>> tf.config.list_physical_devices()
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

All seems to be running OK

8. Setup virtualenv kernel to Jupyter 

While in the virtual environment install ipykernel

(tf-env) $ pip install ipykernel

Add current virtual environment to Jupyter 

(tf-env) $ python -m ipykernel install --user --name=tf-env

tf-env will show up in the list of Jupyter kernels. The name for the Jupyter kernel can be anything. I kept it the same for consistency.

You can find the Jupyter kernels in ~/.local/share/jupyter/kernels

Test tensorflow gpu support in jupyter

(tf-env) $ jupyter notebook

import tensorflow as tf
tf.config.experimental.list_physical_devices()
tf.config.list_physical_devices()
tf.test.gpu_device_name()

Note: The tensorflow GPU detection in Jupyter will only work when Jupyter is run from within the virtual environment. Running Jupyter outside the virtualenv will not work even if the virtualenv kernel (tf-env) is chosen over regular system python kernel.

Plot a grid of plots in python by iterating over the subplots

In this article, we will make a grid of plots in python by iterating over the subplot axes and columns of a pandas dataframe.

Python has a versatile plotting framework in Matplotlib but the documentation seems extremely poor (or I was not able to find the right docs). It took me a fair amount of time to figure out how to send plots of columns of dataframe to individual subplots while rotating the xlabels for each subplot.

Usage

Plotting subplots in Matplotlib begins by using the plt.subplots() statement.

import pandas as pd
import matplotlib.pyplot as plt


fig, axs = plt.subplots(nrows=2, ncols=2)

We can omit the nrows and ncols args but I kept it for effect. This statement generates a grid of 2×2 subplots and returns the overall figure (the object which contains all plots inside it) and the individual subplots as a tuple of subplots. The subplots can be accessed using axs[0,0], axs[0,1], axs[1,0], and axs[1,1]. Or they can be unpacked during the assignment as follows.

import pandas as pd
import matplotlib.pyplot as plt


fig, ((ax1, ax2),(ax3, ax4)) = plt.subplots(nrows=2, ncols=2)

When we have 1 row and 4 columns instead of 2 rows and 2 columns it has to be unpacked as follows.

import pandas as pd
import matplotlib.pyplot as plt


fig, ((ax1, ax2, ax3, ax4)) = plt.subplots(nrows=1, ncols=4)

Flattening the grid of subplots

We, however, do not want to unpack individually. Instead, we would like to flatten the tuple of subplots and iterate over them rather than assigning each subplot to a variable. The tuple is flattened by the flatten() command.

axs.flatten()

We identify 4 columns of a dataframe we want to plot and save the column names in a list that we can iterate over. We flatten the subplots and generate an iterator or we can convert the iterator to a list and then pack it (zip) with the column names.

import pandas as pd
import matplotlib.pyplot as plt


profiles_file = 'data.csv'
df = pd.read_csv(profiles_file)

cols_to_plot = ['age', 'drinking', 'exercise', 'smoking']

fig, axs = plt.subplots(nrows=2, ncols=2)
fig.set_size_inches(20, 10)
fig.subplots_adjust(wspace=0.2)
fig.subplots_adjust(hspace=0.5)

for col, ax in zip(cols_to_plot, axs.flatten()):
    dftemp = df[col].value_counts()
    ax.bar(dftemp.index, list(dftemp))
    ax.set_title(col)
    ax.tick_params(axis='x', labelrotation=30)

plt.show()

As we iterate over each subplot axes, and the column names which are zipped with it, we plot each subplot with the ax.plot() command and we have to supply the x and y values manually. I tried plotting with pandas plot df.plot.bar() and assigning the returned object to the ax. It doesn’t work. The x values for the ax.plot() are the dataframe index (df.index) and y values are the values in the dataframe column (which needs to be converted to a list to as ax.plot() does not accept pd.Series).

Rotate x-axis of subplots

The x-axis for each subplot is rotated using

ax.tick_params(axis='x', labelrotation=30)

 

Use pandas to convert a date to datetime format

Importing dates from a CSV file is always a hassle. With myriads of DateTime formats possible, we will need to write extensive amounts of code to accommodate al possible DateTime formats or put restrictions on the contents of the CSV file. We don’t want to do either. Instead of hard-coding commands like

map(datetime.strftime(string, “%m/%d/%Y))

into our codes, we can use pandas to convert the dates for us. Pandas has the capability to convert an entire column of dates in string format to DateTime format. We just need to be careful when importing just dates and not DateTime objects(strings). Pandas usually converts to DateTime objects. If we are just importing dates then the time components are undesirable. We will need to strip off the time part using .date() at the end. So instead of

pd.to_datetime(date)

we will need to use

pd.to_datetime(date).date()

An example script illustrates this procedure.

def dateformat(self, date):
    # use pandas to convert a date to datetime format
    # extract just the date since pandas returns the date as Timestamp object
    # repack the date as datetime using datetime.datetime.combine() with time = 00:00

    date = dt.datetime.combine(pd.to_datetime(date).date(), 
                               dt.datetime.min.time())
    return date

 

Timestamp error

TypeError: an integer is required (got type Timestamp)


This error is common when a date is extracted from a pandas dataframe datetime column or index. We need to extract the date from the Timestamp object by using the .date() method:

startdate
Timestamp('2019-10-28 05:00:00')

startdate.date()
datetime.date(2019, 10, 28)

If we need to convert to a datetime format then we need to combine the date and the time portion like this

startdate = dt.datetime.combine(startdate.date(), startdate.time())


dt.datetime.combine() combines the date and the time part together to form a datetime object.

Method overloading in python with no default parameters

Since methods are evaluated at declaration time and not at run time the default values for method parameters cannot be dynamically set at declaration. Which means we cannot overload a function call like we can do in C++ such as this

int func()
int func(int c)

However, the workaround is to set the argument parameters to None, which allows the method to be called without any arguments. For example, the method

  def getdata(self, startdate, enddate):

when called as

s = obj.getdata()

Will raise an exception

TypeError: getdata() missing 2 required positional arguments: ‘startdate’ and ‘enddate’

We need to modify the declaration to

  def getdata(self, startdate=None, enddate=None):
       if startdate is None:
           startdate = self.stockdata.index[0]
       if enddate is None:
           enddate = self.stockdata.index[-1]

That will define the method as taking one, two, or no arguments. When called without arguments the if statements will inject the parameters dynamically during runtime.

Reference:
Using self.xxxx as a default parameter in a class method