A Comparison of Classifiers for Mammogram Data

January 31, 2024January 31, 2024 ~ Saugata ~ Leave a comment

In this post, we will explore the mammogram dataset from the UCI Machine Learning Repository to learn how to predict malignant tumors using different ML classifiers. The point is to compare and contrast the different classifiers to find the best one. This is a good use case where we analyze the small dataset before deploying it at scale. This is the EDA where we perform analysis on a small portion of the dataset to decide on the best ML algorithm to use.

We will be using the mammogram dataset from the UCI Machine Learning Repository. The dataset contains 961 instances and 6 attributes, including the target variable. The goal is to predict the presence of a mass in the mammogram, a binary classification task.

There are several classifiers we can use. However, we will be comparing the results of three classifiers: Deep Neural Network (DNN), Random Forest, and XGBoost. We will also be using Stratified K-Fold cross-validation to ensure a fair comparison of the classifiers.

The reason for choosing these classifiers is that they are widely used and have distinct characteristics. The DNN is a deep learning model that can capture complex, non-linear relationships. Random Forest is an ensemble of decision trees, known for its stability against outliers, avoiding overfitting, ability to generalize well, and ability to handle categorical features. XGBoost is a gradient boosting algorithm that excels in performance and speed, particularly with large datasets. When the dataset has a large number of categories, then XGBoost is a good choice. However, if it performs poorly on a smaller dataset, there is no need to choose XGBoost over Random Forest. Higher speed should not come at the cost of accuracy.

To begin our analysis, we make sure we ingest the data correctly. Since the dataset is in a zip file containing data and a summary, which is normal for most ML datasets, we have to extract and locate the actual data in the zip. You can skip the next section if you are interested in the classifiers only. For MLEs and data scientists, it is of interest to see the data ingestion pipeline.

Handling ZIP Files and Data Extraction

import pandas as pd
from io import BytesIO
from zipfile import ZipFile
import requests

# Load and extract data from ZIP file
url = 'https://archive.ics.uci.edu/static/public/161/mammographic+mass.zip'
response = requests.get(url)
with ZipFile(BytesIO(response.content), 'r') as zip_file:
    with zip_file.open('mammographic_masses.data') as file:
        masses_data = pd.read_csv(file)

In the above code, we are using the requests library to download the zip file from the UCI Machine Learning Repository. We then use the ZipFile class from the zipfile module to extract the data from the zip file. The data is then read into a pandas DataFrame using the read_csv method. The data is now ready for further processing and analysis.

Classifiers at a Glance

We will be testing three distinctive classifiers:

1. Deep Neural Network (DNN)

We will create a DNN model using the Keras API from TensorFlow. The model will have an input layer with 6 neurons and a ReLU activation function, and an output layer with 1 neuron and a sigmoid activation function. The model will be compiled with the Adam optimizer, binary crossentropy loss function, and accuracy as the evaluation metric.

Choice of the Activation Function

The first layer has 6 neurons connected to the input of 4 neurons (which are the 4 categories on which prediction is done). Whatever the number of neurons in the input layer, it should be equal to the number of categories in the dataset.

The 6 neurons need to be activated. We do not want negative values, so we can squash those values and keep only positive values by using the ReLU function. The ReLU activation function is a popular choice for deep learning models due to its simplicity and effectiveness. It helps mitigate the vanishing gradient problem and accelerates convergence.

The output layer has 1 neuron connected to the 6 neurons of the previous layer. This is because we want one output which is positive (1) or negative (0) classification of the feature. Positive if the features suggest a possible malignancy, and negative if the features suggest a benign mass. Since the neurons have continuous values between 0 and 1, we need to squash the values into either 0 or 1. The best function to do this is the sigmoid function. The sigmoid function is a logistic function that squashes the values between 0 and 1. The sigmoid function is the best choice for binary classification tasks.

Choice of the Loss Function

The loss function is a measure of how well the model is performing. The choice of the loss function is crucial, as it guides the model towards the right direction during training. The choice depends on the nature of the problem. The binary crossentropy loss function is a popular choice for binary classification tasks. It measures the difference between the predicted probability distribution and the true probability distribution. The binary crossentropy loss function is well-suited for our mammogram dataset, making it an ideal choice for our DNN model.

Choice of the Optimizer

The Adam optimizer is a popular choice for training deep learning models. It is an extension of the stochastic gradient descent algorithm that computes adaptive learning rates for each parameter. The Adam optimizer is known for its speed and performance, making it a compelling choice for training deep neural networks. This one is conventional, and you can use it for most of the deep learning models.

from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

# Model for Deep Neural Network (DNN)
def create_original_model():
    model = Sequential()
    model.add(Dense(6, input_dim=4, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(metrics=['accuracy'], optimizer='adam', loss='binary_crossentropy')
    return model

The model architecture is as follows:

2. Random Forest

Random Forest is a powerful ensemble of decision trees recognized for its robust predictions. Its strength lies in stability against outliers, effective management of categorical features, and resilience to overfitting, making it a dependable choice, especially for smaller datasets. The deliberate setting of the random state ensures reproducibility, while the choice of 100 estimators optimizes its performance.

from sklearn.ensemble import RandomForestClassifier

# Model for Random Forest
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

3. XGBoost

XGBoost is a gradient boosting algorithm meticulously designed for speed and optimal performance. Its configuration, incorporating ‘learning_rate’ and ‘n_estimators,’ ensures the delivery of superior results. The establishment of a random state guarantees reproducibility, making XGBoost a compelling choice, particularly for handling large datasets.

from xgboost import XGBClassifier

# Model for XGBoost
xgb_classifier = XGBClassifier(learning_rate=0.1, n_estimators=100, random_state=42)

When to Use Each Classifier

When the dataset is small, Random Forest is a reliable choice. It’s robust, less prone to overfitting, and can handle categorical features with ease. XGBoost, on the other hand, shines with large datasets, offering superior performance and speed. As for DNN, it’s a versatile choice, particularly for complex, non-linear relationships. Its deep architecture can capture intricate patterns, making it a compelling option for image and text data.

Stratified K-Fold and Imbalanced Datasets

The dataset’s imbalances demand careful handling. We will use Stratified K-Fold, a technique that ensures class distribution integrity in each fold. This guarantees a fair evaluation, particularly crucial when dealing with imbalanced datasets.

from sklearn.model_selection import StratifiedKFold

cv_method = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

Cross-Validation Scores and Comparison

from sklearn.model_selection import cross_val_score
import seaborn as sns
import matplotlib.pyplot as plt

# Run cross-validation for each model
dnn_scores = cross_val_score(dnn, X, y, cv=cv_method)
rf_scores = cross_val_score(rf_classifier, X, y, cv=cv_method)
xgb_scores = cross_val_score(xgb_classifier, X, y, cv=cv_method)

# Compile results
results = pd.DataFrame({
    'DNN': dnn_scores,
    'RandomForest': rf_scores,
    'XGBoost': xgb_scores
})

# Plot results using seaborn boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(data=results)
plt.title('Cross-Validation Scores Comparison')
plt.ylabel('Accuracy')
plt.show()

And we get a beautiful boxplot that visually compares the cross-validation scores of our classifiers. The boxplot illuminates the strengths and variations in our classifiers’ performances. The choice was boxplot to highlight the spread of the CV scores and also to demonstrate the mean.

Conclusion

In this study, we conducted an analysis of mammogram data, employing a comprehensive examination of three classification algorithms. The focus extended to a thorough comparison and contrast of these algorithms, evaluating their performance in a simple classification task. Additionally, a detailed exploration of a Deep Neural Network (DNN) was undertaken, emphasizing the architectural considerations relevant to the specified classification task.

Happy coding and exploring the realms of machine learning!

The full code

import pandas as pd
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from scikeras.wrappers import KerasClassifier
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import matplotlib.pyplot as plt
from urllib.request import urlopen
from zipfile import ZipFile
from io import BytesIO


def load_data():
    print("Downloading and extracting the ZIP file...")
    # Download and extract the ZIP file
    zip_url = 'https://archive.ics.uci.edu/static/public/161/mammographic+mass.zip'
    response = urlopen(zip_url)
    zip_data = BytesIO(response.read())

    with ZipFile(zip_data, 'r') as zip_ref:
        # Assuming there is only one CSV file in the ZIP file
        csv_file_name = zip_ref.namelist()[0]
        print(f"Extracting data from {csv_file_name}...")
        with zip_ref.open(csv_file_name) as file:
            # Read the CSV file into a DataFrame
            masses_data = pd.read_csv(file, na_values=['?'],
                                      names=['BI-RADS', 'age', 'shape', 'margin', 'density', 'severity'])

    print("ETL: Dropping missing values...")
    # Assuming masses_data is your DataFrame
    # ETL and handling missing values
    masses_data = masses_data.dropna()

    all_features = masses_data[['age', 'shape', 'margin', 'density']].values
    all_classes = masses_data['severity'].values

    # Scaling features
    scaler = StandardScaler()
    all_features_scaled = scaler.fit_transform(all_features)
    return all_features_scaled, all_classes


# Model for Neural Network (DNN)
def create_original_model():
    model = Sequential()
    model.add(Dense(6, input_dim=4, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(metrics=['accuracy'], optimizer='adam', loss='binary_crossentropy')
    return model


# Function to run cross-validation and return scores
def run_cv(model, X, y, cv_method):
    print(f"Running {model.__class__.__name__} model...")
    return cross_val_score(model, X, y, cv=cv_method)


# Main method to compile and plot results
def run():
    X, y = load_data()

    # Model for Random Forest
    rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

    # Model for XGBoost
    xgb_classifier = XGBClassifier(learning_rate=0.1, n_estimators=100, random_state=42)

    # Model for a Deep Neural Network
    dnn = KerasClassifier(build_fn=create_original_model, epochs=100, verbose=0)

    cv_method = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

    # Run cross-validation for each model
    dnn_scores = run_cv(dnn, X, y, cv_method)
    rf_scores = run_cv(rf_classifier, X, y, cv_method)
    xgb_scores = run_cv(xgb_classifier, X, y, cv_method)

    # Compile results
    results = pd.DataFrame({
        'DNN': dnn_scores,
        'RandomForest': rf_scores,
        'XGBoost': xgb_scores
    })

    # Plot results using seaborn boxplot
    plt.figure(figsize=(10, 6))
    sns.boxplot(data=results)
    plt.title('Cross-Validation Scores Comparison')
    plt.ylabel('Accuracy')
    plt.show()
    # save the plot
    plt.savefig('cross_val_scores.png')


# Run the main method
run()

Case Study: Transforming Higher Ed with Predictive Analytics

December 28, 2023December 28, 2023 ~ Saugata ~ Leave a comment

Statement of Problem

In the dynamic realm of higher education, many colleges grapple with a common hurdle—the soaring dropout rates typically observed within the initial two semesters of student enrollment. This not only translates to a direct loss of revenue but also triggers additional costs related to marketing endeavors and sales efforts, often yielding meager returns.

One such college, choosing to remain anonymous, boasted an excellent sales team with an impressive throughput of almost 20%. However, the institution was grappling with significant enrollment losses for diverse reasons. Determined to pinpoint the root causes and intervene strategically, the college aimed to minimize its enrollment dropouts.

This anonymous college recognized that while its sales team demonstrated commendable performance, a substantial portion of enrollments was slipping away. The institution sought to delve into the heart of the matter, identifying key issues and implementing targeted interventions to curtail enrollment dropouts effectively. The focus was on leveraging available data during enrollment, encompassing essential demographic details, academic performance metrics, and financial aid information. The overarching goal was to deploy predictive analytics, facilitating timely and personalized support to mitigate academic and financial challenges, and optimizing resource allocation for sustainable student retention.

Exploration and Decision Trees

The initial phase involved an exploration of predictive models, and decision trees emerged as a valuable tool. These trees effectively identified key metrics and features contributing to the risk of student dropout. This preliminary analysis was pivotal and laid the groundwork for the subsequent stages of the project. Planning, Budgeting, and Resource Allocation

With key decision-makers, including the CEO, on board, the project moved into a comprehensive planning phase. Decision trees’ success in identifying critical risk factors facilitated a smooth transition. A detailed budget was crafted, aiming to utilize 80% of the initial budget while accommodating potential cost overruns. Resource allocation became a strategic task, with the data scientist diligently identifying the necessary resources to ensure project completion within the allocated budget.

Project Initiation and Model Development

The final model was trained on enrollment data, encompassing demographic information, credit history, and a host of relevant features. Manual data cleaning and exploratory data analysis were initially employed and later automated through an ETL pipeline. Random Forest was chosen to enhance accuracy over decision trees, providing a more robust model with low variance. The pipeline was trained on open-source algorithms, effectively managing costs. Notably, the landscape has evolved, and current processes, such as ETL and exploratory data analysis, can be automated, significantly reducing costs. The full pipeline is shown below.

Model Efficacy and Cost Analysis

The developed predictive model showcased an impressive 82% accuracy, successfully reducing the dropout rate to 8% within the identified segments during the trial run. With an average total enrollment of 823 students for a full year, the initial 18% dropout rate saw a substantial reduction to 11%, indicating a remarkable 39% improvement in student retention.

Considering an acquisition cost of $5,000 per student and a calculated lifetime value of $40,000 for each student ( derived from the total tuition of $80,000 for the 2-year accelerated program), the total savings amounted to $263,360. This surpasses the initial deployment cost of around $300,000, showcasing a remarkable 8.81x savings rate.

This case study underscores the transformative impact of predictive analytics in enhancing student retention, achieving substantial cost savings, and demonstrating the continuous evolution of data science methodologies.

For more details on our approach and methodologies, please contact our team.

Comparing Pandas Vector Operations to Multithreading

November 17, 2023 ~ Saugata ~ Leave a comment

In this analysis, we compare the performance of Pandas vector operations against multithreading for Natural Language Processing (NLP) preprocessing tasks.

# Import necessary libraries and modules
import time
import spacy
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
import utils.data_utils  # Assuming you have a module for loading data

# Load Spacy model
nlp = spacy.load("en_core_web_sm")

# Define NLP preprocessing function
def ner_preprocessing(text):
    doc = nlp(text)
    entities = [ent.text for ent in doc.ents]
    return entities

# Function for parallel processing using ThreadPoolExecutor
def process_transaction_data_parallel(df):
    with ThreadPoolExecutor() as executor:
        processed_data = list(executor.map(ner_preprocessing, df['Description']))
    return processed_data

def main():
    # Load data
    df = utils.data_utils.load_data()

    # Time Pandas vector operation
    start = time.time()
    df['ner_preprocessing_map'] = df['Description'].apply(ner_preprocessing)
    print(f"Time taken for Pandas vector operation: {time.time() - start}")

    # Time multithreading with ThreadPoolExecutor
    start = time.time()
    processed_data = process_transaction_data_parallel(df)
    df['ner_preprocessing_threaded'] = processed_data
    print(f"Time taken for multithreading: {time.time() - start}")

    # Print DataFrame
    print(df.head())

if __name__ == "__main__":
    main()

The NER preprocessing function extracts entities from the ‘Description’ column in a Pandas DataFrame. The comparison involves using Pandas’ apply method for vectorized operations and multithreading for parallelized operations.

Results:

The time taken for

Pandas vector operation : 38.60 seconds

multithreading : 62.06 seconds

It’s evident that the Pandas vector operation outperforms multithreading in this scenario.

Additionally, the multithreading approach resulted in issues related to the order of direct assignment. The ‘ner_preprocessing_threaded’ column shows discrepancies in the order of entities assigned to each row. This could potentially lead to data integrity issues.

t’s crucial to consider the nature of the task and the size of the dataset when choosing between Pandas vector operations and multithreading. While multithreading can provide parallelization benefits, it may not always be the most efficient solution, as seen in this NLP preprocessing example.

Decoding Textual Secrets: Navigating NLP Pitfalls with Precision and SpaCy Brilliance!

November 16, 2023November 16, 2023 ~ Saugata ~ Leave a comment

In the ever-evolving landscape of Natural Language Processing (NLP), text preprocessing is a critical step to transform raw text into a format suitable for analysis. However, this seemingly routine task can become a double-edged sword, where oversimplified preprocessing might inadvertently discard valuable information. In this blog post, we’ll explore the journey through text preprocessing, learning from the pitfalls and incorporating improved techniques for optimal results.

Understanding the Problem: Oversimplified Preprocessing

The Challenge

Consider a dataset with diverse descriptions, including grocery store details and financial transactions. Traditional preprocessing methods might oversimplify the data, as shown in the example where ’99-CENTS-ONLY #0133′ becomes ‘cents only’ after processing. The challenge lies in retaining contextually relevant information while ensuring cleanliness.

# Previous oversimplified preprocessing
text = '99-CENTS-ONLY #0133'
text = text.lower()
text = re.sub(r'\d+', '', text)
text = text.translate(str.maketrans("", "", string.punctuation))
print(text)
# Output: 'cents only'

Improved Preprocessing Techniques

1. Customized Preprocessing for Specific Domains

Recognizing the domain-specific nature of the data, we can tailor preprocessing steps. In this example, we selectively remove noise while retaining domain-specific terms and identifiers.

def custom_preprocessing(text):
    text = text.lower()
    text = re.sub(r'\b\d+\b', '', text)  # Remove standalone numbers
    text = text.replace('-', ' ')  # Preserve hyphenated terms
    text = text.translate(str.maketrans("", "", string.punctuation))
    return text

2. Named Entity Recognition (NER)

Incorporate Named Entity Recognition (NER) to identify and preserve specific entities in the text. This helps maintain critical information, such as recognizing ’99 cents’ as a monetary value associated with a store.

import spacy

# Load the spaCy model for NER
nlp = spacy.load('en_core_web_sm')

def ner_preprocessing(text):
    doc = nlp(text)
    entities = [ent.text for ent in doc.ents]
    return entities

Applying Improved Techniques to the Dataset

Now, let’s apply these improved techniques to the dataset containing diverse descriptions:

data = ['99-CENTS-ONLY #0133',
        'ESC DISB - PMI INS',
        'Vanguard Total Bond Market ETF Market Buy']

df = pd.DataFrame({'description': data})
df['custom_preprocessing'] = df['description'].apply(lambda x: custom_preprocessing(x))
df['ner_preprocessing'] = df['description'].apply(lambda x: ner_preprocessing(x))

print(df)

Insights: Balancing Cleanliness and Context

As we inspect the processed data, a clear evolution emerges. The improved preprocessing techniques now strike a balance between cleaning data and retaining relevant information:

Description	Customized Preprocessing	NER Preprocessing
99-CENTS-ONLY #0133	[’99 cents’]	[’99-CENTS-ONLY #0133′]
ESC DISB – PMI INS	[]	[]
Vanguard Total Bond Market ETF Market Buy	[‘vanguard total bond market etf market buy’]	[‘Vanguard Total Bond Market’]

Table: Output of preprocessing pipeline

Clearly Named Entity Recognition is superior in preserving information that is useful. We have to weigh our options when using either technique. This is where humans come in. Not everything can be automated!!

Conclusion: A Refined Approach to Text Preprocessing

In navigating the challenges of text preprocessing, it’s evident that a refined approach is essential. By customizing preprocessing steps, incorporating advanced techniques like Named Entity Recognition (NER), and adopting hybrid approaches, we enhance the accuracy and contextual understanding of NLP models. It’s not just about cleaning; it’s about preserving valuable information, ensuring that our models are equipped to handle the intricacies of real-world text data. As you embark on your NLP endeavors, embrace the power of thoughtful preprocessing for a richer and more meaningful analysis. Happy preprocessing!

Heteroscedasticity Unmasked: Taming Increasing Variance with Transformations!

October 9, 2023October 9, 2023 ~ Saugata ~ Leave a comment

What is Heteroscedasticity? Heteroscedasticity refers to the uneven spread or varying levels of dispersion in data points, meaning that the variability of data isn’t consistent across the range.
Why is it Bad for Modeling and Prediction? Heteroscedasticity can wreak havoc on modeling and prediction because it violates the assumption of constant variance in many statistical techniques, leading to biased results and unreliable predictions.
How to Handle It To tackle Heteroscedasticity, one effective approach is data transformation. This involves altering the data using mathematical functions to stabilize variance and make your models more robust.

The Data: Let’s start with some code and synthetic data for house prices, where the variability increases as house sizes grow. This mirrors real-world scenarios where larger properties often exhibit more price variability.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from numpy.polynomial import Polynomial

# Step 1: Generate Synthetic Data with Heteroscedasticity
np.random.seed(0)  # For reproducibility
house_sizes = np.linspace(1000, 5000, 100)  # House sizes
true_prices = 50000 + 30 * house_sizes  # True house prices

# Introduce heteroscedasticity - variance increases with house size
error_term = np.random.normal(np.zeros(100), 7 * house_sizes)
noisy_prices = true_prices + error_term

# Split error_term into positive and negative components
positive_errors = np.where(error_term > 0, error_term, 0)
negative_errors = np.where(error_term < 0, error_term, 0)

Step 2: Now, let’s plot the bounding lines, illustrating the increasing variance.

# Fit polynomials to the positive and negative errors
positive_poly = Polynomial.fit(house_sizes, positive_errors, deg=1)
negative_poly = Polynomial.fit(house_sizes, negative_errors, deg=1)

# Calculate values for the bounding lines
positive_bounds = positive_poly(house_sizes)
negative_bounds = negative_poly(house_sizes)

Step 3: To stabilize the data, we’ll apply a logarithmic transformation to the noisy prices. This transformation makes the data more suitable for analysis.

# Step 3: Apply a Logarithmic Transformation
data['Log_Price'] = np.log(data['Price'])

Result: The transformation effectively reduces the increasing variance. We can visualize the change through comparison plots, showing the data’s improved stability.

Conclusion: Addressing increasing variance with transformations is a valuable tool in data analysis. Whether you’re dealing with house prices or any other dataset, understanding and mitigating heteroscedasticity enhances the reliability of your analysis and decision-making.

The Importance of the Else Clause in Exception Handling: A Practical Use Case

August 28, 2023 ~ Saugata ~ Leave a comment

Introduction

Exception handling is a crucial aspect of writing robust and reliable code. It allows programmers to gracefully handle unexpected errors and ensure that the program can recover from unforeseen situations. One often overlooked aspect of exception handling is the else clause, which can provide elegant solutions in scenarios where different code paths are needed based on whether an exception occurred. In this blog post, we’ll explore a real-world example where the else clause proved to be a game-changer and helped streamline a data handling process.

The Problem

Consider a scenario where data needs to be fetched, processed, and then saved to a MySQL database. However, there might be instances when the database insertion fails due to network issues, permissions, or other reasons. In such cases, it’s crucial to have a backup plan to prevent data loss. One initial approach could be to stash the data in a CSV file every time there’s an exception during database insertion. However, this approach might lead to unexpected results and unnecessary stashing of data.

The Issue with the Original Approach

Here’s the code that was initially implemented to handle the scenario:

try:
    # Save data to MySQL database
    database.save_data(df, verbose=verbose)
    if os.path.exists('stashed_data.csv'):
        # Delete stashed data
        if verbose:
            print('main: Deleting stashed data CSV file.')
        os.remove('stashed_data.csv')
except Exception as e:
    print('main: Error saving data to the database.')
    print(e)
    if verbose:
        print('main: Stashing data to CSV file.')
    # Save data to CSV file
    df.to_csv('stashed_data.csv', index=False)

The issue with this approach is that the stashed data in the CSV file is deleted before handling the exception. As a result, the data is lost even if there was a chance to handle the exception and recover from the error. This approach could potentially lead to unnecessary data loss and complications during troubleshooting.

The Solution

To address the issue and ensure more controlled data handling, the else clause can be utilized. By structuring the exception handling code with an else clause, you can ensure that the data is stashed in the CSV file only when the database insertion operation was successful. If an exception is raised, the else block won’t execute, preventing unnecessary stashing of data.

The Revised Code:

try:
    # Save data to MySQL database
    database.save_data(df, verbose=verbose)
except Exception as e:
    print('main: Error saving data to the database.')
    print(e)
    if verbose:
        print('main: Stashing data to CSV file.')
    # Save data to CSV file
    df.to_csv('stashed_data.csv', index=False)
else:
    if os.path.exists('stashed_data.csv'):
        # Delete stashed data
        if verbose:
            print('main: Deleting stashed data CSV file.')
        os.remove('stashed_data.csv')

Conclusion

In software development, making the right decisions in exception handling can greatly impact the reliability and maintainability of your code. The else clause provides an elegant way to differentiate between code paths where an exception occurred and where it didn’t. By utilizing the else clause, as demonstrated in our practical use case, you can avoid unnecessary data stashing and ensure a more efficient and robust data handling process. Remember, sometimes, less is more, and a single keyword like else can make a world of difference in the outcome of your code.

Automating MP3 ID3 Tag Updates with Python

July 18, 2023July 23, 2023 ~ Saugata ~ Leave a comment

Introduction:

Managing and organizing a music collection often involves keeping track of the artist and song information associated with each MP3 file. Manually updating ID3 tags can be a time-consuming task, especially when dealing with a large number of files. However, with the power of Python and the Mutagen library, we can automate this process and save valuable time. In this article, we will explore a Python script that updates ID3 tags for MP3 files based on the filename structure.

Prerequisites:

To follow along with this tutorial, make sure you have Python and the Mutagen library installed on your system. You can install Mutagen by running pip install mutagen in your terminal.

Understanding the Code:

The Python script we’ll be using leverages the Mutagen library to handle ID3 tag manipulation. The core logic resides in the update_id3_tags function, which updates the ID3 tags of an MP3 file based on the filename structure.

The script accepts command-line arguments using the argparse module, allowing you to specify the folder containing your MP3 files, along with options to ignore files with existing ID3 tags and print verbose output. This provides flexibility and customization to suit your specific requirements.

The getargs function parses the command-line arguments and returns the parsed arguments as an object. The folder_path, ignore_existing, and verbose variables are then extracted from the parsed arguments.

The script retrieves a list of MP3 files in the specified folder and iterates over each file. For each file, the update_id3_tags function is called. It extracts the artist and song name from the filename using the specified structure. The ID3 tags are then updated with the extracted information using the Mutagen library.

Code:

#!/usr/bin/env python
import os
import argparse
from mutagen.id3 import ID3, TIT2, TPE1

def update_id3_tags(filename, ignore_existing, verbose):
    # Extract artist and song name from filename
    basename = os.path.basename(filename)
    print(f"processing --[{basename}]--")
    if "-" in basename:
        artist = basename[:-4].split(" - ")[0].strip()
        song = " - ".join(basename[:-4].split(" - ")[1:]).strip()
    else:
        print("Cannot split file not in format [artist] - [song].mp3")
        return -1

    # Load the ID3 tags from the file
    audio = ID3(filename)

    # Check if ID3 tags already exist
    if not ignore_existing or not audio.tags:
        # Update the TIT2 (song title) and TPE1 (artist) tags if they are empty
        if not audio.get("TIT2"):
            audio["TIT2"] = TIT2(encoding=3, text=song)
            if verbose:
                print(f"Updated TIT2 tag for file: {filename} with value: {song}")
        elif verbose:
            print(f"Skipping existing ID3 tag for title: {audio.get('TIT2')}")

        if not audio.get("TPE1"):
            audio["TPE1"] = TPE1(encoding=3, text=artist)
            if verbose:
                print(f"Updated TPE1 tag for file: {filename} with value: {artist}")
        elif verbose:
            print(f"Skipping existing ID3 tag for track: {audio.get('TPE1')}")           
    print('-'*10)

    # Save the updated ID3 tags back to the file
    audio.save()    


def getargs():
    # parse command-line arguments using argparse()
    parser = argparse.ArgumentParser(description='Update ID3 tags for MP3 files.')
    parser.add_argument("folder", nargs='?', default='.', help="Folder containing MP3 files (default: current directory)")
    parser.add_argument('-i', "--ignore", action="store_true", help="Ignore files with existing ID3 tags")
    parser.add_argument('-v', "--verbose", action="store_true", help="Print verbose output")
    return parser.parse_args()


if __name__ == '__main__':
    args = getargs()
    folder_path = args.folder
    ignore_existing = args.ignore
    verbose = args.verbose

    # Get a list of MP3 files in the folder
    mp3_files = [file for file in os.listdir(folder_path) if file.endswith(".mp3")]

    # Process each MP3 file
    for mp3_file in mp3_files:
        mp3_path = os.path.join(folder_path, mp3_file)
        update_id3_tags(mp3_path, ignore_existing, verbose)

Example:

Let’s assume you have a folder called “Music” that contains several MP3 files with filenames in the format “artist – song.mp3”. We want to update the ID3 tags for these files based on the filename structure.

Here’s how you can use the Python script:

python script.py Music --ignore --verbose

In this example, we’re running the script with the following arguments:

Music: The folder containing the MP3 files. Replace this with the actual path to your folder.
--ignore: This flag tells the script to ignore files that already have existing ID3 tags.
--verbose: This flag enables verbose output, providing details about the files being processed and the updates made.

By running the script with these arguments, it will update the ID3 tags for the MP3 files in the “Music” folder, ignoring files that already have existing ID3 tags, and provide verbose output to the console.

Once the script finishes running, you can check the updated ID3 tags using any media player or music library software that displays the ID3 tag information.

This example demonstrates how the Python script automates the process of updating MP3 ID3 tags based on the filename structure, making it convenient and efficient to manage your music collection.

Conclusion:

Automating the process of updating MP3 ID3 tags can save you valuable time and effort. With the Python script we’ve discussed in this article, you can easily update the ID3 tags of your MP3 files based on the filename structure. The flexibility offered by command-line arguments allows you to tailor the script to your specific needs. Give it a try and simplify your music collection management!

Level up your Python game: Unleashing Pyenv in Ubuntu!

May 21, 2023May 22, 2023 ~ Saugata ~ Leave a comment

If you’re a Python developer who finds themselves juggling multiple Python versions on an Ubuntu system, you may have noticed that the deadsnakes ppa doesn’t provide install candidates for kinetic (22.10). This can be a frustrating roadblock, especially if you’re trying to work with Tensorflow extended (TFX), which doesn’t cooperate smoothly with Python 3.10.

Fortunately, there’s a solution: pyenv. Pyenv is a remarkable tool that simplifies the installation, management, and switching between different Python versions effortlessly. In this blog post, we’ll walk you through the process of installing pyenv on Ubuntu, empowering you to effortlessly manage multiple Python versions on your system.

Step 1: Update System Packages

Before installing pyenv, we need to update the system packages to their latest version. Run the following command to update the system packages:

sudo apt update && sudo apt upgrade -y

Step 2: Install Dependencies

pyenv requires some dependencies to be installed on your system. Run the following command to install the dependencies:

sudo apt install -y make build-essential libssl-dev zlib1g-dev libbz2-dev \\\\
libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev xz-utils tk-dev \\\\
libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev

Step 3: Install pyenv

Once the dependencies are installed, we can proceed with the installation of pyenv. Run the following command to install pyenv [1]:

curl https://pyenv.run | bash

This command will download and install the latest version of pyenv on your system. Once the installation is complete, add the following lines to your ~/.bashrc file to set up pyenv [2]:

export PATH="$HOME/.pyenv/bin:$PATH"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"

Then, run the following command to reload your ~/.bashrc file:

source ~/.bashrc

Depending on your setup, you may have one or more of the following files: ~/.profile, ~/.bash_profile or ~/.bash_login. If any of these files exist, it is recommended to add the commands there. However, if none of these files are present, you can simply add the commands to ~/.profile for seamless integration.

Step 4: Verify Installation

To verify that pyenv is installed correctly, run the following command:

pyenv --version

This command should output the version number of pyenv installed on your system.

Step 5: Install Python

Now that pyenv is installed, you can use it to install any Python version you need. To install Python 3.9.0, for example, run the following command:

pyenv install 3.9.0

This will download and install Python 3.9.0 on your system. Once the installation is complete, you can set this version of Python as the default by running the following command:

pyenv global 3.9.0

Step 6: Set Up a Virtual Environment

Now that you have installed pyenv, you can use it to create and manage virtual environments for your Python projects. To create a virtual environment for your project, run the following command:

pyenv virtualenv 3.9.0 tfx_venv

This command will create a new virtual environment named tfx_venv, based on Python 3.9.0. The virtual environment will be stored at $HOME/.pyenv/versions/3.9.0/tfx_venv .

Pyenv’s standout feature lies in its ability to set specific environments based on individual folders. Imagine a scenario where you require Python 3.9 for your trading-gcp folder, while the rest of your projects calls for Python 3.10.

├── courses
│   ├── coursera
│   │   ├── trading-gcp
│   │   └── udemy - Algorithmic Trading with Machine Learning in Python

With pyenv, achieving this is a breeze. Simply navigate to the trading-gcp folder and run the command

pyenv local tfx_venv

Pyenv will automatically configure the environment to match your folder’s requirements. This is accomplished by creating a .python-version file within the folder, which references the tfx_venv environment.

(tfx_venv) trading-gcp$ cat .python-version 
tfx_venv

If we want to reset the python environment associated with this folder then we simply need to remove this file or run another pyenv local command.

Managing virtual environments becomes hassle-free with pyenv. You no longer need to worry about manual activation or deactivation. As you navigate to different folders, pyenv seamlessly adjusts the Python environment to suit the code contained within. When you leave the folder, the environment switches back automatically.

Conclusion

In this tutorial, we have gone through the steps to install pyenv on Ubuntu. pyenv is a useful tool that can help you manage multiple Python versions on your system with ease. With pyenv, you can switch between different Python versions and install the required packages for each version without any conflicts.

References

[1] pyenv installer, https://github.com/pyenv/pyenv-installer
[2] pyenv: imple Python Version Management: pyenv. https://github.com/pyenv/pyenv
[3] “Introduction to pyenv” by Real Python. https://realpython.com/intro-to-pyenv

Fitting noisy data with a spline to see trends

May 10, 2023 ~ Saugata ~ Leave a comment

Are you struggling with visualizing or analyzing data that is noisy or contains irregularities? Do you need to identify trends that may not be immediately apparent to the naked eye or require an analytical approach? One way to address these challenges is by using a spline, a mathematical function that smoothly connects data points. Splines can be used to “smooth out” data, making it easier to analyze and visualize.

In this blog post, we will explore the use of the UnivariateSpline function in Python’s scipy.interpolate library to fit a spline to noisy data. Spline interpolation is a method of interpolating data points by fitting a piecewise-defined polynomial function to the data.

In Python, the scipy.interpolate module provides several functions for fitting splines to data. One of these functions is UnivariateSpline, which fits a spline to a set of one-dimensional data points.

Let’s take a closer look at how UnivariateSpline works and how it can be used to smooth out noisy data in Python. We’ll use an example code snippet to illustrate the process.

import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import UnivariateSpline

x = np.linspace(0, 10, 100)
y1 = 0.2*x + 0.5*np.sin(x) + 2*np.random.normal(0, 0.1, size=100)
xs = np.linspace(np.min(x), np.max(x), 1000)

In this code, we first import the necessary modules: numpy, matplotlib.pyplot, and scipy.interpolate. We then create a set of x values using numpy.linspace(), which generates a linearly spaced array of 100 values between 0 and 10. Then, we generate some noisy data points using numpy’s linspace and random functions:

Next, we define the range of our x-axis by creating a new array, xs, with 1000 evenly spaced points between the minimum and maximum x values:

Now, we use the UnivariateSpline function to fit a spline to the noisy data:

spl = UnivariateSpline(x, y1)

We then plot the original data points and the fitted spline with all default parameters:

fig, ax = plt.subplots()
ax.plot(x, y1, 'k.', alpha=0.5)
plt.plot(xs, spl(xs), 'b', lw=1)
plt.title("Fitting a spline to noisy data (k=AUTO, smooth=AUTO)")
plt.xlabel('time')
plt.ylabel('Generated Stock prices')
plt.show()

The resulting plot shows that the spline fits the noisy data reasonably well, but there is some room for improvement.

Next, we try fitting the spline with a different degree, k, and a smoothing factor, s:

spl = UnivariateSpline(x, y1, k=4)
spl.set_smoothing_factor(20)

fig, ax = plt.subplots()
ax.plot(x, y1, 'k.', alpha=0.5)
plt.plot(xs, spl(xs), 'b', lw=1)
plt.title("Fitting a spline to noisy data (k=4, smooth=20)")
plt.xlabel('time')
plt.ylabel('Generated Stock prices')
plt.show()

This time, we use a degree of 4 for the spline and a smoothing factor of 20. The resulting plot shows that the spline fits the data even better than before. Finally, we try fitting the spline with a different smoothing factor:

spl = UnivariateSpline(x, y1)
spl.set_smoothing_factor(1)

fig, ax = plt.subplots()
ax.plot(x, y1, 'k.', alpha=0.5)
plt.plot(xs, spl(xs), 'b', lw=1)
plt.title("Fitting a spline to noisy data (k=AUTO, smooth=1)")
plt.xlabel('time')
plt.ylabel('Generated Stock prices')
plt.show()

This time, we set the smoothing factor to 1. The resulting plot shows that the spline now fits the data too closely, and has likely overfit the data.

By default, UnivariateSpline uses a smoothing factor of 0, which can result in a curve that closely follows the original data points, even if they are noisy or irregular. However, this can also lead to overfitting, where the curve follows the noise rather than the underlying trend in the data.

In conclusion, the UnivariateSpline function in Python’s scipy.interpolate library is a powerful tool for fitting a spline to noisy data. By adjusting the degree of the spline and the smoothing factor, we can achieve a good balance between fitting the data closely and avoiding overfitting. This method helps us draw envelopes around Monte Carlo chains or draw the Snell’s envelope. The possibilities are endless.

Full code

import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import UnivariateSpline

x = np.linspace(0, 10, 100)
y1 = 0.2*x + 0.5*np.sin(x) + 2*np.random.normal(0, 0.1, size=100)
xs = np.linspace(np.min(x), np.max(x), 1000)

spl = UnivariateSpline(x, y1)

fig, ax = plt.subplots()
ax.plot(x, y1, 'k.', alpha=0.5)
plt.plot(xs, spl(xs), 'b', lw=1)
plt.title("Fitting a spline to noisy data (k=AUTO, smooth=AUTO)")
plt.xlabel('time')
plt.ylabel('Generated Stock prices')
plt.show()

spl = UnivariateSpline(x, y1, k=4)
spl.set_smoothing_factor(20)

fig, ax = plt.subplots()
ax.plot(x, y1, 'k.', alpha=0.5)
plt.plot(xs, spl(xs), 'b', lw=1)
plt.title("Fitting a spline to noisy data (k=4, smooth=20)")
plt.xlabel('time')
plt.ylabel('Generated Stock prices')
plt.show()

spl = UnivariateSpline(x, y1)
spl.set_smoothing_factor(1)

fig, ax = plt.subplots()
ax.plot(x, y1, 'k.', alpha=0.5)
plt.plot(xs, spl(xs), 'b', lw=1)
plt.title("Fitting a spline to noisy data (k=AUTO, smooth=1)")
plt.xlabel('time')
plt.ylabel('Generated Stock prices')
plt.show()

Predicting Stock Price Volatility with GARCH Model (Part 1)

May 9, 2023May 12, 2023 ~ Saugata ~ Leave a comment

In time series analysis, it is essential to model the volatility of a stock. One way to achieve this is through the use of the EGARCH (Exponential Generalized Autoregressive Conditional Heteroskedasticity) model. In this article, we will perform an analysis of the MSFT stock price using EGARCH to model its volatility.

GARCH Model and When to Use It

GARCH (Generalized Autoregressive Conditional Heteroskedasticity) is a statistical model used to analyze financial time series data. It is a type of ARCH (Autoregressive Conditional Heteroskedasticity) model that takes into account the volatility clustering often observed in financial data. The GARCH model assumes that the variance of the error term in a time series is a function of both past error terms and past variances.

The GARCH model is commonly used in finance to model and forecast the volatility of asset returns. In particular, it is useful for predicting the likelihood of extreme events, such as a sudden stock market crash or a sharp increase in volatility.

When deciding whether to use a GARCH model, it is important to consider the characteristics of the financial time series data being analyzed. If the data exhibits volatility clustering or other patterns of heteroskedasticity, a GARCH model may be appropriate. Additionally, GARCH models are often used when the goal is to forecast future volatility or to estimate the risk associated with an investment. The GARCH(p,q) model can be represented by the following equation:

$\begin{aligned} r_t &= \mu_t + \epsilon_t \\ \epsilon_t &= \sigma_t z_t \\ \sigma_t^2 &= \omega + \sum_{i=1}^p \alpha_i \epsilon_{t-i}^2 + \sum_{j=1}^q \beta_j \sigma_{t-j}^2 \end{aligned}$

where $r_t$ is the log return at time t, $\mu_t$ is the conditional mean at time t, $\epsilon_t$ is the standardized residual at time t, $\sigma_t$ is the conditional standard deviation at time t, $z_t$ is a standard normal random variable, $\omega$ is the constant, $\alpha_i$ and $\beta_i$ are the GARCH and ARCH coefficients at lag i, and p and q are the order of the GARCH and ARCH terms, respectively.

In a GARCH(p,q) model, the dependence on the error term and the volatility term at the same time reflects the notion of volatility clustering, which is a characteristic of financial time series data. The error term represents the current shock or innovation to the return series, while the volatility term captures the past history of the shocks. The dependence on the error term and the volatility term at the same time implies that the model recognizes that a current shock to the return series can have a persistent effect on future volatility. In other words, large shocks tend to be followed by large subsequent changes in volatility, and vice versa. This feature of GARCH models has important implications for risk management and financial decision-making. By accounting for the clustering of volatility, GARCH models can provide more accurate estimates of risk measures, such as Value-at-Risk (VaR) and Expected Shortfall (ES), which are used to assess the potential losses in financial portfolios. GARCH models can also be used to forecast future volatility, which can be useful for developing trading strategies and hedging positions in financial markets. We will explore these concepts in the future parts of this ongoing series.

One specific form of the GARCH model is the EGARCH model, which stands for Exponential Generalized Autoregressive Conditional Heteroskedasticity. The EGARCH model allows for both asymmetry and leverage effects in the volatility of the data. The EGARCH model can be represented by the following equation:

$\begin{aligned} r_t &= \mu_t + \epsilon_t \\ \epsilon_t &= \sigma_t z_t \\ \log(\sigma_t^2) &= \omega + \sum_{i=1}^p \alpha_i \left( \frac{\left| \epsilon_{t-i} \right|}{\sigma_{t-i}} - \sqrt{\frac{2}{\pi}} \right) + \sum_{i=1}^q \beta_i \log \sigma_{t-i}^2 \end{aligned}$

Exploratory Data Analysis

Before modeling, it is essential to explore the data to understand its characteristics. The plot below shows the time series plot of the MSFT stock price.

import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
import statsmodels.api as sm
import pmdarima as pm
import yfinance as yf
import seaborn as sns
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller

msft = yf.Ticker('MSFT')
df = msft.history(period='5y')

sns.light_palette("seagreen", as_cmap=True)
sns.set_style("darkgrid", {"grid.color": ".6", "grid.linestyle": ":"})
sns.lineplot(df['Close'])
plt.title('MSFT')

We can see that the stock price exhibits a clear upward trend over the period, with some fluctuations. The plot below shows the ACF and PACF plots of the first differences of the stock price.

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# ACF and PACF plots of first differences
fig, axes = plt.subplots(2, 1, figsize=(10, 8))
plot_acf(msft_log, ax=axes[0])
plot_pacf(msft_log, ax=axes[1])
plt.tight_layout()
plt.show()

From the ACF and PACF plots, we can observe that there is no clear pattern in the data, indicating that it may be a white noise process. However, there is some significant autocorrelation at lag 1 in the PACF plot, suggesting that we may need to include an AR term in our model.

Model Selection

To model the volatility of the MSFT stock price, we will use the EGARCH model. We will begin by fitting a baseline EGARCH(1,1) model and compare it with other models.

from arch import arch_model

# Fit EGARCH(1,1) model
egarch11_model = arch_model(msft_log, vol='EGARCH',
                               p=1, o=0, q=1, dist='Normal')
egarch11_fit = egarch11_model.fit()
print(egarch11_fit.summary())

Constant Mean - GARCH Model Results                      
===============================================================
Dep. Variable:                  Close   R-squared:                       0.000
Mean Model:             Constant Mean   Adj. R-squared:                  0.000
Vol Model:                      GARCH   Log-Likelihood:                3350.57
Distribution:                  Normal   AIC:                          -6693.15
Method:            Maximum Likelihood   BIC:                          -6672.60
                                        No. Observations:                 1258
Date:                Tue, May 09 2023   Df Residuals:                     1257
Time:                        10:42:42   Df Model:                            1
                                 Mean Model                                 
===============================================================
                 coef    std err          t      P>|t|      95.0% Conf. Int.
----------------------------------------------------------------------------
mu         1.5045e-03  9.713e-08  1.549e+04      0.000 [1.504e-03,1.505e-03]
                              Volatility Model                              
===============================================================
                 coef    std err          t      P>|t|      95.0% Conf. Int.
----------------------------------------------------------------------------
omega      7.6495e-06  1.791e-12  4.272e+06      0.000 [7.650e-06,7.650e-06]
alpha[1]       0.1000  1.805e-02      5.541  3.004e-08   [6.463e-02,  0.135]
beta[1]        0.8800  1.551e-02     56.729      0.000     [  0.850,  0.910]
===============================================================

The following table shows the results of fitting various EGARCH models to the MSFT stock price data.

Overall, the models indicate that the volatility of stock returns is persistent, with all models showing significant positive values for the alpha parameters. Moreover, the models suggest that the volatility of stock returns responds asymmetrically to changes in returns, with negative shocks having a more significant impact than positive shocks. This is highlighted by the negative values of the omega parameters in all three models. In finance, the omega parameter represents the risk in the market that is unrelated to the past volatility of the asset being studied. It signifies the inherent uncertainty or randomness in the system that cannot be explained by any of the past information used in the model.

Model	Log Likelihood	AIC	BIC
EGARCH(1,1)	3355.44	-6702.88	-6682.33
EGARCH(1,2)	3356.18	-6702.36	-6676.67
EGARCH(2,1)	3356.67	-6703.34	-6677.66
EGARCH(2,2)	3356.67	-6701.34	-6670.52

Based on the information criteria, the EGARCH(2,2) model has the lowest AIC and BIC values, making it the final model of choice.

Model Diagnostics

After selecting the final model, we need to perform diagnostic checks to ensure that the model is appropriate. The following plots show the diagnostic checks for the EGARCH(2,2) model.

# Residuals plot
plt.plot(egarch22_fit.resid)
plt.title("EGARCH(2,2) Residuals")
plt.show()

# ACF/PACF of residuals
fig, axes = plt.subplots(2, 1, figsize=(10, 8))
plot_acf(egarch22_fit.resid, ax=axes[0])
plot_pacf(egarch22_fit.resid, ax=axes[1])
plt.tight_layout()
plt.show()

From the residual plot, we can see that the residuals of the model are approximately normally distributed and have constant variance over time. Additionally, the ACF and PACF plots of the residuals show no significant autocorrelation, indicating that the model has captured all the relevant information in the data.

Forecasting

The EGARCH(2,2) model provides a volatility fit for the MSFT stock price. Notably, there were spikes in volatility around the start of COVID in 2020 and during the Fed’s interest rate increase in 2022.

# Plot conditional volatility
plt.plot(egarch22_fit.conditional_volatility)
plt.xlabel("time")
plt.title("Conditional Volatility")
plt.show()

Finally, let’s use the EGARCH(2,2) model to forecast the volatility of the MSFT stock price for the next day.

# Last 5 days of volatility
egarch22_fit.conditional_volatility[-5:]

Date
2023-05-08 00:00:00-04:00    0.017646
2023-05-09 00:00:00-04:00    0.016606
2023-05-10 00:00:00-04:00    0.015939
2023-05-11 00:00:00-04:00    0.016860
2023-05-12 00:00:00-04:00    0.016005

# Forecast next day
forecasts = egarch22_fit.forecast(reindex=False)
print("Forecasting Mean variance")
print(egarch22_forecast.mean.iloc[-3:])
print("Forecasting Residual variance")
print(forecasts.residual_variance.iloc[-3:])

Forecasting Mean variance
                                h.1
Date                               
2023-05-09 00:00:00-04:00  0.001541

Based on the model, the forecasted volatility for the next day is 0.001541. This value suggests that the average volatility will decrease compared to the last five days. However, the accuracy of this prediction remains uncertain. To assess the model’s accuracy, a rolling prediction approach can be used and compared against actual values using a measure like RMSE. Further analysis will be explored in the subsequent parts of this series.

References

Tsay, R.S. (2010) Analysis of Financial Time Series, Third Edition. Wiley.
“Introduction to ARCH/GARCH Models”. ARCH Documentation. Retrieved from https://arch.readthedocs.io/en/latest/univariate/introduction.html.