A Comparison of Classifiers for Mammogram Data

January 31, 2024January 31, 2024 ~ Saugata ~ Leave a comment

In this post, we will explore the mammogram dataset from the UCI Machine Learning Repository to learn how to predict malignant tumors using different ML classifiers. The point is to compare and contrast the different classifiers to find the best one. This is a good use case where we analyze the small dataset before deploying it at scale. This is the EDA where we perform analysis on a small portion of the dataset to decide on the best ML algorithm to use.

We will be using the mammogram dataset from the UCI Machine Learning Repository. The dataset contains 961 instances and 6 attributes, including the target variable. The goal is to predict the presence of a mass in the mammogram, a binary classification task.

There are several classifiers we can use. However, we will be comparing the results of three classifiers: Deep Neural Network (DNN), Random Forest, and XGBoost. We will also be using Stratified K-Fold cross-validation to ensure a fair comparison of the classifiers.

The reason for choosing these classifiers is that they are widely used and have distinct characteristics. The DNN is a deep learning model that can capture complex, non-linear relationships. Random Forest is an ensemble of decision trees, known for its stability against outliers, avoiding overfitting, ability to generalize well, and ability to handle categorical features. XGBoost is a gradient boosting algorithm that excels in performance and speed, particularly with large datasets. When the dataset has a large number of categories, then XGBoost is a good choice. However, if it performs poorly on a smaller dataset, there is no need to choose XGBoost over Random Forest. Higher speed should not come at the cost of accuracy.

To begin our analysis, we make sure we ingest the data correctly. Since the dataset is in a zip file containing data and a summary, which is normal for most ML datasets, we have to extract and locate the actual data in the zip. You can skip the next section if you are interested in the classifiers only. For MLEs and data scientists, it is of interest to see the data ingestion pipeline.

Handling ZIP Files and Data Extraction

import pandas as pd
from io import BytesIO
from zipfile import ZipFile
import requests

# Load and extract data from ZIP file
url = 'https://archive.ics.uci.edu/static/public/161/mammographic+mass.zip'
response = requests.get(url)
with ZipFile(BytesIO(response.content), 'r') as zip_file:
    with zip_file.open('mammographic_masses.data') as file:
        masses_data = pd.read_csv(file)

In the above code, we are using the requests library to download the zip file from the UCI Machine Learning Repository. We then use the ZipFile class from the zipfile module to extract the data from the zip file. The data is then read into a pandas DataFrame using the read_csv method. The data is now ready for further processing and analysis.

Classifiers at a Glance

We will be testing three distinctive classifiers:

1. Deep Neural Network (DNN)

We will create a DNN model using the Keras API from TensorFlow. The model will have an input layer with 6 neurons and a ReLU activation function, and an output layer with 1 neuron and a sigmoid activation function. The model will be compiled with the Adam optimizer, binary crossentropy loss function, and accuracy as the evaluation metric.

Choice of the Activation Function

The first layer has 6 neurons connected to the input of 4 neurons (which are the 4 categories on which prediction is done). Whatever the number of neurons in the input layer, it should be equal to the number of categories in the dataset.

The 6 neurons need to be activated. We do not want negative values, so we can squash those values and keep only positive values by using the ReLU function. The ReLU activation function is a popular choice for deep learning models due to its simplicity and effectiveness. It helps mitigate the vanishing gradient problem and accelerates convergence.

The output layer has 1 neuron connected to the 6 neurons of the previous layer. This is because we want one output which is positive (1) or negative (0) classification of the feature. Positive if the features suggest a possible malignancy, and negative if the features suggest a benign mass. Since the neurons have continuous values between 0 and 1, we need to squash the values into either 0 or 1. The best function to do this is the sigmoid function. The sigmoid function is a logistic function that squashes the values between 0 and 1. The sigmoid function is the best choice for binary classification tasks.

Choice of the Loss Function

The loss function is a measure of how well the model is performing. The choice of the loss function is crucial, as it guides the model towards the right direction during training. The choice depends on the nature of the problem. The binary crossentropy loss function is a popular choice for binary classification tasks. It measures the difference between the predicted probability distribution and the true probability distribution. The binary crossentropy loss function is well-suited for our mammogram dataset, making it an ideal choice for our DNN model.

Choice of the Optimizer

The Adam optimizer is a popular choice for training deep learning models. It is an extension of the stochastic gradient descent algorithm that computes adaptive learning rates for each parameter. The Adam optimizer is known for its speed and performance, making it a compelling choice for training deep neural networks. This one is conventional, and you can use it for most of the deep learning models.

from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

# Model for Deep Neural Network (DNN)
def create_original_model():
    model = Sequential()
    model.add(Dense(6, input_dim=4, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(metrics=['accuracy'], optimizer='adam', loss='binary_crossentropy')
    return model

The model architecture is as follows:

2. Random Forest

Random Forest is a powerful ensemble of decision trees recognized for its robust predictions. Its strength lies in stability against outliers, effective management of categorical features, and resilience to overfitting, making it a dependable choice, especially for smaller datasets. The deliberate setting of the random state ensures reproducibility, while the choice of 100 estimators optimizes its performance.

from sklearn.ensemble import RandomForestClassifier

# Model for Random Forest
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

3. XGBoost

XGBoost is a gradient boosting algorithm meticulously designed for speed and optimal performance. Its configuration, incorporating ‘learning_rate’ and ‘n_estimators,’ ensures the delivery of superior results. The establishment of a random state guarantees reproducibility, making XGBoost a compelling choice, particularly for handling large datasets.

from xgboost import XGBClassifier

# Model for XGBoost
xgb_classifier = XGBClassifier(learning_rate=0.1, n_estimators=100, random_state=42)

When to Use Each Classifier

When the dataset is small, Random Forest is a reliable choice. It’s robust, less prone to overfitting, and can handle categorical features with ease. XGBoost, on the other hand, shines with large datasets, offering superior performance and speed. As for DNN, it’s a versatile choice, particularly for complex, non-linear relationships. Its deep architecture can capture intricate patterns, making it a compelling option for image and text data.

Stratified K-Fold and Imbalanced Datasets

The dataset’s imbalances demand careful handling. We will use Stratified K-Fold, a technique that ensures class distribution integrity in each fold. This guarantees a fair evaluation, particularly crucial when dealing with imbalanced datasets.

from sklearn.model_selection import StratifiedKFold

cv_method = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

Cross-Validation Scores and Comparison

from sklearn.model_selection import cross_val_score
import seaborn as sns
import matplotlib.pyplot as plt

# Run cross-validation for each model
dnn_scores = cross_val_score(dnn, X, y, cv=cv_method)
rf_scores = cross_val_score(rf_classifier, X, y, cv=cv_method)
xgb_scores = cross_val_score(xgb_classifier, X, y, cv=cv_method)

# Compile results
results = pd.DataFrame({
    'DNN': dnn_scores,
    'RandomForest': rf_scores,
    'XGBoost': xgb_scores
})

# Plot results using seaborn boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(data=results)
plt.title('Cross-Validation Scores Comparison')
plt.ylabel('Accuracy')
plt.show()

And we get a beautiful boxplot that visually compares the cross-validation scores of our classifiers. The boxplot illuminates the strengths and variations in our classifiers’ performances. The choice was boxplot to highlight the spread of the CV scores and also to demonstrate the mean.

Conclusion

In this study, we conducted an analysis of mammogram data, employing a comprehensive examination of three classification algorithms. The focus extended to a thorough comparison and contrast of these algorithms, evaluating their performance in a simple classification task. Additionally, a detailed exploration of a Deep Neural Network (DNN) was undertaken, emphasizing the architectural considerations relevant to the specified classification task.

Happy coding and exploring the realms of machine learning!

The full code

import pandas as pd
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from scikeras.wrappers import KerasClassifier
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import matplotlib.pyplot as plt
from urllib.request import urlopen
from zipfile import ZipFile
from io import BytesIO


def load_data():
    print("Downloading and extracting the ZIP file...")
    # Download and extract the ZIP file
    zip_url = 'https://archive.ics.uci.edu/static/public/161/mammographic+mass.zip'
    response = urlopen(zip_url)
    zip_data = BytesIO(response.read())

    with ZipFile(zip_data, 'r') as zip_ref:
        # Assuming there is only one CSV file in the ZIP file
        csv_file_name = zip_ref.namelist()[0]
        print(f"Extracting data from {csv_file_name}...")
        with zip_ref.open(csv_file_name) as file:
            # Read the CSV file into a DataFrame
            masses_data = pd.read_csv(file, na_values=['?'],
                                      names=['BI-RADS', 'age', 'shape', 'margin', 'density', 'severity'])

    print("ETL: Dropping missing values...")
    # Assuming masses_data is your DataFrame
    # ETL and handling missing values
    masses_data = masses_data.dropna()

    all_features = masses_data[['age', 'shape', 'margin', 'density']].values
    all_classes = masses_data['severity'].values

    # Scaling features
    scaler = StandardScaler()
    all_features_scaled = scaler.fit_transform(all_features)
    return all_features_scaled, all_classes


# Model for Neural Network (DNN)
def create_original_model():
    model = Sequential()
    model.add(Dense(6, input_dim=4, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(metrics=['accuracy'], optimizer='adam', loss='binary_crossentropy')
    return model


# Function to run cross-validation and return scores
def run_cv(model, X, y, cv_method):
    print(f"Running {model.__class__.__name__} model...")
    return cross_val_score(model, X, y, cv=cv_method)


# Main method to compile and plot results
def run():
    X, y = load_data()

    # Model for Random Forest
    rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

    # Model for XGBoost
    xgb_classifier = XGBClassifier(learning_rate=0.1, n_estimators=100, random_state=42)

    # Model for a Deep Neural Network
    dnn = KerasClassifier(build_fn=create_original_model, epochs=100, verbose=0)

    cv_method = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

    # Run cross-validation for each model
    dnn_scores = run_cv(dnn, X, y, cv_method)
    rf_scores = run_cv(rf_classifier, X, y, cv_method)
    xgb_scores = run_cv(xgb_classifier, X, y, cv_method)

    # Compile results
    results = pd.DataFrame({
        'DNN': dnn_scores,
        'RandomForest': rf_scores,
        'XGBoost': xgb_scores
    })

    # Plot results using seaborn boxplot
    plt.figure(figsize=(10, 6))
    sns.boxplot(data=results)
    plt.title('Cross-Validation Scores Comparison')
    plt.ylabel('Accuracy')
    plt.show()
    # save the plot
    plt.savefig('cross_val_scores.png')


# Run the main method
run()

Case Study: Transforming Higher Ed with Predictive Analytics

December 28, 2023December 28, 2023 ~ Saugata ~ Leave a comment

Statement of Problem

In the dynamic realm of higher education, many colleges grapple with a common hurdle—the soaring dropout rates typically observed within the initial two semesters of student enrollment. This not only translates to a direct loss of revenue but also triggers additional costs related to marketing endeavors and sales efforts, often yielding meager returns.

One such college, choosing to remain anonymous, boasted an excellent sales team with an impressive throughput of almost 20%. However, the institution was grappling with significant enrollment losses for diverse reasons. Determined to pinpoint the root causes and intervene strategically, the college aimed to minimize its enrollment dropouts.

This anonymous college recognized that while its sales team demonstrated commendable performance, a substantial portion of enrollments was slipping away. The institution sought to delve into the heart of the matter, identifying key issues and implementing targeted interventions to curtail enrollment dropouts effectively. The focus was on leveraging available data during enrollment, encompassing essential demographic details, academic performance metrics, and financial aid information. The overarching goal was to deploy predictive analytics, facilitating timely and personalized support to mitigate academic and financial challenges, and optimizing resource allocation for sustainable student retention.

Exploration and Decision Trees

The initial phase involved an exploration of predictive models, and decision trees emerged as a valuable tool. These trees effectively identified key metrics and features contributing to the risk of student dropout. This preliminary analysis was pivotal and laid the groundwork for the subsequent stages of the project. Planning, Budgeting, and Resource Allocation

With key decision-makers, including the CEO, on board, the project moved into a comprehensive planning phase. Decision trees’ success in identifying critical risk factors facilitated a smooth transition. A detailed budget was crafted, aiming to utilize 80% of the initial budget while accommodating potential cost overruns. Resource allocation became a strategic task, with the data scientist diligently identifying the necessary resources to ensure project completion within the allocated budget.

Project Initiation and Model Development

The final model was trained on enrollment data, encompassing demographic information, credit history, and a host of relevant features. Manual data cleaning and exploratory data analysis were initially employed and later automated through an ETL pipeline. Random Forest was chosen to enhance accuracy over decision trees, providing a more robust model with low variance. The pipeline was trained on open-source algorithms, effectively managing costs. Notably, the landscape has evolved, and current processes, such as ETL and exploratory data analysis, can be automated, significantly reducing costs. The full pipeline is shown below.

Model Efficacy and Cost Analysis

The developed predictive model showcased an impressive 82% accuracy, successfully reducing the dropout rate to 8% within the identified segments during the trial run. With an average total enrollment of 823 students for a full year, the initial 18% dropout rate saw a substantial reduction to 11%, indicating a remarkable 39% improvement in student retention.

Considering an acquisition cost of $5,000 per student and a calculated lifetime value of $40,000 for each student ( derived from the total tuition of $80,000 for the 2-year accelerated program), the total savings amounted to $263,360. This surpasses the initial deployment cost of around $300,000, showcasing a remarkable 8.81x savings rate.

This case study underscores the transformative impact of predictive analytics in enhancing student retention, achieving substantial cost savings, and demonstrating the continuous evolution of data science methodologies.

For more details on our approach and methodologies, please contact our team.

Comparing Pandas Vector Operations to Multithreading

November 17, 2023 ~ Saugata ~ Leave a comment

In this analysis, we compare the performance of Pandas vector operations against multithreading for Natural Language Processing (NLP) preprocessing tasks.

# Import necessary libraries and modules
import time
import spacy
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
import utils.data_utils  # Assuming you have a module for loading data

# Load Spacy model
nlp = spacy.load("en_core_web_sm")

# Define NLP preprocessing function
def ner_preprocessing(text):
    doc = nlp(text)
    entities = [ent.text for ent in doc.ents]
    return entities

# Function for parallel processing using ThreadPoolExecutor
def process_transaction_data_parallel(df):
    with ThreadPoolExecutor() as executor:
        processed_data = list(executor.map(ner_preprocessing, df['Description']))
    return processed_data

def main():
    # Load data
    df = utils.data_utils.load_data()

    # Time Pandas vector operation
    start = time.time()
    df['ner_preprocessing_map'] = df['Description'].apply(ner_preprocessing)
    print(f"Time taken for Pandas vector operation: {time.time() - start}")

    # Time multithreading with ThreadPoolExecutor
    start = time.time()
    processed_data = process_transaction_data_parallel(df)
    df['ner_preprocessing_threaded'] = processed_data
    print(f"Time taken for multithreading: {time.time() - start}")

    # Print DataFrame
    print(df.head())

if __name__ == "__main__":
    main()

The NER preprocessing function extracts entities from the ‘Description’ column in a Pandas DataFrame. The comparison involves using Pandas’ apply method for vectorized operations and multithreading for parallelized operations.

Results:

The time taken for

Pandas vector operation : 38.60 seconds

multithreading : 62.06 seconds

It’s evident that the Pandas vector operation outperforms multithreading in this scenario.

Additionally, the multithreading approach resulted in issues related to the order of direct assignment. The ‘ner_preprocessing_threaded’ column shows discrepancies in the order of entities assigned to each row. This could potentially lead to data integrity issues.

t’s crucial to consider the nature of the task and the size of the dataset when choosing between Pandas vector operations and multithreading. While multithreading can provide parallelization benefits, it may not always be the most efficient solution, as seen in this NLP preprocessing example.

Decoding Textual Secrets: Navigating NLP Pitfalls with Precision and SpaCy Brilliance!

November 16, 2023November 16, 2023 ~ Saugata ~ Leave a comment

In the ever-evolving landscape of Natural Language Processing (NLP), text preprocessing is a critical step to transform raw text into a format suitable for analysis. However, this seemingly routine task can become a double-edged sword, where oversimplified preprocessing might inadvertently discard valuable information. In this blog post, we’ll explore the journey through text preprocessing, learning from the pitfalls and incorporating improved techniques for optimal results.

Understanding the Problem: Oversimplified Preprocessing

The Challenge

Consider a dataset with diverse descriptions, including grocery store details and financial transactions. Traditional preprocessing methods might oversimplify the data, as shown in the example where ’99-CENTS-ONLY #0133′ becomes ‘cents only’ after processing. The challenge lies in retaining contextually relevant information while ensuring cleanliness.

# Previous oversimplified preprocessing
text = '99-CENTS-ONLY #0133'
text = text.lower()
text = re.sub(r'\d+', '', text)
text = text.translate(str.maketrans("", "", string.punctuation))
print(text)
# Output: 'cents only'

Improved Preprocessing Techniques

1. Customized Preprocessing for Specific Domains

Recognizing the domain-specific nature of the data, we can tailor preprocessing steps. In this example, we selectively remove noise while retaining domain-specific terms and identifiers.

def custom_preprocessing(text):
    text = text.lower()
    text = re.sub(r'\b\d+\b', '', text)  # Remove standalone numbers
    text = text.replace('-', ' ')  # Preserve hyphenated terms
    text = text.translate(str.maketrans("", "", string.punctuation))
    return text

2. Named Entity Recognition (NER)

Incorporate Named Entity Recognition (NER) to identify and preserve specific entities in the text. This helps maintain critical information, such as recognizing ’99 cents’ as a monetary value associated with a store.

import spacy

# Load the spaCy model for NER
nlp = spacy.load('en_core_web_sm')

def ner_preprocessing(text):
    doc = nlp(text)
    entities = [ent.text for ent in doc.ents]
    return entities

Applying Improved Techniques to the Dataset

Now, let’s apply these improved techniques to the dataset containing diverse descriptions:

data = ['99-CENTS-ONLY #0133',
        'ESC DISB - PMI INS',
        'Vanguard Total Bond Market ETF Market Buy']

df = pd.DataFrame({'description': data})
df['custom_preprocessing'] = df['description'].apply(lambda x: custom_preprocessing(x))
df['ner_preprocessing'] = df['description'].apply(lambda x: ner_preprocessing(x))

print(df)

Insights: Balancing Cleanliness and Context

As we inspect the processed data, a clear evolution emerges. The improved preprocessing techniques now strike a balance between cleaning data and retaining relevant information:

Description	Customized Preprocessing	NER Preprocessing
99-CENTS-ONLY #0133	[’99 cents’]	[’99-CENTS-ONLY #0133′]
ESC DISB – PMI INS	[]	[]
Vanguard Total Bond Market ETF Market Buy	[‘vanguard total bond market etf market buy’]	[‘Vanguard Total Bond Market’]

Table: Output of preprocessing pipeline

Clearly Named Entity Recognition is superior in preserving information that is useful. We have to weigh our options when using either technique. This is where humans come in. Not everything can be automated!!

Conclusion: A Refined Approach to Text Preprocessing

In navigating the challenges of text preprocessing, it’s evident that a refined approach is essential. By customizing preprocessing steps, incorporating advanced techniques like Named Entity Recognition (NER), and adopting hybrid approaches, we enhance the accuracy and contextual understanding of NLP models. It’s not just about cleaning; it’s about preserving valuable information, ensuring that our models are equipped to handle the intricacies of real-world text data. As you embark on your NLP endeavors, embrace the power of thoughtful preprocessing for a richer and more meaningful analysis. Happy preprocessing!

Heteroscedasticity Unmasked: Taming Increasing Variance with Transformations!

October 9, 2023October 9, 2023 ~ Saugata ~ Leave a comment

What is Heteroscedasticity? Heteroscedasticity refers to the uneven spread or varying levels of dispersion in data points, meaning that the variability of data isn’t consistent across the range.
Why is it Bad for Modeling and Prediction? Heteroscedasticity can wreak havoc on modeling and prediction because it violates the assumption of constant variance in many statistical techniques, leading to biased results and unreliable predictions.
How to Handle It To tackle Heteroscedasticity, one effective approach is data transformation. This involves altering the data using mathematical functions to stabilize variance and make your models more robust.

The Data: Let’s start with some code and synthetic data for house prices, where the variability increases as house sizes grow. This mirrors real-world scenarios where larger properties often exhibit more price variability.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from numpy.polynomial import Polynomial

# Step 1: Generate Synthetic Data with Heteroscedasticity
np.random.seed(0)  # For reproducibility
house_sizes = np.linspace(1000, 5000, 100)  # House sizes
true_prices = 50000 + 30 * house_sizes  # True house prices

# Introduce heteroscedasticity - variance increases with house size
error_term = np.random.normal(np.zeros(100), 7 * house_sizes)
noisy_prices = true_prices + error_term

# Split error_term into positive and negative components
positive_errors = np.where(error_term > 0, error_term, 0)
negative_errors = np.where(error_term < 0, error_term, 0)

Step 2: Now, let’s plot the bounding lines, illustrating the increasing variance.

# Fit polynomials to the positive and negative errors
positive_poly = Polynomial.fit(house_sizes, positive_errors, deg=1)
negative_poly = Polynomial.fit(house_sizes, negative_errors, deg=1)

# Calculate values for the bounding lines
positive_bounds = positive_poly(house_sizes)
negative_bounds = negative_poly(house_sizes)

Step 3: To stabilize the data, we’ll apply a logarithmic transformation to the noisy prices. This transformation makes the data more suitable for analysis.

# Step 3: Apply a Logarithmic Transformation
data['Log_Price'] = np.log(data['Price'])

Result: The transformation effectively reduces the increasing variance. We can visualize the change through comparison plots, showing the data’s improved stability.

Conclusion: Addressing increasing variance with transformations is a valuable tool in data analysis. Whether you’re dealing with house prices or any other dataset, understanding and mitigating heteroscedasticity enhances the reliability of your analysis and decision-making.

The Role of p-Value in Hypothesis Testing and Machine Learning

April 10, 2023April 12, 2023 ~ Saugata ~ Leave a comment

Introduction

In statistical hypothesis testing, the p-value is a widely-used tool to evaluate the significance of results obtained from a sample as they relate to a population. In machine learning, hypothesis testing is employed to validate models and determine whether the results obtained from the models are statistically significant.

Example using the Normal Distribution

Suppose we have a dataset of the heights of 1000 people, and we want to test whether the average height of the population is greater than 170 cm. We can use the normal distribution to model the distribution of heights in the population.

We can use the null hypothesis that the average height is 170 cm, and the alternative hypothesis that the average height is greater than 170 cm.

We take a sample of 50 people from the population and calculate the sample mean and standard deviation. We then calculate the test statistic:

z = (sample_mean - hypothesized_mean) / (standard_error)

where the standard error is given by:

standard_error = standard_deviation / sqrt(sample_size)

If the null hypothesis is true, then the test statistic follows a standard normal distribution. We can calculate the p-value as the area under the tail of the distribution corresponding to the observed test statistic.

If the p-value is less than the significance level (usually 0.05), then we reject the null hypothesis and conclude that the average height is greater than 170 cm. Otherwise, we fail to reject the null hypothesis and conclude that there is not enough evidence to support the alternative hypothesis.

Uses in Machine Learning

In machine learning, the p-value is used to evaluate the performance of a model. For instance, a machine learning algorithm could be used to predict whether a customer will buy a product or not. The p-value would then be used to determine whether the results obtained from the algorithm are statistically significant, i.e., whether the algorithm is better than random guessing.

The p-value is also used in feature selection, which is the process of selecting relevant features that contribute to the prediction of the target variable. Features with a low p-value are considered to be statistically significant and are usually selected for use in the model.

Example of p-value in ML

In R, the lm() function is used to fit linear regression models. After fitting a model using lm(), the summary() function can be used to obtain a summary of the model, which includes the p-values for the coefficients.

For example, suppose we have a dataset of the weights and heights of 100 people. We want to fit a linear regression model to predict weight from height. We can use the following R code to fit the model:

Suppose we have a dataset of the weights and heights of 100 people. We want to fit a linear regression model to predict weight from height. We can use the following R code to fit the model:

model <- lm(weight ~ height, data = mydata)
summary(model)

The output of the summary() function would be:

Call:
lm(formula = weight ~ height, data = mydata)

Residuals:
    Min      1Q  Median      3Q     Max
-6.2500 -1.4375  0.2812  1.6250  5.8125

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -103.285      5.181 -19.924   <2e-16 ***
height         3.451      0.079  43.545   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.424 on 98 degrees of freedom
Multiple R-squared:  0.8676,	Adjusted R-squared:  0.8663
F-statistic:  1895 on 1 and 98 DF,  p-value: < 2.2e-16

The table of coefficients includes the estimated coefficients for the intercept and the height variable, as well as their standard errors, t-values, and p-values. The p-value for the height variable is less than 0.05, indicating that the height variable is statistically significant and contributes to the prediction of the target variable (weight).

Limitations of p-value

It is important to note that the p-value is not without its limitations. One of the main limitations is that it can be influenced by the sample size and the significance level chosen by the researcher. Additionally, the p-value does not provide information about the effect size or the practical significance of the results obtained from a study. Therefore, it is important to interpret the p-value in conjunction with other statistical measures, such as effect size and confidence intervals, to obtain a more complete understanding of the results obtained from a study.

Once common issue is that we need to know which tailed test we use based on the context. The lm() function always outputs the two-tailed p-value (t-value). However, some cases as described below is really one-tailed.

For example, let’s say we are interested in studying the effect of a new drug on reducing anxiety levels in patients. We have a hypothesis that the new drug will decrease anxiety levels in patients, and we are not interested in the possibility that the drug may increase anxiety levels. In this case, we would use a one-tailed test with the alternative hypothesis stating that the true mean difference in anxiety levels between the treatment and control groups is less than zero (i.e., the drug reduces anxiety levels).

mydata <- data.frame(anxiety_reduction = c(2, 4, 3, 5, 1, 6, 7, 
                        8, 9, 10, 12, 13, 14, 15, 11),
                      new_drug = c(0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0))

model <- lm(anxiety_reduction ~ new_drug, data = mydata)
summary(model)

Call:
lm(formula = anxiety_reduction ~ new_drug, data = mydata)

Residuals:
      Min        1Q    Median        3Q       Max
-2.041667 -0.916667 -0.041667  0.958333  2.458333

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   2.5000     0.8021   3.114    0.009 **
new_drug      1.4167     0.8021   1.766    0.099 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.386 on 13 degrees of freedom
Multiple R-squared:  0.1792,	Adjusted R-squared:  0.09458
F-statistic: 3.131 on 1 and 13 DF,  p-value: 0.09852

It looks like the above is not statistically significant since p-value>0.05. However, the test is left-tailed since we want to know if there is any decrease in symptoms. The p-value in the output is for two-tailed. Therefore, the left-tailed p-value would be half of that: 0.09852/2 = 0.049 < 0.05, making the effect of the drug statistically significant.

Conclusion

In conclusion, the p-value is a valuable tool in hypothesis testing and machine learning. It is used to evaluate the statistical significance of results obtained from models and experiments. However, it is important to interpret the p-value in conjunction with other statistical measures and to be aware of its limitations to obtain a complete understanding of the results obtained from a study.

Auto update episode names of shows by google scraping

August 17, 2020August 18, 2020 ~ Saugata ~ Leave a comment

After backing up DVDs we are sometimes forced with the mundane task of renaming files with the episode names. A small python script can automate this laborious task.

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw scrapeepisodenames.ipynb hosted with ❤ by GitHub

/media/htd/Seagate Backup Plus Drive/TV Shows/Buffy the Vampire Slayer/buffyS01/BuffytheVampireSlayer_s01_e01.mp4
[renamed to]
/media/htd/Seagate Backup Plus Drive/TV Shows/Buffy the Vampire Slayer/buffyS01/S01 E01 · Welcome to the Hellmouth (1).mp4

/media/htd/Seagate Backup Plus Drive/TV Shows/Buffy the Vampire Slayer/buffyS01/BuffytheVampireSlayer_s01_e02.mp4
[renamed to]
/media/htd/Seagate Backup Plus Drive/TV Shows/Buffy the Vampire Slayer/buffyS01/S01 E02 · The Harvest.mp4

/media/htd/Seagate Backup Plus Drive/TV Shows/Buffy the Vampire Slayer/buffyS01/BuffytheVampireSlayer_s01_e03.mp4
[renamed to]
/media/htd/Seagate Backup Plus Drive/TV Shows/Buffy the Vampire Slayer/buffyS01/S01 E03 · Witch.mp4

/media/htd/Seagate Backup Plus Drive/TV Shows/Buffy the Vampire Slayer/buffyS01/BuffytheVampireSlayer_s01_e04.mp4
[renamed to]
/media/htd/Seagate Backup Plus Drive/TV Shows/Buffy the Vampire Slayer/buffyS01/S01 E04 · Teacher’s Pet.mp4