Add columns of a CSV file with a single line of bash commands

Analysing and extracting results from data stored in CSV files is made simple with spreadsheet programs which can easily import them and support a wide variety of formats. For example the exported data from my HSA contribution file.

12/01/201911/30/2019InterestNone$0.09
11/12/201911/12/2019CIGNA HEALTHCARE 68Debit Card-$25.00
11/04/201911/04/2019Fee DistributionEFT-$1.75
11/01/201910/31/2019InterestNone$0.09
10/02/201910/02/2019Fee DistributionEFT-$1.75
10/01/201909/30/2019InterestNone$0.09
09/04/201909/04/2019Fee DistributionEFT-$1.75
09/01/201908/31/2019InterestNone$0.09
08/03/201908/02/2019Fee DistributionEFT-$1.75
08/01/201907/31/2019InterestNone$0.09
07/05/201907/05/2019CIGNA HEALTHCARE OF ARDebit Card-$25.00
07/02/201907/02/2019Fee DistributionEFT-$1.75
07/01/201906/30/2019InterestNone$0.09
06/04/201906/04/2019Fee DistributionEFT-$1.75
06/03/201906/03/2019CIGNA HEALTHCARE OF ARDebit Card-$25.00
06/01/201905/31/2019InterestNone$0.09
05/31/201905/31/2019CVS/PHARMACY #07078Debit Card-$15.00
05/13/201905/13/2019CIGNA HLTHCRE AZ#HCFSDebit Card-$65.80
05/02/201905/02/2019Fee DistributionEFT-$1.75
05/01/201904/30/2019InterestNone$0.09
04/02/201904/02/2019Fee DistributionEFT-$1.75
04/01/201903/31/2019InterestNone$0.10
03/21/201903/21/2019Adjustment ContributionEFT$1.75
03/04/201903/04/2019Fee DistributionEFT-$1.75
03/01/201902/28/2019InterestNone$0.18
02/25/201902/25/2019CVS/PHARMACY #07078Debit Card-$15.00
HSA contribution CSV file called hsa.csv

To get the 2019 activity only in a spreadsheet program we will need to set up filters and filter out all other years except 2019. That is a lot of mouse events. In command-line we can use just use a single command – grep

Filter for specific column values and then adding a separate column

Now, let’s say we want to find what is the total expenditure in PHARMACY. In spreadsheet we have to filter again by PHARMACY and then sum those columns. While this is not too difficult there can be issues with autosum summing hidden rows too, leading to erroneous results. IN command-line we simply do grep again.

$ cat hsa.csv | grep 2019 | grep PHARMACY 

To sum the last column involves extracting that column. A single column or a list of columns can be extracted using the cut command. The -d switch defines the delimiter (separator of fields), which in the case of CSV files in the comma “,”. So the cut command cuts up the file vertically along the delimiter. -f lists the columns to be extracted – as a single number for a single column or as a list 2-5, to extract columns 2 to 5 (for more info see the man page of cut).

$ cat hsa.csv | grep 2019 | grep PHARMACY | cut -d"," -f5 

And then we extract the numbers in that column only. While the numbers can be extracted in a variety of ways including sed and awk (which many people prefer) I will use tr (translate) command. tr (translate) command primarily is use to replace characters but it can also remove characters.

$  cat hsa.csv | grep 2019 | grep PHARMACY | \
> cut -d"," -f5 | tr -cd '[:digit:].\n'
15.00
15.00
15.00

tr -c chooses the complement of :digits: (0-9), period (.), and newline (\n) i.e. everything else except real numbers and newline. -d deletes those characters.

Now just have to sum those numbers.

$  cat hsa.csv | grep 2019 | grep PHARMACY | \ 
> cut -d"," -f5 | tr -cd '[:digit:].\n' | paste -sd+ | bc
45.00

paste accepts each line as input and joins them using the separator + which then is fed onto the terminal calculator bc which then performs the addition.

Now, this is good for learning. If we do not feel like going through the steps over and over again I have written a script which auto detects the column that needs to be summed and sums that column based on a given filter. The script below can be downloaded from my GitHub page.

#!/bin/bash
# sumcolumn.sh v1.5 by Saugata copyright 07/07/2020
# autosum price column of a csv after filtering for a specific column value

usage="
Usage: ./sumcolumn.sh [filename] [pattern] [-h help]
  
Example: ./sumcolumn.sh hsa.csv PHARMACY
"

# ---- GETOPTS ----
# no args. print usage and exit
if [[ $# -eq 0 ]]; then
 echo "$usage"
 exit
fi

while getopts h option
do
case "${option}"  in
 h) echo "$usage" 
 exit ;;
esac
done

# check if first arg is a file which exists
if [[ -z $1 ]]; then
  echo "File doesn't exist"
  exit
fi
echo "Filename: " $1

# check if file is a text file
txtfile=`file "$1" | grep "text"`

if [[ $txtfile == "" ]]; then
  echo "File not a text file. Choose a text file."
  exit
fi

# check if pattern exists in file
ptn=`grep "$2" "$1"`
if [[ $ptn == "" ]]; then
  echo "Pattern does not exist in file. No sum."
  exit
fi

# identify column containing price (elements have $ symbol)
sumcol="$(head -n 2 "$1" | tail -n 1 | tr "," "\n" | nl \
 | grep "\\$" | sed 's/ //g' | head -c1 )"

if [[ $sumcol == "" ]]; then
  echo "No columns to sum"
  exit
fi

echo "Summing column: " `head -n 1 "$1" | cut -d"," -f$sumcol`

echo "Total: " total="$(grep "$2" "$1" | cut -d"," -f$sumcol \
 | tr -cd '[:digit:].\n' | paste -sd+ | bc)"

Leave a comment