Convert filelist to an Excel database (importable ebook list)

Let us say we have a collection of ebooks or papers/articles sorted in various folders and we want to create a database (or spreadsheet) of those papers or books so that we can add comments or notes next to them.For example, let us say we have a file structure like (find . type f)

./entanglement-entropy-holography/1006.1263.pdf

./entanglement-entropy-holography/0912.1877.pdf

./entanglement-entropy-holography/0911.3160v2.pdf

./entanglement-entropy-holography/0912.1877v2.pdf

./entanglement-entropy-holography/1010.1682.pdf

./graviton-propagator/zee-1979-PhysRevLett.42.417.pdf
./graviton-propagator/dewitt-3-PhysRev.162.1239.pdf
./graviton-propagator/dewitt-2-PhysRev.162.1195.pdf
./graviton-propagator/dewitt-1-PhysRev.160.1113.pdf
./SUSY/Piguet-9710095v1.pdf
./SUSY/Olive_susy_9911307v1.pdf
./SUSY/sohnius-introducing-susy-1985.pdf
./SUSY/khare-cooper-susy-qm-phys.rept-1995.pdf
./SUSY/Instantons Versus Supersymmetry9902018v2.pdf
and we want this list to be converted to a database format.

 

Article Type Notes
1006.1263.pdf entanglement-entropy-holography
0912.1877.pdf entanglement-entropy-holography
0911.3160v2.pdf entanglement-entropy-holography
0912.1877v2.pdf entanglement-entropy-holography
1010.1682.pdf entanglement-entropy-holography
zee-1979-PhysRevLett.42.417.pdf graviton-propagator
dewitt-3-PhysRev.162.1239.pdf graviton-propagator Difficult
dewitt-2-PhysRev.162.1195.pdf graviton-propagator Difficult
dewitt-1-PhysRev.160.1113.pdf graviton-propagator Difficult
Piguet-9710095v1.pdf SUSY
Olive_susy_9911307v1.pdf SUSY
sohnius-introducing-susy-1985.pdf SUSY
khare-cooper-susy-qm-phys.rept-1995.pdf SUSY
Instantons Versus Supersymmetry9902018v2.pdf SUSY Random comment
The last column is added by the user after the data is imported. In order to import the data in the above format, we need the directory name (TYPE) and the FILENAME to be reversed and printed as columns separated by TAB. We can use any other delimiter but with TAB as the delimiter of columns, a spreadsheet program will automatically split the imported columns into two columns.

$ find . -type f -print | sed -r ‘s|(.*)\/|\1+|’  | awk -F”+” ‘{print $2″\t”$1}’ | sed ‘s|\.\/||’

The find command lists all files and pipes it to sed which then replaces the last forward slash (/) with a +. This replacement allows awk to operate on this location (+) and splice the string into two – the first part is the TYPE and the second part is the FILENAME. awk then switches the order of the fields TYPE and FILENAME and puts a TAB in between the fields. Now a simple copy-paste of the output to a spreadsheet program will automatically sort the two fields into two different columns.
Detailed explanation:
find . -type f 
selects only files recursively from all sub-directories
sed -r ‘s|(.*)\/|\1+|’

-r indicates REGEX(regular expression) to be used in pattern matching

| delimiter is used instead of the conventional / to avoid confusion while replacing the / in the strings.
(.*)\/ selects everything up to the last forward slash (/) (sed is a greedy pattern matcher).

(.*) is stored in \1 is put back while the forward slash (/) is replaced by +.

awk -F”+” ‘{print $2″\t”$1}’

-F sets the input field separator to be + so that awk can splice the input string at the location of the +, which is conveniently inserted at the location of the last forward slash (/) by the previous sed operation.
‘{print $2″\t”$1}’ prints column 2, TAB, and column 1 in that order, effectively interchanging the columns and inserting a TAB between them.
The output will look like this 

$ find . -type f -print | sed -r 's|(.*)\/|\1+|'  | awk -F"+" '{print $2"\t"$1}' | sed 's|\.\/||'

1006.1263.pdf entanglement-entropy-holography 
0912.1877.pdf entanglement-entropy-holography 
0911.3160v2.pdf entanglement-entropy-holography 
0912.1877v2.pdf entanglement-entropy-holography 
1010.1682.pdf entanglement-entropy-holography 
zee-1979-PhysRevLett.42.417.pdf graviton-propagator 
dewitt-3-PhysRev.162.1239.pdf graviton-propagator Difficult
dewitt-2-PhysRev.162.1195.pdf graviton-propagator Difficult
dewitt-1-PhysRev.160.1113.pdf graviton-propagator Difficult
Piguet-9710095v1.pdf SUSY 
Olive_susy_9911307v1.pdf SUSY 
sohnius-introducing-susy-1985.pdf SUSY 
khare-cooper-susy-qm-phys.rept-1995.pdf SUSY 
Instantons Versus Supersymmetry9902018v2.pdf SUSY

 

Identifying delimiter of a CSV file

The following one-liner can be used to extract the delimiter of a CSV file. This command does not work on TAB separated files. It only works on delimited files whose field separators are NOT whitespaces.

$ head -n1 bookmerged.csv  | tr -d '[a-z][A-Z][0-9]' | \
tr -d '"' | sed 's/.\{1\}/&\n/g' | sort -r | uniq -c | \
sort -nr | tr -s " " | cut -d" " -f3 | head -n1

This command generates a list of special characters and from that list selects the character with the highest frequency of occurrence. This character must be the delimiter of the file unless some other special character is used heavily. This code will fail when other special characters have a higher frequency of occurrence than the delimiter. An explanation of this code is as follows.

After head grabs the column headers, the first two trace commands (tr) removes all alphabets, numbers, and quotes. This leaves a bunch of special characters among which the character with the highest frequency of occurrence is most likely the delimiters of the fields.

,,,,,   , ,, , , ,,, ,, , ,/ , , , 

The sed command introduces a newline after every character effectively putting every single character on a new line. {1} selects one character at a time, \{ escapes the character {, and & substitutes the pattern match (the single character) with pattern+newline. We can also use \0 instead of &. sort -r | uniq -c | sort -nr generates the list of characters in descending order of prevalence.

     20 ,
     14  
      1 /
      1 

The most prevalent character appears at the top of this list. tr -s ” “ combines (squeezes) the multiple spaces into one and the cut command splices up the list along the spaces and selects the third column which is the delimiter.

Batch rotate video files

I like to capture videos of scenery while I am driving. I do that with the help of my phone mounted on a car cell phone holder. The phone keeps recording and later I edit out the boring parts. Now, this works very well when the phone is already mounted in the horizontal position when the recording starts and as well when it ends. However, if for any reason I pick up the phone anytime during recording the video gets saved in a vertical format. So I have to rotate those videos a lot. They can also be rotated in a video editing software like Kdenlive by applying the “Rotate” effect. However, then the video gets chopped off from the sides. I find it easier to rotate the videos before importing them to the video editing software. Since a lot of videos might end up being vertical when they should be horizontal I have to rotate a lot of videos. So I wrote a batch script to rotate a bunch of videos at once. The script runs on every mp4 file in the directory. The script needs to be edited to run on other video types in the videos are in other formats. Since the script runs on all video files it is best to segregate all the videos files that need to be edited in a separate folder and copy the script to that folder. Running the script in the folder which contains all video files might have the undesirable effect of rotating the files which do not need to be rotated.

The script is very simple. It is a for loop wrapped around a ffmpeg command.

for i in `ls -1 *.mp4`; do
echo $i "====================================================="
j="rotated-"$i
ffmpeg -i $i -vf "transpose=2" $j
done

 

transpose takes the following values
0 = 90 degrees CounterClockwise and Vertical Flip (default)
1 = 90 degrees Clockwise
2 = 90 degrees CounterClockwise
3 = 90 degrees Clockwise and Vertical Flip

Setting up a virtual environment for Python

Many specialized tools are written for some version of Python like python2.7 and has dependencies on some versions of packages like pandas 0.7.3. Installing these older versions will remove newer versions and create conflicts with existing code. So a better option is to create a virtual environment with the specific package versions only. For example, QSTK does not work with Python 3 or pandas 0.21. It only works with python2.7 and pandas 0.7.3. So we have to create a virtual environment and install these versions.

virtualenv --python=/usr/bin/python2.7 ~/python2.7-virtual-env


This will create the
~/python2.7-virtual-env directory if it doesn’t exist, and also create directories inside it containing a copy of the Python interpreter, the standard library, and various supporting files. Now, we have to go to that directory and run source activate to start a new environment (just like a chroot environment).

source ~/python2.7-virtual-env/bin/activate


This will start a new environment. To test that we are really in the environment 

$ which python

~/python2.7-virtual-env/bin/python

$ python --version
Python 2.7.12


The environment is now using the local version of python which is python 2.7. 
Now we can install the older versions of the required packages.

pip install pandas==0.7.3


==0.7.3 forces install of the version 0.7.3 of pandas. It removes newer versions if already installed by default.

Setting up a virtual environment in Anaconda

Now Anaconda itself is a virtual environment with the latest version of scientific and statistical tools. However, there will be instances where certain older codes will not run with newer versions of the packages. For example, the pandas datareader library which pulls data from Yahoo and Morningstar is broken in version 0.6.0 (See my GitHub page github.com/saugatach/stockanalysis). Let us say we are trying to work around this issue and want to get back pandas-datareader v0.5.0 but also want to keep the latest pandas-datareader v0.6.0. So we create a separate virtual environment within Anaconda called “stocks”. The process is very well detailed in the conda docs conda.io/docs/user-guide/tasks/manage-environments.html.

conda create --name stocks python=3.6 pandas-datareader==0.5.0

This creates a virtenv called stocks which has python 3.6 and the older pandas-datareader. We can activate the environment by

$ source activate stocks

The CLI prompt should have the environment name as the prefix. We can check if the correct version of our package is installed.

(stocks) $ pip3 list
........
numpy (1.15.1)
pandas (0.23.4)
pandas-datareader (0.5.0)
pandocfilters (1.4.2)

parso (0.3.1)

 

Download manager for Chrome

uGet is a download manager for Chrome which uses aria2, a command line download client, to speed up downloads. uGet supports segmented downloads which can generate a massive increase in download speed. In fact, we can reach the maximum bandwidth available using uGet. It can also serve as a true test for bandwidth capacity.  The best feature of uGet is that it can be integrated into Chrome (also Firefox) using a plugin from the Chrome Web Store. However, the Chrome Web Store plugin will not be able to find uGet unless the uget-integrator is installed first. The steps to integrate uGet into Chrome as a download manager are as follows. These commands work for Debian systems. For other Linux systems, the steps remain the same, only the commands would be different.

1. Install uGet and aria2

sudo add-apt-repository ppa:plushuang-tw/uget-stable
sudo apt update
sudo apt install uget
sudo apt install aria2

 

2. Install uGet-integrator

sudo add-apt-repository ppa:uget-team/ppa
sudo apt update
sudo apt install uget-integrator

 

3. Install uGet Integration from Chrome Web Store

https://chrome.google.com/webstore/detail/uget-integration/efjgjleilhflffpbnkaofpmdnajdpepi/

A little housekeeping might be required. We need to set the default download client to aria2

Settings -> Plug-In -> aria2

Also, I would set the default download location and the total number of active downloads.

Category -> Properties -> Category Settings -> Active Downloads -> 1 

Category -> Properties -> Default for new download -> Folder: