Convert file listing to database format

Let us say we have a collection of ebooks or papers/articles sorted in various folders and we want to create a database (or spreadsheet) of those papers or books so that we can add comments or notes next to them.

For example, let us say we have a file structure like (find . type f)

./entanglement-entropy-holography/1006.1263.pdf
./entanglement-entropy-holography/0912.1877.pdf
./entanglement-entropy-holography/0911.3160v2.pdf
./entanglement-entropy-holography/0912.1877v2.pdf
./entanglement-entropy-holography/1010.1682.pdf

./graviton-propagator/zee-1979-PhysRevLett.42.417.pdf

./graviton-propagator/dewitt-3-PhysRev.162.1239.pdf

./graviton-propagator/dewitt-2-PhysRev.162.1195.pdf

./graviton-propagator/dewitt-1-PhysRev.160.1113.pdf

./SUSY/Piguet-9710095v1.pdf

./SUSY/Olive_susy_9911307v1.pdf

./SUSY/sohnius-introducing-susy-1985.pdf

./SUSY/khare-cooper-susy-qm-phys.rept-1995.pdf

./SUSY/Instantons Versus Supersymmetry9902018v2.pdf

and we want this list to be converted to a database format.

Article	Type	Notes
1006.1263.pdf	entanglement-entropy-holography
0912.1877.pdf	entanglement-entropy-holography
0911.3160v2.pdf	entanglement-entropy-holography
0912.1877v2.pdf	entanglement-entropy-holography
1010.1682.pdf	entanglement-entropy-holography
zee-1979-PhysRevLett.42.417.pdf	graviton-propagator
dewitt-3-PhysRev.162.1239.pdf	graviton-propagator	Difficult
dewitt-2-PhysRev.162.1195.pdf	graviton-propagator	Difficult
dewitt-1-PhysRev.160.1113.pdf	graviton-propagator	Difficult
Piguet-9710095v1.pdf	SUSY
Olive_susy_9911307v1.pdf	SUSY
sohnius-introducing-susy-1985.pdf	SUSY
khare-cooper-susy-qm-phys.rept-1995.pdf	SUSY
Instantons Versus Supersymmetry9902018v2.pdf	SUSY	Random comment

The last column is added by the user after the data is imported. In order to import the data in the above format, we need the directory name (TYPE) and the FILENAME to be reversed and printed as columns separated by TAB. We can use any other delimiter but with TAB as the delimiter of columns, a spreadsheet program will automatically split the imported columns into two columns.

***$ find . -type f -print | sed -r 's|(.*)\/|\1+|' | awk -F"+" '{print $2"\t"$1}' | sed 's|\.\/||'***

The find command lists all files and pipes it to sed which then replaces the last forward slash (/) with a +. This replacement allows awk to operate on this location (+) and splice the string into two - the first part is the TYPE and the second part is the FILENAME. awk then switches the order of the fields TYPE and FILENAME and puts a TAB in between the fields. Now a simple copy-paste of the output to a spreadsheet program will automatically sort the two fields into two different columns.

Detailed explanation:

find . -type f

selects only files recursively from all sub-directories

sed -r 's|(.*)\/|\1+|'

-r indicates REGEX(regular expression) to be used in pattern matching

| delimiter is used instead of the conventional / to avoid confusion while replacing the / in the strings.

(.*)\/ selects everything up to the last forward slash (/) (sed is a greedy pattern matcher).

(.*) is stored in \1 is put back while the forward slash (/) is replaced by +.

awk -F"+" '{print $2"\t"$1}'

-F sets the input field separator to be + so that awk can splice the input string at the location of the +, which is conveniently inserted at the location of the last forward slash (/) by the previous sed operation.

'{print $2"\t"$1}' prints column 2, TAB, and column 1 in that order, effectively interchanging the columns and inserting a TAB between them.

The output will look like this

$ find . -type f -print | sed -r 's|(.*)\/|\1+|'  | awk -F"+" '{print $2"\t"$1}' | sed 's|\.\/||'

1006.1263.pdf entanglement-entropy-holography 
0912.1877.pdf entanglement-entropy-holography 
0911.3160v2.pdf entanglement-entropy-holography 
0912.1877v2.pdf entanglement-entropy-holography 
1010.1682.pdf entanglement-entropy-holography 
zee-1979-PhysRevLett.42.417.pdf graviton-propagator 
dewitt-3-PhysRev.162.1239.pdf graviton-propagator Difficult
dewitt-2-PhysRev.162.1195.pdf graviton-propagator Difficult
dewitt-1-PhysRev.160.1113.pdf graviton-propagator Difficult
Piguet-9710095v1.pdf SUSY 
Olive_susy_9911307v1.pdf SUSY 
sohnius-introducing-susy-1985.pdf SUSY 
khare-cooper-susy-qm-phys.rept-1995.pdf SUSY 
Instantons Versus Supersymmetry9902018v2.pdf SUSY

Synecdoche

Search This Blog

Convert file listing to database format

***$ find . -type f -print | sed -r 's|(.*)\/|\1+|' | awk -F"+" '{print $2"\t"$1}' | sed 's|\.\/||'***

Labels

Comments

Popular posts from this blog

LYRICS OF CHANDRABINDOO

Changing the font size of section headings in LaTex

Fastest way to send multiple drafts from gmail