Let us say we have a collection of ebooks or papers/articles sorted in various folders and we want to create a database (or spreadsheet) of those papers or books so that we can add comments or notes next to them.
For example, let us say we have a file structure like (find . type f)
./entanglement-entropy-holography/1006.1263.pdf
./entanglement-entropy-holography/0912.1877.pdf
./entanglement-entropy-holography/0911.3160v2.pdf
./entanglement-entropy-holography/0912.1877v2.pdf
./entanglement-entropy-holography/1010.1682.pdf
(.*) is stored in \1 is put back while the forward slash (/) is replaced by +.
For example, let us say we have a file structure like (find . type f)
./entanglement-entropy-holography/1006.1263.pdf
./entanglement-entropy-holography/0912.1877.pdf
./entanglement-entropy-holography/0911.3160v2.pdf
./entanglement-entropy-holography/0912.1877v2.pdf
./entanglement-entropy-holography/1010.1682.pdf
./graviton-propagator/zee-1979-PhysRevLett.42.417.pdf
./graviton-propagator/dewitt-3-PhysRev.162.1239.pdf
./graviton-propagator/dewitt-2-PhysRev.162.1195.pdf
./graviton-propagator/dewitt-1-PhysRev.160.1113.pdf
./SUSY/Piguet-9710095v1.pdf
./SUSY/Olive_susy_9911307v1.pdf
./SUSY/sohnius-introducing-susy-1985.pdf
./SUSY/khare-cooper-susy-qm-phys.rept-1995.pdf
./SUSY/Instantons Versus Supersymmetry9902018v2.pdf
and we want this list to be converted to a database format.
The last column is added by the user after the data is imported. In order to import the data in the above format, we need the directory name (TYPE) and the FILENAME to be reversed and printed as columns separated by TAB. We can use any other delimiter but with TAB as the delimiter of columns, a spreadsheet program will automatically split the imported columns into two columns.
$ find . -type f -print | sed -r 's|(.*)\/|\1+|' | awk -F"+" '{print $2"\t"$1}' | sed 's|\.\/||'
The find command lists all files and pipes it to sed which then replaces the last forward slash (/) with a +. This replacement allows awk to operate on this location (+) and splice the string into two - the first part is the TYPE and the second part is the FILENAME. awk then switches the order of the fields TYPE and FILENAME and puts a TAB in between the fields. Now a simple copy-paste of the output to a spreadsheet program will automatically sort the two fields into two different columns.
Detailed explanation:
find . -type f
selects only files recursively from all sub-directories
sed -r 's|(.*)\/|\1+|'
-r indicates REGEX(regular expression) to be used in pattern matching
| delimiter is used instead of the conventional / to avoid confusion while replacing the / in the strings.
(.*)\/ selects everything up to the last forward slash (/) (sed is a greedy pattern matcher).
(.*) is stored in \1 is put back while the forward slash (/) is replaced by +.
awk -F"+" '{print $2"\t"$1}'
-F sets the input field separator to be + so that awk can splice the input string at the location of the +, which is conveniently inserted at the location of the last forward slash (/) by the previous sed operation.
'{print $2"\t"$1}' prints column 2, TAB, and column 1 in that order, effectively interchanging the columns and inserting a TAB between them.
The output will look like this
$ find . -type f -print | sed -r 's|(.*)\/|\1+|' | awk -F"+" '{print $2"\t"$1}' | sed 's|\.\/||' 1006.1263.pdf entanglement-entropy-holography 0912.1877.pdf entanglement-entropy-holography 0911.3160v2.pdf entanglement-entropy-holography 0912.1877v2.pdf entanglement-entropy-holography 1010.1682.pdf entanglement-entropy-holography zee-1979-PhysRevLett.42.417.pdf graviton-propagator dewitt-3-PhysRev.162.1239.pdf graviton-propagator Difficult dewitt-2-PhysRev.162.1195.pdf graviton-propagator Difficult dewitt-1-PhysRev.160.1113.pdf graviton-propagator Difficult Piguet-9710095v1.pdf SUSY Olive_susy_9911307v1.pdf SUSY sohnius-introducing-susy-1985.pdf SUSY khare-cooper-susy-qm-phys.rept-1995.pdf SUSY Instantons Versus Supersymmetry9902018v2.pdf SUSY
Comments