Skip to main content

Posts

Showing posts from June, 2018

Identifying delimiter of a CSV file

The following one-liner can be used to extract the delimiter of a CSV file. This command does not work on TAB separated files. It only works on delimited files whose field separators are not whitespaces. $ head - n1 bookmerged . csv | tr - d '[a-z][A-Z][0-9]' | \ tr -d '"' | sed 's/.\{1\}/&\n/g' | sort - r | uniq - c | \ sort - nr | tr - s " " | cut - d " " - f3 | head - n1 This command generates a list of special characters and from that list selects the character with the highest frequency of occurrence. This character must be the delimiter of the file unless some other special character is used heavily. This code will fail when other special characters have a higher frequency of occurrence than the delimiter. An explanation of this code is as follows. After head  grabs the column headers, the first two trace commands (tr) removes all alphabets, numbers, and quotes. This leaves a bunch of s

Swap columns of CSV file from Linux terminal

Swapping columns is an integral part of data analysis. And with GUI spreadsheet programs it is simply a four-step process. Suppose ColumnA and ColumnB need to be swapped. Then the follwing sequence does the job. Create a new column before ColumnA Cut ColumnB into this new column Cut ColumnA to the location of ColumnB Delete empty column However, for massive databases, the spreadsheet program is neither adequate nor recommended. The software will take a long time to load the file, maybe even stall in the process of loading the large database. A simpler solution will be to use AWK to swap the columns of the database. This method is extremely fast and efficient. A typical AWK command to rearrange the columns of a database will look like awk - F ',' 'BEGIN{OFS=",";} {print $1, $5, $3, $4, $2}' test . csv This command rearranges column 2 with column 8. This command is simple and elegant. But it has its drawbacks. The user needs to type all the c

Testing Central Limit Theorem with R

In this article, we will verify the Central Limit Theorem which says that a distribution of sample means of samples from a distribution of a random variable approaches that of a normal distribution with increasing sample size. Put simply, if multiple samples are taken from a distribution (normal or otherwise) and the mean of the samples are computed then the collection of sample means hence generated will itself form a distribution and that distribution will be the Normal Distribution (provided the sample size is large). One corollary of the Central Limit Theorem is that the sample mean will approach the population mean as the sample size goes to infinity (or the population limit). One way to verify this statement is to do the sampling using random variables generated by R and then calculate the sample means for each set of random numbers. Using R we will generate a sample of N normal random numbers and repeat that sampling 20 times each time finding the mean of the sample of the

Convert file listing to database format

Let us say we have a collection of ebooks or papers/articles sorted in various folders and we want to create a database (or spreadsheet) of those papers or books so that we can add comments or notes next to them. For example, let us say we have a file structure like ( find . type f ) ./entanglement-entropy-holography/1006.1263.pdf ./entanglement-entropy-holography/0912.1877.pdf ./entanglement-entropy-holography/0911.3160v2.pdf ./entanglement-entropy-holography/0912.1877v2.pdf ./entanglement-entropy-holography/1010.1682.pdf ./graviton-propagator/zee-1979-PhysRevLett.42.417.pdf ./graviton-propagator/dewitt-3-PhysRev.162.1239.pdf ./graviton-propagator/dewitt-2-PhysRev.162.1195.pdf ./graviton-propagator/dewitt-1-PhysRev.160.1113.pdf ./SUSY/Piguet-9710095v1.pdf ./SUSY/Olive_susy_9911307v1.pdf ./SUSY/sohnius-introducing-susy-1985.pdf ./SUSY/khare-cooper-susy-qm-phys.rept-1995.pdf ./SUSY/Instantons Versus Supersymmetry9902018v2.pdf and we want this list to b

List files with absolute pathname in linux

ls -d $PWD/* $PWD/* expands the absolute path of the present working directory and appends the directory listing of * to it. ls displays that list while -d prevents ls from going into each directory in that list and recursively listing all sub-directories. We can also print filelist of all sub-directories relative to current directory. find . -type f