Skip to main content

Identifying delimiter of a CSV file

The following one-liner can be used to extract the delimiter of a CSV file. This command does not work on TAB separated files. It only works on delimited files whose field separators are not whitespaces.

$ head -n1 bookmerged.csv  | tr -d '[a-z][A-Z][0-9]' | \
tr -d '"' | sed 's/.\{1\}/&\n/g' | sort -r | uniq -c | \
sort -nr | tr -s " " | cut -d" " -f3 | head -n1

This command generates a list of special characters and from that list selects the character with the highest frequency of occurrence. This character must be the delimiter of the file unless some other special character is used heavily. This code will fail when other special characters have a higher frequency of occurrence than the delimiter. An explanation of this code is as follows.

After head grabs the column headers, the first two trace commands (tr) removes all alphabets, numbers, and quotes. This leaves a bunch of special characters among which the character with the highest frequency of occurrence is most likely the delimiters of the fields.
,,,,,   , ,, , , ,,, ,, , ,/ , , , 
The sed command introduces a newline after every character effectively putting every single character on a new line. {1} selects one character at a time, \{ escapes the character {, and & substitutes the pattern match (the single character) with pattern+newline. We can also use \0 instead of &. sort -r | uniq -c | sort -nr generates the list of characters in descending order of prevalence.
     20 ,
     14  
      1 /
      1 
The most prevalent character appears at the top of this list. tr -s " " combines (squeezes) the multiple spaces into one and the cut command splices up the list along the spaces and selects the third column which is the delimiter.


Comments

Popular posts from this blog

Fastest way to send multiple drafts from gmail

People claim that the fastest way to send multiple email drafts is to use Gmail IMAP with email client like Outlook or Evolution or Thunderbird. But I have found this is not true. Because Thunderbird and Evolution etc. email clients treats the drafts as emails still to be edited. So it is not just simple select all and hit send. Each email draft has to be opened and sent separately. That is a lot of clicks and mouse movements, wasting precious time and energy. I have a better solution which involves minimum keystrokes and mouse usage. Efficiency booster technique for sending emails. If someone is feeling adventurous and want to try it from the Gmail interface itself, here's how to do it in the fastest possible manner. It involves using the mouse once. Select the first draft. Gmail would open a new email box and put the cursor inside the box to write. Press TAB once to go the Send button. Press ENTER to send. Now Gmail sends it and the box is gone but the highlight goes to the last

LYRICS OF CHANDRABINDOO

___________________________________________________________________ SWEET HEART FROM AAR JAANI NAA(T-SERIES) -- SWEETHEART -- Pratham college-er din ta Aajo thik e mone poRey scene ta Dada didi haath dhorey siNRi tei bose poRey Aamar chokh ta ghorey bon bon bon bon Sweetheart, I am seating alone Sweetheart, for me there is none DhoNk gile chole gelo pratham maas Meye dekhlei feli deergho-shwash DhoNk gile chole gelo pratham maas Meye dekhlei othe nabhishwash Meyera bheeshan smart poRey chhoto mini-skirt Aamar e je sheet korey kon kon kon kon Sweetheart, I am seating alone Sweetheart, for me there is none Taarporey kete gelo maas chaar Fuse holo je kato future Bandhura purse khule eke oke taake tole Aamar pran ta korey chon mon chon mon Sweetheart, I am seating alone Sweetheart, for me there is none Ekdin lawn theke beriye Ek tanayaar dike taakiye Hawt korey ki je holo magaj ta ghurey gelo Taar kaaner saamne kori ghyan ghyan ghyan ghyan Sweetheart, I am seating alone Sweethea

Changing the font size of section headings in LaTex

You have several ways to do so: 1.- A direct redefinition of \section: \makeatletter \renewcommand\section{\@startsection{section}{1}{\z@}%                                   {-3.5ex \@plus -1ex \@minus -.2ex}%                                   {2.3ex \@plus.2ex}%                                   {\normalfont\large\bfseries}} \makeatother 2.- By means of the titlesec package: \usepackage{titlesec} \titleformat{\section}{\large\bfseries}{\thesection}{1em}{} 3.- By means of the sectsty package: \usepackage{sectsty} \sectionfont{\large} source : http://www.latex-community.org/forum/viewtopic.php?f=4&t=3245   Now, I would explain the titlesec package a bit more (because it seems easier to me and with more options) : To change the section fonts with this package put the following lines in the preamble - \usepackage{titlesec} \titleformat{\ section }{\ large \ bfseries }{\thesection}{1em}{} Options available are- a> Font size - \normals