The following one-liner can be used to extract the delimiter of a CSV file. This command does not work on TAB separated files. It only works on delimited files whose field separators are not whitespaces.
This command generates a list of special characters and from that list selects the character with the highest frequency of occurrence. This character must be the delimiter of the file unless some other special character is used heavily. This code will fail when other special characters have a higher frequency of occurrence than the delimiter. An explanation of this code is as follows.
After head grabs the column headers, the first two trace commands (tr) removes all alphabets, numbers, and quotes. This leaves a bunch of special characters among which the character with the highest frequency of occurrence is most likely the delimiters of the fields.
The sed command introduces a newline after every character effectively putting every single character on a new line. {1} selects one character at a time, \{ escapes the character {, and & substitutes the pattern match (the single character) with pattern+newline. We can also use \0 instead of &. sort -r | uniq -c | sort -nr generates the list of characters in descending order of prevalence.
The most prevalent character appears at the top of this list. tr -s " " combines (squeezes) the multiple spaces into one and the cut command splices up the list along the spaces and selects the third column which is the delimiter.
$ head -n1 bookmerged.csv | tr -d '[a-z][A-Z][0-9]' | \
tr -d '"' | sed 's/.\{1\}/&\n/g' | sort -r | uniq -c | \
sort -nr | tr -s " " | cut -d" " -f3 | head -n1
This command generates a list of special characters and from that list selects the character with the highest frequency of occurrence. This character must be the delimiter of the file unless some other special character is used heavily. This code will fail when other special characters have a higher frequency of occurrence than the delimiter. An explanation of this code is as follows.
After head grabs the column headers, the first two trace commands (tr) removes all alphabets, numbers, and quotes. This leaves a bunch of special characters among which the character with the highest frequency of occurrence is most likely the delimiters of the fields.
,,,,, , ,, , , ,,, ,, , ,/ , , ,
20 , 14 1 / 1
Comments