Swapping columns is an integral part of data analysis. And with GUI spreadsheet programs it is simply a four-step process. Suppose ColumnA and ColumnB need to be swapped. Then the follwing sequence does the job.
This command rearranges column 2 with column 8. This command is simple and elegant. But it has its drawbacks. The user needs to type all the column numbers by hand, which will become inefficient as the number of columns increases. A huge database might have more than 50 columns. It is very inefficient to type all column numbers by hand. Another disadvantage of manual entry is that the possibility of error is high while ordering the columns. The solution will be writing a shell script which achieves the same result with an AWK command but is more user-friendly.
After the usual checks for missing filenames, the script extracts the column headers using head (which outputs the first part of files) and replaces the column delimiter with a newline using tr. This produces a list of column headers. An nl command numbers the lines and makes it easier for the user to choose the columns. It then reads the input stream for column numbers using the read command. Once the user input is loaded, the script uses a similar one-liner to extract the number of columns and runs through a loop which generates a string of column numbers in the order specified by the user (and stores in the variable awkstr). It checks for the special case when one of the swapped columns is the last column of the database and avoids appending a separating character (comma) for that case.
The AWK command needs some explanation. The '"$awkstr"' variable, which holds the string meant to generate the reordering of the columns, needs to be expanded inside the AWK command and at the same time interpreted as an argument to print. Double quotes around a bash variable expands its value while single quotes allow string literal interpretation by the print command of AWK. The quotes around the variable for the delimiter work in the same manner except there needs to be an extra double quote around it so that it is interpreted as OFS="," and not as OFS=, (absence of quotes generates AWK error).
To implement this code, let's say we want to rearrange the columns 1 and 2 on a test file displayed below. The test CSV file has three columns - Test, NumA, and NumB.
When the script is run on this test data, the script automatically detects the delimiter and displays a numbered list of the columns. It then asks the user to choose the column numbers which needs to be swapped. We have appended the output of the script stored in the file out.csv.
The ordering of column numbers is immaterial.
The automatic detection of the delimiter (see variable $detecteddelimiter in the above code) is a very useful piece of code (albeit a long one) and is discussed in a subsequent post on identifying delimiter of a CSV file.
When one of the swapped columns is the last column of the database it is a special case, which is handled in the script by the first if statement inside the for loop.
The scripts' error handlers handle wrong, misspelled, and non-existent filenames.
The script can be easily modified to remove columns instead of swapping them.
- Create a new column before ColumnA
- Cut ColumnB into this new column
- Cut ColumnA to the location of ColumnB
- Delete empty column
awk -F ',' 'BEGIN{OFS=",";} {print $1, $5, $3, $4, $2}' test.csv
This command rearranges column 2 with column 8. This command is simple and elegant. But it has its drawbacks. The user needs to type all the column numbers by hand, which will become inefficient as the number of columns increases. A huge database might have more than 50 columns. It is very inefficient to type all column numbers by hand. Another disadvantage of manual entry is that the possibility of error is high while ordering the columns. The solution will be writing a shell script which achieves the same result with an AWK command but is more user-friendly.
#!/bin/bash # swapcolumns v2.1 - this script swaps two columns of a csv file # Usage: ./swapcolumns [inputfilename.csv] [outputfilename.csv] [delimiter] # Input file test if [ -z $1 ]; then echo "Input filename absent" read -p 'Enter input filename: ' inputfilename else inputfilename=$1 fi # check if file exists if [ ! -f $inputfilename ]; then echo "File" $inputfilename "does not exist. Exiting." exit fi # Output file test if [ -z $2 ]; then echo "Output filename absent" read -p 'Enter output filename: ' outputfilename else outputfilename=$2 fi # check if file exists if [ ! -f $inputfilename ]; then echo "File " $inputfilename " does not exist. Exiting." exit fi detecteddelimiter=`head -n1 $inputfilename | tr -d '[a-z][A-Z][0-9]' | sed 's/.\{1\}/&\n/g' | sort -r | uniq -c | sort -nr | tr -s " " | cut -d" " -f3 | head -n1` # Find column delimiter if [ -z $3 ]; then echo "Delimiter detected: " $detecteddelimiter read -p 'Press enter to accept this as the delimiter or enter one: ' delimiter if [ -z $delimiter ];then delimiter=$detecteddelimiter fi else delimiter=$3 fi head $inputfilename -n1 | tr $delimiter '\n' | nl echo read -p 'Enter column 1: ' col1 read -p 'Enter column 2: ' col2 columnlength=`head $inputfilename -n1 | tr ',' '\n' | wc -l` awkstr="" for i in `seq 1 $columnlength`; do if (( $i == $columnlength )); then if (( $i == $col1 )); then awkstr=$awkstr" $"$col2 elif (( $i == $col2 )); then awkstr=$awkstr" $"$col1 else awkstr=$awkstr" $"$i fi else if (( $i == $col1 )); then awkstr=$awkstr" $"$col2"," elif (( $i == $col2 )); then awkstr=$awkstr" $"$col1"," else awkstr=$awkstr" $"$i"," fi fi done # '"$variable"' - the double quotes expands the variable to its value while # single quotes allows AWK to interpreting it as an argument to print # "'"$delimiter"'" requires an extra "" around it so that it is interpreted # as OFS="," and not as OFS=, awk -F ${delimiter} 'BEGIN{OFS = "'"$delimiter"'" ;} {print '"$awkstr"' }' $inputfilename > $outputfilename echo "Output written to" $outputfilename
After the usual checks for missing filenames, the script extracts the column headers using head (which outputs the first part of files) and replaces the column delimiter with a newline using tr. This produces a list of column headers. An nl command numbers the lines and makes it easier for the user to choose the columns. It then reads the input stream for column numbers using the read command. Once the user input is loaded, the script uses a similar one-liner to extract the number of columns and runs through a loop which generates a string of column numbers in the order specified by the user (and stores in the variable awkstr). It checks for the special case when one of the swapped columns is the last column of the database and avoids appending a separating character (comma) for that case.
The AWK command needs some explanation. The '"$awkstr"' variable, which holds the string meant to generate the reordering of the columns, needs to be expanded inside the AWK command and at the same time interpreted as an argument to print. Double quotes around a bash variable expands its value while single quotes allow string literal interpretation by the print command of AWK. The quotes around the variable for the delimiter work in the same manner except there needs to be an extra double quote around it so that it is interpreted as OFS="," and not as OFS=, (absence of quotes generates AWK error).
To implement this code, let's say we want to rearrange the columns 1 and 2 on a test file displayed below. The test CSV file has three columns - Test, NumA, and NumB.
$ cat test.csv Test,NumA,NumB A,2,9 B,2,9 C,6,17 D,1,7 E,3,11 F,3,11 G,6,17 H,5,15
When the script is run on this test data, the script automatically detects the delimiter and displays a numbered list of the columns. It then asks the user to choose the column numbers which needs to be swapped. We have appended the output of the script stored in the file out.csv.
$ ./swapcolumns.sh test.csv out.csv Delimiter detected: , Press enter to accept this as the delimiter or enter one: 1 Test 2 NumA 3 NumB Enter col1: 1 Enter col2: 2 $ cat out.csv NumA,Test,NumB 2,A,9 2,B,9 6,C,17 1,D,7 3,E,11 3,F,11 6,G,17 5,H,15
The ordering of column numbers is immaterial.
$ ./swapcolumns.sh test.csv out.csv Delimiter detected: , Press enter to accept this as the delimiter or enter one: 1 Test 2 NumA 3 NumB Enter col1: 2 Enter col2: 1 cat out.csv NumA,Test,NumB 2,A,9 2,B,9 6,C,17 1,D,7 3,E,11 3,F,11 6,G,17 5,H,15
The automatic detection of the delimiter (see variable $detecteddelimiter in the above code) is a very useful piece of code (albeit a long one) and is discussed in a subsequent post on identifying delimiter of a CSV file.
When one of the swapped columns is the last column of the database it is a special case, which is handled in the script by the first if statement inside the for loop.
$ ./swapcolumns.sh test.csv test.csv 1 Test 2 NumA 3 NumB Enter col1: 2 Enter col2: 3 Test,NumB,NumA A,9,2 B,9,2 C,17,6 D,7,1 E,11,3 F,11,3 G,17,6 H,15,5
The scripts' error handlers handle wrong, misspelled, and non-existent filenames.
$ ./swapcolumns.sh Input filename absent Enter input filename: junk.csv File junk.csv does not exist. Exiting.
The script can be easily modified to remove columns instead of swapping them.
#!/bin/bash # removecolumns v1.0 - this script removes columns from a csv file # Usage: ./removecolumns [inputfilename.csv] [outputfilename.csv] [delimiter] # Input file test if [ -z $1 ]; then echo "Input filename absent" read -p 'Enter input filename: ' inputfilename else inputfilename=$1 fi # check if file exists if [ ! -f $inputfilename ]; then echo "File" $inputfilename "does not exist. Exiting." exit fi # Output file test if [ -z $2 ]; then echo "Output filename absent" read -p 'Enter output filename: ' outputfilename else outputfilename=$2 fi # check if file exists if [ ! -f $inputfilename ]; then echo "File " $inputfilename " does not exist. Exiting." exit fi detecteddelimiter=`head -n1 $inputfilename | tr -d '[a-z][A-Z][0-9]' | sed 's/.\{1\}/&\n/g' | sort -r | uniq -c | sort -nr | tr -s " " | cut -d" " -f3 | head -n1` # Find column delimiter if [ -z $3 ]; then echo "Delimiter detected: " $detecteddelimiter read -p 'Press enter to accept this as the delimiter or enter one: ' delimiter if [ -z $delimiter ];then delimiter=$detecteddelimiter fi else delimiter=$3 fi head $inputfilename -n1 | tr $delimiter '\n' | nl echo read -p 'Enter column to be removed: ' col1 columnlength=`head $inputfilename -n1 | tr ',' '\n' | wc -l` awkstr="" for i in `seq 1 $columnlength`; do if (( $i == $columnlength )); then if (( $i != $col1 )); then awkstr=$awkstr" $"$i fi else if (( $i != $col1 )); then awkstr=$awkstr" $"$i"," fi fi done # '"$variable"' - the double quotes expands the variable to its value while # single quotes allows AWK to interpreting it as an argument to print # "'"$delimiter"'" requires an extra "" around it so that it is interpreted # as OFS="," and not as OFS=, awk -F ${delimiter} 'BEGIN{OFS = "'"$delimiter"'" ;} {print '"$awkstr"' }' $inputfilename > $outputfilename echo "Output written to" $outputfilename
Comments