This shell script processes a tab-separated values (TSV) file and converts it into a CSV (Comma-Separated Values) format. The script handles cases where fields in the TSV file contain commas, ensuring proper CSV formatting by escaping and quoting such fields. Let's break it down step by step:
This line specifies that the script should be run using the sh shell.
The main body of the script is an awk command. awk is a powerful text-processing tool that works line by line on input text.
BEGIN { ... }: This block is executed before any lines of input are processed.FS="\t": This sets the Field Separator (FS) to a tab character (\t). This means thatawkwill split each line of input into fields based on tab characters.OFS=",": This sets the Output Field Separator (OFS) to a comma (,), meaning that whenawkprints the output, the fields will be separated by commas.
This block is executed for each line of the input data.
-
rebuilt=0- This initializes a flag
rebuiltto 0. The flag will be used to track whether any modifications are made to the fields during processing.
- This initializes a flag
-
for(i=1; i<=NF; ++i)- This starts a
forloop that iterates over each field of the current line. NFis a built-inawkvariable that holds the number of fields in the current record (line).
- This starts a
-
if ($i ~ /,/ && $i !~ /^".*"$/)- This checks if the current field
$icontains a comma ($i ~ /,/) and is not already enclosed in double quotes ($i !~ /^".*"$/). - If the field contains a comma and is not quoted, it must be quoted in CSV format to avoid confusion with field delimiters.
- This checks if the current field
-
gsub("\"", "\"\"", $i)gsub("\"", "\"\"", $i)replaces all double quotes in the current field$iwith two double quotes (\"\"). This is a standard way to escape double quotes in CSV format. For example, if a field contains the valuehello"world, it will be converted tohello""world.
-
$i = "\"" $i "\""- This wraps the current field
$iin double quotes. After this step, the field will be quoted in CSV format (e.g.,hello,worldbecomes"hello,world").
- This wraps the current field
-
rebuilt=1- This sets the
rebuiltflag to 1, indicating that a modification has been made to the current field (i.e., it has been quoted).
- This sets the
-
if (!rebuilt) { $1=$1 }- This checks if the
rebuiltflag is still 0 (meaning no fields were modified). - If no fields were modified, this line forces
awkto reprocess the first field by assigning$1=$1. This might seem redundant, but it ensures that the line gets printed out with the correct output format (e.g., replacing tab with comma).
- This checks if the
-
print- This prints the entire line after the field modifications, with fields separated by commas (as defined by
OFS).
- This prints the entire line after the field modifications, with fields separated by commas (as defined by
$1refers to the first command-line argument passed to the script, which should be a file name. This is passed toawk, soawkprocesses the file specified by$1.
The script:
- Reads a tab-separated file (the file specified as the first command-line argument).
- Converts it to a CSV format.
- Ensures that any fields containing commas are properly quoted (and double-quoted within the field to escape existing quotes).
- Outputs the result to standard output.
For example: Input (TSV file):
name age address
John, Doe 30 "123, Elm St"
Alice 25 456 Oak Rd
Output (CSV):
name,age,address
"John, Doe",30,"123, Elm St"
Alice,25,456 Oak Rd
This script ensures the integrity of the CSV format by quoting fields that contain commas and escaping double quotes within those fields.