This shell script processes a tab-separated values (TSV) file and converts it into a CSV (Comma-Separated Values) format. The script handles cases where fields in the TSV file contain commas, ensuring proper CSV formatting by escaping and quoting such fields. Let's break it down step by step:
This line specifies that the script should be run using the sh
shell.
The main body of the script is an awk
command. awk
is a powerful text-processing tool that works line by line on input text.
BEGIN { ... }
: This block is executed before any lines of input are processed.FS="\t"
: This sets the Field Separator (FS) to a tab character (\t
). This means thatawk
will split each line of input into fields based on tab characters.OFS=","
: This sets the Output Field Separator (OFS) to a comma (,
), meaning that whenawk
prints the output, the fields will be separated by commas.
This block is executed for each line of the input data.
-
rebuilt=0
- This initializes a flag
rebuilt
to 0. The flag will be used to track whether any modifications are made to the fields during processing.
- This initializes a flag
-
for(i=1; i<=NF; ++i)
- This starts a
for
loop that iterates over each field of the current line. NF
is a built-inawk
variable that holds the number of fields in the current record (line).
- This starts a
-
if ($i ~ /,/ && $i !~ /^".*"$/)
- This checks if the current field
$i
contains a comma ($i ~ /,/
) and is not already enclosed in double quotes ($i !~ /^".*"$/
). - If the field contains a comma and is not quoted, it must be quoted in CSV format to avoid confusion with field delimiters.
- This checks if the current field
-
gsub("\"", "\"\"", $i)
gsub("\"", "\"\"", $i)
replaces all double quotes in the current field$i
with two double quotes (\"\"
). This is a standard way to escape double quotes in CSV format. For example, if a field contains the valuehello"world
, it will be converted tohello""world
.
-
$i = "\"" $i "\""
- This wraps the current field
$i
in double quotes. After this step, the field will be quoted in CSV format (e.g.,hello,world
becomes"hello,world"
).
- This wraps the current field
-
rebuilt=1
- This sets the
rebuilt
flag to 1, indicating that a modification has been made to the current field (i.e., it has been quoted).
- This sets the
-
if (!rebuilt) { $1=$1 }
- This checks if the
rebuilt
flag is still 0 (meaning no fields were modified). - If no fields were modified, this line forces
awk
to reprocess the first field by assigning$1=$1
. This might seem redundant, but it ensures that the line gets printed out with the correct output format (e.g., replacing tab with comma).
- This checks if the
-
print
- This prints the entire line after the field modifications, with fields separated by commas (as defined by
OFS
).
- This prints the entire line after the field modifications, with fields separated by commas (as defined by
$1
refers to the first command-line argument passed to the script, which should be a file name. This is passed toawk
, soawk
processes the file specified by$1
.
The script:
- Reads a tab-separated file (the file specified as the first command-line argument).
- Converts it to a CSV format.
- Ensures that any fields containing commas are properly quoted (and double-quoted within the field to escape existing quotes).
- Outputs the result to standard output.
For example: Input (TSV file):
name age address
John, Doe 30 "123, Elm St"
Alice 25 456 Oak Rd
Output (CSV):
name,age,address
"John, Doe",30,"123, Elm St"
Alice,25,456 Oak Rd
This script ensures the integrity of the CSV format by quoting fields that contain commas and escaping double quotes within those fields.