Skip to content

Instantly share code, notes, and snippets.

@romiras
Created November 28, 2024 22:12
Show Gist options
  • Save romiras/d9cb25118d51338c5dc1e67ec35d030c to your computer and use it in GitHub Desktop.
Save romiras/d9cb25118d51338c5dc1e67ec35d030c to your computer and use it in GitHub Desktop.
Convert TSV to CSV file

This shell script processes a tab-separated values (TSV) file and converts it into a CSV (Comma-Separated Values) format. The script handles cases where fields in the TSV file contain commas, ensuring proper CSV formatting by escaping and quoting such fields. Let's break it down step by step:

1. Shebang (#!/bin/sh)

This line specifies that the script should be run using the sh shell.

2. awk command

The main body of the script is an awk command. awk is a powerful text-processing tool that works line by line on input text.

BEGIN { FS="\t"; OFS="," }

  • BEGIN { ... }: This block is executed before any lines of input are processed.
  • FS="\t": This sets the Field Separator (FS) to a tab character (\t). This means that awk will split each line of input into fields based on tab characters.
  • OFS=",": This sets the Output Field Separator (OFS) to a comma (,), meaning that when awk prints the output, the fields will be separated by commas.

{ ... }

This block is executed for each line of the input data.

  1. rebuilt=0

    • This initializes a flag rebuilt to 0. The flag will be used to track whether any modifications are made to the fields during processing.
  2. for(i=1; i<=NF; ++i)

    • This starts a for loop that iterates over each field of the current line.
    • NF is a built-in awk variable that holds the number of fields in the current record (line).
  3. if ($i ~ /,/ && $i !~ /^".*"$/)

    • This checks if the current field $i contains a comma ($i ~ /,/) and is not already enclosed in double quotes ($i !~ /^".*"$/).
    • If the field contains a comma and is not quoted, it must be quoted in CSV format to avoid confusion with field delimiters.
  4. gsub("\"", "\"\"", $i)

    • gsub("\"", "\"\"", $i) replaces all double quotes in the current field $i with two double quotes (\"\"). This is a standard way to escape double quotes in CSV format. For example, if a field contains the value hello"world, it will be converted to hello""world.
  5. $i = "\"" $i "\""

    • This wraps the current field $i in double quotes. After this step, the field will be quoted in CSV format (e.g., hello,world becomes "hello,world").
  6. rebuilt=1

    • This sets the rebuilt flag to 1, indicating that a modification has been made to the current field (i.e., it has been quoted).
  7. if (!rebuilt) { $1=$1 }

    • This checks if the rebuilt flag is still 0 (meaning no fields were modified).
    • If no fields were modified, this line forces awk to reprocess the first field by assigning $1=$1. This might seem redundant, but it ensures that the line gets printed out with the correct output format (e.g., replacing tab with comma).
  8. print

    • This prints the entire line after the field modifications, with fields separated by commas (as defined by OFS).

3. $1

  • $1 refers to the first command-line argument passed to the script, which should be a file name. This is passed to awk, so awk processes the file specified by $1.

Summary of Functionality:

The script:

  • Reads a tab-separated file (the file specified as the first command-line argument).
  • Converts it to a CSV format.
  • Ensures that any fields containing commas are properly quoted (and double-quoted within the field to escape existing quotes).
  • Outputs the result to standard output.

For example: Input (TSV file):

name	age	address
John, Doe	30	"123, Elm St"
Alice	25	456 Oak Rd

Output (CSV):

name,age,address
"John, Doe",30,"123, Elm St"
Alice,25,456 Oak Rd

This script ensures the integrity of the CSV format by quoting fields that contain commas and escaping double quotes within those fields.

#!/bin/sh
awk 'BEGIN { FS="\t"; OFS="," } {
rebuilt=0
for(i=1; i<=NF; ++i) {
if ($i ~ /,/ && $i !~ /^".*"$/) {
gsub("\"", "\"\"", $i)
$i = "\"" $i "\""
rebuilt=1
}
}
if (!rebuilt) { $1=$1 }
print
}' $1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment