Skip to content

Instantly share code, notes, and snippets.

@vhbui02
Last active October 17, 2023 14:27
Show Gist options
  • Save vhbui02/f523d4214f5d6f86a610768abd9c18d3 to your computer and use it in GitHub Desktop.
Save vhbui02/f523d4214f5d6f86a610768abd9c18d3 to your computer and use it in GitHub Desktop.
[awk tutorial] Maybe i won't remember after today but I still wanna note

Pre-defined and automatic variable

RS: Recond Seperator

  • AWK processes data 1 record at a time, records are seperated by RS from a whole input data stream
  • By default, RS=\n

NR: Record Number

  • If you're using RS=\n by default, NR will be the current input line number.

FS/OFS: Field Seperator/Output Field Seperator

  • AWK splits 1 record into multiple field based on the value of FS
  • AWK prints 1 record by rejoining the field using the OFS.
  • By default, FS=OFS=' ' (it's a whitespace).
  • They don't have to be the same

NF: Number of fields in the current record

  • If you're using FS=OFS=' ', NF = number of words in current record.

There are other standard AWK variables, but not necessarily

A. Basic usage of AWK command

1. Print all lines

awk '1 { print }' <file_name>

Syntax: pattern { action }

If, for a given record/line inside <file_name>, the pattern evaluates to a non-zero value (which means TRUE in AWK), the commands in the corresponding action block are executed.

{ print } is default action block if you don't specify one. E.g. awk 1 <file_name> == awk '1 { print }' <file_name>

2. Remove a file header

awk 'NR > 1' <file_name>

Remember: this is the equivalent of writing

3. Print lines in a range

awk 'NR > 1 && NR < 4' <file_name>

4. Removing whitespace-only lines

awk 'NF' <file_name> (only NF a.k.a NF != 0)

Recall: AWK splits 1 record into fields using FS=' ', but actually it's 1-or-several-white-space-characters, which means multiple spaces or tabs => At least 1 non-whitespace character => At least 1 field => Not 1 non-whitespace character => Field = 0

5. Removing all blank lines

awk '1' RS='' <file_name>

This is based on a POSIX rule: If RS is set to empty string, records are seperated by sequences consisting of a "\n" + 1 or more 'blank lines'

A blank lines is a completely empty lines, a line contains only whitespace also counted as not a blank line.

6. Extracting fields

awk '{ print $1, $2, $3 }' FS=, OFS=, <file_name>

When AWK split a record into fields, it stores the 1st field into $1, the 2nd field into $2, ... Worth mentioning: $0 is the entire record, not the entire file The example lines don't use a pattern, in order to filter some data, add some condition: awk 'NF { print $1, $2, $3 }' FS=, OFS=, <file_name>

You could specify the pre-defined variables INSIDE the AWK program awk 'BEGIN { FS=OFS="," } NF { print $1, $2, $3 } END { }' <file_name>

BEGIN will run before the first record is read. END will run after the last record has been read.

7. Performing calculations column-wise

awk '{ SUM+=$1 } END { print SUM }' FS=, OFS=, <file_name>

Note: AWK variables do not need to be declared before usage. An undefiend variable is assumed to hold an empty string. AWK type conversion rules specify: Empty string = 0, Other string = 1 So, no need to bother about type converting if you are doing addition (multiplication is another story)

8. Counting the number of non-empty lines

Ignore POSIX blank line

awk '/./ { COUNT += 1 } END { print COUNT }' <file_name> A pattern can also be a regex like /./, which means 'each line containing at least 1 character'.

Ignore Whitespace-only line

awk 'NF { COUNT += 1 } END { print COUNT }' <file_name>

Ignore non-numeric-data-from-the-1st-field line

awk '+$1 { COUNT += 1 } END { print COUNT }' <file_name>

Explain: The unary plus in the pattern +$1 forces the evaluation of $1 in a numerical context. In the specific example:

  • Each data record contains a number in their 1st field.
  • Each non-data record (heading, blank lines, whitespace-only lines) contains either text, or nothing. All of them being equal to 0.

Worth mentioning: by convention, AWK variables are UPPERCASE, just like the pre-defined.

B. Using Arrays in AWK

All arrays in AWK are 'associative arrays', or hash map.

9. A simple example of AWK array

I want to know the total credits for all users. I can store an entry for each user in an associative array, and each time I encounter a record for that user, I increment the corresponding value stored in the array.

awk '+$1 { CREDITS[$3] += $1 }
		 END { for (NAME in CREDITS) print NAME, CREDITS[NAME] }' FS=, <file_name>

for loop can iterate through an associative array by its key NAME, then use its key to extract the value CREDITS[NAME]

10. Identifying duplicated lines using AWK

Refresh memory: Variables can be used in both pattern and action block, so do associative arrays

awk 'a[$0]++' <file_name> Refresh memory #2: $0 is entire record

The first time a record is read, a[$0] is undefined, an thus equivalent to zero for AWK. So that first record is not written to output, then that entry is changed from zero to one. The 2nd time the same input record is read, since a[$0] now is 1, it will be printed. Then it will be updated from 1 to 2. And so on

11. Removing duplicated lines

awk '!a[$0]++' <file_name>

By using not operator !, that reverse the logic of the previous code. Note that the ++ post increment bear no influence to ! not operator.

C. Field and record seperator magic

12. Changing the Field Seperator (FS)

If you by any chance, specify OFS=<something> but it doesn't work? => AWK WILL NOT change the output record as long as you did not change a field. => Solution? Trick it

awk '$1 = $1' FS=, OFS=';' <file_name> $1 = $1 forces AWK to break the record and ressemble it using the OFS

Refresh memory: by default, { print } is the default action block so it is not different from awk '$1=$1 { print }' FS=, <file_name>

Worth Mentioning: empty lines are also removed Since AWK conversion rules specify an empty string is "false", all other strings are "true".

The expression $1=$1 does 2 things:

  • 1, alter the $1, but actually it doesn't change anything
  • 2, the expression is acting as a pattern, a pattern needs a value, so the expression evaluates to a value - $1, which is "false" for the empty string.

How to keep empty lines?: put pattern into action block awk '{ $1=$1; print }' FS=, OFS=';' <file_name>

13. Remove multiple spaces

awk '$1=$1' <file_name> By default, multiple whitespaces are used in the input FS. But only 1 space is used as the output OFS.

14. Joining lines using AWK

Using output record seperator ORS awk '{ print $1 }' FS=, ORS=' ' <file_name>; echo (idk why echo is there)

This method has some drawbacks

It does not discard whitespace-only lines

Solution: using regex awk '/[^[:space:]]/ { print $3 } FS=, ORS='+' <file_name>; echo

Trailing seperator still exists

awk '/[^[:space:]]/ { print SEP $3; SEP="+" }' FS=, ORS='' <file_name>; echo Taking care of adding seperator by yourself, but you need to set the output record seperator to the empty string.

D. Field Formatting

AWK inherits printf function, allowing great control over the formatting of the text sent to the output.

awk '+$1 { printf("%s ", $3) }' FS=, <file_name>; echo

print statement use OFS and ORS printf statement doesn't, that a small price to pay.

15. Introducing tabular results

Using printf function, you can also product fixed-width tabular output since each "format specifier" in a printf statement can accept an optional width parameter.

awk '+$1 { printf("%10s | %4d\n", $3, $1) }' FS=, <file_name>

10, 4 are >0: pad to the left with spaces. In order to pad to the right with spaces, use negative numbers, also you can pad fields with zeros instead of spaces

awk '+$1 { printf("%-10s | %04d\n", $3, $1) }' FS=, <file_name>

16. Dealing with floating point numbers

awk '+$1 { SUM+=$1; NUM+=1 } END { printf("AVG=%.1f", SUM/NUM) } ' FS=, <file_name>

.1f means display the number with 1 decimal numbers after the dot.

E. Using string functions in AWK

17. Converting text to uppercase

awk '$3 { print toupper($0) }' <file_name>

This is probably the best and most portable solution to convert text to uppercase from the shell.

18. Changing part of a string

19. Spliting fields in sub-fields

awk '+$1 { split($2, DATE, " "); print $1, $3, DATE[1], DATE[2], DATE[3] }' FS=, OFS=, data.txt

awk '+$1 { split($4, GRP, /:+/); print $3, GRP[1], GRP[2] }' FS=, file

split command arguments: 1, the variable containing the text; 2, a new associative arrays that stored values; 3, the regex to identify the seperator

20. Searching and replacing with AWK commands

Sometimes you want to perform substitution like the sed s///g command, but only on 1 field. The gsub command is what you need in that case:

awk '+$1 {gsub(/ +/, "-", $2); print }' FS=, data.txt

gsub command arguments: 1, a regex to search 2, a replacement string and 3, the variable containing the text (if the last argument is missing, $0 is assumed)

F. Working with external commands in AWK

21. Adding the date on top of a file

awk 'BEGIN { printf("UPDATED: "); system("date"); } /^UPDATED/ { next } 1' <file_name>

Notice the next command, it's a standard way of ignoring some records from the input file.

22. Modifying a field externally

...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment