RS: Recond Seperator
- AWK processes data 1 record at a time, records are seperated by RS from a whole input data stream
- By default,
RS=\n
NR: Record Number
- If you're using
RS=\n
by default, NR will be the current input line number.
FS/OFS: Field Seperator/Output Field Seperator
- AWK splits 1 record into multiple field based on the value of FS
- AWK prints 1 record by rejoining the field using the OFS.
- By default,
FS=OFS=' '
(it's a whitespace). - They don't have to be the same
NF: Number of fields in the current record
- If you're using
FS=OFS=' '
, NF = number of words in current record.
There are other standard AWK variables, but not necessarily
awk '1 { print }' <file_name>
Syntax: pattern { action }
If, for a given record/line inside <file_name>, the pattern evaluates to a non-zero value (which means TRUE in AWK), the commands in the corresponding action block are executed.
{ print }
is default action block if you don't specify one.
E.g. awk 1 <file_name>
== awk '1 { print }' <file_name>
awk 'NR > 1' <file_name>
Remember: this is the equivalent of writing
awk 'NR > 1 && NR < 4' <file_name>
awk 'NF' <file_name>
(only NF a.k.a NF != 0)
Recall: AWK splits 1 record into fields using FS=' ', but actually it's 1-or-several-white-space-characters, which means multiple spaces or tabs => At least 1 non-whitespace character => At least 1 field => Not 1 non-whitespace character => Field = 0
awk '1' RS='' <file_name>
This is based on a POSIX rule: If RS is set to empty string, records are seperated by sequences consisting of a "\n" + 1 or more 'blank lines'
A blank lines is a completely empty lines, a line contains only whitespace also counted as not a blank line.
awk '{ print $1, $2, $3 }' FS=, OFS=, <file_name>
When AWK split a record into fields, it stores the 1st field into $1, the 2nd field into $2, ... Worth mentioning: $0 is the entire record, not the entire file
The example lines don't use a pattern, in order to filter some data, add some condition:
awk 'NF { print $1, $2, $3 }' FS=, OFS=, <file_name>
You could specify the pre-defined variables INSIDE the AWK program
awk 'BEGIN { FS=OFS="," } NF { print $1, $2, $3 } END { }' <file_name>
BEGIN
will run before the first record is read.
END
will run after the last record has been read.
awk '{ SUM+=$1 } END { print SUM }' FS=, OFS=, <file_name>
Note: AWK variables do not need to be declared before usage. An undefiend variable is assumed to hold an empty string. AWK type conversion rules specify: Empty string = 0, Other string = 1 So, no need to bother about type converting if you are doing addition (multiplication is another story)
awk '/./ { COUNT += 1 } END { print COUNT }' <file_name>
A pattern can also be a regex like /./
, which means 'each line containing at least 1 character'.
awk 'NF { COUNT += 1 } END { print COUNT }' <file_name>
awk '+$1 { COUNT += 1 } END { print COUNT }' <file_name>
Explain: The unary plus in the pattern +$1
forces the evaluation of $1
in a numerical context. In the specific example:
- Each data record contains a number in their 1st field.
- Each non-data record (heading, blank lines, whitespace-only lines) contains either text, or nothing. All of them being equal to 0.
Worth mentioning: by convention, AWK variables are UPPERCASE, just like the pre-defined.
All arrays in AWK are 'associative arrays', or hash map.
I want to know the total credits for all users. I can store an entry for each user in an associative array, and each time I encounter a record for that user, I increment the corresponding value stored in the array.
awk '+$1 { CREDITS[$3] += $1 }
END { for (NAME in CREDITS) print NAME, CREDITS[NAME] }' FS=, <file_name>
for
loop can iterate through an associative array by its key NAME
, then use its key to extract the value CREDITS[NAME]
Refresh memory: Variables can be used in both pattern and action block, so do associative arrays
awk 'a[$0]++' <file_name>
Refresh memory #2: $0 is entire record
The first time a record is read, a[$0]
is undefined, an thus equivalent to zero for AWK. So that first record is not written to output, then that entry is changed from zero to one. The 2nd time the same input record is read, since a[$0]
now is 1, it will be printed. Then it will be updated from 1 to 2. And so on
awk '!a[$0]++' <file_name>
By using not operator !, that reverse the logic of the previous code.
Note that the ++
post increment bear no influence to !
not operator.
If you by any chance, specify OFS=<something>
but it doesn't work?
=> AWK WILL NOT change the output record as long as you did not change a field.
=> Solution? Trick it
awk '$1 = $1' FS=, OFS=';' <file_name>
$1 = $1
forces AWK to break the record and ressemble it using the OFS
Refresh memory: by default, { print } is the default action block so it is not different from awk '$1=$1 { print }' FS=, <file_name>
Worth Mentioning: empty lines are also removed Since AWK conversion rules specify an empty string is "false", all other strings are "true".
The expression $1=$1
does 2 things:
- 1, alter the
$1
, but actually it doesn't change anything - 2, the expression is acting as a pattern, a pattern needs a value, so the expression evaluates to a value -
$1
, which is "false" for the empty string.
How to keep empty lines?: put pattern into action block awk '{ $1=$1; print }' FS=, OFS=';' <file_name>
awk '$1=$1' <file_name>
By default, multiple whitespaces are used in the input FS.
But only 1 space is used as the output OFS.
Using output record seperator ORS
awk '{ print $1 }' FS=, ORS=' ' <file_name>; echo
(idk why echo
is there)
This method has some drawbacks
Solution: using regex
awk '/[^[:space:]]/ { print $3 } FS=, ORS='+' <file_name>; echo
awk '/[^[:space:]]/ { print SEP $3; SEP="+" }' FS=, ORS='' <file_name>; echo
Taking care of adding seperator by yourself, but you need to set the output record seperator to the empty string.
AWK inherits printf
function, allowing great control over the formatting of the text sent to the output.
awk '+$1 { printf("%s ", $3) }' FS=, <file_name>; echo
print
statement use OFS
and ORS
printf
statement doesn't, that a small price to pay.
Using printf
function, you can also product fixed-width tabular output since each "format specifier" in a printf
statement can accept an optional width parameter.
awk '+$1 { printf("%10s | %4d\n", $3, $1) }' FS=, <file_name>
10, 4 are >0: pad to the left with spaces. In order to pad to the right with spaces, use negative numbers, also you can pad fields with zeros instead of spaces
awk '+$1 { printf("%-10s | %04d\n", $3, $1) }' FS=, <file_name>
awk '+$1 { SUM+=$1; NUM+=1 } END { printf("AVG=%.1f", SUM/NUM) } ' FS=, <file_name>
.1f
means display the number with 1 decimal numbers after the dot.
awk '$3 { print toupper($0) }' <file_name>
This is probably the best and most portable solution to convert text to uppercase from the shell.
awk '+$1 { split($2, DATE, " "); print $1, $3, DATE[1], DATE[2], DATE[3] }' FS=, OFS=, data.txt
awk '+$1 { split($4, GRP, /:+/); print $3, GRP[1], GRP[2] }' FS=, file
split
command arguments: 1, the variable containing the text; 2, a new associative arrays that stored values; 3, the regex to identify the seperator
Sometimes you want to perform substitution like the sed s///g
command, but only on 1 field. The gsub
command is what you need in that case:
awk '+$1 {gsub(/ +/, "-", $2); print }' FS=, data.txt
gsub
command arguments: 1, a regex to search 2, a replacement string and 3, the variable containing the text (if the last argument is missing, $0
is assumed)
awk 'BEGIN { printf("UPDATED: "); system("date"); } /^UPDATED/ { next } 1' <file_name>
Notice the next
command, it's a standard way of ignoring some records from the input file.
...