awk Commands, Examples & Meaning

working-with-awk

awk is an interpreted language used for data processing and validation,  generate reports and experiment with algorithms that can be ported to other languages.

awk name & History
The name awk comes from the initials of its creators: Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan.
Original version of awk was written in 1977 at AT&T Bell Laboratories.  Paul Rubin wrote gawk ( gnu awk ) in 1986.

Using awk
awk can be used directly on a command line, executed as program file or from a program file referenced by command line awk .
awk Command Line
awk can be used in command line as a tool to process and format the data from one or more input files or output of another program .

Syntax to use data file as input and run awk command to process data
awk ‘<awk command>’ <file>
or
Use command output as data input using PIPE processing
Command output | awk ‘<awk command>’

awk variable assignments :
awk works on lines and columns and process data line by line and assigns variables to each line and column.
$0 = Entire line
$1 = First Column
$2 = Second Column
$3 = Third Column
and so on
column is defined as a word/characters surrounded by space/s . Common Linux/Unix commands like df ,ls , ps gives columnar outputs and awk is very useful in getting listing and processing column data. A print statement is used to print variables .

Working with awk commands
awk commands are enclosed in single quotes, any single quote after awk options is considered as awk command and a matching single quote is taken as as end of command.

awk Examples
1. Extract the used space column by mount points using df out put

localhost ~]$ df | awk ‘{print $3 }’
Used
10831280
0
252
1020
0
176
123767
51118256
66006

Similar operation to extract 1st and 4tgh column from a file called testfile  containing  following lines
column1 column2 column3 column4
1111 2222 3333 4444
1111 2222 3333 4444
1111 2222 3333 4444

localhost ~]$ awk ‘{print $1,$4}’ testfile
column1 column4
1111 4444
1111 4444
1111 4444

Comma separating fields gives a default space between the output data fields. For large number of fields a special awk variable, Output Field Separator, OFS is used . Default is a space and it can be assign to any other value , such as a pipe symbol , | ,  in the example below.

localhost ~]$ awk ‘{OFS=”|” ; print $3,$4}’ testfile
column3|column4
3333|4444
3333|4444
3333|4444

awk BEGIN and END statement
Multi-line program uses BEGIN and END statements to execute statements once at the beginging and at the end.
basic construction is  :

BEGIN <statment>
<processing statments >
END <statment>

Example:
localhost ~]$ awk ‘BEGIN { print “Count Records ” }
/4444/ { ++num }
END { print “Recs ” num }’ testfile
Count Records
Recs 3

awk program File
awk programs can be written and invoked from a  file by providing awk interpreter location in the first liner ,

Syntax :
$awk -f <program file> <datafile>

Create a awk program  test file, chkrec as below.
#! /bin/awk -f
BEGIN { print “Count Records ” }
/4444/ { ++num }
END { print “Recs ” num }
Execute file with -f option
localhost ~]$ awk -f chkrec testfile
Count Records
Recs 3
or make it executable & directly execute with data file as argument
localhost ~]$ chmod 755 chkrec
localhost ~]$ ./chkrec testfile
Count Records
Recs 3

Awk Example programs
1. Compare values
print Available Use% Mounted columns if used percentage is more than 60%

localhost ~]$ df| awk ‘$5 > “60” { print $4,$5,$6}’
Available Use% Mounted
4522188 92% /home
32298 68% /boot/efi

2. Sum operations
Add file sizes for selective files, /var/log/yum* and total sum is printed ,  column from each line is added in variable n and total is printed with END statement.

localhost ~]$ ls -l /var/log/yum* | awk ‘{ n += $5 }
END { print “Total bytes = “, n }’
Total bytes = 63665

3. if else conditions
Check available space , print ok in front of the output  if less than 60% and Problem if more than 60%

df | awk ‘{ if ($5 > 60) print “Problem “$0
else
print “ok “, $0
};’

Problem Filesystem 1K-blocks Used Available Use% Mounted on
ok /dev/mapper/fedora-root 51475068 10831316 38005928 23% /
ok devtmpfs 1956180 0 1956180 0% /dev
ok tmpfs 1966388 252 1966136 1% /dev/shm
ok tmpfs 1966388 992 1965396 1% /run
ok tmpfs 1966388 0 1966388 0% /sys/fs/cgroup
ok tmpfs 1966388 176 1966212 1% /tmp
ok /dev/sda9 487652 123767 334189 28% /boot
Problem /dev/mapper/fedora-home 58642620 51118476 4522188 92% /home
Problem /dev/sda2 98304 66006 32298 68% /boot/efi

4. For loop
Print 1 to 5 numbers using a for loop by proving initial value , final value and increment function.

localhost ~]$ awk ‘BEGIN { for (i = 1; i <= 5; ++i) print i }’
1
2
3
4
5

5. awk Arrays , creating and sorting

Create a  array by assigning values to array indexes :
A[“ZZ”] = “Last”
A[“DD”] = “Middle”
A[“AA”] = “First”
Sorting arrays
asorti – Array Sort by Indices
asort – Array Sort by value

asort(A)
A[“AA”] = “First”
A[“ZZ”] = “Last”
A[“DD”] = “Middle”

asorti – Array Sort by Indices
asprti(A)
A[“AA”] = “First”
A[“DD”] = “Middle”
A[“ZZ”] = “Last”

awk regular expressions

gsub
Global substitution for the pattern in target
gsub(regexp, replacement [, target])

gensub()
it is a general substitution function providing more features than the standard sub() and gsub() functions- the ability to specify components of a regexp in the replacement text

localhost ~]$ df  | awk ‘{ print gensub(/\%/, ” Percent”, 1) }’
Filesystem              1K-blocks     Used Available Use Percent Mounted on
/dev/mapper/fedora-root  51475068 10831316  38005928  23 Percent /
devtmpfs                  1956180        0   1956180   0 Percent /dev
/dev/sda9                  487652   123767    334189  28 Percent /boot
/dev/mapper/fedora-home  58642620 51118476   4522188  92 Percent /home
/dev/sda2                   98304    66006     32298  68 Percent /boot/efi

index(in, find)
Find the index value of a  sub string .

localhost ~]$ awk ‘BEGIN { print index(“SomeLongString”, “tr”) }’
10

length([string])
Find the length of string,  length of lines in the example below

localhost ~]$ awk ‘ { print length($0) }’ testfile
31
29
29
29

match(string, regexp [, array])
match alphabet characters in file  and print whole line

localhost ~]$ awk ‘ match($0, /[a-z]/) { print $0 }’ testfile
column1 column2 column3 column4

split(string, array [, fieldsep [, seps ] ])
Split a list of rpm names at  dashes.

content of the files – rpms
libhbalinux-1.0.16-2.fc20.x86_64
gucharmap-3.10.1-1.fc20.x86_64
libplist-1.11-2.fc20.x86_64
libgcc-4.8.3-7.fc20.i686
glx-utils-8.1.0-4.fc20.x86_64
vlgothic-fonts-20140801-1.fc20.noarch

Split along  dashes , keep in array and  print selected index values , keep separators in a array called sep .

localhost ~]$ cat rpms | awk ‘{split($0, ary, “-“, seps) ; print ary[1],ary[2],ary[3]}’
libhbalinux 1.0.16 2.fc20.x86_64
gucharmap 3.10.1 1.fc20.x86_64
libplist 1.11 2.fc20.x86_64
libgcc 4.8.3 7.fc20.i686
glx utils 8.1.0
vlgothic fonts 20140801

print both arrays , ary and sep , the seprator arry contents

localhost ~]$ cat rpms | awk ‘{split($0, ary, “-“, seps) ; print ary[1],ary[2],ary[3],seps[1],seps[2]}’
libhbalinux 1.0.16 2.fc20.x86_64 —
gucharmap 3.10.1 1.fc20.x86_64 —
libplist 1.11 2.fc20.x86_64 —
libgcc 4.8.3 7.fc20.i686 —
glx utils 8.1.0 —
vlgothic fonts 20140801 —

sub(regexp, replacement [, target])
Substitute a pattern with a string ,  in the example below replace dash followed by any number with –>

localhost ~]$ cat rpms | awk ‘sub(/-[0-9]/, ” –> ” )’;
libhbalinux –> .0.16-2.fc20.x86_64
gucharmap –> .10.1-1.fc20.x86_64
libplist –> .11-2.fc20.x86_64
libgcc –> .8.3-7.fc20.i686
glx-utils –> .1.0-4.fc20.x86_64
vlgothic-fonts –> 0140801-1.fc20.noarch

substr(string, start [, length ])
Get a  substring of defined length  from  a given position

Lets use this file having two fields
localhost ~]$ cat nums
123456789 abcdef

find 3rd position and print two values from first field.
localhost ~]$ awk ‘{print substr($1,3,2) }’ nums
34
find 3rd position and print two values from second field.
localhost ~]$ awk ‘{print substr($2,3,2) }’ nums
cd

tolower(string)
Convert alphabet string into lower case

tolower(“MiXeD cAsE 123”) returns “mixed case 123”.

Changing entire files to lowercase in the example below

localhost ~]$ cat letters
This is Just Some Random Text Here ..
localhost ~]$ awk ‘{ print tolower($0)}’ letters
this is just some random text here ..

toupper(string)
Convert alphabet string into upper case

localhost ~]$ awk ‘{ print toupper($0)}’ letters
THIS IS JUST SOME RANDOM TEXT HERE ..

Selective fields can be used for this operation, to make only first field as upper case:

awk ‘{ print toupper($1)}’ letters
THIS

Built in Operational Variables

IGNORECASE <digit>
If IGNORECASE is nonzero or non-null, then all string comparisons and all regular expression matching are case-independent.

OFS
The Output Field Separator . It is output between the fields printed by a print statement. Its default value is ” “, a string consisting of a single space.

localhost ~]$ awk ‘{ OFS=”|” ; print $1,$2,$3,$4}’ testfile
column1|column2|column3|column4
1111|2222|3333|4444
1111|2222|3333|4444
1111|2222|3333|4444
it can be defined by -F option also , following example define field separator as : and print first field.
awk  -F:  ‘{ print $1}’  /etc/passwd

ORS
The Output Record Separator determines how records/ lines are separated default value is “\n”, the newline character.
Lets use earlier used rpms file to print lines separated by an ||  operator.
localhost ~]$ awk ‘{ ORS=”||” ; print $0}’ rpms
libhbalinux-1.0.16-2.fc20.x86_64||gucharmap-3.10.1-1.fc20.x86_64||libplist-1.11-2.fc20.x86_64||libgcc-4.8.3-7.fc20.i686||glx-utils-8.1.0-4.fc20.x86_64||vlgothic-fonts-20140801-1.fc20.noarch||

NF
Number of Fields , separated by space or  designated by FS value.

count number of fields separated by :
localhost ~]$ awk -F: ‘{ print $0,NF}’ /etc/passwd
root:x:0:0:root:/root:/bin/bash 7
bin:x:1:1:bin:/bin:/sbin/nologin 7
daemon:x:2:2:daemon:/sbin:/sbin/nologin 7
adm:x:3:4:adm:/var/adm:/sbin/nologin 7
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin 7
sync:x:5:0:sync:/sbin:/bin/sync 7

RS

The input record separator. default is a new line but can be changed to other values depending on the input file.

ARGC, ARGV

The command-line arguments available to awk programs are stored in an array called ARGV.
ARGC is the number of command-line arguments
ARGV is the value of argument. present and is indexed from 0 to ARGC -1
AWKPATH
awk gets its search path from the AWKPATH environment variable. If that variable does not exist, or if it has an empty value, gawk uses a default path  ‘.:/usr/local/share/awk’.

Leave a Reply

Your email address will not be published. Required fields are marked *