Oct 18 2013

A Crash Course In AWK

A while back, MarkD wrote a great series of posts on DTrace. I’d never been exposed to DTrace—I assumed it was similar to strace. It’s a whole other animal, though—an event-based engine suitable for everything from debugging to systems scripting.

The coolest thing was that the most powerful aspects of DTrace come from its wholesale copying of the programming model of one of my favorite UNIX tools: AWK. If I’m hacking together something on the command line, chances are good that I’m using AWK for some part of it. AWK is much simpler than DTrace—it’s a general-purpose tool built around one big idea.

There are plenty of great resources on how to use AWK. Rather than write another one, this short post will show you the basics of what AWK is, and what it’s good for. You’ll need to know some command line basics, as well as what a regular expression is. All the examples in this post were written assuming an OS X environment.

Not An Operation—A Programming Model

They say that the UNIX way is to compose together small tools that do one thing well. AWK definitely does that, but not in the same way as head or tail do.

Let me show you what I mean. Let’s say that I have a little text file that is an inventory of all my worldly possessions:

bash-3.2$ cat inventory
beans and celery
beans and oatmeal
beans and beans
quinoa

Even if you’ve never seen the head command before, the following example will probably make sense:

bash-3.2$ cat inventory | head -1
beans and celery

AWK is different. If you saw this next example in a shell script, you’d have a hard time knowing what it meant without reading up on AWK:

bash-3.2$ cat inventory | awk '/oatmeal/ { print $1 ": featuring " $3 }'
beans: featuring oatmeal

That’s because AWK’s job isn’t to do one small thing. It’s to allow you to use one small idea: event-based programming.

Event-based Programming

In a normal procedural shell script or command line session, you’re telling the computer to do a sequence of things in a specific order. That’s not how AWK works. In AWK, you tell the computer how to look for events, and then tell it what to do when it finds an event you’re interested in.

Let’s take another look at that AWK program. This time, I’ll format it a bit more nicely:

/oatmeal/ {
    print $1 ": featuring " $3;
}

The first part of this program — /oatmeal/ — is the event that you’re looking for. Events can be specified in a few different ways: you can use a C-style conditional expression, or a special event like BEGIN that is triggered before the first line is read. However, the most common kind of event to see is a regular expression event, which is what /oatmeal/ is. If “oatmeal” appears in a line of text, then our event will be triggered.

The action is the second part of this program, the part between the braces. This part is a procedural set of instructions to perform when your event occurs. Here, you have a small C-like programming language at your disposal, with for/while loops, if statements, and global variables at your disposal.

When AWK runs your program, it will read each line of input in, one after the other. Each time it reads in a line, it will see if your event occurred. If it has, then it performs your event’s action. You can define as many events as you like. If more than one event occurs, each event’s action is performed in the order they appear in your program.

Here’s a slightly more complicated example: an implementation of FizzBuzz on the command line, using seq and awk: (updated: now correct! I should know what FizzBuzz is before I write it. -Bill)

bash-3.2$ seq 1 100 | awk '
> ($1 % 3 == 0) {
>     printf("Fizz");
> }
> ($1 % 5 == 0) {
>     printf("Buzz");
> }
> ($1 % 3 != 0 && $1 % 5 != 0) {
> printf($1);
> } > {
> printf("\n")
> }'

Simple String Processing

Our first AWK program didn’t use any loops or conditionals, but it did use a couple of other features specific to AWK. Here’s our first action again:

print $1 ": featuring " $3;

Since AWK is mainly used for wrangling text, it automatically does a bit of that work for you. It splits each line of text up into whitespace-separated words and stashes them in variables named $N, where N is the index of the word starting from 1 ($0 gives you the entire line of text).

AWK also makes it easy to paste two strings together. All you have to is put them next to one another. So the line of code above pastes together $1 (“beans”), “: featuring “, and $3 (“oatmeal”).

Getting Fancy: Multiple Events And Variables

Lots of AWK scripts do little more than look for a particular line in a file and print out a specific field, but you can use it to do simple parsing of structured text, too. For example, as an Android developer, I’m often working with XML layout files that look like this:

<FrameLayout xmlns:android="http://schemas.android.com/apk/res/android"
  android:layout_width="match_parent"
  android:layout_height="match_parent"
  >

  <android.support.v4.view.ViewPager
    android:id="@+id/fragment_pager_viewPager"
    android:layout_width="match_parent"
    android:layout_height="match_parent"
    android:padding="24dp" />

</FrameLayout>

In my Java code, nine times out of ten I’m going to want to pull out a reference to the ViewPager I defined above by writing a line of code like this:

final ViewPager viewPager = (ViewPager)v.findViewById(R.id.fragment_pager_viewPager);

Now, if I were writing production code that translated that XML into that line of Java code, I’d want to use a real programming language with a real XML parser to avoid any parsing pitfalls. If I’m writing a tool for myself, though, that doesn’t sound like a lot of fun. It’d also be nice to be able parse sloppy input, like a small fragment of the XML file containing just a few views. That won’t work with a beefier XML parser, which will yell at me if it doesn’t receive perfect input.

In its own slapdash way, AWK is handy with this kind of thing. I’ve got an AWK script I use for just this task. It uses the gensub function, which is specific to gawk. (You can install gawk with homebrew or macports if you’re on a Mac.) Here’s the script:

#!/usr/bin/env gawk -f

BEGIN {
    # appropriate for an onCreateView
    spacing = "        ";
}

/<[a-zA-Z.]*/ {
    tagName = gensub(/^.*<([a-zA-Z.]*\.)?([a-zA-Z]*).*/, "\\2", $0);
}

/android:id=\"@\+id\// {
    rawId = gensub(/^.*:id=\"@\+id\/([a-zA-Z0-9_]*)\".*/, "\\1", "", $0);
    fieldName = gensub(/^.*_/, "", "", rawId);
    if (tagName == "include") {
        tagName = "View";
    }
    print spacing "final " tagName " " fieldName " = (" tagName ")v.findViewById(R.id." rawId ");"
}

This script has three events. The first one, BEGIN, happens before processing any text. It defines the amount of leading whitespace, which we’ll need later on.

The second event looks for opening XML tags. Whenever it finds one, it uses the gensub function to pull out the last part of the class name with regex matching. It then stashes that classname in the fieldName variable. So tagName will always store the last class name we read in.

The last event looks for the android:id attribute we’re interested in. When this happens, we should spit out a line of Java code. We can do that by using gensub again, first to pull out the id, then to strip out the underscored portion to get our variable name.

This script isn’t perfect—it’s easy to create an XML file that will break it. As long as the XML looks like the kind of XML my team writes, though, it’s great.

I’m Sold, Bill. Where Can I Buy An Awk?

There’s a little bit more to AWK than I’ve covered here, but those are the basics. If you’re interested in more, check out Bruce Barnett’s tutorial and short reference here.

18 Comments

  1. Ed

    awk takes a file name as an argument on the command line, so in your first example, the cat is redundant; just use awk '/oatmeal/ { print $1 ": featuring " $3 }' inventory

    Nice intro to awk, thanks.

  2. Ryan Waldron

    Never had thought about awk that way. That’s very enlightening.

    There’s a bit of a typo in the paragraph beginning, “The second event looks for opening XML tags.” It says it stashes the class name in the fieldName variable, but I believe you mean the tagName variable.

    Thanks!

  3. Shane

    Great article — I’d never considered the “event-related” programming metaphor before, but structuring the explanation that ways really snaps things into focus.

  4. Andrew Phillips

    Bill, I think you’ve spelled your surname wrong!

  5. camel

    Perl one-liner is far better.
    Use Perl one-liner.

  6. Nigel Trewartha

    A very nice article indeed !
    I wonder if AWK and other linux tools are available for Win 8.1.
    may make Windows worth using.

    • There are some distributions of unix toolsets out there for Windows, like cygwin, but if you’re really committed to the Windows development platform I would recommend shying away from them. I used cygwin all the time back when I worked on Windows, but no matter how cool the stuff I wrote with it was, my coworkers would never get on board.

  7. Michael Mauch

    Does that

    #!/usr/bin/env gawk -f

    line work on Mac OS X? It doesn’t work on Linux, I get

    /usr/bin/env: gawk -f: No such file or directory

    See also here.

    • I’ve just got an OS X machine, so I can’t be of help for you there. On OS X that example will work if saved to a file and run as an executable.

    • Øsse

      The reason it doesn’t work is due to the shebang mechanism (the #! thing). Using that you can only give the program one argument. Here env is given two arguments: ‘gawk’ and ‘-f’.

      Unless you are the kind of person who compiles/installs your own gawk in various locations you can use this without problems: #!/bin/gawk -f.

      The reason for using env is that env with take your PATH into account etc. Most of the time that’s not really necessary.

    • Aaron Toponce

      This is a hard lesson to learn, but one that everyone must learn eventually. Just because it works on your machine, doesn’t mean that the code is portable, and will run on every platform.

      gawk(1) is GNU’s implementation of the AWK programming language. Even though it can be POSIX compliant, that doesn’t mean the code you write is. Further, gawk(1) isn’t installed on all Unix variants by default, or even available in some cases.

      When writing shell code, especially when putting it up on your blog, be very careful that the code you write is portable, and you’re not getting caught in the traps of “bashisms” and the like.

      • Thanks for the feedback. I mainly use awk for local scripting these days, so portability has not been an issue I’ve needed to deal with. Do you have any suggestions on how the example might be made more portable? (Note that I’ll be sticking with gawk – you’re right that it’s not as portable, but it’s much more pleasant to use.)

        For the time being, I’ll add a disclaimer about the environment all the examples use.

  8. jarmod

    This is a misleading use of the term ‘event’, if you ask me. They are ‘matches’, not ‘events’.

Leave a Comment

Join the discussion. Do not worry, your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>