DevOps Zone is brought to you in partnership with:

Geoff Papilion has made a living running infrastructure for the past 15 years. He is currently employeed at Wikia.com, scaling the infrastructure to 1.5 billion request per day. Geoffrey is a DZone MVB and is not an employee of DZone and has posted 26 posts at DZone. You can read more from them at their website. View Full User Profile

Getting Unique Counts from a Log File

06.24.2013
| 4415 views |
  • submit to reddit

Two colleagues of mine ask a very similar question for interviews. The question is not particularly hard, nor does it require a lot of thought to solve, but it's something that as a developer or as ops guys you might find yourself needing to do. The question is, given a log file of a particular format, tell me how many times something occurs in that log file. For example tell me the number of unique IP addresses in an access log, and the number of times each IP had visited this system.

It's amazing how many people don’t know what to do with this. One of my peers ask people to do this using the command line, the other tells the candidate they can do this anyway then want. I like this question because it's VERY practical; I do tasks like this everyday, and I expect the people I work with to be able to do.

A More Concrete Exmaple

I like the shell solution, because its basically a one liner. So lets walk through it using access logs as an example.

Here is a very basic sample of a common access_log I threw together for this:

127.0.0.1 - - [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
192.168.0.1 - - [10/Oct/2000:13:55:41 -0700] "GET /missing.html HTTP/1.0" 404 506
192.168.0.2 - - [10/Oct/2000:13:55:48 -0700] "GET /missing.html HTTP/1.0" 404 506
192.168.0.5 - - [10/Oct/2000:13:56:42 -0700] "GET /missing.html HTTP/1.0" 404 506
192.168.0.6 - - [10/Oct/2000:13:57:05 -0700] "GET /missing.html HTTP/1.0" 404 506
192.168.0.1 - - [10/Oct/2000:13:58:36 -0700] "GET /missing2.html HTTP/1.0" 404 506
192.168.0.1 - - [10/Oct/2000:13:59:28 -0700] "GET /exitst.html HTTP/1.0" 200 1506
192.168.0.3 - - [10/Oct/2000:14:15:20 -0700] "GET /exitst.html HTTP/1.0" 200 1506
192.168.0.7 - - [10/Oct/2000:14:16:32 -0700] "GET /missing3.html HTTP/1.0" 404 506
192.168.0.7 - - [10/Oct/2000:14:20:54 -0700] "GET /exitst.html HTTP/1.0" 200 1506
192.168.0.8 - - [10/Oct/2000:13:22:42 -0700] "GET /exitst.html HTTP/1.0" 200 1506

Let's say you want to count the number of times a unique IP addresses who’ve visited this system. Using nothing more than awk, sort, and uniq you can find the answer. What you’ll want to do is pull the first field with awk, then pipe that through sort, and then uniq. This isn’t fancy, but it returns the result very quickly without a whole lot of fuss.

Like so:

~/Projects/access_logs$ awk '{print $1}' < access_logs  |sort | uniq -c
      1 127.0.0.1
      3 192.168.0.1
      1 192.168.0.2
      1 192.168.0.3
      1 192.168.0.5
      1 192.168.0.6
      2 192.168.0.7
      1 192.168.0.8
~/Projects/access_logs$ 

This gives you each hostname or IP, and the number of times they’ve contacted this server.

Upping the Complexity


Now for something more complex -- let's say you want to get the most commonly requested document that returns a 404. So, again we can do this all in a shell one-liner. We still need awk, sort, uniq, but this time we’ll also use tail. This time we can use awk to examine the status field(9), then print the URL field(7) if the status returned was 404. We can then use sort, uniq, and sort to order the results. Finally we’ll use tail to only print the last line, and awk, to print the requested document.

So here is what this looks like:

~/Projects/access_logs$ awk '{if($9=="404"){print $7}}'  access_logs  |sort |uniq -c |sort -n |tail -1 |awk '{print $2}'
/missing.html

Of course there are many other ways to do this. This is a totally simple way to do it, and the best part of this is that you can count on these tools being on almost every *nix system.

Published at DZone with permission of Geoffrey Papilion, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)