Simple WebSite Log file analysis with bash, grep and awk

By neokrates, written on April 17, 2008

article

  • Join date: 11-30-99
  • Posts: 224
View Counter:
  • 3,620 views
Rate it
Ad
Poll
  • You parse your logs using?

    View Results

    Loading ... Loading ...
Feeds:
  • bodytext bodytext bodytext

Suppose, that you run a Web Site on server or cluster. You log but actually have a little idea on how your server performs. Or it may also be your new job or part of it, to assess how well the server side performs. It is also a problem to plug-in a commercial solution, maybe because of the price, maybe you environment is just way too custom. So you have got to come up with the simple coded solution.

Questions:

  • How many users do login per period of time (1 min, 1 hour etc)?
  • How many users do perform particular operation (reading data, posting messages, whatever else)?
  • What is the regular “user behavior”. When does your site get under stress and when it doesn’t during the day or month?

 
Precondition: You do log inside the code and you logging system allows tracking of particular function calls. If it’s not a case, it is easy enough to create such a logging system using lo4j, log4cpp, apache commons or some similar logging system.

Alternative: There are also many ways to understand your site performance by using pluggable statistics tools. But that issue isn’t discussed here.

1

Get the log files

Where are you log files actually? The normal convention for log files is to be stored on the Unix (other server os) server. In enterprise they are on some server, typically reachable per ssh.

Log file names should identify the date they where written in (contain the time-stamp). If that is not a case it might be much harder to map your data to that actual day they were created.

In my case I have files on the server myhost in format ${DATE}-${NAME}-source.log The ${NAME} refers to particular software module running on that server.

Create a local directory on you work PC. I.e. you can name it statistics. Also create the subdirectory logs. That is useful because you don’t want to have a plenty of data in project root. You also may later want to organize your log data, for example if you track the performance during long time period.

Now, suppose I am interested in login count into my site and thus take the module LOGINS. I will just download the log files to my local directory:

scp myhost:${PATH-TO-LOGS}/*${NAME}-source.log .

REMARK There may be slight to average difficulty to get the files. There is a chance you must contact Administrator to get access information and .ssh keys.

The good thing is, you will need to login only once, provided you one regular expression matches all log file names correctly.

2

Bash: select the files you want to analyze

Using Bash and regular expression, we create a loop, which gets the names of log files one after another.
I just change into logs directory and select all files, which have the .log extension:

#!/bin/bash
cd logs
for file in *.log; do
done

3

Bash: filter out the requests you need

I use the grep to do it. It is fast and you can parse for almost everything. You grep on the file, your outer for cycle has selected, of course. You can save the result in unique file which has just additional extension ‘.gr’.

#!/bin/bash
cd logs
for file in *.log; do

grep "userLoginSuccess INFO" $file>$file.gr

done

4

Bash: extract the time of each request

That is important. You logging mechanism must support the time-stamp. But as it is an industry standard, I guess it does. My input log data looks something like this:

2008-01-24 00:00:00,342 [thread-ED] my.long.package.name.userLoginSuccess INFO  param1=val1 param2=val2
2008-01-24 00:00:00,343 [thread-ED] my.long.package.name.userLoginSuccess INFO  param1=val1 param2=val2

 
It is normally easy to parse such a file with awk .

Here is how I do it (you can combine all separate awk commands into one, if you want to. I used many separate calls to make it simple to read):

cat $file.gr|awk -F, '{print $1}'|awk '{print $2}' | awk -F: '{print "A"$1*60+$2"A"}' >  $file.txt

 
cat $file.gr – pipe the file;
awk -F, ‘{print $1} – only interested in the first column, where separator is ‘,’
awk ‘{print $2} – only need time, not date, separator is ‘ ‘
awk -F: ‘{print "A"$1*60+$2"A"}’ – from the hour and minute I calculate the time. Time is for me the value between 0 and 1439. That is an exact minute during the day, at which the call took place. I also put 2xA around the output, which will allow me to parse that time tag in the future.
> $file.txt output the result into second unique file.

Now, the file does contain all the requests you are interested in, identified only by the time tag (in min) during the day, at which the it took place. Again the two ‘A’ are a simple way to parse for 1 and not get 111. It would look something like:

A0A
A0A
A0A
A0A
A1A
A1A
A1A
...
A1439A
A1439A
A1439A
A1439A

 
We have now:

#!/bin/bash

cd logs
for file in *.log; do

grep "userLoginSuccess INFO" $file>$file.gr
cat $file.gr|awk -F, '{print $1}'|awk '{print $2}' | awk -F: '{print "A"$1*60+$2"A"}' >  $file.txt

done

5

Bash: Count call number per minute

It is simple. Number of lines with the same tag is the number of calls. Unix wc can count the lines for you. That must be done for every minute of the day.

for i in $(seq 1 1439); do

res=`grep A${i}A ${file}.txt|wc -l`
echo $i , $res >> ops_per_min_$file.csv
done

 
for i in $(seq 1 1439); do For every minute of the day, do…
res=`grep A${i}A ${file}.txt|wc -l` Count the number of calls …
echo $i , $res >> ops_per_min_$file.csv Save the total number into comma separated file(.csv) as

time(min),  # of calls

 
You will have something like:

0 , 1209
1 , 525
2 , 611
..
23 , 2359

 
And the batch file would be:

#!/bin/bash

cd logs
for file in *.log; do

grep "userLoginSuccess INFO" $file>$file.gr
cat $file.gr|awk -F, '{print $1}'|awk '{print $2}' | awk -F: '{print "A"$1*60+$2"A"}' >  $file.txt

for i in $(seq 1 1439); do

res=`grep A${i}A ${file}.txt|wc -l`
echo $i , $res >> ops_per_min_$file.csv
done

done

6

Bash: count calls per hour

The same task slightly more complex. We go through two cycles. First for hour, second for minutes in each hour.

for i in $(seq 0 23); do
sum=0
for j in $(seq 0 59); do
((time = $i*60+$j))
res=`grep A${time}A ${file}.txt|wc -l`
((sum=$sum+$res))
done
echo $i , $sum>> ops_per_hour_$file.csv
done

 
for i in $(seq 0 23); do Go through hours
sum = 0 here will be the number of calls per hour
for j in $(seq 0 59); do Go through minutes in hour
((time = $i*60+$j)) the time in hour, recalculated in minutes
res=`grep A${time}A ${file}.txt|wc -l` count the sum for this minute
((sum=$sum+$res)) count the sum for current hour
echo $i , $sum>> ops_per_hour_$file.csv store the result

Here is the final version:

#!/bin/bash

cd logs
for file in *.log; do

grep "userLoginSuccess INFO" $file>$file.gr
cat $file.gr|awk -F, '{print $1}'|awk '{print $2}' | awk -F: '{print "A"$1*60+$2"A"}' >  $file.txt

for i in $(seq 1 1439); do

res=`grep A${i}A ${file}.txt|wc -l`
echo $i , $res >> ops_per_min_$file.csv
done

for i in $(seq 0 23); do
sum=0
for j in $(seq 0 59); do
((time = $i*60+$j))
res=`grep A${time}A ${file}.txt|wc -l`
((sum=$sum+$res))
done
echo $i , $sum>> ops_per_hour_$file.csv
done

done

7

Bash: clean up

You can add the rm *.gr and rm *.txt, if you don’t need the temporary files.

8

Combine the results to see the big picture

If you are interested in monthly load or even years, you can just concatenate the daily statistics.

cat ops_per_hour*.csv > total_ops_per_hour.csv
cat ops_per_min*.csv > total_ops_per_hour.csv

9

Render the results

Use any office software like excel of Microsoft or the open office. Of course, any other mathematical or statistical tool will do. Most of them use .csv format, which we have produced. Load the .csv you are interested in, for the day or overall…

You have 2 columns:

time(min),  # of calls

 
Select draw chart option of you tool of the choice, and then bars chart type. Take the first column as a data range and second as data series. Render. Rescale the x axis just to see each bar good enough.

You can also create additional column and store the average number of calls per minute during an hour. That is in the case that you currently render the hourly statistic. Here is how it looks for me (daily load, requests per hour above, average request count per hour below):

10

Analysis

What can you read from the charts?

  • Watch the regular “high load” times
  • The ‘daily cycle’ of your site
  • Search for critical periods (top load), where you system gets the most hits

 
In my case, I just discovered that our video on demand corporate site gets critical load during the “Live events” on TV. And that might be just critical enough to add new servers or load balancer and not to loose any customers.

Be Sociable, Share!
 


TAGS

SOCIAL
Be Sociable, Share!


One Response to “Simple WebSite Log file analysis with bash, grep and awk”

  1. Faruq says:

    I really liked this write up, simple, practical result in the end.
    Thank you

Leave a Reply