Bash scripting to compare the Ossasepia logs

The task is to devise a bash script to compare the logs of ossasepia on different servers (here in particular logs.nosuchlabs.com and logs.ossasepia.com).

Preliminary notes:

  • As advised, the raw knob can be used to extract the text of the logs. The raw mechanism will spit out max 500 lines.
  • i.e if a user provides a large range of id’s – this will have to be split into batches of 500 lines.
  • W.r.t diff the focus will be on id < 1000,000.
  • My initial idea to use R and connect to the db snapshot was an example of an unnecessarily bloated solution when readily available bash + curl + diff can do the job. As the thread points out, I was not considering that the db snapshots are synced 1x/day at different times (depending on time zone config of the box).
  • FWIW: even beyond common sense and using the right tool for the job – I have always wanted to improve my bash scripting skills.

Plan

  1. Create a simple case:
  1. Use curl on raw knob links from each box > write this to a text file.
  2. Use diff to compare the text files.
  • Include variables to substitute start id and end id.
  • Strategy for a id range above 500
  • Enable providing arguments (url(s), startid and endid) to supply to the bash script so it can be invoked easily from the command line.

Simple case

Beginning with manually using curl.

#!bin/bash</p>
<p>curl "http://logs.ossasepia.com/log-raw/ossasepia?istart=999600&amp;iend=999700" &gt; ~/temp/log-test.txt</p>
<p>curl "http://logs.nosuchlabs.com/log-raw/ossasepia?istart=999600&amp;iend=999700" &gt; ~/temp/log2-test.txt</p>
<p>diff -uNr ~/temp/log-test.txt ~/temp/log2-test.txt &gt; ~/temp/hololo.txt

Quick test of diffing post 1,000,000 id’s.

#!bin/bash</p>
<p>curl "http://logs.ossasepia.com/log-raw/ossasepia?istart=1000000&amp;iend=1000400" &gt; ~/temp/log-test.txt</p>
<p>curl "http://logs.nosuchlabs.com/log-raw/ossasepia?istart=1000000&amp;iend=1000400" &gt; ~/temp/log2-test.txt</p>
<p>diff -uNr ~/temp/log-test.txt ~/temp/log2-test.txt &gt; ~/temp/hololo.txt</p>
<p>

Including variables for url prefix, start id and end id

  • lulz note: after a few hours of head-banging using istart= 995000 and iend= 995500 – I realised that these do not exist in the ossasepia log, and I had the syntax right in my first attempt.

#!bin/bash
urlPrefix1="logs.ossasepia.com/log-raw/ossasepia"
urlPrefix2="logs.nosuchlabs.com/log-raw/ossasepia"
startid=1001700
endid=1001900</p>
<p>curl "${urlPrefix1}?istart=${startid}&amp;iend=${endid}" &gt; ~/temp/log-test.txt</p>
<p>curl "${urlPrefix2}?istart=${startid}&amp;iend=${endid}" &gt; ~/temp/log2-test.txt

diff -uNr ~/temp/log-test.txt ~/temp/log2-test.txt > ~/temp/log-diff.txt

So far, so good. Now comes the relatively tricky part: extending the above to cover more than 500 lines. This will need some conditionals and a for loop thrown in for dealing with a large range.

Figuring out a larger range

Strategy:

  1. Obtain a startid and endid (i.e istart and iend)
  2. If (endid-startid <= 500) – go ahead with directly using curl and diff.
  3. If endid-startid > 500
  1. divide the number of lines by 500. Obtain the quotient and remainder.
  2. Use the quotient in a for loop as the number of times the internal startidi is incremented by 500.
  3. the internal endidi is subtracted by 1 to account for duplication of lines.
  4. Subtract the remainder from original endid to extract the last portion.

Implementing a simple conditional statement

#!bin/bash
urlPrefix1="logs.ossasepia.com/log-raw/ossasepia"
urlPrefix2="logs.nosuchlabs.com/log-raw/ossasepia"
startid=999700
endid=999900
rangelimit=500</div>
</div>
<p>let subtrid=endid-startid</p>
<p>if [ "$subtrid" -le "$rangelimit" ]
then</p>
<p>echo "Lines &lt;= 500. Proceeding to curl and diff."
    curl "${urlPrefix1}?istart=${startid}&amp;iend=${endid}" &gt; ~/temp/log-test.txt
    curl "${urlPrefix2}?istart=${startid}&amp;iend=${endid}" &gt; ~/temp/log2-test.txt
    diff ~/temp/log-test.txt ~/temp/log2-test.txt &gt; ~/temp/log-diff.txt
else
    echo "Lines &gt; 500. Additional calcs required."
fi

Implementing the for loop for a range > 500

#!bin/bash
urlPrefix1="logs.ossasepia.com/log-raw/ossasepia"
urlPrefix2="logs.nosuchlabs.com/log-raw/ossasepia"
startid=1001900
endid=1002900
rangelimit=500</div>
</div>
<p>let subtrid=endid-startid</p>
<p>if [ "$subtrid" -le "$rangelimit" ]
then</p>
<p>echo "Lines &lt;= 500. Proceeding to curl and diff."
    curl "${urlPrefix1}?istart=${startid}&amp;iend=${endid}" &gt; ~/temp/log-test.txt
    curl "${urlPrefix2}?istart=${startid}&amp;iend=${endid}" &gt; ~/temp/log2-test.txt
    diff ~/temp/log-test.txt ~/temp/log2-test.txt &gt; ~/temp/log-diff.txt
else
    echo "Lines &gt; 500. Entering Loop to split the range into batches of 500 lines."
    let quotient=$subtrid/$rangelimit
    let remainder=$subtrid%$rangelimit
    echo $quotient
    echo $remainder
    for (( c=0; c &lt;$quotient; c++ ))
    do
	let "startidi=$startid + $c * $rangelimit"
	let "endidi=$startidi + $rangelimit -1"
	echo $startidi
	echo $endidi
	curl "${urlPrefix1}?istart=${startidi}&amp;iend=${endidi}" &gt;&gt; ~/temp/log-test.txt
	curl "${urlPrefix2}?istart=${startidi}&amp;iend=${endidi}" &gt;&gt; ~/temp/log2-test.txt
    done
    let "portionstartid=$endid - $remainder"
    echo $portionstartid
    curl "${urlPrefix1}?istart=${portionstartid}&amp;iend=${endid}" &gt;&gt; ~/temp/log-test.txt
    curl "${urlPrefix2}?istart=${portionstartid}&amp;iend=${endid}" &gt;&gt; ~/temp/log2-test.txt
    diff ~/temp/log-test.txt ~/temp/log2-test.txt &gt; ~/temp/log-diff.txt
fi

The above has been tested to work across a range of start and end ID’s.

Adding some functions and other minor streamlining

  • function to check the output of curl as well as diff if empty.
  • curl operations put into a function since repeated.
  • Streamlined echo outputs to be more neat.

#!bin/bash
urlPrefix1="logs.ossasepia.com/log-raw/ossasepia"
urlPrefix2="logs.nosuchlabs.com/log-raw/ossasepia"
startid="1001900"
endid="1003700"
log1_file=$(mktemp -t "$(date +"%Y_%H-%M-%S").log1")
log2_file=$(mktemp -t "$(date +"%Y_%H-%M-%S").log2")
diff_file=$(mktemp -t "$(date +"%Y_%H-%M-%S").difflog")
rangelimit=500</p>
<p>let subtrid=endid-startid</p>
<p>function check_output {
    echo "Log1 curl output is at $log1_file"
    echo "Log2 curl output is at $log2_file"
    echo "diff output is at $diff_file"</p>
<p>if [ ! -s $1 ] || [ ! -s $2 ]
    then
	echo "Atleast One curl output returned nothing."
    fi</p>
<p>if [ -s $3 ]
    then
	echo "Diff file is not empty. Logs not equal"
    else
	echo "Diff file is empty."
    fi
}</p>
<p>function curler {
    curl "${1}?istart=${3}&amp;iend=${4}" &gt;&gt; $log1_file
    curl "${2}?istart=${3}&amp;iend=${4}" &gt;&gt; $log2_file
}</p>
<p>if [ "$subtrid" -le "$rangelimit" ]
then</p>
<p>echo "Lines &lt;= $rangelimit. Proceeding to curl and diff."
    curler $urlPrefix1 $urlPrefix2 $startid $endid
    diff -uNr $log1_file $log2_file &gt; $diff_file
    check_output $log1_file $log2_file $diff_file</p>
<p>else
    echo "Lines &gt; $rangelimit. Looping to split the range into batches."
    let quotient=$subtrid/$rangelimit
    let remainder=$subtrid%$rangelimit
    echo "Batches of $rangelimit lines = $quotient. Remaining lines = $remainder"
    for (( c=0; c &lt;$quotient; c++ ))
    do
	let "startidi=$startid + $c * $rangelimit"
	let "endidi=$startidi + $rangelimit -1"
	echo "istart is $startidi and iend is $endidi"
	curler $urlPrefix1 $urlPrefix2 $startidi $endidi
    done
    let "portionstartid=$endid - $remainder"
    echo "Last portion istart is $portionstartid"
    curler $urlPrefix1 $urlPrefix2 $portionstartid $endid
    diff -uNr $log1_file $log2_file &gt; $diff_file
    check_output $log1_file $log2_file $diff_file
fi

Enabling the script to be called with parameters

#!bin/bash
urlPrefix1=$1
urlPrefix2=$2
startid=$3
endid=$4
log1_file=$(mktemp -t "$(date +"%Y_%H-%M-%S").log1")
log2_file=$(mktemp -t "$(date +"%Y_%H-%M-%S").log2")
diff_file=$(mktemp -t "$(date +"%Y_%H-%M-%S").difflog")
rangelimit=500</div>
</div>
<p>let subtrid=endid-startid</p>
<p>function check_output {
    echo "Log1 curl output is at $log1_file"
    echo "Log2 curl output is at $log2_file"
    echo "diff output is at $diff_file"</p>
<p>if [ ! -s $1 ] || [ ! -s $2 ]
    then
	echo "Atleast One curl output returned nothing."
    fi</p>
<p>if [ -s $3 ]
    then
	echo "Diff file is not empty. Logs not equal"
    else
	echo "Diff file is empty."
    fi
}</p>
<p>function curler {
    curl "${1}?istart=${3}&amp;iend=${4}" &gt;&gt; $log1_file
    curl "${2}?istart=${3}&amp;iend=${4}" &gt;&gt; $log2_file
}</p>
<p>if [ "$subtrid" -le "$rangelimit" ]
then</p>
<p>echo "Lines &lt;= $rangelimit. Proceeding to curl and diff."
    curler $urlPrefix1 $urlPrefix2 $startid $endid
    diff -uNr $log1_file $log2_file &gt; $diff_file
    check_output $log1_file $log2_file $diff_file</p>
<p>else
    echo "Lines &gt; $rangelimit. Looping to split the range into batches."
    let quotient=$subtrid/$rangelimit
    let remainder=$subtrid%$rangelimit
    echo "Batches of $rangelimit lines = $quotient. Remaining lines = $remainder"</p>
<p>for (( c=0; c &lt;$quotient; c++ ))
    do
	let "startidi=$startid + $c * $rangelimit"
	let "endidi=$startidi + $rangelimit -1"
	echo "istart is $startidi and iend is $endidi"
	curler $urlPrefix1 $urlPrefix2 $startidi $endidi
    done</p>
<p>let "portionstartid=$endid - $remainder"
    echo "Last portion istart is $portionstartid"
    curler $urlPrefix1 $urlPrefix2 $portionstartid $endid
    diff -uNr $log1_file $log2_file &gt; $diff_file
    check_output $log1_file $log2_file $diff_file
fi

The above script, if saved as ~/temp/log-bash-curl-diff.sh can be called as:

sh ~/temp/log-bash-curl-diff.sh "logs.ossasepia.com/log-raw/ossasepia" "logs.nosuchlabs.com/log-raw/ossasepia" 1001900 1003700

Lines > 500. Looping to split the range into batches.
Batches of 500 lines = 3. Remaining lines = 300
istart is 1001900 and iend is 1002399
istart is 1002400 and iend is 1002899
istart is 1002900 and iend is 1003399
Last portion istart is 1003400
Log1 curl output is at /var/folders/39/l1557gl175s593l7zjj9kd640000gn/T/2019_08-31-36.log1.GATDGr6j
Log2 curl output is at /var/folders/39/l1557gl175s593l7zjj9kd640000gn/T/2019_08-31-36.log2.tiPldwjW
diff output is at /var/folders/39/l1557gl175s593l7zjj9kd640000gn/T/2019_08-31-36.difflog.5WfHEFEL
Diff file is not empty. Logs not equal

CleanShot 2019-09-22 at 07.41.53@2x.png

Comparing logs for range 9998683 to 1000000

  • 998683 is the beginning of the ossasepia log.

sh ~/temp/log-bash-curl-diff.sh "logs.ossasepia.com/log-raw/ossasepia" "logs.nosuchlabs.com/log-raw/ossasepia" "998683" "1000000"

Lines > 500. Looping to split the range into batches.
Batches of 500 lines = 2. Remaining lines = 317
istart is 998683 and iend is 999182
istart is 999183 and iend is 999682
Last portion istart is 999683
Log1 curl output is at /var/folders/39/l1557gl175s593l7zjj9kd640000gn/T/2019_07-47-56.log1.LDXQXheQ
Log2 curl output is at /var/folders/39/l1557gl175s593l7zjj9kd640000gn/T/2019_07-47-56.log2.4nLSpAfS
diff output is at /var/folders/39/l1557gl175s593l7zjj9kd640000gn/T/2019_07-47-56.difflog.TNbRHuwG
Diff file is empty.

Concluding remarks

  • a neat little bash script is constructed which will retrieve content from 2 specified URL’s and diff the output. Particularly, the script was constructed to compare the logs on logs.ossasepia.com and logs.nosuchlabs.com
  • functions, conditionals, loops, for bash were learned and deployed, along with using curl and diff.
  • Interim progress and general observations were discussed in .
  • Retrieving a large number of lines will take some time and is also dependent on the internet speed. The curl/diff files will be empty if the lines are non-existent.
  • Diff results of the logs from line 9998683 to 1000000 indicates there are no missing lines.
  • the check_output function only checks if the files are empty. It does not account for curl retrieving error messages.
  • In a batch retrieval – the final curl output is checked whether empty. It does not account for empty retrievals for a particular batch.
  • overflow/underflow is not accounted for in this script. Refer thread for a short discussion.

References

3 responses on “Bash scripting to compare the Ossasepia logs”

  1. Not bad, quite well done for starters and nicely structured explanation of it all.

    A few things to look out for at all times (and with direct application here too):

    1. How does your code behave on unexpected/incorrect inputs? In this particular case: what happens for instance if startid > lastid or one of them is negative? This is a sort of basic awareness whenever you write some code: is my code well behaved when *outside things don’t go as they should*? Because in all real life situations, those pesky outside things will certainly NOT go as they should at one point or another.

    2. Whenever you have to repeat some operation (as here, the curl from A, curl from B, run a diff) there are 2 things you need to figure out: 1. what does the operation itself consist of 2. what is the stopping point ie “when is this done”. Note that this is more generic than “how many times should I do this”, precisely because there are several cases that are at core still the same old “repeat some operation”: a. with fixed, known upfront number of repetitions (usually this is implemented with a for loop but note that for is quite flexible in general so it can easily be used in other cases too) b. with initial check aka something that may be done 0 or more times (usually a while condition do) c. with final check aka something that may be done 1 or more times (usually a repeat until or do while).

    All the above cases (a,b,c) have an if in them even if not obvious – the “check” is precisely an if (and for works with a check at every step too, after the change to the counter var). As such, your initial check of the range is just the base case and there’s no need for it to be separate really (if you look at your code you’ll notice you have the same lines repeated outside of the for loop and in the for loop and that’s why) – your repetition there is of type b (since you want to make sure you don’t proceed if the given indices are bogus) aka initial check: as long as you still have *some* lines to check, you take the first 500 (or all remaining ones if less than 500) and check; advance and repeat.

Leave a Reply

Your email address will not be published. Required fields are marked *