One of the most important tools we use in our performance analysis engagements is MR Skew. It’s been an important part of our software tools catalog for a little over a year. Today, I’ll explain why we built it and show you what it can do for you.

The "skew" in "MR Skew" is there because finding skew is the keystone of performance analysis. That sounds like a riddle, but it's literally true. Skew is a non-uniformity in a list. If you can't find those non-uniformities, then you can’t fix your performance problems. I’ll show you.

Imagine that it takes 10 seconds for your database application to return a result to you after you click. You want the answer in 1 second or less. Why did it take so long? One might naively assume, “My computer has CPU, a disk, memory, and a network; since there are 4 such resources, it must be spending 25% of its time using each of these 4 resources.” Probably not: there’s almost always non-uniformity in how your click spends time among those components. One might even more dangerously assume, “Database response time is almost always dominated by disk I/O latency; it must be spending 80% of its time reading from disk.” This is not the kind of thing you should guess about. You should know:

Resource typeResponse time
Total10.0100%
CPU8.080%
Disk1.010%
Memory.77%
Network.33%

The non-uniformity here—the skew—is the most important thing you need to know if you’re going to fix the performance of your click. From this table, you can see now, for example, that upgrading your disk subsystem isn’t going to improve performance by 80%, or even 25%. It can improve performance at most by 10%.

Next step: What’s consuming so much CPU? On a database application, it’s your database calls that consume your CPU, so the next thing you need to know is whether your time is uniformly distributed among database call types:

Call typeResponse time
Total8.0100%
PARSE6.075%
FETCH1.620%
EXEC.45%

Now you know it’s a PARSE call problem. That’s vital information, because a PARSE call problem requires a completely different kind of solution than a FETCH call problem. If you had assumed a uniform distribution of response time contribution across database call types, you’d have missed your big leverage opportunity, which is to focus on PARSE calls.

The next step is yet another skew analysis. There’s possible non-uniformity in the list of PARSE call durations as well. Is PARSE call time uniformly distributed across SQL statements?

SQL hash valueResponse time
Total6.0100%
91428148495.490%
1403776981.12%
All other.58%

Yep, one SQL statement accounts for 90% of the parsing. The next skew analysis: are all PARSE calls for this statement of equal duration? Or is there skew there, too?

PARSE call idResponse time
Total5.4100%
15.399%
All other.11%

So, it’s one PARSE call for this statement (the first one) that’s causing all the ruckus: one PARSE call is responsible for 53% of your response time. From there, you might drill into the PARSE call code path to see if there’s skew in how its individual function calls or lines of code path contribute to response time.

These are the steps we execute when we analyze performance. And that’s why we built MR Skew. MR Skew makes it easy for us to perform all of these skew analysis steps upon an Oracle trace file.

Here’s a real-life Oracle analysis using MR Skew. It begins with a properly scoped Oracle trace file that represents a performance problem. My first MR Skew call (mrskew) shows all the database calls and system calls that Oracle reports into its trace data (using --name='.*', which means match all call names). I’ll group by the name itself (--group='$name'). The result is a profile by Oracle subroutine name of where my time has gone.

$ mrskew --name='.*' --group='$name' asv_ora_1492.trc
'name' VALUE                                 DURATION       CALLS         MEAN          MIN          MAX
---------------------------  ------------------------  ----------  -----------  -----------  -----------
FETCH                                22.480000  59.6%         199     0.112965     0.000000    21.010000
db file scattered read               14.100000  37.4%        1262     0.011173     0.000000     0.050000
direct path write                     0.410000   1.1%         318     0.001289     0.000000     0.040000
direct path read                      0.360000   1.0%         594     0.000606     0.000000     0.040000
PARSE                                 0.140000   0.4%          26     0.005385     0.000000     0.120000
5 others                              0.200000   0.5%         332     0.000602     0.000000     0.030000
---------------------------  ------------------------  ----------  -----------  -----------  -----------
TOTAL (10)                           37.690000 100.0%        2731     0.013801     0.000000    21.010000

Now that I know that FETCHing is consuming most of my time, I can drill into the FETCH calls alone (--name='FETCH'), and I can group by the SQL hash value (--group='$hv'). The result is a profile by SQL statement of where my FETCH time has gone.

$ mrskew --name='FETCH' --group='$hv' asv_ora_1492.trc
'hv' VALUE                  DURATION       CALLS         MEAN          MIN          MAX
----------  ------------------------  ----------  -----------  -----------  -----------
1923179528          21.470000  95.5%           4     5.367500     0.060000    21.010000
2856808142           0.550000   2.4%          18     0.030556     0.000000     0.090000
1848910272           0.200000   0.9%          18     0.011111     0.010000     0.020000
 612219287           0.180000   0.8%          36     0.005000     0.000000     0.010000
  99414984           0.050000   0.2%          12     0.004167     0.000000     0.010000
  8 others           0.030000   0.1%         111     0.000270     0.000000     0.010000
----------  ------------------------  ----------  -----------  -----------  -----------
TOTAL (13)          22.480000 100.0%         199     0.112965     0.000000    21.010000

Now I know exactly which SQL statement I need to go hunt down in my trace file. Furthermore, I know exactly how much of my response time I stand to save if I can shoot this thing down: it’s 21.47 seconds of my 37.69-second response time.

Next, I need to find out whether there's skew in the FETCH call durations. I drill into just the FETCH calls (--name='FETCH'), but this time, only for calls made by the one important SQL statement (--where='$hv==1923179528'), and I'll group by the line number in the trace file where the calls occur (--group='$line').

$ mrskew --name='FETCH' --where='$hv==1923179528' --group='$line' asv_ora_1492.trc
'line' VALUE                  DURATION       CALLS         MEAN          MIN          MAX
------------  ------------------------  ----------  -----------  -----------  -----------
        4190          21.010000  97.9%           1    21.010000    21.010000    21.010000
        4544           0.230000   1.1%           1     0.230000     0.230000     0.230000
        1238           0.170000   0.8%           1     0.170000     0.170000     0.170000
         687           0.060000   0.3%           1     0.060000     0.060000     0.060000
------------  ------------------------  ----------  -----------  -----------  -----------
   TOTAL (4)          21.470000 100.0%           4     5.367500     0.060000    21.010000

Now I know that 97.9% of this statement’s FETCH call time was consumed by the single fetch call described on line 4190 of my trace file. I am now zeroed in on the one FETCH call that accounts for 21.01 seconds of my 37.69-second response time. Here’s that line (and the 5 before and after it), produced using a tool called mrnl, which I’ll tell you about later.

$ mrnl --line=4190:5 --nouaf --nodelta asv_ora_1492.trc
 LINE#   BEGIN-TIME     END-TIME       DURATION TRACE-LINE
  4185            .            .              .  bind 2: dty=2 mxl=22(21) mal=00 scl=00 pre=00 oacflg=03 oacfl2=1 size=0 offset=48
  4186            .            .              .    bfp=025a17c8 bln=22 avl=04 flg=01
  4187            .            .              .    value=102381
  4188 15:43:30.390 15:43:30.390              . EXEC #21:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=4079725939
  4189 15:43:30.390 15:43:30.390              . FETCH #21:c=0,e=0,p=0,cr=3,cu=0,mis=0,r=0,dep=1,og=4,tim=4079725939
  4190 15:42:55.070 15:43:30.390        35.320s FETCH #1:c=2101,e=3532,p=70923,cr=93424,cu=132,mis=0,r=15,dep=0,og=3,tim=4079725939
  4191            .            .              . BINDS #21:
  4192            .            .              .  bind 0: dty=2 mxl=22(21) mal=00 scl=00 pre=00 oacflg=03 oacfl2=1 size=72 offset=0
  4193            .            .              .    bfp=025a1798 bln=22 avl=03 flg=05
  4194            .            .              .    value=515
  4195            .            .              .  bind 1: dty=2 mxl=22(21) mal=00 scl=00 pre=00 oacflg=03 oacfl2=1 size=0 offset=24

From here, my exercise shifts to finding a way to optimize the SQL that motivated this 15-row (r=15) FETCH.

MR Skew can do things no other tool can do, and it’s fun to use because its flexibility rewards your creativity. MR Skew makes simple work of answering questions that are difficult to answer any other way. Here are some examples of common MR Skew queries:

  • Which database files provide the worst read call latencies? (Your SAN administrator will want to know how you figured that out.)
  • Which of my trace files shows the most time spent competing for latches?
  • Which SQL statements are consuming the most PARSE time?
  • Which non-PARSE calls have misses on the library cache?
  • Which array fetch sizes are consuming the most FETCH time?

How would you write these MR Skew queries? Have a look at our MR Skew web page and our mrskew manual page to find out.

Comments   

# A. Coskan Gundogar 2010-09-14 07:57
As a proud owner of MR Tools I think it rocks I can honestly recommend it to everyone. Easy to use, time saving and also helps to catch to problems easily. Thanks for writing it
Reply | Reply with quote | Quote | Report to administrator
# TonySleight 2012-05-02 23:32
Cary,

I attended your 'Mastering Oracle Trace Data' one day seminar in London last year. I have last week, successfully reduced the User I/O Wait times by over 64%, and for one query, reduced the PIO from 15,000,000 per day to 4000 per day. All this was achieved by analysis using Oracle Extended Trace files and MRSKEW.

MRSKEW made the analysis so much easier, processing gigabytes of trace file data in a matter of minutes. This is without doubt, the best £400 my company has spent.
Reply | Reply with quote | Quote | Report to administrator