| Performance is All About [MR] Skew | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Written by Cary Millsap | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Monday, 13 September 2010 15:04 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
One of the most important tools we use in our performance analysis engagements is MR Skew. It’s been an important part of our software tools catalog for a little over a year. Today, I’ll explain why we built it and show you what it can do for you. The "skew" in "MR Skew" is there because finding skew is the keystone of performance analysis. That sounds like a riddle, but it's literally true. Skew is a non-uniformity in a list. If you can't find those non-uniformities, then you can’t fix your performance problems. I’ll show you. Imagine that it takes 10 seconds for your database application to return a result to you after you click. You want the answer in 1 second or less. Why did it take so long? One might naively assume, “My computer has CPU, a disk, memory, and a network; since there are 4 such resources, it must be spending 25% of its time using each of these 4 resources.” Probably not: there’s almost always non-uniformity in how your click spends time among those components. One might even more dangerously assume, “Database response time is almost always dominated by disk I/O latency; it must be spending 80% of its time reading from disk.” This is not the kind of thing you should guess about. You should know:
The non-uniformity here—the skew—is the most important thing you need to know if you’re going to fix the performance of your click. From this table, you can see now, for example, that upgrading your disk subsystem isn’t going to improve performance by 80%, or even 25%. It can improve performance at most by 10%. Next step: What’s consuming so much CPU? On a database application, it’s your database calls that consume your CPU, so the next thing you need to know is whether your time is uniformly distributed among database call types:
Now you know it’s a PARSE call problem. That’s vital information, because a PARSE call problem requires a completely different kind of solution than a FETCH call problem. If you had assumed a uniform distribution of response time contribution across database call types, you’d have missed your big leverage opportunity, which is to focus on PARSE calls. The next step is yet another skew analysis. There’s possible non-uniformity in the list of PARSE call durations as well. Is PARSE call time uniformly distributed across SQL statements?
Yep, one SQL statement accounts for 90% of the parsing. The next skew analysis: are all PARSE calls for this statement of equal duration? Or is there skew there, too?
So, it’s one PARSE call for this statement (the first one) that’s causing all the ruckus: one PARSE call is responsible for 53% of your response time. From there, you might drill into the PARSE call code path to see if there’s skew in how its individual function calls or lines of code path contribute to response time. These are the steps we execute when we analyze performance. And that’s why we built MR Skew. MR Skew makes it easy for us to perform all of these skew analysis steps upon an Oracle trace file. Here’s a real-life Oracle analysis using MR Skew. It begins with a properly scoped Oracle trace file that represents a performance problem. My first MR Skew call (mrskew) shows all the database calls and system calls that Oracle reports into its trace data (using $ mrskew --name='.*' --group='$name' asv_ora_1492.trc 'name' VALUE DURATION CALLS MEAN MIN MAX --------------------------- ------------------------ ---------- ----------- ----------- ----------- FETCH 22.480000 59.6% 199 0.112965 0.000000 21.010000 db file scattered read 14.100000 37.4% 1262 0.011173 0.000000 0.050000 direct path write 0.410000 1.1% 318 0.001289 0.000000 0.040000 direct path read 0.360000 1.0% 594 0.000606 0.000000 0.040000 PARSE 0.140000 0.4% 26 0.005385 0.000000 0.120000 5 others 0.200000 0.5% 332 0.000602 0.000000 0.030000 --------------------------- ------------------------ ---------- ----------- ----------- ----------- TOTAL (10) 37.690000 100.0% 2731 0.013801 0.000000 21.010000 Now that I know that FETCHing is consuming most of my time, I can drill into the FETCH calls alone ( $ mrskew --name='FETCH' --group='$hv' asv_ora_1492.trc 'hv' VALUE DURATION CALLS MEAN MIN MAX ---------- ------------------------ ---------- ----------- ----------- ----------- 1923179528 21.470000 95.5% 4 5.367500 0.060000 21.010000 2856808142 0.550000 2.4% 18 0.030556 0.000000 0.090000 1848910272 0.200000 0.9% 18 0.011111 0.010000 0.020000 612219287 0.180000 0.8% 36 0.005000 0.000000 0.010000 99414984 0.050000 0.2% 12 0.004167 0.000000 0.010000 8 others 0.030000 0.1% 111 0.000270 0.000000 0.010000 ---------- ------------------------ ---------- ----------- ----------- ----------- TOTAL (13) 22.480000 100.0% 199 0.112965 0.000000 21.010000 Now I know exactly which SQL statement I need to go hunt down in my trace file. Furthermore, I know exactly how much of my response time I stand to save if I can shoot this thing down: it’s 21.47 seconds of my 37.69-second response time. Next, I need to find out whether there's skew in the FETCH call durations. I drill into just the FETCH calls (
$ mrskew --name='FETCH' --where='$hv==1923179528' --group='$line' asv_ora_1492.trc
'line' VALUE DURATION CALLS MEAN MIN MAX
------------ ------------------------ ---------- ----------- ----------- -----------
4190 21.010000 97.9% 1 21.010000 21.010000 21.010000
4544 0.230000 1.1% 1 0.230000 0.230000 0.230000
1238 0.170000 0.8% 1 0.170000 0.170000 0.170000
687 0.060000 0.3% 1 0.060000 0.060000 0.060000
------------ ------------------------ ---------- ----------- ----------- -----------
TOTAL (4) 21.470000 100.0% 4 5.367500 0.060000 21.010000
Now I know that 97.9% of this statement’s FETCH call time was consumed by the single fetch call described on line 4190 of my trace file. I am now zeroed in on the one FETCH call that accounts for 21.01 seconds of my 37.69-second response time. Here’s that line (and the 5 before and after it), produced using a tool called mrnl, which I’ll tell you about later. $ mrnl --line=4190:5 --nouaf --nodelta asv_ora_1492.trc LINE# BEGIN-TIME END-TIME DURATION TRACE-LINE 4185 . . . bind 2: dty=2 mxl=22(21) mal=00 scl=00 pre=00 oacflg=03 oacfl2=1 size=0 offset=48 4186 . . . bfp=025a17c8 bln=22 avl=04 flg=01 4187 . . . value=102381 4188 15:43:30.390 15:43:30.390 . EXEC #21:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=4079725939 4189 15:43:30.390 15:43:30.390 . FETCH #21:c=0,e=0,p=0,cr=3,cu=0,mis=0,r=0,dep=1,og=4,tim=4079725939 4190 15:42:55.070 15:43:30.390 35.320s FETCH #1:c=2101,e=3532,p=70923,cr=93424,cu=132,mis=0,r=15,dep=0,og=3,tim=4079725939 4191 . . . BINDS #21: 4192 . . . bind 0: dty=2 mxl=22(21) mal=00 scl=00 pre=00 oacflg=03 oacfl2=1 size=72 offset=0 4193 . . . bfp=025a1798 bln=22 avl=03 flg=05 4194 . . . value=515 4195 . . . bind 1: dty=2 mxl=22(21) mal=00 scl=00 pre=00 oacflg=03 oacfl2=1 size=0 offset=24 From here, my exercise shifts to finding a way to optimize the SQL that motivated this 15-row ( MR Skew can do things no other tool can do, and it’s fun to use because its flexibility rewards your creativity. MR Skew makes simple work of answering questions that are difficult to answer any other way. Here are some examples of common MR Skew queries:
How would you write these MR Skew queries? Have a look at our MR Skew web page and our mrskew manual page to find out. Comments (2)
![]() Write comment
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


