Connection pools help solve a big performance problem, but they also make using trace data more difficult. Method R Tools, part of the Method R Workbench software package, makes it easier to measure individual user response time experiences on connection pooling systems. Now you can look at performance problems the way you’ve always wanted to see them.

Life Before Connection Pooling

Do you remember life before connection pooling? For a user to get some work done in his Oracle Database, he would have to connect, do the work, and then disconnect. On systems with just a few users, it worked just fine. However, on systems with tens of thousands of users, the connection management overhead would dominate the workload. All the connect/disconnect requests and all the database prepare (parse) calls would drive CPU loads and coherency delays (like waiting for latch acquisitions) through the roof.

Connection pooling helps you solve those problems. Now, instead of creating a new connection to Oracle and then parsing your own PL/SQL statement (which probably looks just like the statement that all the other users are using, too), you merely grab an existing connection handle for the kind of PL/SQL statement you want to execute, pass in the specific values that you want bound into the prepared statement’s placeholder variables, and you get your work done without creating a connection or making even a single parse call. Systems built like this can scale to millions of users.

But Connection Pools Complicate Diagnosis

But, as “everyone knows,” connection pooling systems are much more difficult to diagnose for performance problems. The reason: because thousands of different people can use a given connection over the course of a day, and because a person may use dozens of different connections throughout that same day, it’s difficult on connection pooling systems to track down how the system has responded to a given individual user experience. For example, try pushing a trace file created on a connection pooling system through the Method R Profiler. Even though the Method R Profiler is unquestionably the World’s Greatest Software Tool for explaining how an Oracle Database has consumed the response time for a single individual user experience, the Profiler can’t make sense of a connection pooling trace file, because that file does not describe a single individual user experience. Such a file, in fact, describes many user experiences, and to most people, it’s too difficult to distinguish those experiences from each other.

Don’t Give Up

At this point, a lot of people just give in to the Dark Side and use statistics that are aggregated across countless user experiences. They give up on even trying to explain exactly how the time for a specific, given experience was consumed. Well, you don’t need to give up. The information you need is in your trace files. You just need to understand how to get it out. The mrskew utility in the Method R Tools package can do it for you.

What’s in a Connection Pool Trace File

To understand how to get response times out of your trace files, you need to understand what’s in a trace file generated on a connection pooling system. It’s usually a sequence that looks something like this:

WAIT ... nam='SQL*Net message from client' ela= 1202689 ...
A sequence of trace lines explaining time consumption for a single user experience; call it Experience A
WAIT ... nam='SQL*Net message from client' ela= 4260917 ...
A sequence of trace lines explaining time consumption for Experience B
WAIT ... nam='SQL*Net message from client' ela= 5213365 ...
A sequence of trace lines explaining time consumption for Experience C
WAIT ... nam='SQL*Net message from client' ela= 2044420 ...

Oceans and Islands

I think of each connection pooling emitted trace file as an ocean of idle time speckled with islands of at-work time spent servicing database calls. The experience “islands” are delimited by a pair of long-latency SQL*Net message from client oceans. Each SQL*Net message from client line that the Oracle Database writes into a trace file is just a record of the kernel’s having blocked on an OS read() call for ela microseconds, waiting for its client to ask it to do something.

Enter Method R Tools

The key, then, is to split the trace file into pieces that can each be measured—if you like, even profiled. That’s where mrskew comes in. The new txnz RC file shipped with Method R Tools version 3.1 makes it simple. Here is an example of using mrskew with txnz:

$ mrskew v11203_ora_26827.trc --rc=txnz 
   EXP-ID   DURATION       %   CALLS      MEAN       MIN       MAX
---------  ---------  ------  ------  --------  --------  --------
       24  88.238004  100.0%  17,359  0.005083  0.000000  0.283979
---------  ---------  ------  ------  --------  --------  --------
TOTAL (1)  88.238004  100.0%  17,359  0.005083  0.000000  0.283979

Now, this is not a very auspicious first attempt, because it makes my file look like it contains only one experience, when I am pretty sure that it contains lots of much shorter-duration experiences.

Oceans and Islands—oh, and Rivers

Here’s the problem. Of course, not every SQL*Net message from client call is an “ocean.” Some SQL*Net message from client calls are “rivers”—they’re short-duration calls representing network round-trips between the client and the database server that are part of the user’s experience. Each such river is thus part of some island. The txnz RC file defines an “ocean” as any SQL*Net message from client call that lasts longer than 1.0 seconds. For my trace file, which is from a really fast and busy system, I’m pretty sure that I need to change the definition of an “ocean.”

I can use mrskew to see what the SQL*Net message from client latencies look like in my trace file:

$ mrskew v11203_ora_26827.trc --name="message from client" --group='$line'
      '$line'   DURATION       %  CALLS      MEAN       MIN       MAX
-------------  ---------  ------  -----  --------  --------  --------
        21312   0.283979    0.3%      1  0.283979  0.283979  0.283979
        37077   0.174480    0.2%      1  0.174480  0.174480  0.174480
        21396   0.168399    0.2%      1  0.168399  0.168399  0.168399
         8810   0.167489    0.2%      1  0.167489  0.167489  0.167489
        36148   0.166965    0.2%      1  0.166965  0.166965  0.166965
        21410   0.165523    0.2%      1  0.165523  0.165523  0.165523
        29360   0.164519    0.2%      1  0.164519  0.164519  0.164519
        21036   0.163736    0.2%      1  0.163736  0.163736  0.163736
        29388   0.163088    0.2%      1  0.163088  0.163088  0.163088
        36316   0.161805    0.2%      1  0.161805  0.161805  0.161805
 5,031 others  86.231484   98.0%  5,031  0.017140  0.001152  0.161570
-------------  ---------  ------  -----  --------  --------  --------
TOTAL (5,041)  88.011467  100.0%  5,041  0.017459  0.001152  0.283979

Aha! My biggest SQL*Net message from client latency in the whole file is just 0.283979 seconds. No wonder txnz thought that all my SQL*Net message from client calls were rivers. ...With a little more mrskew work, specifically adding the arguments --top=4000 --sort=2na to see the 4,000 smallest latencies, I was able to determine that a good criterion for distinguishing between rivers of busy round-trip activity and oceans of idle time was a threshold latency of .05 seconds:

Seeing Real User Response Times

Now I can adjust txnz’s behavior. First, I copied my txnz RC file to my current working directory (the distributed file is in the directory `which mrskew`/../../rc), and then I edited its line number 6 as follows:

OLD:  if ($name eq "SQL*Net message from client" and $af > 1) {
NEW:  if ($name eq "SQL*Net message from client" and $af > .05) {

With this new RC file (which I called txnz.05), I obtained the following new report on my trace file:

$ mrskew v11203_ora_26827.trc --rc=txnz.05
     EXP-ID   DURATION       %   CALLS      MEAN       MIN       MAX
-----------  ---------  ------  ------  --------  --------  --------
          0  24.236626   27.5%     327  0.074118  0.050007  0.283979
      19547   2.212251    2.5%     807  0.002741  0.000000  0.049582
      27247   2.112561    2.4%     791  0.002671  0.000000  0.048360
      24221   1.927336    2.2%     267  0.007218  0.000000  0.048210
      16129   1.450686    1.6%     683  0.002124  0.000000  0.049147
      22289   0.997744    1.1%     643  0.001552  0.000000  0.045547
      29620   0.982700    1.1%     562  0.001749  0.000000  0.049281
       2843   0.967385    1.1%     655  0.001477  0.000000  0.048986
      33239   0.920264    1.0%     139  0.006621  0.000000  0.047733
      23031   0.917492    1.0%     647  0.001418  0.000000  0.049615
      17091   0.899165    1.0%     579  0.001553  0.000000  0.045020
      14701   0.864747    1.0%     123  0.007030  0.000000  0.049502
       6509   0.805075    0.9%     437  0.001842  0.000000  0.043662
        653   0.780152    0.9%     403  0.001936  0.000000  0.048553
      36583   0.773713    0.9%     484  0.001599  0.000000  0.030175
      26287   0.767064    0.9%     619  0.001239  0.000000  0.038591
       2333   0.750920    0.9%     103  0.007290  0.000000  0.045808
       9685   0.720571    0.8%     479  0.001504  0.000000  0.047614
      25107   0.718329    0.8%     115  0.006246  0.000000  0.043572
      28487   0.715467    0.8%     107  0.006687  0.000000  0.048749
 309 others  43.717756   49.5%   8,389  0.005211  0.000000  0.049996
-----------  ---------  ------  ------  --------  --------  --------
TOTAL (329)  88.238004  100.0%  17,359  0.005083  0.000000  0.283979

This is a list of transactions (the txn part of the RC file’s name) delimited by .05-second latency think times (abbreviated Z by mathematicians and, of course, the z part of the RC file’s name).

It is important to understand what you can see here:

This is a list of how long people waited for their clicks on this connection pooling system!

And it’s available to you in your plain old Oracle trace data, even if the developers of the application didn’t put any extra instrumentation into their code at all. (For example, if they had just set the Oracle client id, you could do even more, like see exactly who was making each click.)

How txnz Works: a Quick Tour

The following example should make it a bit clearer how txnz works. In the output, you can see the trace file line number, the eid value for the line provided by txnz, and the whole trace file line itself:

$ mrskew 5184.trc --rc=txnz --group='sprintf("%5d %5d %s", $line, eid(), $text)' --name=all --rc=all
EXP-ID                                                                                                                           
---------------------------------------------------------------------------------------------------------------------------------
    1     1 Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production                                         
    2     1                                                                                                                      
    3     0 WAIT #0: nam='SQL*Net message from client' ela= 1000001 driver id=1952673792 #bytes=4 p3=0 obj#=-1 tim=9267360456862 
    4     4                                                                                                                      
    5     4 PARSE #46:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1546270724,tim=9267360347576                                
    6     4 EXEC #46:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1546270724,tim=9267360347576                                 
    7     4 WAIT #46: nam='SQL*Net message from client' ela= 1 driver id=1952673792 #bytes=4 p3=0 obj#=-1 tim=9267360347577      
    8     4 FETCH #46:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=1,dep=0,og=1,plh=1546270724,tim=9267360347577                                
    9     4 STAT #46 id=1 cnt=1 pid=0 pos=1 obj=0 op='FAST DUAL  (cr=0 pr=0 pw=0 time=0 us cost=2 size=0 card=1)'                
   10     4 CLOSE #46:c=0,e=0,dep=0,type=3,tim=9267360347577                                                                     
   11     4                                                                                                                      
   12     0 WAIT #0: nam='SQL*Net message from client' ela= 1000001 driver id=1952673792 #bytes=4 p3=0 obj#=-1 tim=9267360456862 
   13    13                                                                                                                      
   14    13 PARSE #46:c=10,e=10,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1546270724,tim=9267360347576                              
   15    13 EXEC #46:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1546270724,tim=9267360347576                                 
   16    13 WAIT #46: nam='SQL*Net message from client' ela= 1000000 driver id=1952673792 #bytes=4 p3=0 obj#=-1 tim=9267361347576
   17    13 FETCH #46:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=1,dep=0,og=1,plh=1546270724,tim=9267361347576                                
   18    13 STAT #46 id=1 cnt=1 pid=0 pos=1 obj=0 op='FAST DUAL  (cr=0 pr=0 pw=0 time=0 us cost=2 size=0 card=1)'                
   19    13 CLOSE #46:c=0,e=0,dep=0,type=3,tim=9267361347576                                                                     
   20    13                                                                                                                      
   21     0 WAIT #0: nam='SQL*Net message from client' ela= 1000001 driver id=1952673792 #bytes=4 p3=0 obj#=-1 tim=9267360456862 
   22    22                                                                                                                      
---------------------------------------------------------------------------------------------------------------------------------
TOTAL (22)

The SQL*Net message from client call on line 3 creates a new experience id for line 4, and lines 5 through 11 use that experience id (eid=4). Notice that the SNMFC on line 7 does not create a new experience id: its short latency means that it is only a river; it is part of experience 4. The experience id changes again on line 12, and lines 14 through 20 carry experience id 13. On line 16, there is a SQL*Net message from client call that’s almost long-latency enough to be an ocean, but just not quite (remember, txnz defines an ocean as a SQL*Net message from client call with $af > 1).

Notice that each ocean line itself carries an eid value of 0 instead of its own line number. There’s a good reason for this: each ocean itself is not part of any island. So all of the ocean water goes into a separate bucket, with eid=0. Now you can understand why the report on my big real trace file earlier put 24.236626 seconds toward eid=0: this is how big my oceans, combined, were for that file.

There’s Lots More You Can Do

Perhaps now it’s not that difficult for you to imagine using mrskew to drill in on the response time for an individual experience (e.g., with --rc=txnz --where="eid() == 42"). Or chop up a connection pooling trace file into the appropriate pieces that would be required to create individual profiles for all the individual experiences with the Method R Profiler….

It used to be that when I knew my trace data was going to be “connection pooled,” I’d get very nervous. Not so much anymore.