Performance Zone is brought to you in partnership with:

Sasha Goldshtein is a Senior Consultant for Sela Group, an Israeli company specializing in training, consulting and outsourcing to local and international customers.Sasha's work is divided across these three primary disciplines. He consults for clients on architecture, development, debugging and performance issues; he actively develops code using the latest bits of technology from Microsoft; and he conducts training classes on a variety of topics, from Windows Internals to .NET Performance. You can read more about Sasha's work and his latest ventures at his blog: http://blogs.microsoft.co.il/blogs/sasha. Sasha writes from Herzliya, Israel. Sasha is a DZone MVB and is not an employee of DZone and has posted 204 posts at DZone. You can read more from them at their website. View Full User Profile

Doing Micro-Benchmarking Wrong for the Wrong Reasons

06.22.2012
| 3213 views |
  • submit to reddit

This is a short excerpt from Chapter 2 (Performance Measurement) of Pro .NET Performance, scheduled to appear in August 2012. I might be publishing a few more of these before and after the book is out.

The purpose is to determine “which is faster”—using the is keyword and then casting to the desired type, or using the as keyword and relying on the result. Here’s a test class for this comparison I’ve seen lately:

//Test class
class Employee {
  public void Work() {}
}

//Fragment 1 – casting safely and then checking for null
static void Fragment1(object obj) {
  Employee emp = obj as Employee;
  if (emp != null) {
    emp.Work();
  }
}

//Fragment 2 – first checking the type and then casting
static void Fragment2(object obj) {
  if (obj is Employee) {
    Employee emp = obj as Employee;
    emp.Work();
  }
}

A rudimentary benchmarking framework might go along the following lines:

static void Main() {
  object obj = new Employee();
  Stopwatch sw = Stopwatch.StartNew();
  for (int i = 0; i < 500; i++) {
    Fragment1(obj);
  }
  Console.WriteLine(sw.ElapsedTicks);
  sw = Stopwatch.StartNew();
  for (int i = 0; i < 500; i++) {
    Fragment2(obj);
  }
  Console.WriteLine(sw.ElapsedTicks);
}

This is not a convincing microbenchmark, although the results are fairly reproducible. More often than not, the output is 4 ticks for the first loop and 200-400 ticks for the second loop. This might lead to the conclusion that the first fragment is 50-100 times faster. However, there are significant errors in this measurement and the conclusion that stems from it:

  1. The loop runs only once and 500 iterations are not enough to reach any meaningful conclusions—it takes a negligible amount of time to run the whole benchmark, so it can be affected by many environmental factors.
  2. No effort was made to prevent optimization, so the JIT compiler may have inlined and discarded both measurement loops completely.
  3. The Fragment1 and Fragment2 methods measure not only the cost of the is and as keywords, but also the cost of a method invocation (to the FragmentN method itself!). It may be the cast that invoking the method is significantly more expensive than the rest of the work.

Improving upon these problems, the following microbenchmark more closely depicts the actual costs of both operations:

class Employee {
  //Prevent inlining this method
  [MethodImpl(MethodImplOptions.NoInlining)]
  public void Work() {}
}

static void Measure(object obj) {
  const int OUTER_ITERATIONS = 10;
  const int INNER_ITERATIONS = 100000000;
  for (int i = 0; i < OUTER_ITERATIONS; ++i) {
    Stopwatch sw = Stopwatch.StartNew();
    for (int j = 0; j < INNER_ITERATIONS; ++j) {
      Employee emp = obj as Employee;
      if (emp != null)
        emp.Work();
    }
    Console.WriteLine(sw.ElapsedMilliseconds);
  }
  for (int i = 0; i < OUTER_ITERATIONS; ++i) {
    Stopwatch sw = Stopwatch.StartNew();
    for (int j = 0; j < INNER_ITERATIONS; ++j) {
      if (obj is Employee) {
        Employee emp = obj as Employee;
        emp.Work();
      }
    }
    Console.WriteLine(sw.ElapsedMilliseconds);
  }
}

The results on desktop (after discarding the first iteration) were around 410ms for the first loop and 440ms for the second loop, a reliable and reproducible performance difference, which might have you convinced that indeed, it’s more efficient to use just the as keyword for casts and checks.

However, the riddles aren’t over yet. If we add the virtual modifier to the Work method, the performance difference disappears completely, even if we increase the number of iterations. This cannot be explained by the virtues or maladies of our micro-benchmarking framework—it is a result from the problem domain. There is no way to understand this behavior without going to the assembly language level and inspecting the loop generated by the JIT compiler in both cases. First, before the virtual modifier:

; Disassembled loop body – the first loop
mov edx,ebx
mov ecx,163780h ;MT of Employee
call clr!JIT_IsInstanceOfClass
test eax,eax
je WRONG_TYPE
mov ecx,eax
call dword ptr ds:[163774h] ;Employee.Work()
WRONG_TYPE:

; Disassembled loop body – the second loop
mov edx,ebx
mov ecx,163780h ;MT of Employee
call clr!JIT_IsInstanceOfClass
test eax,eax
je WRONG_TYPE
mov ecx,ebx
cmp dword ptr [ecx],ecx
call dword ptr ds:[163774h] ;Employee.Work()
WRONG_TYPE:

The instruction sequence emitted by the JIT compiler to call a non-virtual method and to call a virtual method was discussed in one of my posts more than five years ago. When calling a non-virtual method, the JIT compiler must emit an instruction that makes sure we are not making a method call on a null reference. The CMP instruction in the second loop serves that task. In the first loop, the JIT compiler is smart enough to optimize this check away, because immediately prior to the call, there is a null reference check on the result of the cast (if (emp != null) ...). In the second loop, the JIT compiler’s optimization heuristics are not sufficient to optimize the check away (although it would have been just as safe), and this extra instruction is responsible for the extra 7-8% of performance overhead.

After adding the virtual modifier, however, the JIT compiler generates exactly the same code in both loop bodies:

; Disassembled loop body – both cases
mov edx,ebx
mov ecx,1A3794h ;MT of Employee
call clr!JIT_IsInstanceOfClass
test eax,eax
je WRONG_TYPE
mov ecx,eax
mov eax,dword ptr [ecx]
mov eax,dword ptr [eax+28h]
call dword ptr [eax+10h]
WRONG_TYPE:

The reason is that when invoking a virtual method, there’s no need to perform a null reference check explicitly—it is inherent in the method dispatch sequence. When the loop bodies are identical, so are the timing results.

This small example should not dissuade you from ever writing your own micro-benchmarks. However, the worst performance optimization is one that is based on incorrect measurements; unfortunately, manual benchmarking often leads into this trap.

Published at DZone with permission of Sasha Goldshtein, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

dieter von holten replied on Sat, 2012/06/23 - 3:14am

 

the microbenchmarkers should also consider this post from a few days ago

Top 10 Causes of Java EE Enterprise Performance Problems

to put their results into a bigger picture

 

Annie Calvert replied on Fri, 2012/08/10 - 3:49am

Important things for  benchmarks are: Warm up the JIT first by running the code several times before timing it
Make sure you run it for long enough to be able to measure the results in seconds or (better) tens of seconds
For more visit this.

http://www.dapfor.com/en/net-suite/net-grid/features/performance
 

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.