In my column last month, "WPP Tracing in Visual C++ 2010 Projects," I showed you how you could unleash the power of Windows software trace preprocessor (WPP)–style Event Tracing for Windows (ETW) tracing in your user-mode applications within Visual Studio 2010. In this column, I'll focus on how you can use the OS's built-in ETW support to perform sample-based profiling.

If you're unfamiliar with ETW, I highly recommend that you read over the high-level documentation on MSDN (see "Event Tracing" to get an overview of how it works). In a nutshell, the application acts as the provider of various events. A provider is identified by a unique GUID, and an application may implement multiple providers eafiguch identified by a GUID. On the other side is a consumer or collector of events, which typically buffers up these events and often flushes them out to disk as necessary. And finally, a controller is an application that orchestrates the entire operation.

I will be focusing on Xperf, which is a component of Windows Performance Analyzer (WPA), part of the Windows Performance Toolkit (WPT). The .msi for installing WPT is included in the Win32 Development Tools portion of the Windows SDK, which you can install via the web installer for the Windows SDK if you don't already have it installed. (For more information about installing Xperf, visit the Windows Performance Analysis Developer Center.)

Xperf is an ETW controller and a super powerful visualizer for analyzing Event Trace Log (ETL) files. The Windows OS, specifically the kernel, implements a provider known as the NT Kernel Logger, which is sometimes referred to as the default provider within Xperf. Even though you can use Xperf to log ETL from any provider, the bulk of Xperf's power centers on the kernel logger—which is what will be most relevant to us in this article.

I don't have the space to dig into the full capabilities of Xperf, as it is a very feature-rich application albeit with some functionality discoverability issues, which gives it a steep learning curve. But you can find plenty of documentation about it on the web as well as some nice blogs detailing various scenarios—for instance, blogs.msdn.com/ntdebugging. Note that Xperf cannot capture stack traces on OSs prior to Windows Vista because of a lack of support in those older kernels.

Stack Walking with Xperf
From a developer's troubleshooting perspective, the holy grail of Xperf's functionality is that of capturing stack traces. In reality, this relies upon OS support and, as mentioned, is unavailable prior to Vista and continues to evolve in Windows 7. You may be familiar with Kernrate, which is an older sample profiler that ships with the Windows Driver Kit (WDK). However, although Kernrate can tell you within which function the most time is spent, it cannot tell you how it came to that conclusion; for that, a stack trace is needed.

Another important note is that if you are using a 64-bit OS, you must set the Disable Paging Executive flag in order to capture stack traces. You can very easily do this by executing the command in Figure 1 from an elevated command prompt, then restarting.

When you start tracing from the kernel logger, you can set various flags to tell it what you want to capture. You can see those flags by executing the command shown in Figure 2 (which also displays the command's output).



You can string a list of these flags together using the plus (+) operator when starting Xperf, or you could also use one of the kernel groups, shown in Figure 3. In this case, the best kernel group to use is Latency. Therefore, to start a trace in Xperf that also captures stacks via sample profiling, execute the following command:

xperf -on latency -stackwalk profile

Once you are finished reproducing your scenario, you can then stop the trace and merge all the data into an ETL file by executing the following command (which places the results in example.etl):

xperf -d example.etl

You should consider that these traces could be quite large if you let them run for very long. Additionally, depending on the configuration of the system, you may lose some events due to latency, in which case you can usually resolve that problem by manually configuring Xperf's in-memory buffer sizes.

An Example Scenario
Imagine that you have a Microsoft Foundation Class (MFC)–based application under development, and an engineer in your testing group has found a situation where after performing a specific operation, the application goes unresponsive for 30 seconds and hammers the CPU at the same time. You asked additional questions and even watched the test engineer make it happen, but you cannot surmise what is going on. What should you do?

You could attach a debugger to the application at the time it's misbehaving and attempt to step through the code. But what if this happens only on this particular test engineer's machine? It is complicated to install the debugging tools on a test machine, and if you don't have a symbol server set up with source indexed symbols, then you'll also have to put the source and the symbols on the test machine.

Alternatively, you could produce an instrumented build of the application for the test engineer. But what if your instrumentation alters the timing such that the problem does not occur?

Situations like this are where Xperf really shines. Using Xperf, you can capture a trace on the test engineer's machine, then take the resulting ETL file back to your desk and analyze it. Best of all, you are modifying the environment as little as possible and using the ETW tracing mechanisms already built into the OS.

To simulate a buggy application like this, I simply created a basic MFC application using the wizard in Visual Studio 2010. Then, I wired up the following code to the File, New menu item:

void CMainFrame::OnFileNew()
\\{
    DWORD time = GetTickCount();

    while( GetTickCount() - time < 30000 );
\\}

As you can see, this function will hog the processor for 30 seconds before it falls through. So, to simulate the buggy behavior, one can choose File, New from the menu. For now, let's ignore that this particular problem could be spotted easily with a code review.



Collecting and Examining the Trace
After starting the buggy application, but before reproducing the problem, go ahead and launch an elevated command prompt. This is required because the ETW controller must be elevated. Once you select File, New from our buggy application, start the ETW collection using the following command:

xperf -on latency -stackwalk profile

Let the collection proceed for a few seconds, then stop the event collection with the following command:

xperf -d example.etl

Now, from a non-elevated command line, execute Xperf again using the following command:

xperf example.etl

The last command actually launches the Xperf viewer, which is where we begin to realize the full power of Xperf. The first thing you will notice after Xperf has parsed the ETL is that it presents you with a series of graphs representing the activity on the system. Typically, you will see graphs such as CPU Sampling by CPU, which shows the percentage utilization of each CPU. For my trace, it looks like the graph shown in Figure 4.

You will also see the graph CPU Sampling by Process, and you should notice on that graph that our buggy application is flat-lined across the top, as shown in Figure 5. When you hover the mouse over one of the lines on the CPU Sampling by Process graph, it will show you the name and process identifier (PID) of the process. There are quite a few more graphs presented, some of which are not enabled by default. On the left side of the main Xperf window is a button for a flyout that you can activate to turn on or off some of the other available graphs. You can also do so by using the Graphs menu item.

Note: If you expected the buggy application to show 100 percent utilization on one of the processors in the CPU Sampling by CPU graph, the reason it does not is that the thread for the buggy application is being scheduled to multiple processors each time it is given a time slice. You can use Process Explorer to modify the process's processor affinity to achieve that result. For example, if you suspect a process is utilizing 100 percent of a processor's time, you should set its affinity to a single processor to verify. On my machine, I have eight logical cores and therefore, a runaway process is harder to spot since it only shows up in Task Manager as consuming 12 to 13 percent of the total CPU utilization.

The next thing you want to do is point Xperf to your symbols, and you will also want to include the path to the Microsoft public symbol server. My symbol path looks like the following (with path_to_buggy_app_symbols replaced with the actual path to the PDB files for the buggy application): srv*c:\symbols*http://msdl.microsoft.com/download/symbols;C:\path_to_buggy_app_symbols.

Once you have symbols in place, be sure to enable Trace, Load Symbols from the menu; then you can start to examine stacks. The way to do this is to select a time span of interest in one of the sampling graphs, such as CPU Sampling by CPU, then right-click and select Summary Table. The first time you do this, it could take some time as the required symbols are downloaded. Once you do this, a window similar to that in Figure 6 will be displayed (I selected a small portion of the CPU Sampling by CPU graph during the time the application I titled CpuHog.exe was misbehaving).


 
As you can see, this view shows you all the processes running during the selected interval as well as their relative weight. If you don't see the same columns as shown in Figure 6, you can bring out the flyout to easily turn particular columns on and off. Not surprisingly, CpuHog.exe is right near the top just under the system Idle process.

The gold horizontal bar is the aggregation bar and arguably the most powerful feature of the Xperf summary table view. It serves as a grouping operator, and anything on the left side serves as a grouping selector. For example, if you expand the Stack node for the CpuHog.exe process shown in Figure 6, you'll see that it starts showing you more details of the stacks sampled for CpuHog.exe. As an experiment, to see the aggregation bar in action, drag the Module column to a point between the Process and Stack column. Now you get an expandable node in the Process column that expands the modules for the process relevant to the range selected. You can also see that the sampled stacks have been partitioned by Module, as shown in Figure 7.
 
In this view, you can see that KernelBase.dll has the highest weight within the CpuHog.exe process. This should not be a surprise since I am calling GetTickCount() as frequently as possible. If you expand the Stack node just to the right of KernelBase.dll, you can verify this, and you'll see a stack similar to the one that Figure 8 shows.

You may have noted that the stack is upside down when compared to the way we're used to looking at stacks in the debugger. In more complicated applications, you'll often see branches in the stack trace where a certain function, or node, along the way is expandable. This generally means that the function was sampled calling into multiple leaf functions. In such cases, you can see which path it spent the most time in by examining the weight. In this case, since there is no branching in the stack trace, the weight is uniform all the way down the stack since this same stack was the one sampled during the interval I selected for the summary view.

If you have a complicated stack trace and you just want to know who calls a particular function within the range you selected, you can right-click on the function and choose View Callers, Innermost, which will not only answer your question but will also show you the stack in the order your accustomed to in the debugger.

So for the purpose of this example, the summary view clearly shows us that most of the time is spent in GetTickCount(), which is called by CMainFrame::OnFileNew. Armed with this information, we can go straight to that point in the code and examine the problem in detail. Of course, in real life you will be presented with much more complicated scenarios.

A Versatile Tool
Although Xperf is not suitable for fine-grained performance tuning of code, it is certainly very powerful in pointing out performance bottlenecks in applications in a non-invasive manner. The fact that Xperf is non-invasive and relies on capabilities already built into the Windows 7 and Vista OSs makes it an attractive and ideal tool that every developer and tester should have in their toolbox. But Xperf does not just apply to user-mode applications. You can also use it to identify problems in driver code. And finally, what I have shown you here merely just scratches the surface of Xperf's capabilities. I highly recommend that you read the MSDN documentation as well as the various blog postings available on the Internet. Happy troubleshooting!


Trey Nash is a senior escalation engineer at Microsoft working on the Windows OSs and various other products. When he is not working feverishly within the bowels of the OS, he is delivering training on .NET platform and kernel mode debugging or playing ice hockey.