By now, you've likely heard of the "manycore shift." Hardware manufacturers that previously enabled performance improvements through increases in clock speed have hit against laws of physics that prevent such scaling from continuing. Instead, the silicon that otherwise would have been used for such endeavors has been applied to increasing the number of cores and processors available in commodity hardware. It's now uncommon in the United States to find single-core desktop machines for sale, with the norm being dual-core and quad-core machines, and with that norm increasing to eight-core, sixteen-core, and higher in the not too distant future.

Back in 2005, Herb Sutter wrote about this phenomenon in his now-classic essay, "The Free Lunch Is Over". Herb's thesis is simple: Developers need to start coding their applications to take advantage of parallelism and therefore enable these programs to automatically scale as more and more logical processors are introduced into mainstream hardware. Historically, this has been a difficult task, and the increased concept count, boilerplate code, and complexity necessary to develop parallel implementations has relegated the art to only a select few people specialized in the practice. To truly enable the manycore shift, new support is necessary to make developing for manycore easier.

The Microsoft .NET Framework 4 and Visual Studio 2010 strive to do just that. In the core of the .NET Framework 4 and built in to Visual Studio 2010 are new features that help developers express, efficiently execute, debug, and tune parallelism in their applications. This article provides an overview of that support, starting with work done in the depths of the CLR and moving up the stack into programming models and tooling capabilities.

Minimizing the Overhead of Parallelism
The improvements to support parallel computing within the .NET Framework 4 begin within the bowels of the Framework. To parallelize a single operation, you must partition that operation into multiple pieces, each of which may then be executed asynchronously so that you can execute the pieces concurrently with one another. The more logical processors in a machine, the more partitions are necessary because if there are fewer partitions than there are processors, at least one processor will remain unused in the operation. Of course, executing a piece of work asynchronously has some overhead, even if it's as small as putting a delegate into a data structure that enables background threads to find and execute that work.

Typically, the overhead associated with executing work asynchronously is not related to the amount of real work associated with that work item. As such, the more processors you need to be able to target, the more partitions you'll need to break a piece of work into, the greater the ratio of the overhead for each partition will be to the work associated with that partition, and the more time you'll spend in overhead instead of in processing real work. Thus to efficiently support parallelism, you need to minimize the overhead associated with executing a work item. Toward this end, the ThreadPool in .NET 4 has been modified from earlier versions in several key ways.

The first improvement has to do with that data structure referred to earlier for handing off work to background processing threads. In .NET 3.5, that data structure (referred to as the ThreadPool's "global queue") is a linked list-based queue protected by a Monitor. The Monitor ensures that the queue remains consistent even when multiple threads attempt to mutate it (add and remove work items) concurrently. In .NET 4, that data structure has been replaced with a thread-safe queue implemented in a "lock-free" manner. This data structure has less overhead and scales better than its .NET 3.5 counterpart. Also, the general design and algorithms used for this data structure have been encapsulated into the new, public ConcurrentQueue<T> collection, available in the System.Collections.Concurrent namespace, along with several other new thread-safe and highly scalable collections.

A second improvement relates to the cost of operations involved in synchronizing between multiple threads and in accessing isolated data. Multiple such operations, including methods on the Interlocked class and on the Thread class, have been turned into just-in-time (JIT) intrinsics, which means the JIT compiler in the CLR knows about these operations and can output optimized machine code for them, rather than requiring the application to call the relevant methods and take a slow path through the CLR. Other implementations, such as the one used for ThreadStaticAttribute, have been overhauled as well. As an example of the benefits these improvements yield, on the machine on which I'm writing this article, the Interlocked.Exchange method is approximately 50 percent faster with .NET 4 than it is with .NET 3.5. On the same machine, reading from a ThreadStatic variable is more than 10 times faster. These kinds of operations are used frequently in the implementation of the ThreadPool, as well as in other parallelized code. Thus their improvements indirectly result in significantly decreased overheads.

A third and vital improvement also has to do with the data structures used for handing off work from threads queuing work items to threads consuming and processing those work items. Unlike in .NET 3.5, where the global queue was the only queue, in .NET 4 the ThreadPool potentially has many queues. The global queue is still used for work items that are queued from non-ThreadPool threads (such as the main thread of your application) and for work items that arrive through older APIs, such as ThreadPool.QueueUserWorkItem. However, the .NET 4 ThreadPool also maintains a special work-stealing queue for each of its threads. These data structures enable very low-overhead pushing and popping of work items from the thread's local queue, while enabling other threads in the pool to "steal," or dequeue, work items from other threads' local queues. This design is critical to enabling the ThreadPool to scale to higher core counts, providing several benefits including improved locality of data access and decreased contention (for more information, see the video "Erika Parsons and Eric Eilebrecht: CLR 4 - Inside the Thread Pool").

These special queues are available only to code that uses the new System.Threading.Tasks.Task type in the .NET Framework 4. Any tasks scheduled from another task or code running on one of the ThreadPool's threads will, by default, take advantage of the performance of these local queues. Although the new Task type is engineered for high-performance parallel computing, its design also breaks new ground in providing a rich set of APIs for constructing parallel, highly concurrent, and asynchronous applications.

Task Is "the New Black"
Since .NET Framework 1.0, ThreadPool.QueueUserWorkItem has been the preeminent way for creating work items to run in the ThreadPool:

// C#<br>public static bool QueueUserWorkItem(WaitCallback callBack)<br>' Visual Basic<br>Public Shared Function QueueUserWorkItem(<br>    ByVal callBack As WaitCallback) As Boolean



Unfortunately, although QueueUserWorkItem is great for fire-and-forget operations (ones where you have no need to refer to a created work item after it's been queued), any additional functionality must be custom coded. The new System.Threading.Tasks.Task type addresses this deficiency in a serious way.

Tasks provide built-in support for waiting, for continuations with dataflow, for cooperative cancelation, for exception handling, for representation of arbitrary asynchronous I/O, for custom scheduling, and much more. They've also been elevated to a first-class citizen in Visual Studio 2010, with two new debugger windows, Parallel Tasks and Parallel Stacks, which provide deep insight into Tasks' state and execution, as Figure 1 shows.

 

Figure 1: Parallel tasks in Visual Studio 2010
Figure 1: Parallel tasks in Visual Studio 2010

 


As an example of how Tasks can be useful for coordinating parallel activities, consider the scenario of a build tool operating on a solution that contains eight projects, some of which reference each other in the following manner:
•    Project 4 references project 1.
•    Project 5 references both projects 2 and 3.
•    Project 6 references both projects 3 and 4.
•    Project 7 references both projects 5 and 6.
•    Project 8 references project 5.

These references are strict dependencies, such that a project can't be built until all of its referenced projects have been built. We could just build each project sequentially, starting with project 1, but that doesn't afford efficiencies to be gained by building projects in parallel. To handle the parallelism, you can use a Task to represent building each of the projects, and you can use continuations to express the dependencies. For example, you create a Task to build project 4 as a continuation from the Task to build project 1; as soon as the project 1 build completes, you can start building project 4. Prior to Task's existence, expressing such a graph of processing to be done with as much parallelism as possible would be a complicated endeavor. With Task, your requirements can be translated into code with almost one-to-one line correspondence.



Loopy for Parallelism
Task is not only usable directly for coordinating multiple asynchronous activities, it's also a fundamental building block on top of which higher-level libraries can be built. One such component is the new System.Threading.Tasks.Parallel class, which provides iteration constructs for enabling loop parallelism and which is built using Task as its underlying unit of parallelism.

The Parallel class in .NET 4 provides three static methods, each with a multitude of overloads: For, ForEach, and Invoke. These methods provide parallelized alternatives to serial for loops, for each loops, and sets of statements. These constructs support partitioning the input data source (to spread the processing across the machine's logical processors), generating tasks to do parallel processing, breaking out of loops early, task-local state, and more. Although the developer still needs to understand and code for the fact that multiple threads are being used to process the iteration constructs, these methods eliminate a significant amount of boilerplate code that would otherwise be necessary to parallelize typical loops.

For example, let's say you have an application to which you've added the code in Figure 2, which is meant to calculate all driving routes from a multitude of source locations to a single destination.

 

Figure 2: A serial calculation with a for loop
Figure 2: A serial calculation with a for loop

 

All iterations of this loop are independent, operating on distinct locations and routes and storing the results back into distinct slots in the output array. As such, and given that route calculation is a computationally intensive process, you can easily handle this processing in parallel by using the Parallel.For method, which Figure 3 shows.



Figure 3: Parallelizing a for loop
Figure 3: Parallelizing a for loop

 

Just as Visual Studio enables Tasks as first-class citizens in the debugger, Visual Studio also provides performance-tuning functionality that recognizes parallel loops as first-class citizens. The new Concurrency Visualizer built into the Visual Studio profiler enables developers to visualize the behavior of their multithreaded applications. The visualizer provides three views: one to understand CPU utilization over the lifetime of the application, one to understand what all the application's threads were doing, and one to understand how threads were mapped to logical processors for execution. In the latter two views, constructs such as Parallel.For are rendered as overlay regions on the timeline so that the developer can understand the impact of this parallelization on their application. Figure 4 shows the visualizer displaying markers for a Parallel.For region.

 

Figure 4: Concurrency Visualizer in Visual Studio 2010
Figure 4: Concurrency Visualizer in Visual Studio 2010


Querying in Parallel
Consider our route calculation example from earlier. After designing this routine for your application, you realize that the source locations provided to the method might contain duplicates and might even contain sources that are the same as the destination. You don't want to have to calculate the same route twice, nor do you want to try to calculate the distance from a location to itself. Hence you want to filter out duplicates and any sources that match the destination. Sequentially, this could be done very simply with Language Integrated Query (LINQ), as Figure 5 shows.

 

 

Figure 5: Using LINQ to Objects
Figure 5: Using LINQ to Objects


Given the expressiveness of the LINQ query, which lets you specify what you want done rather than how you want it done, it would be unfortunate if you then had to stop using LINQ to parallelize this same operation. One of the benefits of expressing semantics in this manner, however, is that you leave it up to the Framework how to implement the various data operations specified in the query. To this end, the .NET Framework 4 includes a parallelized implementation of this LINQ to Objects support known as Parallel LINQ (PLINQ). As with the Parallel class, PLINQ is built on top of Tasks and is also highlighted in the Concurrency Visualizer. To enable PLINQ, one simply opts into its usage by calling the new AsParallel operator with the query's data source, as Figure 6 shows.

 



Figure 6: Enabling parallel processing of LINQ queries
Figure 6: Enabling parallel processing of LINQ queries

The parallel query processor will partition the source dataset, process it in parallel, and merge the results, yielding them back to the consumer. PLINQ's power lies in its simplicity, however it does provide optional knobs that give developers additional control over how the query is parallelized.

Coordination and Synchronization Primitives
When discussing the ThreadPool earlier in this article, I mentioned the new ConcurrentQueue<T> type. This is just one of a plethora of new types introduced to help developers with synchronization and coordination. Here are some others.



Collections. Whereas ConcurrentQueue<T> provides a thread-safe first-in-first-out (FIFO) collection, ConcurrentStack<T> provides a thread-safe last-in-first-out (LIFO) collection, and ConcurrentBag<T> provides a thread-safe collection without any ordering guarantees (which enables it to better optimize for certain use cases). .NET 4 also includes the new ConcurrentDictionary<TKey,TValue> type, which is an IDictionary<TKey,TValue> that safely supports multiple concurrent readers and writers.

Work Exchange. The new BlockingCollection<T> enables producer threads to hand off data in a potentially throttled manner to be received by waiting consumer threads. This provides bounding and blocking behavior on top of collections such as ConcurrentQueue<T>. The new Barrier class enables multiple threads to cooperatively work on one algorithm in parallel through multiple phases. And the new CountdownEvent type provides a simple mechanism for joining with multiple forked asynchronous operations.

Locks. .NET 3.5 SP1 includes multiple synchronization primitives for synchronizing between threads. .NET 4 adds several more in the form of ManualResetEventSlim, SemaphoreSlim, and SpinLock. It also provides the new SpinWait value type, which is used by low-level lock-free algorithms to provide correct and efficient spinning behavior.

Initialization. Lazy initialization is a common pattern in both serial and parallel applications. The new Lazy<T> type encapsulates several lazy initialization patterns, including thread-safe implementations. In addition, for scenarios in which thread-local state is required, the ThreadLocal<T> type supports lazy initialization on a per-thread, per-instance basis, as well as very efficient subsequent getting and setting of the data.

Cancellation. When dealing with large amounts of parallel and asynchronous work, your ability to cancel all outstanding operations is crucial. .NET 4 provides the new CancellationToken and CancellationTokenSource types, which serve as a unified mechanism for canceling large swaths of asynchronous operations. The Parallel class, Parallel LINQ, Tasks, and many of the new coordination and synchronization primitives all support cancellation using CancellationToken.

For More Information
I've given you a fast tour through new support for parallel computing in the .NET Framework 4 and Visual Studio 2010. For more information, check out the Parallel Computing developer center on MSDN at msdn.com/concurrency. There you'll find articles, videos, podcasts, and more to help you get up to speed on this new technology. You can also check out the parallel computing forums at social.msdn.microsoft.com/Forums/en-US/category/parallelcomputing to provide feedback and ask questions. In addition, the team that developed this support in .NET 4 blogs about it regularly at blogs.msdn.com/pfxteam and welcomes any feedback. Finally, if you use this support in your application, please let us know. We'd love to hear about your experiences. Happy parallelizing!


Stephen Toub (stoub@microsoft.com) is a program manager lead on the Parallel Computing Platform team at Microsoft.