The Parallel Patterns Library (PPL) provides algorithms that concurrently perform work on collections of data. These algorithms resemble those provided by the Standard Template Library (STL).
The parallel algorithms are composed from existing functionality in the Concurrency Runtime. For example, the concurrency::parallel_for algorithm uses a concurrency::structured_task_group object to perform the parallel loop iterations. The parallel_for
algorithm partitions work in an optimal way given the available number of computing resources.
The concurrency::parallel_for algorithm repeatedly performs the same task in parallel. Each of these tasks is parameterized by an iteration value. This algorithm is useful when you have a loop body that does not share resources among iterations of that loop.
The parallel_for
algorithm partitions tasks in an optimum way for parallel execution. It uses a workstealing algorithm and range stealing to balance these partitions when workloads are unbalanced. When one loop iteration blocks cooperatively, the runtime redistributes the range of iterations that is assigned to the current thread to other threads or processors. Similarly, when a thread completes a range of iterations, the runtime redistributes work from other threads to that thread. The parallel_for
algorithm also supports nested parallelism. When one parallel loop contains another parallel loop, the runtime coordinates processing resources between the loop bodies in an efficient way for parallel execution.
The parallel_for
algorithm has several overloaded versions. The first version takes a start value, an end value, and a work function (a lambda expression, function object, or function pointer). The second version takes a start value, an end value, a value by which to step, and a work function. The first version of this function uses 1 as the step value. The remaining versions take partitioner objects, which enable you to specify how parallel_for
should partition ranges among threads. Partitioners are explained in greater detail in the section Partitioning Work in this document.
You can convert many for
loops to use parallel_for
. However, the parallel_for
algorithm differs from the for
statement in the following ways:
The
parallel_for
algorithmparallel_for
does not execute the tasks in a predetermined order.The
parallel_for
algorithm does not support arbitrary termination conditions. Theparallel_for
algorithm stops when the current value of the iteration variable is one less thanlast
.The
_Index_type
type parameter must be an integral type. This integral type can be signed or unsigned.The loop iteration must be forward. The
parallel_for
algorithm throws an exception of type std::invalid_argument if the_Step
parameter is less than 1.The exceptionhandling mechanism for the
parallel_for
algorithm differs from that of afor
loop. If multiple exceptions occur simultaneously in a parallel loop body, the runtime propagates only one of the exceptions to the thread that calledparallel_for
. In addition, when one loop iteration throws an exception, the runtime does not immediately stop the overall loop. Instead, the loop is placed in the cancelled state and the runtime discards any tasks that have not yet started. For more information about exceptionhandling and parallel algorithms, see Exception Handling.
Although the parallel_for
algorithm does not support arbitrary termination conditions, you can use cancellation to stop all tasks. For more information about cancellation, see Cancellation.


The scheduling cost that results from load balancing and support for features such as cancellation might not overcome the benefits of executing the loop body in parallel, especially when the loop body is relatively small. You can minimize this overhead by using a partitioner in your parallel loop. For more information, see Partitioning Work later in this document. 
Example
The following example shows the basic structure of the parallel_for
algorithm. This example prints to the console each value in the range [1, 5] in parallel.
This example produces the following sample output:
Because the parallel_for
algorithm acts on each item in parallel, the order in which the values are printed to the console will vary.
For a complete example that uses the parallel_for
algorithm, see How to: Write a parallel_for Loop.
[Top]
The concurrency::parallel_for_each algorithm performs tasks on an iterative container, such as those provided by the STL, in parallel. It uses the same partitioning logic that the parallel_for
algorithm uses.
The parallel_for_each
algorithm resembles the STL std::for_each algorithm, except that the parallel_for_each
algorithm executes the tasks concurrently. Like other parallel algorithms, parallel_for_each
does not execute the tasks in a specific order.
Although the parallel_for_each
algorithm works on both forward iterators and random access iterators, it performs better with random access iterators.
Example
The following example shows the basic structure of the parallel_for_each
algorithm. This example prints to the console each value in a std::array object in parallel.
This example produces the following sample output:
Because the parallel_for_each
algorithm acts on each item in parallel, the order in which the values are printed to the console will vary.
For a complete example that uses the parallel_for_each
algorithm, see How to: Write a parallel_for_each Loop.
[Top]
The concurrency::parallel_invoke algorithm executes a set of tasks in parallel. It does not return until each task finishes. This algorithm is useful when you have several independent tasks that you want to execute at the same time.
The parallel_invoke
algorithm takes as its parameters a series of work functions (lambda functions, function objects, or function pointers). The parallel_invoke
algorithm is overloaded to take between two and ten parameters. Every function that you pass to parallel_invoke
must take zero parameters.
Like other parallel algorithms, parallel_invoke
does not execute the tasks in a specific order. The topic Task Parallelism explains how the parallel_invoke
algorithm relates to tasks and task groups.
Example
The following example shows the basic structure of the parallel_invoke
algorithm. This example concurrently calls the twice
function on three local variables and prints the result to the console.
This example produces the following output:
For complete examples that use the parallel_invoke
algorithm, see How to: Use parallel_invoke to Write a Parallel Sort Routine and How to: Use parallel_invoke to Execute Parallel Operations.
[Top]
The concurrency::parallel_transform and concurrency::parallel_reduce algorithms are parallel versions of the STL algorithms std::transform and std::accumulate, respectively. The Concurrency Runtime versions behave like the STL versions except that the operation order is not determined because they execute in parallel. Use these algorithms when you work with a set that is large enough to get performance and scalability benefits from being processed in parallel.


The 
The parallel_transform Algorithm
You can use the parallel transform
algorithm to perform many data parallelization operations. For example, you can:
Adjust the brightness of an image, and perform other image processing operations.
Sum or take the dot product between two vectors, and perform other numeric calculations on vectors.
Perform 3D ray tracing, where each iteration refers to one pixel that must be rendered.
The following example shows the basic structure that is used to call the parallel_transform
algorithm. This example negates each element of a std::vector object in two ways. The first way uses a lambda expression. The second way uses std::negate, which derives from std::unary_function.


This example demonstrates the basic use of 
The parallel_transform
algorithm has two overloads. The first overload takes one input range and a unary function. The unary function can be a lambda expression that takes one argument, a function object, or a type that derives from unary_function
. The second overload takes two input ranges and a binary function. The binary function can be a lambda expression that takes two arguments, a function object, or a type that derives from std::binary_function. The following example illustrates these differences.


The iterator that you supply for the output of 
The parallel_reduce Algorithm
The parallel_reduce
algorithm is useful when you have a sequence of operations that satisfy the associative property. (This algorithm does not require the commutative property.) Here are some of the operations that you can perform with parallel_reduce
:
Multiply sequences of matrices to produce a matrix.
Multiply a vector by a sequence of matrices to produce a vector.
Compute the length of a sequence of strings.
Combine a list of elements, such as strings, into one element.
The following basic example shows how to use the parallel_reduce
algorithm to combine a sequence of strings into one string. As with the examples for parallel_transform
, performance gains are not expected in this basic example.
In many cases, you can think of parallel_reduce
as shorthand for the use of the parallel_for_each
algorithm together with the concurrency::combinable class.
Example: Performing Map and Reduce in Parallel
A map operation applies a function to each value in a sequence. A reduce operation combines the elements of a sequence into one value. You can use the Standard Template Library (STL) std::transformstd::accumulate classes to perform map and reduce operations. However, for many problems, you can use the parallel_transform
algorithm to perform the map operation in parallel and the parallel_reduce
algorithm perform the reduce operation in parallel.
The following example compares the time that it takes to compute the sum of prime numbers serially and in parallel. The map phase transforms nonprime values to 0 and the reduce phase sums the values.
For another example that performs a map and reduce operation in parallel, see How to: Perform Map and Reduce Operations in Parallel.
[Top]
To parallelize an operation on a data source, an essential step is to partition the source into multiple sections that can be accessed concurrently by multiple threads. A partitioner specifies how a parallel algorithm should partition ranges among threads. As explained previously in this document, the PPL uses a default partitioning mechanism that creates an initial workload and then uses a workstealing algorithm and range stealing to balance these partitions when workloads are unbalanced. For example, when one loop iteration completes a range of iterations, the runtime redistributes work from other threads to that thread. However, for some scenarios, you might want to specify a different partitioning mechanism that is better suited to your problem.
The parallel_for
, parallel_for_each
, and parallel_transform
algorithms provide overloaded versions that take an additional parameter, _Partitioner
. This parameter defines the partitioner type that divides work. Here are the kinds of partitioners that the PPL defines:
concurrency::affinity_partitioner
Divides work into a fixed number of ranges (typically the number of worker threads that are available to work on the loop). This partitioner type resembles static_partitioner
, but improves cache affinity by the way it maps ranges to worker threads. This partitioner type can improve performance when a loop is executed over the same data set multiple times (such as a loop within a loop) and the data fits in cache. This partitioner does not fully participate in cancellation. It also does not use cooperative blocking semantics and therefore cannot be used with parallel loops that have a forward dependency.
concurrency::auto_partitioner
Divides work into an initial number of ranges (typically the number of worker threads that are available to work on the loop). The runtime uses this type by default when you do not call an overloaded parallel algorithm that takes a _Partitioner
parameter. Each range can be divided into subranges, and thereby enables load balancing to occur. When a range of work completes, the runtime redistributes subranges of work from other threads to that thread. Use this partitioner if your workload does not fall under one of the other categories or you require full support for cancellation or cooperative blocking.
concurrency::simple_partitioner
Divides work into ranges such that each range has at least the number of iterations that are specified by the given chunk size. This partitioner type participates in load balancing; however, the runtime does not divide ranges into subranges. For each worker, the runtime checks for cancellation and performs loadbalancing after _Chunk_size
iterations complete.
concurrency::static_partitioner
Divides work into a fixed number of ranges (typically the number of worker threads that are available to work on the loop). This partitioner type can improve performance because it does not use workstealing and therefore has less overhead. Use this partitioner type when each iteration of a parallel loop performs a fixed and uniform amount of work and you do not require support for cancellation or forward cooperative blocking.


The 
Typically, these partitioners are used in the same way, except for affinity_partitioner
. Most partitioner types do not maintain state and are not modified by the runtime. Therefore you can create these partitioner objects at the call site, as shown in the following example.
However, you must pass an affinity_partitioner
object as a nonconst
, lvalue reference so that the algorithm can store state for future loops to reuse. The following example shows a basic application that performs the same operation on a data set in parallel multiple times. The use of affinity_partitioner
can improve performance because the array is likely to fit in cache.


Use caution when you modify existing code that relies on cooperative blocking semantics to use 
The best way to determine whether to use a partitioner in any given scenario is to experiment and measure how long it takes operations to complete under representative loads and computer configurations. For example, static partitioning might provide significant speedup on a multicore computer that has only a few cores, but it might result in slowdowns on computers that have relatively many cores.
[Top]
The PPL provides three sorting algorithms: concurrency::parallel_sort, concurrency::parallel_buffered_sort, and concurrency::parallel_radixsort. These sorting algorithms are useful when you have a data set that can benefit from being sorted in parallel. In particular, sorting in parallel is useful when you have a large dataset or when you use a computationallyexpensive compare operation to sort your data. Each of these algorithms sorts elements in place.
The parallel_sort
and parallel_buffered_sort
algorithms are both comparebased algorithms. That is, they compare elements by value. The parallel_sort
algorithm has no additional memory requirements, and is suitable for generalpurpose sorting. The parallel_buffered_sort
algorithm can perform better than parallel_sort
, but it requires O(N) space.
The parallel_radixsort
algorithm is hashbased. That is, it uses integer keys to sort elements. By using keys, this algorithm can directly compute the destination of an element instead of using comparisons. Like parallel_buffered_sort
, this algorithm requires O(N) space.
The following table summarizes the important properties of the three parallel sorting algorithms.
Algorithm  Description  Sorting mechanism  Sort Stability  Memory requirements  Time Complexity  Iterator access 

parallel_sort  Generalpurpose comparebased sort.  Comparebased (ascending)  Unstable  None  O((N/P)log(N/P) + 2N((P1)/P))  Random 
parallel_buffered_sort  Faster generalpurpose comparebased sort that requires O(N) space.  Comparebased (ascending)  Unstable  Requires additional O(N) space  O((N/P)log(N))  Random 
parallel_radixsort  Integer keybased sort that requires O(N) space.  Hashbased  Stable  Requires additional O(N) space  O(N/P)  Random 
The following illustration shows the important properties of the three parallel sorting algorithms more graphically.
These parallel sorting algorithms follow the rules of cancellation and exception handling. For more information about cancellation and exception handling in the Concurrency Runtime, see Canceling Parallel Algorithms and Exception Handling.


These parallel sorting algorithms support move semantics. You can define a move assignment operator to enable swap operations to occur more efficiently. For more information about move semantics and the move assignment operator, see Rvalue Reference Declarator: &&, and Move Constructors and Move Assignment Operators (C++). If you do not provide a move assignment operator or swap function, the sorting algorithms use the copy constructor. 
The following basic example shows how to use parallel_sort
to sort a vector
of int
values. By default, parallel_sort
uses std::less to compare values.
This example shows how to provide a custom compare function. It uses the std::complex::real method to sort std::complex<double> values in ascending order.
This example shows how to provide a hash function to the parallel_radixsort
algorithm. This example sorts 3D points. The points are sorted based on their distance from a reference point.
For illustration, this example uses a relatively small data set. You can increase the initial size of the vector to experiment with performance improvements over larger sets of data.
This example uses a lambda expression as the hash function. You can also use one of the builtin implementations of the std::hash class or define your own specialization. You can also use a custom hash function object, as shown in this example:
The hash function must return an integral type (std::is_integral::value must be true
). This integral type must be convertible to type size_t
.
Choosing a Sorting Algorithm
In many cases, parallel_sort
provides the best balance of speed and memory performance. However, as you increase the size of your data set, the number of available processors, or the complexity of your compare function, parallel_buffered_sort
or parallel_radixsort
can perform better. The best way to determine which sorting algorithm to use in any given scenario is to experiment and measure how long it takes to sort typical data under representative computer configurations. Keep the following guidelines in mind when you choose a sorting strategy.
The size of your data set. In this document, a small dataset contains fewer than 1,000 elements, a medium dataset contains between 10,000 and 100,000 elements, and a large dataset contains more than 100,000 elements.
The amount of work that your compare function or hash function performs.
The amount of available computing resources.
The characteristics of your data set. For example, one algorithm might perform well for data that is already nearly sorted, but not as well for data that is completely unsorted.
The chunk size. The optional
_Chunk_size
argument specifies when the algorithm switches from a parallel to a serial sort implementation as it subdivides the overall sort into smaller units of work. For example, if you provide 512, the algorithm switches to serial implementation when a unit of work contains 512 or fewer elements. A serial implementation can improve overall performance because it eliminates the overhead that is required to process data in parallel.
It might not be worthwhile to sort a small dataset in parallel, even when you have a large number of available computing resources or your compare function or hash function performs a relatively large amount of work. You can use std::sort function to sort small datasets. (parallel_sort
and parallel_buffered_sort
call sort
when you specify a chunk size that is larger than the dataset; however, parallel_buffered_sort
would have to allocate O(N) space, which could take additional time due to lock contention or memory allocation.)
If you must conserve memory or your memory allocator is subject to lock contention, use parallel_sort
to sort a mediumsized dataset. parallel_sort
requires no additional space; the other algorithms require O(N) space.
Use parallel_buffered_sort
to sort mediumsized datasets and when your application meets the additional O(N) space requirement. parallel_buffered_sort
can be especially useful when you have a large number of computing resources or an expensive compare function or hash function.
Use parallel_radixsort
to sort large datasets and when your application meets the additional O(N) space requirement. parallel_radixsort
can be especially useful when the equivalent compare operation is more expensive or when both operations are expensive.


Implementing a good hash function requires that you know the dataset range and how each element in the dataset is transformed to a corresponding unsigned value. Because the hash operation works on unsigned values, consider a different sorting strategy if unsigned hash values cannot be produced. 
The following example compares the performance of sort
, parallel_sort
, parallel_buffered_sort
, and parallel_radixsort
against the same large set of random data.
In this example, which assumes that it is acceptable to allocate O(N) space during the sort, parallel_radixsort
performs the best on this dataset on this computer configuration.
[Top]
Title  Description 

How to: Write a parallel_for Loop  Shows how to use the parallel_for algorithm to perform matrix multiplication. 
How to: Write a parallel_for_each Loop  Shows how to use the parallel_for_each algorithm to compute the count of prime numbers in a std::array object in parallel. 
How to: Use parallel_invoke to Write a Parallel Sort Routine  Shows how to use the parallel_invoke algorithm to improve the performance of the bitonic sort algorithm. 
How to: Use parallel_invoke to Execute Parallel Operations  Shows how to use the parallel_invoke algorithm to improve the performance of a program that performs multiple operations on a shared data source. 
How to: Perform Map and Reduce Operations in Parallel  Shows how to use the parallel_transform and parallel_reduce algorithms to perform a map and reduce operation that counts the occurrences of words in files. 
Parallel Patterns Library (PPL)  Describes the PPL, which provides an imperative programming model that promotes scalability and easeofuse for developing concurrent applications. 
Cancellation  Explains the role of cancellation in the PPL, how to cancel parallel work, and how to determine when a task group is canceled. 
Exception Handling  Explains the role of exception handling in the Concurrency Runtime. 