阅读信息
490 词 4 分钟 本页总访问量 加载中...
Parallel Algorithms
Introduction
PRAM Model
PRAM is for designing algorithm.
In Parallel Random Access Machine (PRAM) model, there are \(n\) processors sharing same memories. Each processor can interact with the memory (read/write/computation).
How to resolve access conflicts?
- EREW (Exclusive-Read Exclusive-Write): No two processors may read from or write to the same memory location.
- CREW (Concurrent-Read Exclusive-Write): Multiple processors may read the same memory simultaneously, but only one processor may write.
- CRCW (Concurrent-Read Concurrent-Write): Multiple processors may read from and write to the same memory location simultaneously, with arbitrary rule, priority rule or common rule.
WD Model
WD is for analyzing complexity.
In the Work–Depth (WD) model, a parallel algorithm is characterized by two measures: work \(W\) (the total number of operations performed), and depth \(D\) (the length of the longest chain of dependent operations).
In parallel algorithm, all operations form a DAG (directed acyclic graph). Unconnected nodes can parallel, and connected nodes must operate in the direction of edge.
Work \(W\) is the total numbers of nodes in DAG, and depth \(D\) is the length of the longest directed path. WD does not care about exact number of working processors in each layer. Instead, it emphesis the overall \(W\) and \(D\). (If we consider the work of each layer in WD, there are no emplicit rule.)
Use WD model to measure the performance?
Two parameters:
- Work load \(W(n)\): total number of operations;
- Worst-case running time \(T(n)\): i.e. the depth in WD model.
Number of processors \(p\) vs. total time:
- If \(p=W(n)/T(n)\), then in PRAM model, \(W(n)\) operations can be finished in \(T(n)\) time.
- If \(p\le W(n)/T(n)\), the total time is \(W(n)/p\).
In general, Brent's Theorem:
Examples
Summation Problem
Input \(A(1)\) to \(A(n)\), and output \(A(1)+\cdots A(n)\).
Summation Problem
1. PRAM model:
Summing in the form of a binary tree. Each node sums its two children.
| C | |
|---|---|
Disadvantage: More processors are not in use in higher layers.
2. WD model:
Seemingly no difference with PRAM...? Maybe the two models just care about different aspects x_x
- \(T(n)=\log n+2\) (2 is initilization and output);
- \(W(n)=n+n/2+n/2^2+\cdots+n/2^k+1=2n\).
Prefix-Sums
Input \(A(1)\) to \(A(n)\), and output \(\sum_{i=1}^1 A(i), \sum_{i=1}^2 A(i),\cdots,\sum_{i=1}^n A(i)\).
Prefix-Sums
1. PRAM Model:
Def. \(C(h,i)=\sum_{k=1}^{\alpha}A(k)\), where \((0,\alpha)\) is the rightmost descendant leaf of node \((h, i)\).
Generate \(C\) from \(B\) (where \(B(h,j)\) is the value held by processor \(j\) at step \(h\)), from top to bottom:
- First node in each step: \(C(h,1)=B(h,1)\);
- Even nodes in each step: \(C(h,2k)=C(h+1,k)\);
- Odd nodes except the first one in each step: \(C(h,2k+1)=C(h+1,k)+B(h,2k+1)\).
2. WD Model:
- \(T(n)=O(\log n)\);
- \(W(n)=O(n)\).
Merging
Merge two non-decreasing arrays \(A(1), A(2), \cdots , A(n)\) and \(B(1), B(2), \cdots , B(m)\) into another non-decreasing array \(C(1), C(2), \cdots , C(n+m)\).
To simplify, assume:
- the elements of A and B are pairwise distinct;
- \(n = m\);
- both \(\log n\) and \(n/\log n\) are integers.
Partition
Partition the input into \(p\) (a large number) independent small jobs, so that the size of the largest small job is roughly \(n/p\). Do the small jobs concurrently, using a separate (possibly serial) algorithm for each.
Specifically, in the merging problem, it's transferred into ranking problem: the rank of \(j\)-th element of array \(B\) in array \(A\) is
Computation of \(RANK(i,A)\) and \(RANK(i,B)\) can be concurrent.
The index in the final array is the sum of ranks in two arrays. Therefore the result \(C\) can be represented as:
Given a solution to the ranking problem, the merging problem can be solved in \(O(1)\) time and \(O(n+m)\) work.
How to do the ranking?
Binary Search
- \(T(n)=O(\log n)\);
- \(W(n)=O(n\log n)\).
Serial Ranking
| C | |
|---|---|
- \(T(n)=W(n)=O(n+m)\).
Parallel Ranking
Assume \(n=m\). Parallel ranking operates as below:
Let \(p=n/\log n\), divide into \(p\) subproblems (If \(p\) is too small, size of subproblems becomes too large; if \(p\) is too big, cost of sampling becomes too large).
- Select a sample in every \(\log n\) elements
- Compute RANK for each selected element (using binary search).
- RANK for every element is determined by selected samples in both arrays. Combine subproblems.
Let \(p=n/\log n\). The first two steps divide the problem into at most \(2p\) smaller sized \(O(logn)\) subproblems.
- For step 1, \(T_1=O(\log n)\), \(W_1=pT_1=O(n)\);
- For step 2, \(T_2=T_1=O(\log n)\), \(W_2=2pT_2=O(n)\).
Therefore, total depth and work is:
- \(T=O(\log n)\);
- \(W=O(n)\).
Maximum Finding
Input \(A(1)\) to \(A(n)\), and output \(\max\{A(1),\cdots ,A(n)\}\).
Similar to Summation Problem
- \(T=O(\log n)\);
- \(W=O(n)\).
Compare all pairs
- \(T=O(1)\);
- \(W=O(n^2)\).
Doubly-logarithmic Paradigm
Operate in the follwoing three steps:
- Divide the array into \(\sqrt{n}\) blocks, each of size \(\sqrt{n}\).
- Concurrently compute the maximum of each block.
- Concurrently compute the maximum of \(\sqrt{n}\) local maximums.
- \(T(n)=O(\log\log n)\);
- \(W(n)=O(n\log\log n)\).
The \(W\) is not optimum.
Improvement: Let \(h=\log\log n\). Divide the array into \(n/h\) blocks, each of size \(h\).
- \(T(n)=O(\log\log n)\);
- \(W(n)=O(n)\).
Random Sampling
weishenmezhemeduoliziaaabuxiangdazile
Homework
E.g. T/F
When we solve the ranking problem via designing the parallel algorithms based on binary search, we shorten the worst-case running time but take more work loads comparing with the sequential algorithms.
(T)
In general, for a 3-way merge we need 6 input buffers and 2 output buffers for parallel operations.
(T)