CPS222 Lecture: Algorithm Design Strategies Last modified 4/29/2015 Materials 1. Projectable of brute force solution to max sum problem 2. Huffman algorithm Powerpoint 3. Dynamic Programming Fibonacci program example 4. Projectable of partially filled in LCS table 5. Projectable of figure 12.2 p. 563 in Goodrich, Tamassia, Mount 6. Projectable of same figure, but showing derivation of GTTTAA 7. Projectable of example use of optimal BST algorithm on a tree of 4 keys 8. obst program to demo and project I. Introduction - ------------ A. At this point in the course, we are going to shift our focus somewhat. 1. Up until now, our focus has been on learning "standard" algorithms and their associated data structures. 2. When confronted with a problem to solve, you should always ask "can this problem be viewed as an instance of a problem for which there exists a known algorithm?" If the answer is "yes", then you don't have to "reinvent the wheel". Example: We saw earlier that problems as diverse as scheduling tasks with prerequisites, analyzing electrical circuits, and designing robust communication or transportation networks can be solved by known graph algorithms. 3. Sometimes, though, one is confronted with a problem which does not correspond to any previously-solved problem. In this case, it may be necessary to develop an algorithm to solve the problem from scratch. 4. Or, the problem may be a familiar problem for which no good algorithm is known - e.g. it may be an instance of an NP-complete problem. In this case, we may need to develop an algorithm that develops an acceptable, though perhaps not optimal solution. Example: If we have a problem to solve that is equivalent to the traveling salesman problem, we will not be able to find a practical algorithm that gives us a guaranteed optimal solution; but we may be able to develop an algorithm that gives us a solution that is close enough to optimal for the cases we are interested in. B. We now consider a number of strategies that can be used to tackle a problem which does not already have a known algorithmic solution. 1. These are not solutions to a problem, but strategies to explore when trying to find a solution. 2. Many of the "standard" algorithms that we have learned were first discovered by someone who applied one of the strategies to the problem in the first place! C. For each strategy, we will consider one or more examples of algorithms that utilize that strategy. We will see that algorithms we have already learned exemplify the strategy we are considering, and we will also consider some new algorithm. In all cases, though, the goal here is to understand the design strategy behind the algorithm, not just the algorithm itself. II. Brute Force -- ----- ----- A. Given the sheer speed of a computer, it is tempting to try to solve a problem by brute force - e.g. trying all the possibilities. 1. We saw an example of this when we first looked at algorithm anaylsis. Remember the maximum sum problem? Our first attempt at a solution was the brute force solution. int naiveMaxSum(int a[], int n) /* Naive solution to the maximum subvector sum problem */ { int maxSum = 0; for (int i = 0; i < n; i ++) for (int j = 0; j < n; j ++) { int thisSum = 0; for (int k = i; k <= j; k ++) thisSum += a[k]; if (thisSum > maxSum) maxSum = thisSum; } return maxSum; } PROJECT 2. Complexity of this solution? ASK theta(N^3) 3. As you recall, we developed a series of better solutions, culminating in an theta(N) solution - which we argued is inherently the lower limit because any solution must look at each element of the vector at least once. B. Of course, often we will be able to find a better solution to the problem than sheer brute force - as was the case with the max sum problem. C. However, this won't always be the case. There will be some problems for which brute force is the only option. Examples? ASK 1. Searching an unordered list, or one that is ordered on some basis other than the order of the search key. The only option is the brute force one of looking at every item. 2. Problems like the traveling salesman - if we must have the absolute best solution. III. Greedy Algorithms --- ------ ---------- A. Many problems have the general form "for a given set of data, what is the best way to ____?". 1. Examples we have considered thus far: ASK a. Shortest path problem in a graph b. Minimal cost spanning tree of a graph 2. For such a problem, what we are seeking is a GLOBAL OPTIMUM - i.e. the best overall way to solve the problem. E.g. for a minimum cost spanning tree, we want to find the spanning tree having the lowest overall cost, though it may include some "expensive" individual edges. 3. One way to solve such a problem would be exhaustive search - create all the possible solutions, and then choose the cheapest one. Unfortunately, such an approach generally has exponential cost. B. The greedy strategy goes like this: build up an overall solution one step at a time, by making a series of LOCALLY OPTIMAL choices. 1. Example: Djikstra's shortest path algorithm builds up the list of shortest paths one node at a time by, at each step, choosing the not yet known node that has the shortest known path to the starting vertex. 2. Example: Kruskal's minimum cost spanning tree algorithm builds up the tree one edge at a time by, at each step, adding to the tree the lowest cost edge which does not introduce a cycle. C. A good - and historically important - example of a greedy algorithm is the Huffman algorithm. We will now look at it, both as an algorithm that is interesting in its own right, and as an example of the greedy strategy. 1. One area of considerable interest in many applications is DATA COMPRESSION - reducing the number of bits required to store a given body of data. We consider one approach here, based on weight-balanced binary trees, and utilizing a greedy algorithm that produces an optimal solution. 2. Suppose you were given the task of storing messages comprised of the 7 letters A-G plus space (just to keep things simple.) In the absence of any information about their relative frequency of use, the best you could do would be to use a three bit code - e.g. 000 = space 001 .. 111 = A .. G 3. However, suppose you were given the following frequency of usage data. Out of every 100 characters, it is expected that: 10 are A's Note: these data are contrived! 10 are B's 5 are C's 5 are D's 30 are E's 5 are F's 5 are G's 30 are spaces a. Using the three bit code we just considered, a typical message of length 100 would use 300 bits. b. Suppose, however, we used the following variable-length code instead: A = 000 NOTE: No shorter code can be a prefix of B = 001 any longer code. Thus, we cannot C = 0100 use codes like 00 or 01 - if we saw D = 0101 these bits, we wouldn't know if they E = 10 were a character in their own right or F = 0110 part of the code for A/B or C/D. G = 0111 space = 11 A message of length 100 with typical distribution would now need: (10 * 3) + (10 * 3) + (5 * 4) + (5 * 4) + (2 * 30) + (5 * 4) + (5 * 4) + (30 * 2) = 260 bits = a savings of about 13% 4. A variable length code can be represented by a decode tree, with external nodes representing characters and internal nodes representing a decision point at a single bit of the message - e.g. ( first bit) / 0 \ 1 (2nd bit) (2nd bit) / 0 \ 1 / 0 \ 1 (3rd bit) (3rd bit) [E] [space] / 0 \ 1 / 0 \ 1 [A] [B] (4th bit) (4th bit) / 0 \ 1 / 0 \ 1 [C] [D] [F] [G] The optimum such tree is the one having the smallest weighted external path length - e.g. sum of the levels of the leaves times their weights. 5. An algorithm for computing such a weight-balanced code tree is the Huffman algorithm, discussed in the book. a. Basic method: we work with a list of partial trees. i. Initially, the list contains one partial tree for each character. ii. At each iteration, we choose the two partial trees of least weight and construct a new tree consisting of an internal node plus these two as its children. We put this new tree back on the list, with weight equal to the sum of its children's weights. iii. Since each step reduces the length of the list by 1 (two partial trees removed and one put back on), after n-1 iterations we have a list consisting of a single node, which is our decode tree. b. Example: For the above data. Initial list: A B C D E F G space .10 .10 .05 .05 .30 .05 .05 .30 / \ / \ / \ / \ / \ / \ / \ / \ Step 1 - remove C, D - and add new node: () A B E F G space .10 .10 .10 .30 .05 .05 .30 / \ / \ / \ / \ / \ / \ / \ C D Step 2 - remove F, G - and add new node: () () A B E space .10 .10 .10 .10 .30 .30 / \ / \ / \ / \ / \ / \ F G C D Step 3 - remove A, B - and add new node: () () () E space .20 .10 .10 .30 .30 / \ / \ / \ / \ / \ A B F G C D Step 4 - remove two partial trees - and add new node: () () E space .20 .20 .30 .30 / \ / \ / \ / \ () () A B / \ / \ C D F G Step 5 - remove two partial trees - and add new node: () E space .40 .30 .30 / \ / \ / \ () () / \ / \ A B () () / \ / \ C D F G Step 6 - remove E, space - and add new node: () () .60 .40 / \ / \ E space () () / \ / \ A B () () / \ / \ C D F G Step 7 - construct final tree: () 1.00 / \ () () / \ / \ () () E space / \ / \ A B () () / \ / \ C D F G c. Analysis: i. Constructing the initial list is theta(n). ii. Transforming to a tree involves n-1 (= theta(n)) iterations. On each iteration, we scan the entire list to find the two partial trees of least weight = theta(n) - so this process, using the simplest mechanism for storing the list of partial trees is theta(n^2). iii. Printing the tree is theta(n). iv. Overall is therefore theta(n^2). However, we could reduce time to theta(n log n) by using a more sophisticated data structure for the "list" of partial trees - e.g. a heap based on weight. (But given the small size of a typical alphabet, the theta(n^2) algorithm may actually be faster.) 6. We have applied this technique to individual characters in an alphabet. It could also be profitably applied to larger units - e.g. we might choose to have a single code for frequently occurring words (such as "the") or sequences of letters within words (such as "th" or "ing"). 7. The Huffman algorithm exemplifies the greedy algorithm strategy, because at each step we choose the two lowest weight subtrees to combine into a new subtree, thus increasing the code length for each of the characters in the subtrees by one. We keep our cost down by increasing the code length of the lowest frequency subtrees. D. A significant limitation of the greedy strategy is that, for some problems, a greedy algorithm fails to deliver a globally optimal solution. 1. For the examples we have looked at this far (shortest path, minimum cost spanning tree, shortest job first scheduling, and the Huffman algorithm), the greedy algorithm actually produces a result that can be shown to be globally optimal - i.e. it finds the best possible solution. 2. For other problems, however, finding the globally optimal solution may require a step that is not locally optimal. A simple example of this is finding one's way through a maze. a. A greedy algorithm for finding one's way through a maze is as follows: never go back to a square you've already visited unless you have no other choice; where two or non-backtracking moves are possible, choose the one that moves you closer to the goal. b. An example where this greedy algorithm finds the best path: (S = start, G = goal) +-----------------------+ | | | +---+-----------+ | | |///| | | | +---+ | | | | | S | G | | | |-----------+---+ | |///////////////| | +-----------+---+-------+ c. An example where this greedy algorithm fails to find the best path, because a move away from the goal (not locally optimal) is needed to find the best (globally optimal) path. +-----------------------+ | | | +---+---+-------+ | | | | | |-------+ | | | | S | G | | | | --------+ | | | | | | +---------------+-------+ E. As it turns out, it is frequently the case that a problem for which a greedy algorithm fails to find the best solution may be one for which finding the best solution inherently requires exponential effort. In such cases, a greedy algorithm may still be be a useful approach to finding a solution that is generally close enough - given that an algorithm for finding the optimal solution may not be practical (e.g. it may be NP-complete) or an algorithmic solution may not exist at all. 1. A good example of such a problem is the bin packing problem. a. The problem originates in the way the post office handles packages: i. The post office uses large cloth bins which are filled with packages and then loaded on a plane or a truck. (Perhaps you've seen one at a PO.) ' ii. The problem is this: given a supply of bins of some fixed capacity, and packages of varying sizes, find a way to put the packages in the bins in such a way as to use the fewest possible bins. iii. To simplify our discussion, we will simplify the problem in two ways: - We will assume that the size of each package can be represented by a single number (i.e. we will not consider issues of shape - only overall volume). - We will normalize the sizes to the capacity of the bin, so that a bin will be considered to have a capacity of 1, and the size of each package wil be represented as some fraction of the bin capacity (e.g. 0.3). We will assume that the bin can hold any number of packages for which (sum of size) <= 1. iv. Although we couch the problem in terms of packing bins with packages, similar problems arise in other areas - e.g. allocating memory using operator new (which satisfies requests by carving off smaller pieces from large blocks allocated by the operating system, or allocating space for files on disk, when holes are created by the deletion of other files. b. The problem actually comes in two versions: the online version and the offline version. i. In the online version, a decision about where to place each package must be made before the next package is seen. This would correspond to a situation like the following: Wall with small window in it | O | | --+-- _ / \ | | Customers hand packages Clerk and bins | to clerk one at a time The clerk must place each package in a bin as it is handed through the window, before getting to see the next package. ii. In the offline version, it is possible to look at the entire list of packages before making a decision about where to place each one. c. It is easy to show that there cannot be an algorithm that always finds the optimal packing for the online version of the problem. Suppose such an algorithm exists, and is asked to pack a total of four packages, using the minimum possible number of bins. Suppose the first two packages have sizes 0.45 and 0.45. Into which bin should the algorithm place the second package? It turns out that the answer depends on the size of the next two packages, which the online version is not allowed to know until a decision has been made about the second package. i. If the next two packages are size 0.55 and 0.55, then the optimal choice would be to place the second package in an empty bin. This would yield a final packing using just two bins Bin 1: First package (0.45) + Third package (0.55) Bin 2: Second package (0.45) + Fourth package (0.55) However if the second package is placed in the same bin as the first, the final packing would require three bins: Bin 1: First package (0.45) + Second package (0.45) Bin 2: Third package (0.55) Bin 3: Fourth package (0.55) ii. If the next two packages are size 0.60 and 0.60, then the optimal choice would be to place the second package in the same bin as the first. This would yield a final packing using three bins: Bin 1: First package (0.45) + Second package (0.45) Bin 2: Third package (0.60) Bin 3: Fourth package (0.60) However, if the second package is placed in an empty bin, the final packing would require four bins: Bin 1: First package (0.45) Bin 2: Second package (0.45) Bin 3: Third package (0.60) Bin4 : Fourth package (0.60) Since either choice made by the algorithm for the second package could turn out to be wrong in some case, there cannot be an algorithm that always makes the right choice. d. For the offline version of the bin packing problem, it is possible to find an optimal packing. (Consider all possibilities and pick the best, which takes time exponential in the number of packages.) It turns out that offline bin-packing has been proved to be NP-complete. Thus, if the commonly held view of the relationship between P and NP is true, then ANY offline algorithm that always discovers an optimal solution to the bin-packing problem must take exponential time. e. Since there is not a practical algorithmic solution to either form of the bin-packing problem, it is worth considering whether a greedy algorithm might yield a solution that is close enough to optimal. 2. We consider first the online version of the problem a. There are three greedy strategies we might consider. i. One greedy strategy, called NEXT-FIT, goes like this: if the package we are dealing with would fit in the same bin as the previously packed package, then put it there - else start a new bin. (Note that, once we start packing a new bin, we never go back and put any packages in previous bins. This might be advantageous in some applications, because once a bin is declared packed, it can be moved out the door and loaded on the truck or whatever.) ii. A second greedy strategy, called FIRST-FIT, goes like this: as we pack each package, look at each of the bins in turn, and place it in the first bin we find where it fits. Start a new bin only if we cannot fit the package in any of the others. iii. A third greedy strategy, called BEST-FIT, goes like this: as we pack each package, look at each of the bins in turn, and place it it the bin where it fits best - i.e. leaves the least unused space. Start a new bin only if we cannot fit the package in any of the others. b. To see the difference between these strategies, suppose we are trying to pack a package of size 0.2 under the following scenario (where the last package packed was placed in bin 3) Bin 1: Currently contains 0.7 Bin 2: Currently contains 0.8 Bin 3: Currently contains 0.3 Next fit would put the package in bin 3 First fit would put the package in bin 1 Best fit would put the package in bin 2 c. Which strategy is best? i. Next fit will never yield an overall result that is better than first fit or best fit. However, it is the simplest to implement, and is the fastest running. (Each choice is theta(1), since only the most recently used bin has to be examined, as opposed to theta(n) for the other two.) Also, once next fit declares a bin full, it can never be considered again, whereas with the other two algorithms no bins can be "shipped" until all the packages have been placed. ii. It turns out that there are sets of data for which first fit gives the optimal result, and the others don't; and there are other sets of data where best fit gives the optimal result, and the others don't. Example: sequence of sizes 0.3 0.8 0.1 0.6 0.2 NF: Bin 1: 0.3 Bin 2: 0.8 0.1 Bin 3: 0.6 0.2 FF: Bin 1: 0.3 0.1 0.6 Bin 2: 0.8 0.2 BF: Bin 1: 0.3 0.6 Bin 2: 0.8 0.1 Bin 3: 0.2 Example: sequence of sizes 0.3 0.8 0.2 0.7 NF: Bin 1: 0.3 Bin 2: 0.8 0.2 Bin 3: 0.7 FF: Bin 1: 0.3 0.2 Bin 2: 0.8 Bin 3: 0.7 BF: Bin 1: 0.3 0.7 Bin 2: 0.8 0.2 d. It is possible to analyze the behavior of each of these strategies, and to show that: i. Next fit is guaranteed to find a result that requires no more than twice the optimal number of bins (and there is some data that will force it to use very close to this number.) ii. First fit is guaranteed to find a result that requires no more than 17/10 times the optimal number of bins (and again there is some data that will force it to use very close to this number.) iii. Best fit is also guaranteed to find a result that requires no more than 17/10 times the optimal number of bins (and again there is some data that will force it to use very close to this number.) 3. For the offline version of the problem, a greedy algorithm is still of interest, even though it cannot guarantee optimal results, since the problem is NP-complete. a. The offline versions of the greedy algorithms are derived from the online versions based on the observation that we will generally get better results by packing the bigger items first, and then fitting the smaller items into the remaining spaces. b. An offline version of the first fit algorithm is called FIRST FIT DECREASING. It considers packages in decreasing order of size, beginning with the largest. Each is placed using first-fit. Example: earlier we showed that the sequence 0.3 0.8 0.2 0.7 requires three bins if packed using an online first fit algorithm. If we use first fit decreasing offline, we consider the packages in the order 0.8, 0.7, 0.3, 0.2, and pack them as follows: Bin 1: 0.8 0.2 Bin 2: 0.7 0.3 It is possible to prove that if M is the optimal number of bins needed to pack some list of items, then first fit decreasing never uses more than 11/9 M + 4 bins to pack the same items. c. It is also possible to derive offline versions of next fit and best fit, which we won't discuss. IV. Divide-And-Conquer Algorithms -- ------------------ ---------- A. An algorithm-design strategy behind several of the algorithms we have seen is divide and conquer. 1. The basic strategy is this: partition the initial problem into two or more smaller subproblems solve each subproblem (recursively) stitch the solutions to the subproblems together to yield a solution to the original problem 2. Examples we have seen? ASK a. One of the solutions to the maximal vector subsequence sum problem we discuss when we introduced algorithm analysis b. Fibonacci Numbers c. Towers of Hanoi d. Traversal of a binary tree e. Quick Sort f. Merge Sort B. Divide and conquer is often a useful strategy for finding good algorithms. Let's look at another example: 1. As you know, standard integer representations are limited by the number of bits used to represent an integer (64 on modern machines). What happens if we need to represent integers larger than this? a. The typical solution is to use an array of int (32-bit integers), treated as digits base 2^32. Example: a 100 decimal digit integer a might be represented by an array of 10 32-bit binary integers as 288 256 224 192 160 128 96 64 32 0 a * 2 + a * 2 + a * 2 + a * 2 + a * 2 + a * 2 + a * 2 + a * 2 + a * 2 + a * 2 9 8 7 6 6 4 3 2 1 0 In general, we can measure the size of such a representation by the size of the array - e.g. we would consider the size of the above example to be 10. b. Now suppose we had two large integers (a and b) each represented using an array of n 32-bit integers Let's consider the complexity of various arithmetic operations. i. Addition: We will require n additions - i.e. sum = a + b; 0 0 0 sum = a + b + carry from sum, etc. 1 1 1 0 - so the operation is theta(n) ii. Subtraction is similar, and is also theta(n). iii. However, for multiplication, it looks like we will require theta(n^2) multiplications, since 288 256 224 288 256 224 (a * 2 + a * 2 + a * 2 + ... ) * (b * 2 + a * 2 + a * 2 + ...) = 9 8 7 9 8 7 576 544 512 a * b * 2 + (a * b + a * b ) * 2 + (a * b + a * b + a * b ) * 2 + ... 9 9 9 8 8 9 9 7 8 8 7 9 - so each of the n coefficients in a are multiplied by each of the n coefficients in b. 2. We could consider a divide and conquer approach a. divide the arrays representing each number in half (which we call A , A and B , B below). Then the product becomes 1 0 1 0 16n 16n (A 2 + A )(B 2 + B ) = 1 0 1 0 32n 16n 0 A B * 2 + (A B + A B) * 2 + (A B ) * 2 1 1 1 0 0 1 0 0 b. Then we can continue by calculating each of the A's and B's by dividing the arrays in two until we get to arrays having a single element, at which point ordinary multiplication works. c. However, this hasn't reduced the total effort = each of the products after the first division only requires n^2/4 multiplications, but since there are 4 of them the overall computation is still theta(n^2). d. At this point, though, we could take advantage of an observation first made by Gauss in a different context. Observe that A B + A B = (A + A )(B + B) - A B + A B 1 0 0 1 1 0 1 0 1 1 0 0 Since we need to calculate A B and A B anyway, we can use this to 1 1 0 0 replace the original four products by three products and a subtraction. e. That means, at each stage in the divide and conquer, we only need to create 3 subproblems with 1/4 the effort, rather than 4. And that benefit compounds itself at each stage. (We will look at the effect quantitatively in a bit) C. An algorithmic pattern that is very similar to divide and conquer is decrease and conquer. 1. In this pattern, we partition a problem into some number of subproblems, but then discard all but one of these subproblems and solve the original problem by solving this one. (Note that the term "divide and conquer" is usually not used for algorithms that discard all but one of the subproblems and then solve the original problem by solving it.) 2. It turns out that many search strategies are actually examples of this pattern. a. Example: binary search of an ordered array - we compare the search target to the middle key of the array. Based on the outcome of this comparison, we continue our search in either the first or last half of the array, ignoring the other half. b. Example: search in any sort of m-way search tree (binary, 2-3-4, or B-Tree) - we compare the search target to the keys stored in a node, and then continue our search in one of its children, ignoring the others. 3. Moreover, maintenance of an m-way search tree is a form of decrease and conquer. a. For example, when we insert into a binary search tree, at each level we use comparison of the key we are inserting with the key at the current node to decide whether to insert into its left or right subtree. b. Deletion is similar. 4. Let's look at another example. Suppose we have an unordered list of n numbers, and want to find the k-th smallest member. a. If we wanted the smallest (or the nth smallest - which would be the largest), there is a straightforward theta(n) algorithm. b. For arbitrary k, it would be possible to sort the list and then take element in position k of the result. However, this would require theta(n log n) time because of the sort. c. Can we do this for arbitrary k in just theta(n) time? It turns out the answer is yes. i. Choose an partitioning element (perhaps at random or using some arbitrary scheme such as first element). Partition the original list into two sublists, one containing all the elements less than or equal to the partitioning element and one containing all the elements greater. While doing this, keep track of the count of elements (c) in the list containing the smaller elements. ii. Now, if c >= k, it means that the element we want is also the kth smallest element in the first sublist. If c > k + 1, the element we want is the (k - c - 1) smallest element in the second sublist. (Of course if c = k + 1 the partitioning element is the one we want, but this would be rare). iii. What is the complexity of this process? Since, on the average, partitioning with a random pivot like this produces sublists of roughly equal length, the first partioning would require looking at all elements, but the second would look at only n/2, the third only n/4 ... iv. Therefore, the total number of elements examined is n + n / 2 + n / 4 + ... + 1 = 2n. So we now have an theta(n) algorithm! D. Analysis of Divide and Conquer Algorithms 1. Recursive algorithms of the sort that arise in connection with divide (or decrease) and conquery can be hard to analyze. In the case of these algorithms, there is a general approach that works for many (but not all) divide/decrease and conquer algorithms. a. Let T(n) = the time it takes the algorithm to solve a problem of size n. (Assume T(n) = O(1) for sufficiently small n.) b. Assume that, for the recursive case, the algorithm solves a problem of size n by partioning it into a subproblems of size n/b, where a and b are integer constants. E.g. for the average case of Quick Sort a is 2 and b is 2 - we partition a problem of size n into two subproblems of size n / 2. the same is true for Merge Sort c. Suppose, further, that the time for partitioning a problem of size n into subproblems is f(n), and the time for stitching the solutions together after the subproblems have been solved is g(n). E.g. for Quick Sort, f(n) is O(n) and g(n) is O(1). For Merge Sort, f(n) is O(1) and g(n) is O(n). d. Then the time to solve a problem of size n is given by the recurrence T(n) = time to partition + time to solve subproblems + time to stitch = f(n) + aT(n/b) + g(n). = aT(n/b) + (f(n) + g(n)) e. There is a general rule for solving recurrences of this form (which we state here without proof) If a recurrence is of the form k T(n) = aT(ceiling(n/b)) + theta(n ) - where and b and k are constants, with a > 0, b > 1, and k >= 0 Then (log a) b k T(N) = theta(N ) if a > b k k theta(N log N) if a = b k k theta(N ) if a < b f. This formula is known as the "master theorem" 2. Examples of applying this: a. Traversal of a binary tree: - We visit the root (which we'll assume is O(1)), and traverse each of the subtrees in some order - On the average, each subtree has almost N/2 nodes Recurrence is T(N) = O(1) + 2 T(N/2), so a = 2, b = 2, k = 0 First case applies: T(N) = theta(N) - which is, of course, what we would expect since we visit each node exactly once b. Multiplication of big integers as discussed above - here's a case where the formula really helps - At each step, We split into two sublists of length N / 2 and perform three multiplications. Splitting takes O(1) time but stitching the result together requires O(N) additions to handle carry, so Recurrence is T(N) = O(1) + 3 T(N/2), so a = 3, b = 2, k = 1 log 3 2 1.58 First case applies: T(N) = theta(N ) = theta(N ) - a significant improvement of the theta(n^2) algorithm we considered at first. c. Merge Sort: - We split into two sublists of length N/2, which takes O(1) time, sort them, then merge them together (which takes O(N) time) Recurrence is T(N) = 2 T(N/2) + O(N), so a = 2, b = 2, k = 1 Second case applies: T(N) = theta(N log N) (Quick sort is similar, excpet the split is O(N) and the stitch is O(1), but the recurrence equation and hence the solution is the same.) d. Binary search At each step, we create two subproblems, but only need to solve one. Since both splitting and stitching are O(1), we get the recurrence: T(N) = T(N/2)+O(1), so a = 1, b = 2, and k = 0 Second case applies: T(n) = theta(log N) e. k-selection. At each step, we create two subproblems in O(N) time, but only need to solve one, so recurrence is T(N) = T(N/2) + O(N), so a = 1, b = 2, and k = 1. Third case applies: T(n) = theta(N) 3. Note that the master theorem does not apply to all divide and conquer algorithms, because it requires a, b, and k to be constants. For example, it does not apply to the recursive computation of the Fibonacci numbers using the definition Fib(n) = Fib(n-1) + Fib(n-2)[ with base cases n = 1 and n = 2] a. The recurrence is T(n) = T(n-1) + T(n-2) (Note that, by inspection, T(n) is O(Fib(n))) b. Here, if we wished to attempt to apply the master theorem, we could argue that a = 2 and k = 0 (the partition/stitch time is constant.) However, b is n / (n - 1), which while always greater than 1 becomes increasingly close to 1 as n increases, so the master theorem does not apply. c. In fact, the recursive divide and conquer algorithm to calculate Fibonacci numbers is impractical for n of any significant size, so the analysis is not useful in any case. Fortunately, there is a linear time algorithm, as we shall see when we talk about dynamic programming! V. Dynamic Programming - ------- ----------- A. In the last section, we were reminded that sometimes a recursive divide and conquer algorithm can have very poor performance. 1. A good example of this is Fibonacci numbers. To see why, consider the tree generated by the computation of Fib(6): Fib(6) / \ Fib(5) Fib(4) / \ / \ Fib(4) Fib(3) Fib(3) Fib(2) / \ / \ / \ Fib(3) Fib(2) Fib(2) Fib(1) Fib(2) Fib(1) / \ Fib(2) Fib(1) Observe that we do certain computations many times - e.g. we compute Fib(5) once Fib(4) twice Fib(3) thrice Fib(2) 5 times Fib(1) 3 times 2. A much more efficient approach is to save previously computed results and re-use them when needed, instead of repeating the computation. This would yield the following tree for Fib(6), which would require linear time. (Cases marked with an asterisk re-use previously computed results instead of re-doing them - note that each Fibonacci number value from 1 to 6 is computed just once.) Fib(6) / \ Fib(5) Fib(4) * / \ Fib(4) Fib(3) * / \ Fib(3) Fib(2) * / \ Fib(2) Fib(1) 3. The following linear time algorithm incorporates this insight: int fibAux(int n, int saved []) { if (saved[n-1] == -1) saved[n-1] = fibAux(n-1, saved) + fibAux(n-2, saved); return saved[n-1]; } int fib(int n) { // Use an array to save previously computed values. An // initial value of -1 indicates we have not yet computed // the value and so need to do so. int saved[n]; for (int i = 0; i < n; i ++) saved[i] = -1; saved[0] = saved[1] = 1; // By definition return fibAux(n, saved); } 4. A simpler algorithm that builds up the solution from small values is the following. int fib(int n) { if (n <= 2) return 1; int last = 1; int nextToLast = 1; int current = 1; for (int i = 3; i <= n; i ++) { current = nextToLast + last; nextToLast = last; last = current; } return current; } B. The strategy we just used to improve the calculation of the Fibonacci numbers is an illustration of a general algorithm design technique called Dynamic Programming. In Dynamic Programming, we use a table of previously calculated results to assist us in deriving new results, rather than calculating them from scratch. C. An example developed in the book: Longest Common Subsequence (LCS). 1. Recall the following from the book discussion: a. A subsequence of sequence is a sequence of elements that occur in the same order somewhere in the sequence - not necessarily without gaps between elements. Example: For the string ABC, the subsequences are , A, B, C, AB, AC, BC, and ABC b. A common subsequence of two sequences is a subsequence of both sequences Example: For the strings ABC and DADCD, the common subsequences are , A, C, and AC - since they are also subsequences of DADCD, while the subsequences of ABC that contain B are not subsequences of DADCD c. The longest common subsequence (LCS) is the subsequence that is the longest i. In the example we have been using, the LCS is AC ii. It may be for some pairs of strings that the LCS is of length 0 - e.g. the LCS of ABC and DEF is iii. It may be that the LCS of two strings is not unique - i.e. there may be two or more different subsequences that both have the same maximal length. Example: both AB and AC are LCSs of ABC and ACB d. As the text notes, LCS if useful in genetics for comparing DNA strings (sequences of the bases A, C, G, and T) and in other areas as well. 2. A brute force algorithm would compute all the subsequences of the shorter string and then test each to see if it is a subsequence of the longer - an approach that is more than exponential in the length of the shorter string, and hence usually not practical. 3. The book discusses how dynamic programming might be used to develop an algorithm whose complexity is proportional to the product of the lengths of the two strings - i.e. theta(n^2) if the two strings have the same length. The basic idea is to make use of a table with rows corresponding to the characters of one string, and columns corresponding to characters of the second string (and with an extra row and column at the start). The entries in the table represent the length of the LCS ending with that position in each of the two strings, with the bottom rightmost entry representing the length of the overall LCS. For the example in the book: LCS of the DNA sequences GTTCCTAATA and CGATAATTGAGA the initial table would look like this (dummy rows and columns filled in with 0) PROJECT C G A T A A T T G A G A -1 0 1 2 3 4 5 6 7 8 9 10 11 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 G 0 0 T 1 0 T 2 0 C 3 0 C 4 0 T 5 0 A 6 0 A 7 0 T 8 0 A 9 9 4. The table is filled in row by row from top to bottom. a. If an entry corresponds to a place where the two strings agree, the value is 1 more than the entry diagonally above it. b. If an entry corresponds to a place where the two strings disagree, the value is the maximum of the value just to the left and just above it c. Example - the entry in row 0, column 0 (G, C) is filled in with 0. d. Example - the entry in row 0, column 1 (G, G) is filled in with 1. e. The last entry to be filled in - the bottom right one - represents the length of the LCS. PROJECT: Figure 12.2 - page 563 5. The table gives the _length_ of the LCS. To get the LCS itself, one works backwards from the bottom right corner, finding entries in the LCS from last to first. a. If an entry corresponds to a place where the two strings agree, the character in question is part of the LCS, and one moves diagonally up and to the left. b. If an entry corresponds to a place where the two strings don't agree, one moves either left or up, choosing the bigger of the two or choosing one direction arbitrarily if the two are the same. (Of course, in this case, a character is not included in the LCS). PROJECT same figure - note trace of finding CTAATA c. Sometimes, a pair of sequences will have two or more LCSs of the same length. This will be reflected in a situation in the table where the choice of moving up or left is arbitrary because of a tie. Example: the example in the book actually has two LCSs of length 6. The second can be found by making the choice to go up rather than left in row 8, column 10. PROJECT - same figure, but showing trace of finding GTTTAA ASK - are there others? (Yes - first derivation, but go up rather than left at row 4 column 2, yielding GTAATA D. Another Example: Weight-Balanced Binary Search Trees 1. Earlier, we talked about strategies for maintaining height-balanced binary search trees. Where the set of keys to be stored in a tree is fixed, and we know the relative probabilities of accessing the different keys, it is possible to build a WEIGHT-BALANCED tree in which the average cost of tree accesses is minimized. a. For example, suppose we had to build a binary search tree consisting of the following C/C++ reserved words. Suppose further that we had data available to us as to the relative frequency of usage of each (expressed as a percentage of all uses of words in the group), as shown: break 55% Note: The numbers are contrived to make a point! case 25% In no way do they represent actual frequencies for 11% for typical C/C++ code! if 5% int 2% switch 1% while 1% b. Suppose we constructed a height-balanced tree, as shown: if / \ case switch / \ / \ break for int while - 5% of the lookups would access just 1 node (if) - 25% + 1% = 26% would access 2 nodes (case, switch) - 55% + 11% + 2% + 1% = 69% would access 3 nodes (the rest) Therefore, the average number of accesses would be (.05 * 1) + (.26 * 2) + (.69 * 3) = 2.64 nodes accessed per lookup c. Now suppose, instead, we constructed the following search tree break \ case \ for \ if \ int \ switch \ while The average number of nodes visited by lookup is now - 55% access 1 node (break) - 25% access 2 nodes (case) - 11% access 3 nodes (for) - 5% access 4 nodes (if) - 2% access 5 nodes (int) - 1% access 6 nodes (switch) - 1% access 7 nodes (while) (.55 * 1) + (.25 * 2) + (.11 * 3) + (.05 * 4) + (.02 * 5) + (.01 * 6) + (.01 * 7) = average 1.81 nodes accessed This represents over a 30% savings in average lookup time d. Interestingly, for the particular distribution of probability values we have used, this tree is actually optimal. To see that, consider what would happen if we rotated the tree about one of the nodes - e.g. around the root: case / \ break for \ if \ int \ switch \ while We have now reduced the number of nodes accessed for lookups in every case, save 1. But since break is accessed 55% of the the time, the net change in average number of accesses is (.55 * +1) + ((1 - .55) * - 1) = .55 - .45 = +.10. Thus, this change makes the performance worse. The same phenomenon would arise with other potential improvements. e. In general, weight balancing is an appropriate optimization only for static trees - i.e. trees in which the only operations performed after initial construction are lookups (no inserts, deletes.) Such search trees are common, though, since programming languages, command interpreters and the like have lists of reserved or predefined words that need to be searched regularly. Of course, weight balancing also requires advance knowledge of probability distributions for the accesses. (For a compiler for a given programming language, this might be discovered by analyzing frequency of reserved word usage in a sample of "typical" programs.) 2. We could consider a greedy approach to discovering the optimal binary search tree. a. The basic idea would be to make the key of highest probability the root of the tree. The keys of next highest probability would be its children, etc. - subject to the constraints of the tree being a binary search tree (e.g. only a key smaller than the root of the overall tree could be the root of the left subtree.) b. Applying this approach to the example we just considered would yield an optimal tree. c. However, the greedy strategy will not always find the optimal tree. d. However, unlike previous cases where the greedy strategy fails to find the optimal tree, finding the optimal tree does not require exponential time. 3. We now consider a method for finding the optimal binary search tree for a given static set of keys, given an advance knowledge of the probabilities of various values being sought, which finds the optimal tree in theta(n^2) time. a. The basic idea i. For an optimal tree containing n keys, if key k is the root, then the two subtrees are optimal trees made up of the first k-1 keys and the last (n-k)-1 keys. ii. We build up a table in with rows describing optimal trees with 1 key, 2 keys ... n keys, and columns corresponding possible starting positions of the subtree (e.g. the first column corresponds to subtrees that start with the first key). (a) If there are n keys, there will be n rows, with the last describing the optimal tree that contains all n keys - which is what we want. (b) While the first row has n columns, the second row (describing subtrees containing two keys) has only n-1 columns, since a subtree that contains two keys cannot start with the last key. This pattern continues until the last row has only one column, since the subtree it describes must start with the first key. b. Filling in the first row of the table is trivial, since there is only one possibility in each case for a tree containing only one key. c. We then fill in the rest of the table row-by-row, using information from the previous row. Example: when filling in the entry for an optimal tree containing the first four keys, we consider four possibilities: key 1 key 2 key 3 keys 4 / \ / \ / \ / \ empty optimal optimal optimal optima1 optimal optimal empty subtree subtree subtree subtree subtree subtree subtree subtree containing containing containing containing containing containing containing keys 2-4 key 1 keys 3-4 keys 1-2 key 4 keys 1-3 Since the costs of the different subtrees have already been calculated, we choose the least expensive root, and continue working across, then d. Of course, we must also allow for the possibility of unsuccessful search. To handle this, we convert our search tree into an EXTENDED TREE by adding FAILURE NODES (by convention, drawn as square boxes.) Example: our a balanced tree for the seven C++ keywords: if / \ case switch / \ / \ break for int while / \ / \ / \ / \ [] [] [] [] [] [] [] [] Each failure node represents a group of keys for which the search would fail - e.g. the leftmost one represents all keys less than break [e.g. a, apple, boolean]; the second all keys between break and case [c, class] etc. To discover the optimal tree, we need to consider both the probabilities of the keys and the probabilities of the various failure nodes - i.e. the probability that we will be searching for something that is not in the tree and will end up at that node. 4. To find an optimal tree, we need to define some terms and measures: a. We will number the keys 1 .. n b. Probabilities connected with the various keys i. Let p be the probability of searching for key (1 <= i <= n) i i ii. Let q be the probability of searching for a non-existent key i lying between key and key . (Of course q represents i i+1 0 all values less than key , and q all values greater than key .) 1 n n iii. Clearly, since we are working with probabilities, the sum of all the p's and q's must be 1. c. T is the optimal binary search tree containing key through key . ij i+1 j d. T , then, is an empty tree, consisting only of the failure node ii lying between key and key . i i+1 e. We will denote the weight of T by w . Clearly, ij ij the weight of T is p + p + ... + p + q + q + ... q ij i+1 i+2 j i i+1 j which is the probability that a search will end up in T . The ij weight of the empty tree T , then, is q - the probability of ii i the failure node lying between key and key . Note that, for a i i+1 non-empty tree, the weight is simply the probability of the root plus the sum of the weights of the subtrees. f. We will denote the cost of T - i.e. the average number of comparisons ij needed by a search that ends in T by c . ij ij c is calculated as follows: ij - If T is empty (consists only of a failure node), then its ij cost is zero - i.e. once we get to it, we need do no further comparisons. - Otherwise, its cost is the weight of its root, plus the sum of the weights of its subtrees, plus the sum of the costs of its subtrees. - The first term represents the fact that search for the key at the root costs one comparison. - The rationale for including the costs of the subtrees in the overall cost should be clear. To this, we add the WEIGHTS of the subtrees to reflect the fact that we must do one comparison at the root BEFORE deciding which subtree to go into, and the probability that that comparison will lead into the subtree is equal to the weight of the subtree. g. Clearly, an optimal binary search tree is one whose cost is minimal. h. We will denote the root of T by r . ij ij i. Example - the balanced tree we considered earlier would be optimal if the probabilities of all keys and failures were equal (i.e. each p and q = 1/15) if / \ case switch / \ / \ break for int while / \ / \ / \ / \ [] [] [] [] [] [] [] [] i. Cost of external nodes = 0 in each case, and weights of external nodes = 1/15 in each case. So c = c = c = c = c = c = c = c = 0. 00 11 22 33 44 55 66 77 w = w = w = w = w = w = w = w = 1/15. 00 11 22 33 44 55 66 77 ii. Cost of each tree rooted at a level 3 node (break, for, int, while) = weight of root (1/15) + sum of costs of subtrees (0) + sum of weights of subtrees (2/15) = 3/15. The weight of each subtree is also 3/15. So c = c = c = c = 3/15 01 23 45 67 w = w = w = w = 3/15 01 23 45 67 iii. Cost of each tree rooted at a level 2 node (case, switch) is 1/15 (weight of root) plus 2 x 3/15 (costs of two subtrees) + 2 x 3/15 (weights of two subtrees) = 13/15, and weight is 1/15 + 2 x 3/15 = 7/15. So c = c = 13/15 03 47 w = w = 7/15 03 47 iv. Cost of overall tree (c ) = 07 Probability of root (4) = p = 1/15 + 4 Weight of left subtree (T ) = w = 7/15 + 03 03 Weight of right subtree (T ) = w = 7/15 + 47 47 Cost of left subtree (T ) = c = 13/15 + 03 03 Cost of right subtree (T ) = c = 13/15 47 47 So total cost is 41/15 v. Weight of overall tree (w ) = 07 Probability of root = 1/15 + Weight of left subtree = 7/15 + Weight of right subtree = 7/15 = 1 (as expected) 5. Dynamic programming is used in an algorithm for finding an optimal tree, given a set of values for the p's and q's. a. T is the OPTIMAL tree including keys i+1 .. j. ij Therefore, T is the optimal tree for the whole set of keys, 0n and is what we want to find. b. w is the WEIGHT of T . ij ij - For i = j, w = q . ij i - For i < j, w = p + w + w ij r i r - 1 r j ij ij ij c. c is the COST of T . ij ij - For i = j, c = 0. ij - For i < j, c = w + w + w + c + c ij r i r - 1 r j i r - 1 r j ij ij ij ij ij = w + c + c ij i r - 1 r j ij ij d. r is the ROOT of T . ij ij - Obviously, r is undefined if i = j. ij (We will record the value as 0 in this case.) - If i < j, then the subtrees of T are T and T ij i r - 1 r j ij ij (Clearly, if T is optimal then its subtrees must be also.) ij - We consider each possible value for r and then pick the one that ij yields the lowest value for c . Because we build the tree up by ij first considering trees containing 0 keys, then 1, then 2 ... we have already calculated the w and c values we need to perform this comparison. - It turns out that, in exploring possible values for r , we don't need ij to consider values less than r or greater than r , which greatly i j-1 i+1 j reduces the effort. 6. As an example, the operation of the algorithm for four keys is looks like this, if the probabilities are: p = (3/16, 3/16, 1/16, 1/16) and q = (2/16, 3/16, 1/16, 1/16, 1/16). PROJECT - For convenience the probabilities are multiplied by 16 which doesn't affect the correct operation of the algorithm but eliminates a lot of "/16" a. The first row represents empty trees, whose weights are simply the appropriate "q" value, whose costs are 0, and whose roots are undefined. b. The second row represents trees containing just one key. In each case, the weight is the sum of the weight of the one key plus the weights of the two adjacent failure nodes, and the cost is the weight of the one key (since the costs of failure nodes are zero.) The root, of course, is the one key. c. The third row represents the optimal choice for constructing trees of two nodes. i. For example, the first entry represents a tree including keys 1 and 2 - ie. T . The two options would have been to let key 1 02 be the root or key 2 be the root. Calculating the costs: - if key 1 is the root, then the cost is p + w + w + c + c = 3 + 2 + 7 + 0 + 7 = 19 1 00 12 00 12 - if key 2 is the root, then the cost is p + w + w + c + c = 3 + 8 + 1 + 8 + 0 = 20 2 01 22 01 22 Thus, 1 is chosen as r and the cost of 19 is recorded. 02 ii. The remaining entries in the row are calculated in the same way. Note that the weights and costs needed to compare root choices are always available from previous rows. d. Subsequent rows represent optimal trees with 3 and then 4 keys. The latter is, of course, the final answer. Note that, in each case, we consider all viable possibilities for the root using information already recorded in the table, and then choose the choice with the lowest cost 5. This algorithm is implemented by the following program: PROJECT CODE 6. Time complexity? (ASK CLASS) a. At first it may appear to be theta(n^3) [ three nested loops ] b. The code incorporates an improvement suggested by Donald Knuth that makes this theta(n^2) by limiting the range of possible roots considered when searching for the optimal root by again taking advantage of previously computed values. We won't pursue this. VI. Randomized Algorithms -- ---------- ---------- A. A final category of algorithm design approaches we want to consider is randomized algorithms. 1. One variant on this approach is to use randomization to deal with the possibility of worst case data sets. 2. A second variant arises when exhaustively testing all the data we need to test to get a guaranteed answer is computationally infeasible. In such a case, it may be possible to test a random sample and get an answer that is sufficiently reliable. B. As an example of the first category of uses of randomization, consider quick sort. 1. We know that if we choose the first element in the unsorted data as the pivot element, the algorithm degenerates to O(n^2) performance in the case where the data is already sorted in either forward or reverse order. 2. Now consider what would happen if we chose a RANDOM element as the pivot element. a. Obviously, it could still be the case that we happen to make a bad choice - indeed, we could end up with a bad choice even if the data itself is random, if we happened to choose the smallest (or largest) element. b. However, the probability of making a bad choice is small, and the probability of making bad choices over and over again on successive iterations becomes increasingly small. c. Further, the pathological case of already sorted data now poses no more problem than any other data set. If there is a significant probability that we will have significant pre-existing order in our data, randomly choosing the pivot element may greatly reduce the likelihood of pathological behavior (though it cannot eliminate it, of course.) C. As an example of the second category of uses of randomization, consider testing an integer to see if it is prime. 1. This is an important problem in connection with cryptography, since the most widely used encryption scheme generates its key from two large prime numbers (potentially 100's of bits.) 2. To exhaustively test an integer n to see if it is prime, we would have to try dividing it by all possible factors less than or equal to sqrt(n). This would seem to be an O(n^1/2) operation, which is certainly not bad. However, when dealing with cryptographic algorithms, we tend to use the NUMBER OF BITS as the measure of problem size. For a b bit number, the maximum value is 2^b - 1, and we need to test possible factors in the range 2 .. 2^b/2. This means exhaustively testing an integer to see if it is prime takes time exponential in the number of bits. 3. There are various results from number theory that allow us to test a small, randomly-chosen subset of the possible factors. If any of these declares the number to be non-prime, it is definitely non-prime. If the number passes all the tests, we can say with a very high probability that it is prime. (Since I don't claim any expertise in the relevant number theory, I leave the details to someone like Dr. Crisman) D. One further issue with using a randomized algorithm, of course, is how do we get random numbers on a deterministic machine? 1. Absent very specialized hardware, the answer is that we settle for PSEUDORANDOM SEQUENCES that behave, statistically, like random numbers. 2. One good way to generate such a sequence is by using a linear congruential generator, which generates each new element of the sequence x(i+1) from the previous member of the sequence x(i) by using the congruence: x = A x mod M i+1 i for appropriately chosen values of A and M. 3. It is important to choose appropriate values of A and M, and also to deal appropriately with the possibility of overflow in the computation. (Multiplying two 32-bit integers can yield a product as big as 64 bits). Some widely-use "random number" functions actually have some very bad characteristics. 4. As a practical matter, when writing randomized algorithms on a Unix system, use the newer random number function random() instead of the older rand(), whose lower bits cycle through the same pattern over and over. (On Linux systems, rand() is actually random() - the old rand() is not used.)