CPS222 Lecture: Sorting Last Revised 3/13/2015 Objectives: 1. To introduce basic concepts (what do we mean by sorting? internal and external sorts. Stability) 2. To introduce common internal sorting algorithms. 3. To introduce the basic external merge sort algorithm. 4. To prove that sorting by comparison is omega(n log n) Materials: 1. Copy of Knuth volume 3 to show 2. Projectable of various internal sorting algorithms 3. Handout with above code 4. Projectable of bubble sort tree for three items that are permutations of ABC I. Introduction - ------------ A. The topic of sorting is a very important one in the area of data structures and algorithms, because many computer applications make use of sorting in some form or fashion. As a result, this area has been studied extensively, and numerous algorithms with varying performance strengths have been developed. B. The basic goal of sorting is fairly intuitive - arrange a group of items in ascending (or descending) order. But there are a few nuances we need to consider. 1. Sometimes, we are sorting items such as numbers or names, where the entire entity we are sorting also serves as the basis for the sort. At other times, we are sorting complex structures based on one piece of information - traditionally called the SORT KEY - or just the key for short. a. Sometimes, the same list of items may even be sorted using different sort keys at different times. Example: Suppose we create a class Student with instance variables like the following: id (an integer) last name (a string) first name (a string) major (a string) class year (an integer) gpa (a real) If we have a list of Student objects, it is easy to imagine different circumstances under which we would want to use any of these instance variables as a sort key - or perhaps even use last name and first name together as a COMPOSITE KEY. (Sort based on last name, use first name to break ties.) b. For simplicity, we will discuss algorithms for sorting a list that is "all key" - but the same principles could be used for sorting a list of objects where one field is the sort key. 2. It turns out that sorting algorithms are generic in the sense that (in almost every case) a given algorithm will work with any type of sort key that is comparable. a. Basically, a C++ type or class is comparable if it defines an operator <. (This includes numeric types, strings, and any class for which the class author defines operator <). b. Java has a similar notion with an interface called Comparable. To implement Comparable, a class must have a method called compareTo which, when applied to another object of the same class, returns a negative value if it is less than the other object, 0 if it is equal, and a positive value if it is greater than the other object. Note that in Java sorting algorithms are implemented slightly differently for sorting objects of primitive type (like ints) - where the built-in operator < is used - and for sorting objects of class type (including String) - where the class must implement Comparable and the compareTo() method is used in the algorithm. c. As we develop our algorithms, we won't worry about how the sort key is actually defined for the objects we are sorting - we only require that the objects to which our sorting algorithms are applied somehow define a comparison operation (<) that is meaningful for objects of that type. We'll use < as the comparison operator without regard to language-specific nuances. d. Note, too, that we will discuss algorithms in terms of sorting into ascending order. The same algorithms can be used for sorting into descending order, except that we reverse the order of comparison. 3. In general, we can consider implementing a sorting operation for almost any sort of collection: an array, a vector, a linked list, etc. Traditionally, sort algorithms have been formulated on the assumption that the items being sorted are in an array whose elements are accessible by operator []. We'll use array terminology, recognizing that the collection we are sorting is not necessarily an array. 4. We are now prepared to define what we mean when we say that an array is sorted: we say that an array of entities x[0 .. n-1] is sorted if x[0] <= x[1] <= x[2] ... <= x[n-1] - or, more simply, x[i] <= x[j] whenever 0 <= i <= j <= n-1. 5. We can also define what we mean by sorting an array: we sort an array by permuting in such a way as to produce a sorted array. Note that we explicitly require that the sorted array be a permutation of the original array. This precludes the use of a simplistic "sort" algorithm like the following: for (int i = 0; i < n; i ++) x[i] = x[0]; This results in a sorted array by our definition, but we don't consider this a proper sorting algorithm because the result is not - in general - a permuation of what we started with! C. We will study a variety of sorting algorithms, because there is no one best algorithm for all cases. 1. Different algorithms work best for different SIZE arrays. a. For all but very large arrays, we will typically use an INTERNAL SORT, in which the items to be sorted are kept in main memory. For very large arrays, we will have to use an EXTERNAL SORT in which the items to be sorted reside in secondary storage (disk or tape) and are brought into main memory to be sorted a few at a time. i. Internal sorts are much faster than external sorts, because the access time of external storage devices is orders of magnitude greater than that for main memory. ii. However, internal sorts are limited as to the amount of data they can handle by available main memory, while external sorts are limited by available external storage (which is generally orders of magnitude bigger.) (Note, too, that, with virtual memory, main memory appears almost boundless; but if the amount of memory in use becomes too great, then paging begins to occur, and the performance of the internal sort begins to deteriorate.) As memory sizes have grown, external sorting has become unnecessary for many applications, but it still important for algorithms that deal with big data. iii. Often, an external sorting algorithm will make use of an internal sort, done on a portion of the data at one time, to give it a "head start". (We will focus on internal sorts first, and then will talk about external sorts.) b. Among internal sorts, there are several algorithms with theta(n^2) behavior, and several with theta(n log n) behavior. Interestingly, for sufficiently small arrays, a theta(n^2) algorithm may be faster than a theta(n log n) algorithm, due to a smaller constant of proportionality. Moreover, when implementing a recursive "divide and conquer" theta(n log n) algorithm, it is common to switch to using a theta(n^2) algorithm when the pieces become sufficiently small. 2. Some algorithms are quite sensitive to the presence of some initial order in the items being sorted. Some algorithms do better when this is the case; others actually do worse (they work best on totally random data.) 3. Some algorithms require significant additional space beyond that needed to store the actual data to be sorted; others require very little additional space. Extra space required can range from theta(1) to theta(n) extra space. 4. In some cases, STABILITY of the algorithm is an important consideration. a. The issue of stability arises if we are sorting an array where duplicate keys are allowed - i.e. two (or more entries) may legally have the same key. (Example: sorting a list of people by last name.) b. A sort is said to be STABLE if two records having the same key value are guaranteed to be in the same relative order in the output as they were in the input. c. Example: Suppose we were sorting bank transactions, each consisting of an account id, transaction code, and amount - e.g. 5437 D 100.00 1234 D 50.00 5437 W 50.00 1234 W 20.00 (where the sort key is just the account number.) i. A stable sorting algorithm would be guaranteed to produce: 1234 D 50.00 1234 W 20.00 5437 D 100.00 5437 W 50.00 ii. While an unstable one might produce the above, or any of the following instead: 1234 W 20.00 1234 D 50.00 5437 D 100.00 5437 W 50.00 1234 D 50.00 1234 W 20.00 5437 W 50.00 5437 D 100.00 or 1234 W 20.00 1234 D 50.00 5437 W 50.00 5437 D 100.00 Here, the stable sort might be necessary to ensure correctness if one of the withdrawl transactions, in fact, represents a withdrawl against the funds deposited earlier - i.e. there was not enough money in the account to cover the withdrawl before the deposit was made. d. Stability is never an issue if the sort keys are guaranteed to be unique - i.e. no two items can have the same value of the key. D. A classic work on sorting is Donald Knuth: The Art of Computer Programming volume 3: Sorting and Searching. II. Approaches to Internal Sorting -- ---------- -- -------- ------- We will begin by considering sorting algorithms that are primarily used for internal sorts. There are a number of basic approaches to sorting, including the following (classification from Knuth volume 3). (For consistency, I will illustrate each with sample code that sorts an array of strings - but the algorithm is the same regardless of what one is sorting) DISTRIBUTE HANDOUT A. Sorting by insertion: for (int i = 1; i < n; i ++) insert the ith entry from the original array into a sorted subtable composed of entries 0..i-1 1. Demonstrate with class 2. Many texts have an algorithm for a straight insertion sort. a. Example Code: PROJECT/HANDOUT b. Analysis: ASK theta(n^2) c. What will happen if used on already sorted data? ASK Time becomes theta(n), because inner loop terminates immediately on each time through outer loop. This is a peculiar characteristic of this algorithm which it makes it advantageous to use in cases where there is a significant probability that the data will already be in order. 3. The Shell sort is an insertion sort with behavior approximately O(n^1.26) - closed form analysis being very difficult. (We won't discuss) 4. Another variant of insertion sort is address calculation sort. a. This builds on the idea that if we are manually sorting a pile of papers, and we see a paper with a lastname beginning with 'B', we automatically start looking for its near the beginning of the pile; if it begins with 'T', we look near the end; and if it begins with 'M' we look near the middle. b. One approach is to conduct insertion sort with several lists, instead of one, each corresponding to a certain range of key values (e.g. A-C, D-F ...). An item is inserted using the methods of insertion sort into the appropriate list, and then they are all combined at the end. c. We won't develop further 5. Simple insertion sort is stable and address calculation sort are stable, but Shell Sort is not. B. Sorting by exchanging: scan the table repeatedly (by some scheme), looking for items that are not in the correct sequence vis-a-vis each other, and exchange them. 1. Almost every intro computer science text discusses the bubble sort, which is an exchange sort. a. Example Code: PROJECT / HANDOUT b. Analysis? ASK theta(n^2) - but with a larger constant of proportionality than insertion sort, because multiple exchanges can be done on each pass through the outer loop, while insertion sort does simple data movements rather than exchanges. c. What will happen if used on already sorted data? ASK In this case, there is no asymptotic gain (though no exchanges are done, so the overall time is better by a constant factor.) There are improvements to the algorithm that terminate early if no exchanges are done on some pass, yielding potentially theta(n) behavior on sorted data. d. The chief reason for this sort being so widely known is that the code is so simple. 2. Quicksort a. The basic idea is this: i. Choose an arbitrary element of the list as the pivot element. ii. Rearrange the list as follows: keys <= pivot pivot keys >= pivot (Note that a key that is equal to the pivot can end up in either half) iii. Sort the two partitions recursively b. CODE - PROJECT Goodrich/Tamassia Code Fragment 11.5 i. This version makes the arbitrary choice of using the last element as the pivot. ii. Note that we consider this sort to be an exchange sort because of the method used to do the partitioning. c. Analysis: We consider average case and worst case separately: i. Average case - we expect each partition to divide the list roughly in half. We can thus picture the partitioning process by a tree like the following: n items n/2 n/2 n/4 n/4 n/4 n/4 .................................... 1 1 1 ......................................... 1 1 1 - At each "level", we must examine all n items to create the next level of partitions. There are log n levels - therefore QuickSort is O(nlogn), average case. ii. In the worst case, QuickSort is not so good, however. Consider the behavior for a list that is exactly backward. - The first partion produces sublists of 0 and n-1 items. - The second produces sublists of 0 and n-2 ... - Therefore, there are n levels of partitioning, each examining theta(n) items - therefore the worst case for QuickSort is theta(n^2). What about the case where the list is already sorted to begin with? Paradoxically, this too turns out to be theta(n^2). d. We can reduce the likelihood of worst case behavior by improving the way we select the pivot element. i. Ideally, the key we use as the pivot should be the median of the items in the list. In practice, this involves either sorting the list, or using a rather complex theta(n) algorithm which we won't discuss. ii. One simple improvement is as follows: instead of choosing the first item in the list as the basis for partitioning, choose the median of the (physically) first, (physically) middle, and (physically) last. (Worst case behavior can still occur, but not with the case of a backward or an already ordered list.) iii. If our major concern is with avoiding the worst-case behavior that comes when the data is already sorted or reverse-sorted, we can also select a pivot randomly from among all the items being worked on - which may be somewhat simpler to implement. e. In practice, QuickSort is often improved by switching to another method (e.g. insertion sort) when the size of the sublist to be sorted falls below some threshold. That is, the recursive calls might be coded as follows: Present code: if (size <= 1) ; // Do nothing else { ... Quick sort code Modified code: if (size <= 1) ; // Do nothing else if (size < threshold) { ... Insertion sort code } else { ... Quick sort code f. One other point to note about quicksort is that, due to the recursion, it does require additional memory for the stack. i. The amount of additional memory needed will vary from O(log n) [if each partitioning roughly divides the list in two] to O(n) [in the pathological cases where each partitioning produces one sublist that is smaller by just 1 item than the list that was partitioned.] ii. The stack growth can be kept to O(log n) in all cases as follows: Always sort the smaller of the two sublists first, and use tail recursion optimization on the second call in each case 3. The bubble sort is stable, but quicksort is not. C. Sorting by selection: for (i = 0; i < n; i ++) select the smallest (largest) item from those still under consideration, put it in the right place, and remove it from consideration on further passes 1. Demonstrate with class 2. Many texts give an algorithm for a straight selection sort. a. Example Code: PROJECT / HANDOUT b. Analysis: ASK theta(n^2). Constant of proportionality tends to be better than insertion sort, because there is only one data movement done per pass through the outer loop. c. What will happen if used on already sorted data? ASK - Nothing is gained or lost 3. Heapsort is a selection sort method a. The text discussed heapsort in conjunction with its discussion of heaps, though I postponed the reading of this material until now 1. We have already seen that it is possible to convert an array to a heap enmasse in theta(n) time. Suppose we were to build a maxheap (largest item is on top of the heap.) Clearly that item belongs at the _end_ of a sorted version of the original array. 2. We have also seen that it is possible to remove the top item from a heap and replace it by its appropriate successor in theta(log n) time. b. This leads to the following approach to sorting: Convert the array into a maxheap for (i = 0; i < n; i ++) remove the top item from the heap and put it i slots from the end of the sorted array; then readjust the heap c. Example code: PROJECT / HANDOUT Demonstrate phase 2 (after heap built) using student names d. Analysis ASK Since the first step takes theta(n) time and the loop does a theta(log n) operation n times, the total time will be theta(n) + n * theta(log n) = theta(n log n) 4. Neither simple selection sort nor heapsort is stable, though simple selection can be made stable at the cost of both extra time and space. D. Sorting by merging 1. Suppose we have two sorted lists. It is easy to merge them into a single sorted list in theta(n) time for (i = 0; i < n; i ++) choose the smaller item from the fronts of the two lists, and add it to the sorted list. (If one list is empty, always take from the other list) a. Demonstration: merge two sorted lists of student names b. This leads to a recursive sorting strategy: - Split the data in half - Sort each half recursively - Merge the two sorted halves c. Example code: PROJECT / HANDOUT d. Analysis: ASK Guaranteed theta(n log n) - by similar reasoning used to show that quick sort is theta(n log n) - but this time, we can guarantee perfect partitioning, so this asymptotic bound holds for all cases e. Moreover, if we break ties by always choosing from the list thast came from nearer the start of the original list, we can guarantee that merge sort is stable. f. Unfortunately, we require theta(n) extra space to store the merged list - or we can use linked lists, which require theta(n) extra space for the links! 2. We will see shortly that merge sorting is the basis for all external sorting strategies - though sometimes we sacrifice stability for extra speed. E. Sorting by distribution: 1. This works with a key of m "digits", using d "pockets" where d is the number of possible values a key digit may assume (e.g. 10 for a decimal key; 26 for an alphabetic key etc.) for (i = 0; i < m; i ++) distribute the file into d pockets based on the ith key digit from the right reconstruct the file by appending the pockets to one another. Example: Assume we are sorting strings of three letters, drawn from the alphabet ABCDE [so we need 5 pockets] Initial data: CBD ADE CAD ADA BAD ACE BEE BED First distribution - on rightmost character: ADA (empty) (empty) CBD ADE CAD ACE BAD BEE BED Pickup left-to-right: ADA CBD CAD BAD BED ADE ACE BEE Second distribution: CAD CBD ACE ADA BED BAD ADE BEE Pick up: CAD BAD CBD ACE ADA ADE BED BEE Third distribution: ACE BAD CAD ADA BED CBD ADE BEE Final pickup: ACE ADA ADE BAD BED BEE CAD CBD 2. Time complexity appears to be order (n*m) - but note that for n keys we have a minimum value for m of log n - therefore, it d is in fact theta(nlogn) - since log n and log n have a constant ratio. 2 d 3. Unfortunately, distribution sorting requires extra space; though the extra space requirements can be kept down by careful coding. a. If the "pockets" were represented by arrays, then we would need one array for each possible value of a digit - e.g. 26 pockets if sorting based on letters of the alphabet. Further, each pocket would need to be big enough to possibly hold all the data if, in fact, all the keys had the same value in one position. Thus, we would need O(n) extra space, where the constant of proportionality would be huge. b The extra space can be greatly reduced - though it is still O(n) - by represented the "pockets" by linked lists, using a table of links as in the previous example. This is really the only practical way to go. 4. Distribution sorting is always stable; in fact, it relies on the stability of later passes to preserve the work done on earlier ones. 5. (No demo code for this one - but the book discusses briefly under the name "bucket sort") F. Sorting by enumeration: For each record, determine how many records have keys less than its key. We will call this value for the ith record count[i]. Clearly, the record currently in position i actually belongs in position count[i] + 1, so as a final step we put it there. 1. Observe that this strategy is theta(n^2), and is stable [because if two records have equal keys, we increase the count of the one occurring physically later.] 2. An interesting variant is possible if the set of possible keys is small (i.e. many items have the same key.) a. Example: Sort the students by academic class - using two arrays: count[i] and position[i] (1 <= i <= 4) i. We make one pass through all the students to calculate count[]. ii. position[1] is set to 0 iii. position[i] (2 <= i <= 4) is set to position[i-1] + count[i-1] iv. Make a second pass through all the students and place according to current value of position[] for his/her class, then increment position. b. Analysis: ASK O(n) - but special case! 3. (No demo code for this one) G. Summary We can compare the internal sorting strategies we have looked at thus far by considering several attributes; 1. Asymptotic complexity 2. Behavior with already sorted data. 3. Need for additional storage. 4. Stability Algorithm Asymptotic Impact of Extra Stable? Complexity Sorted Data Storage --------- ---------- ----------- ------- ------ Simple theta(n^2) becomes minimal yes Insertion theta(n). Bubble theta(n^2) can become minimal yes theta(n) w/suitable coding Quicksort theta(n log n) can degenerate theta(log n) no average - to theta(n^2) stack can degenerate unless avoided for to theta(n^2) by coding recursion Simple theta(n^2) little change minimal no Selection Heapsort theta(n log n) little change minimal no always Merge Sort theta(n log n) little change theta(n) for yes extra array or at least links Distribution If keys are little change theta(n) yes Sort unique, ends up theta (n log n) Enumeration theta(n^2) little change theta(n) for yes Sort counts -- can be theta(n) for special case where potential key values form a small set. The result of this analysis shows that there is no one algorithm that's best on all counts. In particular, there is no known sorting algorithm that has all of the following characteristics: theta(n log n) asymptotic complexity, theta(1) extra space, and stability. We can get any two of the three, but not all three! III. How Fast Can We Sort? --- --- ---- --- -- ---- A. One observation one can make from the table we just considered is that the best general-purpose sorting algorithms have asymptotic complexity theta(n log n). Is this as good as we can do, or is it possible to find an algorithm whose average case asymptotic complexity is less than n log n? B. In the case of sorts based on binary comparisons, the answer to our question is no. We will now prove the following theorem: Theorem: Any sort BASED ON BINARY COMPARISONS must have complexity at least O(nlogn). C. Proof: 1. Any sorting algorithm for sorting n items must be prepared to deal with all n! possible permutations of those items, and must deal with each permutation differently. 2. Each comparison in the sort serves to partition the permutations into two classes - those passing the test, and those failing the test. Example: bubble sort of three items must deal with 6 permutations: ABC ACB BAC BCA CAB CBA The first comparison checks to see if item[0] is <= item[1]. Three permutations pass this test: ABC, ACB, and BCA. The other three (BAC, CAB, CBA) fail the test, necessitating an exchange. 3. Each subsequent comparison partitions each of these classes further. Example: the second comparison checks to see if item[1] <= item[2]. Of the three permutations passing the first test, one passes the second (ABC) and the other two do not. Of the three permutations failing the first test - and after the exchange - only one passes the second test (BAC altered to ABC). c 4. After c comparisons, then, we have 2 classes - some of which may be empty. 5. At the completion of the sort, we must have at least n! classes - since each original permutation must be handled differently. Example: complete classification tree for bubble sort of 3 items: item[0] <= item[1] / no \ yes (BAC, CAB, CBA) (ABC, ACB, BCA) | become | (ABC, ACB, BCA) | | | item[1] <= item[2] item[1] <= item[2] / no \ yes / no \ yes (ACB, BCA) (ABC) (ACB, BCA) (ABC) | become | | become | (ABC, BAC) | (ABC, BAC) | | | | | item[0] <= item[1] item[0] <= item[1] item[0] <= item[1] item[0] <= item[1] / no \ yes / no \ yes / no \ yes / no \ yes (BAC) (ABC) (empty) (ABC) (BAC) (ABC) (empty) (ABC) | becomes | becomes (ABC) (ABC) PROJECT After 3 comparisons, we have eight classes - 6 of which contain one item (corresponding to each of the 3! original permutations) and 2 of which are empty. c 6. Thus, we have 2 >= n!, or c >= log(n!) 7. However, by Stirling's approximation, n! ~ sqrt(2 pi n) * (n/e)^n so log(n!) ~ 0.5(1 + log(pi) + log(n)) + n (log(n) - log(e)) = n logn + O(n) + O(logn) + O(1) = O(n log n) so c >= O(n log n) = omega (n log n) - QED 8. Note: our text argues that log(n! is omega (n log n) in a different way - same conclusion, just a different way of getting there. IV. External Sorting --- -------- -------- A. We have seen that the algorithms we use for searching tables stored on disk are quite different from those used for searching tables stored in main memory, because the disk access time dominates the processing time. B. For much the same reason, we use different algorithms for sorting information stored on disk than for sorting information in main memory. 1. We call an algorithm that sorts data contained in main memory an INTERNAL SORTING algorithm, while one that sorts data on disk is called an EXTERNAL SORTING algorithm. 2. In the simplest case - if all the data fits in main memory - we can simply read the data from disk into main memory, sort it using an internal sort, and then write it back out. 3. The more interesting case - and the one we consider here - arises when the file to be sorted does not all fit in main memory. 4. Historically, external sorting algorithms were developed in the context of systems that used magnetic tapes for file storage, and the literature still uses the term "tape", even though files are most often kept on some form of disk. It turns out, though, that the storage medium being used doesn't really matter because the algorithms we will consider all read/write data sequentially. C. Most external sorting algorithms are variants of a basic algorithm known as EXTERNAL MERGE sort. Note that there is also an internal version of merge sort that we have considered. External merging reads data one record at a time from each of two or more files, and writes records to one or more output files. As was the case with internal merging, external merging is theta(n log n) for time, but theta(n) for extra space, and (if done carefully) it is stable. D. First, though, we need to review some definitions: 1. A RUN is a sequence of records that are in the correct relative order. 2. A STEPDOWN normally occurs at the boundary between runs. Instead of the key value increasing from one record to the next, it decreases. Example: In the following file: B D E C F A G H - we have three runs (B D E, C F, A G H) - we have two stepdowns (E C, F A) 3. Observe that an unsorted file can have up to n runs, and up to n-1 stepdowns. In general (unless the file is exactly backwards) there will be a lesser number than this of runs and stepdowns, due to pre-existing order in the file. 4. Observe that a sorted file consists of one run, and has no stepdowns. E. We begin with a variant of external merge sort that one would not use directly, but which serves as the foundation on which all the other variants build. 1. In the simplest merge sort algorithm, we start out by regarding the file as composed of n runs, each of length 1. (We ignore any runs which may already be present in the file.) On each pass, we merge pairs of runs to produce runs of double length. a. After pass 1, we have n/2 runs of length 2. b. After pass 2, we have n/4 runs of length 4. c. The total number of passes will be ceil(log n). [ Where ceil is the ceiling function - smallest integer greater than or equal to.] After the last pass, we have 1 run of length n, as desired. d. Of course, unless our original file length is a power of 2, there will be some irregularities in this pattern. In particular, we let the last run in the file be smaller than all the rest - possibly even of length zero. Example: To sort a file of 6 records: Initially: 6 runs of length 1 After pass 1: 3 runs of length 2 + 1 "dummy" run of length 0 After pass 2: 1 run of length 4 + 1 run of length 2 After pass 3: 1 run of length 6 2. We will use a total of three scratch files to accomplish the sort. a. Initially, we distribute the input data over two files, so that half the runs go on each. We do this alternately - i.e. first we write a run to one file, then to the other - in order to ensure stability. b. After the initial distribution, each pass entails merging runs from two of the scratch files and writing the generated runs on the third. At the end of the pass, if we are not finished, we redistribute the runs from the third file alternately back to the first two. Example: original file: B D E C F A G H initial distribution: B E F G (File SCRATCH1) D C A H (File SCRATCH2) (remember we ignore runs existing in the raw data) -------------------------------------------------- after first merge: BD CE AF GH (File SCRATCH3) PASS 1 redistribution: BD AF (File SCRATCH1) CE GH (File SCRATCH2) -------------------------------------------------- after second merge: BCDE AFGH (File SCRATCH3) PASS 2 redistribution: BCDE (File SCRATCH1) AFGH (File SCRATCH2) -------------------------------------------------- after third merge: ABCDEFGH (File SCRATCH3) PASS 3 (no redistribution) 3. Analysis of the basic merge sort a. Space: three files, one of length n and two of length n/2. We can use the output file as one of the scratch files, so the total additional space is two files of length n/2 = total scratch space for n records In addition, we need internal memory for three buffers - one for each of the three files. In general, each buffer needs to be big enough to hold an entire block of data (based on the blocksize of the device), rather than a single record. b. Time: - Initial distribution involves n reads - Each pass except the last involves 2n reads due to merging followed by redistribution. The last pass involves just n reads. - Total reads = 2 n ceil(log n), so total IO operations = 4n ceil(log n) F. A significant improvement arises from the observation that our original algorithm started out assuming that the input file consists of n runs of length 1 - the worst possible case (a totally backward file.) In general, the file will contain many runs longer than one just as a consequence of the randomness of the data, and we can use these to reduce the number of passes. 1. Example: The sample file we have been using contains 3 runs, so we could do our initial distribution as follows: initial distribution: BDE AGH (File SCRATCH1) CF (File SCRATCH2) after first pass: BCDEF (File SCRATCH3) AGH (File SCRATCH4) after second pass: ABCDEFGH (File SCRATCH1) (Note: we have assumed the use of a balanced merge; but a non-balanced merge could also have been used.) 2. This algorithm is called a NATURAL MERGE. The term "natural" reflects the fact that it relies on runs naturally occurring in the data. 3. However, this algorithm has a quirk we need to consider. a. Since we merge one run at a time, we need to know where one run ends and another run begins. In the case of the previous algorithms, this was not a problem, since we knew the size of each run. Here, though, the size will vary from run to run. In the code we just looked at, the the solution to this problem involved recognizing that the boundary between runs is marked by a stepdown. Thus, each time we read a new record from an input file, we will keep track of the last key processed from that file; and if our newly read key is smaller than that key, then we know that we have finished processing one run from that file. Example: in the initial distribution above, we placed two runs in the first scratch file. The space between them would not be present in the file; what we would have is actually BDEAGH. But the run boundary would be apparent because of the stepdown from E to A. b. However, if stability is important to us, we need to be very careful at this point. In some cases, the stepdown between two runs could disappear, and an unstable sort could result. Consider the following file: F E D C B A M1 Z N M2 (where records M1 and M2 have identical keys.) ___ No stepdown here, so | 2 runs look like one: v Initial distribution: F | D | B | N E | C | A M1 Z | M2 F | D | B N E | C | A M1 Z | M2 First pass: E F | A B M1 N Z C D | M2 ^ |___ No stepdown here, so two runs look like one: E F | A B M1 N Z C D M2 | Second pass: C D E F M2 A B M1 N Z Third pass: A B C D E F M2 M1 N Z ^ | In the case of equal keys, we take record from first scratch file before record from second, since first scratch file should contain records from earlier in original file. c. If stability is a concern, we can prevent this from occurring by writing a special run-separator record between runs in our scratch files. This might, for example, be a record whose key is some impossibly big value like maxint or '~~~~~'. Of course, processing these records takes extra overhead that reduces the advantage gained by using the natural runs. d. Analysis: i. Space is the same as an ordinary merge if no run separator records are used. However, in the worst case of a totally backward input file, we would need n run separator records on our initial distribution, thus potentially doubling the scratch space needed. ii. The time will be some fraction of the time needed by an ordinary merge, and will depend on the average length of the naturally occurring runs. - If the naturally occurring runs are of average length 2, then we save 1 pass - in effect we start where we would be on the second pass of ordinary merge. - In general, if the naturally occuring runs are of average length m, we save at least floor(log m) passes. Thus, if we use a balanced 2-way merge, our time will be n (1 + ceil(log n - log m)) reads = n (1 + ceil(log n/m)) reads or 2n (1 + ceil(log n/m)) IO operations - Of course, if run separator records are used, then we actually process more than n records on each pass. This costs additional time for n/m reads on first pass n/2m reads on second pass n/4m reads on third pass ... = (2n/m - 1) additional reads, or about 4n/m extra IO operations - Obviously, a lot depends on the average run length in the original data (m). It can be shown that, in totally random data, the average run length is 2 - which translates into a savings of 1 merge pass, or 2n IO operations. However, if we use separator records, we would need 2n extra IO operations to process them - so we gain nothing! (We could still gain a little bit by omitting separator records if stability were not an issue, though.) - In many cases, though, the raw data does contain considerable natural order, beyond what is expected randomly. In this case, natural merging can help us a lot. G. Another improvement builds on the idea of the natural merge by using an internal sort during the distribution phase to CREATE runs of some size. 1. The initial distribution pass now looks like this - assuming we have room to sort s records at a time internally: while not eof(infile) do read up to s records into main memory sort them write them to one of the scratch files 2. Clearly, the effect of this is to reduce the merge time by a factor of (log (n/s)) / (log n). For example, if s = sqrt(n), we reduce the merge time by a factor of 2. The overall time is not reduced as much, of course, because a. The distribution pass still involves the same number of reads. b. We must now add time for the internal sorting! c. Nonetheless, the IO time saved makes internal run generation almost always worthwhile. Example: suppose we need to sort 65536 records, and have room to internally sort 1024 at a time. - The time for a simple merge sort is 65536 * (1 + log 65536) reads + the same number of writes = 65536 * 17 * 2 = 2,228,224 IO operations - The time with internal run generation is 65536 * (1+ log 65536/1024) reads + the same number of writes + internal sort time = 65536 * 7 * 2 = 917,504 IO operations + 64 1024-record sorts 3. This process is stable iff the inernal sort used is stable. If stability is not a concern, it is common to use an internal sort like quicksort. (Note that a stable internal sort is either O(n^2), or it requires O(n) extra space, which cuts down on the size of the initial runs that can be created by internal sorting!) V. Sorting with multiple keys - ------- ---- -------- ---- A. Thus far, we have assumed that each record in the file to be sorted contains one key field. What if the record contains multiple keys - e.g. a last name, first name, and middle initial? 1. We wish the records to be ordered first by the primary key (last name). 2. In the case of duplicate primary keys, we wish ordering on the secondary key (first name). 3. In the case of ties on both keys, we wish ordering on the tertiary key (middle initial). etc - to any number of keys. B. The approach we will discuss here applies to BOTH INTERNAL AND EXTERNAL SORTS. C. There are two techniques that can be used for cases like this: 1. We can modify an existing algorithm to consider multiple keys when it does comparisons - e.g. a. Original algorithm says: if (item[i].key < item[j].key) b. Revised algorithm says: if ((item[i].primary_key < item[j].primary_key) || ((item[i].primary_key == item[j].primary_key) && (item[i].secondary_key < item[j].secondary_key) || ((item[i].primary_key == item[j].primary_key) && (item[i].secondary_key == item[j].secondary_key) && (item[i].tertiary_key < item[j].tertiary_key)) ) 2. We can sort the same file several times, USING A STABLE SORT. a. First sort is on least significant key. b. Second sort is on second least significant key. c. Etc. d. Final sort is on primary key. 3. The first approach is usable when we are embedding a sort in a specific application package; the second is more viable when we are building a utility sorting routine for general use [but note that we are now forced to a stable algorithm.] VI. Pointer Based Sorting -- ------- ----- ------- A. When the items being sorted are large records (perhaps hundreds of bytes each, it may be desirable to use a pointer-based approach to reduce the time spent moving data. The following are some variants on this theme. B. ADDRESS TABLE SORTING: we use an array of pointers P[1]..P[N]. Instead of physically rearranging the records (which is costly in terms of data movement time), we leave the records in their original place and sort the array of pointers so that: P[i]^.Key <= P[j]^.Key for all i <= j. C. KEY SORTING: if the key is short relative to the whole record, then we sort an array consisting of keys plus pointers to the rest of the record, so that we only move keys and pointers, not whole records. At the very end of the sort, we may physically rearrange the records themselves. D. LIST SORTING: we keep the records on a linked list, and rearrange links rather than moving records. (We will use this in several of the algorithms below.) Again, at the very end of the sort, we may physically rearrange the records themselves.