Note: This document outlines my personal understanding of and approach to the Quicksort algorithm. It differs from the original text-book algorithm and example at some point.

What is Quicksort?

Quicksort is a sorting algorithm that uses the divide and conquer algorithmic paradigm. It takes an array and breaks it down into two parts using a pre-defined partition scheme. This scheme ensures that a reference element is selected and the array is partitioned accordingly. This process continues until there are no more regions to divide.

Lomuto Partition Scheme

Lomuto designates the latest element as the pivot (reference element) and positions two pointers at the beginning.

The first pointer moves forward and compares its value with the pivot. If the value is less than the pivot, the first pointer swaps values with the second pointer, and the second pointer moves to the next element after the swap.

When the first pointer points to the pivot (index), the first and second pointers swap values. Finally it returns the second pointer’s index for the next partition.

After each Lomuto partition, the elements on the left side of the second pointer (returned pointer index) are less than the pivot, and the rest are equal to or greater than the pivot.

Hoare Partition Scheme

With Hoare, designating an element as the pivot is more flexible. Usually, the first element is the pivot.

There are two pointers: one positioned at the start of the array and the other positioned at the end. These two pointers move toward each other. If the first pointer, positioned at the start, encounters an element equal to or greater than the pivot, it stops. And If the second pointer encounters an element that is equal to or less than the pivot, it also stops.

The values are swapped unless the two pointers cross each other, and after each swap, both of the pointers move toward to each other once more.

After the partition, one of the pointers must return as split index.

Implementation in Zig

We can benefit from generic functions if we want to work with multiple numerical types, such as integers and floating numbers and schemes. Zig mainly uses unsigned integers internally. For this reason, I’ve modified the schemes to work with unsigned integers. Zig 0.16.0-dev.1890+2bd02883c has been used to implement Quicksort.

sorting.zig

const std = @import("std");

fn schemeLomuto(comptime T: type, array: []T) usize {
    const pivot_index: usize = array.len - 1;
    var mark: usize = 0;
    var pick: usize = 0;
    while (pick < pivot_index) {
        if (array[pick] < array[pivot_index]) {
            std.mem.swap(T, &array[mark], &array[pick]);
            mark += 1;
        }
        pick += 1;
    }
    std.mem.swap(T, &array[mark], &array[pivot_index]);
    return mark;
}

fn schemeHoare(comptime T: type, comptime pivot_designation: PivotDesignationHoare, array: []T) usize {
    const pivot: T = switch (pivot_designation) {
        .first => array[0],
        .middle => array[(array.len - 1) / 2],
    };
    var low: usize = 0;
    var high: usize = array.len - 1;
    while (true) {
        while (low < array.len and array[low] < pivot) low += 1;
        while (high > 0 and array[high] > pivot) high -= 1;
        if (low >= high) break;
        std.mem.swap(T, &array[low], &array[high]);
        low += 1;
        high -= 1;
    }
    return high;
}

const PivotDesignationHoare = enum { first, middle };

pub const Scheme = union(enum) {
    lomuto,
    hoare: PivotDesignationHoare,
};

pub fn quicksort(comptime T: type, comptime scheme: Scheme, array: []T) void {
    if (array.len < 2) return;
    switch (@typeInfo(T)) {
        .int, .float => switch (scheme) {
            .lomuto => {
                const p = schemeLomuto(T, array);
                if (p > 0) quicksort(T, scheme, array[0..p]);
                quicksort(T, scheme, array[p + 1 ..]);
            },
            .hoare => {
                const p = schemeHoare(T, scheme.hoare, array);
                quicksort(T, scheme, array[0 .. p + 1]);
                quicksort(T, scheme, array[p + 1 ..]);
            },
        },
        else => @compileError("Cannot run on non-numerical array."),
    }
}

main.zig

const std = @import("std");
const sorting = @import("sorting.zig");

pub fn main() !void {
    var page = std.heap.page_allocator;

    const seed = 123456789; // Keep the seed same for all builds.
    var xos = std.Random.Xoshiro256.init(@intCast(seed));
    var rand = xos.random();

    const TYPE_TO_USE = u8;
    const array = try page.alloc(TYPE_TO_USE, 1_000_000);
    defer page.free(array);

    for (array) |*value| {
        switch (@typeInfo(TYPE_TO_USE)) {
            .int => {
                value.* = rand.int(TYPE_TO_USE);
            },
            .float => {
                value.* = rand.float(TYPE_TO_USE);
            },
            else => @compileError("Cannot run the test on non-numerical type."),
        }
    }

    // Pick one of the below for each build and comment out the rest.
    sorting.quicksort(TYPE_TO_USE, .lomuto, array);
    //sorting.quicksort(TYPE_TO_USE, .{ .hoare = .first }, array);
    //sorting.quicksort(TYPE_TO_USE, .{ .hoare = .middle }, array);
}

Benchmarking Attempts

Let’s build and run the binary with perf, the Linux kernel performance auditing tool, to see each scheme’s stats.

Set sort scheme to Lomuto:
1. $ zig build -Doptimize=ReleaseFast
2. $ perf stat -r 30 ./zig-out/bin/{binary_name}
Set sort scheme to Hoare.first:
1. $ zig build -Doptimize=ReleaseFast
2. $ perf stat -r 30 ./zig-out/bin/{binary_name}
Set sort scheme to Hoare.middle:
1. $ zig build -Doptimize=ReleaseFast
2. $ perf stat -r 30 ./zig-out/bin/{binary_name}

Note: This benchmark includes memory allocation and element randomization. This is about 1,000,000 u8 of elements. Therefore, it does not show statistics for the Quicksort implementation alone.

Performance counter stats for 30 runs of Quicksort with Lomuto Scheme that uses last element pivot designation:

  353,024,943      task-clock:u                     #    0.999 CPUs utilized               ( +-  0.08% )
            0      context-switches:u               #    0.000 /sec                      
            0      cpu-migrations:u                 #    0.000 /sec                      
          253      page-faults:u                    #  716.663 /sec                        ( +-  0.03% )
7,423,909,544      instructions:u                   #    4.53  insn per cycle            
                                             #    0.00  stalled cycles per insn     ( +-  0.00% )
1,637,252,480      cycles:u                         #    4.638 GHz                         ( +-  0.05% )
   22,560,428      stalled-cycles-frontend:u        #    1.38% frontend cycles idle        ( +-  0.11% )
2,461,942,601      branches:u                       #    6.974 G/sec                       ( +-  0.00% )
    5,413,542      branch-misses:u                  #    0.22% of all branches             ( +-  0.04% )

     0.353457 +- 0.000283 seconds time elapsed  ( +-  0.08% )

Performance counter stats for 30 runs of Quicksort with Hoare Scheme that uses first element pivot designation:

   29,572,237      task-clock:u                     #    0.990 CPUs utilized               ( +-  0.74% )
            0      context-switches:u               #    0.000 /sec                      
            0      cpu-migrations:u                 #    0.000 /sec                      
          252      page-faults:u                    #    8.522 K/sec                       ( +-  0.03% )
  238,310,634      instructions:u                   #    1.80  insn per cycle            
                                             #    0.06  stalled cycles per insn     ( +-  0.00% )
  132,178,999      cycles:u                         #    4.470 GHz                         ( +-  0.10% )
   14,853,774      stalled-cycles-frontend:u        #   11.24% frontend cycles idle        ( +-  0.45% )
   71,530,527      branches:u                       #    2.419 G/sec                       ( +-  0.00% )
    3,492,931      branch-misses:u                  #    4.88% of all branches             ( +-  0.03% )

     0.029871 +- 0.000222 seconds time elapsed  ( +-  0.74% )

Performance counter stats for 30 runs of Quicksort with Hoare Scheme that uses middle element pivot designation:

   30,549,272      task-clock:u                     #    0.991 CPUs utilized               ( +-  0.65% )
            0      context-switches:u               #    0.000 /sec                      
            0      cpu-migrations:u                 #    0.000 /sec                      
          252      page-faults:u                    #    8.249 K/sec                       ( +-  0.03% )
  208,795,131      instructions:u                   #    1.52  insn per cycle            
                                             #    0.08  stalled cycles per insn     ( +-  0.00% )
  137,141,927      cycles:u                         #    4.489 GHz                         ( +-  0.04% )
   16,043,666      stalled-cycles-frontend:u        #   11.70% frontend cycles idle        ( +-  0.29% )
   60,693,501      branches:u                       #    1.987 G/sec                       ( +-  0.00% )
    3,766,177      branch-misses:u                  #    6.21% of all branches             ( +-  0.03% )

     0.030838 +- 0.000202 seconds time elapsed  ( +-  0.65% )

Conclusion

Based on the perf stats, Hoare is faster than Lomuto when sorting 1,000,000 u8 elements in my implementation. Hoare executes dramatically fewer instructions and fewer branches overall. Although its branch-miss rate is higher, total work is much lower, so total cycles and elapsed time are lower.

This is consistent with Hoare performing fewer swaps and not comparing every element against the pivot in the same way Lomuto does. Comparing the two Hoare pivot choices (first vs. middle), execution time is similar in these runs, but IPC and branch behavior differ.