57005 or alive

Selecting a random element from a linked list - 3 approaches in F#

Nov 16, 2013 F#

You have a linked list (i.e. no random access) with an unknown number of elements and a standard random number generator. Write an algorithm that selects a random element from the list, going for best O(N) performance and memory usage.

Beginning with some setup code, let’s look at 3 different approaches. All code below can be run in the browser at TryFSharp.

open System
open System.Diagnostics

// simple timing function
let time f x y = Stopwatch.StartNew() |> (fun sw -> (f x y, sw.Elapsed))

// test data
let intData = List.init (int 2E7) id
let strData = List.init (int 2E7) string

Method 1

Perhaps the most obvious and easiest-to-understand approach is to find the length \(L\) of the list, generate a random number \(0 \le i \lt L\) , then take the element at index \(i\) .

let method1 lst (rng : Random)  =
    List.nth lst (rng.Next(List.length lst))

time method1 intData (Random(100)) |> snd |> printfn "ints %A"
time method1 strData (Random(100)) |> snd |> printfn "strs %A"
// ints 00:00:00.0460722
// strs 00:00:00.1079891

Method 2

We can improve the algorithmic efficiency and select an element in exactly 1 pass by  relying on a mathematical trick.  We want each element to have a selection probability of \(1/N\) , but how can we do that without first obtaining \(N\) ?  By noticing that

$$\frac{1}{N} = \left( \frac{1}{i} \right) \left( \frac{i}{i+1} \right) \left( \frac{i+1}{i+2} \right) \cdots \left( \frac{N-2}{N-1} \right) \left( \frac{N-1}{N} \right) = \frac{1}{i} \prod_{k=i+1}^N\frac{k-1}{k}$$

This means we can select a random element in 1 pass using the following simple procedure - pick the element at the current index \(i\) with probability \(1/i\) , otherwise use the previously picked element. At the end of the scan, whichever element has “survived” the picking process is selected. You can see that the probability of any particular element being picked matches the far RHS of the expression above: \(1/i\) chance that it was picked, times the probability that all remaining elements were not picked.

let method2 lst (rng : Random) =
    snd <| List.fold (fun (i, pick) elem ->
               if rng.Next(i) = 0 then (i + 1, elem)
               else (i + 1, pick)
           ) (0, List.head lst) lst

time method2 intData (Random(100)) |> snd |> printfn "ints %A"
time method2 strData (Random(100)) |> snd |> printfn "strs %A"
// ints 00:00:00.5678366
// strs 00:00:00.8276745

// some perf is lost due to need for
//    tuples to track indices and picks
// if we define an indexed fold, "foldi," we can do better

module List =
    let foldi f s lst = 
        match lst with 
        | [] -> s
        | _ -> 
            let rec loop i s xs = 
                match xs with 
                | [] -> s
                | h::t -> loop (i + 1) (f i s h) t
            loop 0 s lst

let method2a lst (rng : Random) =
    List.foldi (fun i pick elem ->
        if rng.Next(i) = 0 then elem
        else pick
    ) (List.head lst) lst

time method2a intData (Random(100)) |> snd |> printfn "ints %A"
time method2a strData (Random(100)) |> snd |> printfn "strs %A"
// ints 00:00:00.4098307
// strs 00:00:00.4885459

Method 3

We can select a random element from an arbitrary sized list (i.e. no need to ever store the list length) in just slightly more than 1 pass (~1.0000000005 passes) by using the following method - For each element in the list, generate a random number \(r\) . If \(r\) is greater than the greatest random number yet seen, abandon any current picks and pick the current element. If \(r\) ties the largest random number already seen, then add the current element to the picks. Otherwise, ignore the current element and move on. At the end of the scan, if only a single element is picked, select it. Otherwise recursively select a random element from the list of picks.

let method3 lst (rng : Random) =
    let rec step remaining picks top =
        match (remaining, picks) with
        | ([], []) -> failwith "Don't pass empty list"
        //  if only 1 element is picked, this is the result
        | ([], [p]) -> p
        // if multiple elements are picked, select randomly from them
        | ([], ps) -> step ps [] -1
        | (h :: t, ps) ->
            match rng.Next() with
            // if RNG makes new top number, picks list is reset
            | n when n > top -> step t [h] n
            // if RNG ties top number, add current element to picks list
            | n when n = top -> step t (h::ps) top
            // otherwise ignore and move to next element
            | _ -> step t ps top
    step lst [] -1

time method3 intData (Random(100)) |> snd |> printfn "ints %A"
time method3 strData (Random(100)) |> snd |> printfn "strs %A"
// ints 00:00:00.2223620
// strs 00:00:00.2549300

A note on benchmarks

So method 1 is the clear winner, right? Although it’s the worst approach in terms of algorithmic efficiency, this method can actually be quite fast in practice when all memory accesses are sequential and there are few cache misses. Changing the data type to something larger, or using a list that’s spread out more in memory, will have significant impact on the runtime. We see this a bit in the (contrived, limited) benchmark presented - method 1 sees a > 2x slowdown for ints vs strings. Methods 2 and 3 are affected much less by this. As always, you need to test…

Timings are from F# 3.1, .NET 4.5.1, Intel Core i7 2.7 GHz, Windows 8.1, release mode build, targeting x86 (results were significantly different when targeting x64, try it!).