Curious behavior when de-duplicating a collection in PowerShell

Aug 2, 2016 powershell performance bug

This is a bug/curiosity in PowerShell that I stumbled upon a few years ago, but never wrote up. The behavior hasn’t changed significantly in the intervening verions, so now I’m finally getting around to a quick blog post.

Here’s a chart detailing the runtime of 4 different PowerShell approaches to de-duplicate a collection - i.e. filter an input collection to just its unique elements. Code used for the benchmark can be found here.

Comparison of de-duplicating cmdlets, linear scale

Same chart, with log scale so it’s a little easier to distinguish the trend lines:

Comparison of de-duplicating cmdlets, log scale

The solutions being compared are:

$data | select -unique
- Built-in, does nothing but de-duplicate the input collection.
$data | group |% Name
- Composition of built-ins group (which groups together duplicate items) and % (to read off the key from each group)
$data | sort -unique
- Built-in, sorts the input collection and additionally de-duplicates it.
$data | hashunique
- Custom cmdlet using simple hashset-based approach. 12 lines, available alongside benchmark code.

I was surprised to see how poorly select -unique performs. It’s part of the core PowerShell library (presumably authored by experts), implemented as compiled C# code (not script), and its sole purpose is to solve this exact problem.

So why is it the slowest option?!

Specific questions I’m still wondering about -

Why do “select” and “group” have quadratic runtime?

Filtering out duplicates from a collection, or grouping together equivalent items, can be done in $O(n)$ time by utilizing a hash set or hash table, respectively. The trendline above hints, and a quick ILSpy session confirms, that select and group are instead relying on an $O(n^2)$ algorithm that doesn’t use hashing at all.

My best guess is that somehow PowerShell’s object and equality system preclude efficient, reliable hashing for all scriptable objects, so they decide not to do it for any objects. Perhaps everything just relies on equality.

But this doesn’t quite hold up. Here’s a simple example where two items are considered non-unique by select, sort, and group yet aren’t considered equal:

PS> 'dude' > sweet.txt
PS> $f1 = dir .\sweet.txt
PS> $f2 = dir .\sweet.txt
PS> ($f1, $f2 | select -Unique).Length
1
PS> ($f1, $f2 | sort -Unique).Length
1
PS> ($f1, $f2 | group).Length
1
PS> $f1 -eq $f2
False

Practically speaking, when I’m at the terminal and need to de-duplicate a collection, 99% of the time it’s just going to be strings or numbers. It’s a shame this core use case is left completely unoptimized. Going forward I will just keep hashunique in my profile and use that.

Why is “group” faster than “select -unique”?

Setting aside asymtotic complexity, how can select -unique be slower by some constant factor than piping together group |% Name? The latter has to carry around a ton of extra data, and also incurs the cost of an additional pipeline command. This makes no sense to me.

I'VE GOT THE BYTE ON MY SIDE

57005 or alive

Curious behavior when de-duplicating a collection in PowerShell

Why do “select” and “group” have quadratic runtime?

Why is “group” faster than “select -unique”?