Skip to content

Slow performance of Set.intersects when comparing two sets of different sizes #19139

@akselkvitberg

Description

@akselkvitberg

I have found that Set.intersect have greatly different performance depending on the order of the arguments passed.

Here is a reproduction sample

type SetIntersectBenchmark() =
    let longSet = [ for _ in 1 .. 1_000_000 -> Random.Shared.Next() ] |> Set.ofList
    let shortSet = Set.ofList [1;2;3;4;5;6;7;8;9;10]

    [<Benchmark>]
    member _.intersect_long_set_with_short_set() = Set.intersect longSet shortSet
    
    [<Benchmark(Baseline=true)>]
    member _.intersect_short_set_with_long_set() = Set.intersect shortSet longSet

Expected behavior

Set.intersect a b should produce the same result as Set.intersect b a. I would expect the performance to be equal.

Actual behavior

Performance differ by several magnitudes

Method Job Arguments IterationCount LaunchCount WarmupCount Mean Error StdDev Median Ratio RatioSD Allocated Alloc Ratio
intersect_long_set_with_short_set Job-BBQTFN /p:BUILDING_USING_DOTNET=true Default Default Default 113,804,493.5 ns 3,993,161.34 ns 11,584,885.09 ns 113,180,500.0 ns 210,173.97 29,869.00 - 0.00
intersect_short_set_with_long_set Job-BBQTFN /p:BUILDING_USING_DOTNET=true Default Default Default 547.0 ns 18.29 ns 52.78 ns 525.3 ns 1.00 0.00 40 B 1.00
intersect_long_set_with_short_set Job-XIHKCK /p:BUILDING_USING_DOTNET=true Default Default Default 107,596,475.3 ns 6,650,312.63 ns 19,504,215.74 ns 100,039,800.0 ns 199,488.09 41,392.25 - 0.00
intersect_short_set_with_long_set Job-XIHKCK /p:BUILDING_USING_DOTNET=true Default Default Default 602.0 ns 21.11 ns 61.25 ns 577.7 ns 1.11 0.16 40 B 1.00
intersect_long_set_with_short_set Job-MMDXPR Default 2 2 1 104,123,106.2 ns 132,585,779.86 ns 20,517,796.20 ns 104,504,000.0 ns 199,902.65 30,687.64 - 0.00
intersect_short_set_with_long_set Job-MMDXPR Default 2 2 1 566.6 ns 268.16 ns 41.50 ns 569.6 ns 1.10 0.15 40 B 1.00

Proposed change

Due to how comparing two sets are implemented; loop through all elements in the first set, check if it is present in the second set, then add to intersection set, if the first set is huge, then we will have to loop through every item in the set, performing potentially hundred of thousands of comparisons.
On the other hand, looping 10 times, checking for membership in a list of 1000000 elements is trivial.

I believe Big O of intersect to be O(na log nb), which indicates that the order of operations will be for small a and large b around 135 operations, while large a and small b around 2.25 million operations.

To improve the performance, simply compare the sizes of the sets passed in and flip the comparison if it is more performant.

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    Status

    New

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions