`using` killed the `FUN` on GCC 5

An anecdote from benchmarking the compile time of mpark/variant, an implementation of std::variant for C++11/14/17.

Introduction

std::variant has a converting constructor that looks like this:

template <typename... Ts>
class variant {
  template <typename T>
  constexpr variant(T &&);
};

For this constructor, we need to determine which of the Ts... will be initialized with T. For std::variant, the behavior is defined as follows:

template <class T> constexpr variant(T&& t) noexcept(see below);

Let Tj be a type that is determined as follows: build an imaginary function FUN(Ti) for each alternative type Ti. The overload FUN(Tj) selected by overload resolution for the expression FUN(std::forward<T>(t)) defines the alternative Tj which is the type of the contained value after construction.

For example, given variant<int, std::string> v("hello");, we want to create an overload set FUN with int and std::string then manually perform overload resolution with an argument of type const char (&)[6].

The following is the initial approach I took for FUN:

template <typename Head, typename... Tail>
struct FUN;

// Base case.
template <typename Head>
struct FUN<Head> {
  Head operator()(Head) const;
};

// Recursive case.
template <typename Head, typename... Tail>
struct FUN : FUN<Tail...> {
  using FUN<Tail...>::operator();
  Head operator()(Head) const;
};

Then the winner of overload resolution can be determined using FUN like this:

using best_overload = std::result_of_t<FUN<int, std::string>(const char (&)[6])>;

The compile-time benchmark of variant’s converting constructor using this implementation looked like this on GCC 5:

A minimal example on Wandbox shows (via -ftime-report) that the cost of using this particular implementation of FUN at N = 350 is ~20s!

What’s so expensive…?

Determining the Source

The first thing I found is that commenting out the using declaration from the above example takes the compile time down to ~0.15s. So the recursive inheritance seems fine, and the using declaration could be the issue, but by not pulling in the parent operator() we’re also no longer performing overload resolution. Could it be the cost of overload resolution?

Well, the non-recursive, hand-rolled version of FUN which does perform overload resolution takes ~0.1s. Okay, it must be the using declaration.

Now what…?

Surrogate Call Function

When considering the overload set for a call to object of a class type, it’s not just the operator()s that get involved in the action.

For example, an implicit conversion operator to R (*)(P1, ..., PN) is also considered. They participate in overload resolution as surrogate call function with the unique name call-function and has the form:

using F = R (*)(P1, ..., PN);

R call-function(F f, P1 a1, ..., PN aN) { return f(a1, ..., aN); }

This is an example taken basically verbatim from [over.call.object]:

int f1(int);
int f2(float);

typedef int (*fp1)(int);
typedef int (*fp2)(float);

struct A {
  operator fp1() { return f1; }
  operator fp2() { return f2; }
} a;

int i = a(1);  // calls f1 via pointer returned from conversion function

How does this help us?

What’s in a Name?

Why did we need the using declaration in the initial approach, anyway? It’s due to the fact that the name of the overloaded function in that case was operator(), and FUN<T, Ts...>::operator() shadowed FUN<Ts...>::operator(). So we needed to pull the FUN<Ts...>::operator() name explicitly into FUN<T, Ts...>’s scope.

With surrogate call functions, the “pulling in” happens implicitly:

Similarly, surrogate call functions are added to the set of candidate functions for each non-explicit conversion function declared in a base class of T provided the function is not hidden within T by another intervening declaration.

As such, the “name” is essentially the type of the pointer to function.

Now, we can implement FUN like this:

template <typename Head, typename... Tail>
struct FUN;

// Base case.
template <typename Head>
struct FUN<Head> {
  using F = Head (*)(Head);
  operator F() const;
};

// Recursive case.
template <typename Head, typename... Tail>
struct FUN : FUN<Tail...> {
  using F = Head (*)(Head);
  operator F() const;
};

With this implementation, at N = 350 we get ~0.6s! MUCH better than ~20s.

Final Touch with Multiple Inheritance

Since the surrogate call functions are implicitly pulled in from the base classes, we don’t even need to keep the recursive inheritance anymore.

We can flatten, simplify, and improve the implementation with multiple inheritance:

template <typename T>
struct FUN_leaf {
  using F = T (*)(T);
  operator F() const;
};

template <typename... Ts>
struct FUN : FUN_leaf<Ts>... {};

Now, with this implementation, at N = 350 we get about ~0.2s! Not as drastic, but still a ~3x improvement.

Final Remarks

This issue was discovered while I was setting up MPark.Variant/Benchmark to compare the many variant implementations out there.

My hope is that this website will help to improve the community-wide implementation quality of variant by comparing existing implementations.

Take a look at the latest GCC 5 results to see the improvements that have been made to mpark/variant!

Thanks to Agustín “K-ballo” Bergé for introducing / explaining surrogate call functions to me.

Thanks to Louis Dionne and Bruno Dutra for developing the metabench framework and setting a precedent with metaben.ch!

Michael Park

from Shower Thoughts to Ruminations