After a discussion about the performance implications of dynamic_cast, I had some time on my hands and wanted to check it myself (at least to some degree). That prompted the measurements below, where dynamic_cast is compared against an implementation that uses class member that cary an unique identifier to check. The functionality of the implementation is not the same as with dynamic_cast, but it satisfies the application that was discussed:
class Base {
public:
virtual ~Base() {}
volatile int m_value;
const types m_type;
protected:
Base(types type) : m_type(type), m_value(0) {}
};
class Derived : public Base {
public:
Derived() : Base(OurType) {}
void increment() { m_value++; }
};
class OtherDerived : public Base {
public:
OtherDerived() : Base(OtherType) {}
void doSomeThing() { m_value--; }
};
The testcase was to pass a Base* to a function, decide in the function if the object is an instance of Derived or OtherDerived and then call increment() or exit straight away:
void doDynamic(Base* base) {
Derived* derived = dynamic_cast<Derived*>(base);
if(derived)
derived->increment();
}
void doStatic(Base* base) {
if(base->m_type == OurType)
static_cast<Derived*>(base)->increment();
}
The design with type IDs itself was chosen because it seemed to me much more optimized, would be sufficient for the discussed application and is simple to implement and test. As it stands I would not be too proud if I would write an application this way (but I still might...).
The complete benchmark code can be found at the bottom and is separated into two source files, the first with the benchmark code itself, the second with some helper functions for time measurement, output etc. Since I was lazy and did not want to have to write more than necessary, I included the one source file in the other (but somehow I have to think about http://xkcd.com/292/ all the time now...)
All code was compiled with
g++ bench.cpp -o bench -O3
gcc was
gcc (Gentoo 4.7.3-r1 p1.4, pie-0.5.5) 4.7.3
host system for the tests was a
Linux xxx 3.12.13-gentoo #5 SMP Mon May 12 23:50:28 CEST 2014 i686 Genuine Intel(R) CPU T2400 @ 1.83GHz GenuineIntel GNU/Linux
Then the built program was simply executed. As can be seen in the code, no corrections where made for spurious outliers, the test was simply repeated until there were no (or at least very few) outliers.
Disassembly was generated using radare2:
> radare2 bench
e asm.syntax=pseudo
aa
pdf@sym._Z9doDynamicP4Base
pdf@sym._Z8doStaticP4Base
with additional manual comments after ';;'.
The first try was to run the inner loop within the tested functions:
void doDynamic(Base* base) {
for(int cnt = 0; cnt < INNER_LOOP; cnt++) {
Derived* derived = dynamic_cast<Derived*>(base);
if(derived)
derived->increment();
}
}
void doStatic(Base* base) {
for( int cnt = 0; cnt < INNER_LOOP; cnt++) {
if(base->m_type == OurType)
static_cast<Derived*>(base)->increment();
}
}
In the dynamic_cast case this works out nicely. In the case with typeid it is not so easy, as the compiler is smart enough to pull the type check out of the loop...
The next test looked more promising, but the enum used led to problems:
enum types {
OurType,
OtherType
};
In this case 'OurType' will be assigned the integer 0, which makes the test in doStatic simpler than it would be for the general case (a TEST instruction can be used instead of a CMP with a constant).
The core code that is being compared looks like this:
void __attribute__((noinline)) doDynamic(Base* base) {
Derived* derived = dynamic_cast<Derived*>(base);
if(derived)
derived->increment();
}
void __attribute__((noinline)) doStatic(Base* base) {
if(base->m_type == OurType)
static_cast<Derived*>(base)->increment();
}
void __attribute__((noinline)) doDerived(Derived* derived) {
derived->increment();
}
For the dynamic case we do a dynamic cast, check if the result is 0, and if not we invoke the method of the casted object. In the static case we make a lookup into the objects type and use that to decide if we have the correct object or not.
The noinline attribute is neccesary because gcc can not be trusted not to inline the function call (and then do the same optimization as above). I guess thats a good thing though...
With this we can do a simple measurement, where the output look something like (minus the raw values, for those see the end of the file):
Measurement resolution: 1ns
Results for dynamic:
min: 39980011ns
max: 45938170ns
avg: 4.02109e+07ns
stddev: 7.77441e+06ns
range: 5958159ns
range %: 0.148173%
Results for static:
min: 4375487ns
max: 4460343ns
avg: 4.38304e+06ns
stddev: 147143ns
range: 84856ns
range %: 0.0193601%
Results for derived:
min: 3281912ns
max: 3338762ns
avg: 3.28731e+06ns
stddev: 117995ns
range: 56850ns
range %: 0.0172938%
Adjusted for loop, operation & function call overhead:
dynamic avg: 3.69236e+07ns
static avg : 1.09573e+06ns
derived avg: 0ns
Taking the average, there is a factor of ~9.1 between the two codes. If we look at just the times that decision & casting take, we look at a factor of >30.
Let us have a look at the assembly code:
/ (fcn) sym._Z9doDynamicP4Base 60
| 0x080491b0 83ec1c sub esp, 0x1c
| 0x080491b3 8b442420 mov eax, [esp+0x20]
| 0x080491b7 85c0 test eax, eax
| ,=< 0x080491b9 742d je 0x80491e8
| | 0x080491bb c744240c000. mov dword [esp+0xc], 0x0
| | 0x080491c3 c7442408689. mov dword [esp+0x8], sym._ZTI7Derived ; 0x08049d68
| | 0x080491cb c7442404609. mov dword [esp+0x4], sym._ZTI4Base ; 0x08049d60
| | 0x080491d3 890424 mov [esp], eax
| | ; CODE (CALL) XREF from 0x08048ba0 (fcn.08048b96)
| | 0x080491d6 e8c5f9ffff call sym.imp.__dynamic_cast
| | sym.imp.__dynamic_cast()
| | 0x080491db 85c0 test eax, eax
| ,==< 0x080491dd 7409 je 0x80491e8
| || 0x080491df 8b5004 mov edx, [eax+0x4]
| || 0x080491e2 83c201 add edx, 0x1
| || 0x080491e5 895004 mov [eax+0x4], edx
| ``-> 0x080491e8 83c41c add esp, 0x1c
\ 0x080491eb c3 ret
/ (fcn) sym._Z8doStaticP4Base 26
| 0x080491f0 8b442404 mov eax, [esp+0x4]
| 0x080491f4 83780801 cmp dword [eax+0x8], 0x1
| ,=< 0x080491f8 7406 je 0x8049200
| | 0x080491fa f3c3 repe ret
| | 0x080491fc 8d742600 lea esi, [esi]
| `-> 0x08049200 8b5004 mov edx, [eax+0x4]
| 0x08049203 83c201 add edx, 0x1
| 0x08049206 895004 mov [eax+0x4], edx
\ 0x08049209 c3 ret
We see that the case with type id is a simple test, jump and then the inlined increment() call. In the case with the dynamic cast we have to set up the registers with the values to call the implementation function of dynamic_cast, and the call the function itself. The call can not be inlined, since dynamic_cast is in a shared library and will be linked at runtime.
Even if dynamic_cast could be inlined, it would still be much slower, reason being that we know much more about our class structure than dynamic_cast could ever assume. It basically has to deal with (from C++ Standard Draft N2857, Chapter 5.2.7 [expr.dynamic.cast]):
- multiple inheritance
- virtual inheritance
- errors like multiple non-virtual appearance of the same base class (therefore it has to walk the complete object ever dynamic_cast)
- access control
For simple cases gcc has a few dynamic_cast implementations that do not cover the full spectrum (I fear one of those is used here...).
Changes to test this case:
...
void __attribute__((noinline)) doDerived(Derived* derived) {
//derived->increment();
asm volatile("");
}
int main(int argc, char* argv[]) {
//Base* derived = new Derived();
Base* derived = new OtherDerived();
...
An interesting thing is comparing the case where we call the function with a class that can not be casted to the expected class. Lets have a look at the numbers here:
Results for dynamic:
min: 61891106ns
max: 69293025ns
avg: 6.19973e+07ns
stddev: 8.63447e+06ns
range: 7401919ns
range %: 0.119391%
Results for static:
min: 3281842ns
max: 3343511ns
avg: 3.28736e+06ns
stddev: 128454ns
range: 61669ns
range %: 0.0187594%
Results for derived:
min: 3281772ns
max: 3323956ns
avg: 3.28706e+06ns
stddev: 108004ns
range: 42184ns
range %: 0.0128333%
We see an even bigger difference in the two implementations, the averages have about a factor 19 between them. In this case it is not useful to compare the corrected values since there is so little overhead from the noop function to the one check the numbers are not realistic.
- Is the implementation with type IDs viable or just plain ugly and evil?
- how does performance change with different class hierarchies (wide, deep, both)?
- how does performance look on other platformat/compilers/systems?
- does branch prediction influence our result?
If you can avoid using dynamic_cast, then do so. On cast not compared here is the qobject_cast of QT, which is implemented without dynamic_cast, but needs the QT metaobject system to get type information and therefore only works on classes derived from QObject (and needs QT...). But since it does not have to cover all the special cases, it should be fast than dynamic_cast, though not as fast as no type-guessing at all. There are also a host of other alternatives, some listed below. Whether those are viable will highly depend on your use-case, you might be ok with dynamic_cast event though it is slower (but hey, it checks more stuff...).
Interesting stuff on the internet to look at:
- Fast dynamic casts (http://onlinelibrary.wiley.com/doi/10.1002/spe.686/abstract). Paper by Michael Gibbs and Bjarne Stroustrup proposing to use a scheme based on prime numbers to optimize dynamic_cast.
- Practical and Verifiable C++ Dynamic Cast for Hard Real-Time Systems (http://www.stroustrup.com/fdc_jcse.pdf), Paper by Damian Dechev, Rabi Mahapatra, Bjarne Stroustrup, follow up on the paper above
- Priori - A Fast dynamic_cast<> Alternative (http://www.codeproject.com/Articles/609598/Priori-A-Fast-dynamic-cast-Alternative) unoptimized implementation of the scheme described in the papers above
- the standard library locale header with std::use_facet, which contains a way to automatically assign Object IDs
- much more that I did not look at/find yet
Measurement resolution: 1ns
Results for dynamic:
min: 39980011ns
max: 45938170ns
avg: 4.02109e+07ns
stddev: 7.77441e+06ns
range: 5958159ns
range %: 0.148173%
raw values:
45938170 40699028 40878939 40973364 41518266 40757345 40648113 40617104 40778926 40834519
40728779 40608165 40683941 40717954 40645668 40742329 40742120 40580018 39994259 40096926
40104329 40078139 40128564 40207834 44474089 40023663 40119973 39991745 40076742 40109846
40110475 40043846 40032253 40119764 40055371 40096856 40075554 40152660 40092805 40112081
40542793 40181434 40004386 40072621 40075066 40051739 40127517 40054253 40137504 40080094
40094551 40072970 40103840 40089592 40119555 40080793 40073041 40150284 40039447 40118088
40086380 40092945 40028831 40052297 40060888 40218240 40445713 40112431 40106424 40065498
40039656 40107961 40102164 40054463 40117459 40078278 40097275 40119345 40130868 40142462
40103491 40126818 40130171 40093154 40094481 40141066 40045663 40094342 40074437 40029110
40560044 40099859 40007180 40065637 40117878 40141554 40051598 40127516 40025129 40113618
40025478 40097206 40041821 40110685 40095040 40080792 40142741 40084214 40081421 40085682
40109707 40107263 40118298 40100488 40106354 40151960 40068849 40037910 40156152 40110754
40086241 40094621 40128983 40074158 40086869 40081770 40086938 40075205 40070107 40159015
40089941 40081631 40092177 40107472 40053834 40118087 40099300 40033091 40100208 40071434
40548520 40142811 40069897 40112221 40068640 40052577 40125561 40081771 40135478 40157897
40069757 40007319 39985739 40140717 40098951 40078069 40093992 40090152 40037211 40069338
40067453 40076462 40117040 40093993 40064030 40189466 40145885 40059980 40031834 40105098
40026526 40103771 40137015 40104469 40084564 40119065 40128564 40076811 40118088 40111313
40069967 40117739 40022824 40074228 40147631 40096926 40091339 40048735 40074298 40103351
40530361 40117738 39980011 40144348 40099300 40141973 40090222 40125840 40053485 40061167
Results for static:
min: 4375487ns
max: 4460343ns
avg: 4.38304e+06ns
stddev: 147143ns
range: 84856ns
range %: 0.0193601%
raw values:
4393086 4375766 4389245 4375766 4375906 4389734 4375766 4389245 4375766 4390921
4375696 4375766 4399301 4375765 4397975 4375766 4391619 4375765 4395251 4375766
4375766 4397277 4375696 4389594 4375766 4391829 4375766 4375765 4389175 4375766
4389175 4375766 4389245 4375766 4419695 4375766 4375766 4392946 4375766 4391619
4375696 4389733 4375766 4389245 4375766 4375696 4390781 4375836 4389664 4375766
4389454 4375627 4375766 4389384 4375766 4389594 4375766 4389175 4375835 4389175
4375766 4375766 4389384 4375766 4389455 4375766 4389454 4375696 4375766 4390781
4375696 4389385 4375766 4389245 4375836 4389105 4375835 4375696 4389384 4375836
4389315 4375696 4389454 4375766 4375766 4390432 4375766 4389943 4375836 4389105
4375766 4419416 4375766 4375766 4389943 4375766 4389733 4375766 4389315 4375766
4389803 4375766 4375696 4389873 4375766 4389524 4375836 4389384 4375765 4375696
4389315 4375836 4389035 4375766 4390432 4375696 4389664 4375766 4375696 4389035
4375766 4389175 4375836 4389455 4375766 4375766 4389244 4375766 4389035 4375766
4390502 4375766 4389943 4375835 4375836 4390572 4375766 4389175 4375976 4390292
4375696 4375766 4460343 4375766 4392247 4375766 4389594 4375766 4429124 4375766
4375487 4389873 4375766 4389454 4375835 4393574 4375766 4375766 4393505 4375766
4398464 4375766 4389803 4375696 4389174 4375696 4375696 4392247 4375697 4389524
4375765 4389524 4375696 4389245 4375766 4375766 4390781 4375766 4389803 4375696
4389663 4375766 4375766 4391200 4375697 4389803 4375765 4388686 4375766 4389384
4375696 4375696 4390711 4375766 4389105 4375766 4389664 4375696 4375766 4389385
Results for derived:
min: 3281912ns
max: 3338762ns
avg: 3.28731e+06ns
stddev: 117995ns
range: 56850ns
range %: 0.0172938%
raw values:
3282121 3295810 3282191 3282191 3295880 3282121 3282191 3338762 3282261 3282122
3297206 3282261 3282191 3295879 3282121 3282191 3295810 3282121 3282192 3282261
3295810 3282261 3282122 3295670 3282121 3282121 3295880 3282121 3282191 3297067
3282191 3282191 3296089 3282122 3282121 3295530 3282261 3282192 3296857 3282191
3282191 3295530 3282122 3282261 3295880 3282260 3281912 3295670 3282261 3282191
3300419 3282191 3282121 3295879 3282192 3282261 3295740 3282471 3282191 3295530
3282121 3282261 3295879 3282191 3282191 3296159 3282191 3282261 3296857 3282192
3282121 3295879 3282191 3282330 3295530 3282121 3282121 3295739 3282331 3282191
3299232 3282401 3282191 3324304 3282121 3282122 3297206 3282191 3282261 3297765
3282191 3282051 3282191 3296228 3282261 3282191 3295810 3282122 3282331 3297975
3282470 3282121 3296088 3282121 3282261 3295600 3282191 3282261 3295669 3282121
3282191 3295739 3282121 3282122 3295740 3282121 3282260 3295879 3282052 3281982
3295600 3282191 3282331 3295810 3282680 3282330 3296438 3282261 3282122 3295949
3282261 3282191 3295810 3282121 3282121 3295600 3282121 3282191 3295809 3282122
3282191 3295879 3282122 3282121 3295460 3282191 3282191 3295810 3282261 3282121
3296996 3282192 3282051 3296228 3282261 3282192 3297695 3282191 3282122 3324374
3282331 3282121 3296369 3282191 3282122 3282191 3296228 3282052 3282401 3295810
3282261 3282261 3295879 3282121 3282192 3296019 3282122 3282331 3295740 3282191
3282121 3295530 3282192 3282261 3295880 3282191 3282192 3296019 3282121 3282191
3296857 3282122 3282191 3296299 3282261 3282820 3295879 3282121 3282121 3295879
Adjusted for loop, operation & function call overhead:
dynamic avg: 3.69236e+07ns
static avg : 1.09573e+06ns
derived avg: 0ns
Factors:
normal: 9.17421
corr : 33.6977
#include <iostream>
#include <string>
const int INNER_LOOP = 1000000;
const int OUTER_LOOP = 200;
enum types {
OurType = 1,
OtherType
};
class Base {
public:
virtual ~Base() {}
volatile int m_value;
const types m_type;
protected:
Base(types type) : m_type(type), m_value(0) {}
};
class Derived : public Base {
public:
Derived() : Base(OurType) {}
void increment() { m_value++; }
};
class OtherDerived : public Base {
public:
OtherDerived() : Base(OtherType) {}
void doSomeThing() { m_value--; }
};
void __attribute__((noinline)) doDynamic(Base* base) {
Derived* derived = dynamic_cast<Derived*>(base);
if(derived)
derived->increment();
}
void __attribute__((noinline)) doStatic(Base* base) {
if(base->m_type == OurType)
static_cast<Derived*>(base)->increment();
}
void __attribute__((noinline)) doDerived(Derived* derived) {
derived->increment();
//asm volatile("");
}
#include "stuff.cpp" // I know... Raptors will come and eat me for this
int main(int argc, char* argv[]) {
Base* derived = new Derived();
//Base* derived = new OtherDerived();
Clock::TimePoint startDyn[OUTER_LOOP], startStat[OUTER_LOOP], startDer[OUTER_LOOP];
Clock::TimePoint endDyn[OUTER_LOOP], endStat[OUTER_LOOP], endDer[OUTER_LOOP];
std::cout << "Measurement resolution: " << Clock::Resolution() << "ns" << std::endl << std::endl;
for( int i = 0; i < 10; ++i)
doDynamic(derived);
for( int i = 0; i < OUTER_LOOP; ++i) {
startDyn[i] = Clock::Now();
for( int cnt = 0; cnt < INNER_LOOP; cnt++)
doDynamic(derived);
endDyn[i] = Clock::Now();
}
for( int i = 0; i < 10; ++i)
doStatic(derived);
for( int i = 0; i < OUTER_LOOP; ++i) {
startStat[i] = Clock::Now();
for( int cnt = 0; cnt < INNER_LOOP; cnt++)
doStatic(derived);
endStat[i] = Clock::Now();
}
for( int i = 0; i < 10; ++i)
doDerived((Derived*)derived);
for( int i = 0; i < OUTER_LOOP; ++i) {
startDer[i] = Clock::Now();
for( int cnt = 0; cnt < INNER_LOOP; cnt++)
doDerived((Derived*)derived);
endDer[i] = Clock::Now();
}
double dy = printResult( "dynamic", startDyn, endDyn);
double st = printResult( "static", startStat, endStat);
double de = printResult( "derived", startDer, endDer);
std::cout << "Adjusted for loop, operation & function call overhead:" << std::endl;
std::cout << "\tdynamic avg: " << dy-de << "ns" << std::endl;
std::cout << "\tstatic avg : " << st-de << "ns" << std::endl;
std::cout << "\tderived avg: 0ns" << std::endl;
std::cout << std::endl;
std::cout << "Factors:" << std::endl;
std::cout << "\tnormal: " << dy / st << std::endl;
std::cout << "\tcorr : " << (dy-de)/(st-de) << std::endl;
}
#include <time.h>
#include <stdint.h>
#include <limits>
#include <cmath>
std::ostream& operator<<( std::ostream& stream, const Base& base) {
stream << "Type: " << base.m_type << " Value: " << base.m_value << std::endl;
return stream;
}
/////////////////////////////////////////////////////////////////////////////////////
// taken from the hayai library (https://github.com/nickbruun/hayai) by Nick Bruun //
/////////////////////////////////////////////////////////////////////////////////////
class Clock
{
public:
/// Time point.
/// Opaque representation of a point in time.
typedef struct timespec TimePoint;
/// Get the current time as a time point.
/// @returns the current time point.
static TimePoint Now()
{
TimePoint result;
# if defined(CLOCK_MONOTONIC_RAW)
clock_gettime(CLOCK_MONOTONIC_RAW, &result);
# elif defined(CLOCK_MONOTONIC)
clock_gettime(CLOCK_MONOTONIC, &result);
# elif defined(CLOCK_REALTIME)
clock_gettime(CLOCK_REALTIME, &result);
# else
clock_gettime((clockid_t)-1, &result);
# endif
return result;
}
/// Get the duration between two time points.
/// @param startTime Start time point.
/// @param endTime End time point.
/// @returns the number of nanoseconds elapsed between the two time
/// points.
static int64_t Duration(const TimePoint& startTime,
const TimePoint& endTime)
{
TimePoint timeDiff;
timeDiff.tv_sec = endTime.tv_sec - startTime.tv_sec;
if (endTime.tv_nsec < startTime.tv_nsec)
{
timeDiff.tv_nsec = endTime.tv_nsec + 1000000000L -
startTime.tv_nsec;
timeDiff.tv_sec--;
}
else
timeDiff.tv_nsec = endTime.tv_nsec - startTime.tv_nsec;
return timeDiff.tv_sec * 1000000000L + timeDiff.tv_nsec;
}
static int64_t Resolution()
{
TimePoint result;
# if defined(CLOCK_MONOTONIC_RAW)
clock_getres(CLOCK_MONOTONIC_RAW, &result);
# elif defined(CLOCK_MONOTONIC)
clock_getres(CLOCK_MONOTONIC, &result);
# elif defined(CLOCK_REALTIME)
clock_getres(CLOCK_REALTIME, &result);
# else
clock_getres((clockid_t)-1, &result);
# endif
return result.tv_sec * 1000000000L + result.tv_nsec;
}
};
double printResult( std::string test, Clock::TimePoint start[OUTER_LOOP], Clock::TimePoint end[OUTER_LOOP]) {
double avg = 0;
double variance = 0;
double stddev = 0;
int64_t duration[OUTER_LOOP];
int64_t min = std::numeric_limits<int64_t>::max();
int64_t max = std::numeric_limits<int64_t>::min();
int64_t range;
for( int i = 0; i < OUTER_LOOP; ++i) {
duration[i] = Clock::Duration(start[i], end[i]);
avg += duration[i];
if( min > duration[i])
min = duration[i];
if( max < duration[i])
max = duration[i];
}
avg /= OUTER_LOOP;
for( int i = 0; i < OUTER_LOOP; ++i)
variance += std::pow(duration[i] - avg, 2);
stddev = std::sqrt(variance);
std::cout << "Results for " << test << ":" << std::endl;
std::cout << std::endl;
std::cout << "min: " << min << "ns" << std::endl;
std::cout << "max: " << max << "ns" << std::endl;
std::cout << "avg: " << avg << "ns" << std::endl;
std::cout << "stddev: " << stddev << "ns" << std::endl;
std::cout << "range: " << max - min << "ns" << std::endl;
std::cout << "range %: " << (max - min)/avg << "%" << std::endl;
std::cout << std::endl;
std::cout << "raw values: ";
for( int i = 0; i < OUTER_LOOP; ++i) {
if( (i % 10) == 0)
std::cout << std::endl << "\t";
std::cout << duration[i] << " ";
}
std::cout << std::endl;
std::cout << std::endl;
return avg;
}