I'm just tired of constant optimization and I wrote this note to show some tricks of optimization. Note: no multithread/coroutine
While keeping the structure of std::vector::resize(reserve, push_back/emplace_back)
and getopt_long
, instead of using dynamic std::string
for storing mode, just using (char*)optarg
and strcmp
and return integer sign to indicate the mode. (It's simple switch (only 3 modes and 2 options), some techniques like place your case index in order and Duff's device are useless.)
push_back
receives rvalue reference after C++11, so for prvalue no need to call emplace_back
Lab1 is surely a IO-heavy program. C and Cpp have a lots of IO functions and there are various methods to optimize them.
// Stop using endl, using << nl;
template <class CharT, class Traits>
std::basic_ostream<CharT, Traits>& nl(std::basic_ostream<CharT, Traits>& os) {
os.put('\n');
// os.flush() endl does this!
return os;
}
std::ios_base::sync_with_stdio(false); // disable synchronization of stdio and iostream (This will allocae ~120000 bytes streambuf, unable to stop that for wide-char stream since encoded in gcc/confdefs.h while compiling)
std::cin.tie(nullptr); // untie cin and cout to disable flush between switching cin and cout
// This can be wrote as std::cout << std::nounitbuf;
// unitbuf is just mean make the buffer size 1. AKA. turn off
std::cout.unsetf(std::ios_base::unit_buf); // disable the automatic output flush, then you should flush it manually. Or by dtor
Just using them. glibc writes them in hundreds of macros (vfprintf). However, C lacks of compile-time parsing of format string. That's bad because we need to parse the format in runtime (unlike rust and dlang!).
Sometimes (in Lab1), we just need simple positive integer reader/writer.
inline bool read_int_eof(size_t& x) {
char c = getchar(); x = 0;
if (c == EOF) return false;
while (c < '0' || c > '9') {
c = getchar();
if (c == EOF) return false;
}
while (c <= '9' && c >= '0') {
x = (x << 3) + (x << 1) + c - 48;
c = getchar();
}
return true;
}
inline void write_szt(size_t x) {
int cnt = 0;
char c[15];
while (x) {
++cnt;
c[cnt] = (x % 10) + 48;
x /= 10;
}
while (cnt) {
putchar(c[cnt]);
--cnt;
}
}
That's fast. However, there is still some space to add. getchar()/putchar()
is multithread safe (MT-safe). There are MT-unsafe version called getchar_unlocked()/putchar_unlocked()
. getchar_unlocked()
is much faster than getchar()
!. Notably, _unlocked()
version are POSIX-specfic (just like getopt
!)
Note: in MSVC #define _CRT_DISABLE_PERFCRIT_LOCKS
will do the same things.
Use it and its _unlocked
version simply. It's convenient and fast to read/write a line.
No gets
For large non-integer input and reading them totally, using global char array for buffering (Competitive Programming players just like that!) instead of default buffer using setbuf/setvbuf
(glibc-x64 using 64k). Also, fread()
and feof()
have their _unlocked()
version (MT-safe, slightly improvement).
For std::stringstream
as buffer
ostringstream oss{"Some Text"};
cout << oss.str();
stringstream ss{"Some Text"};
cout << ss.rdbuf();
Note: I don't test performance of read()
but fread_unlocked
with global char array is the best among all of those.
Update: fread
actually mmap the file
freopen
#if defined(FREOPEN_FILE_IN) && defined(FREOPEN_FILE_OUT)
freopen(FREOPEN_FILE_IN, "r", stdin);
freopen(FREOPEN_FILE_OUT, "w", stdout);
#endif
#if defined(FREOPEN_FILE_IN) && defined(FREOPEN_FILE_OUT)
fclose(FREOPEN_FILE_IN);
fclose(FREOPEN_FILE_OUT);
#endif
// If you want it back
#ifdef _WIN32
#define NULL_DEVICE "NUL:"
#else
#define NULL_DEVICE "/dev/null"
#endif
#ifdef _WIN32
#define STANDARD_OUT_DEVICE "CON"
#else
#define STANDARD_OUT_DEVICE "/dev/tty" // Or use dup to store fileno of stdio
#endif
read/write/mmap
(Lab1 will cause MLE)
int fd = fileno(stdin);
struct stat sb;
fstat(fd, &sb);
size_t file_sz = sb.st_size;
char* buf = reinterpret_cast<char*>(mmap(0, file_sz, PROT_READ, MAP_PRIVATE, fd, 0)); // I do not really understand this, just ref your linux/glibc manual. Anyway, though crazy like me, that's too much crazy.
char* p = buf; // read from it;
char buffer[LARGE_SZ];
while ((int sz = read(fileno(stdin), buffer, LARGE_SZ)) > 0) {
write(fileno(stdout), buffer, sz)
}
Note: for small input, too much optimization may be overhead.
Because output needs to be dealt with, so it may have much calculation and difference for output data.
Basically, using printf()
, then putchar/puts
and it's unlocked version.
Some advanced options can be (and _unlocked
of course)
std::fill_n(std::ostream_iterator<const char*>(std::cout, '\n'), count, cstring);
std::cout.write(cppstr.data(), cppstr.size());
fwrite(cstring, 1, length, stdout);
write(fileno(stdout), cstring, length);
Though little use in Lab1, -Ofast
enables some non-standard-compliant options, eg. -ffast-math
(enables unsafe floating-number operation)
Also, little use in Lab1 (IO simply cannot use SIMD to speed up). -march=native
select current chipset instruction set (eg. SSE AVX MMX) and likely break compatibility of this program on older chipsets
Zero use in Lab1, since Lab1 is a single file program and don't apply to LTO (link time optimization (eg. inline functions in different transition units).
It's a bit like cheating, but the autograder accepts it.
-fprofile-generate
generates gcda files and then if you pack your gcda file, and -fprofile-use -Wno-coverage-mismatch
then compiler will profile the usage for you.
You can see it below at cache section.
-fomit-frame-pointer
-falign-functions=16
/-falign-loops=16
-fno-rtti
switch can derive 2 kinds of code. Naive if and jump table. If you place your label orderly, it's easy to form a jump table.
cmov
vs. cmp + jcc
actually that's a argument about whether it would be quicker to write it into ternary form. linus-on-cmov
register
hints the compiler to the value on register instead of stack. However, this is deprecated in Cpp17.
restrict
is one qualifier which C has and Cpp not. It indicates that no other access through this pointer in this function body is valid, so the compiler can do more aggressive optimiziation. However, Cpp DON'T have it. (Rust always does it, only one mutable borrow exists)
Some optimization tricks on CPU architectures including using __builtin_expect
(optimize branch prediction), __builtin_prefetch
and adjusting the array size, expanding the loop, using things like Duff's device to speed up. I have little knowledge about that (Maybe after I finish my EECS370, ha). Also, It's possible, but hard to write better assembler code than compiler does.
...But always you can add -fprefetch-loop-arrays
to let compiler does it. reference
Update: after profiling your code, you can add __builtin_expect
for time-consuming branch condition codes. (eg. gperftools)
madvise/posix_fadvise
madvise(p, size, MADV_WILLNEED || MADV_SEQUENTIAL)
<emmintrin.h>
and __m128
_mm_xxxxx
A live example I found here computers-are-fast
Acutally I have something like this and my program have twice time
Warning: Your program used more system time than user time. This may be due to excessive I/O, overly frequent time measurement (via getrusage for example), or unnecessary system calls.
In my project1 with same program, R5S can have a chance to be 1/3 normal time.
So luck may be important. Wish you a good luck :)
It's EECS281: Data structure and Alg. So, FOCUS ON YOUR DS & ALG!
tnlbxtt