Time Optimization in EECS281 Lab1

I'm just tired of constant optimization and I wrote this note to show some tricks of optimization. Note: no multithread/coroutine

1. Program

Avoid dynamic string

While keeping the structure of std::vector::resize(reserve, push_back/emplace_back) and getopt_long, instead of using dynamic std::string for storing mode, just using (char*)optarg and strcmp and return integer sign to indicate the mode. (It's simple switch (only 3 modes and 2 options), some techniques like place your case index in order and Duff's device are useless.)

Avoid extra copy and using rvalue reference version functions

push_back receives rvalue reference after C++11, so for prvalue no need to call emplace_back

2. IO, especially Input

Lab1 is surely a IO-heavy program. C and Cpp have a lots of IO functions and there are various methods to optimize them.

iostream (cin, cout)

// Stop using endl, using << nl;
template <class CharT, class Traits>
std::basic_ostream<CharT, Traits>& nl(std::basic_ostream<CharT, Traits>& os) {
    os.put('\n');
    // os.flush() endl does this!
    return os;
}

std::ios_base::sync_with_stdio(false); // disable synchronization of stdio and iostream  (This will allocae ~120000 bytes streambuf, unable to stop that for wide-char stream since encoded in gcc/confdefs.h while compiling)
std::cin.tie(nullptr); // untie cin and cout to disable flush between switching cin and cout
// This can be wrote as std::cout << std::nounitbuf;
// unitbuf is just mean make the buffer size 1. AKA. turn off
std::cout.unsetf(std::ios_base::unit_buf); // disable the automatic output flush, then you should flush it manually. Or by dtor

scanf/printf

Just using them. glibc writes them in hundreds of macros (vfprintf). However, C lacks of compile-time parsing of format string. That's bad because we need to parse the format in runtime (unlike rust and dlang!).

DIY using getchar/putchar

Sometimes (in Lab1), we just need simple positive integer reader/writer.

inline bool read_int_eof(size_t& x) {
    char c = getchar(); x = 0;
    if (c == EOF) return false;
    while (c < '0' || c > '9') {
        c = getchar();
        if (c == EOF) return false;        
    }
    while (c <= '9' && c >= '0') {
        x = (x << 3) + (x << 1) + c - 48;
        c = getchar();
    }
    return true;
}

inline void write_szt(size_t x) {
    int cnt = 0;
    char c[15];
    while (x) {
        ++cnt;
        c[cnt] = (x % 10) + 48;
        x /= 10;
    }
    while (cnt) {
        putchar(c[cnt]);
        --cnt;
    }
}

That's fast. However, there is still some space to add. getchar()/putchar() is multithread safe (MT-safe). There are MT-unsafe version called getchar_unlocked()/putchar_unlocked(). getchar_unlocked() is much faster than getchar()!. Notably, _unlocked() version are POSIX-specfic (just like getopt!)

Note: in MSVC #define _CRT_DISABLE_PERFCRIT_LOCKS will do the same things.

fgets/fputs/gets/puts

Use it and its _unlocked version simply. It's convenient and fast to read/write a line.

No gets

Buffering and fread/setvbuf

For large non-integer input and reading them totally, using global char array for buffering (Competitive Programming players just like that!) instead of default buffer using setbuf/setvbuf (glibc-x64 using 64k). Also, fread() and feof() have their _unlocked() version (MT-safe, slightly improvement).

For std::stringstream as buffer

ostringstream oss{"Some Text"};
cout << oss.str();
stringstream ss{"Some Text"};
cout << ss.rdbuf();

Note: I don't test performance of read() but fread_unlocked with global char array is the best among all of those.

Update: fread actually mmap the file

Specific file?

freopen

#if defined(FREOPEN_FILE_IN) && defined(FREOPEN_FILE_OUT)
    freopen(FREOPEN_FILE_IN, "r", stdin);
    freopen(FREOPEN_FILE_OUT, "w", stdout);
#endif

#if defined(FREOPEN_FILE_IN) && defined(FREOPEN_FILE_OUT)
    fclose(FREOPEN_FILE_IN);
    fclose(FREOPEN_FILE_OUT);
#endif

// If you want it back
#ifdef _WIN32
#define NULL_DEVICE "NUL:"
#else
#define NULL_DEVICE "/dev/null"
#endif

#ifdef _WIN32
#define STANDARD_OUT_DEVICE "CON"
#else
#define STANDARD_OUT_DEVICE "/dev/tty"  // Or use dup to store fileno of stdio
#endif

Too too large?

read/write/mmap (Lab1 will cause MLE)

int fd = fileno(stdin);
struct stat sb;
fstat(fd, &sb);
size_t file_sz = sb.st_size;
char* buf = reinterpret_cast<char*>(mmap(0, file_sz, PROT_READ, MAP_PRIVATE, fd, 0)); // I do not really understand this, just ref your linux/glibc manual. Anyway, though crazy like me, that's too much crazy.
char* p = buf; // read from it;

char buffer[LARGE_SZ];
while ((int sz = read(fileno(stdin), buffer, LARGE_SZ)) > 0) {
    write(fileno(stdout), buffer, sz)
}

Note: for small input, too much optimization may be overhead.

Why Output is a problem

Because output needs to be dealt with, so it may have much calculation and difference for output data.

Basically, using printf(), then putchar/puts and it's unlocked version.

Some advanced options can be (and _unlocked of course)

std::fill_n(std::ostream_iterator<const char*>(std::cout, '\n'), count, cstring);
std::cout.write(cppstr.data(), cppstr.size());
fwrite(cstring, 1, length, stdout);
write(fileno(stdout), cstring, length);

Others

Makefile or GCC options

`-O3` vs. `-Ofast`

Though little use in Lab1, -Ofast enables some non-standard-compliant options, eg. -ffast-math (enables unsafe floating-number operation)

`-march=native` (`-mtune=native`)

Also, little use in Lab1 (IO simply cannot use SIMD to speed up). -march=native select current chipset instruction set (eg. SSE AVX MMX) and likely break compatibility of this program on older chipsets

`-fwhole-program` `-flto`

Zero use in Lab1, since Lab1 is a single file program and don't apply to LTO (link time optimization (eg. inline functions in different transition units).

`-fprofile`

It's a bit like cheating, but the autograder accepts it.

-fprofile-generate generates gcda files and then if you pack your gcda file, and -fprofile-use -Wno-coverage-mismatch then compiler will profile the usage for you.

`-fprefetch-loop-arrays`

You can see it below at cache section.

Some others

-fomit-frame-pointer
-falign-functions=16 / -falign-loops=16
-fno-rtti

Arrange your switch

switch can derive 2 kinds of code. Naive if and jump table. If you place your label orderly, it's easy to form a jump table.

Using ternary operator?

cmov vs. cmp + jcc actually that's a argument about whether it would be quicker to write it into ternary form. linus-on-cmov

keyword: register and restrict

register hints the compiler to the value on register instead of stack. However, this is deprecated in Cpp17.

restrict is one qualifier which C has and Cpp not. It indicates that no other access through this pointer in this function body is valid, so the compiler can do more aggressive optimiziation. However, Cpp DON'T have it. (Rust always does it, only one mutable borrow exists)

cache, pipeline and inline asm?

Some optimization tricks on CPU architectures including using __builtin_expect (optimize branch prediction), __builtin_prefetch and adjusting the array size, expanding the loop, using things like Duff's device to speed up. I have little knowledge about that (Maybe after I finish my EECS370, ha). Also, It's possible, but hard to write better assembler code than compiler does.

...But always you can add -fprefetch-loop-arrays to let compiler does it. reference

Update: after profiling your code, you can add __builtin_expect for time-consuming branch condition codes. (eg. gperftools)

syscall (together with mmap)?

madvise/posix_fadvise

madvise(p, size, MADV_WILLNEED || MADV_SEQUENTIAL)

vector intrinistics

<emmintrin.h> and __m128 _mm_xxxxx

A live example I found here computers-are-fast

Luck on Autograder

Acutally I have something like this and my program have twice time

Warning: Your program used more system time than user time. This may be due to excessive I/O, overly frequent time measurement (via getrusage for example), or unnecessary system calls.

In my project1 with same program, R5S can have a chance to be 1/3 normal time.

So luck may be important. Wish you a good luck :)

Epilogue

It's EECS281: Data structure and Alg. So, FOCUS ON YOUR DS & ALG!

Airtnp/Time Optimization Tips.md

Time Optimization in EECS281 Lab1

1. Program

Avoid dynamic string

Avoid extra copy and using rvalue reference version functions

2. IO, especially Input

iostream (cin, cout)

scanf/printf

DIY using getchar/putchar

fgets/fputs/gets/puts

Buffering and fread/setvbuf

Specific file?

Too too large?

Why Output is a problem

Others

Makefile or GCC options

`-O3` vs. `-Ofast`

`-march=native` (`-mtune=native`)

`-fwhole-program` `-flto`

`-fprofile`

`-fprefetch-loop-arrays`

Some others

Arrange your switch

Using ternary operator?

keyword: register and restrict

cache, pipeline and inline asm?

syscall (together with mmap)?

vector intrinistics

Luck on Autograder

Epilogue

JasonQSY commented Sep 20, 2017

Airtnp commented Sep 20, 2017

Airtnp/Time Optimization Tips.md

Time Optimization in EECS281 Lab1

1. Program

Avoid dynamic string

Avoid extra copy and using rvalue reference version functions

2. IO, especially Input

iostream (cin, cout)

scanf/printf

DIY using getchar/putchar

fgets/fputs/gets/puts

Buffering and fread/setvbuf

Specific file?

Too too large?

Why Output is a problem

Others

Makefile or GCC options

-O3 vs. -Ofast

-march=native (-mtune=native)

-fwhole-program -flto

-fprofile

-fprefetch-loop-arrays

Some others

Arrange your switch

Using ternary operator?

keyword: register and restrict

cache, pipeline and inline asm?

syscall (together with mmap)?

vector intrinistics

Luck on Autograder

Epilogue

JasonQSY commented Sep 20, 2017

Airtnp commented Sep 20, 2017

`-O3` vs. `-Ofast`

`-march=native` (`-mtune=native`)

`-fwhole-program` `-flto`

`-fprofile`

`-fprefetch-loop-arrays`