Skip to content

Instantly share code, notes, and snippets.

@raysan5
Last active October 5, 2024 16:17
Show Gist options
  • Save raysan5/325f4cb46da17a60e48f6069fa0a07b1 to your computer and use it in GitHub Desktop.
Save raysan5/325f4cb46da17a60e48f6069fa0a07b1 to your computer and use it in GitHub Desktop.
10x Optimizations! An installer creation adventure!

10x Optimization! An installer creation adventure

Background: creating an installer tool

Lately I've been working on rInstallFriendly v2.0, my simple and easy-to-use tool to create fancy software installers.

rInstallFriendly, some visual_styles

rInstallFriendly v2.0, some of its multiple visual styles available. Style is fully customizable!

Differently to other installer-creation tools, rInstallFriendly is designed to make the installation process enjoyable for the users and useful for developers, so, it supports a special banner with multipl eunique features. The banner could be used as an advertising panel, to showcase the product, it could also display a playable game or just show some shinny interactive visuals.

Many users still keep staring at the installer progress bar while waiting for the software to be installed, rInstallFriendly is intended for those users, to make their first impression of the software an unforgetable experience.

One of the main features of rInstallFriendly is allowing an interactive banner displayed while the software is being installed, this banner can been a still image, an animated GIF but also an interactive shader or even a small game.

First version of the tool was single-thread, so, to allow the game running while the files were decompressed the solution implemented was splitting the decompression process per frame, just decompressing a number of files every frame and hopefully leaving enough frame time left to execute game logic and draw the frame. Despite this approach could seem in-appropiate, in fact it worked really well, running at 60 frames per second.

But there were some problems with this approach:

  • Game stuttering: If files to decompress were slightly big (>50MB), some stuttering could be noticed on playable game.
  • Installation time: Installation time was actually bounded by framerate and the number of files to be installed.

So, for next rInstallFriendly v2.0 I decided to address those issues.

First solution to address the framerate drop was using multi-threading, moving the installation process to a second thread while the main thread keeps drawing the game banner at a stable 60 fps.

I used the excellent thread.h library for that task, that worked like a charm.

But after moving the decompression of files to a second thread, I got a surprise:

Installation time was not reduced, as I was expecting, actually it was almost the same than previous implementation! 😕

Process investigation

My testbed is the raylib Windows Installer package, including w64devkit, Notepad++ and raylib (library with sources and examples), a total of ~7000 files, mostly small files, packaged into a ~120MB zip file, using level 8 deflate compression.

My tests hardware: MSI Laptop, CPU: [email protected] (8 logic processors), 16GB RAM, NVMe SSD 1TB

rInstallFriendly basically decompresses the provided .zip file, using the miniz library, so, I started trying other decompression options for comparison:

  • Windows extractor: 4min : 02sec.
  • rInstallFriendly: 1min : 31sec.
  • 7-Zip: 6sec. !!!

I was not surprised by Windows extractor but I was very surprised by 7-Zip results, WHY SO FAST?!?! I needed to understand the reason because it looked like magic to me and I like to know magic tricks!

Tried some quick Google search but unfortunately I couldn't find a clear answer so I tried asking in twitter/X, I know many great programmers follow me so I expected someone could provide some answers.

Multiple answers pointed to multi-threading, to use multiple threads for the decompression process but I got my doubts about it, simply because I had already moved decompression to a second thread and numbers were almost the same as original implementation; still I tried to find-out if 7-Zip was effectively using multi-threading for decompression and if I could force the process to run on a single-thread to verify times, unfortunately I couldn't find that info and 7z.exe command-line neither seems to support multi-core-decompression config parameter (only support for compression).

Other replies on twitter mentioned the files disk-write cost so, I did a quick test: try decompressing the files in memory but not writting them to disk... and VOILÀ! The decompression of the +7000 files on memory on single-thread only required 3.6 seconds! So, the bottleneck was clearly on files disk-writing!

After further replies and some investigation I found that, on Windows, does exist WriteFile() and WriteFileEx() that, as per my understanding, operate at a kernel-level and are faster than libC provided fwrite(), so, I decided to try that route. WriteFile() is intended for synchronous writes while WriteFileEx() is intended for asynchronous file writing! That seemed to be the solution!

As usual with Win32 API, documentation is quite dense and sometimes confusing and it's difficult to find specific examples for specific use cases. Still, I managed to code a quick implementation with WriteFileEx() for my use case, despite not clearly understanding some of the provided parameters... BIG ERROR!, never do that, all parameters provided to a function should be clearly understood!

This was the solution I quickly implemented:

// Write file result callback?
VOID WINAPI WinWriteFileCallback(DWORD dwErrorCode, DWORD dwBytesTransferred, LPOVERLAPPED lpOverlapped)
{
    if (dwErrorCode != 0) printf("WARNING: CompletionRoutine: Unable to write to file! Error: %u, AddrOverlapped: %p\n", dwErrorCode, lpOverlapped);
    else printf("CompletionRoutine: Transferred: %u Bytes, AddrOverlapped: %p\n", dwBytesTransferred, lpOverlapped);
}

// Write file to disk
int WinWriteFile(char *filePath, char *buffer, int bufferSize)
{
    BOOL errorFlag = FALSE;
    OVERLAPPED overlap = { 0 };
    
    HANDLE fileHandle = CreateFileA(filePath, GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_FLAG_OVERLAPPED, NULL);
    if (fileHandle == INVALID_HANDLE_VALUE)
    {
        printf("WARNING: Could not create file: %s! ERROR: %u\n", filePath, GetLastError());
        return -1;
    }

    errorFlag = WriteFileEx(fileHandle, buffer, bufferSize, &overlap, (LPOVERLAPPED_COMPLETION_ROUTINE)WinWriteFileCallback);
    if (errorFlag == 0) printf("WARNING: Unable to write to file! Error %u\n", GetLastError());

    CloseHandle(fileHandle);
    return 0;
}

"Amazing" results...

On my first test, installation time was reduced to 5.8 seconds! Wow! That was fast! And running on a single thread (despite the file-writing could be happening on multiple underlying threads, due to its asynchronous nature)!

After the initial excitement for the huge time improvement I was about to close Visual Studio and call it a day... when I noticed a "small" issue... I compared the original package with the installed one and saw some files (1-2 files) were not correctly installed, they were created with a size of 0 bytes.

My first reaction was fast: it must be related to re-using the same memory buffer for decompression of every file, considering the async nature of WriteFileEx(), I was modifying the data to be written for every file decompressed before the actual writing was happening (or that was my thought)! So, just did a quick test to verify it: using one separate buffer for every file to decompress... and at that point program started randomly crashing!

I hate these situations were you seem to be so close to a great solution but some apparently "simple" thing starts not working as expected and everything crashes... Long story short, after several hours of investigation and debugging I realized that some compressed files in the .zip could actually be 0 bytes in size, I needed to consider that case when allocating multiple buffers and not only the possibility of files entries that are actually directories. (Bonus info: directory entries creation depends on .zip creator software, different software tread them in different ways!)

So, after addressing those issues I tested again and no more random crashes... but still failing randomly, some files created as 0 bytes (obviously, files that were not supposed to be 0 bytes). And again, some hours trying to find-out the problem...

I was convinced that issue should be on the async process but after some hours trying to find a solution (and being a bit obfuscated), I randomly tried to replace WriteFileEx() with the WriteFile() synchronous alternative. To my surprise, installation took 5.6 seconds!!! And all the files were installed successfully!

After multiple tests, everything worked but I noticed the installation time could vary between 5-8 seconds depending on multiple factors (restarted computer, open programs, build mode...) but it's still quite impressive for a single-thread decompression and synchonous disk file writting!


Comparison of installation speed for rInstallFriendly v1.0 vs rInstallFriendly v2.0. 10x Optimization! Notice the stutering happening on first image while playing the game. No time to play on second image!


I've been further investigating the possible issues with WriteFileEx() and I found this remark in the docs:

A common mistake is to reuse an OVERLAPPED structure before the previous asynchronous operation has been completed. You should use a separate structure for each request. You should also create an event object for each thread that processes data. If you store the event handles in an array, you could easily wait for all events to be signaled using the WaitForMultipleObjects function.

That was probably my issue with WriteFileEx(), I was creating an OVERLAPPED variable per file but internal to my custom WinWriteFile() function, so it was discarded when out of scope, and probably before the async operation completed.

Possible improvements

  • Use multiple threads for decompression: For my test cases, decompression was not a bottleneck, actually decompression is really fast! Still, when processing so many files, processing can be probably divided into several threads. My concern is the .zip file-access and thread synchronization, specially for the rollback case (when user cancels the installation and intalled files are removed).
  • Use WriteFileEx() properly: Every call to the function should use its own OVERLAPPED structure but the async processes should be carefull synchronized, detecting when a file-writing process is ended and only at that moment freeing the file memory. It seems it requires a more complex implementation than current one.
  • Use memory mapped files: I got that recommendation by a highly experienced developer, instead of using WriteFileEx(), use CreateFileMapping(). Knowing the required file size, just create the memory mapping and write to memory as usual, the OS should take care automatically of async writing that memory to disk, usually in a very efficiently way. Undoubtely, it worths a try!

Plot twist: minimum installation time

One of the issues I detected with rInstallFriendly v1.0 was that, in case of small software packages, installation was too fast and the banner/ad/game was not displayed enough time for the users to notice/enjoy it.

Solution: Adding a configurable minimum install time, so developers could set a minimum time the display banners or game, independently of the installation speed.

What an irony! I improved installation time by an order of magnitude but I also added an option to slow down installation time while being too fast... the fun of programming!

rInstallFriendly min installation time raylib_installation min 20 seconds

rInstallFriendly v2.0 minimum installation time option and result installing the package that previously took ~6 seconds. Now users can enjoy the installer game... and without stuttering!


Thanks for reading! Feel free to comment or ask me in this gist thread!

@JodiTheTigger
Copy link

Beware using CreateFileMapping for memory mapped file IO. The issue is when there are errors. These will be raised as SEH exceptions for the entire thread. Mapped files are great when they work, but it's an all-or-nothing approach if there is any file IO issue.

(if I'm wrong, I would be v.happy to find out how to manage errors when doing memory mapped file IO)

@mmozeiko
Copy link

mmozeiko commented Sep 13, 2024

My comments:

  1. WriteFile[Ex] is not faster than fwrite. In fact it can be slower if you call with many tiny amounts of data. Because WriteFile[Ex] will do syscall each time, but fwrite will do buffering internally. In such cases fwrite will be faster! But as long as you write decent sized buffers, both - fwrite and WriteFile - will work with same speed. Any CRT overhead in fwrite is tiny compared to actual I/O cost. Same for reads with fread/ReadFile.

  2. WriteFileEx is not needed for async writes. You can use just regular WriteFile for that - all you need to do is pass OVERLAPPED structure to it. With same caveats of course - each write needs its own OVERLAPPED struct, and keeping buffer alive during write operation. Typical way is to combine it with I/O Completion Port handle. This way you can dequeue finished writes off the completion port without querying each OVERLAPPED event handle manually.

  3. Instead of guessing where time is spent it is better to use profiler to see what code is running. Visual Studio comes with decent sampling profiler: Debug -> Performance Profiler -> CPU Usage. It will show in what functions code is spending most of the time, allowing you to figure out what exactly your code is doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment