Skip to content

Instantly share code, notes, and snippets.

@aheadley
Created March 10, 2015 22:32
Show Gist options
  • Save aheadley/aea63fac00e02ce222fd to your computer and use it in GitHub Desktop.
Save aheadley/aea63fac00e02ce222fd to your computer and use it in GitHub Desktop.
-----------------------------------------------------------------------------
Originally hosted at: http://mods.relicnews.com/misc/BIGSpec.shtml
-----------------------------------------------------------------------------
BIG FILE FORMAT SPECIFICATION -- RBF1.23 -- v0.8
(incomplete)
01/04/2000
by:
_!Lachesis_atatata!
pronounced:
shaka >yell< lachesis shaka atatata >yell<
(you may yell or scream at your capability. what you yell is irrelevant)
member of:
ICGOMA (International Cooperative Governance Organization of Meaningless Acronyms)
in conjunction with:
UCFEDD (Uraguayan Consortium of Ferrets with Erectile Dysfunction Disorder)
APDOFF (Agrarian Paramilitary Defense Organization of Feminist Farmers)
MOOOOO (MOOOOO)
-----------------------------------------------------------------------------
A. INTRODUCTION
-----------------------------------------------------------------------------
This specification describes the format of .BIG archives used in
Homeworld by Relic Entertainment Inc. a BIG archive is a container for
other files. instead of distributing hundreds of files separately, files are
"packed" together into one larger file.
at this time, it is not a complete specification. enough information has
been learned to extract all files from a BIG archive, but not enough to
be able to create one.
if information from this document is used for implementations or in
other documents, please credit me. it's just the nice thing to do.
i'll be haunting the message boards, if you have a question, post it. if
i don't reply i:
1. don't know the answer
2. am busy with something at some point on the globe
3. am dead
if there are errors on this document ... oops :)
-----------------------------------------------------------------------------
B. OVERALL LAYOUT
-----------------------------------------------------------------------------
there are three main areas to the BIG file format arranged
sequentially one after the other:
1. header
2. table of contents
3. name/data blobs
each is described in the sections below.
-----------------------------------------------------------------------------
B1. HEADER AREA
-----------------------------------------------------------------------------
the header area comes at the beginning of the file. it has the normal
elements you would expect to be there. there is only one header.
-------------------------------------------------------------------------
typedef unsigned long ulong_t;
typedef unsigned char ubyte_t;
const char BIG_FILE_ID[] = { 0x52, 0x42, 0x46, 0x31, 0x2E, 0x32, 0x33 };
// note: "RBF1.23"
struct header {
char magic_cookie[7];
ulong_t toc_size;
ulong_t header_unknown;
};
-------------------------------------------------------------------------
magic_cookie
identifier for the file. The only currently known value for this is
contained in BIG_FILE_ID;
toc_size
number of toc_entry structures (*see B2) that are contained in the
file. (note: toc stands for Table of Contents)
header_unknown
this field's purpose is unknown. it's value is always 1. does not seem
to play a role in anything yet.
-----------------------------------------------------------------------------
B2. TABLE OF CONTENTS AREA
-----------------------------------------------------------------------------
the table of contents area holds a list of toc_entry structures which
comes directly after the header. header.toc_size tells you how many
there are. for each file that is in the BIG archive, there is a toc_entry
structure.
-------------------------------------------------------------------------
struct toc_entry {
ulong_t crc_msb;
ulong_t crc_lsb;
ulong_t name_size;
ulong_t data_compressed_size;
ulong_t data_uncompressed_size;
ulong_t file_offset;
time_t timestamp;
ubyte_t toc_unknown[4];
};
-------------------------------------------------------------------------
crc_msb/crc_lsb
this is some sort of crc value. crc_msb is the most significant byte and
crc_lsb is the least significant byte of a larger 8-byte(64-bit) crc
number. exactly how it's generated and how to use it is unknown.
(*see crc note below)
name_size
the size(length) of the name in the name_data_blob (*see B3) that
this toc_entry refers to.
data_compressed_size
the size of the data in the name_data_blob (*see B3) in compressed
form.
data_uncompressed_size
the size of the data in the name_data_blob (*see B3) in uncompressed
form.
file_offset
the offset in bytes from the beginning of the BIG file to the
name_data_blob (*see B3) that this toc_entry refers to.
timestamp
the date/time of the entry.
toc_unknown
what part this element plays is unknown. what is known is that
toc_unknown[0] will be 1 if the data has been compressed and 0 if it
has not. (*see toc_unknown note below) toc_unknown[1..3] always
seems to equal {0xC9, 0xCA, 0xCB)
-------------------------------------------------------------------------
toc_unknown note:
sometimes the data is compressed, sometimes it's not. compression
can be determined by doing either:
1. data_compressed_size < data_uncompressed_size or ...
2. toc_unknown[0] == 1
I would personally suggest sticking to option 1 until toc_unknown is
fully understood.
-------------------------------------------------------------------------
crc note:
the entire table of contents seems to be sorted by the 8-byte crc
value. why is a complete mystery. the crc itself is most likely used as
a data validation mechanism. but to sort on it? it may be that the crc
is also being used as a mechanism to uniquely identify a file. i *think*
that there are no duplicate crc's.
-----------------------------------------------------------------------------
B3. NAME/DATA BLOB AREA
-----------------------------------------------------------------------------
the name/data blob area holds a list of name_data_blob structures
which comes directly after the table of contents area. a
name_data_blob holds the name and data of a file that exists in the
BIG archive. each name_data_blob is referred to by a toc_entry
structure which gives the sizes of the name and data fields.
-------------------------------------------------------------------------
struct name_data_blob {
char name[];
char data[];
};
quick note:
the structure above is not legal C. the fields are not fixed length and
i've used this notation because it's useful. use the standard techniques
of dynamic memory when implementing.
-------------------------------------------------------------------------
name
this is the file name of the original file. it's length is determined from
toc_entry.name_size + 1 (note: there IS a terminating null byte)
which this name_data_blob is being referred by. it also happens to be
encrypted. (*see encryption note below)
data
this is the actual file data. it's length is determined from the
toc_entry.data_compressed_size field which this name_data_blob is
being referred by. if the data is compressed, the algorithm used is
LZSS. (*see compression note below)
-------------------------------------------------------------------------
encryption note:
as if the world isn't evil enough as it is, some sick sick sick bastard >
my kinda programmer actually :) < decided to encrypt the file names
so that those lurking hacker types would get all confused and bleary-
eyed from trying to hack the format. luckily, i knew this silly
encryption trick ... i've used it on others myself. ;)
what's used is an XOR run. it's really simple, really fast, and has no
redeeming value (ie. compression, secure encryption) other than to
screw with people. here's the code for doing it.
void xor_run(char* buffer, ulong_t buffer_size)
{
char last_char;
ulong_t i;
last_char = (char)0xD5;
for (i = 0; i < buffer_size; i++)
{
last_char ^= buffer[i];
buffer[i] = last_char;
}
}
for those of you who don't catch on, that's both the encryption AND
decryption routine. this particular version de/encrypts "in-place" and
writes over the buffer you pass in.
as an implementation note, don't touch the terminating null when
decrypting. pass in toc_entry.name_size, not toc_entry.name_size + 1.
-------------------------------------------------------------------------
compression note:
compression (decompression) of data is done using the LZSS
algorithm. for this particular implementation:
> very basic LZSS (ie. no huffman a la ZIP)
> marker bit of 1 signals passthrough character
> marker bit of 0 signals dictionary entry
> dictionary entry is composed of 12-bit index and 4-bit length fields
don't know the LZSS algorithm? go to:
http://dogma.net/DataCompression/
if you look carefully, you will even find a clean LZSS implementation
in C up amongst the links that works perfectly. of course, since life
isn't fair in the least, i found it about an hour AFTER implementing it
myself. ;(
note: i'm not associated with the link above, or any link off of it. so
don't waste your time speculating.
-----------------------------------------------------------------------------
C. >gratzi<
-----------------------------------------------------------------------------
>gratzis< go out to Relic for coming up with Homeworld. very nice.
artsy >gratzi< goes to the cutscenes which, though simple, were very
effective. special >gratzi< for the music and the voice acting. though
not enough music :(
>gratzi< to the person who eventually creates me this ship:
Light Carrier
(to support guerilla tactics)
as compared to the standard carrier:
no space consuming construction facilities
a bit smaller
a bit lighter
a bit faster
a bit cheaper
more docking ports for faster docking of large wings
capacity for more fighters/corvettes
a bit faster fighter/corvette repair cycle
enough small guns to give a scout/interceptor wing a hard time
>!bye!<
-----------------------------------------------------------------------------
Originally provided in Relic's source: BIGaddendum.doc
-----------------------------------------------------------------------------
.BIG file specification addendum
By B1FF ( HYPERLINK "mailto:[email protected]" [email protected])
The article listed on RelicNews is pretty complete WRT the .BIG file format. It was really neat to download the program for viewing and extracting the contents of a bigfile. We will probably release our bigfile creation program but you will note that our version of the ‘extract’ command was never finished. Oh the pains of finalling!
The only thing that was not pick up on was the CRC’s of the bigfile. The CRC is an 8-byte CRC actually made up of 2 standard 32-bit CRC’s. Included is some sample code to create these CRC’s. I think I originally copied this code from Graphics Gem’s several games ago. It’s pretty standard. Make note of this algorithm. It is also used in the .CRC format.
udword CRCTable[] =
{
0x00000000,0x77073096,0xEE0E612C,0x990951BA,
0x076DC419,0x706AF48F,0xE963A535,0x9E6495A3,
0x0EDB8832,0x79DCB8A4,0xE0D5E91E,0x97D2D988,
0x09B64C2B,0x7EB17CBD,0xE7B82D07,0x90BF1D91,
0x1DB71064,0x6AB020F2,0xF3B97148,0x84BE41DE,
0x1ADAD47D,0x6DDDE4EB,0xF4D4B551,0x83D385C7,
0x136C9856,0x646BA8C0,0xFD62F97A,0x8A65C9EC,
0x14015C4F,0x63066CD9,0xFA0F3D63,0x8D080DF5,
0x3B6E20C8,0x4C69105E,0xD56041E4,0xA2677172,
0x3C03E4D1,0x4B04D447,0xD20D85FD,0xA50AB56B,
0x35B5A8FA,0x42B2986C,0xDBBBC9D6,0xACBCF940,
0x32D86CE3,0x45DF5C75,0xDCD60DCF,0xABD13D59,
0x26D930AC,0x51DE003A,0xC8D75180,0xBFD06116,
0x21B4F4B5,0x56B3C423,0xCFBA9599,0xB8BDA50F,
0x2802B89E,0x5F058808,0xC60CD9B2,0xB10BE924,
0x2F6F7C87,0x58684C11,0xC1611DAB,0xB6662D3D,
0x76DC4190,0x01DB7106,0x98D220BC,0xEFD5102A,
0x71B18589,0x06B6B51F,0x9FBFE4A5,0xE8B8D433,
0x7807C9A2,0x0F00F934,0x9609A88E,0xE10E9818,
0x7F6A0DBB,0x086D3D2D,0x91646C97,0xE6635C01,
0x6B6B51F4,0x1C6C6162,0x856530D8,0xF262004E,
0x6C0695ED,0x1B01A57B,0x8208F4C1,0xF50FC457,
0x65B0D9C6,0x12B7E950,0x8BBEB8EA,0xFCB9887C,
0x62DD1DDF,0x15DA2D49,0x8CD37CF3,0xFBD44C65,
0x4DB26158,0x3AB551CE,0xA3BC0074,0xD4BB30E2,
0x4ADFA541,0x3DD895D7,0xA4D1C46D,0xD3D6F4FB,
0x4369E96A,0x346ED9FC,0xAD678846,0xDA60B8D0,
0x44042D73,0x33031DE5,0xAA0A4C5F,0xDD0D7CC9,
0x5005713C,0x270241AA,0xBE0B1010,0xC90C2086,
0x5768B525,0x206F85B3,0xB966D409,0xCE61E49F,
0x5EDEF90E,0x29D9C998,0xB0D09822,0xC7D7A8B4,
0x59B33D17,0x2EB40D81,0xB7BD5C3B,0xC0BA6CAD,
0xEDB88320,0x9ABFB3B6,0x03B6E20C,0x74B1D29A,
0xEAD54739,0x9DD277AF,0x04DB2615,0x73DC1683,
0xE3630B12,0x94643B84,0x0D6D6A3E,0x7A6A5AA8,
0xE40ECF0B,0x9309FF9D,0x0A00AE27,0x7D079EB1,
0xF00F9344,0x8708A3D2,0x1E01F268,0x6906C2FE,
0xF762575D,0x806567CB,0x196C3671,0x6E6B06E7,
0xFED41B76,0x89D32BE0,0x10DA7A5A,0x67DD4ACC,
0xF9B9DF6F,0x8EBEEFF9,0x17B7BE43,0x60B08ED5,
0xD6D6A3E8,0xA1D1937E,0x38D8C2C4,0x4FDFF252,
0xD1BB67F1,0xA6BC5767,0x3FB506DD,0x48B2364B,
0xD80D2BDA,0xAF0A1B4C,0x36034AF6,0x41047A60,
0xDF60EFC3,0xA867DF55,0x316E8EEF,0x4669BE79,
0xCB61B38C,0xBC66831A,0x256FD2A0,0x5268E236,
0xCC0C7795,0xBB0B4703,0x220216B9,0x5505262F,
0xC5BA3BBE,0xB2BD0B28,0x2BB45A92,0x5CB36A04,
0xC2D7FFA7,0xB5D0CF31,0x2CD99E8B,0x5BDEAE1D,
0x9B64C2B0,0xEC63F226,0x756AA39C,0x026D930A,
0x9C0906A9,0xEB0E363F,0x72076785,0x05005713,
0x95BF4A82,0xE2B87A14,0x7BB12BAE,0x0CB61B38,
0x92D28E9B,0xE5D5BE0D,0x7CDCEFB7,0x0BDBDF21,
0x86D3D2D4,0xF1D4E242,0x68DDB3F8,0x1FDA836E,
0x81BE16CD,0xF6B9265B,0x6FB077E1,0x18B74777,
0x88085AE6,0xFF0F6A70,0x66063BCA,0x11010B5C,
0x8F659EFF,0xF862AE69,0x616BFFD3,0x166CCF45,
0xA00AE278,0xD70DD2EE,0x4E048354,0x3903B3C2,
0xA7672661,0xD06016F7,0x4969474D,0x3E6E77DB,
0xAED16A4A,0xD9D65ADC,0x40DF0B66,0x37D83BF0,
0xA9BCAE53,0xDEBB9EC5,0x47B2CF7F,0x30B5FFE9,
0xBDBDF21C,0xCABAC28A,0x53B39330,0x24B4A3A6,
0xBAD03605,0xCDD70693,0x54DE5729,0x23D967BF,
0xB3667A2E,0xC4614AB8,0x5D681B02,0x2A6F2B94,
0xB40BBE37,0xC30C8EA1,0x5A05DF1B,0x2D02EF8D,
};
/*=============================================================================
Functions:
=============================================================================*/
/*-----------------------------------------------------------------------------
Name : crc32Compute
Description : Compute a 32-bit CRC
Inputs :
Outputs :
Return :
----------------------------------------------------------------------------*/
crc32 crc32Compute(ubyte *packet, udword length)
{
udword index, tableIndex;
crc32 crc;
crc = 0xffffffff;
for (index = 0; index < length; index++)
{
tableIndex = (crc ^ *(packet++)) & 0x000000FF;
crc = ((crc >> 8) & 0x00FFFFFF) ^ CRCTable[tableIndex];
}
return(~crc);
}
The first CRC is the first half of the file name and the second CRC is the second half of the CRC. Why do such a silly scheme? It makes it easy to sort the TOC by CRC and do a binary search for a filename. This makes for faster lookups. All file requests in our file layer are resolved from the text name to an 8-byte CRC.
As for some unknown data members, the header_unknown member you refer to is always 1. A bit redundant? Yes. The toc_unknown[1..3] can be ignored. They’re padding that is cleared to something by the compiler.
All indexes, offsets, and counts are little-endian and require
conversion for the Mac PowerPC architecture, but are ok as-is on the
Intel platform. Items that are numeric and described as 4 bytes are of
type uint32_t. Items that are numeric and described as 2 bytes are of
type uint16_t.
Overall format is:
Archive Header
Section Header describing the four sections immediately
following the Archive Header (TOC List,
Folder List, File Info List, and File Name List)
TOC (Table of Contents) List
Folder List
File Info List
File Name List
File Data for all the files (including the 264 byte header
preceeding the file data of each file)
The format of each of the above is:
180 byte archive header
8 bytes of "_ARCHIVE"
4 bytes version
16 bytes for MD5 tool signature of archive (MD5 of tool security
key and full file data excluding the archive header)
128 bytes for 64 utf16 chars for archive name
16 bytes for MD5 signature of archive (MD5 of HW2 Root Security Key
and archive header data)
4 bytes section header size
4 bytes exact file data offset
24 byte section header consisting of four 6 byte sections.
Each 6 byte section has:
4 byte offset relative to archive header
2 byte count
The four sections are:
TOC List (describes each TOC entry, that is, each folder hierarchy)
Folder List (describes the folder hierarchy for each TOC)
File Info List (describes each file)
File Name List (the list of file names, including folder names)
TOC list entry (138 bytes)
64 character alias name
64 character name
2 byte first folder index
2 byte last folder index
2 byte first filename index
2 byte last filename index
2 byte start folder index for hierarchy
Folder list entry (12 bytes)
4 bytes file name offset (relative to file name list offset)
2 bytes first subfolder index
2 bytes last subfolder index
2 bytes first filename index
2 bytes last filename index
File info list entry (17 bytes)
4 bytes file name offset (relative to file name list offset)
1 byte flags (0x00 if uncompressed
0x10 to decompress during read -- used for large files
0x20 to decompress all at once -- used for small files, like .lua files)
4 bytes file data offset (relative to overall file data offset)
4 bytes compressed length
4 bytes decompressed length
File header preceding file data for each file (264 bytes)
256 chars for file name
4 bytes file modification date
4 bytes CRC of uncompressed file data.
Note that the file data offset in the file info list entry indicates the
location of the file data. In order to access the file header
preceeding the file data you must subtract 264 from the offset.
The HW2 Root Security Key is an ASCII string that is passed first to
the MD5 algorithm followed by the archive header data to create the
archive's 128 bit (16 byte) MD5 signature. The MD5 algorithm used is
standard. The Root Security Key is embedded in the HW2 application
and also in Relic's archive tool.
The tool security key is an ASCII string that is passed first to the MD5
algorithm followed by the full data in the archive excluding the archive
header to create the archive's 128 bit (16 byte) MD5 tool signature.
The MD5 algorithm is standard. The tool security key is embedded in
Relic's archive tool.
The file modification date appears to be the number of seconds
since UTC 00:00:00 January 1st, 1970. This date is the Unix epoch,
although it is unknown to the author of this document if that is also
the Windows epoch.
The CRC algorithm used to calculate the uncompressed file data CRC
is the exact same algorithm used for Homeworld. Apparently the algorithm
and table are taken from the 32-Bit CRC International Standard,
which is based on a particular mathematical formula. Thus there shouldn't
be any concerns over copyright in this case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment