Created
December 5, 2011 18:33
-
-
Save valyala/1434692 to your computer and use it in GitHub Desktop.
Word Chains solution
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// This is a solution for the Word Chains problem described | |
// at http://socialcam.com/jobs/problems . | |
// | |
// The problem. | |
// | |
// Two words are connected by a word chain if it is possible to change one into | |
// the other by making a series of single-character changes, such that every | |
// intermediate form is also a word. For example, CAT and DOG are connected with | |
// a word chain because CAT, COT, COG and DOG are all words. DEMONIC and | |
// UMBRELLA are not. | |
// | |
// Write a program that takes a list of words (for example /usr/share/dict/words | |
// on a unix system) and then reads pairs of words on stdin and prints 'YES' if | |
// the words are connected by a chain, and 'NO' if they are not. The program | |
// should take the path to the word list as a command-line argument, and should | |
// then loop, reading pairs of whitespace-delimited words on stdin and printing | |
// 'YES' or 'NO.' | |
// | |
// Only one operation is allowed between words in the chain. That operation may | |
// consist of changing any single character, deleting any single character | |
// or inserting any single character at any position. All comparisons should be | |
// case insensitive. | |
// | |
// | |
// The solution. | |
// | |
// The core concept of the solution is subword - a tuple of a word's substring | |
// and a character position. Word's substring is created from a word by deleting | |
// a character at the given position. It is possible creating L distinct | |
// subwords from a word with length L. Word's substring is always one character | |
// shorter than the corresponding word. Distinct subwords can have equal | |
// substrings if they differ by character position. | |
// For example, ('oo', 0), ('fo', 1), ('fo', 2) are all possible subwords | |
// for the word 'foo'. | |
// | |
// If two words have a common subword, then one word can be converted into | |
// another word by character substitution at the given position. | |
// If a word matches a subword's substring, then this word can be converted | |
// into subword's word by character deletion at the given position. Character | |
// deletion is equivalent to character insertion for the matching word. | |
// Obviously subwords can be used for building all possible one-character | |
// transformations for dictionary words irregardless of alphabet used in words. | |
// | |
// The solution is split into two phases: | |
// | |
// 1. Dictionary words preprocessing. | |
// 2. Input words lookup. | |
// | |
// The first phase has the following steps: | |
// | |
// 1. Read all dictionary words into memory. | |
// This step has O(W) computational complexity and O(S) space complexity, | |
// where: | |
// - W is the total number of dictionary words; | |
// - S is the total number of chars in all dictionary words. | |
// | |
// 2. For each word from the dictionary create all possible subwords. If word's | |
// subword matches already existing subword, then connect the corresponding | |
// words. | |
// This step has O(S) computational complexity and O(S*E1) space coplexity, | |
// where: | |
// - E1 is the average number of connections found per each subword. | |
// | |
// 3. For each word from the dictionary find all subwords with substrings | |
// matching the given word. Connect the word to found subwords' words. | |
// This step has O(W) computational complexity and O(W*E2) space complexity, | |
// where: | |
// - E2 is the average number of connections found per each word. | |
// | |
// 4. Step 3 creates disconnected set of connected graphs from words. | |
// Each connected graph in this set represents all the words, which can be | |
// transformed from each other. If words belong to distinct connected graphs, | |
// then the transformation is impossible. Let's assign distinct colors | |
// for each connected graph and brush each word in each graph with | |
// the corresponding color. | |
// This step has O(W) computational complexity and O(W) space complexity. | |
// | |
// 5. Create a lookup table, which maps from word to color. | |
// This step has O(W) computational complexity and O(S) space complexity. | |
// Subwords can be deleted after this step. | |
// | |
// The second phase consists of one step: | |
// | |
// 1. For each word pair verify whether two given words belong to the same | |
// graph by comparing their colors obtained from the lookup table. | |
// This step has O(P) computational complexity and O(1) space complexity, | |
// where: | |
// - P is the total number of word pairs. | |
// | |
// For words from natural languages E1 and E2 can be considered constants. | |
// In this case computational complexity of the first phase is O(S). | |
// Storage required per each word from a natural language can be considered | |
// constant, so the solution requires O(S) space. | |
// | |
// It is worth noting algorithm's computational complexity doesn't depend | |
// on the size of alphabet used in dictionary words. I.e. words containing | |
// arbitrary unicode characters will be processed with the same speed as words | |
// containing only Latin chars. (Though uncode support is missing | |
// in the current implementation). | |
// | |
// This program reads 99K words from /usr/share/dict/words in 0.24 seconds | |
// on my laptop with the speed of ~413K words/s. Steps 1-4 require 16Mb of RAM | |
// (162 bytes per word) on 64-bit build. 32-bit build consumes 15Mb of RAM | |
// (152 bytes per word). | |
// Step 5 requires only a few Mb of RAM for storing a mapping between dictionary | |
// words and colors. | |
// After the dictionary load complete, it processes 50K word pairs from the same | |
// /usr/share/dict/words file in 0.09 seconds (i.e. ~556K word pairs/s). | |
// | |
// The program uses the following optimization tricks, which help increasing | |
// performance and decreasing memory usage: | |
// | |
// - Custom optimized containers are used instead of generic STL containers. | |
// - Memory for already known quantity of structures is allocated in bulk. | |
// - Small integer types are used in structs if possible instead of generic wide | |
// types. | |
// - Pointers are substituted by offsets from base address. Offsets have smaller | |
// sizes comparing to pointers (especially on x64 builds). | |
// - Access patterns to memory in loops is close to linear. | |
// - Unions are used for packing mutually exclusive objects into the same | |
// location. | |
// - Objects are stored by value in data containers. This saves memory | |
// by eliminating a pointer to value from data container and eliminates | |
// a memory dereference during access to such values. | |
// - Unneeded memory copies are avoided where possible. | |
// - Pragma pack() directive is used for tight structures' packing. | |
// | |
// Author: Aliaksandr Valialkin <[email protected]>. | |
#include <cassert> // for assert() | |
#include <cctype> // for tolower(), isspace() | |
#include <cstddef> // for size_t | |
#include <cstdlib> // for exit(), EXIT_SUCCESS/EXIT_FAILURE | |
#include <fstream> // for ifstream | |
#include <iostream> // for cin/cout/cerr | |
#include <istream> | |
#include <ostream> | |
#include <stdint.h> // for uint*_t | |
// Denies copy constructor and assignment operator for the given class. | |
// These operations are implicitly implemented by the compiler unless | |
// explicilty declared. Sometimes implicit implementation of these operations | |
// won't match your needs. For example, if a class contains pointers to objects | |
// it owns, implicit copying will lead to use after free errors. | |
// | |
// This macro MUST be placed only at the end of class definition! | |
#define DENY_COPYING(classname) \ | |
private: \ | |
classname(const classname &); \ | |
void operator = (const classname &); | |
// It's unclear how to obtain SIZE_MAX in C++. So define it here. | |
#ifndef SIZE_MAX | |
# define SIZE_MAX ((size_t)-1) | |
#endif | |
using namespace std; | |
namespace | |
{ | |
// Debug stream. | |
// | |
// This stream is completely eliminated if NDEBUG macro is present, so feel free | |
// scattering debug messages in the code wherever you want. This won't hurt | |
// performance in optimized builds. | |
class debug_ostream | |
{ | |
private: | |
ostream &_output; | |
public: | |
explicit debug_ostream(ostream &output) : _output(output) {} | |
debug_ostream &operator << (ostream &(*const manipulator)(ostream &)) | |
{ | |
(void)manipulator; | |
#ifndef NDEBUG | |
manipulator(_output); | |
#endif | |
return *this; | |
} | |
template <class T> | |
debug_ostream &operator << (const T &v) | |
{ | |
(void)v; | |
#ifndef NDEBUG | |
_output << v; | |
#endif | |
return *this; | |
} | |
}; | |
// Debug stream initialization. | |
debug_ostream cdbg(cerr); | |
// Calculates the number of bytes required for storing the value N. | |
template <size_t N> struct bytes | |
{ | |
static const size_t n = 1 + bytes<(N/256)>::n; | |
}; | |
template<> struct bytes<0> | |
{ | |
static const size_t n = 0; | |
}; | |
// Determines a type suitable for holding the given number of bytes. | |
template <size_t Bytes> struct size_type; | |
template<> struct size_type<1> { typedef uint8_t t; }; | |
template<> struct size_type<2> { typedef uint16_t t; }; | |
// Moves object's contents from *src to *dst. | |
// It differs from assignment operator in two aspects: | |
// - dst contents is undefined before the move, so there is no need in freeing | |
// up dst resources during the move. | |
// - src contents can be modified during the move, so it can be left in | |
// the state, which consumes minimal CPU at destruction time. | |
// | |
// These aspects allow writing fast move() specializations. | |
template <class ValueType> | |
void move(ValueType *const dst, ValueType *const src) | |
{ | |
// This move() implementation is suitable only for POD types. | |
// Non-POD types must provide own move() specialization. | |
*dst = *src; | |
} | |
// Stack implementation optimized for performance and memory usage. | |
// It packs multiple items in chunks up to ChunkSize items per chunk. | |
// This improves memory locality and shaves off memory required for 'next' | |
// pointers comparing to canonical list implementations. | |
// | |
// See http://en.wikipedia.org/wiki/Unrolled_linked_list for details. | |
// | |
// The stack owns items pushed into it. It uses move() template function | |
// for moving items from/to the stack. | |
#pragma pack(push, 1) | |
template <class ValueType, size_t ChunkSize> | |
class fast_stack | |
{ | |
private: | |
typedef typename size_type<bytes<ChunkSize>::n>::t chunk_size_t; | |
static const chunk_size_t _MAX_CHUNK_SIZE = ((chunk_size_t)-1); | |
// Pointer to the next stack chunk. | |
fast_stack *_next; | |
// Space for items in the current chunk. | |
char _v[sizeof(ValueType) * ChunkSize]; | |
// The number of items in the current chunk. | |
chunk_size_t _chunk_size; | |
ValueType *_get(const size_t n) const | |
{ | |
assert(_chunk_size <= ChunkSize); | |
assert(n < _chunk_size); | |
return ((ValueType *)_v) + n; | |
} | |
fast_stack(fast_stack *const next, ValueType *const v) : | |
_next(next), _chunk_size(1) | |
{ | |
assert(ChunkSize > 0); | |
assert(ChunkSize <= _MAX_CHUNK_SIZE); | |
cdbg << "Allocated stack chunk for " << ChunkSize << " items" << endl; | |
move(_get(0), v); | |
} | |
~fast_stack() | |
{ | |
for (size_t i = 0; i < _chunk_size; ++i) { | |
_get(i)->~ValueType(); | |
} | |
cdbg << "Released stack chunk for " << ChunkSize << " items" << endl; | |
} | |
public: | |
// Moves the given item onto the stack using move() template function. | |
// May update a pointer to the stack top. | |
static void push(fast_stack **const top_p, ValueType *const v) | |
{ | |
assert(top_p != 0); | |
fast_stack *top = *top_p; | |
if (top == 0 || top->_chunk_size == ChunkSize) { | |
*top_p = new fast_stack(*top_p, v); | |
} | |
else { | |
assert(top->_chunk_size < ChunkSize); | |
++(top->_chunk_size); | |
move(top->_get(top->_chunk_size - 1), v); | |
} | |
} | |
// Pops the item from the top of the stack to *to_v using move() template | |
// function. May update a pointer to the stack top. | |
static void pop(fast_stack **const top_p, ValueType *const to_v) | |
{ | |
assert(top_p != 0); | |
fast_stack *top = *top_p; | |
assert(top != 0); | |
assert(top->_chunk_size > 0); | |
move(to_v, top->_get(top->_chunk_size - 1)); | |
--(top->_chunk_size); | |
if (top->_chunk_size == 0) { | |
*top_p = top->_next; | |
delete top; | |
} | |
} | |
// Moves items from the array to the stack using move() template function. | |
// Returns the top of the created stack. | |
static fast_stack *move_from_array(ValueType *begin, const size_t n) | |
{ | |
assert(begin != 0); | |
const ValueType *const end = begin + n; | |
fast_stack *top = 0; | |
while (begin != end) { | |
push(&top, begin); | |
++begin; | |
} | |
return top; | |
} | |
// Destroys the given stack. | |
static void destroy(fast_stack *top) | |
{ | |
while (top != 0) { | |
fast_stack *next = top->_next; | |
delete top; | |
top = next; | |
} | |
} | |
// Calls visitor for each item on the stack. | |
// The order of visited items is unspecified. | |
template <class Visitor> | |
static void for_each(const fast_stack *top, Visitor *const visitor) | |
{ | |
while (top != 0) { | |
assert(top->_chunk_size > 0); | |
for (size_t i = 0; i < top->_chunk_size; ++i) { | |
ValueType *const v = top->_get(i); | |
(*visitor)(v); | |
} | |
top = top->_next; | |
} | |
} | |
template <class Predicate> | |
static ValueType *find_if(const fast_stack *top, Predicate *const predicate) | |
{ | |
while (top != 0) { | |
assert(top->_chunk_size > 0); | |
for (size_t i = 0; i < top->_chunk_size; ++i) { | |
ValueType *const v = top->_get(i); | |
if ((*predicate)(const_cast<const ValueType *const>(v))) { | |
return v; | |
} | |
} | |
top = top->_next; | |
}; | |
return 0; | |
} | |
DENY_COPYING(fast_stack) | |
}; | |
#pragma pack(pop) | |
// A list optimized for storage space. | |
// It stores up to InlineChunkSize values in the object memory itself, | |
// then falls back to dynamically allocated storage. | |
// | |
// The list owns added items. It uses move() template function for moving | |
// items into the list. | |
#pragma pack(push, 1) | |
template <class ValueType, size_t InlineChunkSize, size_t ExternalChunkSize> | |
class compact_list | |
{ | |
private: | |
// This type must be able handling up to InlineChunkSize + 1 distinct integer | |
// values. The (InlineChunkSize + 1) value is used as a flag indicating | |
// external storage is used instead of inline storage. | |
typedef typename size_type<bytes<InlineChunkSize+1>::n>::t | |
inline_storage_size_t; | |
typedef fast_stack<ValueType, ExternalChunkSize> external_storage_t; | |
static const size_t _MAX_INLINE_STORAGE_SIZE = ((inline_storage_size_t)-1); | |
union { | |
char inline_storage[sizeof(ValueType) * InlineChunkSize]; | |
external_storage_t *external_storage; | |
} _u; | |
inline_storage_size_t _inline_storage_size; | |
ValueType *_get_from_inline_storage(const size_t n) const | |
{ | |
assert(n < _inline_storage_size); | |
return ((ValueType *)_u.inline_storage) + n; | |
} | |
public: | |
compact_list() : _inline_storage_size(0) | |
{ | |
assert(InlineChunkSize > 0); | |
assert(InlineChunkSize < _MAX_INLINE_STORAGE_SIZE); | |
} | |
~compact_list() | |
{ | |
if (_inline_storage_size <= InlineChunkSize) { | |
for (size_t i = 0; i < _inline_storage_size; i++) { | |
_get_from_inline_storage(i)->~ValueType(); | |
} | |
} | |
else { | |
assert(_u.external_storage != 0); | |
external_storage_t::destroy(_u.external_storage); | |
} | |
} | |
// Moves the given item into the list using move() template function. | |
void add(ValueType *const v) | |
{ | |
if (_inline_storage_size < InlineChunkSize) { | |
assert(_inline_storage_size < _MAX_INLINE_STORAGE_SIZE); | |
++_inline_storage_size; | |
move(_get_from_inline_storage(_inline_storage_size - 1), v); | |
} | |
else { | |
if (_inline_storage_size == InlineChunkSize) { | |
cdbg << "Switching from inline to external storage, because list " << | |
"size is greater than " << InlineChunkSize << endl; | |
assert(_inline_storage_size < _MAX_INLINE_STORAGE_SIZE); | |
++_inline_storage_size; | |
_u.external_storage = external_storage_t::move_from_array( | |
_get_from_inline_storage(0), InlineChunkSize); | |
} | |
assert(_u.external_storage != 0); | |
external_storage_t::push(&_u.external_storage, v); | |
} | |
} | |
// Calls visitor for each item on the list. | |
// The order of visited items is unspecified. | |
template <class Visitor> | |
void for_each(Visitor *const visitor) const | |
{ | |
if (_inline_storage_size <= InlineChunkSize) { | |
for (size_t i = 0; i < _inline_storage_size; ++i) { | |
ValueType *const v = _get_from_inline_storage(i); | |
(*visitor)(v); | |
} | |
} | |
else { | |
assert(_u.external_storage != 0); | |
external_storage_t::for_each(_u.external_storage, visitor); | |
} | |
} | |
template <class Predicate> | |
ValueType *find_if(Predicate *const predicate) const | |
{ | |
if (_inline_storage_size <= InlineChunkSize) { | |
for (size_t i = 0; i < _inline_storage_size; ++i) { | |
ValueType *const v = _get_from_inline_storage(i); | |
if ((*predicate)(const_cast<const ValueType *const>(v))) { | |
return v; | |
} | |
} | |
return 0; | |
} | |
assert(_u.external_storage != 0); | |
return external_storage_t::find_if(_u.external_storage, predicate); | |
} | |
DENY_COPYING(compact_list) | |
}; | |
#pragma pack(pop) | |
// Simple hash table, which uses less memory comparing to sets and maps | |
// provided by STL. | |
// | |
// ValueType must implement equal_to() and hash() methods. | |
// Hash table owns objects put into it. It uses move() template function | |
// for object's movement. Objects put into hash table shouldn't | |
// modify hash() result. | |
template <class ValueType, size_t OptimalBucketSize> | |
class simple_hashtable | |
{ | |
public: | |
typedef compact_list<ValueType, OptimalBucketSize, OptimalBucketSize * 2> | |
bucket_t; | |
private: | |
const size_t _buckets_count; | |
bucket_t *const _buckets; | |
template <class Visitor> | |
class _value_visitor | |
{ | |
private: | |
Visitor &_visitor; | |
public: | |
explicit _value_visitor(Visitor &visitor) : _visitor(visitor) {} | |
void operator () (ValueType *const v) const | |
{ | |
_visitor(v); | |
} | |
}; | |
class _value_finder | |
{ | |
private: | |
const ValueType *const _v; | |
const simple_hashtable *const _htb; | |
public: | |
_value_finder(const ValueType *const v, | |
const simple_hashtable *const htb) : _v(v), _htb(htb) {} | |
bool operator () (const ValueType *const v) const | |
{ | |
assert(_htb->get_bucket(_v) == _htb->get_bucket(v)); | |
return _v->equal_to(v); | |
} | |
}; | |
ValueType *_find(const bucket_t *const bucket, const ValueType *const v) const | |
{ | |
assert(bucket == get_bucket(v)); | |
const _value_finder vf(v, this); | |
return bucket->find_if(&vf); | |
} | |
public: | |
explicit simple_hashtable(const size_t buckets_count) : | |
_buckets_count(buckets_count), _buckets(new bucket_t[buckets_count]) | |
{ | |
assert(_buckets_count > 0); | |
cdbg << "*** Allocated " << (_buckets_count * sizeof(_buckets[0])) << | |
" bytes for " << _buckets_count << " hashtable buckets" << endl; | |
} | |
~simple_hashtable() | |
{ | |
assert(_buckets != 0); | |
delete[] _buckets; | |
cdbg << "*** Deleted " << (_buckets_count * sizeof(_buckets[0])) << | |
" bytes allocated for " << _buckets_count << " hashtable buckets" << | |
endl; | |
} | |
bucket_t *get_bucket(const ValueType *const v) const | |
{ | |
assert(_buckets_count > 0); | |
assert(_buckets != 0); | |
// TODO: use something faster than modulo operation? | |
size_t n = v->hash() % _buckets_count; | |
return _buckets + n; | |
} | |
// Moves the given item into hashtable using move() template function. | |
// Returns 0 on success or a pointer to already existing equivalent | |
// item on failure. | |
ValueType *insert(ValueType *const v) | |
{ | |
bucket_t *const bucket = get_bucket(v); | |
ValueType *const tmp = _find(bucket, v); | |
if (tmp == 0) { | |
bucket->add(v); | |
} | |
else { | |
assert(v->equal_to(tmp)); | |
} | |
return tmp; | |
} | |
// Returns a pointer to an item with value equal to the given value. | |
// Returns 0 if there is no item with the given value. | |
ValueType *find(const ValueType *const v) const | |
{ | |
return _find(get_bucket(v), v); | |
} | |
// Call visitor for each item in the hashtable. | |
// The order of visited items is unspecified. | |
template <class Visitor> | |
void for_each(Visitor *const visitor) const | |
{ | |
assert(_buckets != 0); | |
const _value_visitor<Visitor> vv(visitor); | |
for (size_t i = 0; i < _buckets_count; ++i) { | |
_buckets[i].for_each(vv); | |
} | |
} | |
DENY_COPYING(simple_hashtable) | |
}; | |
// Jenkin's hash parts. | |
// | |
// See http://en.wikipedia.org/wiki/Jenkins_hash_function . | |
// C++ doesn't provide decent hash implementations, so we have to implement | |
// it here. | |
struct jenkins_hash | |
{ | |
static uint32_t mix(uint32_t h, const char c) | |
{ | |
h += c; | |
h += (h << 10); | |
h ^= (h >> 6); | |
return h; | |
} | |
static uint32_t final(uint32_t h) | |
{ | |
h += (h << 3); | |
h ^= (h >> 11); | |
h += (h << 15); | |
return h; | |
} | |
}; | |
// Tiny and fast string implementation. | |
// It builds strings on top of the provided buffer. | |
// It doesn't allocate additional memory. | |
#pragma pack(push, 1) | |
class tiny_string | |
{ | |
public: | |
// Maximum string size value must fit this type. | |
// This type must be unsigned. | |
typedef uint8_t ts_size_t; | |
private: | |
// Buffer size value must fit this type. | |
// This type must be unsigned. | |
typedef uint32_t offset_t; | |
static const offset_t _MAX_STRING_SIZE = ((ts_size_t)-1); | |
static const offset_t _MAX_INLINE_STRING_SIZE = sizeof(offset_t); | |
static const offset_t _MAX_BUF_SIZE = ((offset_t)-2); | |
static const offset_t _TMP_STRING_OFFSET = ((offset_t)-1); | |
union { | |
// Offset of the string in the buffer. | |
offset_t offset; | |
// Inline string allows eliminating additional memory dereference. | |
char inline_string[_MAX_INLINE_STRING_SIZE]; | |
} _u; | |
// The size of the string. | |
// Strings shorter than (_MAX_INLINE_STRING_SIZE + 1) are stored | |
// in _u.inline_string. | |
ts_size_t _size; | |
static char *_buf; | |
static offset_t _buf_size; | |
static offset_t _next_offset; | |
// Temporary buffer is used for holding a string read from input stream. | |
static char _tmp_buf[_MAX_STRING_SIZE]; | |
void _verify() const | |
{ | |
assert(_buf != 0); | |
assert(_next_offset <= _buf_size); | |
if (_size > _MAX_INLINE_STRING_SIZE) { | |
if (_u.offset != _TMP_STRING_OFFSET) { | |
assert(_u.offset <= _next_offset - _size); | |
} | |
} | |
} | |
static void _set_string(tiny_string *const ts, const offset_t offset, | |
const ts_size_t size) | |
{ | |
if (size <= _MAX_INLINE_STRING_SIZE) { | |
const char *const src = (offset != _TMP_STRING_OFFSET) ? (_buf + offset) : _tmp_buf; | |
char *const dst = ts->_u.inline_string; | |
// This loop is faster than memcpy(). | |
for (size_t i = 0; i < size; ++i) { | |
dst[i] = src[i]; | |
} | |
} | |
else { | |
ts->_u.offset = offset; | |
} | |
ts->_size = size; | |
} | |
public: | |
// Sets the buffer, which will be used for strings building. | |
// The buffer must contain whitespace-delimited strings. | |
static void set_buffer(char *const buf, const size_t buf_size) | |
{ | |
assert(_buf == 0); | |
assert(_MAX_BUF_SIZE <= SIZE_MAX); | |
if (buf_size > _MAX_BUF_SIZE) { | |
cerr << "Cannot reserve " << buf_size << " chars for tiny strings." << | |
"Widen tiny_string::offset_t for handling more chars" << endl; | |
::exit(EXIT_FAILURE); | |
} | |
_buf = buf; | |
assert((offset_t) buf_size == buf_size); | |
_buf_size = buf_size; | |
_next_offset = 0; | |
cdbg << "*** Set a buffer for holding " << _buf_size << | |
" tiny string chars" << endl; | |
} | |
// Builds the next string from the underlying buffer. | |
// Builds the string with zero length if the end of the buffer reached. | |
static void get_next_string(tiny_string *const ts) | |
{ | |
assert(_buf != 0); | |
assert(_next_offset <= _buf_size); | |
// skip leading whitespace | |
char c = 0; | |
while (_next_offset < _buf_size) { | |
c = _buf[_next_offset++]; | |
if (!::isspace(c)) { | |
break; | |
} | |
} | |
if (_next_offset == _buf_size) { | |
// end of buffer reached | |
_set_string(ts, _TMP_STRING_OFFSET, 0); | |
return; | |
} | |
// lowercase string contents. | |
const offset_t s_offset = _next_offset - 1; | |
assert(!::isspace(c)); | |
do { | |
_buf[_next_offset - 1] = ::tolower(c); | |
c = _buf[_next_offset++]; | |
} while (_next_offset < _buf_size && !::isspace(c)); | |
// put trailing character into string. | |
if (!::isspace(c)) { | |
_buf[_next_offset - 1] = ::tolower(c); | |
} | |
// calculate string size | |
assert(s_offset + 1 < _next_offset); | |
const offset_t s_size = _next_offset - s_offset - 1; | |
if (s_size > _MAX_STRING_SIZE) { | |
cerr << "Input string ["; | |
cerr.write(_buf + s_offset, s_size); | |
cerr << "] containing " << s_size << " chars is longer than " << | |
"the allowed maximum " << _MAX_STRING_SIZE << " chars. Widen " << | |
"tiny_string::ts_size_t for handling longer strings" << endl; | |
::exit(EXIT_FAILURE); | |
} | |
assert((ts_size_t) s_size == s_size); | |
_set_string(ts, s_offset, s_size); | |
} | |
// Reads the string from input stream. | |
// The string is read into a static temporary buffer, so the next call | |
// to read() overwrites the previous string in the temporary buffer. | |
static void read(istream &input, tiny_string *const ts) | |
{ | |
// skip leading whitespace | |
int c; | |
do { | |
c = input.get(); | |
} while (::isspace(c)); | |
if (!input.good()) { | |
// end of input stream | |
_set_string(ts, _TMP_STRING_OFFSET, 0); | |
return; | |
} | |
// read and lowercase string contents. | |
offset_t s_size = 0; | |
assert(!::isspace(c)); | |
do { | |
_tmp_buf[s_size++] = ::tolower(c); | |
c = input.get(); | |
} while (s_size < _MAX_STRING_SIZE && !::isspace(c) && input.good()); | |
if (!::isspace(c) && input.good()) { | |
assert(s_size == _MAX_STRING_SIZE); | |
cerr << "Tiny string ["; | |
cerr.write(_tmp_buf, _MAX_STRING_SIZE); | |
cerr << "]... cannot be longer than " << _MAX_STRING_SIZE << " chars" << | |
endl; | |
::exit(EXIT_FAILURE); | |
} | |
_set_string(ts, _TMP_STRING_OFFSET, s_size); | |
} | |
size_t size() const | |
{ | |
_verify(); | |
return _size; | |
} | |
const char *data() const | |
{ | |
_verify(); | |
if (_size <= _MAX_INLINE_STRING_SIZE) { | |
return _u.inline_string; | |
} | |
return (_u.offset != _TMP_STRING_OFFSET) ? (_buf + _u.offset) : _tmp_buf; | |
} | |
bool equal_to(const tiny_string *const ts) const | |
{ | |
if (_size != ts->_size) { | |
return false; | |
} | |
const char *const s1 = data(); | |
const char *const s2 = ts->data(); | |
for (ts_size_t i = 0; i < _size; ++i) { | |
if (s1[i] != s2[i]) { | |
return false; | |
} | |
} | |
return true; | |
} | |
uint32_t hash() const | |
{ | |
uint32_t h = 0; | |
const char *const s = data(); | |
for (ts_size_t i = 0; i < _size; ++i) { | |
h = jenkins_hash::mix(h, s[i]); | |
} | |
return jenkins_hash::final(h); | |
} | |
#ifndef NDEBUG | |
void debug_print(ostream &output) const | |
{ | |
output << "["; | |
output.write(data(), size()); | |
output << "]"; | |
} | |
#endif | |
}; | |
#pragma pack(pop) | |
char *tiny_string::_buf; | |
tiny_string::offset_t tiny_string::_buf_size; | |
tiny_string::offset_t tiny_string::_next_offset; | |
char tiny_string::_tmp_buf[tiny_string::_MAX_STRING_SIZE]; | |
#ifndef NDEBUG | |
ostream &operator << (ostream &output, const tiny_string &ts) | |
{ | |
ts.debug_print(output); | |
return output; | |
} | |
#endif | |
// This class holds information related to a dictionary word: | |
// - A list of words, which can be transformed into the word via a single | |
// character substitution, character insertion or character deletion. | |
// - A color of the word. Initially all the words have no color (zero value). | |
// Words are colorified via graph traversing after all dictionary words | |
// are populated and all word connections are built. | |
#pragma pack(push, 1) | |
class word | |
{ | |
public: | |
// The type of word index. | |
// Widen it in order to handle more words. | |
// Must be unsigned. | |
typedef uint32_t idx_t; | |
// The type of word color. | |
// Widen it in order to handle more colors. | |
// Must be unsigned. | |
typedef uint16_t color_t; | |
// Null index is used solely for performance optimization purposes. | |
static const idx_t NULL_IDX; | |
// A stack of words. It is used for traversing graphs of connected words | |
// during words brushing. | |
// Use large stack chunks for performance reasons. | |
typedef fast_stack<idx_t, 4000> stack_t; | |
private: | |
static word *_words; | |
static idx_t _words_count; | |
static idx_t _next_idx; | |
// ((idx_t)-1) is already occupied by NULL_IDX. | |
static const idx_t _MAX_WORDS_COUNT = ((idx_t)-2); | |
// It is observed that words from /usr/share/dict/words have ~3 connections | |
// per word, so store up to 4 connections inline for performance reasons :) | |
typedef compact_list<idx_t, 4, 8> connected_words_t; | |
// List of connected words. | |
connected_words_t _connected_words; | |
// String associated with the word. | |
const tiny_string _ts; | |
// Word's color. | |
color_t _color; | |
static idx_t _get_idx(const word *const w) | |
{ | |
assert(w >= _words); | |
assert(w - _words < _next_idx); | |
return w - _words; | |
} | |
class _words_brusher | |
{ | |
private: | |
stack_t **const _ws_p; | |
const color_t _color; | |
public: | |
_words_brusher(stack_t **const ws_p, const color_t color) : | |
_ws_p(ws_p), _color(color) | |
{ | |
assert(_ws_p != 0); | |
assert(_color != 0); | |
} | |
void operator () (const idx_t *const idx) const | |
{ | |
word *const w = word::get_by_idx(*idx); | |
if (w->get_color() == _color) { | |
return; | |
} | |
w->set_color(_color); | |
idx_t tmp = *idx; | |
stack_t::push(_ws_p, &tmp); | |
} | |
}; | |
class _idx_finder | |
{ | |
private: | |
const word::idx_t _idx; | |
public: | |
_idx_finder(const word::idx_t idx) : _idx(idx) {} | |
bool operator () (const word::idx_t *const idx) const | |
{ | |
return (_idx == *idx); | |
} | |
}; | |
explicit word(const tiny_string *const ts) : | |
_connected_words(), _ts(*ts), _color(0) {} | |
public: | |
static word *get_by_idx(const idx_t idx) | |
{ | |
assert(idx < _next_idx); | |
return _words + idx; | |
} | |
// Reserves space for the given number of words. | |
static void reserve_words(const size_t words_count) | |
{ | |
assert(_words == 0); | |
if (words_count > _MAX_WORDS_COUNT) { | |
cerr << "Cannot allocate space for " << words_count << " words. Maximum " << | |
"number of words is " << _MAX_WORDS_COUNT << | |
"Widen word::idx_t for handling more words" << endl; | |
::exit(EXIT_FAILURE); | |
} | |
if (words_count > SIZE_MAX / sizeof(_words[0])) { | |
cerr << "Cannot allocate space for " << words_count << " words" << endl; | |
::exit(EXIT_FAILURE); | |
} | |
const size_t buf_size = words_count * sizeof(_words[0]); | |
_words = (word *) operator new (buf_size); | |
assert((idx_t) words_count == words_count); | |
_words_count = words_count; | |
_next_idx = 0; | |
cdbg << "*** Allocated " << buf_size << " bytes for " << words_count << | |
" words" << endl; | |
} | |
// Releases space allocated by reserve_words(). | |
static void release_words() | |
{ | |
assert(_words != 0); | |
assert(_words_count <= _MAX_WORDS_COUNT); | |
assert(_next_idx <= _words_count); | |
for (idx_t i = 0; i < _next_idx; ++i) { | |
get_by_idx(i)->~word(); | |
} | |
operator delete (_words); | |
_words = 0; | |
cdbg << "*** Freed " << (_words_count * sizeof(_words[0])) << | |
" bytes allocated for words" << endl; | |
} | |
// Adds new word into words storage. | |
static void add_word(const tiny_string *const ts) | |
{ | |
assert(_words != 0); | |
assert(_words_count <= _MAX_WORDS_COUNT); | |
assert(_next_idx < _words_count); | |
++_next_idx; | |
word *const w = get_by_idx(_next_idx - 1); | |
new (w) word(ts); | |
cdbg << "Read " << w << endl; | |
} | |
bool is_connected_to(const word *const w) const | |
{ | |
const idx_t idx = _get_idx(w); | |
const _idx_finder idf(idx); | |
const idx_t *const tmp = _connected_words.find_if(&idf); | |
if (tmp != 0) { | |
assert(*tmp == idx); | |
return true; | |
} | |
return false; | |
} | |
void connect_to(const word *const w) | |
{ | |
assert(w->_ts.size() == _ts.size() || | |
w->_ts.size() + 1 == _ts.size() || | |
w->_ts.size() == _ts.size() + 1); | |
assert(!is_connected_to(w)); | |
idx_t idx = _get_idx(w); | |
_connected_words.add(&idx); | |
} | |
void brush_connected_words(stack_t **const ws_p, const color_t color) | |
{ | |
assert(color == _color); | |
cdbg << "Brushing words connected to " << this << " into color=" << | |
color << endl; | |
const _words_brusher wb(ws_p, color); | |
_connected_words.for_each(&wb); | |
} | |
color_t get_color() const | |
{ | |
return _color; | |
} | |
void set_color(const color_t color) | |
{ | |
cdbg << "Brushing " << this << " with color=" << color << endl; | |
assert(_color == 0); | |
_color = color; | |
} | |
const tiny_string *get_string() const | |
{ | |
return &_ts; | |
} | |
// Calls word_visitor for each word in the dictionary. | |
// The order of visited words is unspecified. | |
template <class WordVisitor> | |
static void for_each(WordVisitor *const word_visitor) | |
{ | |
assert(_words != 0); | |
assert(_next_idx <= _words_count); | |
for (idx_t i = 0; i < _next_idx; ++i) { | |
const idx_t idx = i; | |
(*word_visitor)(_words + idx, idx); | |
} | |
} | |
#ifndef NDEBUG | |
void debug_print(ostream &output) const | |
{ | |
output << "word(" << _ts << ")"; | |
} | |
#endif | |
DENY_COPYING(word) | |
}; | |
#pragma pack(pop) | |
const word::idx_t word::NULL_IDX = ((word::idx_t)-1); | |
word *word::_words; | |
word::idx_t word::_words_count; | |
word::idx_t word::_next_idx; | |
#ifndef NDEBUG | |
ostream &operator << (ostream &output, const word *const w) | |
{ | |
w->debug_print(output); | |
return output; | |
} | |
#endif | |
// This class holds information related to a subword. | |
// Subword is a tuple (word, position), where subword's string can | |
// be created by removing a character from the word at the given position. | |
#pragma pack(push, 1) | |
class subword | |
{ | |
private: | |
// Word's index. | |
word::idx_t _idx; | |
// Position of the character to delete in the word. | |
tiny_string::ts_size_t _pos; | |
word *_word() const | |
{ | |
return word::get_by_idx(_idx); | |
} | |
const tiny_string *_get_string() const | |
{ | |
return _word()->get_string(); | |
} | |
size_t _substring_size() const | |
{ | |
const tiny_string *const ts = _get_string(); | |
assert(ts->size() > 0); | |
assert(_pos <= ts->size()); | |
const size_t ts_size = ts->size(); | |
return _pos == ts_size ? ts_size : ts_size - 1; | |
} | |
static bool _substring_equal_to(const subword *const sw1, | |
const subword *const sw2) | |
{ | |
assert(sw1->_substring_size() == sw2->_substring_size()); | |
const tiny_string *const ts1 = sw1->_get_string(); | |
const tiny_string *const ts2 = sw2->_get_string(); | |
const tiny_string::ts_size_t pos1 = sw1->_pos; | |
const tiny_string::ts_size_t pos2 = sw2->_pos; | |
assert(pos1 <= pos2); | |
assert(pos1 <= ts1->size()); | |
assert(pos2 <= ts2->size()); | |
const char *const s1 = ts1->data(); | |
const char *const s2 = ts2->data(); | |
for (size_t i = 0; i < pos1; ++i) { | |
if (s1[i] != s2[i]) { | |
return false; | |
} | |
} | |
for (size_t i = pos1 + 1; i <= pos2; ++i) { | |
assert(i < ts1->size()); | |
assert(i <= ts2->size()); | |
if (s1[i] != s2[i - 1]) { | |
return false; | |
} | |
} | |
for (size_t i = pos2 + 1; i < ts2->size(); ++i) { | |
assert(i < ts1->size()); | |
if (s1[i] != s2[i]) { | |
return false; | |
} | |
} | |
return true; | |
} | |
public: | |
subword(const word::idx_t idx, const tiny_string::ts_size_t pos) : | |
_idx(idx), _pos(pos) | |
{ | |
assert(_pos <= _get_string()->size()); | |
} | |
void connect_to(const subword *const sw) | |
{ | |
word *const w1 = _word(); | |
word *const w2 = sw->_word(); | |
if (w1->is_connected_to(w2)) { | |
cdbg << w1 << " is already connected to " << w2 << endl; | |
return; | |
} | |
cdbg << "Connecting " << w1 << " to " << w2 << " via char "; | |
if (_get_string()->size() == sw->_get_string()->size()) { | |
cdbg << "substitution"; | |
} | |
else { | |
assert(_get_string()->size() + 1 == sw->_get_string()->size()); | |
cdbg << "insertion"; | |
} | |
cdbg << " at position " << (size_t)sw->_pos << endl; | |
w1->connect_to(w2); | |
w2->connect_to(w1); | |
} | |
// Compares only subwords' substrings, but not character deletion positions. | |
bool substring_equal_to(const subword *const sw) const | |
{ | |
if (_substring_size() != sw->_substring_size()) { | |
return false; | |
} | |
return (_pos <= sw->_pos) ? _substring_equal_to(this, sw) : | |
_substring_equal_to(sw, this); | |
} | |
bool equal_to(const subword *const sw) const | |
{ | |
return (_pos == sw->_pos && substring_equal_to(sw)); | |
} | |
uint32_t hash() const | |
{ | |
// Subwords with the same substring must have identical hash value, | |
// so they will trap into the same hash table bucket. | |
// So don't mix _pos into hash value. | |
// This property is exploited when building links between words. | |
const tiny_string *const ts = _get_string(); | |
const char *const s = ts->data(); | |
uint32_t h = 0; | |
for (size_t i = 0; i < _pos; ++i) { | |
h = jenkins_hash::mix(h, s[i]); | |
} | |
for (size_t i = _pos + 1; i < ts->size(); ++i) { | |
h = jenkins_hash::mix(h, s[i]); | |
} | |
return jenkins_hash::final(h); | |
} | |
}; | |
#pragma pack(pop) | |
// Mapping between a word string and a color. | |
#pragma pack(push, 1) | |
struct word_to_color | |
{ | |
tiny_string ts; | |
word::color_t color; | |
word_to_color(const tiny_string *const ts_, const word::color_t color_) : | |
ts(*ts_), color(color_) | |
{ | |
assert(ts.size() > 0); | |
} | |
bool equal_to(const word_to_color *const wc) const | |
{ | |
return ts.equal_to(&wc->ts); | |
} | |
uint32_t hash() const | |
{ | |
return ts.hash(); | |
} | |
}; | |
#pragma pack(pop) | |
// Subwords hashtable is used for building connections between words and | |
// for words' brushing. | |
// The hashtable is deleted after words are brushed. | |
typedef simple_hashtable<subword, 2> subwords_t; | |
subwords_t *subwords; | |
// Lookup table is used for mapping between dictionary words and | |
// the corresponding colors. | |
typedef simple_hashtable<word_to_color, 2> lookup_table_t; | |
lookup_table_t *lookup_table; | |
void get_stats_for_dict_words(const char *const string_buffer, | |
const size_t string_buffer_size, size_t *const words_count_p, | |
size_t *const total_words_size_p) | |
{ | |
cdbg << "*** Populating dict word stats" << endl; | |
size_t words_count = 0; | |
size_t total_words_size = 0; | |
size_t i = 0; | |
while (i < string_buffer_size) { | |
char c = string_buffer[i++]; | |
if (::isspace(c)) { | |
continue; | |
} | |
++words_count; | |
do { | |
++total_words_size; | |
if (i == string_buffer_size) { | |
break; | |
} | |
c = string_buffer[i++]; | |
} while (!::isspace(c)); | |
} | |
cdbg << "*** Total words count: " << words_count << endl; | |
cdbg << "*** Total chars in all words: " << total_words_size << endl; | |
*words_count_p = words_count; | |
*total_words_size_p = total_words_size; | |
} | |
void read_dict_words(char *const string_buffer, const size_t string_buffer_size) | |
{ | |
cdbg << "*** Reading dict words from the input" << endl; | |
tiny_string::set_buffer(string_buffer, string_buffer_size); | |
tiny_string ts; | |
while (true) { | |
tiny_string::get_next_string(&ts); | |
size_t s_size = ts.size(); | |
if (s_size == 0) { | |
cdbg << "*** End of input stream reached" << endl; | |
break; | |
} | |
word::add_word(&ts); | |
} | |
} | |
void build_subwords_for_word(const word *const w, const word::idx_t idx) | |
{ | |
cdbg << "Building subwords for " << w << endl; | |
const tiny_string *const ts = w->get_string(); | |
for (size_t pos = 0; pos < ts->size(); ++pos) { | |
subword sw(idx, pos); | |
subword *const tmp = subwords->insert(&sw); | |
if (tmp != 0) { | |
assert(sw.equal_to(tmp)); | |
sw.connect_to(tmp); | |
} | |
} | |
} | |
void build_subwords_for_dict_words() | |
{ | |
cdbg << "*** Building subwords for dict words" << endl; | |
word::for_each(&build_subwords_for_word); | |
} | |
class word_connections_builder | |
{ | |
private: | |
subword *const _sw; | |
public: | |
word_connections_builder(subword *const sw) : _sw(sw) {} | |
void operator () (subword *const sw) const | |
{ | |
assert(subwords->get_bucket(_sw) == subwords->get_bucket(sw)); | |
if (sw->substring_equal_to(_sw)) { | |
_sw->connect_to(sw); | |
} | |
} | |
}; | |
void build_connections_for_word(word *const w, const word::idx_t idx) | |
{ | |
cdbg << "Building connections for " << w << endl; | |
const tiny_string *const ts = w->get_string(); | |
subword sw(idx, ts->size()); | |
const word_connections_builder wcb(&sw); | |
subwords->get_bucket(&sw)->for_each(&wcb); | |
} | |
void build_connections_for_dict_words() | |
{ | |
cdbg << "*** Building connections for dict words" << endl; | |
word::for_each(&build_connections_for_word); | |
} | |
void brush_connected_graph(word::stack_t **const ws_p, word::idx_t idx, | |
const word::color_t color) | |
{ | |
word *w = word::get_by_idx(idx); | |
assert(w->get_color() == 0); | |
assert(color > 0); | |
word::idx_t null_idx = word::NULL_IDX; | |
word::stack_t::push(ws_p, &null_idx); | |
w->set_color(color); | |
while (idx != word::NULL_IDX) { | |
w = word::get_by_idx(idx); | |
w->brush_connected_words(ws_p, color); | |
word::stack_t::pop(ws_p, &idx); | |
} | |
} | |
class dict_word_colorifier | |
{ | |
private: | |
word::stack_t *_ws; | |
word::color_t _next_color; | |
public: | |
dict_word_colorifier() : _ws(0), _next_color(0) | |
{ | |
// Pre-allocate words' stack here in order to decrease the number of silly | |
// new()/delete() calls for each brush_connected_graph() call. | |
word::idx_t null_idx = word::NULL_IDX; | |
word::stack_t::push(&_ws, &null_idx); | |
assert(_ws != 0); | |
} | |
~dict_word_colorifier() | |
{ | |
assert(_ws != 0); | |
word::idx_t idx; | |
word::stack_t::pop(&_ws, &idx); | |
assert(idx == word::NULL_IDX); | |
assert(_ws == 0); | |
} | |
void operator () (word *const w, const word::idx_t idx) | |
{ | |
if (w->get_color() != 0) { | |
assert(w->get_color() <= _next_color); | |
return; | |
} | |
++_next_color; | |
if (_next_color <= 0) { | |
cerr << "Color space exhausted. Widen word::color_t for handling " << | |
"more colors" << endl; | |
::exit(EXIT_FAILURE); | |
} | |
brush_connected_graph(&_ws, idx, _next_color); | |
} | |
word::color_t get_next_color() const | |
{ | |
return _next_color; | |
} | |
DENY_COPYING(dict_word_colorifier) | |
}; | |
void colorify_dict_words() | |
{ | |
cdbg << "*** Colorifying dict words" << endl; | |
dict_word_colorifier dwc; | |
word::for_each(&dwc); | |
cdbg << "*** Total colors=" << dwc.get_next_color() << endl; | |
} | |
void add_word_to_lookup_table(const word *const w, const word::idx_t idx) | |
{ | |
(void)idx; | |
word_to_color wtc(w->get_string(), w->get_color()); | |
word_to_color *const tmp = lookup_table->insert(&wtc); | |
if (tmp != 0) { | |
assert(tmp->equal_to(&wtc)); | |
assert(tmp->color == wtc.color); | |
cdbg << "Skipping duplicate " << w << endl; | |
} | |
} | |
void build_lookup_table() | |
{ | |
cdbg << "*** Building a lookup table for dict words" << endl; | |
word::for_each(&add_word_to_lookup_table); | |
} | |
bool get_word_color(istream &input, word::color_t *const color) | |
{ | |
tiny_string ts; | |
tiny_string::read(input, &ts); | |
if (ts.size() == 0) { | |
cdbg << "End of input stream reached" << endl; | |
return false; | |
} | |
assert(ts.size() > 0); | |
cdbg << "Obtaining the color for word " << ts << endl; | |
const word_to_color dummy_wtc(&ts, 0); | |
const word_to_color *wtc = lookup_table->find(&dummy_wtc); | |
if (wtc == 0) { | |
cdbg << "Cannot find word " << ts << " in the dict" << endl; | |
*color = 0; | |
} | |
else { | |
assert(wtc->equal_to(&dummy_wtc)); | |
*color = wtc->color; | |
assert(*color > 0); | |
cdbg << "The word " << ts << " has color=" << *color << endl; | |
} | |
return true; | |
} | |
void read_word_pairs(istream &input) | |
{ | |
cdbg << "*** Started reading word pairs" << endl; | |
while (input.good()) { | |
word::color_t color1, color2; | |
if (!get_word_color(input, &color1) || !get_word_color(input, &color2)) { | |
break; | |
} | |
if (color1 == color2 && color1 != 0) { | |
cout << "YES" << endl; | |
} | |
else { | |
cout << "NO" << endl; | |
} | |
} | |
cdbg << "*** Finished reading word pairs" << endl; | |
} | |
void show_usage(const char *const program_name) | |
{ | |
assert(program_name != 0); | |
cerr << "Usage: " << program_name << " path/to/dict/file" << endl; | |
cerr << "\tFeed word pairs to the input and the program will answer" << endl; | |
cerr << "\tYES or NO for each pair depending on special rules" << endl; | |
cerr << "\tdescribed at http://socialcam.com/jobs/problems " << endl; | |
cerr << "\t(Word Chains problem)" << endl; | |
} | |
void read_dict_to_string_buffer(const char *const dict_filename, | |
char **const string_buffer, size_t *const string_buffer_size) | |
{ | |
ifstream *const dict_file = new ifstream(dict_filename); | |
if (!(*dict_file)) { | |
cerr << "Cannot open the file [" << dict_filename << "]" << endl; | |
::exit(EXIT_FAILURE); | |
} | |
// Try reading the whole file into memory in one read() syscall. | |
// mmap() could be faster, but it isn't portable. | |
filebuf *const fbuf = dict_file->rdbuf(); | |
const size_t file_size = fbuf->pubseekoff(0, dict_file->end, dict_file->in); | |
*string_buffer = new char[file_size]; | |
cdbg << "*** Allocated " << file_size << " bytes for string buffer" << endl; | |
fbuf->pubseekpos(0, dict_file->in); | |
const size_t bytes_read = fbuf->sgetn(*string_buffer, file_size); | |
if (bytes_read != file_size) { | |
cerr << "Expected to read " << file_size << " bytes from dict file, " << | |
"but read only " << bytes_read << " bytes" << endl; | |
::exit(EXIT_FAILURE); | |
} | |
dict_file->close(); | |
delete dict_file; | |
*string_buffer_size = file_size; | |
} | |
void destroy_string_buffer(char *const string_buffer, | |
const size_t string_buffer_size) | |
{ | |
delete[] string_buffer; | |
cdbg << "*** Released " << string_buffer_size << " bytes allocated " << | |
"for string buffer" << endl; | |
(void)string_buffer_size; | |
} | |
} // end of anonymous namespace | |
int main(const int argc, const char *const argv[]) { | |
if (argc != 2) { | |
assert(argc >= 1); | |
const char *const program_name = argv[0]; | |
assert(program_name != 0); | |
show_usage(program_name); | |
::exit(EXIT_FAILURE); | |
} | |
const char *const dict_filename = argv[1]; | |
assert(dict_filename != 0); | |
char *string_buffer; | |
size_t string_buffer_size; | |
read_dict_to_string_buffer(dict_filename, &string_buffer, | |
&string_buffer_size); | |
size_t words_count, total_words_size; | |
get_stats_for_dict_words(string_buffer, string_buffer_size, &words_count, | |
&total_words_size); | |
assert(total_words_size >= words_count); | |
if (words_count == 0) { | |
cerr << "File " << dict_filename << " contains zero dict words" << endl; | |
::exit(EXIT_FAILURE); | |
} | |
word::reserve_words(words_count); | |
read_dict_words(string_buffer, string_buffer_size); | |
assert(subwords == 0); | |
subwords = new subwords_t(total_words_size); | |
build_subwords_for_dict_words(); | |
build_connections_for_dict_words(); | |
colorify_dict_words(); | |
assert(lookup_table == 0); | |
lookup_table = new lookup_table_t(words_count); | |
build_lookup_table(); | |
delete subwords; | |
subwords = 0; | |
word::release_words(); | |
read_word_pairs(cin); | |
delete lookup_table; | |
lookup_table = 0; | |
destroy_string_buffer(string_buffer, string_buffer_size); | |
cdbg << "*** Bye!" << endl; | |
return EXIT_SUCCESS; | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment