Skip to content

Instantly share code, notes, and snippets.

@iamarkdev
Last active December 4, 2019 16:33
Show Gist options
  • Save iamarkdev/f7a9ce7ce7ad7c5fb11ec511887789bc to your computer and use it in GitHub Desktop.
Save iamarkdev/f7a9ce7ce7ad7c5fb11ec511887789bc to your computer and use it in GitHub Desktop.
/*
Preface:
I'm issuing this challenge as I've hit a major roadblock in the development process
of one of my personal side projects. I've spent the last 2 weeks exhaustively exploring
solutions to the following problem and I'm at my wits end. I've taken a number of approaches
using various tree data structures and clever algorithms with no success.
Problem:
Given an input, find all possible byte substrings for the input and count how many
times each unique byte substring occurs throughout the input. Return all byte substrings
along with their occurence count that have an occurence count > 1.
You'll notice the examples given below show all byte substrings including byte substrings that have
an occurrence count of 1. This is mainly for illustration of the problem. Your solution
should exclude byte substrings that only occur 1 time in the input.
REQUIREMENTS
- Your solution may be written in whatever language you're comfortable with.
- Your solution should be able to compute a correct output more efficiently than O(n^2).
- Your solution should not exceed a memory usage limit of 32GB.
- Your solution must work with any file of any type that is <= 1GB in size.
- Your solution's output must exclude byte substrings than only have 1 occurrence in the input.
- Your solution's output must be written to a new file in a transferrable data format such as JSON (output.json for example).
- Your solution's output should be equivalent to an object of key value pairs as: (substring) => (# of occurrences).
- Please provide a way to contact you in the event your solution is accepted as the winning solution.
JUDGING CRITERIA
- Correct output (See requirements and examples below to understand what's considered correct output). (Pass / fail)
- Meets requirements (See above). (Pass / fail)
- Execution time: I will benchmark all solutions on the same machine with the same input - the input will be a binary file.
- Algorithm / Data Structure clarity and concision - if your solution has complex working parts,
please include comments where applicable to give better explain what's going on.
SUBMISSION
Please comment with a link to a gist of your solution to submit it for consideration.
PRIZE
I am guaranteeing $500.00 USD that will be gifted through PayPal to the individual that provides
the solution that best fits the judging criteria and requirements listed above.
--
I will be checking back periodically. Judging will be done on a rolling basis up until December 25th.
I will give feedback as comments to the gists you link to as solutions.
A winning solution will be selected on December 25th, 2016.
If you have any questions, feel free to email me: [email protected] ...
Or message me on Facebook: https://www.facebook.com/braydonbatungbacal
-----------
-----------
-----------
Given this example input: 0xFE 0xFF 0xAB 0x00 0xFE 0xFF 0xAB 0xAC
We would find the following byte substrings with the following occurrence counts.
0xFE - 2 occurrences
0xFF - 2 occurrences
0xAB - 2 occurrences
0x00 - 1 occurrence
0xAC - 1 occurrence
0xFE 0xFF - 2 occurrences
0xFF 0xAB - 2 occurrences
0xAB 0x00 - 1 occurrence
0x00 0xFE - 1 occurrence
0xAB 0xAC - 1 occurrence
0xFE 0xFF 0xAB - 2 occurrences
0xFF 0xAB 0x00 - 1 occurrence
0xAB 0x00 0xFE - 1 occurrence
0x00 0xFE 0xFF - 1 occurrence
0xFF 0xAB 0xAC - 1 occurrence
0xFE 0xFF 0xAB 0x00 - 1 occurrence
0xFF 0xAB 0x00 0xFE - 1 occurrence
0xAB 0x00 0xFE 0xFF - 1 occurrence
0x00 0xFE 0xFF 0xAB - 1 occurrence
0xFE 0xFF 0xAB 0xAC - 1 occurrence
0xFE 0xFF 0xAB 0x00 0xFE - 1 occurrence
0xFF 0xAB 0x00 0xFE 0xFF - 1 occurrence
0xAB 0x00 0xFE 0xFF 0xAB - 1 occurrence
0x00 0xFE 0xFF 0xAB 0xAC - 1 occurrence
0xFE 0xFF 0xAB 0x00 0xFE 0xFF - 1 occurrence
0xFF 0xAB 0x00 0xFE 0xFF 0xAB - 1 occurrence
0xAB 0x00 0xFE 0xFF 0xAB 0xAC - 1 occurrence
0xFE 0xFF 0xAB 0x00 0xFE 0xFF 0xAB - 1 occurrence
0xFF 0xAB 0x00 0xFE 0xFF 0xAB 0xAC - 1 occurrence
0xFE 0xFF 0xAB 0x00 0xFE 0xFF 0xAB 0xAC - 1 occurrence
-
The accepted output for the given input above that excludes all 1 occurrence byte substrings would be:
0xFE - 2 occurrences
0xFF - 2 occurrences
0xAB - 2 occurrences
0xFE 0xFF - 2 occurrences
0xFE 0xFF 0xAB - 2 occurrences
And then as JSON:
{
"FE" : 2,
"FF" : 2,
"AB" : 2,
"FEFF" : 2,
"FEFFAB" : 2
}
-----------
Edge case handling
Let's say we have the following input 0x00 0x00 0x00 0x00
The correct matched byte substrings would be
0x00 - 4 occurrences
0x00 0x00 - 3 occurrences
0x00 0x00 0x00 - 2 occurrences
{
"00" : 4,
"0000" : 3,
"000000" : 2
}
-----------
Example Input Files
Here are links to two example input files. These files are well below the 1GB limit, but they represent
the level of entropy and distribution of byte sequences you can likely expect from the final input file
your solution will be benchmarked against.
https://drive.google.com/file/d/0B1GeH8qXPbvBZTRuNWVJbUVNZGs/view?usp=sharing
AND
https://drive.google.com/file/d/0B1GeH8qXPbvBYmxMU1JSWjlPc2c/view?usp=sharing
-----------
The proposed solution below is O(n^2) where n is the length of the given input.
This is not ideal but it is enough to demonstrate expected output for a given input
for this problem, along with a brute force solution.
Also note, while this example is using strings and characters - it is purely to
demonstrate the generalness of the brute force algorithm used. Your solution will
be required to work with an array of bytes.
*/
function Occurrences() {
this.sequences = {};
this.longestSequence = '';
}
Occurrences.prototype.expand = function(byte) {
this.longestSequence += byte;
for (var offset = 0; offset < this.longestSequence.length; offset++) {
var sequence = this.longestSequence.substr(offset);
this.sequences[sequence] = this.sequences[sequence] + 1 || 1;
}
}
var occurrences = new Occurrences();
var input = "abcdabcefgabc";
for (var i = 0; i < input.length; i++) {
occurrences.expand(input[i]);
}
console.log(occurrences.sequences);
/*
--------
Given input for solution above: "abcdabcefgabc".
Acceptable Output - This is excluding substrings that only occur 1 time in the input:
{
"a":3,
"ab":3,
"b":3,
"abc":3,
"bc":3,
"c":3
}
Complete Solution Output:
{
"a":3,
"ab":3,
"b":3,
"abc":3,
"bc":3,
"c":3,
"abcd":1,
"bcd":1,
"cd":1,
"d":1,
"abcda":1,
"bcda":1,
"cda":1,
"da":1,
"abcdab":1,
"bcdab":1,
"cdab":1,
"dab":1,
"abcdabc":1,
"bcdabc":1,
"cdabc":1,
"dabc":1,
"abcdabce":1,
"bcdabce":1,
"cdabce":1,
"dabce":1,
"abce":1,
"bce":1,
"ce":1,
"e":1,
"abcdabcef":1,
"bcdabcef":1,
"cdabcef":1,
"dabcef":1,
"abcef":1,
"bcef":1,
"cef":1,
"ef":1,
"f":1,
"abcdabcefg":1,
"bcdabcefg":1,
"cdabcefg":1,
"dabcefg":1,
"abcefg":1,
"bcefg":1,
"cefg":1,
"efg":1,
"fg":1,
"g":1,
"abcdabcefga":1,
"bcdabcefga":1,
"cdabcefga":1,
"dabcefga":1,
"abcefga":1,
"bcefga":1,
"cefga":1,
"efga":1,
"fga":1,
"ga":1,
"abcdabcefgab":1,
"bcdabcefgab":1,
"cdabcefgab":1,
"dabcefgab":1,
"abcefgab":1,
"bcefgab":1,
"cefgab":1,
"efgab":1,
"fgab":1,
"gab":1,
"abcdabcefgabc":1,
"bcdabcefgabc":1,
"cdabcefgabc":1,
"dabcefgabc":1,
"abcefgabc":1,
"bcefgabc":1,
"cefgabc":1,
"efgabc":1,
"fgabc":1,
"gabc":1
}
*/
@iamarkdev
Copy link
Author

iamarkdev commented Dec 20, 2016

Hey Jonathan, all the information is listed above in the challenge spec.

The one interesting piece that makes this possible in less than (n^2), at least I think, is that we don't care about substrings that only occur 1 time in the input - those are not expected to be in the output. This also means that any future substrings that are prefixed by a substring that previously only occurred once, will always occur once - so we can ignore them.

@jonathanGB
Copy link

jonathanGB commented Dec 20, 2016

Sure.

DP ignores this, it runs in ϴ(n^2), so always the same running-time. If we took another bottom-up approach like yours and added some logic to ignore branches where we are sure the count is not bigger than 1, this would have no impact on the worst-case scenario, which is the purpose of O notation.

If we used a divide-and-conquer (top-down) approach, we would get exponential run-time - and couldn't use your info, as we can't know starting from the top.

So, overall, I still think it's not possible to do better with what we have. But I'm very curious to see the solutions 😃

@jmoss20
Copy link

jmoss20 commented Dec 20, 2016

@QuadmasterXLII @braydo25, see edit above.

@PirosB3
Copy link

PirosB3 commented Dec 20, 2016

import json
from collections import Counter


def generate_solution_stage(items, byte_array, max_size, final):
    ctr = Counter()
    for start, end in items:

        # Ignore imputs that pass byte array size
        if end > max_size:
            continue

        # Increase counter of object by 1
        ctr[''.join(byte_array[start:end])] += 1

    # Write the soluton
    for res, count in ctr.iteritems():
        if count > 1:
            final[res] = count

    # Create new "round" with substrings that have a > count than 1
    new_items = []
    for start, end in items:
        mark = ''.join(byte_array[start:end])
        if ctr[mark] > 1:
            new_items.append((start, end+1))

    return new_items


def solve(byte_array):
    """
    Solves the ByteSubstringsChallenge. And saves the result to
    `/tmp/output.json`
    https://gist.github.com/braydo25/f7a9ce7ce7ad7c5fb11ec511887789bc

    :param byte_array: contains a list of bytes. Where a byte can be
    represented as a string, or encoded in it's string form ("0xAA")
    """

    # Generate initial round
    items = []
    for idx, _ in enumerate(byte_array):
        items.append((idx, idx+1))

    # List of solutions. Continue iterating until there
    # is only 1 string remaining
    solution = {}
    while len(items) > 1:
        items = generate_solution_stage(items, byte_array, len(byte_array), solution)

    with open('/tmp/solution.json', 'w') as f:
        f.write(json.dumps(solution))


# items = '0xFE 0xFF 0xAB 0x00 0xFE 0xFF 0xAB 0xAC'.split(' ')
items = list('abcdabcefgabc')
solve(items)

Let me know if it needs more documetation!

Contact

pirosb3 at gmail dot com

@iamarkdev
Copy link
Author

iamarkdev commented Dec 20, 2016

Hey Piros - at first glance that still looks like it's O(n^2) complexity? I'll take a closer look and run some benchmarks in the morning.

Edit

Try running your solution against the following input: http://norvig.com/big.txt , I'm seeing execution time upwards of minutes which is inline with what I was seeing with my initial O(n^2) approach. Memory usage is acceptable but high relative to the input size - 4GB peak or so.

@metakirby5
Copy link

With a random distribution of bytes, I doubt sub O(n^2) is achievable. With your example inputs, I was thinking a good optimization would be to exclude any substrings containing a globally unique byte, since any such substring is transitively unique. Otherwise, every possible substring must be considered, and merely examining each possible substring is O(n^2).

@eisenstatdavid
Copy link

@eisenstatdavid
Copy link

(P.S. Consider excluding substrings that don't occur more often than the superstring.)

@eisenstatdavid
Copy link

(P.P.S. The output to big.txt is massive otherwise.)

@jaruesink
Copy link

How about something like this?
http://codepen.io/jaruesink/pen/gLEgmw?editors=0012

const test_string = 'abcdabcefgabc';
const test_length = test_string.length;
const only_exists_once = {};
const final = {};
let string_length = 1;

function existsMoreThanOnce(input) {
  let position = test_string.indexOf(input);
  return test_string.includes(input, position+1) ? true : false;
}

for (let index_l in test_string) {
  // first time for string length
    for (let index_p in test_string) {
      // second time for string position
      let position = parseInt(index_p);
      if (position + string_length <= test_length) {
        let string = test_string.slice(index_p, position+string_length);
        if (only_exists_once[string] !== true) {
          if (!existsMoreThanOnce(string)) {
            only_exists_once[string] = true;
          } else {
            if (!final[string]) { final[string] = 0 }
            final[string]++;
          }
        }
      }
    }
  string_length++
}

console.log('final result: ', final);
console.log('only exists once: ', only_exists_once);

@apjaffe
Copy link

apjaffe commented Dec 20, 2016

@animeshf
Copy link

animeshf commented Dec 20, 2016

It is impossible to solve this in < O(n^2). This is because your answer itself can be as big as O(n^2), as in, number of substrings with frequency > 1 can be O(n^2), so storing them and printing them is indeed O(n^2) as well. Consider this case : Generate any random string S of length N, and create T = S + S i.e. T is simply S written twice. Now the answer for T is of size N^2. Since |T| = 2N, the answer is O(|T|^2)!

If the question, however, was to compute just the number of substrings with frequency > 1, I have a O(n log n) solution. Printing all of them, however, will be O(n^2). If you require, I can explain the O(n log n) approach to compute number of valid substrings.

@iamarkdev
Copy link
Author

Hey eisenstatdavid, jaruesink and apjaffe - Thanks for the submissions, I'll benchmark these now and take a look through them.

@iamarkdev
Copy link
Author

animeshf - You may be correct, I'd be curious if anyone else has a counter argument though.

@iamarkdev
Copy link
Author

I've updated the gist with links to 2 example input files.

@iamarkdev
Copy link
Author

@eisenstatdavid
Copy link

To make @animeshf's argument more carefully, consider the output for an input xx where x is drawn uniformly at random from the set of ASCII strings of length n (denoted ∑^n, where ∑ is the set of ASCII characters). Each substring of x appears at least twice in xx (duh). For each k from n/3 to 2n/3, there are Theta(n) length-k substrings of x. A substring of length at least n/3 repeats in distinct fixed positions with probability at most |∑|^(-n/3), so by a union bound over Theta(n^3) pairs, the probability of at least one such collision globally is O(n^3 |∑|^(-n/3)), which goes to zero as n goes to infinity. Thus we can expect that x usually has Theta(n^2) distinct substrings, which require Theta(n^3) characters to be printed explicitly.

When the overhead of printing the output is large, it makes sense to switch to output-sensitive running times. The solution that I implemented takes time O(n + s) where n is the length of the input and s is the length of the output, achieving this running time by means of Yuta Mori's suffix array construction library.

@jonathanGB
Copy link

@eisenstatdavid s = O(n^2) right?

@jonathanGB
Copy link

@braydo25 in your test file, can we ignore white space or it matters?

@iamarkdev
Copy link
Author

@jonathanGB - Whitespace matters

@iamarkdev
Copy link
Author

iamarkdev commented Dec 21, 2016

This is definitely possible through index reduction with every incrementation of search sequence length, with sequence length starting at 1. I just wrote something up in node that's able to compute all sequences in big.txt that occur more than once in 40 seconds. This time includes writing the JSON output to a file.

Regardless of this, I'm still really curious to see everyone elses solutions as I definitely think there's room for improvement. I will still be awarding the $500 on the 25th.

Output: https://drive.google.com/file/d/0B8H05aARgYGockJ3MWdkWnZLY00/view?usp=sharing

@iamarkdev
Copy link
Author

iamarkdev commented Dec 21, 2016

Real time video of log output when running algorithm against big.txt.

https://drive.google.com/file/d/0B8H05aARgYGoZWdNRWFkUEk3QUU/view?usp=sharing

@eisenstatdavid
Copy link

eisenstatdavid commented Dec 21, 2016

@jonathanGB s is Theta(n^3) in the worst case, actually.

@braydo25 I think something's wrong with your experiment. The big.txt that I download from Norvig has sha1sum 213ca262bf6af12428d42842848464565f3d5504. It has a repeated substring of length 8223 (not 66 or 67) starting on or around lines 5311 and 121518. This substring and its substrings alone imply a heuristic lower bound of something like 93 GB on the length of the output (8224*8223/2 nonempty substrings, probably mostly distinct, of average length around 8223/3).

@eisenstatdavid
Copy link

Here's confirmation. Running

sed -n '5312,5500p' big.txt | wc
sed -n '5312,5500p' big.txt | sha1sum
sed -n '121519,121707p' big.txt | sha1sum

gives

     189    1296    8222
6e14cea8c413187db1d0c3430d1f3ab48e0c4dca  -
6e14cea8c413187db1d0c3430d1f3ab48e0c4dca  -

The substring here is one short because sed won't give me the preceding newlines without a fight.

@iamarkdev
Copy link
Author

Ah - good catch @eisenstatdavid - there was definitely a bug in my algo.

@e-wahyutama
Copy link

@AWice
Copy link

AWice commented Dec 25, 2016

https://github.com/davehughes/sais/blob/master/sais-lite/sais.c ,line 646
This solves the global problem on stackoverflow directly using a fast suffix array implementation (sais is basically fastest in the world)

@iamarkdev
Copy link
Author

Hey guys, I'll be reviewing everything tonight and tomorrow morning and posting a winner then.

@iamarkdev
Copy link
Author

Hey guys still going through submissions

@jonathanGB
Copy link

So @braydo25 ? Just curious on what is the result

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment