-
-
Save iamarkdev/f7a9ce7ce7ad7c5fb11ec511887789bc to your computer and use it in GitHub Desktop.
/* | |
Preface: | |
I'm issuing this challenge as I've hit a major roadblock in the development process | |
of one of my personal side projects. I've spent the last 2 weeks exhaustively exploring | |
solutions to the following problem and I'm at my wits end. I've taken a number of approaches | |
using various tree data structures and clever algorithms with no success. | |
Problem: | |
Given an input, find all possible byte substrings for the input and count how many | |
times each unique byte substring occurs throughout the input. Return all byte substrings | |
along with their occurence count that have an occurence count > 1. | |
You'll notice the examples given below show all byte substrings including byte substrings that have | |
an occurrence count of 1. This is mainly for illustration of the problem. Your solution | |
should exclude byte substrings that only occur 1 time in the input. | |
REQUIREMENTS | |
- Your solution may be written in whatever language you're comfortable with. | |
- Your solution should be able to compute a correct output more efficiently than O(n^2). | |
- Your solution should not exceed a memory usage limit of 32GB. | |
- Your solution must work with any file of any type that is <= 1GB in size. | |
- Your solution's output must exclude byte substrings than only have 1 occurrence in the input. | |
- Your solution's output must be written to a new file in a transferrable data format such as JSON (output.json for example). | |
- Your solution's output should be equivalent to an object of key value pairs as: (substring) => (# of occurrences). | |
- Please provide a way to contact you in the event your solution is accepted as the winning solution. | |
JUDGING CRITERIA | |
- Correct output (See requirements and examples below to understand what's considered correct output). (Pass / fail) | |
- Meets requirements (See above). (Pass / fail) | |
- Execution time: I will benchmark all solutions on the same machine with the same input - the input will be a binary file. | |
- Algorithm / Data Structure clarity and concision - if your solution has complex working parts, | |
please include comments where applicable to give better explain what's going on. | |
SUBMISSION | |
Please comment with a link to a gist of your solution to submit it for consideration. | |
PRIZE | |
I am guaranteeing $500.00 USD that will be gifted through PayPal to the individual that provides | |
the solution that best fits the judging criteria and requirements listed above. | |
-- | |
I will be checking back periodically. Judging will be done on a rolling basis up until December 25th. | |
I will give feedback as comments to the gists you link to as solutions. | |
A winning solution will be selected on December 25th, 2016. | |
If you have any questions, feel free to email me: [email protected] ... | |
Or message me on Facebook: https://www.facebook.com/braydonbatungbacal | |
----------- | |
----------- | |
----------- | |
Given this example input: 0xFE 0xFF 0xAB 0x00 0xFE 0xFF 0xAB 0xAC | |
We would find the following byte substrings with the following occurrence counts. | |
0xFE - 2 occurrences | |
0xFF - 2 occurrences | |
0xAB - 2 occurrences | |
0x00 - 1 occurrence | |
0xAC - 1 occurrence | |
0xFE 0xFF - 2 occurrences | |
0xFF 0xAB - 2 occurrences | |
0xAB 0x00 - 1 occurrence | |
0x00 0xFE - 1 occurrence | |
0xAB 0xAC - 1 occurrence | |
0xFE 0xFF 0xAB - 2 occurrences | |
0xFF 0xAB 0x00 - 1 occurrence | |
0xAB 0x00 0xFE - 1 occurrence | |
0x00 0xFE 0xFF - 1 occurrence | |
0xFF 0xAB 0xAC - 1 occurrence | |
0xFE 0xFF 0xAB 0x00 - 1 occurrence | |
0xFF 0xAB 0x00 0xFE - 1 occurrence | |
0xAB 0x00 0xFE 0xFF - 1 occurrence | |
0x00 0xFE 0xFF 0xAB - 1 occurrence | |
0xFE 0xFF 0xAB 0xAC - 1 occurrence | |
0xFE 0xFF 0xAB 0x00 0xFE - 1 occurrence | |
0xFF 0xAB 0x00 0xFE 0xFF - 1 occurrence | |
0xAB 0x00 0xFE 0xFF 0xAB - 1 occurrence | |
0x00 0xFE 0xFF 0xAB 0xAC - 1 occurrence | |
0xFE 0xFF 0xAB 0x00 0xFE 0xFF - 1 occurrence | |
0xFF 0xAB 0x00 0xFE 0xFF 0xAB - 1 occurrence | |
0xAB 0x00 0xFE 0xFF 0xAB 0xAC - 1 occurrence | |
0xFE 0xFF 0xAB 0x00 0xFE 0xFF 0xAB - 1 occurrence | |
0xFF 0xAB 0x00 0xFE 0xFF 0xAB 0xAC - 1 occurrence | |
0xFE 0xFF 0xAB 0x00 0xFE 0xFF 0xAB 0xAC - 1 occurrence | |
- | |
The accepted output for the given input above that excludes all 1 occurrence byte substrings would be: | |
0xFE - 2 occurrences | |
0xFF - 2 occurrences | |
0xAB - 2 occurrences | |
0xFE 0xFF - 2 occurrences | |
0xFE 0xFF 0xAB - 2 occurrences | |
And then as JSON: | |
{ | |
"FE" : 2, | |
"FF" : 2, | |
"AB" : 2, | |
"FEFF" : 2, | |
"FEFFAB" : 2 | |
} | |
----------- | |
Edge case handling | |
Let's say we have the following input 0x00 0x00 0x00 0x00 | |
The correct matched byte substrings would be | |
0x00 - 4 occurrences | |
0x00 0x00 - 3 occurrences | |
0x00 0x00 0x00 - 2 occurrences | |
{ | |
"00" : 4, | |
"0000" : 3, | |
"000000" : 2 | |
} | |
----------- | |
Example Input Files | |
Here are links to two example input files. These files are well below the 1GB limit, but they represent | |
the level of entropy and distribution of byte sequences you can likely expect from the final input file | |
your solution will be benchmarked against. | |
https://drive.google.com/file/d/0B1GeH8qXPbvBZTRuNWVJbUVNZGs/view?usp=sharing | |
AND | |
https://drive.google.com/file/d/0B1GeH8qXPbvBYmxMU1JSWjlPc2c/view?usp=sharing | |
----------- | |
The proposed solution below is O(n^2) where n is the length of the given input. | |
This is not ideal but it is enough to demonstrate expected output for a given input | |
for this problem, along with a brute force solution. | |
Also note, while this example is using strings and characters - it is purely to | |
demonstrate the generalness of the brute force algorithm used. Your solution will | |
be required to work with an array of bytes. | |
*/ | |
function Occurrences() { | |
this.sequences = {}; | |
this.longestSequence = ''; | |
} | |
Occurrences.prototype.expand = function(byte) { | |
this.longestSequence += byte; | |
for (var offset = 0; offset < this.longestSequence.length; offset++) { | |
var sequence = this.longestSequence.substr(offset); | |
this.sequences[sequence] = this.sequences[sequence] + 1 || 1; | |
} | |
} | |
var occurrences = new Occurrences(); | |
var input = "abcdabcefgabc"; | |
for (var i = 0; i < input.length; i++) { | |
occurrences.expand(input[i]); | |
} | |
console.log(occurrences.sequences); | |
/* | |
-------- | |
Given input for solution above: "abcdabcefgabc". | |
Acceptable Output - This is excluding substrings that only occur 1 time in the input: | |
{ | |
"a":3, | |
"ab":3, | |
"b":3, | |
"abc":3, | |
"bc":3, | |
"c":3 | |
} | |
Complete Solution Output: | |
{ | |
"a":3, | |
"ab":3, | |
"b":3, | |
"abc":3, | |
"bc":3, | |
"c":3, | |
"abcd":1, | |
"bcd":1, | |
"cd":1, | |
"d":1, | |
"abcda":1, | |
"bcda":1, | |
"cda":1, | |
"da":1, | |
"abcdab":1, | |
"bcdab":1, | |
"cdab":1, | |
"dab":1, | |
"abcdabc":1, | |
"bcdabc":1, | |
"cdabc":1, | |
"dabc":1, | |
"abcdabce":1, | |
"bcdabce":1, | |
"cdabce":1, | |
"dabce":1, | |
"abce":1, | |
"bce":1, | |
"ce":1, | |
"e":1, | |
"abcdabcef":1, | |
"bcdabcef":1, | |
"cdabcef":1, | |
"dabcef":1, | |
"abcef":1, | |
"bcef":1, | |
"cef":1, | |
"ef":1, | |
"f":1, | |
"abcdabcefg":1, | |
"bcdabcefg":1, | |
"cdabcefg":1, | |
"dabcefg":1, | |
"abcefg":1, | |
"bcefg":1, | |
"cefg":1, | |
"efg":1, | |
"fg":1, | |
"g":1, | |
"abcdabcefga":1, | |
"bcdabcefga":1, | |
"cdabcefga":1, | |
"dabcefga":1, | |
"abcefga":1, | |
"bcefga":1, | |
"cefga":1, | |
"efga":1, | |
"fga":1, | |
"ga":1, | |
"abcdabcefgab":1, | |
"bcdabcefgab":1, | |
"cdabcefgab":1, | |
"dabcefgab":1, | |
"abcefgab":1, | |
"bcefgab":1, | |
"cefgab":1, | |
"efgab":1, | |
"fgab":1, | |
"gab":1, | |
"abcdabcefgabc":1, | |
"bcdabcefgabc":1, | |
"cdabcefgabc":1, | |
"dabcefgabc":1, | |
"abcefgabc":1, | |
"bcefgabc":1, | |
"cefgabc":1, | |
"efgabc":1, | |
"fgabc":1, | |
"gabc":1 | |
} | |
*/ |
@eisenstatdavid s = O(n^2) right?
@braydo25 in your test file, can we ignore white space or it matters?
@jonathanGB - Whitespace matters
This is definitely possible through index reduction with every incrementation of search sequence length, with sequence length starting at 1. I just wrote something up in node that's able to compute all sequences in big.txt that occur more than once in 40 seconds. This time includes writing the JSON output to a file.
Regardless of this, I'm still really curious to see everyone elses solutions as I definitely think there's room for improvement. I will still be awarding the $500 on the 25th.
Output: https://drive.google.com/file/d/0B8H05aARgYGockJ3MWdkWnZLY00/view?usp=sharing
Real time video of log output when running algorithm against big.txt.
https://drive.google.com/file/d/0B8H05aARgYGoZWdNRWFkUEk3QUU/view?usp=sharing
@jonathanGB s is Theta(n^3) in the worst case, actually.
@braydo25 I think something's wrong with your experiment. The big.txt
that I download from Norvig has sha1sum 213ca262bf6af12428d42842848464565f3d5504
. It has a repeated substring of length 8223 (not 66 or 67) starting on or around lines 5311 and 121518. This substring and its substrings alone imply a heuristic lower bound of something like 93 GB on the length of the output (8224*8223/2 nonempty substrings, probably mostly distinct, of average length around 8223/3).
Here's confirmation. Running
sed -n '5312,5500p' big.txt | wc
sed -n '5312,5500p' big.txt | sha1sum
sed -n '121519,121707p' big.txt | sha1sum
gives
189 1296 8222
6e14cea8c413187db1d0c3430d1f3ab48e0c4dca -
6e14cea8c413187db1d0c3430d1f3ab48e0c4dca -
The substring here is one short because sed
won't give me the preceding newlines without a fight.
Ah - good catch @eisenstatdavid - there was definitely a bug in my algo.
https://github.com/davehughes/sais/blob/master/sais-lite/sais.c ,line 646
This solves the global problem on stackoverflow directly using a fast suffix array implementation (sais is basically fastest in the world)
Hey guys, I'll be reviewing everything tonight and tomorrow morning and posting a winner then.
Hey guys still going through submissions
So @braydo25 ? Just curious on what is the result
To make @animeshf's argument more carefully, consider the output for an input xx where x is drawn uniformly at random from the set of ASCII strings of length n (denoted ∑^n, where ∑ is the set of ASCII characters). Each substring of x appears at least twice in xx (duh). For each k from n/3 to 2n/3, there are Theta(n) length-k substrings of x. A substring of length at least n/3 repeats in distinct fixed positions with probability at most |∑|^(-n/3), so by a union bound over Theta(n^3) pairs, the probability of at least one such collision globally is O(n^3 |∑|^(-n/3)), which goes to zero as n goes to infinity. Thus we can expect that x usually has Theta(n^2) distinct substrings, which require Theta(n^3) characters to be printed explicitly.
When the overhead of printing the output is large, it makes sense to switch to output-sensitive running times. The solution that I implemented takes time O(n + s) where n is the length of the input and s is the length of the output, achieving this running time by means of Yuta Mori's suffix array construction library.