-
-
Save iamarkdev/f7a9ce7ce7ad7c5fb11ec511887789bc to your computer and use it in GitHub Desktop.
| /* | |
| Preface: | |
| I'm issuing this challenge as I've hit a major roadblock in the development process | |
| of one of my personal side projects. I've spent the last 2 weeks exhaustively exploring | |
| solutions to the following problem and I'm at my wits end. I've taken a number of approaches | |
| using various tree data structures and clever algorithms with no success. | |
| Problem: | |
| Given an input, find all possible byte substrings for the input and count how many | |
| times each unique byte substring occurs throughout the input. Return all byte substrings | |
| along with their occurence count that have an occurence count > 1. | |
| You'll notice the examples given below show all byte substrings including byte substrings that have | |
| an occurrence count of 1. This is mainly for illustration of the problem. Your solution | |
| should exclude byte substrings that only occur 1 time in the input. | |
| REQUIREMENTS | |
| - Your solution may be written in whatever language you're comfortable with. | |
| - Your solution should be able to compute a correct output more efficiently than O(n^2). | |
| - Your solution should not exceed a memory usage limit of 32GB. | |
| - Your solution must work with any file of any type that is <= 1GB in size. | |
| - Your solution's output must exclude byte substrings than only have 1 occurrence in the input. | |
| - Your solution's output must be written to a new file in a transferrable data format such as JSON (output.json for example). | |
| - Your solution's output should be equivalent to an object of key value pairs as: (substring) => (# of occurrences). | |
| - Please provide a way to contact you in the event your solution is accepted as the winning solution. | |
| JUDGING CRITERIA | |
| - Correct output (See requirements and examples below to understand what's considered correct output). (Pass / fail) | |
| - Meets requirements (See above). (Pass / fail) | |
| - Execution time: I will benchmark all solutions on the same machine with the same input - the input will be a binary file. | |
| - Algorithm / Data Structure clarity and concision - if your solution has complex working parts, | |
| please include comments where applicable to give better explain what's going on. | |
| SUBMISSION | |
| Please comment with a link to a gist of your solution to submit it for consideration. | |
| PRIZE | |
| I am guaranteeing $500.00 USD that will be gifted through PayPal to the individual that provides | |
| the solution that best fits the judging criteria and requirements listed above. | |
| -- | |
| I will be checking back periodically. Judging will be done on a rolling basis up until December 25th. | |
| I will give feedback as comments to the gists you link to as solutions. | |
| A winning solution will be selected on December 25th, 2016. | |
| If you have any questions, feel free to email me: [email protected] ... | |
| Or message me on Facebook: https://www.facebook.com/braydonbatungbacal | |
| ----------- | |
| ----------- | |
| ----------- | |
| Given this example input: 0xFE 0xFF 0xAB 0x00 0xFE 0xFF 0xAB 0xAC | |
| We would find the following byte substrings with the following occurrence counts. | |
| 0xFE - 2 occurrences | |
| 0xFF - 2 occurrences | |
| 0xAB - 2 occurrences | |
| 0x00 - 1 occurrence | |
| 0xAC - 1 occurrence | |
| 0xFE 0xFF - 2 occurrences | |
| 0xFF 0xAB - 2 occurrences | |
| 0xAB 0x00 - 1 occurrence | |
| 0x00 0xFE - 1 occurrence | |
| 0xAB 0xAC - 1 occurrence | |
| 0xFE 0xFF 0xAB - 2 occurrences | |
| 0xFF 0xAB 0x00 - 1 occurrence | |
| 0xAB 0x00 0xFE - 1 occurrence | |
| 0x00 0xFE 0xFF - 1 occurrence | |
| 0xFF 0xAB 0xAC - 1 occurrence | |
| 0xFE 0xFF 0xAB 0x00 - 1 occurrence | |
| 0xFF 0xAB 0x00 0xFE - 1 occurrence | |
| 0xAB 0x00 0xFE 0xFF - 1 occurrence | |
| 0x00 0xFE 0xFF 0xAB - 1 occurrence | |
| 0xFE 0xFF 0xAB 0xAC - 1 occurrence | |
| 0xFE 0xFF 0xAB 0x00 0xFE - 1 occurrence | |
| 0xFF 0xAB 0x00 0xFE 0xFF - 1 occurrence | |
| 0xAB 0x00 0xFE 0xFF 0xAB - 1 occurrence | |
| 0x00 0xFE 0xFF 0xAB 0xAC - 1 occurrence | |
| 0xFE 0xFF 0xAB 0x00 0xFE 0xFF - 1 occurrence | |
| 0xFF 0xAB 0x00 0xFE 0xFF 0xAB - 1 occurrence | |
| 0xAB 0x00 0xFE 0xFF 0xAB 0xAC - 1 occurrence | |
| 0xFE 0xFF 0xAB 0x00 0xFE 0xFF 0xAB - 1 occurrence | |
| 0xFF 0xAB 0x00 0xFE 0xFF 0xAB 0xAC - 1 occurrence | |
| 0xFE 0xFF 0xAB 0x00 0xFE 0xFF 0xAB 0xAC - 1 occurrence | |
| - | |
| The accepted output for the given input above that excludes all 1 occurrence byte substrings would be: | |
| 0xFE - 2 occurrences | |
| 0xFF - 2 occurrences | |
| 0xAB - 2 occurrences | |
| 0xFE 0xFF - 2 occurrences | |
| 0xFE 0xFF 0xAB - 2 occurrences | |
| And then as JSON: | |
| { | |
| "FE" : 2, | |
| "FF" : 2, | |
| "AB" : 2, | |
| "FEFF" : 2, | |
| "FEFFAB" : 2 | |
| } | |
| ----------- | |
| Edge case handling | |
| Let's say we have the following input 0x00 0x00 0x00 0x00 | |
| The correct matched byte substrings would be | |
| 0x00 - 4 occurrences | |
| 0x00 0x00 - 3 occurrences | |
| 0x00 0x00 0x00 - 2 occurrences | |
| { | |
| "00" : 4, | |
| "0000" : 3, | |
| "000000" : 2 | |
| } | |
| ----------- | |
| Example Input Files | |
| Here are links to two example input files. These files are well below the 1GB limit, but they represent | |
| the level of entropy and distribution of byte sequences you can likely expect from the final input file | |
| your solution will be benchmarked against. | |
| https://drive.google.com/file/d/0B1GeH8qXPbvBZTRuNWVJbUVNZGs/view?usp=sharing | |
| AND | |
| https://drive.google.com/file/d/0B1GeH8qXPbvBYmxMU1JSWjlPc2c/view?usp=sharing | |
| ----------- | |
| The proposed solution below is O(n^2) where n is the length of the given input. | |
| This is not ideal but it is enough to demonstrate expected output for a given input | |
| for this problem, along with a brute force solution. | |
| Also note, while this example is using strings and characters - it is purely to | |
| demonstrate the generalness of the brute force algorithm used. Your solution will | |
| be required to work with an array of bytes. | |
| */ | |
| function Occurrences() { | |
| this.sequences = {}; | |
| this.longestSequence = ''; | |
| } | |
| Occurrences.prototype.expand = function(byte) { | |
| this.longestSequence += byte; | |
| for (var offset = 0; offset < this.longestSequence.length; offset++) { | |
| var sequence = this.longestSequence.substr(offset); | |
| this.sequences[sequence] = this.sequences[sequence] + 1 || 1; | |
| } | |
| } | |
| var occurrences = new Occurrences(); | |
| var input = "abcdabcefgabc"; | |
| for (var i = 0; i < input.length; i++) { | |
| occurrences.expand(input[i]); | |
| } | |
| console.log(occurrences.sequences); | |
| /* | |
| -------- | |
| Given input for solution above: "abcdabcefgabc". | |
| Acceptable Output - This is excluding substrings that only occur 1 time in the input: | |
| { | |
| "a":3, | |
| "ab":3, | |
| "b":3, | |
| "abc":3, | |
| "bc":3, | |
| "c":3 | |
| } | |
| Complete Solution Output: | |
| { | |
| "a":3, | |
| "ab":3, | |
| "b":3, | |
| "abc":3, | |
| "bc":3, | |
| "c":3, | |
| "abcd":1, | |
| "bcd":1, | |
| "cd":1, | |
| "d":1, | |
| "abcda":1, | |
| "bcda":1, | |
| "cda":1, | |
| "da":1, | |
| "abcdab":1, | |
| "bcdab":1, | |
| "cdab":1, | |
| "dab":1, | |
| "abcdabc":1, | |
| "bcdabc":1, | |
| "cdabc":1, | |
| "dabc":1, | |
| "abcdabce":1, | |
| "bcdabce":1, | |
| "cdabce":1, | |
| "dabce":1, | |
| "abce":1, | |
| "bce":1, | |
| "ce":1, | |
| "e":1, | |
| "abcdabcef":1, | |
| "bcdabcef":1, | |
| "cdabcef":1, | |
| "dabcef":1, | |
| "abcef":1, | |
| "bcef":1, | |
| "cef":1, | |
| "ef":1, | |
| "f":1, | |
| "abcdabcefg":1, | |
| "bcdabcefg":1, | |
| "cdabcefg":1, | |
| "dabcefg":1, | |
| "abcefg":1, | |
| "bcefg":1, | |
| "cefg":1, | |
| "efg":1, | |
| "fg":1, | |
| "g":1, | |
| "abcdabcefga":1, | |
| "bcdabcefga":1, | |
| "cdabcefga":1, | |
| "dabcefga":1, | |
| "abcefga":1, | |
| "bcefga":1, | |
| "cefga":1, | |
| "efga":1, | |
| "fga":1, | |
| "ga":1, | |
| "abcdabcefgab":1, | |
| "bcdabcefgab":1, | |
| "cdabcefgab":1, | |
| "dabcefgab":1, | |
| "abcefgab":1, | |
| "bcefgab":1, | |
| "cefgab":1, | |
| "efgab":1, | |
| "fgab":1, | |
| "gab":1, | |
| "abcdabcefgabc":1, | |
| "bcdabcefgabc":1, | |
| "cdabcefgabc":1, | |
| "dabcefgabc":1, | |
| "abcefgabc":1, | |
| "bcefgabc":1, | |
| "cefgabc":1, | |
| "efgabc":1, | |
| "fgabc":1, | |
| "gabc":1 | |
| } | |
| */ |
Submission: https://github.com/apjaffe/suffix-tree
It is impossible to solve this in < O(n^2). This is because your answer itself can be as big as O(n^2), as in, number of substrings with frequency > 1 can be O(n^2), so storing them and printing them is indeed O(n^2) as well. Consider this case : Generate any random string S of length N, and create T = S + S i.e. T is simply S written twice. Now the answer for T is of size N^2. Since |T| = 2N, the answer is O(|T|^2)!
If the question, however, was to compute just the number of substrings with frequency > 1, I have a O(n log n) solution. Printing all of them, however, will be O(n^2). If you require, I can explain the O(n log n) approach to compute number of valid substrings.
Hey eisenstatdavid, jaruesink and apjaffe - Thanks for the submissions, I'll benchmark these now and take a look through them.
animeshf - You may be correct, I'd be curious if anyone else has a counter argument though.
I've updated the gist with links to 2 example input files.
@jaruesink - Give this a try: https://gist.github.com/braydo25/8d888674ab7b48c823f7f3a3da10b043
To make @animeshf's argument more carefully, consider the output for an input xx where x is drawn uniformly at random from the set of ASCII strings of length n (denoted ∑^n, where ∑ is the set of ASCII characters). Each substring of x appears at least twice in xx (duh). For each k from n/3 to 2n/3, there are Theta(n) length-k substrings of x. A substring of length at least n/3 repeats in distinct fixed positions with probability at most |∑|^(-n/3), so by a union bound over Theta(n^3) pairs, the probability of at least one such collision globally is O(n^3 |∑|^(-n/3)), which goes to zero as n goes to infinity. Thus we can expect that x usually has Theta(n^2) distinct substrings, which require Theta(n^3) characters to be printed explicitly.
When the overhead of printing the output is large, it makes sense to switch to output-sensitive running times. The solution that I implemented takes time O(n + s) where n is the length of the input and s is the length of the output, achieving this running time by means of Yuta Mori's suffix array construction library.
@eisenstatdavid s = O(n^2) right?
@braydo25 in your test file, can we ignore white space or it matters?
@jonathanGB - Whitespace matters
This is definitely possible through index reduction with every incrementation of search sequence length, with sequence length starting at 1. I just wrote something up in node that's able to compute all sequences in big.txt that occur more than once in 40 seconds. This time includes writing the JSON output to a file.
Regardless of this, I'm still really curious to see everyone elses solutions as I definitely think there's room for improvement. I will still be awarding the $500 on the 25th.
Output: https://drive.google.com/file/d/0B8H05aARgYGockJ3MWdkWnZLY00/view?usp=sharing
Real time video of log output when running algorithm against big.txt.
https://drive.google.com/file/d/0B8H05aARgYGoZWdNRWFkUEk3QUU/view?usp=sharing
@jonathanGB s is Theta(n^3) in the worst case, actually.
@braydo25 I think something's wrong with your experiment. The big.txt that I download from Norvig has sha1sum 213ca262bf6af12428d42842848464565f3d5504. It has a repeated substring of length 8223 (not 66 or 67) starting on or around lines 5311 and 121518. This substring and its substrings alone imply a heuristic lower bound of something like 93 GB on the length of the output (8224*8223/2 nonempty substrings, probably mostly distinct, of average length around 8223/3).
Here's confirmation. Running
sed -n '5312,5500p' big.txt | wc
sed -n '5312,5500p' big.txt | sha1sum
sed -n '121519,121707p' big.txt | sha1sum
gives
189 1296 8222
6e14cea8c413187db1d0c3430d1f3ab48e0c4dca -
6e14cea8c413187db1d0c3430d1f3ab48e0c4dca -
The substring here is one short because sed won't give me the preceding newlines without a fight.
Ah - good catch @eisenstatdavid - there was definitely a bug in my algo.
https://github.com/davehughes/sais/blob/master/sais-lite/sais.c ,line 646
This solves the global problem on stackoverflow directly using a fast suffix array implementation (sais is basically fastest in the world)
Hey guys, I'll be reviewing everything tonight and tomorrow morning and posting a winner then.
Hey guys still going through submissions
So @braydo25 ? Just curious on what is the result
How about something like this?
http://codepen.io/jaruesink/pen/gLEgmw?editors=0012