Created
March 11, 2016 02:15
-
-
Save MarksCode/c553331923430d8e81a0 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This project gave us interesting insights to hash table implementations and python in general. It showed us how important your implementation is when dealing with big data. | |
For chaining, we noticed that having a good hash function was really important to spread the entries over as many buckets as possible since otherwise searching would diverge from the O(1) expectation of a hash table. Similarily, even though each index of the hash array can hold more than one item doesn't mean you don't need a table resizing function if the load factor becomes too big. | |
For linear open adressing, the randomness of your input is very important since otherwise large clumps will form. When your load factor aproaches 1 it becomes very slow since you might have to iterate through many elements to find whether a key exists. Another bad thing is that deletion doesn't free up memory as you just have to mark the deleted bucket with a dumby value so searching for elements that are after it won't be an issue. | |
For quadratic open adressing, the randomness of your input is still important but not as important as for linear as the hashing function ensures that the keys are more spread out. One bad thing about quadratic hashing is that not every bucket will be visited upon rehash iterations. Also, the memory for deleted values cannot be freed so as to not cause issues with searching. | |
For very high load factors chaining is faster than open adressing since open adressing can cause large clumps requiring you to go through many items before your search or insertion is exhausted while chaining only has to iterate through as many items that are inserted in the same bucket. One disadvantage of chaining is linked lists' memory isn't together and memory hopping isn't the optimal way to look through data for your CPU, but since we used python lists as our linked list that isn't as much of an issue. | |
In terms of general python observations, initially we used python lists for our open adressing implementations but that was using way too much memory. We decided to rewrite our code using Numpy arrays and that really helped. This shows that every detail is important when dealing with large sets of data. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment