Last active
September 13, 2018 06:48
-
-
Save tomassedovic/86c15e2c0ead6c3e0a99f5375240aa6a to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Copyright 2018 by Tomas Sedovic, all rights reserved | |
# Contact <[email protected]> for licensing options. | |
# NOTE: p=0.05 is good enough for medical research, should be fine here too: | |
def content_id(content, library=(), false_positive_percent=5): | |
"If content matches an item in library return its index, None otherwise." | |
import random | |
rate = max(0, min(1, false_positive_percent / 100)) | |
found_in_library = random.random() <= rate | |
if library and found_in_library: | |
return random.randint(0, len(library) - 1) | |
# Usage: | |
library = ["Avengers", "Windows 10", "Helter Skelter", "Harry Potter"] | |
# NOTE: the library can also be a list of {name: name, data: full contents of the works} dicts. | |
# You can also supply the hashed contents to make the library smaller. The algorithm is very flexible. | |
>>> for _ in range(10): print(content_id(3.14159265358979323, library, 10)) | |
... | |
None | |
None | |
None | |
None | |
None | |
None | |
3 | |
None | |
None | |
None |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
With the EU copyright reform, everyone will need a filter that will tell them whether an uploaded piece of content matches copyrighted material.
This program is easy to deploy and uses modern technology as well as solid, well-understood computer science foundations to be fast and reliable (no crashes!). The current law does not specify the limit of false positives, but this algorithm is flexible enough to let you tweak it to suit your monetary or future compliance needs.
I am happy to discuss licensing options.