Skip to content

Instantly share code, notes, and snippets.

@magnetikonline
Last active March 3, 2025 23:19
Show Gist options
  • Save magnetikonline/7a21ec5f5bcdbf7adb92f9d617e6198f to your computer and use it in GitHub Desktop.
Save magnetikonline/7a21ec5f5bcdbf7adb92f9d617e6198f to your computer and use it in GitHub Desktop.
Python function - test if given file is considered binary.

Python function - is file binary?

Function which determines if a given file is binary.

Test is based on the following algorithm (similar to that implemented within Perl):

  • Empty files are considered text.
  • If not empty, read up to 512 bytes as a buffer. File will be binary if:
    • Null byte is encountered.
    • More than 30% of the buffer consists of "non text" characters.
  • Otherwise, file is text.

Reference

#!/usr/bin/env python
class IsFileBinary:
READ_BYTES = 512
CHAR_THRESHOLD = 0.3
TEXT_CHARACTERS = "".join(
[chr(code) for code in range(32, 127)] + list("\b\f\n\r\t")
)
def test(self, file_path):
# read chunk of file
fh = open(file_path, "r")
file_data = fh.read(IsFileBinary.READ_BYTES)
fh.close()
# store chunk length read
data_length = len(file_data)
if not data_length:
# empty files considered text
return False
if "\x00" in file_data:
# file containing null bytes is binary
return True
# remove all text characters from file chunk, get remaining length
binary_length = len(file_data.translate(None, IsFileBinary.TEXT_CHARACTERS))
# if percentage of binary characters above threshold, binary file
return (float(binary_length) / data_length) >= IsFileBinary.CHAR_THRESHOLD
def main():
is_file_binary = IsFileBinary()
print("Is binary file: {0}".format(is_file_binary.test("./first")))
print("Is binary file: {0}".format(is_file_binary.test("./second")))
print("Is binary file: {0}".format(is_file_binary.test("./third")))
if __name__ == "__main__":
main()
@ricardo-reis-1970
Copy link

Python 3.12.3 raises a UnicodeDecodeError exception when reading a binary file opened with 'r'. Here's my no OOP approach:

def is_file_binary(file_path: str) -> bool:
    try:
        with open(file_path, 'r') as fp:
            fp.read(16)
            return False
    except UnicodeDecodeError:
            return True

Empty files reported as non-binary.

On a side note, please consider the humungous difference in simplicity, code transparency, ease of use, etc, etc, between obsessing in doing classes for every single thing or just coding it directly. If all you know is OOP, what you do every day is adjusting your problem to a predetermined solution.

@magnetikonline
Copy link
Author

If all you know is OOP, what you do every day is adjusting your problem to a predetermined solution.

one of the strangest comments I've read for a while. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment