Skip to content

Instantly share code, notes, and snippets.

@ptvirgo
Created August 27, 2017 13:21
Show Gist options
  • Save ptvirgo/31449b864175d2e2e55cd8a658dff2a6 to your computer and use it in GitHub Desktop.
Save ptvirgo/31449b864175d2e2e55cd8a658dff2a6 to your computer and use it in GitHub Desktop.
I caught a rude character interfering with my data migration...
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import unittest
RUDE = b"\xe3\x80\x82".decode("utf-8")
def limit_slice(maxlen, text):
"""Given a maximum length and text, return the text truncated to the provided
length."""
encoded = text.encode("utf-8", errors="replace")[:maxlen]
return encoded.decode("utf-8", errors="replace")
def limit_count(maxlen, text):
"""Given a maximum length and text, return the text truncated to the provided
length."""
truncated = ""
tl = 0
for c in text:
size = len(c.encode("utf-8"))
if tl + size <= maxlen:
truncated += c
tl += size
else:
break
return truncated
class TestLimits(unittest.TestCase):
def test_slicing(self):
testlen = 8
truncated = limit_slice(testlen, "Hello " + RUDE + " world")
self.assertTrue(len(truncated.encode("utf-8")) <= testlen)
def test_count(self):
testlen = 8
truncated = limit_count(testlen, "Hello " + RUDE + " world")
self.assertTrue(len(truncated.encode("utf-8")) <= testlen)
if __name__ == "__main__":
unittest.main()
@manchicken
Copy link

Yeah, this isn’t a python thing, I think it’s a character encoding thing. If your target data store is limiting you in bytes, you’re actually doing this wrong. When you use utf8 you explicitly give up the assumption that a character is the same size as a byte. I think you need to truncate to the byte length, but be aware of characters. Maybe a function that chops off a character while the string—interpreted as a byte array, not characters—is too long.

@manchicken
Copy link

(Sorry for the delay)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment