Skip to content

Instantly share code, notes, and snippets.

@poychang
Created May 10, 2024 09:17
Show Gist options
  • Save poychang/44c5c64973274a783c553d39acb99a6e to your computer and use it in GitHub Desktop.
Save poychang/44c5c64973274a783c553d39acb99a6e to your computer and use it in GitHub Desktop.
Find Non-UTF8 Encoded Characters using Visual Studio Code (VS Code)

I ran into a problem with python where a file I wanted to read in and parse contained unexpected non-UTF-8 encoded characters. I am certain there are many ways to solve this problem, but capturing my quick and dirty appraoch below for posterity.

  1. Open the file and the Open Find
  2. In find, copy/paste the regex below:
^([\x00-\x7F]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|[\xEE-\xEF][\x80-\xBF]{2}|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})*$
  1. VS Code will highlight the lines MATCHING UTF encoded characters. So, you just have to skim the file looking for lines without highlighting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment