Skip to content

Instantly share code, notes, and snippets.

@Alex-Just
Last active May 27, 2024 16:44
Show Gist options
  • Save Alex-Just/e86110836f3f93fe7932290526529cd1 to your computer and use it in GitHub Desktop.
Save Alex-Just/e86110836f3f93fe7932290526529cd1 to your computer and use it in GitHub Desktop.
Python regex to strip emoji from a string
import re
# http://stackoverflow.com/a/13752628/6762004
RE_EMOJI = re.compile('[\U00010000-\U0010ffff]', flags=re.UNICODE)
def strip_emoji(text):
return RE_EMOJI.sub(r'', text)
print(strip_emoji('🙄🤔'))
@joc32
Copy link

joc32 commented Jun 26, 2018

Does not work when the emoji is at the end of a sentence.

@GMTernence
Copy link

Thanks a lot
It works for me

@swjmj
Copy link

swjmj commented Aug 20, 2019

Thanks, works very well

@gabriel19913
Copy link

gabriel19913 commented Oct 18, 2019

In this question on stackoverflow, an user said that this function doesn't cover all emojis, so it is better to use:

def strip_emoji(text):
    RE_EMOJI = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
    return RE_EMOJI.sub(r'', text)

@mgaitan
Copy link

mgaitan commented Mar 11, 2020

for the record, this is the pattern we are using

# https://en.wikipedia.org/wiki/Unicode_block
EMOJI_PATTERN = re.compile(
    "["
    "\U0001F1E0-\U0001F1FF"  # flags (iOS)
    "\U0001F300-\U0001F5FF"  # symbols & pictographs
    "\U0001F600-\U0001F64F"  # emoticons
    "\U0001F680-\U0001F6FF"  # transport & map symbols
    "\U0001F700-\U0001F77F"  # alchemical symbols
    "\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
    "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
    "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
    "\U0001FA00-\U0001FA6F"  # Chess Symbols
    "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
    "\U00002702-\U000027B0"  # Dingbats
    "\U000024C2-\U0001F251" 
    "]+"
)```

@mghayour
Copy link

mghayour commented Apr 1, 2020

@mgaitan it works perfectly for me, thanks a lot 💖

def add_space_between_emojies(text):
  # Ref: https://gist.github.com/Alex-Just/e86110836f3f93fe7932290526529cd1#gistcomment-3208085
  # Ref: https://en.wikipedia.org/wiki/Unicode_block
  EMOJI_PATTERN = re.compile(
    "(["
    "\U0001F1E0-\U0001F1FF"  # flags (iOS)
    "\U0001F300-\U0001F5FF"  # symbols & pictographs
    "\U0001F600-\U0001F64F"  # emoticons
    "\U0001F680-\U0001F6FF"  # transport & map symbols
    "\U0001F700-\U0001F77F"  # alchemical symbols
    "\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
    "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
    "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
    "\U0001FA00-\U0001FA6F"  # Chess Symbols
    "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
    "\U00002702-\U000027B0"  # Dingbats
    "])"
  )
  text = re.sub(EMOJI_PATTERN, r' \1 ', text)
  return text

EDIT:
i deleted last one "\U000024C2-\U0001F251" , because it matches persian characters, that makes bug for me

@nestukh
Copy link

nestukh commented Apr 3, 2020

hello, I credited your work for a workaround in a youtube-dl issue:
ytdl-org/youtube-dl#5042 (comment)
it has helped a lot, thank you.

@Shellbye
Copy link

Shellbye commented Jun 9, 2020

In case someone like has from __future__ import unicode_literals at the top, then you need to escape "-" like this:

    EMOJI_PATTERN = re.compile(
    "["
    "\U0001F1E0-\U0001F1FF"  # flags (iOS)
    "\U0001F300-\U0001F5FF"  # symbols & pictographs
    "\U0001F600-\U0001F64F"  # emoticons
    "\U0001F680-\U0001F6FF"  # transport & map symbols
    "\U0001F700-\U0001F77F"  # alchemical symbols
    "\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
    "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
    "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
    "\U0001FA00-\U0001FA6F"  # Chess Symbols
    "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
    "\U00002702-\U000027B0"  # Dingbats
    "\U000024C2-\U0001F251" 
    "]+"
    )

or you will got a bad character range like in this SO

@Lakril
Copy link

Lakril commented Apr 7, 2021

Thanks for you help.

def add_space_between_emojies(text):
    '''
    >>> add_space_between_emojies('Python is fun 💚')
    'Python is fun '    
    '''
    from advertools.emoji import EMOJI
    EMOJI_PATTERN = EMOJI
    text = re.sub(EMOJI_PATTERN, r'', text)
    return text

@clichedmoog
Copy link

Sorry to say this but I think @mgaitan's regex is not perfect.
The recent emoji character includes various combinations and patterns so it would be more complex expression.
And this would be good implementation example by javascript: https://github.com/mathiasbynens/emoji-regex

@mgaitan
Copy link

mgaitan commented May 13, 2021

@clichedmoog you are totally right, everything here is a simplification

. For a complete/accurate emoji remover for python I recommend the library https://github.com/bsolomon1124/demoji which download the latest emoji specification to build the pattern. It's not super fast but it's exhaustive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment