Skip to content

Instantly share code, notes, and snippets.

@halfak
Last active April 13, 2020 20:07
Show Gist options
  • Select an option

  • Save halfak/0f73a39bd0108c38eddf609a4ebe056e to your computer and use it in GitHub Desktop.

Select an option

Save halfak/0f73a39bd0108c38eddf609a4ebe056e to your computer and use it in GitHub Desktop.
$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26)
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from deltas.tokenizers import wikitext_split
>>> wikitext_split.regex.pattern
"(?P<comment_start><!--)|(?P<comment_end>-->)|(?P<url>((bitcoin|geo|magnet|mailto|news|sips?|tel|urn)\\:|((|ftp|ftps|git|gopher|https?|ircs?|mms|nntp|redis|sftp|ssh|svn|telnet|worldwind|xmpp)\\:)?\\/\\/)[^\\s/$.?#].[^\\s]*)|(?P<entity>&[a-z][a-z0-9]*;)|(?P<cjk>[\\u4E00-\\u62FF\\u6300-\\u77FF\\u7800-\\u8CFF\\u8D00-\\u9FCC\\u3400-\\u4DFF\\U00020000-\\U000215FF\\U00021600-\\U000230FF\\U00023100-\\U000245FF\\U00024600-\\U000260FF\\U00026100-\\U000275FF\\U00027600-\\U000290FF\\U00029100-\\U0002A6DF\\uF900-\\uFAFF\\U0002F800-\\U0002FA1F\\u3041-\\u3096\\u30A0-\\u30FF\\u3400-\\u4DB5\\u4E00-\\u9FCB\\uF900-\\uFA6A\\u2E80-\\u2FD5\\uFF5F-\\uFF9F\\u31F0-\\u31FF\\u3220-\\u3243\\u3280-\\u337F])|(?P<ref_open><ref\\b[^>/]*>)|(?P<ref_close></ref\\b[^>]*>)|(?P<ref_singleton><ref\\b[^>/]*/>)|(?P<tag></?([a-z][a-z0-9]*)\\b[^>]*>)|(?P<number>[\\d]+)|(?P<japan_punct>[\\u3000-\\u303F])|(?P<danda>।|॥)|(?P<bold>''')|(?P<italic>'')|(?P<word>([^\\W\\d]|[\\u0901-\\u0963\\u0601-\\u061A\\u061C-\\u0669\\u06D5-\\u06EF\\u0980-\\u09FF])[\\w\\u0901-\\u0963\\u0601-\\u061A\\u061C-\\u0669\\u06D5-\\u06EF\\u0980-\\u09FF]*([\\'’]([\\w\\u0901-\\u0963\\u0601-\\u061A\\u061C-\\u0669\\u06D5-\\u06EF\\u0980-\\u09FF]+|(?=($|\\s))))*)|(?P<period>\\.+)|(?P<qmark>\\?+)|(?P<epoint>!+)|(?P<comma>,+)|(?P<colon>:+)|(?P<scolon>;+)|(?P<break>(\\n|\\n\\r|\\r\\n)\\s*(\\n|\\n\\r|\\r\\n)+)|(?P<whitespace>(\\n|\\n\\r|[^\\S\\n\\r]+))|(?P<dbrack_open>\\[\\[)|(?P<dbrack_close>\\]\\])|(?P<brack_open>\\[)|(?P<brack_close>\\])|(?P<paren_open>\\()|(?P<paren_close>\\))|(?P<tab_open>\\{\\|)|(?P<tab_close>\\|\\})|(?P<dcurly_open>\\{\\{)|(?P<dcurly_close>\\}\\})|(?P<curly_open>\\{)|(?P<curly_close>\\})|(?P<equals>=+)|(?P<bar>\\|)|(?P<etc>.)"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment