Skip to content

Instantly share code, notes, and snippets.

@catawbasam
Created August 20, 2014 23:07
Show Gist options
  • Save catawbasam/3ab68615b4c78a5a49b1 to your computer and use it in GitHub Desktop.
Save catawbasam/3ab68615b4c78a5a49b1 to your computer and use it in GitHub Desktop.
Julia Char predicates draft
{
"metadata": {
"language": "Julia",
"name": "",
"signature": "sha256:e80d1a82f7cdd977679c61b2cbabab28b09bb50363bbf45c783529686408d275"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Character class predicates based on libmojibake/utf8proc.jl\n",
"### with reference to Haskell Char, Go unicode, and perl unicode\n",
"\n",
"Haskell references:\n",
"* http://hackage.haskell.org/package/base-4.7.0.1/docs/Data-Char.html \n",
"* http://hackage.haskell.org/package/base-4.7.0.1/docs/src/Data-Char.html#isMark \n",
"* http://hackage.haskell.org/package/base-4.7.0.1/docs/src/GHC-Unicode.html#isAlpha\n",
"\n",
"Go reference: http://golang.org/pkg/unicode/#IsPrint\n",
"\n",
"perl reference: http://search.cpan.org/~arc/perl-5.17.8/pod/perlunicode.pod\n",
"\n",
"\n",
"Most of the functions below are based on Unicode character categories. Exceptions: `isdigit`, `iscntrl`, `isspace`.\n",
"\n",
"The tests below add `isnumber`, which returns true for a broad range of numeric characters in contrast to the narrow range selected by `isdigit`. `isnumber` is a numeric-only counterpart to `isalnum`.\n",
"\n",
"Haskell Char provides other functions that might be of interest in Julia, for example `isMark` and `isSymbol`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### libmojibake utf8 category constants"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\n",
"const UTF8PROC_CATEGORY_LU = 1 # Lu: Letter, Uppercase\n",
"const UTF8PROC_CATEGORY_LL = 2 # Ll: Letter, Lowercase\n",
"const UTF8PROC_CATEGORY_LT = 3 # Lt: Letter, Titlecase\n",
"const UTF8PROC_CATEGORY_LM = 4 # Lm: Letter, Modifier\n",
"const UTF8PROC_CATEGORY_LO = 5 # Lo: Letter, Other\n",
"const UTF8PROC_CATEGORY_MN = 6 # Mn: Mark, Non-Spacing\n",
"const UTF8PROC_CATEGORY_MC = 7 # Mc: Mark, Spacing Combining\n",
"const UTF8PROC_CATEGORY_ME = 8 # Me: Mark, Enclosing\n",
"const UTF8PROC_CATEGORY_ND = 9 # Nd: Number, Decimal\n",
"const UTF8PROC_CATEGORY_NL = 10 # Nl: Number, Letter\n",
"const UTF8PROC_CATEGORY_NO = 11 # No: Number, Other\n",
"const UTF8PROC_CATEGORY_PC = 12 # Pc: Punctuation, Connector\n",
"const UTF8PROC_CATEGORY_PD = 13 # Pd: Punctuation, Dash \n",
"const UTF8PROC_CATEGORY_PS = 14 # Ps: Punctuation, Open\n",
"const UTF8PROC_CATEGORY_PE = 15 # Pe: Punctuation, Close\n",
"const UTF8PROC_CATEGORY_PI = 16 # Pi: Punctuation, Initial Quote\n",
"const UTF8PROC_CATEGORY_PF = 17 # Pf: Punctuation, Final Quote\n",
"const UTF8PROC_CATEGORY_PO = 18 # Po: Punctuation, Other\n",
"const UTF8PROC_CATEGORY_SM = 19 # Sm: Symbol, Math\n",
"const UTF8PROC_CATEGORY_SC = 20 # Sc: Symbol, Currency\n",
"const UTF8PROC_CATEGORY_SK = 21 # Sk: Symbol, Modifier\n",
"const UTF8PROC_CATEGORY_SO = 22 # So: Symbol, Other\n",
"const UTF8PROC_CATEGORY_ZS = 23 # Zs: Separator, Space\n",
"const UTF8PROC_CATEGORY_ZL = 24 # Zl: Separator, Line\n",
"const UTF8PROC_CATEGORY_ZP = 25 # Zp: Separator, Paragraph\n",
"const UTF8PROC_CATEGORY_CC = 26 # Cc: Other, Control\n",
"const UTF8PROC_CATEGORY_CF = 27 # Cf: Other, Format\n",
"const UTF8PROC_CATEGORY_CS = 28 # Cs: Other, Surrogate\n",
"const UTF8PROC_CATEGORY_CO = 29 # Co: Other, Private Use\n",
"const UTF8PROC_CATEGORY_CN = 30 # Cn: Other, No Assigned"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 1,
"text": [
"30"
]
}
],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# category_code modified to ignore case of unassigned c\n",
"function category_code_assigned(c)\n",
" c > 0x10FFFF && return 0x0000 # see utf8proc_get_property docs\n",
" cat = unsafe_load(ccall(:utf8proc_get_property, Ptr{Uint16}, (Int32,), c))\n",
" # note: utf8proc returns 0, not UTF8PROC_CATEGORY_CN, for unassigned c\n",
" #cat == 0 ? 30 : cat # adds time and not needed by character class predicates\n",
"end"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 2,
"text": [
"category_code_assigned (generic function with 1 method)"
]
}
],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### islower / isupper -- follow Haskell \n",
"* TitleCase characters return true from `isupper`.\n",
"* Julia's isupper appears to follow this convention already."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"islower_moji(c::Char) = (category_code(c)==UTF8PROC_CATEGORY_LL)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 3,
"text": [
"islower_moji (generic function with 1 method)"
]
}
],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# follows Haskell's isUpper() -- uses uppercase + titlecase\n",
"function isupper_moji(c::Char) \n",
" ccode=category_code_assigned(c)\n",
" return ccode==UTF8PROC_CATEGORY_LU || ccode==UTF8PROC_CATEGORY_LT\n",
"end"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": [
"isupper_moji (generic function with 1 method)"
]
}
],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"isupper_moji('H'), isupper_moji('h'), isupper_moji('\u0394'), isupper_moji('\u03b4')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 5,
"text": [
"(true,false,true,false)"
]
}
],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#TitleCase example\n",
"DZ = char(0x0001C5)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 6,
"text": [
"'\u01c5'"
]
}
],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"isupper(DZ), isupper_moji(DZ)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 7,
"text": [
"(true,true)"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### isalpha, isdigit, isnumber, isalnum -- follow Haskell"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# follows Haskell's isLetter()\n",
"function isalpha_moji(c::Char)\n",
" ccode=category_code_assigned(c)\n",
" return (UTF8PROC_CATEGORY_LU <= ccode <= UTF8PROC_CATEGORY_LO) \n",
"end"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 8,
"text": [
"isalpha_moji (generic function with 1 method)"
]
}
],
"prompt_number": 8
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"isalpha_moji('H'), isalpha_moji(' '), isalpha_moji('4'), isalpha_moji('\u221a')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 9,
"text": [
"(true,false,false,false)"
]
}
],
"prompt_number": 9
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# follows Haskell's isDigit() -- ASCII '0'-'9'\n",
"function isdigit_moji(c::Char)\n",
" return '0' <= c <= '9'\n",
"end"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 10,
"text": [
"isdigit_moji (generic function with 1 method)"
]
}
],
"prompt_number": 10
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"isdigit_moji('3'), isdigit_moji('a'), isdigit_moji('\u03b4') "
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 11,
"text": [
"(true,false,false)"
]
}
],
"prompt_number": 11
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# follows Haskell's isNumber()\n",
"function isnumber_moji(c::Char)\n",
" ccode=category_code_assigned(c)\n",
" return (UTF8PROC_CATEGORY_ND <= ccode <= UTF8PROC_CATEGORY_NO) \n",
"end"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 12,
"text": [
"isnumber_moji (generic function with 1 method)"
]
}
],
"prompt_number": 12
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"isnumber_moji('0'), isnumber_moji('g'), isnumber_moji('\u22d2') "
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 13,
"text": [
"(true,false,false)"
]
}
],
"prompt_number": 13
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"arabicnum = char(0x0663)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 14,
"text": [
"'\u0663'"
]
}
],
"prompt_number": 14
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"isnumber_moji(arabicnum)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 15,
"text": [
"true"
]
}
],
"prompt_number": 15
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# follows Haskell's isAlphaNum()\n",
"function isalnum_moji(c::Char)\n",
" ccode=category_code_assigned(c)\n",
" return (UTF8PROC_CATEGORY_LU <= ccode <= UTF8PROC_CATEGORY_LO) ||\n",
" (UTF8PROC_CATEGORY_ND <= ccode <= UTF8PROC_CATEGORY_NO)\n",
"end"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 16,
"text": [
"isalnum_moji (generic function with 1 method)"
]
}
],
"prompt_number": 16
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"isalnum_moji('0'), isalnum_moji('B'), isalnum_moji('\u03a3'), isalnum_moji('\u221a'), "
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 17,
"text": [
"(true,true,true,false)"
]
}
],
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### iscntrl, ispunct -- follow Haskell"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#= \n",
"Haskell isControl: \"Selects control characters, which are the non-printing characters of the Latin-1 subset of Unicode.\n",
"\n",
"Go IsControl: \"IsControl reports whether the rune is a control character. \n",
" The C (Other) Unicode category includes more code points such as surrogates;\n",
" use Is(C, r) to test for them.\"\n",
"=#\n",
"function iscntrl_moji(c::Char)\n",
" #http://en.wikipedia.org/wiki/Control_characters\n",
" return (uint(c)< 0x1f || 0x7f<=uint(c)<=0x9f) \n",
"end"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 18,
"text": [
"iscntrl_moji (generic function with 1 method)"
]
}
],
"prompt_number": 18
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"iscntrl_moji('\\t'), iscntrl_moji('Z') "
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 19,
"text": [
"(true,false)"
]
}
],
"prompt_number": 19
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# follows Haskell's isPunctuation()\n",
"function ispunct_moji(c::Char)\n",
" ccode=category_code_assigned(c)\n",
" return (UTF8PROC_CATEGORY_PC <= ccode <= UTF8PROC_CATEGORY_PO) \n",
"end"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 20,
"text": [
"ispunct_moji (generic function with 1 method)"
]
}
],
"prompt_number": 20
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"ispunct_moji(','), ispunct_moji('!'), ispunct_moji('8')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 21,
"text": [
"(true,true,false)"
]
}
],
"prompt_number": 21
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"upunct = char(0x002021)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 22,
"text": [
"'\u2021'"
]
}
],
"prompt_number": 22
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"ispunct(upunct), ispunct_moji(upunct)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 23,
"text": [
"(true,true)"
]
}
],
"prompt_number": 23
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### isspace: follows Go\n",
" Go includes newline and non-breaking space, unlike Haskell and Julia's current isspace. \n",
" **This definition is breaking with respect to newline and non-breaking space.**"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"const NEL = char(0x000085)\n",
"const NBSP = char(0x0000A0)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 24,
"text": [
"'\u00a0'"
]
}
],
"prompt_number": 24
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# \n",
"# Haskell isSpace: Returns True for any Unicode space character, and the control characters \\t, \\n, \\r, \\f, \\v.\n",
"#= Go IsSpace\n",
"\"IsSpace reports whether the rune is a space character as defined by Unicode's White Space property; in the Latin-1 space this is\n",
"\n",
"'\\t', '\\n', '\\v', '\\f', '\\r', ' ', U+0085 (NEL), U+00A0 (NBSP).\n",
"Other definitions of spacing characters are set by category Z and property Pattern_White_Space.\"\n",
"=#\n",
"\n",
"function isspace_moji(c::Char)\n",
" return c in (' ','\\t','\\n','\\r','\\f','\\v', NEL, NBSP) || \n",
" category_code_assigned(c)==UTF8PROC_CATEGORY_ZS\n",
"end"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 25,
"text": [
"isspace_moji (generic function with 1 method)"
]
}
],
"prompt_number": 25
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"category_code_assigned('\\t')==UTF8PROC_CATEGORY_ZS"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 26,
"text": [
"false"
]
}
],
"prompt_number": 26
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Julia isspace() currently returns false for NEL on both Windows and Linux\n",
"* for NBSP it returns true on Windows but false on linux"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"isspace_moji(' '), isspace_moji('\\n'), isspace_moji('T') "
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 27,
"text": [
"(true,true,false)"
]
}
],
"prompt_number": 27
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"isspace(NEL), isspace_moji(NEL), isspace(NBSP), isspace_moji(NBSP)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 28,
"text": [
"(false,true,true,true)"
]
}
],
"prompt_number": 28
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### isprint - follows perl and current Julia manual definition\n",
"* Julia's isprint() is currently buggy on Windows (e.g. '\\t' returns true)\n",
"* Haskell Char does not have `isGraph`. It does have `isPrint`. \n",
"* Go's does have `isGraphic`, and its Unicode docs are clearer for these. \n",
"* From perl: \"\\p{Print} This matches any character that is graphical or blank, except controls.\" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* On Windows, isprint('\\t') returns true (incorrectly)\n",
"* On Linux, isprint('\\t') returns false (correctly)\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#= Julia's isprint does not match any of these currently.\n",
" Haskell isprint: \"Selects printable Unicode characters \n",
" (letters, numbers, marks, punctuation, symbols and spaces).\"\n",
" Go: \"IsPrint reports whether the rune is defined as printable by Go. \n",
" Such characters include letters, marks, numbers, punctuation, symbols, and the ASCII space character, \n",
" from categories L, M, N, P, S and the ASCII space character. \n",
" This categorization is the same as IsGraphic \n",
" except that the only spacing character is ASCII space, U+0020.\"\n",
"\n",
" From perl: \"\\p{Print} This matches any character that is graphical or blank, except controls.\"\n",
"=#\n",
"# Julia isprint: includes spaces\n",
"function isprint_moji(c::Char)\n",
" ccode=category_code_assigned(c)\n",
" return (UTF8PROC_CATEGORY_LU <= ccode <= UTF8PROC_CATEGORY_ZS) \n",
"end"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 29,
"text": [
"isprint_moji (generic function with 1 method)"
]
}
],
"prompt_number": 29
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"isprint('a'), isprint(' '), isprint('\\t')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 30,
"text": [
"(true,true,true)"
]
}
],
"prompt_number": 30
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"isprint_moji('a'), isprint_moji(' '), isprint_moji('\\t')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 31,
"text": [
"(true,true,false)"
]
}
],
"prompt_number": 31
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### isgraph - follows perl and current Julia manual definition\n",
"* From perl: \"\\p{Graph} Matches any character that is graphic. \n",
" Theoretically, this means a character that on a printer would cause ink to be used.\"\n",
"* Julia isgraph: excludes spaces "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"function isgraph_moji(c::Char)\n",
" category_code_assigned(c)\n",
" return (UTF8PROC_CATEGORY_LU <= ccode <= UTF8PROC_CATEGORY_SO) \n",
"end"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 32,
"text": [
"isgraph_moji (generic function with 1 method)"
]
}
],
"prompt_number": 32
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"isgraph('a'), isgraph(' '), isgraph('\\t')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 33,
"text": [
"(true,false,false)"
]
}
],
"prompt_number": 33
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"isgraph_moji('a'), isgraph_moji(' '), isgraph_moji('\\t')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "LoadError",
"evalue": "ccode not defined\nwhile loading In[34], in expression starting on line 1",
"output_type": "pyerr",
"traceback": [
"ccode not defined\nwhile loading In[34], in expression starting on line 1",
" in isgraph_moji at In[32]:3"
]
}
],
"prompt_number": 34
},
{
"cell_type": "code",
"collapsed": false,
"input": [
";ipython nbconvert Character_Classes.ipynb"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": [
"[NbConvertApp] Using existing profile dir: u'C:\\\\Users\\\\keithc\\\\.ipython\\\\profile_default'\r\n"
]
},
{
"output_type": "stream",
"stream": "stderr",
"text": [
"[NbConvertApp] Converting notebook Character_Classes.ipynb to html\r\n",
"[NbConvertApp] Support files will be in Character_Classes_files\\\r\n"
]
},
{
"output_type": "stream",
"stream": "stderr",
"text": [
"[NbConvertApp] Loaded template full.tpl\r\n"
]
},
{
"output_type": "stream",
"stream": "stderr",
"text": [
"[NbConvertApp] Writing 231801 bytes to Character_Classes.html\r\n"
]
}
],
"prompt_number": 35
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 36
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment