Last active
August 29, 2015 14:05
-
-
Save catawbasam/28153e91774992d6482b to your computer and use it in GitHub Desktop.
Julia islower(Char) and islower(String) benchmarks: Base, utf8proc, PCRE
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"language": "Julia", | |
"name": "", | |
"signature": "sha256:679974b4916c36603648466accbc9c2838242618d92c33c0fcb81297a9e11a79" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Benchmark of islower() using Base and utf8proc-- UPDATE 2\n", | |
"Changes: \n", | |
"* added mojibake_islower(String) based on a C function provided by @stevengj\n", | |
"* Revised islower_utf8proc() to use a hacked version of category_code(c) which omits the line `cat == 0 ? 30 : cat`.\n", | |
"* Revised test of islower(Char) using shuffled array of Char. \n", | |
"* dropped PCRE as it does not work well with Char and libmojibake is the preferred path.\n", | |
"\n", | |
"### Updated Findings for test functions, as tested:\n", | |
"* For islower(Char) run against shuffled Char arrays, islower_base is just a hair faster than islower_utf8proc\n", | |
"* For islower(String), islower_utf8proc is 3-4x faster than islower_base\n", | |
"* The discrepency between islower(Char) and islower(String) is probably due to the inefficiency of the all() function used in the Base version of islower(String)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"versioninfo()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"Julia Version 0.4.0-dev+148\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"Commit 064b5c3 (2014-08-15 04:27 UTC)\n", | |
"Platform Info:\n", | |
" System: Linux (x86_64-redhat-linux)\n", | |
" CPU: Intel(R) Xeon(R) CPU W3690 @ 3.47GHz\n", | |
" WORD_SIZE: 64\n", | |
" BLAS: libopenblas (USE64BITINT NO_AFFINITY NEHALEM)\n", | |
" LAPACK: libopenblas\n", | |
" LIBM: libopenlibm\n", | |
" LLVM: libLLVM-3.3\n" | |
] | |
} | |
], | |
"prompt_number": 11 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### test Chars" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"const c = 'p'\n", | |
"const ca = char(uint8(rand(48:122, 1400)))\n", | |
"ca[1:8]'" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 12, | |
"text": [ | |
"1x8 Array{Char,2}:\n", | |
" '0' 'L' 'R' '1' 'd' 'e' '>' 'f'" | |
] | |
} | |
], | |
"prompt_number": 12 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### test strings" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"const s8a= \"cherry\u03c0\" \n", | |
"const s8=s8a^2\n", | |
"dump(s8)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"UTF8String" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" \"cherry\u03c0cherry\u03c0\"\n" | |
] | |
} | |
], | |
"prompt_number": 13 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"const s32 = utf32(s8)\n", | |
"const ss8=SubString(s8,1,endof(s8))\n", | |
"const rev8 = RevString(s8)\n", | |
"const rep8 = RepString(s8a, 2)\n", | |
"\n", | |
"ts = {s8, ss8, s32, rev8, rep8}" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 14, | |
"text": [ | |
"5-element Array{Any,1}:\n", | |
" \"cherry\u03c0cherry\u03c0\"\n", | |
" \"cherry\u03c0cherry\u03c0\"\n", | |
" \"cherry\u03c0cherry\u03c0\"\n", | |
" \"\u03c0yrrehc\u03c0yrrehc\"\n", | |
" \"cherry\u03c0cherry\u03c0\"" | |
] | |
} | |
], | |
"prompt_number": 14 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"####longer test strings" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"const s8b = s8^100\n", | |
"typeof(s8b), length(s8b)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 15, | |
"text": [ | |
"(UTF8String,1400)" | |
] | |
} | |
], | |
"prompt_number": 15 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"const s32b = utf32(s8b)\n", | |
"const ss8b=SubString(s8b,1,endof(s8b))\n", | |
"const rev8b = RevString(s8b)\n", | |
"const rep8b = RepString(s8, 100)\n", | |
"\n", | |
"tsb = {s8b, ss8b, s32b, rev8b, rep8b};" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 16 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### current base : accuracy\n", | |
"on Windows, islower(c) is inaccurate" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"isupper(c), islower(c) " | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 17, | |
"text": [ | |
"(false,true)" | |
] | |
} | |
], | |
"prompt_number": 17 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"for s in ts\n", | |
" println( islower(s))\n", | |
"end" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"true" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n", | |
"true\n", | |
"true\n", | |
"true\n", | |
"true\n" | |
] | |
} | |
], | |
"prompt_number": 18 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Alternatives and Benchmarks" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Alternative 0: The incumbent\n", | |
"A for loop would be faster than all, but wouldn't address the accuracy problem on Windows." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"islower_base(c::Char) = bool(ccall(:iswlower, Int32, (Cwchar_t,), c))\n", | |
"islower_base(s::String) = all(islower,s)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 19, | |
"text": [ | |
"islower_base (generic function with 2 methods)" | |
] | |
} | |
], | |
"prompt_number": 19 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"c, islower_base(c)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 20, | |
"text": [ | |
"('p',true)" | |
] | |
} | |
], | |
"prompt_number": 20 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"s8, islower_base(s8)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 21, | |
"text": [ | |
"(\"cherry\u03c0cherry\u03c0\",true)" | |
] | |
} | |
], | |
"prompt_number": 21 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Alternative: utf8proc / libmojibake" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# returns UTF8PROC_CATEGORY code in 1:30 giving Unicode category\n", | |
"function category_code_HACK(c)\n", | |
" c > 0x10FFFF && return 0x0000 # see utf8proc_get_property docs\n", | |
" cat = unsafe_load(ccall(:utf8proc_get_property, Ptr{Uint16}, (Int32,), c))\n", | |
" # note: utf8proc returns 0, not UTF8PROC_CATEGORY_CN, for unassigned c\n", | |
" #cat == 0 ? 30 : cat # not important for islower\n", | |
"end" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 75, | |
"text": [ | |
"category_code_HACK (generic function with 1 method)" | |
] | |
} | |
], | |
"prompt_number": 75 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# utf8 category constants\n", | |
"const UTF8PROC_CATEGORY_LU = 1\n", | |
"const UTF8PROC_CATEGORY_LL = 2" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 76, | |
"text": [ | |
"2" | |
] | |
} | |
], | |
"prompt_number": 76 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"islower_utf8proc(c::Char) = (category_code_HACK(c)==UTF8PROC_CATEGORY_LL)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 77, | |
"text": [ | |
"islower_utf8proc (generic function with 2 methods)" | |
] | |
} | |
], | |
"prompt_number": 77 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"function islower_utf8proc(s::String)\n", | |
" for c in s\n", | |
" if category_code_HACK(c)!=UTF8PROC_CATEGORY_LL\n", | |
" return false\n", | |
" end\n", | |
" end\n", | |
" return true\n", | |
"end" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 78, | |
"text": [ | |
"islower_utf8proc (generic function with 2 methods)" | |
] | |
} | |
], | |
"prompt_number": 78 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"@time islower_utf8proc(c)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"elapsed time: 0." | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"002072182 seconds (46160 bytes allocated)\n" | |
] | |
}, | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 79, | |
"text": [ | |
"true" | |
] | |
} | |
], | |
"prompt_number": 79 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"for s in ts\n", | |
" println(islower_utf8proc(s8))\n", | |
"end" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"true\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"true\n", | |
"true\n", | |
"true\n", | |
"true\n" | |
] | |
} | |
], | |
"prompt_number": 80 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Alternative 2 for islower(String) -- mojibake_islower() from @stevengj\n", | |
"This is a C function that handles conversion from utf8 to Chars on the C side." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"ccall((:mojibake_islower,\"/devel/asias/keithc/julia/deps/libmojibake/libmojibake\"),\n", | |
" Int, (Ptr{Uint8},), \"blAH\")" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 81, | |
"text": [ | |
"0" | |
] | |
} | |
], | |
"prompt_number": 81 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"ccall((:mojibake_islower,\"/devel/asias/keithc/julia/deps/libmojibake/libmojibake\"),\n", | |
"Int, (Ptr{Uint8},), \"blah\")" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 82, | |
"text": [ | |
"1" | |
] | |
} | |
], | |
"prompt_number": 82 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"mojibake_islower(s::ByteString) = bool(ccall((:mojibake_islower,\"/devel/asias/keithc/julia/deps/libmojibake/libmojibake\"),\n", | |
"Int, (Ptr{Uint8},), s))" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 83, | |
"text": [ | |
"mojibake_islower (generic function with 1 method)" | |
] | |
} | |
], | |
"prompt_number": 83 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"mojibake_islower(\"\u0394uffer\")" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 84, | |
"text": [ | |
"false" | |
] | |
} | |
], | |
"prompt_number": 84 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"mojibake_islower(\"\u03b4uffer\") " | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 85, | |
"text": [ | |
"true" | |
] | |
} | |
], | |
"prompt_number": 85 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Initial Time comparisons \n", | |
"* base wins on Char\n", | |
"* utf8proc beats base on UTF8String; `mojibake_islower()` is even better" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"CN=1000_000\n", | |
"N=1000\n", | |
"runs = 20" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 86, | |
"text": [ | |
"20" | |
] | |
} | |
], | |
"prompt_number": 86 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"function test_base(v,N)\n", | |
" isit = false\n", | |
" for i in 1:N\n", | |
" isit = islower_base(v) \n", | |
" end\n", | |
" return isit\n", | |
"end" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 87, | |
"text": [ | |
"test_base (generic function with 1 method)" | |
] | |
} | |
], | |
"prompt_number": 87 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"function test_utf8proc(v,N)\n", | |
" isit = false\n", | |
" for i in 1:N\n", | |
" isit = islower_utf8proc(v) \n", | |
" end\n", | |
" return isit\n", | |
"end" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 88, | |
"text": [ | |
"test_utf8proc (generic function with 1 method)" | |
] | |
} | |
], | |
"prompt_number": 88 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"function test_mojibake(v,N)\n", | |
" isit = false\n", | |
" for i in 1:N\n", | |
" isit = mojibake_islower(v) \n", | |
" end\n", | |
" return isit\n", | |
"end" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 89, | |
"text": [ | |
"test_mojibake (generic function with 1 method)" | |
] | |
} | |
], | |
"prompt_number": 89 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"@time test_base(c,CN)\n", | |
"@time test_base(c,CN)\n", | |
"@time test_base(c,CN)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"elapsed time: 0." | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"002158956 seconds (80 bytes allocated)\n", | |
"elapsed time: 0.002148192 seconds (80 bytes allocated)\n", | |
"elapsed time: 0.00215309 seconds (80 bytes allocated)\n" | |
] | |
}, | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 92, | |
"text": [ | |
"true" | |
] | |
} | |
], | |
"prompt_number": 92 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"@time test_utf8proc(c,CN) \n", | |
"@time test_utf8proc(c,CN) \n", | |
"@time test_utf8proc(c,CN) " | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"elapsed time: 0." | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"004567398 seconds (80 bytes allocated)\n", | |
"elapsed time: 0.004400526 seconds (80 bytes allocated)\n", | |
"elapsed time: 0.004369747 seconds (80 bytes allocated)\n" | |
] | |
}, | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 93, | |
"text": [ | |
"true" | |
] | |
} | |
], | |
"prompt_number": 93 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"@time test_base(s8,N)\n", | |
"@time test_base(s8,N)\n", | |
"@time test_base(s8b,N)\n", | |
"@time test_base(s8b,N)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"elapsed time: 0." | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"00122146 seconds (80 bytes allocated)\n", | |
"elapsed time: 0.001210436 seconds (80 bytes allocated)\n", | |
"elapsed time: 0.0867118 seconds (80 bytes allocated)\n", | |
"elapsed time: " | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"0.08556961 seconds (80 bytes allocated)\n" | |
] | |
}, | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 71, | |
"text": [ | |
"true" | |
] | |
} | |
], | |
"prompt_number": 71 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"@time test_utf8proc(s8,N)\n", | |
"@time test_utf8proc(s8,N)\n", | |
"@time test_utf8proc(s8b,N) \n", | |
"@time test_utf8proc(s8b,N) " | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"elapsed time: 0." | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"000205483 seconds (80 bytes allocated)\n", | |
"elapsed time: 0.000203575 seconds (80 bytes allocated)\n", | |
"elapsed time: 0.019182563 seconds (80 bytes allocated)\n", | |
"elapsed time: 0.018921496 seconds (80 bytes allocated)\n" | |
] | |
}, | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 73, | |
"text": [ | |
"true" | |
] | |
} | |
], | |
"prompt_number": 73 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"@time test_mojibake(s8,N)\n", | |
"@time test_mojibake(s8,N)\n", | |
"@time test_mojibake(s8b,N)\n", | |
"@time test_mojibake(s8b,N)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"elapsed time: 0." | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"000109839 seconds (80 bytes allocated)\n", | |
"elapsed time: 5.958e-5 seconds (80 bytes allocated)\n", | |
"elapsed time: 0.006180564 seconds (80 bytes allocated)\n", | |
"elapsed time: 0.006231801 seconds (80 bytes allocated)\n" | |
] | |
}, | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 95, | |
"text": [ | |
"true" | |
] | |
} | |
], | |
"prompt_number": 95 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## islower(Char) over a shuffled Char array\n", | |
"* Base.islower(Char) appears to be about 1.5x faster than islower_utf8proc() version" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"CARUNS=100\n", | |
"function test_chararray_base(ca)\n", | |
" isit = false\n", | |
" for c in ca\n", | |
" isit = islower_base(c)\n", | |
" end\n", | |
" return isit\n", | |
"end\n", | |
"tmupc = Float64[]\n", | |
"i=0\n", | |
"while i<CARUNS\n", | |
" v, t, b, g = @timed test_chararray_base(shuffle(ca))\n", | |
" if g==0\n", | |
" push!(tmupc,t)\n", | |
" i+=1\n", | |
" end\n", | |
" sleep(0.02) \n", | |
"end\n", | |
"println(\"\\nchar array base islower(Char)\")\n", | |
"println(\" Mean time per islower(c): $(mean(tmupc))\")\n", | |
"println(\" Std Dev time per islower(c): $(std(tmupc))\")" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n", | |
"char array base islower(Char)\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Mean time per islower(c): 0.00010246523\n", | |
" Std Dev time per islower(c): 1.4454150120088016e-5" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n" | |
] | |
} | |
], | |
"prompt_number": 96 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"function test_chararray_utf8proc(ca)\n", | |
" isit = false\n", | |
" for c in ca\n", | |
" isit = islower_utf8proc(c)\n", | |
" end\n", | |
" return isit\n", | |
"end\n", | |
"tmupc = Float64[]\n", | |
"i=0\n", | |
"while i<CARUNS\n", | |
" v, t, b, g = @timed test_chararray_utf8proc(shuffle(ca))\n", | |
" if g==0\n", | |
" push!(tmupc,t)\n", | |
" i+=1\n", | |
" end\n", | |
" sleep(0.02) \n", | |
"end\n", | |
"println(\"\\nchar array utf8proc islower(Char)\")\n", | |
"println(\" Mean time per islower(c): $(mean(tmupc))\")\n", | |
"println(\" Std Dev time per islower(c): $(std(tmupc))\")" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n", | |
"char array utf8proc islower(Char)\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Mean time per islower(c): 0.00011192105\n", | |
" Std Dev time per islower(c): 1.0925861625986393e-5\n" | |
] | |
} | |
], | |
"prompt_number": 97 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# PCRE islower(Char) is a non-starter if we have to do string(Char) due to conversion and memory allocation" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 62 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## islower(String) expanded test -- multiple runs and string types" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"function stringtest_base(s, runs, N)\n", | |
" tm = Float64[]\n", | |
" println(typeof(s), \" \", length(s))\n", | |
" i=0\n", | |
" while i<runs+2\n", | |
" v, t, b, g = @timed test_base(s,N)\n", | |
" if g==0\n", | |
" if i>2 #ignore first 2 warm-up runs\n", | |
" push!(tm,t)\n", | |
" end\n", | |
" i+=1\n", | |
" end\n", | |
" sleep(0.01) \n", | |
" end\n", | |
" mean(tm), std(tm)\n", | |
"end" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 98, | |
"text": [ | |
"stringtest_base (generic function with 1 method)" | |
] | |
} | |
], | |
"prompt_number": 98 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"function stringtest_utf8proc(s, runs, N)\n", | |
" tm = Float64[]\n", | |
" println(typeof(s), \" \", length(s))\n", | |
" i=0\n", | |
" while i<runs+2\n", | |
" v, t, b, g = @timed test_utf8proc(s,N)\n", | |
" if g==0\n", | |
" if i>2 #ignore first 2 warm-up runs\n", | |
" push!(tm,t)\n", | |
" end\n", | |
" i+=1\n", | |
" end\n", | |
" sleep(0.01) \n", | |
" end\n", | |
" mean(tm), std(tm)\n", | |
"end" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 99, | |
"text": [ | |
"stringtest_utf8proc (generic function with 1 method)" | |
] | |
} | |
], | |
"prompt_number": 99 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"function stringtest_mojibake(s, runs, N)\n", | |
" tm = Float64[]\n", | |
" println(typeof(s), \" \", length(s))\n", | |
" i=0\n", | |
" while i<runs+2\n", | |
" v, t, b, g = @timed test_mojibake(s,N)\n", | |
" if g==0\n", | |
" if i>2 #ignore first 2 warm-up runs\n", | |
" push!(tm,t)\n", | |
" end\n", | |
" i+=1\n", | |
" end\n", | |
" sleep(0.01) \n", | |
" end\n", | |
" mean(tm), std(tm)\n", | |
"end" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 100, | |
"text": [ | |
"stringtest_mojibake (generic function with 1 method)" | |
] | |
} | |
], | |
"prompt_number": 100 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Short string by type" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"println(\"shorter islower_base(String)\")\n", | |
"for s in ts\n", | |
" avgt, sdt = stringtest_base(s, runs, N)\n", | |
" println(\" Mean time per $N islower(s): $avgt\")\n", | |
" println(\" Std Dev time per $N islower(s): $sdt\")\n", | |
"end" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"shorter islower_base(String)\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"UTF8String 14\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Mean time per 1000 islower(s): 0.0008859104736842106\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Std Dev time per 1000 islower(s): 3.0579873180044464e-5\n", | |
"SubString{UTF8String} 14\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Mean time per 1000 islower(s): 0.0008886625263157894\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Std Dev time per 1000 islower(s): 2.909439493817862e-5\n", | |
"UTF32String 14\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Mean time per 1000 islower(s): 0.0007661934736842105\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Std Dev time per 1000 islower(s): 2.5675350276456092e-5\n", | |
"RevString{UTF8String} 14\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Mean time per 1000 islower(s): 0.0010755214210526314\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Std Dev time per 1000 islower(s): 4.265213206253038e-5\n", | |
"RepString 14\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Mean time per 1000 islower(s): 0.013362767052631577\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Std Dev time per 1000 islower(s): 0.00013291619916384821\n" | |
] | |
} | |
], | |
"prompt_number": 101 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"println(\"shorter islower_utf8proc(String)\")\n", | |
"for s in ts\n", | |
" avgt, sdt = stringtest_utf8proc(s, runs, N)\n", | |
" println(\" Mean time per $N islower(s): $avgt\")\n", | |
" println(\" Std Dev time per $N islower(s): $sdt\")\n", | |
"end" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"shorter islower_utf8proc(String)\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"UTF8String 14\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Mean time per 1000 islower(s): 0.0002451072631578948\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Std Dev time per 1000 islower(s): 3.7097391784392046e-5\n", | |
"SubString{UTF8String} 14\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Mean time per 1000 islower(s): 0.00025295284210526315\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Std Dev time per 1000 islower(s): 2.297940062863829e-5\n", | |
"UTF32String 14\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Mean time per 1000 islower(s): 0.0001367904736842105\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Std Dev time per 1000 islower(s): 2.700984922696086e-5\n", | |
"RevString{UTF8String} 14\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Mean time per 1000 islower(s): 0.0004213798947368421\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Std Dev time per 1000 islower(s): 2.524794539912491e-5\n", | |
"RepString 14\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Mean time per 1000 islower(s): 0.012843512789473686\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Std Dev time per 1000 islower(s): 0.0001222177376395545\n" | |
] | |
} | |
], | |
"prompt_number": 102 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### mojibake_islower() is set up only for ByteStrings" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Longer string by type (excluding the slooow RepString)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"LN = 1000\n", | |
"println(\"longer islower_base(String)\")\n", | |
"for s in tsb[1:end-1]\n", | |
" avgt, sdt = stringtest_base(s, runs, LN)\n", | |
" println(\" Mean time per $N islower(s): $avgt\")\n", | |
" println(\" Std Dev time per $N islower(s): $sdt\")\n", | |
"end" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"longer islower_base(String)\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"UTF8String 1400\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Mean time per 1000 islower(s): 0.08001143336842105\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Std Dev time per 1000 islower(s): 0.00048135938686913004\n", | |
"SubString{UTF8String} 1400\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Mean time per 1000 islower(s): 0.08094776884210526\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Std Dev time per 1000 islower(s): 0.0004420752573338697\n", | |
"UTF32String 1400\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Mean time per 1000 islower(s): 0.06830800131578947\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Std Dev time per 1000 islower(s): 0.00041131204154700216\n", | |
"RevString{UTF8String} 1400\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Mean time per 1000 islower(s): 0.09844635994736843\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Std Dev time per 1000 islower(s): 0.00028812059821954085\n" | |
] | |
} | |
], | |
"prompt_number": 105 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"println(\"longer islower_utf8proc(String)\")\n", | |
"for s in tsb[1:end-1]\n", | |
" avgt, sdt = stringtest_utf8proc(s, runs, LN)\n", | |
" println(\" Mean time per $N islower(s): $avgt\")\n", | |
" println(\" Std Dev time per $N islower(s): $sdt\")\n", | |
"end" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"longer islower_utf8proc(String)\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"UTF8String 1400\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Mean time per 1000 islower(s): 0.018954816578947365\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Std Dev time per 1000 islower(s): 3.470253665290481e-5\n", | |
"SubString{UTF8String} 1400\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Mean time per 1000 islower(s): 0.01851312689473684\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Std Dev time per 1000 islower(s): 0.00010500337487375001\n", | |
"UTF32String 1400\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Mean time per 1000 islower(s): 0.007234282210526316\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Std Dev time per 1000 islower(s): 2.290304486447872e-5\n", | |
"RevString{UTF8String} 1400\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Mean time per 1000 islower(s): 0.036085492368421054\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" Std Dev time per 1000 islower(s): 0.000493830033976403\n" | |
] | |
} | |
], | |
"prompt_number": 106 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment