Skip to content

Instantly share code, notes, and snippets.

@catawbasam
Created April 2, 2014 15:18
Show Gist options
  • Save catawbasam/9936264 to your computer and use it in GitHub Desktop.
Save catawbasam/9936264 to your computer and use it in GitHub Desktop.
Faster, memory-efficient line-oriented text processing in Julia using mmap array
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"language": "Julia",
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Faster line-oriented file processing using mmap\n",
"While processing long lists of floats, eachline() + float() was eating lots of RAM and running slowly. \n",
"\n",
"mmap() and in-place conversion to Float64 cuts way down on RAM usage and cut processing time by half.\n",
"\n",
"Julia Version 0.3.0-prerelease (2014-02-28 04:44 UTC)\n",
" master/4e48d5b* \n",
" x86_64-redhat-linux"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### eachline() read and float() conversion to Float64\n",
"These results are better than the earlier \"Julia String to Float Timings\". \n",
"In that case we generated the strings in the same function as the float conversion. \n",
"We also used list comprehensions to load all strings and all floats in memory at the same time, using alot of memory and causing super-linear load times wrt to the length of the string array. \n",
"\n",
"This version is line oriented, uses much less memory and is basically linear on the length of the string array -- but still RAM-hungry and too slow."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"function test_read_floats_eachline(fn)\n",
" fid = open(fn,\"r\")\n",
" f = 0.0\n",
" for ln in eachline(fid)\n",
" f = float(ln)\n",
" end\n",
" close(fid)\n",
"end"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 6,
"text": [
"test_read_floats_eachline (generic function with 1 method)"
]
}
],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"@time test_read_floats_eachline(\"float100_000.txt\")\n",
"@time test_read_floats_eachline(\"float100_000.txt\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"elapsed time: 0.048104317 seconds (12925560 bytes allocated)\n",
"elapsed time: "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"0.054829294 seconds (12800128 bytes allocated)\n"
]
}
],
"prompt_number": 7
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"@time test_read_floats_eachline(\"float1000_000.txt\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"elapsed time: "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"0.460350783 seconds (127997696 bytes allocated)\n"
]
}
],
"prompt_number": 8
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"@time test_read_floats_eachline(\"float10_000_000.txt\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"elapsed time: "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"4.561470468 seconds (1279970048 bytes allocated)\n"
]
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### float100_000_000.txt is 1.8gb. It contains 100 million floats, with one float per line."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"@time test_read_floats_eachline(\"float100_000_000.txt\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"elapsed time: 45"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
".845388458 seconds (12799698720 bytes allocated)\n"
]
}
],
"prompt_number": 27
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### eachline() + strtod!() conversion to float- slightly faster, a little less RAM-hungry\n",
"strtod!() reduces memory allocation by updating a pre-allocated Float. One version takes a section from a Vector{Uint8} "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"function strtod!(mystr::ASCIIString, fa::Vector{Float64})\n",
" # convert a string to a Float64 in place, using a single element array \n",
" status = ccall(:jl_substrtod, Int32, (Ptr{Uint8}, Csize_t,Int, Ptr{Float64}), \n",
" mystr, convert(Csize_t,0), length(mystr)-1, fa) \n",
" return status # 0=bad, 1=good\n",
"end"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 10,
"text": [
"strtod! (generic function with 1 method)"
]
}
],
"prompt_number": 10
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"fa=Array(Float64,1)\n",
"status = strtod!(ascii(\"33.3\"),fa)\n",
"status, fa[1]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 11,
"text": [
"(1,33.3)"
]
}
],
"prompt_number": 11
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#function strtod!(chars::Array{Uint8,1}, startpos::Int, endpos::Int, fa::Array{Float64,1})\n",
"function strtod!(chars::Vector{Uint8}, startpos::Int, endpos::Int, fa::Vector{Float64})\n",
" # convert a string to a Float64 in place from a range of bytes\n",
" status = ccall(:jl_substrtod, Int32, (Ptr{Uint8}, Csize_t,Int, Ptr{Float64}), \n",
" chars, convert(Csize_t,startpos-1), endpos-1, fa) \n",
" return status # 0=bad, 1=good\n",
"end"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 12,
"text": [
"strtod! (generic function with 2 methods)"
]
}
],
"prompt_number": 12
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"s=\"5.5,6.6\"\n",
"ac = s.data\n",
"ac'\n",
"fa=[0.00,]\n",
"strtod!(ac, 1, 3, fa)\n",
"fa"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 13,
"text": [
"1-element Array{Float64,1}:\n",
" 5.5"
]
}
],
"prompt_number": 13
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"function test_read_strtod_eachline(fn)\n",
" fid = open(fn,\"r\")\n",
" fa=[0.0,]\n",
" for ln in eachline(fid)\n",
" strtod!(ln,fa)\n",
" end\n",
" close(fid)\n",
" println(fa[1])\n",
"end"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 14,
"text": [
"test_read_strtod_eachline (generic function with 1 method)"
]
}
],
"prompt_number": 14
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"@time test_read_strtod_eachline(\"float100.txt\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"9"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
".316638239477001\n",
"elapsed time: 0.01741202 seconds (511948 bytes allocated)\n"
]
}
],
"prompt_number": 15
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"@time test_read_strtod_eachline(\"float100_000.txt\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"8.984626911175305\n",
"elapsed time: 0.052980347 seconds (9601192 bytes allocated)\n"
]
}
],
"prompt_number": 16
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"@time test_read_strtod_eachline(\"float1000_000.txt\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"3"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
".0051466341223065\n",
"elapsed time: 0.417515282 seconds (96006108 bytes allocated)\n"
]
}
],
"prompt_number": 17
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"@time test_read_strtod_eachline(\"float10_000_000.txt\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"7"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
".287769581829075\n",
"elapsed time: 4.230148389 seconds (959978460 bytes allocated)\n"
]
}
],
"prompt_number": 18
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"@time test_read_strtod_eachline(\"float100_000_000.txt\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"0.19639390030415793"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"elapsed time: 43.136997722 seconds (9599735936 bytes allocated)\n"
]
}
],
"prompt_number": 26
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### mmap with 'pointers' -- nearly 2x eachline()+strtod!(); more than 2x eachline()+float()\n",
"We treat the data as an array of Uint8. This can be a mmapped file or an array in memory."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# convenience method to print a section of bytes as a string\n",
"string(ba::Vector{Uint8}, SOL::Int, EOL::Int) = join(char(ba[SOL:EOL]) )"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 19,
"text": [
"string (generic function with 1 method)"
]
}
],
"prompt_number": 19
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"function doline(ba::Vector{Uint8})\n",
" SOL=0 \n",
" EOL=0\n",
" myfloat = [0.00,]\n",
" i=1\n",
" for i in 1:length(ba)\n",
" if ba[i]=='\\n'\n",
" SOL=EOL+1; EOL=i\n",
" if EOL>SOL; strtod!(ba,SOL,EOL-1,myfloat); end\n",
" #println(\"pos \",SOL,' ', EOL)\n",
" #println(myfloat )\n",
" end\n",
" end\n",
" if EOL<length(ba)\n",
" SOL=EOL+1; EOL=i\n",
" if EOL>SOL; strtod!(ba,SOL,EOL-1,myfloat); end\n",
" #println(\"pos \", SOL,' ', EOL)\n",
" #print(myfloat )\n",
" end\n",
" println(\"Done\")\n",
" println(SOL,' ', EOL, ' ', myfloat)\n",
"end"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 20,
"text": [
"doline (generic function with 1 method)"
]
}
],
"prompt_number": 20
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"s=\"5.5\\n6.6\"\n",
"ac = s.data\n",
"doline(ac)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Done\n",
"5 7 6.6\n",
"\n"
]
}
],
"prompt_number": 21
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"function doline(pathname::String) \n",
" fs=filesize(pathname)\n",
" io = open(pathname, \"r\")\n",
" ma = mmap_array(Uint8, (fs,), io)\n",
" doline(ma)\n",
" close(io)\n",
"end"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 22,
"text": [
"doline (generic function with 2 methods)"
]
}
],
"prompt_number": 22
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"@time doline(\"float10_000_000.txt\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Done\n"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"183584053 183584070 7.287769581829075\n",
"\n",
"elapsed time: 2.123117216 seconds (10616 bytes allocated)\n"
]
}
],
"prompt_number": 24
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"@time doline(\"float100_000_000.txt\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Done\n"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1835839677 1835839696 .19639390030415793\n",
"\n",
"elapsed time: 23.067628093 seconds (10616 bytes allocated)\n"
]
}
],
"prompt_number": 25
},
{
"cell_type": "code",
"collapsed": false,
"input": [
";ipython nbconvert lineread_mmap_short.ipynb"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": [
"[NbConvertApp] Using existing profile dir: u'/home/keithc/.config/ipython/profile_default'\n"
]
},
{
"output_type": "stream",
"stream": "stderr",
"text": [
"[NbConvertApp] Converting notebook test_strings.ipynb to html\n",
"[NbConvertApp] Support files will be in test_strings_files/\n"
]
},
{
"output_type": "stream",
"stream": "stderr",
"text": [
"[NbConvertApp] Loaded template html_full.tpl\n"
]
},
{
"output_type": "stream",
"stream": "stderr",
"text": [
"[NbConvertApp] Writing 278075 bytes to test_strings.html\n"
]
}
],
"prompt_number": 92
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment