Packages:
- https://github.com/tanmaykm/Clustering.jl/tree/tanmaykm
- https://github.com/tanmaykm/DistributedDistances.jl
kddcup dataset (clusters network intrusion data)
size: 5,000,000 observations, each with 40 features
150 clusters
$ ~/julia/julia -p 20
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.5.0-dev+1784 (2015-12-15 05:34 UTC)
_/ |\__'_|_|_|\__'_| | Commit 62f1481 (2 days old master)
|__/ | x86_64-linux-gnu
julia> include("kdd.jl")
INFO: reading csv...
INFO: transposing...
K-means converged with 63 iterations (objv = 5.2919472734757043e11)
INFO: distributing...
K-means converged with 64 iterations (objv = 5.2919472734757043e11)
K-means converged with 63 iterations (objv = 5.2919472734757043e11)
K-means converged with 256 iterations (objv = 1.271281932891472e12)
kmpar: 248.89936900138855, kmpp: 81.91902709007263, distributed: 88.15856599807739, singlenode: 646.2298829555511
Packages:
- https://github.com/tanmaykm/RecSys.jl/tree/tanmaykm
- https://github.com/tanmaykm/RecSys.jl/tree/threading
movielens dataset
20,000,000 ratings for 27,000 movies by 138,000 users
# single processor
julia> test("/home/tan/Work/datasets/movielens/ml-20m")
17-Dec 20:20:31:DEBUG:root:loading inputs...
17-Dec 20:20:54:DEBUG:root:time to load inputs: 22.946762084960938 secs
17-Dec 20:20:54:DEBUG:root:preparing inputs...
17-Dec 20:20:55:DEBUG:root:prep time: 0.8154430389404297
17-Dec 20:21:35:DEBUG:root:fact time 40.201488971710205
17-Dec 20:22:03:DEBUG:root:rmse time 28.2427339553833
# shared memory mode
# julia -p 8
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.5.0-dev+1824 (2015-12-16 08:25 UTC)
_/ |\__'_|_|_|\__'_| | Commit 79b2418* (1 day old master)
|__/ | x86_64-linux-gnu
julia> include("movielens.jl")
test (generic function with 1 method)
julia> test("/home/tan/Work/datasets/movielens/ml-20m")
17-Dec 20:12:52:DEBUG:root:loading inputs...
17-Dec 20:13:14:DEBUG:root:time to load inputs: 21.93354296684265 secs
17-Dec 20:13:14:DEBUG:root:preparing inputs...
17-Dec 20:13:15:DEBUG:root:prep time: 0.9523890018463135
17-Dec 20:13:35:DEBUG:root:fact time 19.933722019195557
17-Dec 20:13:47:DEBUG:root:rmse time 11.64983582496643
# with threads
# 1 thread
julia> test("/home/tan/Work/datasets/movielens/ml-20m")
03-Jan 07:24:41:DEBUG:root:loading inputs...
03-Jan 07:25:05:DEBUG:root:time to load inputs: 23.175585985183716 secs
03-Jan 07:25:05:DEBUG:root:preparing inputs...
03-Jan 07:25:06:DEBUG:root:prep time: 1.1804168224334717
03-Jan 07:25:28:DEBUG:root:fact time 21.833703994750977
03-Jan 07:25:57:DEBUG:root:rmse time 29.767030000686646
rmse of the model: 0.7929127604591637
# 2 threads
julia> test("/home/tan/Work/datasets/movielens/ml-20m")
03-Jan 07:31:10:DEBUG:root:loading inputs...
03-Jan 07:31:31:DEBUG:root:time to load inputs: 21.776299953460693 secs
03-Jan 07:31:31:DEBUG:root:preparing inputs...
03-Jan 07:31:32:DEBUG:root:prep time: 0.8184521198272705
03-Jan 07:31:50:DEBUG:root:fact time 17.791420936584473
03-Jan 07:32:08:DEBUG:root:rmse time 17.79410696029663
rmse of the model: 0.7934035822594543
# 8 threads (on 4+4 hyperthreaded cores)
julia> test("/home/tan/Work/datasets/movielens/ml-20m")
03-Jan 07:21:05:DEBUG:root:loading inputs...
03-Jan 07:21:34:DEBUG:root:time to load inputs: 28.406423807144165 secs
03-Jan 07:21:34:DEBUG:root:preparing inputs...
03-Jan 07:21:34:DEBUG:root:prep time: 0.8342618942260742
03-Jan 07:21:49:DEBUG:root:fact time 14.779378890991211
03-Jan 07:22:02:DEBUG:root:rmse time 11.711627006530762
rmse of the model: 0.7940479653508236