tdunning · March 23, 2015 19:54
diff --git a/on-board b/on-board
 Log in to the cluster:

    ted:downloads$ ssh se-node10.se.lab
    Last login: Mon Mar 23 17:35:37 2015 from 10.250.0.220
    Please check the cluster reservation calendar:
    https://www.google.com/calendar/embed?src=maprtech.com_2d38343133383836382d313737%40resource.calendar.google.com

 Poke around looking for my volume and such:

    [tdunning@se-node10 ~]$ ls /mapr/se1/user/t
    tdunning/ tlojko/   
    [tdunning@se-node10 ~]$ ls /mapr/se1/user/tdunning/
    old-cluster
    [tdunning@se-node10 ~]$ maprcli volume list -columns volumename | grep tdunning
    home.tdunning                        

 Also note that we are already in my volume (because my home directory is on the cluster)

    [tdunning@se-node10 ~]$ pwd
    /mapr/se1/user/tdunning

 When we look at the contents of my home directory, we see old stuff and my new volume.  The new volume name is not the same as the mount point.  It is the mount point that we see here.

    [tdunning@se-node10 ~]$ ls
    new-vol  old-cluster
    [tdunning@se-node10 ~]$ cd new-vol/
    [tdunning@se-node10 new-vol]$ ls

 Inside the new volume, create some empty files just because we can:

    [tdunning@se-node10 new-vol]$ touch x y z
    [tdunning@se-node10 new-vol]$ ls
    x  y  z
    [tdunning@se-node10 new-vol]$ cd ..
    [tdunning@se-node10 ~]$ ls
    new-vol  old-cluster

 OK... back in my home directory (which is on the cluster, of course), download the log-synth code and compile it:

    [tdunning@se-node10 ~]$ pwd
    /mapr/se1/user/tdunning
    [tdunning@se-node10 ~]$ git clone https://github.com/tdunning/log-synth.git
    Initialized empty Git repository in /mapr/se1/user/tdunning/log-synth/.git/
    remote: Counting objects: 1421, done.
    remote: Total 1421 (delta 0), reused 0 (delta 0), pack-reused 1421
    Receiving objects: 100% (1421/1421), 2.48 MiB | 2.29 MiB/s, done.
    Resolving deltas: 100% (572/572), done.
    
    [tdunning@se-node10 ~]$ cd log-synth/
    [tdunning@se-node10 log-synth]$ mvn -q -DskipTests package
    ... much goo deleted ...
    [loading ZipFileIndexFileObject[/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/lib/ct.sym(META-INF/sym/rt.jar/java/lang/reflect/AnnotatedElement.class)]]
    [loading ZipFileIndexFileObject[/mapr/se1/user/tdunning/.m2/repository/org/apache/mahout/mahout-math/0.9/mahout-math-0.9.jar(org/apache/mahout/common/RandomWrapper.class)]]
    [loading ZipFileIndexFileObject[/mapr/se1/user/tdunning/.m2/repository/org/slf4j/slf4j-api/1.6.6/slf4j-api-1.6.6.jar(org/slf4j/Marker.class)]]
    [wrote RegularFileObject[/mapr/se1/user/tdunning/log-synth/target/test-classes/com/mapr/stats/UpperQuantileTest.class]]
    [total 680ms]

 Check to see that the executable was created by the compilation.  The file log-synth is the one we care about.

    [tdunning@se-node10 log-synth]$ ls target
    archive-tmp  generated-sources       log-synth                   log-synth-0.1-SNAPSHOT-jar-with-dependencies.jar  maven-status
    classes      generated-test-sources  log-synth-0.1-SNAPSHOT.jar  maven-archiver                                    test-classes

 Back home, we try to run this program and it tells us that we need to give it a schema

    [tdunning@se-node10 log-synth]$ cd ..
    [tdunning@se-node10 ~]$ ./log-synth/target/log-synth 
    Exception in thread "main" java.lang.IllegalArgumentException: Must specify schema file using [-schema filename] option
    	at com.mapr.synth.Synth.main(Synth.java:94)

 So we create a trivial schema that will generate three dates.  Usually, we will do something more interesting here.  See the README at https://github.com/tdunning/log-synth for more information.

    [tdunning@se-node10 ~]$ cat > schema.json
    [
    {"name":"first_visit", "class":"date", "format":"MM/dd/yyyy"},
    {"name":"second_date", "class":"date", "start":"2014-01-31", "end":"2014-02-07"},
    {"name":"third_date", "class":"date", "format":"MM/dd/yyyy", "start":"01/31/1995", "end":"02/07/1999"}
    ]

 Now create a tiny output file with 20 lines.

    [tdunning@se-node10 ~]$ ./log-synth/target/log-synth -schema ./schema.json -output foo -count 20 -format JSON
    R	1	0.0	0	0.0	0.000
    F	1	0.0	20	654.8	820.850
    [tdunning@se-node10 ~]$ cat foo/synth-0000 
    {"first_visit":"05/17/2013","second_date":"2014-01-31","third_date":"04/08/1998"}
    {"first_visit":"10/05/2012","second_date":"2014-02-01","third_date":"06/17/1997"}
      ... 16 lines of similar goo omitted ...
    {"first_visit":"06/27/2013","second_date":"2014-02-04","third_date":"02/18/1996"}
    {"first_visit":"11/11/2012","second_date":"2014-02-03","third_date":"07/31/1995"}

 Let's clean up and generate something bigger with 20 million lines generated using 20 threads.  This isn't a Hadoop program since it is ordinary Java running on a single machine, but it is a simple form of parallelism.

    [tdunning@se-node10 ~]$ rm -rf foo
    [tdunning@se-node10 ~]$ ./log-synth/target/log-synth -schema ./schema.json -output foo -count 20M -threads 20 -format JSON
    R	20	0.0	0	0.0	0.000
    R	20	10.0	2370000	236178.3	237005.826
    R	20	20.0	5014500	250296.7	264464.988
    R	20	30.0	7657500	254959.0	264299.625
    R	20	40.0	10098000	252234.2	244050.373
    R	20	50.0	12846000	256744.3	274800.099
    R	20	60.0	15470000	257686.2	262399.031
    R	20	70.0	17838000	254703.8	236799.218
    F	20	78.1	20000000	256182.6	269071.616

 Now we have lots of files and they are much bigger.

    [tdunning@se-node10 ~]$ du -sh foo/*
    79M	foo/synth-0000
    79M	foo/synth-0001
    79M	foo/synth-0002
    79M	foo/synth-0003
    79M	foo/synth-0004
    79M	foo/synth-0005
    79M	foo/synth-0006
    79M	foo/synth-0007
    79M	foo/synth-0008
    79M	foo/synth-0009
    79M	foo/synth-0010
    79M	foo/synth-0011
    79M	foo/synth-0012
    79M	foo/synth-0013
    79M	foo/synth-0014
    79M	foo/synth-0015
    79M	foo/synth-0016
    79M	foo/synth-0017
    79M	foo/synth-0018
    79M	foo/synth-0019
    [tdunning@se-node10 ~]$ du -sh foo
    1.6G	foo
    [tdunning@se-node10 ~]$ wc -l foo/*
       1000000 foo/synth-0000
       1000000 foo/synth-0001
       1000000 foo/synth-0002
       1000000 foo/synth-0003
       1000000 foo/synth-0004
       1000000 foo/synth-0005
       1000000 foo/synth-0006
       1000000 foo/synth-0007
       1000000 foo/synth-0008
       1000000 foo/synth-0009
       1000000 foo/synth-0010
       1000000 foo/synth-0011
       1000000 foo/synth-0012
       1000000 foo/synth-0013
       1000000 foo/synth-0014
       1000000 foo/synth-0015
       1000000 foo/synth-0016
       1000000 foo/synth-0017
       1000000 foo/synth-0018
       1000000 foo/synth-0019
      20000000 total

 Clean up again and make something much smaller.

    [tdunning@se-node10 ~]$ rm -rf foo
    [tdunning@se-node10 ~]$ ./log-synth/target/log-synth -schema ./schema.json -output foo -count 2M -threads 10 -format JSON
    R	10	0.0	0	0.0	0.000
    F	10	7.1	2000000	279924.3	281345.025
    [tdunning@se-node10 ~]$ du -sh foo
    157M	foo
    
 Only 157MB instead of 1.6GB.  Much nicer.  Next we package this up as a single file to make web storage easier.

    [tdunning@se-node10 ~]$ tar zcvf foo.tgz foo
    foo/
    foo/synth-0009
    foo/synth-0004
    foo/synth-0003
    foo/synth-0008
    foo/synth-0006
    foo/synth-0005
    foo/synth-0007
    foo/synth-0002
    foo/synth-0000
    foo/synth-0001
    
 OK. While you weren't watching (because you were reading this) I copied that file to the Public directory in Dropbox on my laptop using scp.  That means that I can delete foo.tgz from my cluster home directory and try downloading from the web.  First I use a utility called `wget`.  Later, I will use `curl` which is more commonly used by the SE team.

    [tdunning@se-node10 ~]$ rm foo.tgz 
    [tdunning@se-node10 ~]$ wget http://bit.ly/se-onboarding-data
    --2015-03-23 18:52:32--  http://bit.ly/se-onboarding-data
    Resolving bit.ly... 69.58.188.39, 69.58.188.40
    Connecting to bit.ly|69.58.188.39|:80... connected.
    HTTP request sent, awaiting response... 301 Moved Permanently
    Location: https://dl.dropboxusercontent.com/u/36863361/foo.tgz [following]
    --2015-03-23 18:52:32--  https://dl.dropboxusercontent.com/u/36863361/foo.tgz
    Resolving dl.dropboxusercontent.com... 23.21.196.214, 50.17.184.208, 54.221.192.137, ...
    Connecting to dl.dropboxusercontent.com|23.21.196.214|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 13278296 (13M) [application/x-gtar]
    Saving to: “se-onboarding-data”
    100%[===========================================================================================================================================>] 13,278,296  2.60M/s   in 5.6s    
    2015-03-23 18:52:41 (2.26 MB/s) - “se-onboarding-data” saved [13278296/13278296]

 The `wget` command gives me lots of information by default so I can see the redirection that happens with bitly short links. I can now unpack the file and see that it has the original data:

    [tdunning@se-node10 ~]$ tar xvf se-onboarding-data 
    foo/
    foo/synth-0009
    foo/synth-0004
    foo/synth-0003
    foo/synth-0008
    foo/synth-0006
    foo/synth-0005
    foo/synth-0007
    foo/synth-0002
    foo/synth-0000
    foo/synth-0001

 OK.  So we clean up again to get ready to use `curl` for the same task.

    [tdunning@se-node10 ~]$ rm -rf foo foo.tgz 
    [tdunning@se-node10 ~]$ rm se-onboarding-data 
    [tdunning@se-node10 ~]$ ls
    log-synth  new-vol  old-cluster  schema.json  side-log
    [tdunning@se-node10 ~]$ cd new-vol/
    [tdunning@se-node10 new-vol]$ ls
    x  y  z
    [tdunning@se-node10 new-vol]$ pwd
    /mapr/se1/user/tdunning/new-vol
    
 The `curl` command works a bit differently from `wget`.  If I try to simply grab the contents of the bitly link, `curl` doesn't follow the redirect by default.  It also puts the content on the standard output which can be bad if you are downloading tens of megabytes.

    [tdunning@se-node10 new-vol]$ curl http://bit.ly/se-onboarding-data
    <html>
    <head><title>Bitly</title></head>
    <body><a href="https://dl.dropboxusercontent.com/u/36863361/foo.tgz">moved here</a></body>
    </html>[tdunning@se-node10 new-vol]$ curl http://bit.ly/se-onboarding-data
    
 The `-L` option forces `curl` to follow redirects.  Note also the use of `> foo.tgz` to redirect the output to a file with an useful name.

    [tdunning@se-node10 new-vol]$ curl -L http://bit.ly/se-onboarding-data > foo.tgz
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100 12.6M  100 12.6M    0     0  1593k      0  0:00:08  0:00:08 --:--:-- 2884k

 And again, note that the contents can be extracted as before.

    [tdunning@se-node10 new-vol]$ tar xvf foo.tgz 
    foo/
    foo/synth-0009
    foo/synth-0004
    foo/synth-0003
    foo/synth-0008
    foo/synth-0006
    foo/synth-0005
    foo/synth-0007
    foo/synth-0002
    foo/synth-0000
    foo/synth-0001
    [tdunning@se-node10 new-vol]$ wc -l foo/*
       200000 foo/synth-0000
       200000 foo/synth-0001
       200000 foo/synth-0002
       200000 foo/synth-0003
       200000 foo/synth-0004
       200000 foo/synth-0005
       200000 foo/synth-0006
       200000 foo/synth-0007
       200000 foo/synth-0008
       200000 foo/synth-0009
      2000000 total
    [tdunning@se-node10 new-vol]$ ls
    foo  foo.tgz  x  y  z


 You can see MapR DB tables from the command line as well.  They appear as symbolic links to a special location.  I created the table called `data-table` in the `new-vol` directory using the MCS.  Here is the result.

    [tdunning@se-node10 new-vol]$ pwd
    /mapr/se1/user/tdunning/new-vol
    [tdunning@se-node10 new-vol]$ ls -l
    total 12969
    lr-------- 1 tdunning tdunning        2 Mar 23 19:11 data-table -> mapr::table::3315.125.131512
    drwxrwxr-x 2 tdunning tdunning       10 Mar 23 18:43 foo
    -rw-rw-r-- 1 tdunning tdunning 13278296 Mar 23 19:06 foo.tgz
    -rw-rw-r-- 1 tdunning tdunning        0 Mar 23 18:28 x
    -rw-rw-r-- 1 tdunning tdunning        0 Mar 23 18:28 y
    -rw-rw-r-- 1 tdunning tdunning        0 Mar 23 18:28 z
    [tdunning@se-node10 new-vol]$ ls
    data-table  foo  foo.tgz  x  y  z
    [tdunning@se-node10 new-vol]$ find . -type l
    ./data-table
    
 I can search for other tables, subject to file permissions.

    [tdunning@se-node10 new-vol]$ find /mapr/se1/user -type l -ls | grep ::
    744048204    0 lrwx------   1 apernsteiner apernsteiner        2 Mar  5 19:34 /mapr/se1/user/apernsteiner/tables/andytable -> mapr::table::2097.76.529274
    874071677    0 lr--------   1 tdunning tdunning        2 Mar 23 19:11 /mapr/se1/user/tdunning/new-vol/data-table -> mapr::table::3315.125.131512
    907626118    0 lr--------   1 kbotzum  kbotzum         2 Mar 12 19:31 /mapr/se1/user/kbotzum/tables/ycsb -> mapr::table::2275.134.393918
    907626140    0 lr--------   1 kbotzum  kbotzum         2 Mar 12 19:53 /mapr/se1/user/kbotzum/tables/ycsb2 -> mapr::table::2275.156.393962
    188302894    0 lr--------   1 jbates   jbates          2 Dec 16 16:43 /mapr/se1/user/jbates/mapr_table -> mapr::table::2314.46.262516
    ^C
    
 Finally, I clean up everything by deleting my the `new-vol` volume using the MCS.

    [tdunning@se-node10 ~]$ ls
    log-synth  old-cluster  schema.json  side-log
    [tdunning@se-node10 ~]$ rm -rf foo-orig/
    [tdunning@se-node10 ~]$ ls -l
    total 2
    drwxr-xr-x 6 tdunning tdunning  11 Mar 23 18:32 log-synth
    drwxr-xr-x 6 tdunning tdunning   6 Oct 11 05:55 old-cluster
    -rw-rw-r-- 1 tdunning tdunning 252 Mar 23 18:34 schema.json
    -rw-rw-r-- 1 tdunning tdunning 260 Mar 23 18:43 side-log
	Log in to the cluster:

	ted:downloads$ ssh se-node10.se.lab
	Last login: Mon Mar 23 17:35:37 2015 from 10.250.0.220
	Please check the cluster reservation calendar:
	https://www.google.com/calendar/embed?src=maprtech.com_2d38343133383836382d313737%40resource.calendar.google.com

	Poke around looking for my volume and such:

	[tdunning@se-node10 ~]$ ls /mapr/se1/user/t
	tdunning/ tlojko/
	[tdunning@se-node10 ~]$ ls /mapr/se1/user/tdunning/
	old-cluster
	[tdunning@se-node10 ~]$ maprcli volume list -columns volumename \| grep tdunning
	home.tdunning

	Also note that we are already in my volume (because my home directory is on the cluster)

	[tdunning@se-node10 ~]$ pwd
	/mapr/se1/user/tdunning

	When we look at the contents of my home directory, we see old stuff and my new volume. The new volume name is not the same as the mount point. It is the mount point that we see here.

	[tdunning@se-node10 ~]$ ls
	new-vol old-cluster
	[tdunning@se-node10 ~]$ cd new-vol/
	[tdunning@se-node10 new-vol]$ ls

	Inside the new volume, create some empty files just because we can:

	[tdunning@se-node10 new-vol]$ touch x y z
	[tdunning@se-node10 new-vol]$ ls
	x y z
	[tdunning@se-node10 new-vol]$ cd ..
	[tdunning@se-node10 ~]$ ls
	new-vol old-cluster

	OK... back in my home directory (which is on the cluster, of course), download the log-synth code and compile it:

	[tdunning@se-node10 ~]$ pwd
	/mapr/se1/user/tdunning
	[tdunning@se-node10 ~]$ git clone https://github.com/tdunning/log-synth.git
	Initialized empty Git repository in /mapr/se1/user/tdunning/log-synth/.git/
	remote: Counting objects: 1421, done.
	remote: Total 1421 (delta 0), reused 0 (delta 0), pack-reused 1421
	Receiving objects: 100% (1421/1421), 2.48 MiB \| 2.29 MiB/s, done.
	Resolving deltas: 100% (572/572), done.

	[tdunning@se-node10 ~]$ cd log-synth/
	[tdunning@se-node10 log-synth]$ mvn -q -DskipTests package
	... much goo deleted ...
	[loading ZipFileIndexFileObject[/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/lib/ct.sym(META-INF/sym/rt.jar/java/lang/reflect/AnnotatedElement.class)]]
	[loading ZipFileIndexFileObject[/mapr/se1/user/tdunning/.m2/repository/org/apache/mahout/mahout-math/0.9/mahout-math-0.9.jar(org/apache/mahout/common/RandomWrapper.class)]]
	[loading ZipFileIndexFileObject[/mapr/se1/user/tdunning/.m2/repository/org/slf4j/slf4j-api/1.6.6/slf4j-api-1.6.6.jar(org/slf4j/Marker.class)]]
	[wrote RegularFileObject[/mapr/se1/user/tdunning/log-synth/target/test-classes/com/mapr/stats/UpperQuantileTest.class]]
	[total 680ms]

	Check to see that the executable was created by the compilation. The file log-synth is the one we care about.

	[tdunning@se-node10 log-synth]$ ls target
	archive-tmp generated-sources log-synth log-synth-0.1-SNAPSHOT-jar-with-dependencies.jar maven-status
	classes generated-test-sources log-synth-0.1-SNAPSHOT.jar maven-archiver test-classes

	Back home, we try to run this program and it tells us that we need to give it a schema

	[tdunning@se-node10 log-synth]$ cd ..
	[tdunning@se-node10 ~]$ ./log-synth/target/log-synth
	Exception in thread "main" java.lang.IllegalArgumentException: Must specify schema file using [-schema filename] option
	at com.mapr.synth.Synth.main(Synth.java:94)

	So we create a trivial schema that will generate three dates. Usually, we will do something more interesting here. See the README at https://github.com/tdunning/log-synth for more information.

	[tdunning@se-node10 ~]$ cat > schema.json
	[
	{"name":"first_visit", "class":"date", "format":"MM/dd/yyyy"},
	{"name":"second_date", "class":"date", "start":"2014-01-31", "end":"2014-02-07"},
	{"name":"third_date", "class":"date", "format":"MM/dd/yyyy", "start":"01/31/1995", "end":"02/07/1999"}
	]

	Now create a tiny output file with 20 lines.

	[tdunning@se-node10 ~]$ ./log-synth/target/log-synth -schema ./schema.json -output foo -count 20 -format JSON
	R 1 0.0 0 0.0 0.000
	F 1 0.0 20 654.8 820.850
	[tdunning@se-node10 ~]$ cat foo/synth-0000
	{"first_visit":"05/17/2013","second_date":"2014-01-31","third_date":"04/08/1998"}
	{"first_visit":"10/05/2012","second_date":"2014-02-01","third_date":"06/17/1997"}
	... 16 lines of similar goo omitted ...
	{"first_visit":"06/27/2013","second_date":"2014-02-04","third_date":"02/18/1996"}
	{"first_visit":"11/11/2012","second_date":"2014-02-03","third_date":"07/31/1995"}

	Let's clean up and generate something bigger with 20 million lines generated using 20 threads. This isn't a Hadoop program since it is ordinary Java running on a single machine, but it is a simple form of parallelism.

	[tdunning@se-node10 ~]$ rm -rf foo
	[tdunning@se-node10 ~]$ ./log-synth/target/log-synth -schema ./schema.json -output foo -count 20M -threads 20 -format JSON
	R 20 0.0 0 0.0 0.000
	R 20 10.0 2370000 236178.3 237005.826
	R 20 20.0 5014500 250296.7 264464.988
	R 20 30.0 7657500 254959.0 264299.625
	R 20 40.0 10098000 252234.2 244050.373
	R 20 50.0 12846000 256744.3 274800.099
	R 20 60.0 15470000 257686.2 262399.031
	R 20 70.0 17838000 254703.8 236799.218
	F 20 78.1 20000000 256182.6 269071.616

	Now we have lots of files and they are much bigger.

	[tdunning@se-node10 ~]$ du -sh foo/*
	79M foo/synth-0000
	79M foo/synth-0001
	79M foo/synth-0002
	79M foo/synth-0003
	79M foo/synth-0004
	79M foo/synth-0005
	79M foo/synth-0006
	79M foo/synth-0007
	79M foo/synth-0008
	79M foo/synth-0009
	79M foo/synth-0010
	79M foo/synth-0011
	79M foo/synth-0012
	79M foo/synth-0013
	79M foo/synth-0014
	79M foo/synth-0015
	79M foo/synth-0016
	79M foo/synth-0017
	79M foo/synth-0018
	79M foo/synth-0019
	[tdunning@se-node10 ~]$ du -sh foo
	1.6G foo
	[tdunning@se-node10 ~]$ wc -l foo/*
	1000000 foo/synth-0000
	1000000 foo/synth-0001
	1000000 foo/synth-0002
	1000000 foo/synth-0003
	1000000 foo/synth-0004
	1000000 foo/synth-0005
	1000000 foo/synth-0006
	1000000 foo/synth-0007
	1000000 foo/synth-0008
	1000000 foo/synth-0009
	1000000 foo/synth-0010
	1000000 foo/synth-0011
	1000000 foo/synth-0012
	1000000 foo/synth-0013
	1000000 foo/synth-0014
	1000000 foo/synth-0015
	1000000 foo/synth-0016
	1000000 foo/synth-0017
	1000000 foo/synth-0018
	1000000 foo/synth-0019
	20000000 total

	Clean up again and make something much smaller.

	[tdunning@se-node10 ~]$ rm -rf foo
	[tdunning@se-node10 ~]$ ./log-synth/target/log-synth -schema ./schema.json -output foo -count 2M -threads 10 -format JSON
	R 10 0.0 0 0.0 0.000
	F 10 7.1 2000000 279924.3 281345.025
	[tdunning@se-node10 ~]$ du -sh foo
	157M foo

	Only 157MB instead of 1.6GB. Much nicer. Next we package this up as a single file to make web storage easier.

	[tdunning@se-node10 ~]$ tar zcvf foo.tgz foo
	foo/
	foo/synth-0009
	foo/synth-0004
	foo/synth-0003
	foo/synth-0008
	foo/synth-0006
	foo/synth-0005
	foo/synth-0007
	foo/synth-0002
	foo/synth-0000
	foo/synth-0001

	OK. While you weren't watching (because you were reading this) I copied that file to the Public directory in Dropbox on my laptop using scp. That means that I can delete foo.tgz from my cluster home directory and try downloading from the web. First I use a utility called `wget`. Later, I will use `curl` which is more commonly used by the SE team.

	[tdunning@se-node10 ~]$ rm foo.tgz
	[tdunning@se-node10 ~]$ wget http://bit.ly/se-onboarding-data
	--2015-03-23 18:52:32-- http://bit.ly/se-onboarding-data
	Resolving bit.ly... 69.58.188.39, 69.58.188.40
	Connecting to bit.ly\|69.58.188.39\|:80... connected.
	HTTP request sent, awaiting response... 301 Moved Permanently
	Location: https://dl.dropboxusercontent.com/u/36863361/foo.tgz [following]
	--2015-03-23 18:52:32-- https://dl.dropboxusercontent.com/u/36863361/foo.tgz
	Resolving dl.dropboxusercontent.com... 23.21.196.214, 50.17.184.208, 54.221.192.137, ...
	Connecting to dl.dropboxusercontent.com\|23.21.196.214\|:443... connected.
	HTTP request sent, awaiting response... 200 OK
	Length: 13278296 (13M) [application/x-gtar]
	Saving to: “se-onboarding-data”
	100%[===========================================================================================================================================>] 13,278,296 2.60M/s in 5.6s
	2015-03-23 18:52:41 (2.26 MB/s) - “se-onboarding-data” saved [13278296/13278296]

	The `wget` command gives me lots of information by default so I can see the redirection that happens with bitly short links. I can now unpack the file and see that it has the original data:

	[tdunning@se-node10 ~]$ tar xvf se-onboarding-data
	foo/
	foo/synth-0009
	foo/synth-0004
	foo/synth-0003
	foo/synth-0008
	foo/synth-0006
	foo/synth-0005
	foo/synth-0007
	foo/synth-0002
	foo/synth-0000
	foo/synth-0001

	OK. So we clean up again to get ready to use `curl` for the same task.

	[tdunning@se-node10 ~]$ rm -rf foo foo.tgz
	[tdunning@se-node10 ~]$ rm se-onboarding-data
	[tdunning@se-node10 ~]$ ls
	log-synth new-vol old-cluster schema.json side-log
	[tdunning@se-node10 ~]$ cd new-vol/
	[tdunning@se-node10 new-vol]$ ls
	x y z
	[tdunning@se-node10 new-vol]$ pwd
	/mapr/se1/user/tdunning/new-vol

	The `curl` command works a bit differently from `wget`. If I try to simply grab the contents of the bitly link, `curl` doesn't follow the redirect by default. It also puts the content on the standard output which can be bad if you are downloading tens of megabytes.

	[tdunning@se-node10 new-vol]$ curl http://bit.ly/se-onboarding-data
	<html>
	<head><title>Bitly</title></head>
	<body><a href="https://dl.dropboxusercontent.com/u/36863361/foo.tgz">moved here</a></body>
	</html>[tdunning@se-node10 new-vol]$ curl http://bit.ly/se-onboarding-data

	The `-L` option forces `curl` to follow redirects. Note also the use of `> foo.tgz` to redirect the output to a file with an useful name.

	[tdunning@se-node10 new-vol]$ curl -L http://bit.ly/se-onboarding-data > foo.tgz
	% Total % Received % Xferd Average Speed Time Time Time Current
	Dload Upload Total Spent Left Speed
	100 12.6M 100 12.6M 0 0 1593k 0 0:00:08 0:00:08 --:--:-- 2884k

	And again, note that the contents can be extracted as before.

	[tdunning@se-node10 new-vol]$ tar xvf foo.tgz
	foo/
	foo/synth-0009
	foo/synth-0004
	foo/synth-0003
	foo/synth-0008
	foo/synth-0006
	foo/synth-0005
	foo/synth-0007
	foo/synth-0002
	foo/synth-0000
	foo/synth-0001
	[tdunning@se-node10 new-vol]$ wc -l foo/*
	200000 foo/synth-0000
	200000 foo/synth-0001
	200000 foo/synth-0002
	200000 foo/synth-0003
	200000 foo/synth-0004
	200000 foo/synth-0005
	200000 foo/synth-0006
	200000 foo/synth-0007
	200000 foo/synth-0008
	200000 foo/synth-0009
	2000000 total
	[tdunning@se-node10 new-vol]$ ls
	foo foo.tgz x y z


	You can see MapR DB tables from the command line as well. They appear as symbolic links to a special location. I created the table called `data-table` in the `new-vol` directory using the MCS. Here is the result.

	[tdunning@se-node10 new-vol]$ pwd
	/mapr/se1/user/tdunning/new-vol
	[tdunning@se-node10 new-vol]$ ls -l
	total 12969
	lr-------- 1 tdunning tdunning 2 Mar 23 19:11 data-table -> mapr::table::3315.125.131512
	drwxrwxr-x 2 tdunning tdunning 10 Mar 23 18:43 foo
	-rw-rw-r-- 1 tdunning tdunning 13278296 Mar 23 19:06 foo.tgz
	-rw-rw-r-- 1 tdunning tdunning 0 Mar 23 18:28 x
	-rw-rw-r-- 1 tdunning tdunning 0 Mar 23 18:28 y
	-rw-rw-r-- 1 tdunning tdunning 0 Mar 23 18:28 z
	[tdunning@se-node10 new-vol]$ ls
	data-table foo foo.tgz x y z
	[tdunning@se-node10 new-vol]$ find . -type l
	./data-table

	I can search for other tables, subject to file permissions.

	[tdunning@se-node10 new-vol]$ find /mapr/se1/user -type l -ls \| grep ::
	744048204 0 lrwx------ 1 apernsteiner apernsteiner 2 Mar 5 19:34 /mapr/se1/user/apernsteiner/tables/andytable -> mapr::table::2097.76.529274
	874071677 0 lr-------- 1 tdunning tdunning 2 Mar 23 19:11 /mapr/se1/user/tdunning/new-vol/data-table -> mapr::table::3315.125.131512
	907626118 0 lr-------- 1 kbotzum kbotzum 2 Mar 12 19:31 /mapr/se1/user/kbotzum/tables/ycsb -> mapr::table::2275.134.393918
	907626140 0 lr-------- 1 kbotzum kbotzum 2 Mar 12 19:53 /mapr/se1/user/kbotzum/tables/ycsb2 -> mapr::table::2275.156.393962
	188302894 0 lr-------- 1 jbates jbates 2 Dec 16 16:43 /mapr/se1/user/jbates/mapr_table -> mapr::table::2314.46.262516
	^C

	Finally, I clean up everything by deleting my the `new-vol` volume using the MCS.

	[tdunning@se-node10 ~]$ ls
	log-synth old-cluster schema.json side-log
	[tdunning@se-node10 ~]$ rm -rf foo-orig/
	[tdunning@se-node10 ~]$ ls -l
	total 2
	drwxr-xr-x 6 tdunning tdunning 11 Mar 23 18:32 log-synth
	drwxr-xr-x 6 tdunning tdunning 6 Oct 11 05:55 old-cluster
	-rw-rw-r-- 1 tdunning tdunning 252 Mar 23 18:34 schema.json
	-rw-rw-r-- 1 tdunning tdunning 260 Mar 23 18:43 side-log