Created
March 23, 2015 19:54
-
-
Save tdunning/47f876e107663d1d69e5 to your computer and use it in GitHub Desktop.
File manipulation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Log in to the cluster: | |
ted:downloads$ ssh se-node10.se.lab | |
Last login: Mon Mar 23 17:35:37 2015 from 10.250.0.220 | |
Please check the cluster reservation calendar: | |
https://www.google.com/calendar/embed?src=maprtech.com_2d38343133383836382d313737%40resource.calendar.google.com | |
Poke around looking for my volume and such: | |
[tdunning@se-node10 ~]$ ls /mapr/se1/user/t | |
tdunning/ tlojko/ | |
[tdunning@se-node10 ~]$ ls /mapr/se1/user/tdunning/ | |
old-cluster | |
[tdunning@se-node10 ~]$ maprcli volume list -columns volumename | grep tdunning | |
home.tdunning | |
Also note that we are already in my volume (because my home directory is on the cluster) | |
[tdunning@se-node10 ~]$ pwd | |
/mapr/se1/user/tdunning | |
When we look at the contents of my home directory, we see old stuff and my new volume. The new volume name is not the same as the mount point. It is the mount point that we see here. | |
[tdunning@se-node10 ~]$ ls | |
new-vol old-cluster | |
[tdunning@se-node10 ~]$ cd new-vol/ | |
[tdunning@se-node10 new-vol]$ ls | |
Inside the new volume, create some empty files just because we can: | |
[tdunning@se-node10 new-vol]$ touch x y z | |
[tdunning@se-node10 new-vol]$ ls | |
x y z | |
[tdunning@se-node10 new-vol]$ cd .. | |
[tdunning@se-node10 ~]$ ls | |
new-vol old-cluster | |
OK... back in my home directory (which is on the cluster, of course), download the log-synth code and compile it: | |
[tdunning@se-node10 ~]$ pwd | |
/mapr/se1/user/tdunning | |
[tdunning@se-node10 ~]$ git clone https://github.com/tdunning/log-synth.git | |
Initialized empty Git repository in /mapr/se1/user/tdunning/log-synth/.git/ | |
remote: Counting objects: 1421, done. | |
remote: Total 1421 (delta 0), reused 0 (delta 0), pack-reused 1421 | |
Receiving objects: 100% (1421/1421), 2.48 MiB | 2.29 MiB/s, done. | |
Resolving deltas: 100% (572/572), done. | |
[tdunning@se-node10 ~]$ cd log-synth/ | |
[tdunning@se-node10 log-synth]$ mvn -q -DskipTests package | |
... much goo deleted ... | |
[loading ZipFileIndexFileObject[/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/lib/ct.sym(META-INF/sym/rt.jar/java/lang/reflect/AnnotatedElement.class)]] | |
[loading ZipFileIndexFileObject[/mapr/se1/user/tdunning/.m2/repository/org/apache/mahout/mahout-math/0.9/mahout-math-0.9.jar(org/apache/mahout/common/RandomWrapper.class)]] | |
[loading ZipFileIndexFileObject[/mapr/se1/user/tdunning/.m2/repository/org/slf4j/slf4j-api/1.6.6/slf4j-api-1.6.6.jar(org/slf4j/Marker.class)]] | |
[wrote RegularFileObject[/mapr/se1/user/tdunning/log-synth/target/test-classes/com/mapr/stats/UpperQuantileTest.class]] | |
[total 680ms] | |
Check to see that the executable was created by the compilation. The file log-synth is the one we care about. | |
[tdunning@se-node10 log-synth]$ ls target | |
archive-tmp generated-sources log-synth log-synth-0.1-SNAPSHOT-jar-with-dependencies.jar maven-status | |
classes generated-test-sources log-synth-0.1-SNAPSHOT.jar maven-archiver test-classes | |
Back home, we try to run this program and it tells us that we need to give it a schema | |
[tdunning@se-node10 log-synth]$ cd .. | |
[tdunning@se-node10 ~]$ ./log-synth/target/log-synth | |
Exception in thread "main" java.lang.IllegalArgumentException: Must specify schema file using [-schema filename] option | |
at com.mapr.synth.Synth.main(Synth.java:94) | |
So we create a trivial schema that will generate three dates. Usually, we will do something more interesting here. See the README at https://github.com/tdunning/log-synth for more information. | |
[tdunning@se-node10 ~]$ cat > schema.json | |
[ | |
{"name":"first_visit", "class":"date", "format":"MM/dd/yyyy"}, | |
{"name":"second_date", "class":"date", "start":"2014-01-31", "end":"2014-02-07"}, | |
{"name":"third_date", "class":"date", "format":"MM/dd/yyyy", "start":"01/31/1995", "end":"02/07/1999"} | |
] | |
Now create a tiny output file with 20 lines. | |
[tdunning@se-node10 ~]$ ./log-synth/target/log-synth -schema ./schema.json -output foo -count 20 -format JSON | |
R 1 0.0 0 0.0 0.000 | |
F 1 0.0 20 654.8 820.850 | |
[tdunning@se-node10 ~]$ cat foo/synth-0000 | |
{"first_visit":"05/17/2013","second_date":"2014-01-31","third_date":"04/08/1998"} | |
{"first_visit":"10/05/2012","second_date":"2014-02-01","third_date":"06/17/1997"} | |
... 16 lines of similar goo omitted ... | |
{"first_visit":"06/27/2013","second_date":"2014-02-04","third_date":"02/18/1996"} | |
{"first_visit":"11/11/2012","second_date":"2014-02-03","third_date":"07/31/1995"} | |
Let's clean up and generate something bigger with 20 million lines generated using 20 threads. This isn't a Hadoop program since it is ordinary Java running on a single machine, but it is a simple form of parallelism. | |
[tdunning@se-node10 ~]$ rm -rf foo | |
[tdunning@se-node10 ~]$ ./log-synth/target/log-synth -schema ./schema.json -output foo -count 20M -threads 20 -format JSON | |
R 20 0.0 0 0.0 0.000 | |
R 20 10.0 2370000 236178.3 237005.826 | |
R 20 20.0 5014500 250296.7 264464.988 | |
R 20 30.0 7657500 254959.0 264299.625 | |
R 20 40.0 10098000 252234.2 244050.373 | |
R 20 50.0 12846000 256744.3 274800.099 | |
R 20 60.0 15470000 257686.2 262399.031 | |
R 20 70.0 17838000 254703.8 236799.218 | |
F 20 78.1 20000000 256182.6 269071.616 | |
Now we have lots of files and they are much bigger. | |
[tdunning@se-node10 ~]$ du -sh foo/* | |
79M foo/synth-0000 | |
79M foo/synth-0001 | |
79M foo/synth-0002 | |
79M foo/synth-0003 | |
79M foo/synth-0004 | |
79M foo/synth-0005 | |
79M foo/synth-0006 | |
79M foo/synth-0007 | |
79M foo/synth-0008 | |
79M foo/synth-0009 | |
79M foo/synth-0010 | |
79M foo/synth-0011 | |
79M foo/synth-0012 | |
79M foo/synth-0013 | |
79M foo/synth-0014 | |
79M foo/synth-0015 | |
79M foo/synth-0016 | |
79M foo/synth-0017 | |
79M foo/synth-0018 | |
79M foo/synth-0019 | |
[tdunning@se-node10 ~]$ du -sh foo | |
1.6G foo | |
[tdunning@se-node10 ~]$ wc -l foo/* | |
1000000 foo/synth-0000 | |
1000000 foo/synth-0001 | |
1000000 foo/synth-0002 | |
1000000 foo/synth-0003 | |
1000000 foo/synth-0004 | |
1000000 foo/synth-0005 | |
1000000 foo/synth-0006 | |
1000000 foo/synth-0007 | |
1000000 foo/synth-0008 | |
1000000 foo/synth-0009 | |
1000000 foo/synth-0010 | |
1000000 foo/synth-0011 | |
1000000 foo/synth-0012 | |
1000000 foo/synth-0013 | |
1000000 foo/synth-0014 | |
1000000 foo/synth-0015 | |
1000000 foo/synth-0016 | |
1000000 foo/synth-0017 | |
1000000 foo/synth-0018 | |
1000000 foo/synth-0019 | |
20000000 total | |
Clean up again and make something much smaller. | |
[tdunning@se-node10 ~]$ rm -rf foo | |
[tdunning@se-node10 ~]$ ./log-synth/target/log-synth -schema ./schema.json -output foo -count 2M -threads 10 -format JSON | |
R 10 0.0 0 0.0 0.000 | |
F 10 7.1 2000000 279924.3 281345.025 | |
[tdunning@se-node10 ~]$ du -sh foo | |
157M foo | |
Only 157MB instead of 1.6GB. Much nicer. Next we package this up as a single file to make web storage easier. | |
[tdunning@se-node10 ~]$ tar zcvf foo.tgz foo | |
foo/ | |
foo/synth-0009 | |
foo/synth-0004 | |
foo/synth-0003 | |
foo/synth-0008 | |
foo/synth-0006 | |
foo/synth-0005 | |
foo/synth-0007 | |
foo/synth-0002 | |
foo/synth-0000 | |
foo/synth-0001 | |
OK. While you weren't watching (because you were reading this) I copied that file to the Public directory in Dropbox on my laptop using scp. That means that I can delete foo.tgz from my cluster home directory and try downloading from the web. First I use a utility called `wget`. Later, I will use `curl` which is more commonly used by the SE team. | |
[tdunning@se-node10 ~]$ rm foo.tgz | |
[tdunning@se-node10 ~]$ wget http://bit.ly/se-onboarding-data | |
--2015-03-23 18:52:32-- http://bit.ly/se-onboarding-data | |
Resolving bit.ly... 69.58.188.39, 69.58.188.40 | |
Connecting to bit.ly|69.58.188.39|:80... connected. | |
HTTP request sent, awaiting response... 301 Moved Permanently | |
Location: https://dl.dropboxusercontent.com/u/36863361/foo.tgz [following] | |
--2015-03-23 18:52:32-- https://dl.dropboxusercontent.com/u/36863361/foo.tgz | |
Resolving dl.dropboxusercontent.com... 23.21.196.214, 50.17.184.208, 54.221.192.137, ... | |
Connecting to dl.dropboxusercontent.com|23.21.196.214|:443... connected. | |
HTTP request sent, awaiting response... 200 OK | |
Length: 13278296 (13M) [application/x-gtar] | |
Saving to: “se-onboarding-data” | |
100%[===========================================================================================================================================>] 13,278,296 2.60M/s in 5.6s | |
2015-03-23 18:52:41 (2.26 MB/s) - “se-onboarding-data” saved [13278296/13278296] | |
The `wget` command gives me lots of information by default so I can see the redirection that happens with bitly short links. I can now unpack the file and see that it has the original data: | |
[tdunning@se-node10 ~]$ tar xvf se-onboarding-data | |
foo/ | |
foo/synth-0009 | |
foo/synth-0004 | |
foo/synth-0003 | |
foo/synth-0008 | |
foo/synth-0006 | |
foo/synth-0005 | |
foo/synth-0007 | |
foo/synth-0002 | |
foo/synth-0000 | |
foo/synth-0001 | |
OK. So we clean up again to get ready to use `curl` for the same task. | |
[tdunning@se-node10 ~]$ rm -rf foo foo.tgz | |
[tdunning@se-node10 ~]$ rm se-onboarding-data | |
[tdunning@se-node10 ~]$ ls | |
log-synth new-vol old-cluster schema.json side-log | |
[tdunning@se-node10 ~]$ cd new-vol/ | |
[tdunning@se-node10 new-vol]$ ls | |
x y z | |
[tdunning@se-node10 new-vol]$ pwd | |
/mapr/se1/user/tdunning/new-vol | |
The `curl` command works a bit differently from `wget`. If I try to simply grab the contents of the bitly link, `curl` doesn't follow the redirect by default. It also puts the content on the standard output which can be bad if you are downloading tens of megabytes. | |
[tdunning@se-node10 new-vol]$ curl http://bit.ly/se-onboarding-data | |
<html> | |
<head><title>Bitly</title></head> | |
<body><a href="https://dl.dropboxusercontent.com/u/36863361/foo.tgz">moved here</a></body> | |
</html>[tdunning@se-node10 new-vol]$ curl http://bit.ly/se-onboarding-data | |
The `-L` option forces `curl` to follow redirects. Note also the use of `> foo.tgz` to redirect the output to a file with an useful name. | |
[tdunning@se-node10 new-vol]$ curl -L http://bit.ly/se-onboarding-data > foo.tgz | |
% Total % Received % Xferd Average Speed Time Time Time Current | |
Dload Upload Total Spent Left Speed | |
100 12.6M 100 12.6M 0 0 1593k 0 0:00:08 0:00:08 --:--:-- 2884k | |
And again, note that the contents can be extracted as before. | |
[tdunning@se-node10 new-vol]$ tar xvf foo.tgz | |
foo/ | |
foo/synth-0009 | |
foo/synth-0004 | |
foo/synth-0003 | |
foo/synth-0008 | |
foo/synth-0006 | |
foo/synth-0005 | |
foo/synth-0007 | |
foo/synth-0002 | |
foo/synth-0000 | |
foo/synth-0001 | |
[tdunning@se-node10 new-vol]$ wc -l foo/* | |
200000 foo/synth-0000 | |
200000 foo/synth-0001 | |
200000 foo/synth-0002 | |
200000 foo/synth-0003 | |
200000 foo/synth-0004 | |
200000 foo/synth-0005 | |
200000 foo/synth-0006 | |
200000 foo/synth-0007 | |
200000 foo/synth-0008 | |
200000 foo/synth-0009 | |
2000000 total | |
[tdunning@se-node10 new-vol]$ ls | |
foo foo.tgz x y z | |
You can see MapR DB tables from the command line as well. They appear as symbolic links to a special location. I created the table called `data-table` in the `new-vol` directory using the MCS. Here is the result. | |
[tdunning@se-node10 new-vol]$ pwd | |
/mapr/se1/user/tdunning/new-vol | |
[tdunning@se-node10 new-vol]$ ls -l | |
total 12969 | |
lr-------- 1 tdunning tdunning 2 Mar 23 19:11 data-table -> mapr::table::3315.125.131512 | |
drwxrwxr-x 2 tdunning tdunning 10 Mar 23 18:43 foo | |
-rw-rw-r-- 1 tdunning tdunning 13278296 Mar 23 19:06 foo.tgz | |
-rw-rw-r-- 1 tdunning tdunning 0 Mar 23 18:28 x | |
-rw-rw-r-- 1 tdunning tdunning 0 Mar 23 18:28 y | |
-rw-rw-r-- 1 tdunning tdunning 0 Mar 23 18:28 z | |
[tdunning@se-node10 new-vol]$ ls | |
data-table foo foo.tgz x y z | |
[tdunning@se-node10 new-vol]$ find . -type l | |
./data-table | |
I can search for other tables, subject to file permissions. | |
[tdunning@se-node10 new-vol]$ find /mapr/se1/user -type l -ls | grep :: | |
744048204 0 lrwx------ 1 apernsteiner apernsteiner 2 Mar 5 19:34 /mapr/se1/user/apernsteiner/tables/andytable -> mapr::table::2097.76.529274 | |
874071677 0 lr-------- 1 tdunning tdunning 2 Mar 23 19:11 /mapr/se1/user/tdunning/new-vol/data-table -> mapr::table::3315.125.131512 | |
907626118 0 lr-------- 1 kbotzum kbotzum 2 Mar 12 19:31 /mapr/se1/user/kbotzum/tables/ycsb -> mapr::table::2275.134.393918 | |
907626140 0 lr-------- 1 kbotzum kbotzum 2 Mar 12 19:53 /mapr/se1/user/kbotzum/tables/ycsb2 -> mapr::table::2275.156.393962 | |
188302894 0 lr-------- 1 jbates jbates 2 Dec 16 16:43 /mapr/se1/user/jbates/mapr_table -> mapr::table::2314.46.262516 | |
^C | |
Finally, I clean up everything by deleting my the `new-vol` volume using the MCS. | |
[tdunning@se-node10 ~]$ ls | |
log-synth old-cluster schema.json side-log | |
[tdunning@se-node10 ~]$ rm -rf foo-orig/ | |
[tdunning@se-node10 ~]$ ls -l | |
total 2 | |
drwxr-xr-x 6 tdunning tdunning 11 Mar 23 18:32 log-synth | |
drwxr-xr-x 6 tdunning tdunning 6 Oct 11 05:55 old-cluster | |
-rw-rw-r-- 1 tdunning tdunning 252 Mar 23 18:34 schema.json | |
-rw-rw-r-- 1 tdunning tdunning 260 Mar 23 18:43 side-log |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment