#apache pig basic memo
##LOAD hdfsなどのファイルシステムからデータを読み込む。 このようなデータに対してLOADすると、
$ cat myfile.txt
1 2 3
4 2 1
8 3 4
以下の様なかんじ。
A = LOAD 'myfile.txt';
A = LOAD 'myfile.txt' USING PigStorage('\t');
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
データの転換を行います。 A =
<1, 2, 3>
<4, 2, 1>
<8, 3, 4>
<4, 3, 3>
<7, 2, 5>
<8, 4, 3>
のようなとき、
X = FOREACH A GENERATE f1, f2;
すると、 X=
<1, 2>
<4, 2>
<8, 3>
<4, 3>
<7, 2>
<8, 4>
のようになる。
B =
<2, 4>
<8, 9>
<1, 3>
<2, 7>
<2, 9>
<4, 6>
<4, 9>
C =
<1, {<1, 2, 3>}, {<1, 3>}>
<4, {<4, 2, 1>, <4, 3, 3>}, {<4, 6>, <4, 9>}>
<8, {<8, 3, 4>, <8, 4, 3>}, {<8, 9>}>
X = GROUP A BY f1; X = GROUP A BY (f1, f2 ..);
An operator that changes the structure of tuples and bags in a way that a UDF cannot.
consider a relation that has a tuple of the form (a, (b, c)). The expression
GENERATE $0, flatten($1)
, will cause that tuple to become
(a, b, c)
.
A = LOAD 'data1' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = LOAD 'data2' AS (b1:int,b2:int);
DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)
第一要素をもとにA, B両方に存在するものを作成。
X = JOIN A BY a1, B BY b1;
DUMP X;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
The Pig Latin syntax closely adheres to the SQL standard.
- Outer joins will only work provided the relations which need to produce nulls (in the case of non-matching keys) have schemas.
- Outer joins will only work for two-way joins; to perform a multi-way outer join, you will need to perform multiple two-way outer join statements. 例えば left join はこんなかんじ。
A = LOAD 'a.txt' AS (n:chararray, a:int);
B = LOAD 'b.txt' AS (n:chararray, m:chararray);
C = JOIN A by $0 LEFT OUTER, B BY $0;
##FILTER 条件でフィルタリングする。
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
3番目の要素が3のものだけ抽出。
X = FILTER A BY f3 == 3;
DUMP X;
(1,2,3)
(4,3,3)
(8,4,3)
ファイルシステムに保存する。
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
STORE A INTO 'myoutput' USING PigStorage ('*');
CAT myoutput;
1*2*3
4*2*1
8*3*4
4*3*3
7*2*5
8*4*3
##UNION 2つ以上のコンテンツをマージします。
###スキーマの振る舞い サイズが異なる場合、null schemaとなる。
A: (a1:long, a2:long)
B: (b1:long, b2:long, b3:long)
A union B: null
カラム属性が異なる場合。例えば下記ではbytearrayになる。
A: (a1:long, a2:long)
B: (b1:(b11:long, b12:long), b2:long)
A union B: (a1:bytearray, a2:long)
Union columns of compatible type will produce an "escalate" type. The priority is:
- double > float > long > int > bytearray
- tuple|bag|map|chararray > bytearray
A: (a1:int, a2:bytearray, a3:int)
B: (b1:float, b2:chararray, b3:bytearray)
A union B: (a1:float, a2:chararray, a3:int)
###Example1 In this example the union of relation A and B is computed.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
B = LOAD 'data' AS (b1:int,b2:int);
DUMP A;
(2,4)
(8,9)
(1,3)
X = UNION A, B;
DUMP X;
(1,2,3)
(4,2,1)
(2,4)
(8,9)
(1,3)
###Example2 This example shows the use of ONSCHEMA.
L1 = LOAD 'f1' USING (a : int, b : float);
DUMP L1;
(11,12.0)
(21,22.0)
L2 = LOAD 'f1' USING (a : long, c : chararray);
DUMP L2;
(11,a)
(12,b)
(13,c)
U = UNION ONSCHEMA L1, L2;
DESCRIBE U ;
U : {a : long, b : float, c : chararray}
DUMP U;
(11,12.0,)
(21,22.0,)
(11,,a)
(12,,b)
(13,,c)