Created
March 12, 2017 12:38
-
-
Save thanoojgithub/98072f476f7ae27e10631bab2861d865 to your computer and use it in GitHub Desktop.
Apache Hadoop Hive Serde2 Notes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Notes: | |
| ------------- | |
| external table: | |
| ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' | |
| WITH SERDEPROPERTIES ("input.regex" = "(.{2})(.{10})(.{30})(.{10})(.{10}).*" ) | |
| LOCATION '${hiveconf:path}'; | |
| location:maprfs:/externalpath | |
| inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, | |
| serializationLib:org.apache.hadoop.hive.serde2.RegexSerDe, | |
| location:maprfs:/externalpath, | |
| inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, | |
| internal table: | |
| location:maprfs:/home/thanooj/hive/warehouse/emp.db/emp, | |
| inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, | |
| serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, | |
| @Deprecated | |
| public interface SerDe extends Deserializer, Serializer | |
| public abstract class AbstractSerDe extends Object implements SerDe | |
| public class LazySimpleSerDe extends AbstractSerDe | |
| LazySimpleSerDe can be used to read the same data format as MetadataTypedColumnsetSerDe and TCTLSeparatedProtocol. However, LazySimpleSerDe creates Objects in a lazy way, to provide better performance. | |
| Also LazySimpleSerDe outputs typed columns instead of treating all columns as String like MetadataTypedColumnsetSerDe. | |
| HiveSerializer is used to serialize data to a Hadoop Writable object. | |
| Hadoop uses its own serialization format,Writables. Writable is compact and fast, but not extensible or interoperable. | |
| RegexSerDe uses regular expression (regex) to deserialize data. It doesn't support data serialization. It can deserialize the data using regex and extracts groups as columns. In deserialization stage, if a row does not match the regex, then all columns in the row will be NULL. If a row matches the regex but has less than expected groups, the missing groups will be NULL. If a row matches the regex but has more than expected groups, the additional groups are just ignored. NOTE: Regex SerDe supports primitive column types such as TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, STRING, BOOLEAN and DECIMAL NOTE: This implementation uses javaStringObjectInspector for STRING. A more efficient implementation should use UTF-8 encoded Text and writableStringObjectInspector. We should switch to that when we have a UTF-8 based Regex library. | |
| public interface Writable | |
| { | |
| void write(DataOutput out) throws IOException; | |
| void readFields(DataOutput in)throws IOException; | |
| } | |
| public static byte[] serialize(Writable writable) throws IOException | |
| { | |
| ByteArrayOutputStream out = new ByteArrayOutputStream(); | |
| DataOutputStream dataOut = new DataOutputStream(out); | |
| writable.write(dataOut); | |
| dataOut.close(); | |
| return out.toByteArray(); | |
| } | |
| public static byte[] deserialize(Writable writable, byte[] bytes) throws IOException | |
| { | |
| ByteArrayInputStream in = new ByteArrayInputStream(bytes); | |
| DataInputStream dataIn = new DataInputStream(in); | |
| writable.readFields(dataIn); | |
| dataIn.close(); | |
| return bytes; | |
| } | |
| Built-in SerDes | |
| Avro (Hive 0.9.1 and later) | |
| ORC (Hive 0.11 and later) | |
| RegEx | |
| Thrift | |
| Parquet (Hive 0.13 and later) | |
| CSV (Hive 0.14 and later) | |
| JsonSerDe (Hive 0.12 and later in hcatalog-core) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment