Skip to content

Instantly share code, notes, and snippets.

@kumar-de
Last active September 7, 2020 09:52
Show Gist options
  • Save kumar-de/1366000bfc26e0700575afcf9334882e to your computer and use it in GitHub Desktop.
Save kumar-de/1366000bfc26e0700575afcf9334882e to your computer and use it in GitHub Desktop.
Read file directly from spark-shell for experimenting #parquet #sequencefile #spark #shell

Parquet

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val df = sqlContext.read.parquet("/path/to/file/without/hdfs://")

df.printSchema

df.count

Sequence file

import java.util
import org.apache.hadoop.io._
import org.apache.hadoop.hbase.util._
val filePath = "path/to/file"
spark.sparkContext.sequenceFile(filePath, classOf[BytesWritable], classOf[BytesWritable])
           .map { case (rowKey, value) =>
                val rowKeyBytes = util.Arrays.copyOf(rowKey.getBytes, rowKey.getLength)
                val msgBytes = util.Arrays.copyOf(value.getBytes, value.getLength)
                val keyValuePair = (Bytes.toString(rowKeyBytes), Bytes.toString(msgBytes))
                val rowKeyStr = keyValuePair._1
                keyValuePair
            }.collect.foreach(println)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment