Skip to content

Instantly share code, notes, and snippets.

@fnl
Forked from mhausenblas/SparkGrep.scala
Last active May 11, 2020 06:57
Show Gist options
  • Save fnl/10eb15945b5687d5fadc to your computer and use it in GitHub Desktop.
Save fnl/10eb15945b5687d5fadc to your computer and use it in GitHub Desktop.
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>my.org</groupId>
<artifactId>spark-grep</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.2.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
</plugin>
</plugins>
</build>
</project>
package spark.example
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object SparkGrep {
def main(args: Array[String]) {
if (args.length < 3) {
System.err.println("Usage: SparkGrep <host> <input_file> <match_term>")
System.exit(1)
}
val conf = new SparkConf().setAppName("SparkGrep").setMaster(args(0))
val sc = new SparkContext(conf)
val inputFile = sc.textFile(args(1), 2).cache()
val matchTerm : String = args(2)
val numMatches = inputFile.filter(line => line.contains(matchTerm)).count()
println("%s lines in %s contain %s".format(numMatches, args(1), matchTerm))
System.exit(0)
}
}
@fnl
Copy link
Author

fnl commented Jun 28, 2015

Path/directory for the Scala code: mkdir -p src/main/scala/spark/example/ 

Runtime arguments: local[*] src/main/scala/spark/example/SparkGrep.scala val

mvn exec:java -Dexec.mainClass="spark.example.SparkGrep" -Dexec.args="local[*] src/main/scala/spark/example/SparkGrep.scala val"

Expected output: 5 lines in src/main/scala/spark/example/SparkGrep.scala contain val

@Ayush257
Copy link

I ran this code on 700 files giving argument(2) as //.txt . I want the output as 4 lines in //part-123.txt contain . but instead of that I am getting 4 lines in //.txt contain how I will get to know the File name out of this 700 files where my search variable is . Can you please help me out here. Will really be helpful from your end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment