dportabella’s gists

dportabella / dist.scala

Last active March 24, 2017 19:14

compute distance in km between two postal codes

	// using build.sbt: libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"
	// using Ammonite: import $ivy.`org.apache.sis.core:sis-referencing:0.7`, org.apache.sis.distance.DistanceUtils

	case class Coordinates(lat: Double, lon: Double)

	def readCoordinates(file: String): Map[String, Coordinates] = {
	def parseLine(line: String): (String, Coordinates) = {
	val c = line.split("\t")
	(c(0) + "-" + c(1), Coordinates(c(9).toDouble, c(10).toDouble))
	}

dportabella / FilterArchive.scala

Created February 8, 2017 09:39

Example to filter a WARC archive using Spark and storing the result back to a WARC archive

	package application

	import java.io._
	import java.util

	import org.apache.spark.rdd.RDD
	import org.archive.format.warc.WARCConstants.WARCRecordType
	import org.archive.io.warc.WARCRecordInfo
	import org.warcbase.spark.archive.io.ArchiveRecord
	import org.warcbase.spark.matchbox.RecordLoader

dportabella / DeserializeHadoopSequenceFileWithoutClassDeclaration.scala

Last active November 8, 2016 22:32

How to deserialize a hadoop result sequence file outside hadoop (or a spark saveAsObjectFile outside spark) without having the class declaration

	// resolvers += "dportabella-3rd-party-mvn-repo-releases" at "https://github.com/dportabella/3rd-party-mvn-repo/raw/master/releases/"
	// libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.7.3"
	// libraryDependencies += "com.github.dportabella.3rd-party-mvn-repo" % "jdeserialize" % "1.0.0",

	import java.io._
	import org.apache.hadoop.conf._
	import org.apache.hadoop.fs._
	import org.apache.hadoop.io._
	import org.unsynchronized.jdeserialize

dportabella / RunTestOnMultipleGithubRepos

Created November 8, 2016 21:13

An example Scala script that runs a test on all github projects with a given name and their forks and branches (you need to install ammonite: brew install ammonite-repl)

	#!/usr/bin/env amm

	/* To run this script:
	* $ chmod +x ./RunTestOnMultipleGithubRepos
	* $ ./RunTestOnMultipleGithubRepos
	*/

	import ammonite.ops._
	import scalaj.http._
	import $ivy.`org.eclipse.jgit:org.eclipse.jgit:4.5.0.201609210915-r`, org.eclipse.jgit.api.Git

dportabella / deserialize_hadoop_sequence_file.scala

Last active November 8, 2016 21:42

How to deserialize a hadoop result sequence file outside hadoop (or a spark saveAsObjectFile outside spark)

	// libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.7.3"

	import java.io.{ByteArrayInputStream, ObjectInputStream}

	import org.apache.hadoop.conf._
	import org.apache.hadoop.fs._
	import org.apache.hadoop.io._

	val f = "/path/to/part-00000"
	val reader = new SequenceFile.Reader(new Configuration(), SequenceFile.Reader.file(new Path(f)))

dportabella / ExampleExecuteScalaFuturesInSerial.scala

Created September 13, 2016 20:16

Example on how to execute scala futures in serial one after the other, without collecting the result of the futures

	/*
	Example on how to execute scala futures in serial one after the other, without collecting the result of the futures

	Look this instead if we need to collect the result of the futures (it also explains how foldLeft works here):
	https://gist.github.com/dportabella/4e7569643ad693433ec6b86968f589b8
	*/


	import scala.concurrent.ExecutionContext.Implicits.global
	import scala.concurrent.duration.Duration

dportabella / ExampleExecuteScalaFuturesInSerial.scala

Last active June 1, 2023 20:40

Explanation on how to execute scala futures in serial one after the other

	/*
	Execute scala futures in serial one after the other
	This gist is to explain the solution given in
	http://www.michaelpollmeier.com/execute-scala-futures-in-serial-one-after-the-other-non-blocking

	The three examples produce the same result:
	---
	done: 10
	done: 20
	done: 30

dportabella / build.sbt

Last active May 31, 2016 02:07

sbt project for the spark distribution examples

	val sparkVersion = "1.6.1"
	val hbaseVersion = "0.98.7-hadoop2"

	name := "spark-examples"

	version := sparkVersion

	javacOptions ++= Seq("-source", "1.8", "-target", "1.8", "-Xlint")

	initialize := {

dportabella / PomDependenciesToSbt

Last active May 7, 2022 16:25

Script to convert Maven dependencies (and exclusions) from a pom.xml to sbt dependencies. Or run it online on http://goo.gl/wnHCjE

	#!/usr/bin/env amm

	// This script converts Maven dependencies from a pom.xml to sbt dependencies.
	// It is based on the answers of George Pligor and Mike Slinn on http://stackoverflow.com/questions/15430346/
	// - install https://github.com/lihaoyi/Ammonite
	// - make this script executable: chmod +x PomDependenciesToSbt
	// - run it with from your shell (e.g bash):
	// $ ./PomDependenciesToSbt /path/to/pom.xml

	import scala.xml._

dportabella / ExampleScalaAck.scala

Created August 22, 2014 22:46

this example Scala scripts executes a regex to all files recursively. it uses apache tika UniversalEncodingDetector to filter only text files. it uses a regex to find all lines containing the word "super", except if this word is part of the larger word "superstition" or "supernatural".

	import java.io.File
	import org.apache.tika.detect._
	import org.apache.tika.metadata._
	import org.apache.tika.mime._
	import org.apache.tika.io._
	import org.apache.tika.parser.txt._
	import resource._

	def recursiveListFiles(f: File): List[File] = {
	val these = f.listFiles.toList

David Portabella dportabella