dportabella’s gists

dportabella / ExampleExecuteScalaFuturesInSerial.scala

Last active June 1, 2023 20:40

Explanation on how to execute scala futures in serial one after the other

	/*
	Execute scala futures in serial one after the other
	This gist is to explain the solution given in
	http://www.michaelpollmeier.com/execute-scala-futures-in-serial-one-after-the-other-non-blocking

	The three examples produce the same result:
	---
	done: 10
	done: 20
	done: 30

dportabella / ExampleExecuteScalaFuturesInSerial.scala

Created September 13, 2016 20:16

Example on how to execute scala futures in serial one after the other, without collecting the result of the futures

	/*
	Example on how to execute scala futures in serial one after the other, without collecting the result of the futures

	Look this instead if we need to collect the result of the futures (it also explains how foldLeft works here):
	https://gist.github.com/dportabella/4e7569643ad693433ec6b86968f589b8
	*/


	import scala.concurrent.ExecutionContext.Implicits.global
	import scala.concurrent.duration.Duration

dportabella / deserialize_hadoop_sequence_file.scala

Last active November 8, 2016 21:42

How to deserialize a hadoop result sequence file outside hadoop (or a spark saveAsObjectFile outside spark)

	// libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.7.3"

	import java.io.{ByteArrayInputStream, ObjectInputStream}

	import org.apache.hadoop.conf._
	import org.apache.hadoop.fs._
	import org.apache.hadoop.io._

	val f = "/path/to/part-00000"
	val reader = new SequenceFile.Reader(new Configuration(), SequenceFile.Reader.file(new Path(f)))

dportabella / RunTestOnMultipleGithubRepos

Created November 8, 2016 21:13

An example Scala script that runs a test on all github projects with a given name and their forks and branches (you need to install ammonite: brew install ammonite-repl)

	#!/usr/bin/env amm

	/* To run this script:
	* $ chmod +x ./RunTestOnMultipleGithubRepos
	* $ ./RunTestOnMultipleGithubRepos
	*/

	import ammonite.ops._
	import scalaj.http._
	import $ivy.`org.eclipse.jgit:org.eclipse.jgit:4.5.0.201609210915-r`, org.eclipse.jgit.api.Git

dportabella / DeserializeHadoopSequenceFileWithoutClassDeclaration.scala

Last active November 8, 2016 22:32

How to deserialize a hadoop result sequence file outside hadoop (or a spark saveAsObjectFile outside spark) without having the class declaration

	// resolvers += "dportabella-3rd-party-mvn-repo-releases" at "https://github.com/dportabella/3rd-party-mvn-repo/raw/master/releases/"
	// libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.7.3"
	// libraryDependencies += "com.github.dportabella.3rd-party-mvn-repo" % "jdeserialize" % "1.0.0",

	import java.io._
	import org.apache.hadoop.conf._
	import org.apache.hadoop.fs._
	import org.apache.hadoop.io._
	import org.unsynchronized.jdeserialize

dportabella / FilterArchive.scala

Created February 8, 2017 09:39

Example to filter a WARC archive using Spark and storing the result back to a WARC archive

	package application

	import java.io._
	import java.util

	import org.apache.spark.rdd.RDD
	import org.archive.format.warc.WARCConstants.WARCRecordType
	import org.archive.io.warc.WARCRecordInfo
	import org.warcbase.spark.archive.io.ArchiveRecord
	import org.warcbase.spark.matchbox.RecordLoader

dportabella / dist.scala

Last active March 24, 2017 19:14

compute distance in km between two postal codes

	// using build.sbt: libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"
	// using Ammonite: import $ivy.`org.apache.sis.core:sis-referencing:0.7`, org.apache.sis.distance.DistanceUtils

	case class Coordinates(lat: Double, lon: Double)

	def readCoordinates(file: String): Map[String, Coordinates] = {
	def parseLine(line: String): (String, Coordinates) = {
	val c = line.split("\t")
	(c(0) + "-" + c(1), Coordinates(c(9).toDouble, c(10).toDouble))
	}

David Portabella dportabella