This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
BufferedReader getReader (String fileUrl) throws IOException { | |
InputStreamReader reader; | |
try { | |
reader = new FileReader(fileUrl); | |
} | |
catch (FileNotFoundException e) { | |
// try a real URL instead | |
URL url = new URL(fileUrl); | |
reader = new InputStreamReader (url.openStream()); | |
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
register './warcbase_kb/target/warcbase-0.1.0-SNAPSHOT-fatjar.jar'; | |
raw = load '/tmp/IAH-20080430204825-00000-blackbook.arc.gz' using | |
org.warcbase.pig.ArcLoader() as (url: chararray, date:chararray, mime:chararray, content:chararray); | |
a = foreach raw generate url,mime,content,SUBSTRING(date,0,12) as date,org.warcbase.pig.piggybank.DetectMimeType(content) as tikaMime; | |
b = filter a by (tikaMime == 'text/html'); | |
c = foreach b generate url,mime,tikaMime,date,org.warcbase.pig.piggybank.ExtractRawText(content) as txt; | |
d = foreach c generate url,mime,tikaMime,date,org.warcbase.pig.piggybank.DetectLanguage(txt) as lang; | |
e = group d by (lang,date); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# Usage: | |
# ./eunp-img-conversion.sh input.tif temp.tif output.jp2 | |
# 1. Invoke GraphicsMagick command line to convert master images to uncompressed 150ppi TIF with unsharp mask | |
# 2. Invoke Kakadu kdu_compress command line to convert uncompressed TIF to JP2000 | |
gm convert $1 -resample 150x150 -unsharp 1.5 -compress None ptif:$2 | kdu_compress -i $2 -o $3 -rate 1.0,0.84,0.7,0.6,0.5,0.4,0.35,0.3,0.25,0.21,0.18,0.15,0.125,0.1,0.088,0.075,0.0625,0.05,0.04419,0.03716,0.03125,0.025,0.0221,0.01858,0.015625 Clevels=6 Stiles=\{1024,1024\} Cmodes=\{BYPASS\} Corder=RLCP Cblk=\{64,64\} -no_palette |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@ECHO OFF | |
REM recursively traverse through directories and delete all instances of JPG|PNG|TIF|JP2 image files | |
CHOICE /C:12345 /M "Really delete all images of type (1) JPG, (2) JP2, (3) TIF, (4) PNG or (5) Cancel?" | |
IF ERRORLEVEL 5 GOTO Cancel | |
IF ERRORLEVEL 4 GOTO PNG | |
IF ERRORLEVEL 3 GOTO TIF | |
IF ERRORLEVEL 2 GOTO JP2 | |
IF ERRORLEVEL 1 GOTO JPG | |
GOTO END | |
:JPG |
NewerOlder