Skip to content

Instantly share code, notes, and snippets.

@jagedn
Created October 11, 2019 11:33
Show Gist options
  • Save jagedn/e405cfdafd313a4a9c9802acb7d2f808 to your computer and use it in GitHub Desktop.
Save jagedn/e405cfdafd313a4a9c9802acb7d2f808 to your computer and use it in GitHub Desktop.
extraer campos del Boe formato pdf
@Grab(group='org.apache.pdfbox', module='pdfbox', version='2.0.8')
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.text.*
import java.awt.Rectangle
margenright = 10
def model=[:]
txtboe = new PDFTextStripper().getText( PDDocument.load( new URL("https://www.boe.es/boe/dias/2015/09/12/pdfs/BOE-A-2015-9803.pdf").bytes ) )
txtboe = txtboe.replaceAll('\n',' ')
//println txtboe
match = txtboe =~ /<E([0-9\.]+)\s+(.+?)>/
match.each{ token ->
model[token[1]]=token[2]
}
match = txtboe =~ /<E([0-9\.:T]+)\s+(.+?)>/
match.each{ token ->
model[token[1]]=token[2]
}
model.sort().each{
println it.key +' '+it.value
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment