Skip to content

Instantly share code, notes, and snippets.

@vargheseraphy
Last active January 7, 2019 04:53
Show Gist options
  • Save vargheseraphy/6f67e82e74d934a8c2d2 to your computer and use it in GitHub Desktop.
Save vargheseraphy/6f67e82e74d934a8c2d2 to your computer and use it in GitHub Desktop.
Reading content fro a pdf

Parsing .PDF’s with Ruby

In our journey through the world of test automation with ruby we have found that sometimes the data we need to validate is locked up in some .pdf formatted file and can be difficult to get at. The 3Qi Labs team decided there had to be a way to automate the extraction and parsing of these PDF’s within our test automation scripts and the search began. Enter Xpdf.

There are many programs/ruby libraries that can do a the parsing job we need done such as PDFMiner, PoDoFo, Origami, and the PDF-Reader gem, but we have found Xpdf to be a the best choice for our needs to both view and parse out the data from pdf files when your testing includes doing some validation of the contents of generated pdf files. Xpdf is an open source viewer for Adobe “.pdf” files that includes a set of utilities to do just about everything you would want to do to a PDF: extracting the PDF’s info or attachments or images or converting the PDF to a bitmap format, but the utility we are after here is Xpdf’s text extractor, pdftotext.exe, which will do just what it says. Converts your PDF to a text document. (examples below)

The following post will teach you how to use Xpdf to convert a PDF into a text file and then use ruby to parse out the returned data. We will start by explaining how to get the utility installed (example is for windows) and then we will go over some methods we used to do the conversion and parse the data.

To install Xpdf, download the package for your desired platform (we are currently working with Windows) from http://foolabs.com/xpdf/download.html. Install so that the pdftotext.exe file is in the path.

http://www.propublica.org/nerds/item/turning-pdfs-to-text-doc-dollars-guide

require 'rubygems'
require 'nokogiri'
datarows_by_column_count = {}
Nokogiri::HTML(open("EliLillyFacultyRegistryQ22010.html")).css('tr').select{|row| !row.text.match(/2010 To Date Aggregate/) }.each_with_index do |row, line_number|
  cols = row.css('th,td').map{|t| t.text.strip}  
   if cols.length == 9 # a valid data row
     if !cols[2].match(/^[A-Z]{2}$/) 
       # if the state initial column does not contain exactly two capital leters
       corrected_cols = []
       cols.each_with_index do |col, index|
         # populate corrected columns
         nval = col
         if index > 3 # from the fourth column on, the data is numerical. We need to strip non-numbers
           nval.gsub!(/[^\d]/, '')
         end
         corrected_cols[index] = nval[0..(nval.length/2.0).ceil-1]

       end
       cols += corrected_cols
     end
   end # endif cols.length==9 
   datarows_by_column_count[cols.length] ||= [] # initialize a new array if it doesn't exist
   datarows_by_column_count[cols.length] << ([line_number]+cols).join("\t")
  end	

# now print to files
datarows_by_column_count.each_pair do |column_count, datarows|
  if datarows.length > 0 
    File.open("pdf-columns-#{column_count}.txt", 'w'){ |f| 
     f.puts(datarows) 
    }

  end
end

Method 3: Convert to Text, Measure Column Widths Unfortunately, not all PDF tables convert to nice HTML. Try the above method on the GSK file, for example. Converting it to HTML results in this mess:

Translating this PDF to HTML does not preserve the column structure. One possible strategy is to analyze the whitespace between columns. This requires the use of regular expressions. If you don't know about them, they’re worth learning. Even without programming experience, you'll find regular expressions extremely useful when doing data cleaning or even advanced document searches.

The first step is to convert the PDF to plain text. You can use the aptly named pdftotext, which is part of the free xpdf package. We're using a Mac to do this. Linux instructions are pretty similar. Under Windows, your best bet would be to use Cygwin.

For this example, we will use the GSK disclosure PDF, which you can download here.

pdftotext -layout hcp-fee-disclosure-2q-4q2009.pdf This produces hcp-fee-disclosure-2q-4q2009.txt. The -layout flag preserves the spacing of the words as they were in the original PDF. This is what the GSK file looks like in text form:

                               Fees Paid to US Based Healthcare Professionals for Consulting & Speaking Services
                                                  1st Quarter through 3rd Quarter 2010
Health Care Professional   Location                Payee Name                                          Consultant    Speaker     Total Fees
Alario, Frank              BAYVILLE, NJ             Frank C Alario MD PC                                                $6,500        $6,500

Alavi, Ali                 FULLERTON, CA            Ali Alavi Consultant, LLC                                          $41,000       $41,000

Alavi, Ali                 FULLERTON, CA           Ali S Alavi                                                         $37,500       $37,500

Let's look at the easiest scenario of text-handling, where every cell has a value:

Name State Travel Service
Smith, Jon IA 100 200 Doe, Sara CA 200 0 Johnson, Brian NY 0 70 There's no special character, such as a comma or tab, that defines where each column ends and begins.

However, values in separate columns appear to have two or more spaces separating them. So, we can just use our text editing program to find and replace those to a special character of our choosing.

Regular expressions allow us to specify a match of something like "one space or more." In this case, we want to convert every set of two-or-more consecutive spaces into a pipe character ("|").

Many major text-editors allow the use of regular expressions. We use TextMate. For Mac users, TextWrangler is a great free text editor that supports find-and-replace operations with regular expressions. Notepad++ is a free Windows text-editor; here's a tutorial on how to use regular expressions in it.

In regular expression syntax, curly brackets {x,y} denote a range between x and y occurrences of the character preceding the brackets. So e{1,2} will match 1 to 2 'e' characters. So the regular expression to find "bet" and "beet" is: be{1,2}t.

Leaving off the second number, as in e{1,}, means we want to match at least one 'e', and any number of that character thereafter. So, to capture two-or-more whitespaces, we simply do: " {2,}".

So entering " {2,}" into the "Find:" field and "|" into "Replace:", we get:

Name|State|Travel|Service| Smith, Jon|IA|100|200 Doe, Sara|CA|200|0 Johnson, Brian|NY|0|70

Easy enough. But a common problem is when a cell is left blank. This causes two empty columns to be seen as just one empty column, according to our regular expression:

Name State Travel Service
Smith, Jon IA
Doe, Sara CA 0 Johnson, Brian 0 70 Name|State|Travel|Service| Smith, Jon|IA Doe, Sara|CA|0 Johnson, Brian|0|70 If you've worked with older textfile databases or mainframe output, you probably have come across tables with fixed-width columns, where the boundaries of columns is a pre-determined length.

Looking at the above table, we can see that even if there are blanks in the column, the actual data falls within a certain space. So, using regular expressions with a little Ruby scripting, we can programatically determine these columns.

We first delimit each row with the " {2,}" regular expression. As we saw in the example above, we'll end up with lines of varying number of columns.

If we then iterate through each column and find the farthest-left and the farthest-right position per column on the page, according to each word’s position and length, we should be able to produce on-the-fly a fixed-width format for this table.

This is easier to explain with a diagram. Here's a sparsely populated table of four columns.

1: Banana Currant
2: Alaska Colorado Delaware
3: Bear

If we delimit the above with " {2,}", we'll find that the first row will have 2 columns; the second row, 3 columns; and the third, 1 column.

Programmatically, we're going to store each of these lines of text as an array, so Row_1 would be ["Banana", "Current"], for instance. This is just an intermediary step, though. What we really want is where each word begins and ends on that line. If the very first position is 0, then "Banana" begins at position 13 and ends at position 19, that is, 19 spaces from the beginning of the line. Doing this for each line gets us:

1: 13,19, 24,31 2: 4,10, 24,33, 36,44 3: 14,18 So as we read the values for each line, let's keep a master list of the farthest-left and farthest-right positions of each column.

Reading through the first line, this list will be: [13,19], [24,31], where “Banana” and “Currant” are positioned, respectively.

When our script reads through the second line, it finds a word (Alaska) at position 4 and ending at 10.

Since it ends before the starting position (10 < 13) of what the program previously thought was the starting boundary of the first column, it stands to reason that the space containing "Alaska" is actually the table's first column.

When the script reads "Colorado", it sees that it intersects with "Currant"'s position in the first line. It assumes that the two share the same column (now the third), and changes the definition of that column from [24,31] to [24,33], since "Colorado" is a slightly longer word.

The list of columns is now: [4,10], [13,19], [24,33], [36,44].

In the third line, the only word is "Bear" and its dimensions fall within the previously defined second column's positions [13,19]

So now with our master list of positions, we can read each line again and break it apart by these column definitions, getting us a four-column table as expected.

Splitting the PDF When converting the PDF to text, sometimes the columns won't be positioned the same across every page. So let’s begin by splitting the PDF into separate pages by calling pdftotext within Ruby:

for page_num in 1..last_page_number 
	`pdftotext -f #{page_num} -l #{page_num} -layout #{the_pdf_filename} "#{the_pdf_filename.gsub(/\.pdf/i, '')}_#{page_num}.txt"`
end

And then iterate through each page to calculate its fixed-width format with the algorithm described above. Here's the commented code for the entire program:

## 
## Note: Run this script from the command line. i.e., "ruby thisscript.rb FILENAME NUMPAGES MIN_COLS LINES_TO_SKIP"
## 

require 'fileutils'
class Object  
  def blank?
    respond_to?(:empty?) ? empty? : !self; end; end


# filename = name of the PDF file, that will be broken into individual txt files
# number_of_pages = number of pages in the PDF
# min_cols = the minimum number of columns, when delimited by \s{2,}, that a line should have before taking into account its column spacing. Setting it to at least 2 or 3 eliminates lines that were mistranslated.
# lines_to_skip = the number of non-data header lines to skip per page. Should be the same per page, usually.


if ARGV.length < 4
  puts "Call format: ruby spacer.rb PDF_FILENAME NUMBER_OF_PAGES_IN_PDF MINIMUM_NUMBER_OF_COLUMNS_PER_LINE NUMBER_OF_HEADER_LINES_TO_SKIP"
  puts "i.e.  ruby spacer.rb `pdftest.pdf` 42 4 3"


  raise "Please specify all the parameters"
end



filename = ARGV.first
number_of_pages, min_cols, lines_to_skip = ARGV[1..-1].map{|a| a.to_i}



filedir = File.basename(filename).gsub(/[^\w]/, '_')


puts "Filename: #{filename}, #{number_of_pages} pages, minimum of #{min_cols} columns, and skip #{lines_to_skip} lines per page"
FileUtils.makedirs(filedir)

compiled_file = File.open("#{filedir}/compiled.txt", 'w')

for page_num in 1..number_of_pages 
   new_f_name = "#{filedir}/#{filename.gsub(/\.pdf/i, '')}_#{page_num}.txt"
	`pdftotext -f #{page_num} -l #{page_num} -layout #{filename} "#{new_f_name}"`
	 puts "#{new_f_name} created"

	 pdf_text = File.open(new_f_name).readlines[lines_to_skip..-1]
	 puts "Opening #{new_f_name} (#{pdf_text.length} lines)"

	 master_column_position_list = []

	 pdf_text.each_with_index do |line, line_number|
	   current_line_pos = 0

  	 columns = line.strip.split(/ {2,}/).map{|col| col.strip}

  	 if columns.length > min_cols
  	   columns.each_with_index do |column, col_index|

  	     # find the position of the word, starting from current_line_pos
  	     col_start = line.index(column, current_line_pos)

  	     # update current_line_pos so that in the next iteration, 'index' starts *after* the current word
  	     current_line_pos = col_start + column.length

         # temp variable for easier reading; this is where the current word begins and ends on the line
         this_col_pos = [col_start, current_line_pos]


         # with each column-word, find its spot according to the positions we've already found in master_column_position_list
         # (There's probably a more efficient way than iterating from the beginning of the master-list each time, but there's only 2-12 columns at most...)

         if master_column_position_list.length == 0
           master_column_position_list.push(this_col_pos)
         else

           master_column_position_list.each_with_index do |master_col_pos, m_index|

             # check to see if current column-word is positioned BEFORE the current element in master_column_position_list. This happens when the END of the column-word is less than the BEGINNING of the current master-list element
             if master_col_pos[0] > this_col_pos[1]
              # push new position before the current index in the master-list
              master_column_position_list.insert(m_index, this_col_pos)
              break;

             # if the column-word's BEGINNING position is after the END of the current master-list position, then iterate unto
             # the next element in the master-list. Unless we are already at the end; if so, push this column-word onto the array
             elsif master_col_pos[1] < this_col_pos[0] 
               if m_index == master_column_position_list.length-1
                  master_column_position_list.push(this_col_pos) 
                  break;
               end
             else
                ## If there is any overlap in the columns, merge the two positions, taking the minimum of the starting positions and the maximum of the ending positions
              #  elsif master_col_pos[0] <= this_col_pos[1] && master_col_pos[1] >= this_col_pos[0]
                 master_col_pos[0]= [master_col_pos[0], this_col_pos[0]].min
                 master_col_pos[1]= [master_col_pos[1], this_col_pos[1]].max
                 break;
             end


           end 
           # end of iterating through master_column_position_list
         end
         # end of if master_column_position_list.length == 0


       end 
      # end of iterating through each column.each_With_index


     end
     # end of unless line.strip.blank?

   end
   # end of each line iteration

   puts "Master positions for page #{new_f_name}: \n#{master_column_position_list.map{|mpos| "(#{mpos.join(',')})" }.join(', ')}\n\n"

   # Now create new text_file. We map each position in master_column_position_list onto each line, creating a substring for each element that exists in the master-list. We also prepend the current page number, line number, and the number of columns, for later diagonstics

   pdf_text.each_with_index do |line, line_number|
     unless line.strip.blank?
       compiled_file.puts( ([page_num, line_number, master_column_position_list.length] + master_column_position_list.map{|pos|  line[(pos[0])..(pos[1])].to_s.strip}).join("\t"))
     end

   end

end
# end of each page # iteration

compiled_file.close

You’ll note that in the section where we output the results to compiled_file, we’ve also included the page number, line number, and number of columns in that page. When we try this program on Lilly’s PDF, there are some columns in which the data is spread out enough to be considered separate columns by our program. So keeping track of the columns found per page allows us to quickly identify problem pages and fix them manually.

Because of the wide spacing in this particular PDF-to-text translation, our program would mistakenly create two columns where the original PDF only had one. PDF-to-Text Anomalies Almost every conversion ends up with some strange artifacts. For example, in the above conversion of the GSK document, we get some entries in the last column that are repeated over several lines.

I don't know enough about how PDFs are generated to prevent this. But after any conversion, you'll need to use Excel, Google Refine, or some custom code to check that all the fields have values in an expected range.

Regular expressions are pretty much essential to this, allowing you to determine which cells don't fit a certain format, such as an exact length of characters, or a currency format like $xx,xxx.00.

Conclusions There is no single method we could find that does PDF translation perfectly. We recommend trying one of the web services first. If the result isn’t as accurate as you like, it’s not too much work to write some text-processing code.

With any method, you may end up spending lots of time cleaning up the occasionally mistranslated cell, but at least it won't be as arduous as manually retyping the entire PDF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment