Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save lyoungstratus/ad4cfdf442517e1769144ee7fc6709d0 to your computer and use it in GitHub Desktop.
Save lyoungstratus/ad4cfdf442517e1769144ee7fc6709d0 to your computer and use it in GitHub Desktop.
Using PDFtoText and Powershell to extract transactions from an Applecard Statement on MacOS
To extract transactions from an Apple card PDF statement to a tab delimited text under Powershell, you can use these commands:
pdftotext -table '.\Apple Card Statement - October 2019.pdf'
$regex = '^(\d\d)/(\d\d)/(20\d\d)\s+(.*?)\s+\S*%\s+\$\S+\s*(.*)'
get-content '.\Apple Card Statement - October 2019.txt' | where-object {
($_ -replace "\s+"," ") -match $regex
} | foreach-object {
"{2}-{0}-{1}`t{3}`t{4}" -f (1..5 | foreach-object {$matches[$_]})
} | set-content 201910.txt
In order to use those under the mac, you would need the following two packages in their mac versions:
1) Xpdfreader
pdftotext is a command line tool provided as part of the 'Xpdfreader' package, which I installed on Windows using the Chocolatey package manager to install 'xpdf-utils'.
The developer seems to have a download page that includes a downloadable mac version:
https://www.xpdfreader.com/
I also read that it is available for mac through Homebrew
pdftotext has a '-table' option that l"ve found generally used for getting tabular text from pdf's in a layout I've found pretty easy to parse.
2) Powershell
Powershell is the modern command shell for Windows that is also available on Mac and linux.
https://docs.microsoft.com/en-us/powershell/scripting/install/installing-powershell-core-on-macos?view=powershell-6
After installing you can enter the Powershell shell by invoking it as "pwsh"
Inside the shell you will find that the most basic unix-y shell commands navigation work in their bare form (no option flags),
such as 'cd', 'ls', 'mkdir', 'rm'. Each of those are just aliases for the corresponding Powershell equivalent commands
"cd" is set-location, "ls" is get-childitem, "mkdir" is new-item, "rm" is remove-item.
Powershell commands are by convention in the form verb hyphen subject.
Here is my explanation of the three commands I used:
1) pdftotext -table '.\Apple Card Statement - October 2019.pdf'
This converts the pdf file to a text file with the same name but the '.txt' extension.
If pdftotext is not on your path, you would also give include the path to it.
2) $regex = '^(\d\d)/(\d\d)/(20\d\d)\s+(.*?)\s+\S*%\s+\$\S+\s*(.*)'
This sets a shell variable to a literal string. 'regex' is just the name I picked, I could have called it anything.
No shell interpolation is used for a literal string surrounded by single quotes, so the "$" is just a dollar character here.
Backslashes are never significant in powershell strings ... powershell uses the backtick ` instead as the string escape.
3) get-content '.\Apple Card Statement - October 2019.txt' | where-object {
($_ -replace "\s+"," ") -match $regex
} | foreach-object {
"{2}-{0}-{1}`t{3}`t{4}" -f (1..5 | foreach-object {$matches[$_]} )
} | set-content 201910.txt
Get-Content is a standard command in powershell to retrieve the content of a file as lines of text and write each line to the pipeline.
The powershell pipeline is a sequence of objects rather than a sequence of bytes, so each line is an item in the pipeline.
Where-object {} for each object in the input pipeline it evaluates the codeblock within the braces,
and writes to the output only those input objects for which the codeblock returned 'true'
The codeblock is required to produce a boolean result.
$_ is a special shell variable which contains the current object.
In my codeblock I match the current line to my regex, returning true if it matches.
foreach-object {} executes the given code block once for each object in the pipeline.
The input pipeline has only the lines that matched my regex, and also the special variable $matches is still set to the capture
groups from the match for this line ($matches is global, it isnt restricted to the scope of a codeblock).
Here I am outputting the captured groups for the line as a reformatted tab-delimited format, using the "-f" formatting operator.
The string on the left is the formatting template. The "`t" is a tab character .. again powershell uses backtick rather
than backslash to avoid making backslashes in Windows filepaths be an inconvenient conflict with string escaping.
The argument on the right is a one dimensional array of items to be formatted.
Here I generated an array of the 5 capture groups using another little powershell pipeline.
Finally set-content writes each line of its input pipeline to the given path.
@v9999
Copy link

v9999 commented Jan 24, 2021

Hey there, this is very educational.
I was trying to use part of your ideas on a project of mine, but using a regex like this to match only the text that goes after the initial indicated text:
(?<=Username: ).*

However, in this part: "where-object { ($_ -replace "\s+"," ") -match $regex ", i get the whole line where the regex match occurs, instead of the content of the match itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment