I check my bank statements monthly by manually scanning the transaction lists, checking for anything that looks odd, and then filing them away. I then run a script that ensures all bank statements are present to ensure I don’t accidentally miss a month. It would be easier to process these files as data, rather than PDFs I have to read myself. Some banks offer data export features or APIs, but my problem with those is that I ultimately want to treat my PDF statements as the source of truth and the files I keep on record.
But they can be parsed with a little effort.
This will require two tools. They’re widely available on Unix
systems: - pdftotext: extracts text content from PDF files
- awk: for text processing and manipulation.
The PDF in question has a list of transactions, split by a page
break, with five columns split by whitespace. When processing this
through pdftotext, preserving the layout is essential so
that the tabular structure remains:
pdftotext -layout statement.pdf -
This outputs something like this:
11122298986565 AABBCCD
MR K SAUCE Super Duper Account 10000
42 FIBRE JUNCTION Enjoy your life, enjoy your bank
AIRGAPPED X3242 Call 0800 111222 to chat
UNDER THE SEA Leave us an NPS review here.
Your account summary
5th Nov 1955 to 4th Dec 1955
Earnings This month
Your Account $0.01
Important information: I wanna be the very best, like noone ever was.
You are protected for the first $5 you deposit.
Super Duper likes its customers. We hope you like us too.
...
Lots
...
More
...
Text
...
Your transactions 5th Nov 1955 to 4th Dec 1955
Date Description Money coming Money going $ Balance
12th Nov Balance fwd 8.00
14th Nov CHICO'S GRILL 9.50 17.50
15th Nov HARRY'S SCRWS+BO 4.95 12.55
BS GETIT
24th Nov FLUX DISP INVOI 35.00 47.55
Account name: MR K SAUCE
Account number: 11122298986565
Statement number: 11/1955
11122298986565
Date Description Money coming Money going $ Balance
29th Nov FARMER MAN J 7.20 40.35
2nd Dec WOOLY,WOOL EXPO 25.00 15.35
4th Dec Balance carry 15.35
I’m only interested in the transaction table, and that means dropping all text before the start of the table. I’ll write this in a script so it’s easier to reason about, since this processing will happen in multiple passes.
#!/bin/bash
pdftotext -layout "$1" - |This reads the first argument passed to the script as the file path for the PDF to parse, converts it to the text, and pipes it to the next command, the first awk pass:
awk '
/^[0-9]+.. [A-Z][a-z]+[[:space:]]+Balance/ { found=1; next }
found { print }
' |
Breaking this down: - Match the first line that starts with a number
followed by any two characters, a space, a capital letter followed by
multiple lower case letters, followed by at least one space, and then
the word Balance. This is
12 Nov Balance in the example input and marks the
start of the transactions table. - Once found, set the
found variable to a truthy value, and skip the line by
calling next. - If found is now a truthy value
(it is), print the remaining input.
This output is then piped into the next pass:
awk 'NF { print }' |
NF is an awk variable for the number of fields. It’s 0
for empty or whitespace-only lines. The way it’s used here is a truthy
test. This removes empty lines from the input.
awk '!/^Date/' |
And this removes that pesky table header line just after the page break.
Now comes the first serious parsing challenge. Notice the transaction on the 15 November:
15 Nov HARRY'S SCRWS+BO 4.95 12.55
BS GETIT
The description has wrapped onto a newline, but logically it’s part
of the second column. The full description should read
HARRY'S SCRWS+BO BS GETIT.
awk '
# Find candidate lines with leading whitespace
/^[[:space:]]/ {
# Set RLENGTH to length of leading whitespace characters
match($0, /^[[:space:]]+/)
# Candidate line found
# If more than 16 whitespace characters, not a candidate
if (RLENGTH <= 16) {
# Trim the line
gsub(/^[[:space:]]+/, "")
gsub(/[[:space:]]+$/, "")
# If previous line has been buffered
if (prev != "") {
# Append this continuation text to the description column
prev_description = prev_description " " $0
}
}
# Avoid falling through to next block so footer is not printed
next
}
# Process regular transaction lines
{
# Previous line is complete if we hit here, so print it
if (prev != "") {
print prev_date "\t" prev_description "\t" prev_amount "\t" prev_balance
}
# Split on runs of at least three spaces
# This splits the line into the columns
_n = split($0, parts, /[[:space:]]{3,}/)
prev = $0
prev_date = parts[1]
prev_description = parts[2]
prev_amount = parts[3]
prev_balance = parts[4]
# The next line might be a continuation, so do not print yet
}
# Runs after all input processed. Print the buffered line
END {
if (prev != "") {
print prev_date "\t" prev_description "\t" prev_amount "\t" prev_balance
}
}
'
There’s a lot going on here. First, the lines that are continuation
lines must be isolated and distinguished from the page footer.
The continuation description is aligned with the description column, but
the footer is indented further. Here, the threshold value of 16
whitespace characters is used to separate the two: - If there’s more
than 16 whitespace characters at the start of the line, the line is part
of the footer (e.g., the Account name: MR K SAUCE) - If
there’s still leading whitespace characters, but less than 16 of then,
this is a continuation line (e.g., BS GETIT).
The second block splits the lines using runs of whitespace characters as delimiters. The final block ensures the processed line is output. Columns are now tab-delimited to allow for ease in further processing.
Because there’s only only ever one value in either the money-in or the money-out columns, that’s the value that’s getting parsed as the amount, and it’s always positive. That’s not helpful. Luckily, the balance columns will shift depending on if this was a debit or credit to this account, so it can be used to determine if the amount should be positive or negative.
awk -F'\t' '
function parse_num(s) {
# Replace all non-digit non-decimal point chars (commas, currency symbols)
gsub(/[^0-9.]/, "", s)
# Coerce string to number
return s + 0
}
{
current_balance = parse_num($4)
# Only run from line (record) 2 onwards, line 1 has no balance to compare
if (NR > 1) {
if (current_balance < prev_balance) {
$3 = "-" $3
}
}
prev_balance = current_balance
print $1 "\t" $2 "\t" $3 "\t" $4
}
' OFS='\t'
-F and OFS set the input and output field
separator to the tab character respectively.
The last step is to convert this to a CSV file. The description field could contain any character, including commas, so each field is wrapped in quotes.
awk -F'\t' '
{
for (i = 1; i <= NF; i++) {
# Double any existing quote characters
gsub(/"/, "\"\"", $i)
# Wrap each field in quotes
printf "\"%s\"", $i
if (i < NF) printf ","
}
printf "\n"
}
'
Run the whole script, and you should see a quote-wrapped signed-amount full-description CSV:
"14th Nov","CHICO'S GRILL","9.50","17.50"
"15th Nov","HARRY'S SCRWS+BO BS GETIT","-4.95","12.55"
"24th Nov","FLUX DISP INVOI","35.00","47.55"
"29th Nov","FARMER MAN J","-7.20","40.35"
"2nd Dec","WOOLY,WOOL EXPO","-25.00","15.35"
16 to differentiate
continuation lines from the footer is arbitrary. Determining this value
from the input itself would be more robust.Depending on how your bank statement is structured, the implementation and error handling could vary wildly. This post shows one version of how to get a basic parsed version for this input specifically, and how awk can be used to build or augment parsers like this.
~K