Exploring awk through bank statement parsing

I check my bank statements monthly by manually scanning the transaction lists, checking for anything that looks odd, and then filing them away. I then run a script that ensures all bank statements are present to ensure I don’t accidentally miss a month. It would be easier to process these files as data, rather than PDFs I have to read myself. Some banks offer data export features or APIs, but my problem with those is that I ultimately want to treat my PDF statements as the source of truth and the files I keep on record.

But they can be parsed with a little effort.

PDF in, CSV out

This will require two tools. They’re widely available on Unix systems: - pdftotext: extracts text content from PDF files - awk: for text processing and manipulation.

The PDF in question has a list of transactions, split by a page break, with five columns split by whitespace. When processing this through pdftotext, preserving the layout is essential so that the tabular structure remains:

pdftotext -layout statement.pdf -

This outputs something like this:

11122298986565                         AABBCCD

    MR K SAUCE                     Super Duper Account 10000
    42 FIBRE JUNCTION              Enjoy your life, enjoy your bank
    AIRGAPPED X3242                Call 0800 111222 to chat
    UNDER THE SEA                  Leave us an NPS review here.
    
      Your account summary
      5th Nov 1955 to 4th Dec 1955
          Earnings         This month
          Your Account     $0.01
          
    Important information: I wanna be the very best, like noone ever was.
    You are protected for the first $5 you deposit.
    Super Duper likes its customers. We hope you like us too.
    ...
    Lots
    ...
    More
    ...
    Text
    ...
    
Your transactions 5th Nov 1955 to 4th Dec 1955
Date            Description         Money coming       Money going     $ Balance
12th Nov        Balance fwd                                             8.00
14th Nov        CHICO'S GRILL       9.50                               17.50
15th Nov        HARRY'S SCRWS+BO                       4.95            12.55
                BS GETIT
24th Nov        FLUX DISP INVOI     35.00                              47.55

                        Account name: MR K SAUCE
                        Account number: 11122298986565
                        Statement number: 11/1955
                        11122298986565

Date          Description         Money coming       Money going     $ Balance
29th Nov        FARMER MAN J                           7.20            40.35
2nd Dec         WOOLY,WOOL EXPO                        25.00           15.35
4th Dec         Balance carry                                          15.35

I’m only interested in the transaction table, and that means dropping all text before the start of the table. I’ll write this in a script so it’s easier to reason about, since this processing will happen in multiple passes.

#!/bin/bash

pdftotext -layout "$1" - |

This reads the first argument passed to the script as the file path for the PDF to parse, converts it to the text, and pipes it to the next command, the first awk pass:

awk '
  /^[0-9]+.. [A-Z][a-z]+[[:space:]]+Balance/ { found=1; next }
  found { print }
' |

Breaking this down: - Match the first line that starts with a number followed by any two characters, a space, a capital letter followed by multiple lower case letters, followed by at least one space, and then the word Balance. This is 12 Nov Balance in the example input and marks the start of the transactions table. - Once found, set the found variable to a truthy value, and skip the line by calling next. - If found is now a truthy value (it is), print the remaining input.

This output is then piped into the next pass:

awk 'NF { print }' |

NF is an awk variable for the number of fields. It’s 0 for empty or whitespace-only lines. The way it’s used here is a truthy test. This removes empty lines from the input.

awk '!/^Date/' |

And this removes that pesky table header line just after the page break.

Now comes the first serious parsing challenge. Notice the transaction on the 15 November:

15 Nov        HARRY'S SCRWS+BO                       4.95            12.55
              BS GETIT

The description has wrapped onto a newline, but logically it’s part of the second column. The full description should read HARRY'S SCRWS+BO BS GETIT.

awk '
  # Find candidate lines with leading whitespace
  /^[[:space:]]/ {
    # Set RLENGTH to length of leading whitespace characters
    match($0, /^[[:space:]]+/)

    # Candidate line found
    # If more than 16 whitespace characters, not a candidate
    if (RLENGTH <= 16) {
      # Trim the line
      gsub(/^[[:space:]]+/, "")
      gsub(/[[:space:]]+$/, "")

      # If previous line has been buffered
      if (prev != "") {
        # Append this continuation text to the description column
        prev_description = prev_description " " $0
      }
    }

    # Avoid falling through to next block so footer is not printed
    next
  }

  # Process regular transaction lines
  {
    # Previous line is complete if we hit here, so print it
    if (prev != "") {
      print prev_date "\t" prev_description "\t" prev_amount "\t" prev_balance
    }

    # Split on runs of at least three spaces
    # This splits the line into the columns
    _n = split($0, parts, /[[:space:]]{3,}/)
    prev = $0
    prev_date = parts[1]
    prev_description = parts[2]
    prev_amount = parts[3]
    prev_balance = parts[4]

    # The next line might be a continuation, so do not print yet
  }

  # Runs after all input processed. Print the buffered line
  END {
    if (prev != "") {
      print prev_date "\t" prev_description "\t" prev_amount "\t" prev_balance
    }
  }
'

There’s a lot going on here. First, the lines that are continuation lines must be isolated and distinguished from the page footer. The continuation description is aligned with the description column, but the footer is indented further. Here, the threshold value of 16 whitespace characters is used to separate the two: - If there’s more than 16 whitespace characters at the start of the line, the line is part of the footer (e.g., the Account name: MR K SAUCE) - If there’s still leading whitespace characters, but less than 16 of then, this is a continuation line (e.g., BS GETIT).

The second block splits the lines using runs of whitespace characters as delimiters. The final block ensures the processed line is output. Columns are now tab-delimited to allow for ease in further processing.

Because there’s only only ever one value in either the money-in or the money-out columns, that’s the value that’s getting parsed as the amount, and it’s always positive. That’s not helpful. Luckily, the balance columns will shift depending on if this was a debit or credit to this account, so it can be used to determine if the amount should be positive or negative.

awk -F'\t' '
  function parse_num(s) {
    # Replace all non-digit non-decimal point chars (commas, currency symbols)
    gsub(/[^0-9.]/, "", s)

    # Coerce string to number
    return s + 0
  }

  {
    current_balance = parse_num($4)

    # Only run from line (record) 2 onwards, line 1 has no balance to compare
    if (NR > 1) {
      if (current_balance < prev_balance) {
        $3 = "-" $3
      }
    }
    prev_balance = current_balance
    print $1 "\t" $2 "\t" $3 "\t" $4
  }
' OFS='\t'

-F and OFS set the input and output field separator to the tab character respectively.

The last step is to convert this to a CSV file. The description field could contain any character, including commas, so each field is wrapped in quotes.

awk -F'\t' '
  {
    for (i = 1; i <= NF; i++) {
      # Double any existing quote characters
      gsub(/"/, "\"\"", $i)

      # Wrap each field in quotes
      printf "\"%s\"", $i

      if (i < NF) printf ","
    }

    printf "\n"
  }
'

Run the whole script, and you should see a quote-wrapped signed-amount full-description CSV:

"14th Nov","CHICO'S GRILL","9.50","17.50"
"15th Nov","HARRY'S SCRWS+BO BS GETIT","-4.95","12.55"
"24th Nov","FLUX DISP INVOI","35.00","47.55"
"29th Nov","FARMER MAN J","-7.20","40.35"
"2nd Dec","WOOLY,WOOL EXPO","-25.00","15.35"

Future improvements

The first transaction doesn’t have a previous balance to compare its amount to, so the first amount will always be positive. The starting balance line could be parsed, rather than dropped, to fix this.
Determining the sign assumes the transactions are in chronological order. Some banks may supply statements in reverse-chronological order. The transactions should be sorted to protect against this.
The whitespace threshold of 16 to differentiate continuation lines from the footer is arbitrary. Determining this value from the input itself would be more robust.
The year isn’t included in the output. The date column could be rewritten to include it.

Depending on how your bank statement is structured, the implementation and error handling could vary wildly. This post shows one version of how to get a basic parsed version for this input specifically, and how awk can be used to build or augment parsers like this.