Bayes

From GnuCash
Jump to: navigation, search

This article looks at Bayes algorithm for imported transactions. Bayes algorithm tries to find a "best match" and uses the suggested account, depending on how well the fit was.

Disclaimer

The article contains a PERL script to manipulate the database Bayes algorithm builds over time and explains when this may be usefull. However, using the tool may also drastically deteriorate the quality of Bayes algorithm, if you are not carefull. Moreover, the script has not been excessively tested. The GnuCash team is not responsible for it and it is provided as-is and you can use it at your own risk. Do make a backup before using the tool.

Motivation

Sometimes Bayes has a problem that it does not recognise upon import some fairly common transactions. So eventually looking into the xml file for the stored data, the following findings came up:

  • Accounts are stored a strings, that is, if you change the name of an account, delete or or whatever, you loose the "learning process" of Bayes. Also, you have dead wood in the data, as the accounts don't even exist anymore. Experiments showed, that this also stopped some imports to be properly recognised (there may have been a better match to a non-existing account).
  • There are very many entries in their with weight 1 or 2. This is fine if you just started using the import. But when you habe used the import for several years, a weight of 1 or 2 indicates either that the token is brand new or that it is insignificant. This is also plausible, if you look at the corresponding tokens: Often they contain even a daytime ("9:43") or are just some kind of transaction number which is unique.
  • Some really good information is lost, since only spaces are used as separators. Many import contain descriptions of the kind id/date, e.g. 123456/201501, where 123456 is a customer id that never changes, while 201501 refers to the month being billed. Now, the customer id never changes and would, if used as a token separated by / have really high weight. Obviously, the date changes each month - so assuming monthly recurrence, it alsways has weight 1 and the good information of the customer id is lost. Maybe for some "/" is an import part of a token. After having looked for hours at one raw xml, there weren't any of these cases. So for compatibility reasons, "/" should be used as a token not a separator, maybe it would be possible to make this a property the user can choose: Use "/" as a sparator (and spaces of course).

Description of Script and Results

The PERL script allows to remove accounts and/or tokens from the Bayes database in an uncompressed XML gnucash file. It can apply the following rules:

  • Blacklist: Remove accounts if they or a higher-level account appear on a blacklist (provided as an extra file).
  • Minimal Weight: Remove accounts if their weight is below a certain threshold.
  • Miniaml Length: Remove tokens if their length is below a certain trheshold.
  • Automatic remove empty tokens after the above steps.
  • Bonus Allocation: Increase weight of already significant account entries.

Blacklist

Depending on how much the CoA has changed over time, this can be a very usefull feature. Every name change of an account, or deletion thereof, leaves the database entries unchanged. Not only is that entry dead wood, sitting in the XML forever using up space, there is empirical evidence that it may stop some matches being made. This is probably due to the fact that a better match exists with a non-existing account.

The blacklist file should contain all account, which either

  • don't exist anymore (sometime this is easiest done on a higher level, e.g. if you have renamed a very high level account)
  • don't want to be used (entry is done by accident, e.g. at first import you selected an account which you later decide against)

The blacklisting of account will not deteriorate the matching results, since it is all for acocunts that shouldn't be known to the algorhithm.

Minimal Weight

Low weights are natural in the database. Every time, a token first gets discovered, it enters the database with a value of 1. However, if it remains with this value or is only raised slightly through "accidental" matches, it is not significant. Having said that, when you start using the import feature, you need to give the database time to increase these values with each import. If you prune low weights too early, you will seriously harm you matching algorhithm.

In older database, experiments showed that all entries with a weight of less than 4 can be pruned. Note that this may differ depending on your data, so you may want to test this. The threshold can be freely set. If you use a too high value, you will also hurt the algorihthm.

Minimal Length

Tokens with only a few characters may appear to be insignificant. However experiments showed that this is not always the case. Instead of pruning short tokens, it is recommended to instead use Minimal Weight. This gets rid of insignificant tokens anyway. But if a short token has entries with a high weight, it should not be removed and may hurt the matching algorhithm.

Do not use this feature except for experiments

Bonus Allocation

Some experiments have been done to apply a bonus to weights, e.g. double the weight if it is above a certain threshold or adding a constant bonus (e.g. 3). However, this led to false positives, that is transactions were mapped to account and confirmed, when the shouldn't have been. Since this is the kind of error, that is really hard to catch, it is strongly advise to not allocate bonuses.

Script

Parameters

All parameters have to be manually set in the code:

$pruneObsoleteAccs
Set to 0 to not used this pruning method.
$obsoleteAccsFile
Name of file with obsolete accounts, or obsolete top-accounts.
One acc per row. Do not have an empty last row, or everything gets deleted.
$pruneMinimalWeight = 1
Set to 0 to not enforce a minimal weight.
Use this option only, if you have been training Bayes for a while, otherwise you will keep it from learning.
$minimalWeight = 3
Minimal weight enforced. For me, 3 worked fine.
Lower means more prudent, but also less pruning.
$pruneShortTokens = 0
Set to 1 to remove short tokens, independent of weight of contained accounts. This is not recommended!
$minimalTokenLenght = 2
Yes, there is a typo. Anyway, this is the minimal length for the above option. Anything shorter will be removed.
Not recommended.
$bonusHighWeights
Change this to 1 to use Bonus Allocation. Not recommended!
$bonusMinWeight
Minimal weight to receive bonus.
$bonusAmount
Bonus allocated.
$inFile
Name of the uncompressed original XML.
$outFile
Name of the output file.

Code

#!/usr/bin/perl
use strict;

my $pruneObsoleteAccs = 1;
my $obsoleteAccsFile = "obsoleteAccs.txt";
my @obsoleteAccs;
my $obsAcc;
my $pruneMinimalWeight = 0;
my $minimalWeight = 3;
my $pruneShortTokens = 0;
my $minimalTokenLenght = 2;
my $bonusHighWeights = 0;
my $bonusMinWeight = 6;
my $bonusAmount = 3;

my $inFile = "Haushaltsbuch v11.xml";
my $outFile = "Haushaltsbuch v11 pruned.xml";
my @lines;
my $line;
my $row;

# Read in obsolte Account Names.
if ($pruneObsoleteAccs) {
	open my $oac, '<', $obsoleteAccsFile or die "$obsoleteAccsFile: $!";
	push @obsoleteAccs, <$oac>;
	close $oac or die "$oac: $!";
	chomp(@obsoleteAccs);
}

# Read in XML
open my $ifh, '<', $inFile or die "$inFile: $!";
push @lines, <$ifh>;
close $ifh or die "$inFile: $!";
chomp @lines;

# Create output XML
open my $ofh, '>', $outFile or die "$outFile: $!";
$row = 0;
$line = $lines[$row];
while ($line !~ /import-map-bayes/) {
	print $ofh ($line . "\n");
	$row = $row +1;
	$line = $lines[$row];
}
print $ofh ($line . "\n");  # This line contains "import-map-bayes"
$row = $row +1;
$line = $lines[$row];
print $ofh ($line . "\n");  # "Frame" row of import-map.
$row = $row +1;
$line = $lines[$row];

my $currentToken;
my $currentAcc;
my $currentWght;
my $cntSlotValue = 0; 			# If this value is negative, continue with writing everything out.

my $importOpen = 1;
my $tokenOpen = 0;
my $accOpen = 0;
my %tokenAccs;
my $value;

my $cntPrunedTokens = 0;
my $cntPrunedAccs = 0;

while ($importOpen) {
	if ($line =~ /<slot>/) {			# slot starts
		if ($tokenOpen) {
			$accOpen = 1;
		} else {
			$tokenOpen = 1;
			$accOpen = 0;
		}
	} elsif ($line =~ /<\/slot>/) {	# slot ends
		if ($accOpen) {
			$accOpen = 0;
		} elsif ($tokenOpen) {
			# add pruning of @tokenAccs and @tokenWghts here.

			# Short Tokens
			if ($pruneShortTokens) {
				if (length($currentToken) < $minimalTokenLenght) {
					%tokenAccs = ();
				}							
			}
			
			# Obsolete Accounts.
			if ($pruneObsoleteAccs) {
				foreach $currentAcc (keys %tokenAccs) {
					foreach $obsAcc (@obsoleteAccs) {
						if ($currentAcc =~ /^$obsAcc/) {
							delete $tokenAccs{$currentAcc};
							$cntPrunedAccs = $cntPrunedAccs +1;
							last;
						}
					}
				}
			}
			
			# Minimal Weight
			if ($pruneMinimalWeight) {
				foreach $currentAcc (keys %tokenAccs) {
					if ($tokenAccs{$currentAcc} < $minimalWeight) {
						delete $tokenAccs{$currentAcc};
						$cntPrunedAccs = $cntPrunedAccs +1;
						# print STDOUT "Minimal Weight: Deleted $currentAcc\n";
					}
				}
			}
			
			
			# write all remaining @tokenAccs and @tokenWghts to <$ofh>.
			if (%tokenAccs != ()) { # only if accounts remain in tokenAccs.
				print $ofh "        <slot>\n";
				print $ofh "          <slot:key>$currentToken<\/slot:key>\n";
				print $ofh "          <slot:value type=\"frame\">\n";
		
				# Write remaining accs.
				foreach $currentAcc (keys %tokenAccs) {
					print $ofh "            <slot>\n";
					print $ofh "              <slot:key>$currentAcc<\/slot:key>\n";
					$currentWght = $tokenAccs{$currentAcc};
					print $ofh "              <slot:value type=\"integer\">$currentWght<\/slot:value>\n";
					print $ofh "            <\/slot>\n";
				} 
				
				# close Token
				print $ofh "          <\/slot:value>\n";
				print $ofh "        <\/slot>\n";
			} else {
				# token gets dropped.
				$cntPrunedTokens = $cntPrunedTokens + 1;
			}
			# reset variables
			%tokenAccs = ();
			$accOpen = 0;
			$tokenOpen = 0;
		} else {
			$importOpen = 0;
			print $ofh "<\/slot:value>\n"; # Bayes ends
			print $ofh "<\/slot>\n";
		}
	} elsif ($line =~ /<slot:value type="integer">(.*)<\/slot:value>/) { # <slot:value type="integer">1</slot:value>
		$tokenAccs{$currentAcc} = $1; 
		if ($bonusHighWeights) {
			if ($tokenAccs{$currentAcc} >= $bonusMinWeight) {
				$tokenAccs{$currentAcc} = $tokenAccs{$currentAcc} + $bonusAmount;
			}
		}
	} elsif ($line =~ /^<slot:value type="frame">/){							
		# ignore the slot:value="Frame"
	} elsif ($line =~ /^<\/slot:value>/){
		# ignore the slot:value="Frame"
		if (!$tokenOpen) {
			$importOpen = 0;
			print $ofh ($line . "\n");  # ending "Frame" row of import-map.
		}
	} elsif ($line =~ /<slot:key>(.*)<\/slot:key>/) {
		if ($accOpen) {
			$currentAcc = $1;
			$tokenAccs{$currentAcc}= 0;
		} else {
			$currentToken = $1;
		}
	} 
	$row = $row +1;
	$line = $lines[$row];
}

# Write back everything
while ($row <= $#lines) {
	print $ofh ($line . "\n");  
	$row = $row +1;
	$line = $lines[$row];
}
close $ofh or die "$outFile: $!";

print STDOUT "Accounts removed: $cntPrunedAccs\n";
print STDOUT "Tokens removed:   $cntPrunedTokens\n";