GnuCash XML format

From GnuCash
Revision as of 19:39, 14 August 2018 by Sunfish62 (talk | contribs) (Update reference to RELAX NG Schema)
Jump to: navigation, search

This article collects some notes about the XML file format of GnuCash. So far it is just descriptive, and neither normative nor authoritative.

Beginning with version 1.6, the primary GnuCash storage mechanism is an XML file. The file is optionally compressed with gzip, which is a preference that is set at Edit→Preferences→General→Use file compression.

There is a non-normative RELAX NG schema for the XML file format (gnucash-v2.rnc). There are also DTD schema definitions, but these are outdated and do not define the current format correctly (src/doc/xml).

Please keep in mind that GnuCash series 1.8 used the libxml1 library for XML access, whereas 1.9 and later uses the libxml2 library. Some behavior regarding XML files is therefore quite different in 1.8 compared to later versions.

XML files written by GnuCash 1.8 versions are missing XML namespace declarations that are required by most XML processing software (see also FAQ#Q: How can I export data?). See GnuCash Tutorial and Concepts Guide, Appendix A, part 5: Converting XML GnuCash File for the missing declarations. From version 1.8.5 onwards, GnuCash can read XML files containing these declarations (1.8.5 release notes). From 1.9.0 onwards, GnuCash can write the required namespace declarations as well.

Many elements in the XML file are identified by Globally Unique Identifiers (GUID). GnuCash includes its own GUID implementation.

Character encoding

GnuCash 1.8.x interpreted XML documents using a character encoding determined by operating-system–level locale settings, and so did not include an encoding declaration in the opening XML text declaration. (The locale setting here constitutes a "higher-level protocol" in W3C vernacular [1].) GnuCash serializes non-ASCII octets (i.e. those with the high-order bit set) as decimal numeric character references. (E.g., an em-dash is represented as “—”.)

On the other hand, GnuCash 1.9.0 and later writes the XML document always in UTF-8 encoding and also includes the appropriate encoding declaration in the opening XML text declaration. (I think the serialization is still done as decimal numeric character references but this has to be checked.)

For example, in 1.8.x the UTF-8 encoding of the Cyrillic capital letter “Б” is written as “Б”. As the following Python script shows, the UTF-8 text should be transcoded to recover the original Unicode text. (This script uses the 4Suite XML library.)

#! /usr/bin/python2.4                                                                                                                               

from Ft.Xml.Domlette import NonvalidatingReader
from Ft.Xml.XPath import Evaluate
from Ft.Xml.XPath.Context import Context

# precondition: foo.xac was created by GnuCash with LANG=en_US.UTF-8
doc = NonvalidatingReader.parseUri('file:///tmp/foo.xac')
context = Context(doc, processorNss={'cd'    : "http://www.gnucash.org/XML/cd",
                                     'book'  : "http://www.gnucash.org/XML/book",
                                     'gnc'   : "http://www.gnucash.org/XML/gnc",
                                     'cmdty' : "http://www.gnucash.org/XML/cmdty",
                                     'trn'   : "http://www.gnucash.org/XML/trn",
                                     'split' : "http://www.gnucash.org/XML/split",
                                     'act'   : "http://www.gnucash.org/XML/act",
                                     'price' : "http://www.gnucash.org/XML/price",
                                     'ts'    : "http://www.gnucash.org/XML/ts",
                                     'slot'  : "http://www.gnucash.org/XML/kvpslot",
                                     'cust'  : "http://www.gnucash.org/XML/cust",
                                     'addr'  : "http://www.gnucash.org/XML/custaddr"})

accountName = Evaluate('/gnc-v2/gnc:book/gnc:account[act:id="0d69c3557f4d9340198bfd151f9e13cb"]/act:name/text()',
                       context=context)[0]

# object of type "str" (is actually UTF-8–encoded, not latin1!):                                                                                                                         
name_raw = accountName.data.encode('latin1')

# object of type "unicode":
name_unicode = name_raw.decode('utf-8')

# objects of type "str":                                                                                                                            
name_koi8r = name_unicode.encode('koi8-r')
name_utf8  = name_unicode.encode('utf-8')
name_utf16 = name_unicode.encode('utf-16')

assert name_utf8 == accountName.data.encode('latin1')

Validation

The RELAX NG schema file mentioned above can be used to validate an uncompressed GnuCash XML data file. This requires that you:

  • save your GnuCash data file in uncompressed format
  • use an XML validator--e.g., Jing, which will be used in this example.

As stated above, the GnuCash data file is by default stored using gzip compression. You must first save your data file in an uncompressed state. The easiest way to do this is to change the storage preference and save your file. (Remember to reset the preference afterwards).

Then download jing and run the following command

 jing -c path-to-gnucash-v2.rnc path-to-your-datafile.gnucash

jing will report any validation errors it finds.

Note
The validation should not be considered authoritative, as the schema is not updated or tested very often. So validation errors can just as easily be due to errors in the schema than due to errors in the data file.

Based on information provided by Baptiste Carvello in bug 680887.

External links