Difference between revisions of "GnuCash XML format"

From GnuCash
Jump to: navigation, search
(added links for GUID files in 2.2)
m (properly refer to NCRs ; some->most software requires namespace decls)
Line 1: Line 1:
 
This article collects some notes about the XML file format of [[GnuCash]]. So far it is just descriptive, and neither normative nor authoritative.  
 
This article collects some notes about the XML file format of [[GnuCash]]. So far it is just descriptive, and neither normative nor authoritative.  
  
Beginning with version 1.6, the primary GnuCash storage mechanism is an [[Wikipedia:XML|XML]] file. The file is optionally compressed with [[Wikipedia:gzip|gzip]] (“<u>E</u>dit” menu → “Preferences” → “General” → “Use file compression”).
+
Beginning with version 1.6, the primary GnuCash storage mechanism is an [[Wikipedia:XML|XML]] file. The file is optionally compressed with [[Wikipedia:gzip|gzip]] (“<u>E</u>dit” menu → “Preferences” → “General” → “Use file compression”).
  
There is a non-normative [[Wikipedia:RELAX NG|RELAX NG]] schema for the 1.8/2.0 XML file format ([http://svn.gnucash.org/trac/browser/gnucash/trunk/src/doc/xml/gnucash-v2.rnc src/doc/xml/gnucash-v2.rnc]). There were also DTD schema definitions, but these are outdated and do not define the current format correctly ([http://svn.gnucash.org/trac/browser/gnucash/branches/1.8/src/doc/xml src/doc/xml (1.8)]).
+
There is a non-normative [[Wikipedia:RELAX NG|RELAX NG]] schema for the 1.8/2.0 XML file format ([http://svn.gnucash.org/trac/browser/gnucash/trunk/src/doc/xml/gnucash-v2.rnc src/doc/xml/gnucash-v2.rnc]). There were also DTD schema definitions, but these are outdated and do not define the current format correctly ([http://svn.gnucash.org/trac/browser/gnucash/branches/1.8/src/doc/xml src/doc/xml (1.8)]).
  
 
Please keep in mind that GnuCash series 1.8.x uses the libxml1 library for XML access, whereas 1.9.0 and later uses the libxml2 library. Some behaviour regarding XML files is therefore quite different in 1.8.x compared to 1.9.x/2.0.0.
 
Please keep in mind that GnuCash series 1.8.x uses the libxml1 library for XML access, whereas 1.9.0 and later uses the libxml2 library. Some behaviour regarding XML files is therefore quite different in 1.8.x compared to 1.9.x/2.0.0.
  
XML files written by all GnuCash 1.8.x versions are missing [http://www.w3.org/TR/REC-xml-names/#ns-decl XML namespace declarations] that are required by some XML processing software (see also [[FAQ#Q: How can I export data?]]). See [http://www.gnucash.org/docs/v1.8/C/gnucash-guide/appendixa_xmlconvert1.html GnuCash Tutorial and Concepts Guide, Appendix A, part 5: Converting XML GnuCash File] for the missing declarations.  From version 1.8.5 onwards GnuCash is able to ''read'' XML files containing these declarations ([http://mail.gnome.org/archives/gnome-announce-list/2003-August/msg00070.html 1.8.5 release notes]). From 1.9.0 onwards GnuCash will write the required namespace declarations as well.
+
XML files written by all GnuCash 1.8.x versions are missing [http://www.w3.org/TR/REC-xml-names/#ns-decl XML namespace declarations] that are required by most XML processing software (see also [[FAQ#Q: How can I export data?]]). See [http://www.gnucash.org/docs/v1.8/C/gnucash-guide/appendixa_xmlconvert1.html GnuCash Tutorial and Concepts Guide, Appendix A, part 5: Converting XML GnuCash File] for the missing declarations.  From version 1.8.5 onwards GnuCash is able to ''read'' XML files containing these declarations ([http://mail.gnome.org/archives/gnome-announce-list/2003-August/msg00070.html 1.8.5 release notes]). From 1.9.0 onwards GnuCash will write the required namespace declarations as well.
  
Many elements in the XML file are identified by [[Wikipedia:Globally Unique Identifier|GUID]]. GnuCash includes its own GUID implementation ([http://svn.gnucash.org/trac/browser/gnucash/branches/1.8/src/engine/guid.h guid.h], [http://svn.gnucash.org/trac/browser/gnucash/branches/1.8/src/engine/guid.c guid.c] (1.8);
+
Many elements in the XML file are identified by [[Wikipedia:Globally Unique Identifier|GUID]]. GnuCash includes its own GUID implementation ([http://svn.gnucash.org/trac/browser/gnucash/branches/1.8/src/engine/guid.h guid.h], [http://svn.gnucash.org/trac/browser/gnucash/branches/1.8/src/engine/guid.c guid.c] (1.8);
 
[http://svn.gnucash.org/trac/browser/gnucash/branches/2.2/lib/libqof/qof/guid.h guid.h],
 
[http://svn.gnucash.org/trac/browser/gnucash/branches/2.2/lib/libqof/qof/guid.h guid.h],
 
[http://svn.gnucash.org/trac/browser/gnucash/branches/2.2/lib/libqof/qof/guid.c guid.c] (2.2)).
 
[http://svn.gnucash.org/trac/browser/gnucash/branches/2.2/lib/libqof/qof/guid.c guid.c] (2.2)).
  
 
==Character encoding==
 
==Character encoding==
GnuCash 1.8.x interprets XML documents using a character encoding determined by operating-system–level locale settings, and so does not include an [http://www.w3.org/TR/REC-xml/#NT-EncodingDecl encoding declaration] in the opening [http://www.w3.org/TR/REC-xml/#sec-TextDecl XML text declaration]. (The locale setting here constitues a “higher-level protocol” in W3C vernacular [http://www.w3.org/TR/REC-xml/#charencoding].)  GnuCash serializes non-[[Wikipedia:ASCII|ASCII]] octets (i.e. those with the high-order bit set) as decimal numeric entity references.
+
GnuCash 1.8.x interprets XML documents using a character encoding determined by operating-system–level locale settings, and so does not include an [http://www.w3.org/TR/REC-xml/#NT-EncodingDecl encoding declaration] in the opening [http://www.w3.org/TR/REC-xml/#sec-TextDecl XML text declaration]. (The locale setting here constitues a “higher-level protocol” in W3C vernacular [http://www.w3.org/TR/REC-xml/#charencoding].)  GnuCash serializes non-[[Wikipedia:ASCII|ASCII]] octets (i.e. those with the high-order bit set) as decimal numeric character references. (E.g., an em-dash is represented as “<code>&amp;#8212;</code>”.)
  
On the other hand, GnuCash 1.9.0 and later writes the XML document always in UTF-8 encoding and also includes the appropriate encoding declaration in the opening XML text declaration. (I think the serialization is still done as decimal numeric entities but this has to be checked.)
+
On the other hand, GnuCash 1.9.0 and later writes the XML document always in UTF-8 encoding and also includes the appropriate encoding declaration in the opening XML text declaration. (I think the serialization is still done as decimal numeric character references but this has to be checked.)
  
For example, in 1.8.x the UTF-8 encoding of the Cyrillic capital letter “Б” is written as “<code>&amp;#208;&amp;#145;</code>”. As the following Python script shows, the UTF-8 text should be transcoded to recover the original Unicode text.  (This script uses the [http://4suite.org/ 4Suite] XML library.)
+
For example, in 1.8.x the UTF-8 encoding of the Cyrillic capital letter “Б” is written as “<code>&amp;#208;&amp;#145;</code>”. As the following Python script shows, the UTF-8 text should be transcoded to recover the original Unicode text.  (This script uses the [http://4suite.org/ 4Suite] XML library.)
  
 
<pre>#! /usr/bin/python2.4                                                                                                                               
 
<pre>#! /usr/bin/python2.4                                                                                                                               

Revision as of 17:43, 28 April 2012

This article collects some notes about the XML file format of GnuCash. So far it is just descriptive, and neither normative nor authoritative.

Beginning with version 1.6, the primary GnuCash storage mechanism is an XML file. The file is optionally compressed with gzip (“Edit” menu → “Preferences” → “General” → “Use file compression”).

There is a non-normative RELAX NG schema for the 1.8/2.0 XML file format (src/doc/xml/gnucash-v2.rnc). There were also DTD schema definitions, but these are outdated and do not define the current format correctly (src/doc/xml (1.8)).

Please keep in mind that GnuCash series 1.8.x uses the libxml1 library for XML access, whereas 1.9.0 and later uses the libxml2 library. Some behaviour regarding XML files is therefore quite different in 1.8.x compared to 1.9.x/2.0.0.

XML files written by all GnuCash 1.8.x versions are missing XML namespace declarations that are required by most XML processing software (see also FAQ#Q: How can I export data?). See GnuCash Tutorial and Concepts Guide, Appendix A, part 5: Converting XML GnuCash File for the missing declarations. From version 1.8.5 onwards GnuCash is able to read XML files containing these declarations (1.8.5 release notes). From 1.9.0 onwards GnuCash will write the required namespace declarations as well.

Many elements in the XML file are identified by GUID. GnuCash includes its own GUID implementation (guid.h, guid.c (1.8); guid.h, guid.c (2.2)).

Character encoding

GnuCash 1.8.x interprets XML documents using a character encoding determined by operating-system–level locale settings, and so does not include an encoding declaration in the opening XML text declaration. (The locale setting here constitues a “higher-level protocol” in W3C vernacular [1].) GnuCash serializes non-ASCII octets (i.e. those with the high-order bit set) as decimal numeric character references. (E.g., an em-dash is represented as “&#8212;”.)

On the other hand, GnuCash 1.9.0 and later writes the XML document always in UTF-8 encoding and also includes the appropriate encoding declaration in the opening XML text declaration. (I think the serialization is still done as decimal numeric character references but this has to be checked.)

For example, in 1.8.x the UTF-8 encoding of the Cyrillic capital letter “Б” is written as “&#208;&#145;”. As the following Python script shows, the UTF-8 text should be transcoded to recover the original Unicode text. (This script uses the 4Suite XML library.)

#! /usr/bin/python2.4                                                                                                                               

from Ft.Xml.Domlette import NonvalidatingReader
from Ft.Xml.XPath import Evaluate
from Ft.Xml.XPath.Context import Context

# precondition: foo.xac was created by GnuCash with LANG=en_US.UTF-8
doc = NonvalidatingReader.parseUri('file:///tmp/foo.xac')
context = Context(doc, processorNss={'cd'    : "http://www.gnucash.org/XML/cd",
                                     'book'  : "http://www.gnucash.org/XML/book",
                                     'gnc'   : "http://www.gnucash.org/XML/gnc",
                                     'cmdty' : "http://www.gnucash.org/XML/cmdty",
                                     'trn'   : "http://www.gnucash.org/XML/trn",
                                     'split' : "http://www.gnucash.org/XML/split",
                                     'act'   : "http://www.gnucash.org/XML/act",
                                     'price' : "http://www.gnucash.org/XML/price",
                                     'ts'    : "http://www.gnucash.org/XML/ts",
                                     'slot'  : "http://www.gnucash.org/XML/kvpslot",
                                     'cust'  : "http://www.gnucash.org/XML/cust",
                                     'addr'  : "http://www.gnucash.org/XML/custaddr"})

accountName = Evaluate('/gnc-v2/gnc:book/gnc:account[act:id="0d69c3557f4d9340198bfd151f9e13cb"]/act:name/text()',
                       context=context)[0]

# object of type "str" (is actually UTF-8–encoded, not latin1!):                                                                                                                         
name_raw = accountName.data.encode('latin1')

# object of type "unicode":
name_unicode = name_raw.decode('utf-8')

# objects of type "str":                                                                                                                            
name_koi8r = name_unicode.encode('koi8-r')
name_utf8  = name_unicode.encode('utf-8')
name_utf16 = name_unicode.encode('utf-16')

assert name_utf8 == accountName.data.encode('latin1')

See also

External links