EPUB: Chapter and Verse

Inasmuch as XML Prague is the best XML conference in Europe that I know of, I am pleased to be again co-presenting with Mark Howe of Cyberporte at XML Prague 2011 on 26-27 March. Our talk this year is EPUB: Chapter and Verse:

The link between the Bible and publishing technology is at least as old as Gutenberg’s press. 400 years after the publication of the King James Bible, we were asked to convert five modern French Bible translations from a widely-used ad hoc TROFF-like markup scheme used to produce printed Bibles to a standard XML vocabulary, and then to EPUB. We opted to use XSLT 2.0 and ant to perform all stages of the conversion process. Along the way we discovered previously unimagined creativity in the original markup, even within a single translation. We cursed the medieval scholars and the modern editors who have colluded to produce several mutually incompatible document hierarchies. We struggled to map various typesetting features to EPUB. E-Reader compatibility made us nostalgic for browser wars of the 90s. The result is osisbyxsl, a soon-to-be open source solution for Bible EPUB origination.

1.5 talks at XML Prague 2010

Inasmuch as I was fortunate to again be selected to present or co-present two talks, I will be at XML Prague again this year. Going to a technical XML conference, in Prague, in the Spring, again, will be good; presenting the same number of talks as last year is just a bonus.

The talks are:

  • What XSL 2.0 means for implementers and users — Discusses the changes that will have to take place under the hood of any XSL formatter that supports XSL 2.0 and what those additional capabilities can bring to your stylesheets.
  • Real time, all the time, ragtime XML — An update on the capabilities of Xcruciate.

1.5 @ Prague

I will have the pleasure of speaking twice at XML Prague 2009, once on my own and once as a co-presenter:

  • Testing XSLT — An update and expansion of my previous talk on testing XSLT presented in less time. How can that be? Simple, really: put more in the conference paper, direct attendees to the paper, and spend more of the presentation doing demonstrations.
  • Imagining, building and using an XSLT virtual machine — The why and what of the open source Xcruciate XML-based server. Or the why, what, and Howe of Xcruciate, since I’m the second presenter with Mark Howe of Cyberporte, who provides the ideas behind Xcruciate and its related projects.

xs3p is not the secret sauce

I used to think that the open source xs3p schema documentation generator stylesheet from the now-defunct http://titanium.dstc.edu.au/ was the secret sauce behind the remarkably similar graphical XML schema representations of both <oXygen/> and XML Spy. I was wrong: a modified version of xs3p is bundled with <oXygen/> and is used when generating printed documentation, and xs3p may still be included in XML Spy (though it’s unlikely since its currently not listed on their third-party licenses page), but even in its titanium days, it didn’t do any graphical representations of a schema.

Does anybody know of an open source toolkit that can produce that sort of graphical representation?

BOM in UTF-8: good, bad, or ugly?

The usefulness or otherwise of U+FEFF (ZERO WIDTH NON-BREAKING SPACE and BYTE ORDER MARK) in UTF-8 has been subject to reinterpretation over the years. It wasn’t mentioned in the original XML 1.0 Recommendation but was added later, rather like how its use was added to the Unicode Standard.

In the Unicode Standard 2.0, there was no mention of U+FEFF with UTF-8, either in the section on the BOM or in the appendix defining UTF-8.

In the Unicode Standard 3.0, section 13.6, “Specials”, includes:

Although there are never any questions of byte-order with UTF-8 text, this sequence can serve as signature for UTF-8 encoded text where the character set is unmarked.

In the Unicode Standard 5.0, section 3.10, “Unicode Encoding Schemes”, includes:

While there is obviously no need for a byte order signature when using UTF-8, there are occasions when processes convert UTF-16 or UTF-32 data containing a byte order mark into UTF-8. When represented in UTF-8, the byte order mark turns into the byte sequence <EF BB BF>. Its usage at the beginning of a UTF-8 data stream is neither required nor recommended by the Unicode Standard, but its presence does not affect conformance to the UTF-8 encoding scheme. Identification of the <EF BB BF> byte sequence at the beginning of a data stream can, however, be taken as a near-certain indication that the data stream is using the UTF-8 encoding scheme.

So in the Unicode Standard it’s gone from irrelevant to useful to “Oh, if you must”.

(BTW, in other reinterpretations, “Unicode Encoding Scheme” results from splitting the meaning of “UTF”, and the use of U+FEFF to indicate non-breaking is deprecated these days.)

The Unicode FAQ both lists its use as a signature and says to avoid its use where “byte oriented protocols expect ASCII characters at the beginning of a file“. However, I don’t think that XML necessarily counts as one such byte oriented protocol.

Windows drive names with Cygwin xsltproc & xmllint

Cygwin may be the only way to stay sane while using Windows, but it has its own Unix-like notion for drive names, e.g., “/cygdrive/c/” instead of “c:“. Which is fine, except when you want to use both Java XML tools, which understand only the “c:” form, and Cygwin tools, which tend to understand only the “/cygdrive/c/” form.

The Cygwin xsltproc and xmllint complain when you use them with files containing Windows drive names in system identifiers, so the second time it happened, I wrote a simple XML catalog file to map the Windows drive names to the Cygwin paths.

Put this as the contents of /etc/xml/catalog (not catalog.xml!) and the Cygwin xsltproc, etc., will handle Windows drive names:

<?xml version="1.0"?>
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">

You will have to add a suitable rewriteSystem for each additional drive that you use.