Kindle support for Unicode pt2: how to use Unicode

Last time I posted about how Kindles can support Unicode, despite rumors to the contrary. This time I’m going to give some practical advice on what Unicode is, and how if you are self-publishing eBooks you can use it in your books.

It’s a fairly long post, so here’s a:

Cut-out-and-keep executive summary

  • Kindle support for Unicode is very good, although weaker with the earlier models (I mean specifically Kindle 1, Kindle 2 and Kindle DX; Kindle 3 support is excellent).
  • You need to state somewhere in the file you upload that it is encoded in UTF-8. I explain how to do this for saving as an html file from word and then uploading to KDP, and also for people working directly with html or xhtml
  • Support for Unicode with ePUB readers is so much weaker than with Kindle devices that if you use Unicode glyphs, you need to assume that they won’t show up for some people.
  • The world of eBooks is moving fast. I wrote this post in December 2013. I expect UTF-8 to remain a dominant encoding system for many years to come, but the idiosyncrasies of Amazon KDP and ePUB support are likely to change more rapidly, so this post will date.

What is Unicode and why should you care?

Since the earliest days of computers and telecommunication there’s been a need for a standard way to encode text. Documents are stored digitally as bits and bytes: numbers essentially. So what number or numbers represent an upper case ‘A’, and what number or numbers represent a dollar symbol? If Computer A wants to send a document to Computer B, then both computers need to agree on the same encoding system, otherwise what looks good on Computer A will look gibberish on Computer B.

What we require is an independent standards body to define the encoding system and the codes within them. One of the important early standards (from the 1960s) was ASCII 127. The coding system was to represent each character as a seven-bit binary number, and the list of which of the resulting 127 possible ‘code points’ corresponded to which character was defined like this (http://www.asciitable.com/ ) With 127 code points, and some used for control characters (like one to ring the bell – ASCII 127 was used with teletypes) there wasn’t room for some common characters. We get upper case and lower case ‘A’ through ‘Z’ but we don’t get any accented characters. We get numbers and basic mathematical operators. We get a dollar symbol, but we don’t get a pound (£) sign. We get a basic ‘typewriter’ apostrophe and quotes, but we don’t get the proper curled versions (what Microsoft calls smart quotes) that have always been the norm in books and magazines.

The need to limit character encoding to 7-bits is a constraint that has long-since become obsolete. So people have naturally wanted more characters. The same problem persists that everyone needs to agree the rules for how those characters should be stored in computer files. ASCII 127 was pretty dominant for many years. Today there are many rival schemes for encoding characters, which is why I’ve no doubt you’ve seen examples where encoding has gone wrong. The most popular standard at the moment is a form of Unicode called UTF-8. Most websites you see are now using UTF-8. So even if you’ve not heard of it, you have definitely used it.

The reason you should care is that if you have any character not in the ASCII 127 character set, then you need to find a way to encode it safely. If the protagonist in your novel speaks some words in Spanish or Polish or Hebrew or whatever, then you need something better than ASCII 127.

It’s not just the threat of getting things wrong, there’s the opportunity too. If you want a hammer and sickle symbol for your Cold War spy thriller use   (U+262D). Try arrow symbols for your pirate map (U+27AA), or a heart for your ‘I love New York’ t-shirt, or perhaps your historical romance needs a fancy scene break character (U+2767) Unicode supplies the answer. (By the way, I’ve double checked, and all those symbols I’ve just mentioned render fine on my Kindles).

You don’t need to know how UTF-8 works in order to make an ebook (or webpage) in UTF-8. Here’s all you need to know about how it works.

Suppose you have a webpage that needs something more than the old ASCII127 characters. Here’s what you do.

Your webpage is written in a coding language called html.

At the top of your html document is a statement that says “I am encoded using UTF-8”. That’s not something human readers get to see; that statement is put in a special place for other software to find.

Now suppose someone opens up your webpage in a browser. That might be on an Android phone, iPad, Mac, PC or something else. Doesn’t matter. The browser looks at your html page and looks for the statement that tells it how you encoded all your characters.

The browser recognizes UTF-8 as one of the encoding systems it understands. Now it can separate out all the characters in your webpage and knows which Unicode code point each one represents.

What’s a Unicode code point? Take a look at the screenshot I showed you in my last post on Unicode.

The screen is listing separate Unicode code points. So the code point for the ‘black suit heart’ symbol I used in the Jack Fish book (with all the ‘I ♥ New York’ T-shirts I mentioned last time) is U+2665.

So when I enter the heart symbol into my html code, it is saved in the file as U+2665. When your browser sees that Unicode code point, it knows it has to go away and look up that code number in the current font file and display the pattern of dots it finds for that code. That pattern of dots is called a glyph, and will almost certainly be defined as a vector graphic, so it is sharp at any size.

Now, as you might imagine, not every font has every Unicode code point defined. For example, I don’t think the Times New Roman font has U+2665 defined. If your web page is set in Times New Roman,  what decent browser software will say to itself is this: “Hmm, I don’t have anything for U+2665 in the current font. I could display an empty box, question mark, or some other gobbledygook. But that’s a last resort. What I’ll do first is check whether I have a fall back font that does know how to display U+2665. I’ll look in my Arial Unicode MS font first, because that has thousands of glyphs (your magic talking browser might try another font, such as Lucida Grande, on a Mac). Ah, yes. There we are!”

That’s really all you need to do: declare that your file is encoded using UTF-8 and hope that browser reading your webpage has each glyph defined in the font you have defined, or in a fall back font.

How your glyph displays depends on how the font designers have decided it should look. I’ve worked a little with Hebrew glyphs in Kindle books and find that there is a big variation for the same Unicode code point. In my own example of the I love NY t-shirts, the Unicode code point is named by (I presume) the Unicode Consortium as ‘black suit heart’. But on Kindle for iPad, the heart actually comes up red. The same Kindle book on a Kindle Fire (a color device) will show the heart symbol as black.

I’ve used a web page as an example here. But eBooks in Kindle or ePUB format are essentially web pages. Each section of the book is an html page (or a variant called xhtml). There is software embedded in your Sony Reader, Nook, Kindle or whatever that tells the device how to display each page, just the same as Chrome, Safari or Internet Explorer tells a computer or other device how to display a web page.

How to insert Unicode characters using Microsoft Word

Open up the character map (Insert | Symbol) and pick your Unicode symbol, then press the insert button at the bottom to put it into your text. One problem, though, check that the little box at bottom-right says ‘Unicode’ or can be set to Unicode. If you click on the little down arrow and Unicode isn’t an option then the symbol isn’t a glyph for a Unicode code point. You can still use the symbol in a paperback if you embed the font in the PDF, but the symbol will come out as gobbledygook in an eBook (unless that same font is embedded in the eBook, something I don’t advise).

To see the most Unicode code point glyphs in the character map, you will want a font specially designed to have glyphs for Unicode. In the screenshot below, I’ve selected Arial Unicode MS, which is provided by Microsoft and available for Windows, and Macs if you have the right Office installations. This has a huge number of Unicode code point glyphs. For Macs without Arial Unicode MS, try Lucida Grande. There are some free Unicode fonts around, though the purpose behind most is to provide fonts for many languages, rather than fancy characters such as heart symbols. Try looking here: http://en.wikipedia.org/wiki/Open-source_Unicode_typefaces

How to make a Kindle book with UTF-8 if you upload a Word or html file to Amazon KDP or some other auto-converter

For this approach, you need to be able to upload an html file to whatever service makes your Kindle book for you. The simplest way to do this is to save your Word document as html (from the Save As… menu in Word) and then upload the resulting html file directly to Amazon KDP (though you’ll need to read my note in a moment if you include images).

When you save to html, you must set the encoding to UTF-8 as in the following screenshot.

Here I am Saving As… and changing the format (Save as type) to Web Page, Filtered (which is a slightly more streamlined version of what you would get if selecting save to html). I click on the ‘Tools’ button right at the bottom and pick ‘Web Options’. Then I pick ‘Encoding’ and Save the document as Unicode (UTF-8). What this does is to put a statement at the top of the html file that says ‘I am encoded using UTF-8’.

I’ve just tested this out myself to double-check it works. I’m writing this post in Word 2013. I’ve saved as html, zipped the result (see next section for why) and uploaded that to Amazon KDP. The result looks great with all my heart symbols and other Unicode fanciness coming out perfectly in the resulting Kindle file.

In fact, here’s a screenshot of my previous blog post saved to html, uploaded to Amazon KDP, downloaded and then sent to my iPad as a Kindle book. I started all this lengthy post about Unicode because I’d read someone post online that Kindle books don’t support Unicode. There’s my Unicode heart symbol to prove that isn’t so.

Html and images

This is going a little off-topic, but I can’t talk about uploading html files without a little explanation about images. If I went through saving the Word document for this post as html and uploading to Amazon KDP, then all the screenshot images will be missing. It’s easy to fix (so long as you aren’t too bothered about image quality).

Suppose you have a Word document called (naturally) MyDoc.docx. If MyDoc contains images, then when you save you will find a file has been created called MyDoc.html. So far, so simple, but Word will also create a subfolder called ‘MyDoc’ and in there it will place compressed versions of your images, saved as separate files and numbered (e.g. image0001.jpg). For Amazon KDP, what you need to do is create a zip file of your html file and the folder of images. Upload that zip file to Amazon KDP and it will look fine. [Here’s how to zip on Windows (and don’t worry if you don’t have Windows7 as it’s worked this way for a long time) and on Mac.]

The only problem with this approach is that Word always tries to compress your images when saving to html format. You have some limited control through the ‘Pictures’ tab (to the left of the ‘Encoding’ tab in the Web Options screenshot above) but the normal Word Options setting that allows you to turn off image compression doesn’t apply to saving as html, at least not in my Word 2013. When Amazon KDP builds your Kindle book (or you do it yourself through Kindlegen) then that will compress your images anyway, so it might not make a lot of difference for large images but just be aware that when Word saves to html it quietly changes your images.

How to make a Kindle book with UTF-8 if you code your own html

This is the way I make eBooks.

Html files should have the following in the <head> section

<meta http-equiv=”Content-Type” content=”text/html; charset=utf-8″/>

[this is the statement that Word adds when you save to html and set the encoding to utf-8 in Word]

For xhml files you want

<!–?<?xml version=”1.0″ encoding=”utf-8″ ?>

How to make an ePUB book support UTF-8

I’ve concentrated on Kindle books so far; ePUB format uses xhtml files to store its book content and these files need the encoding statement I’ve just given (  <?xml version=”1.0″ encoding=”utf-8″ ?>)

In general, ePUB books are trickier to give guidance for than Kindle because when it comes down to the fine details, there is much more variation in the way in which the ePUB format is implemented by the various ePUB reader devices and the various firmware versions that sit upon them. I’ve read people suggest that because ePUB is an open standard, all you need to do is write one ePUB file and it is guaranteed to work the same way on every device that can show ePUB books. I’m afraid that is far from the truth, but that’s for another post.

When it comes specifically to Unicode support on ePUB, I find support on my Nook Glow, iPad iBooks, and Adobe Digital Editions is good. My Kobo Mini isn’t so good. In my last post I showed some East European Latin extensions implemented as Unicode UTF-8. In that previous post they looked good on my Kobo Mini. But I was cheating! Here’s another screenshot where the Kobo can’t find the right glyph and gives a box with a cross through, what I call a ‘huh?’ symbol.

So what’s gone wrong?

The problem with the Kobo is that it doesn’t seem capable of working with fall back fonts. If the Kobo comes across a Unicode code point, it looks in the current font to see whether it has a glyph defined for it. If it doesn’t, it gives up. What it doesn’t do is go looking in a fall back font. Which is a shame because Kobos have good Unicode support in their Georgia font, which can display those characters perfectly.

You could try forcing the font to Georgia, but that’s easily overridden by the user.

So when I build eBooks for clients and they have requirements outside of a basic Latin character set, I have a conversation about portability – which basically means how confident we can be that the book will behave as we intend across a range of platforms. In some cases this means we have a higher-spec version of the eBook for the Kindle format, and produce a dumbed-down version for ePUB.

What’s the take-away from this post?

Well, it’s really the same as previous post.

The first is that Kindles do have excellent Unicode support, despite what you might read elsewhere. What’s more, an occasional use of Unicode can lift your book out of the ordinary. If you are coding your Kindle book directly with html, or through a tool such as Sigil, then all you need to do is ensure the encoding is correct in thetag as I’ve shown you. If you upload a doc or html file directly to KDP, then you could try setting the encoding as I’ve suggested in Word’s Save As, or simply make do with basic characters.

The second take-away is to beware of what people post on the internet about how to make eBooks because there is a lot out there that isn’t accurate. Treat whatever you find with suspicion, test your books thoroughly, and try to get multiple opinions. That advice, of course, goes for my posts too, every bit as much as anyone else’s.

Click here for part 1 of my unicode posts

Follow this link to my other writing and publishing tips

 ‘Format Your Print Book for Createspace: 2nd Edition‘ available now as a Kindle eBook, and as a 296 page paperback:

eBook:  amazon.com |  amazon.co.uk

Paperback   amazon.com |  amazon.co.uk

Advertisements

About Tim C. Taylor

Science fiction publisher and author of the bestselling Human Legion series. I live with my wife and young family in an English village. I am currently writing full time, when I'm not roped into building Lego.
This entry was posted in Writing Tips and tagged , , , , , . Bookmark the permalink.

One Response to Kindle support for Unicode pt2: how to use Unicode

  1. Pingback: Kindle support for Unicode pt1: dispelling a myth | Tim C. Taylor

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s