Page Contents
Introduction
MT-NewsWatcher can decode and display articles in may different languages, including most of those used in news articles that are posted around the world. For example, it can display articles written using any of these writing systems (and more): Japanese, Traditional Chinese, Simplified Chinese, Cyrillic, Central European, and Korean. Here's a snapshot of a subject and article window for a Japanese newsgroup:
Terminology
Converting text to and from a form in which it can be safely transmitted over the internet is a complex task, when you take into account the number of different ways that text is represented in the different writing systems of the world. The technology of computing largely developed in counties with Western languages, hence the infrastructure has grown up with some restrictions on the way that text must be transmitted (for example on the legal range of characters that can be sent unencoded in news and mail messages).
To overcome these restrictions, techniques have evolved of encoding text to transform it into a form that can be transmitted safely over existing transport sytems (like the NNTP protocol used by Usenet news). These techniques are described below, and in more detail here.
- Character Set
A character set is a mapping between bytes in the source data, and the characters that are displayed to the user. For example, in the normal Macintosh character set, an é is represented by the byte value 142, but in the chatacter set ISO-8859-1 (often used on Usenet), an é is represented by a byte value 233.
- Text Encoding
Text of multi-byte languages (such as Japanese) must be encoded before sending on the Internet, to transform into into a stream of single bytes that will survive transmission without corruption. The receiver must then unencode the stream of data to reconstruct the original text. This encoding differs from just a simple character set mapping, in that an algorithmic transformation must be applied to the text on sending and receiving ends. One example of such a text encoding technique is Shift-JIS.
Because the processes of text encoding conversion and character set mapping are often done at the same time, and by the same software modules, this document often considers both at the same time.
Where MT-NewsWatcher allows you to specify a text encoding or character set, it shows a list of available encodings/character sets in a poppup menu such as this.
- Font
Fonts are used for the display of text to the end user on screen, and in print. After receiving a message and doing any necessary text decoding and character set mapping, the correct font must be chosen to display the sequence of bytes that represent the text. For example, you need a Japanese font to display Japanese text. Choose the wrong font, and all you'll see is a string of garbage characters.
- Language
When this documentation uses the term language, it is using the term to mean writing system, rather than a particular spoken language. When you specify a language in MT-NewsWatcher, it affects two things; first, the font that will be used to display the text, and secondly, the list of text encodings that are available to convert to and from text in that language.
This is an example of a language popup menu in MT-NewsWatcher.
- MIME
MIME, which stands for Multipurpose Internet Mail Extensions, is a way of wrapping up different kinds of data, like text, images, files etc, in a textual format that can be safely sent by email, or posted to newsgroups. Again, it's a way of encoding text and binary data in such a way that it can be transmitted over the internet without getting corrupted. Because MT-NewsWatcher can decoding, and create messages in MIME format, there is an entire chapter devoted to this topic. Its relevance to text encoding is that the specification of character sets and text encodings is an integral part of MIME, so that messages containing non-Roman text are best sent with MIME.
Text Encoding Settings
If you wish to view articles in different languages in MT-NewsWatcher, it is worth spending a little time setting things up (even if you can't read the articles!).
Software requirements
On Mac OS X, all the software that you need should be installed with the operating system. You may wish to install the fonts for additional languages in the Mac OS X installer, but it's unlikely that you'll come across a newsgroup post that you can't display with the default fonts.
Setting up font preferences
After you've installed the fonts you need, run MT-NewsWatcher and open the Preferences dialog from the Edit menu. There are two panels here which are important; the Fonts panel, and the Languages panel.
- Font Preferences
-
The Font Preferences govern which font will be used in lists and text when displaying text for a certain language. They are described in the chapter on Preferences.
- Language Preferences
-
The Languages Preferences are where you specify some language- and encoding-related defaults for articles that you read, and outgoing messages. They also are described in the Preferences chapter.
Decoding Subjects and Articles
MT-NewsWatcher attempts to determine the correct encoding for each article that you read, and choose an appropriate for the display of that article, so that most articles that you see should display correctly the first time. For the remainder where automatic detection fails, there are a number of ways to enforce the correct decoding.
Subject list decoding
When MT-NewsWatcher is fetching article headers, it looks for subject and author headers which contain text that is encoded somehow, tries to figure out how that text is encoded, and decodes it. When all the article headers have been received, it looks for the most common non-Roman encoding, and if one exists, chooses a font for subject list display based on that encoding. This means that one article with a Japanese header among a set of articles with English headers will cause that subject list to be displayed using a Japanese font (which, of course, still displays English characters just fine).
Article headers can be encoded in a variety of ways. Some news clients allow users to type non-Roman characters into their author and subject fields, and simply send the raw bytes entered; often, these survive transport through NNTP unscathed. Other news clients (notably, those from Microsoft) convert such non-Roman characters into 'safe' strings using MIME techniques of quoted printable and base-64 encoding. In the raw, these look like
=?ISO-2022-JP?B?GyRCTlMbKEogGyRCO0s8eRsoSg==?= =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?= =?ISO-8859-1?Q?Patrik_F=E4ltstr=F6m?=
Here, the first string (e.g. "ISO-8859-2") specifies the character set (in this case, Western Latin 2), and the letter between ? the encoding type (Q = quoted printable, B = Base64).
MT-NewsWatcher can decode both these encoding types, and displays the decoded version in subject lists.
Article decoding
Choosing the correct decoding for an article is a sometimes difficult task, because many articles contain no information about how they are encoded, or contain incorrect information.
When receiving and parsing the text of an article for display, MT-NewsWatcher first looks for a header that can contain text encoding information, the "Content-Type" header which is included by MIME-capable news clients. For example:
Content-Type: text/plain; charset=ISO-8859-1
This specifies that this is a plain text article, which uses the Western Latin 1 chararacter set. The "charset" parameter here specifies the character set or encoding used. Other examples are "Big5" for an encoding used for Traditional Chinese or "EUC-JP" for an encoding of Japanese.
Articles posted using MIME (see the MIME page) can contain multiple sections, and each section can be encoded with a different encoding or character set. MT-NewsWatcher will correctly display such articles.
However, some articles labelled as "ISO-8859-1" are actually encoded using some totally different encoding -- MT-NewsWatcher tries to detect this by checking the article text for a high proportion of high-ASCII characters, and ignores the label in these cases. Many (most) articles contain no encoding information at all. In these degenerate cases, MT-NewsWatcher sniffs the text of the article to attempt to determine the encoding used (if the preference is turned on, as described above). Text encoding sniffing does not always work, and works more reliably with longer articles. In order to avoid too many false matches, MT-NW only sniffs for text encodings which have the same destination Mac encoding as the default encoding for the current group. That means that if you are in a group for which you have settings that cause articles to show up in Japanese by default (using the Group Settings described below), then the only encodings sniffed for are those which result in text being displayed in Japanese (e.g. ShiftJIS, EUC-JP and ISO-2022-JP).
If an article does show up that is unreadable because the wrong decoding has been applied, then you can use the
submenu on the menu to change the decoding to something different. If you find that you always need to do this for a particular type of article, then you should use one of the techniques described in the "Controlling decoding" section below.Saving decoded articles
When you save an article, you can choose to save either a version that contains the raw data obtained from the news server, or a version which has been converted into the Macintosh text format. This is controlled using the Save encoded text checkbox in the Saving files preferences panel, and also in the file saving dialog.
If the Save encoded text option is turned on, then the article is saved in its raw, pre-conversion state, with whatever text encoding (or character set) it was sent with, and any encoded binaries included. With this option off, the saved file contains text in Macintosh format, and does not include any data for encoded binaries. Use this latter format if you want to open the file with SimpleText or a word processor.
Controlling decoding
MT-NewsWatcher uses a three-tiered approach to give you maximum flexibility in controlling how articles are decoded. This should allow you to prevent almost all decoding errors in groups that you regularly read.
First, set up the global preferences in the Fonts and Languages preferences panels, as described above, to apply to the majority of groups that you read.
Second, use Group Settings for different levels of the groups hierarchy to specify different defaults for decoding subject lists and articles. For example you might make group settings for all of the "fj.*" hierarchy to change the default decoding to some Japanese encoding. More specific group settings override less specific ones, so you could also make settings for "fj.jobs.*" that change the default article decoding to ISO-2022-JP, for example.
Third, there will be some small subset of articles that always get decoded wrongly, perhaps because they contain incorrect encoding information. These articles are often posted by the same person, or have some other information in common. For these articles, create filters, set the filter action to "Keep", and, in the third panel of the filters dialog, check the box that says "Decode matching articles as" and choose the correct encoding from the popup menu. That will force matching articles to be decoded with that setting when you read them.
Encoding Messages for Sending
Choosing a text encoding for sending messages
When you post a new article, or reply to an existing article, MT-NewsWatcher chooses which text encoding (character set) to use according to the rules below. The chosen text encoding in turn determines which font is used in the message window, and which alternative encodings are available. MT-NewsWatcher chooses a text encoding as follows:
- For new messages
-
Look at the text encoding specified for Posting & Mailing in the Languages Preferences
If there are Group Settings set up for the group or groups that are selected when the New Message command is chosen, then use the text encoding specified in the 'Send messages as' popup of those group settings.
- For replies
-
If the article being replied to has a "Content-Type" header that specifies a character set, and the 'Use article's character set for reply' preference is set, then use that character set (or text encoding).
Otherwise, determine the text encoding as for new messages.
These steps are followed to get the initial text encoding (and hence font) for the message window, but you can always change it before sending. There are two ways to change it.
If you want to send the message in a different "language" (i.e. writing system), say Japanese instead of Western, then you can use the
submenu on the menu to choose a different language. When you do this, it will change the fonts used in the message window to match the new language, using the text font specified for this language in the preferences.Changing the language also requires that you use a different text encoding for sending, because different text encodings must be used for different languages. To choose which text encoding to use with the new language, you need to use the encoding popups in the message window. If you do not already have details showing, choose
from the menu. Now, if necessary, click the 'Text Encoding' tab in the tab group just above the message area. You should see something like this:These popups allow you to choose text encodings to use when this message is posted to the newsgroup, and sent by email. The same message can be encoded in two different ways for news and email. Note that only encodings which are legal for the current language are enabled in these popup menus.
Text Encodings and MIME
If you send articles with character sets other than Western (ISO Latin 1), which is the default, then you are strongly recommended to send articles containing MIME information, because then the character set information is included in the article headers, and other news clients will be able to decode and display the article properly.
To turn on the option to send articles with MIME information, go to the 'Message Options' preferences panel, and check the box labelled 'Send with MIME'.
Note: When sending articles with MIME, MT-NewsWatcher encodes high-ASCII characters in the headers according to RFC 2047, so it makes header lines that can look like =?ISO-2022-JP?B?GyRCTlMbKEogGyRCO0s8eRsoSg==?=. MIME-capable news clients will decode these on recieve to show the original characters. However, those using news clients that can't decode such headers may not appreciate you sending articles containing them. If this is the case, you should turn off sending with MIME in the preferences before posting to such groups.
Personalities and Text Encodings
When creating a Personality, you specify a language in which to enter and encode some of the headers for that personality (using the View as popup menu in the Personalities dialog box). Thus, each personality has an associated language.
When sending messages using that Personality, the language used to send the message should match the langage for that personality. This ensures that the personality headers ("From:" and "Organization:") can be properly encoded on send. If these don't match, then you will see a warning when you try to send, that there is a mismatch. In this case, you'll have to make a personality with the appropriate language, or change the language you are using to send the message.
Encoding mappings
This section is intended for the technically competant, who wish to customize the mappings that MT-NewsWatcher uses between Internet and Macintosh encodings, or to change the sets of languages or text encodings which appear in the menus.
There are two resources in MT-NewsWatcher that control text encoding and
language font settings. These are the 'TCnv
' and 'FSct
'
resources. ResEdit templates are provided for both, though the 'TCnv
' template
is only usable in Resorcerer.
The 'TCnv
' resource
This resource ID 128 specifies the mapping between which internet character sets or text encodings, and their Macintosh counterparts. Unfortunately, the Text Encoding Converter does not seem to provide a service that allows applications to choose the correct destination (Mac) text encoding for a given Internet encoding, so this lookup table is needed.
A Resorcerer template for this resource is provided. Values of encodings in this resource are those defined in TextCommon.h; the Incoming encoding value is an Internet Text encoding, and the preferred and fallback encodings are both Mac encodings. The fallback encoding is currently unused. The Sort order field controls where in the menus an encoding appears. The last two bits in the final byte specify whether this internet encoding is used as the default destination when converting from a Mac encoding, and whether this Internet encoding shows up on the menus.
The 'FSct
' resources
There are two resources of this type, IDs 128 and 129, which contain the default fonts for each language, for screen and printing respectively. Again, a ResEdit template is provided. Since the data contained in this resource is that set in the Font Preferecnes, the only reason you'd need to change this resource is to distribute a modified version of MT-NewsWatcher with different font defaults.
Limitations
There are some languages that MT-NewsWatcher is currently unable to display. Those are langauges that require that the application is using a Unicode- compatible calls to draw text, and handle editing. MT-NewsWatcher does not currently have this capability. The most prevalent language for which this is true is Hebrew.
Table of Contents
- Preface
- Table of Contents
- Introduction
- Features
- Advanced features
- The Interface
- Appendices