Korean text resources in Java

I just spent a day or two wrestling with text resources sent to us by a Korean affiliate. They are bits of data that must be displayed via the browser in Korean. The process I use is outlined in the remainder of this post.

Java is Unicode-based. That’s all well and good, but it doesn’t mean that it can always deal with native Unicode text. In some cases, like .properties files read from disk, the text must be escaped ASCII.

I’ve found that the easiest way to transmit text from a foreign partner using Microsoft Word. Word stores its text as Unicode by default and is widely available and understood. It also displays the foreign text accurately so it can be used to visually compare web page results to original documentation. A plus if you don’t read Korean!

Since Java will not read files stored natively in Unicode, we need to get it into encoded ASCII format. This is done by using the native2ascii tool provided with the JVM. So how does this all fit into the software lifecycle?

Here’s the plan:

1. Obtain native text using Microsoft Word, or any text editor that can store a recognizeable native format. Make sure you know the encoding of the original text; native2ascii will need this. Since we carry the data in Word, the native encoding is UTF8. Other encodings are listed here.

2. Store a native version of the .properties file using a different extension. I use .utf8 for UTF-8, since that seems to keep things clear. In this copy of the file, native text can be pasted directly into the file. Using the Eclipse editor, this maintains the original visual glyphs exactly like Word.

3. Setup an Ant task that calls native2ascii on any applicable resource directories, using UTF8 as the source encoding. The same task can write the output files with the .properties extension. Each type of native source file you maintain will need its own command, which can be called from a single precompile task, like such:

4. Deploy your application!

The key here is to treat the encoded ASCII resources as build artifacts, and not version them. So for an application like ours, I store only resources_ko.utf8, not the resulting resources_ko.properties. This allows us to maintain only the native copy, and check the properties against what comes out on the screen.

One thought on “Korean text resources in Java

  1. Much easier than the klunky way I was doing it – getting EUC-KR text files as foo_kr.txt, and converting from native2ascii using EUC-KR encoding to a foo_kr.properties destination. Though my original solution could be easily ant automated, there is probably less chance for error on these files when edited by the third party using Word.