New Java 18 Feature–Default Charset UTF-8

[article]
Summary:

This article discusses the new Java 18 feature of making the default charset as UTF-8, which makes software development more agile as it makes software more portable across different systems.

What Is a Charset?

At first, one might get the impression that a charsetis just a set of characters, but there is a conversion aspect to it. The term charset as defined by the RFC 2278 means a composition of a Coded Character Set and a Character Encoding Scheme.

A Coded Character Set is a mapping from a set of abstract characters to a set of nonnegative integers. A set of abstract characters is a collection, or repertoire, of characters such as the English alphabet (a to z). An example of a Coded Character Set is the Unicode, which is a set of characters defined by the ISO/IEC 10646 standard.

A Character Encoding Scheme is a mapping from a Coded Character Set to a sequence of octets; an octet being a unit of digital information made of eight bits. Typically a Character Encoding Scheme is associated with a single Coded Character Set. As an example, the UTF-8 Character Encoding Scheme applies only to the Unicode Coded Character Set. Some Character Encoding Schemes, however, may be multi-octet as they are used to encode multiple Coded Character Sets.

In simpler language, a charset is a method for converting a sequence of octets to a sequence of characters. As a charset is a combination of a Coded Character Set and a Character Encoding Scheme, how is it named? If a Coded Character Set is associated with a single Character Encoding Scheme it is named for the Coded Character Set. But if a Coded Character Set is associated with multiple Character Encoding Schemes it is named for the Character Encoding Scheme. Because the Unicode charset is associated with multiple Character Encoding Schemes the UTF-8+Unicode combination charset is named for the character encoding scheme, which is UTF-8.

How Does Charset Apply to Java?

We defined the charset in general. But how does it apply to the Java programming language? The native character encoding of Java is UTF-16, which implies that the source code files use the UTF-16 Unicode. In UTF-16, characters are represented with 16-bit integers. The 16-bit UTF-16 characters constitute a Coded Character Set. Therefore, the charset in Java is defined as the mapping, or conversion, of the 16-bit UTF-16 code units (sequence of characters) of the Java programming language to sequence of raw bytes. Every implementation of Java supports some standard charsets.

What Is the Default Charset in Java?

Pre-Java 18, every instance of Java virtual machine (JVM) has a default charset, which is determined at JVM startup and typically depends on the locale and charset of the underlying operating system. The default charset may not be one of the standard charsets supported. In Java 18, every instance of the JVM has the default charset of UTF-8 unless it has been overridden by a system setting as discussed later. The Charset.defaultCharset() in Java 18 returns UTF-8 unless changed in an implementation specific way.

What Difference Does It Make What the Default Charset Is?

Standard Java APIs for reading and writing files and for processing text use the default charset if a charset is not passed as an argument. These standard Java APIs include the java.io.FileReader, java.io.FileWriter, java.io.InputStreamReader, java.io.OutputStreamWriter, java.io.PrintStream, java.util.Scanner, java.util.Formatter, java.net.URLEncoder, and java.util.URLDecoder.

Because in pre-Java 18, a JVM’s default charset is determined at startup based on implementation, locale, operating system, and configuration, applications developed with one implementation, locale, and operating system are not guaranteed to run on another environment. If the default charset is different between different systems the text is likely to get corrupted, which makes applications unportable across different systems. If an application passes the charset as an argument it would not be affected by what the default charset is.

What Is the Benefit of Using UTF-8 as the Default Charset?

Why UTF-8, and why not some other charset as the default? UTF-8 is the most commonly used encoding on the WWW (World Wide Web). In fact 98% of all web pages use UTF-8. Some Java’s standard APIs such as NIO API use UTF-8 if a charset is not specified as an argument. As an example, methods in the java.nio.file.Files class, which is used for files and directories, use UTF-8 if a charset is not passed as an argument. Java also uses UTF-8 in property files. UTF-8 is also the standard encoding for the majority of the XML, and JSON files processed by Java applications.

What Are the Related System Properties?

Two system properties relate to the default charset: the file.encoding and native.encoding. The default charset of UTF-8 in Java 18 may be overridden by specifying a different default charset on the command-line; if the value of file.encoding is set to COMPAT (java -Dfile-encoding=COMPAT) the default charset is derived from the value of native.encoding, which by default depends on the locale and charset of the underlying operating system. The file.encoding could be set to UTF-8.

How Are the Earlier Version Java Applications Affected?

Earlier version Java applications may not run as expected with Java 18. Text may get corrupted if a different charset has been used. A user could set file.encoding to COMPAT to override the default charset of UTF-8.

Another issue could be that characters that are mappable in another charset may not be mappable in UTF-8. As an example, consider the following Java application developed with default charset of windows-1252.

class HelloWorldApp {
public static void main(String[] args) {
System.out.println("Hello World¡˜ž");

}
}

If compiled with Java 18 using its default charset of UTF-8, even on the same operating system, the application would generate the following errors:

javac HelloWorldApp.java
HelloWorldApp.java:5: error: unmappable character (0xA1) for encoding UTF-8
System.out.println("Hello World???");
^
HelloWorldApp.java:5: error: unmappable character (0x98) for encoding UTF-8
System.out.println("Hello World???");
^
HelloWorldApp.java:5: error: unmappable character (0x9E) for encoding UTF-8
System.out.println("Hello World???");
^
3 errors

A user could modify the application to remove the unmappable characters.

User Comments

2 comments
jim brett's picture

 

I like this weblog. It's a masterpiece!

September 15, 2022 - 12:44am
Emmet Connell's picture

I think this article is a valuable resource for learning more about the benefits and challenges of using UTF-8 as the default charset in Java. I appreciate the author's clear and concise explanation of the concepts and examples.

January 4, 2024 - 10:05pm

About the author

AgileConnection is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.