Unicode support for Lua scripting
Posted: August 12th, 2013, 4:04 pm
We're excited to announce Unicode support for Lua! To enable Unicode support for your script, you must encode your .lua script file in the UTF-16 LE encoding (also known as UCS-2 in some programs). If you're using plain old Notepad, you can select Save As... and then Unicode in the Encoding selection.
After the file encoding has been changed to UTF-16 LE, you can..
Internally, the Lua strings are stored in the UTF-8 format. UTF-8 is a multibyte character encoding, which basically means that a single index of a string is not necessarily enough to represent one character. ASCII characters (such as B or x) can be represented in one index. Consider this example:
That will log "The length of s is 1", as expected. Now, instead of the ASCII character A, let's use the Japanese character ユ.
This, surprisingly, will print "The length of s is 3"! It is beyond the scope of this post to explain why this is so. However, what you need to know is that a single character may not necessarily have a length of 1. It may be 2, 3, or even more in some cases. In your Unicode scripts, you should never attempt to split a string based on arbitrary index. Consider this example:
As you can see, attempting to get the first 8 actual characters of the string s fails miserably. However, attempting to get the all characters until the first "E" works fine. This is because s:find() skips all the characters that are not E regardless of how many indexes the characters consumes.
It is important to keep this in mind when enabling Unicode support for your script. If you treat the Unicode string as if all characters have a length of exactly 1 (like you could and still can with non-Unicode scripts), you're in for trouble with Unicode strings. If you keep this limitation in mind when writing your scripts, you should be perfectly fine. In fact, most of your existing scripts will probably work fine after converting to Unicode scripts
---
Below is some useful information about using the string functions in Unicode scripts (originally from here):
After the file encoding has been changed to UTF-16 LE, you can..
- Use Unicode characters in the script itself. For example, print("ユニコード") is valid code.
- Retrieve Unicode values from Rainmeter. For example, measure:GetStringValue() will return a string with Unicode characters intact. Previously, Unicode characters were replaced with a question mark.
Internally, the Lua strings are stored in the UTF-8 format. UTF-8 is a multibyte character encoding, which basically means that a single index of a string is not necessarily enough to represent one character. ASCII characters (such as B or x) can be represented in one index. Consider this example:
Code: Select all
function Initialize()
local s = "A"
print("The length of s is " .. s:len())
end
Code: Select all
function Initialize()
local s = "ユ"
print("The length of s is " .. s:len())
end
Code: Select all
function Initialize()
local s = "ABC 汉堡包 DEF"
print(s:sub(0, 3)) -- this prints "ABC"
print(s:sub(0, 8)) -- this prints "ABC 汉�"
print(s:sub(0, s:find("E"))) -- this prints "ABC 汉堡包 DE"
end
It is important to keep this in mind when enabling Unicode support for your script. If you treat the Unicode string as if all characters have a length of exactly 1 (like you could and still can with non-Unicode scripts), you're in for trouble with Unicode strings. If you keep this limitation in mind when writing your scripts, you should be perfectly fine. In fact, most of your existing scripts will probably work fine after converting to Unicode scripts
---
Below is some useful information about using the string functions in Unicode scripts (originally from here):
- sub
Works fine if the indices are calculated reasonably -- and I
think this is almost always the case. People don't generally
do [[ string.sub (UNKNOWN_STRING, 3, 6) ]], they calculate a
string position, e.g. by searching, or string beginning/end,
and maybe calculate offsets based on _known_ contents, e.g.
[[ string.sub (s, 1, string.find (s, "/") - 1) ]] - upper, lower
Works fine, but of course only upcases ASCII characters.
However doing this "properly" requires unicode tables, so
isn't appropriate for a minimal library I guess. - len
Works fine for calculating the string byte length -- which is
often what is actually wanted -- or calculating the string
index of the end of the string (for further searching or
whatever). - rep, format
Work fine (only use concatenation) - byte, char
Work fine - find, match, gmatch, gsub
Work fine for the most part. The main exception, of course,
is single-character wildcards, ".", "[^abc]", etc, when used
without a repeat suffix -- but I think in practice, these are
very rarely used without a repeat suffix.
Some of the patterns are limited to ASCII in their
interpration of course (e.g. "%a"), but this isn't really
fixable without full unicode tables, and the ASCII-only
interpretation is not dangerous. - reverse
Now _this_ will probably simply fail for strings containing
non-ASCII UTF-8. But it's also probably not very widely
used...