After the file encoding has been changed to UTF-16 LE, you can..
- Use Unicode characters in the script itself. For example, print("ユニコード") is valid code.
- Retrieve Unicode values from Rainmeter. For example, measure:GetStringValue() will return a string with Unicode characters intact. Previously, Unicode characters were replaced with a question mark.
Internally, the Lua strings are stored in the UTF-8 format. UTF-8 is a multibyte character encoding, which basically means that a single index of a string is not necessarily enough to represent one character. ASCII characters (such as B or x) can be represented in one index. Consider this example:
That will log "The length of s is 1", as expected. Now, instead of the ASCII character A, let's use the Japanese character ユ.
Code: Select all
function Initialize() local s = "A" print("The length of s is " .. s:len()) end
This, surprisingly, will print "The length of s is 3"! It is beyond the scope of this post to explain why this is so. However, what you need to know is that a single character may not necessarily have a length of 1. It may be 2, 3, or even more in some cases. In your Unicode scripts, you should never attempt to split a string based on arbitrary index. Consider this example:
Code: Select all
function Initialize() local s = "ユ" print("The length of s is " .. s:len()) end
As you can see, attempting to get the first 8 actual characters of the string s fails miserably. However, attempting to get the all characters until the first "E" works fine. This is because s:find() skips all the characters that are not E regardless of how many indexes the characters consumes.
Code: Select all
function Initialize() local s = "ABC 汉堡包 DEF" print(s:sub(0, 3)) -- this prints "ABC" print(s:sub(0, 8)) -- this prints "ABC 汉�" print(s:sub(0, s:find("E"))) -- this prints "ABC 汉堡包 DE" end
It is important to keep this in mind when enabling Unicode support for your script. If you treat the Unicode string as if all characters have a length of exactly 1 (like you could and still can with non-Unicode scripts), you're in for trouble with Unicode strings. If you keep this limitation in mind when writing your scripts, you should be perfectly fine. In fact, most of your existing scripts will probably work fine after converting to Unicode scripts
Below is some useful information about using the string functions in Unicode scripts (originally from here):
Works fine if the indices are calculated reasonably -- and I
think this is almost always the case. People don't generally
do [[ string.sub (UNKNOWN_STRING, 3, 6) ]], they calculate a
string position, e.g. by searching, or string beginning/end,
and maybe calculate offsets based on _known_ contents, e.g.
[[ string.sub (s, 1, string.find (s, "/") - 1) ]]
- upper, lower
Works fine, but of course only upcases ASCII characters.
However doing this "properly" requires unicode tables, so
isn't appropriate for a minimal library I guess.
Works fine for calculating the string byte length -- which is
often what is actually wanted -- or calculating the string
index of the end of the string (for further searching or
- rep, format
Work fine (only use concatenation)
- byte, char
- find, match, gmatch, gsub
Work fine for the most part. The main exception, of course,
is single-character wildcards, ".", "[^abc]", etc, when used
without a repeat suffix -- but I think in practice, these are
very rarely used without a repeat suffix.
Some of the patterns are limited to ASCII in their
interpration of course (e.g. "%a"), but this isn't really
fixable without full unicode tables, and the ASCII-only
interpretation is not dangerous.
Now _this_ will probably simply fail for strings containing
non-ASCII UTF-8. But it's also probably not very widely