Unicode support for Lua scripting

Post by **jsmorley** » August 12th, 2013, 4:04 pm

We're excited to announce Unicode support for Lua! To enable Unicode support for your script, you must encode your .lua script file in the UTF-16 LE encoding (also known as UCS-2 in some programs). If you're using plain old Notepad, you can select Save As... and then Unicode in the Encoding selection.

poiru.png

After the file encoding has been changed to UTF-16 LE, you can..

Use Unicode characters in the script itself. For example, print("ユニコード") is valid code.
Retrieve Unicode values from Rainmeter. For example, measure:GetStringValue() will return a string with Unicode characters intact. Previously, Unicode characters were replaced with a question mark.

Important note:

Internally, the Lua strings are stored in the UTF-8 format. UTF-8 is a multibyte character encoding, which basically means that a single index of a string is not necessarily enough to represent one character. ASCII characters (such as B or x) can be represented in one index. Consider this example:

Code: Select all

function Initialize()
	local s = "A"
	print("The length of s is " .. s:len())
end

That will log "The length of s is 1", as expected. Now, instead of the ASCII character A, let's use the Japanese character ユ.

Code: Select all

function Initialize()
	local s = "ユ"
	print("The length of s is " .. s:len())
end

This, surprisingly, will print "The length of s is 3"! It is beyond the scope of this post to explain why this is so. However, what you need to know is that a single character may not necessarily have a length of 1. It may be 2, 3, or even more in some cases. In your Unicode scripts, you should never attempt to split a string based on arbitrary index. Consider this example:

Code: Select all

function Initialize()
	local s = "ABC 汉堡包 DEF"
	print(s:sub(0, 3)) -- this prints "ABC"
	print(s:sub(0, 8)) -- this prints "ABC 汉�"
	print(s:sub(0, s:find("E"))) -- this prints "ABC 汉堡包 DE"
end

As you can see, attempting to get the first 8 actual characters of the string s fails miserably. However, attempting to get the all characters until the first "E" works fine. This is because s:find() skips all the characters that are not E regardless of how many indexes the characters consumes.

It is important to keep this in mind when enabling Unicode support for your script. If you treat the Unicode string as if all characters have a length of exactly 1 (like you could and still can with non-Unicode scripts), you're in for trouble with Unicode strings. If you keep this limitation in mind when writing your scripts, you should be perfectly fine. In fact, most of your existing scripts will probably work fine after converting to Unicode scripts

---

Below is some useful information about using the string functions in Unicode scripts (originally from here):

sub
Works fine if the indices are calculated reasonably -- and I
think this is almost always the case. People don't generally
do [[ string.sub (UNKNOWN_STRING, 3, 6) ]], they calculate a
string position, e.g. by searching, or string beginning/end,
and maybe calculate offsets based on _known_ contents, e.g.
[[ string.sub (s, 1, string.find (s, "/") - 1) ]]
upper, lower
Works fine, but of course only upcases ASCII characters.
However doing this "properly" requires unicode tables, so
isn't appropriate for a minimal library I guess.
len
Works fine for calculating the string byte length -- which is
often what is actually wanted -- or calculating the string
index of the end of the string (for further searching or
whatever).
rep, format
Work fine (only use concatenation)
byte, char
Work fine
find, match, gmatch, gsub
Work fine for the most part. The main exception, of course,
is single-character wildcards, ".", "[^abc]", etc, when used
without a repeat suffix -- but I think in practice, these are
very rarely used without a repeat suffix.

Some of the patterns are limited to ASCII in their
interpration of course (e.g. "%a"), but this isn't really
fixable without full unicode tables, and the ASCII-only
interpretation is not dangerous.
reverse

Now _this_ will probably simply fail for strings containing
non-ASCII UTF-8. But it's also probably not very widely
used...

Post by **jsmorley** » August 12th, 2013, 4:11 pm

Here is a sample skin that demonstrates how to use Unicode with Lua:

UnicodeLua_1.0.rmskin

UnicodeLua.jpg

All files used are encoded with UTF-16 LE

Additional note: In addition to the limitations with string "length" noted before, Lua cannot correctly read or write to external files that are encoded in UTF-16. In some cases you can get around this by reading the file in WebParser (with CodePage=1200 when the file is UTF=16) and RegEXp=(?siU)^(.*)$ and then get and parse that WebParser string value in Lua.

1) It uses WebParser to read an external file, and the Lua gets this measure, parses and displays the text. The WebParser measure uses CodePage=1200 to read UTF-16 Unicode files.

2) Unicode strings are embedded in the .lua file, and displayed in the skin.

3) Unicode strings are embedded in the .ini skin file, and displayed in the skin.

Post by **jsmorley** » August 12th, 2013, 4:15 pm

And another example, demonstrating how you might use the new Unicode support in Lua to create an RSS skin that can change to multiple sites in different languages:

LuaUnicodeRSS_1.0.rmskin

LuaUnicodeRSS.jpg

thatsIch · Post by **thatsIch** » August 12th, 2013, 7:27 pm

Thank you <3

A lot of Love from Germany with their äöüß

sa3er · Post by **sa3er** » August 12th, 2013, 9:07 pm

Excellent.
I'm going to try it with some Persian stuff. Thanks.

Mordasius · Post by **Mordasius** » August 13th, 2013, 1:29 am

Thanks to all those that worked on this. It means I should be able to produce a neater version of the Hijri Calendar

~Faradey~ · Post by **~Faradey~** » August 16th, 2013, 11:22 am

dreams comes true

thank you guys!

thatsIch · Post by **thatsIch** » August 20th, 2013, 3:38 pm

Rainmeter.log in UTF16-LE with BOM
Main.lua in UTF16-LE with BOM
Sub.lua in UTF16-LE with BOM

if I do io.open and :read on the sub.lua I only get cryptic expressions and can't evaluate them. I need to save the Sub.lua in UTF8. Bug or is there a workaround?

Sub.lua

Code: Select all

[[
	print("Test")
]]

in UTF8 results in

Code: Select all

DBUG (00:24:47.922) : [[
	print("Test")
]]

but in UTF16

Code: Select all

DBUG (00:27:59.538) : ��[

as you can see, I only print the content of the file

Post by **jsmorley** » August 20th, 2013, 3:41 pm

thatsIch wrote:Rainmeter.log in UTF16-LE with BOM
Main.lua in UTF16-LE with BOM
Sub.lua in UTF16-LE with BOM

if I do io.open and :read on the sub.lua I only get cryptic expressions and can't evaluate them. I need to save the Sub.lua in UTF8. Bug or is there a workaround?

Not really. See above:

Additional note: In addition to the limitations with string "length" noted before, Lua cannot correctly read or write to external files that are encoded in UTF-16 or UTF-8. In some cases you can get around this by reading the file in WebParser (with CodePage=1200 when the file is UTF=16) and RegEXp=(?siU)^(.*)$ and then get and parse that WebParser string value in Lua.

Actually, it can read UTF-8 w/BOM files ok, as long as you don't make the read "byte" specific and have Unicode chars in the file. Haven't really tested writing to UTF-8 files, and for sure it will hate UTF-16.

Edit: Yeah, it can write to UTF-8 w/BOM external files as well, although I'm not sure if there are any "gotchas".

sa3er · Post by **sa3er** » August 20th, 2013, 6:29 pm

jsmorley wrote: Edit: Yeah, it can write to UTF-8 w/BOM external files as well, although I'm not sure if there are any "gotchas".

I haven't had any problem so far in this project.
It works like a charm.

Unicode support for Lua scripting

Unicode support for Lua scripting

Re: Unicode support for Lua scripting

Re: Unicode support for Lua scripting

Re: Unicode support for Lua scripting

Re: Unicode support for Lua scripting

Re: Unicode support for Lua scripting

Re: Unicode support for Lua scripting

Re: Unicode support for Lua scripting

Re: Unicode support for Lua scripting

Re: Unicode support for Lua scripting