It is currently November 19th, 2018, 8:11 am

Unicode and external files in Lua

Discuss the use of Lua in Script measures.
User avatar
jsmorley
Developer
Posts: 18330
Joined: April 19th, 2009, 11:02 pm
Location: Fort Hunt, Virginia, USA

Unicode and external files in Lua

jsmorley » January 21st, 2018, 4:20 pm

One of the things that Lua is good at is reading and writing external files. This guide is not intended to dig into the details of that, perhaps in a future post. What I wanted to touch on in this guide is how to deal with "encoding" when you are working with Lua and external files. There are some rules, and some limitations.

First off, it should be noted that Lua is "platform agnostic", and is not designed specifically with the Windows operating system in mind. It is designed to be used on a variety of platforms, and so to some degree uses a "lowest common denominator" approach to things.

Let's start off by laying out the "rules". and "limitations".

Rules

1) Your .lua script file should ALWAYS be encoded as UTF-16 Little Endian. This will allow it to communicate properly with Rainmeter, which relies on UTF-16 as the default encoding. You can encode the .lua file as ANSI, but if you do, then no Unicode external files at all will be supported.

2) Any external files that you read or write must be encoded one of two ways:
a: ANSI, with no Unicode characters of any kind it it. Only the ASCII and Extended-ASCII characters sets for your system locale.
b: UTF-8 with or without BOM. Then it will support Unicode characters with some limitations.
The BOM (byte order mark) with UTF-8 is not important. Lua is fine with it either way.

3) External files you intend to read or write with Lua must NEVER be encoded as UTF-16. It simply won't work.

4) External folder or file names you wish to open in Lua must not contain Unicode characters in them. This will cause Windows to treat them as UTF-16, and Lua won't be able to open the file.

Limitations

Unicode characters are often stored as multi-byte characters. If you have strings like 你好,世界 or Привет мир, these have a certain number of "characters", in this case 5 and 10, but since the characters are two bytes each in length they are stored using 10 and 20 bytes respectively. This means that any Lua functions that depend on "bytes" will not work correctly. What this means in a practical sense is that if you are reading in an external file that contains Unicode, these functions should not be used on the result:

1) string.len() : This measures the length of a string in "bytes", which will be the same as "characters" when measuring ASCII, but will return confusing and improper values when measuring Unicode.
2) string.sub() : This extracts a sub-string from a string, based on "bytes". Again, when using it on ASCII strings, it reflects "characters", but will not work correctly with multi-byte Unicode characters.

What works

Other functions, that do not depend on "bytes", but "pattern matching", will work if you use them carefully.

1) string.find() : This will return indexes in "bytes" for where the start and end of a "pattern" is found in a string. This will return valid results, but the result will be based on "bytes" and not "characters". Take care how you use this.
2) string.match() : This will extract sub-strings from a string based on a "pattern". This will work fine.
3) string.gsub() : This will search and replace strings in a string based on a "pattern". This will work fine.

So the long and the short of it is that you can read and write external files that contain Unicode characters, but you have to remember two things:

1) Unicode is always UTF-8 in Lua.
2) Functions that depend on "bytes" to define "characters" will not work correctly.

Here is a skin you can play with to see how things work:
LuaFileDemo_1.0.rmskin
1.png
Skin:

Code: Select all

[Rainmeter]
Update=1000
DynamicWindowSize=1
AccurateText=1

[Variables]

[Lua]
Measure=Script
ScriptFile=LuaFileDemo.lua
Disabled=1

[MeterANSI]
Meter=String
FontSize=11
FontWeight=400
FontColor=255,255,255,255
SolidColor=47,47,47,255
Padding=5,5,5,5
AntiAlias=1
Text=[&Lua:ReadANSI('#CURRENTPATH#ANSI.txt')]
DynamicVariables=1

[MeterUTF8Unicode]
Meter=String
Y=5R
FontSize=11
FontWeight=400
FontColor=255,255,255,255
SolidColor=47,47,47,255
Padding=5,5,5,5
AntiAlias=1
Text=[&Lua:UTF8Unicode('#CURRENTPATH#UTF8Unicode.txt')]#CRLF#The length of the string in bytes is [&Lua:UTF8Len]#CRLF#Note that this is NOT the number of characters, but the number of bytes.#CRLF#The first Unicode character is [&Lua:firstUnicode]. Note that it is two bytes long, not one.#CRLF#firstUnicode = fileTxt:sub(48,50)
DynamicVariables=1
Lua: (encoded as UTF-16 Little Endian)

Code: Select all

function Initialize()
end

function ReadANSI(fileName)

	file = io.open(fileName)
	fileTxt = file:read('*all')
	file:close()
	
	ANSILen = fileTxt:len()
	
	return fileTxt

end

function UTF8Unicode(fileName)

	file = io.open(fileName)
	fileTxt = file:read('*all')
	file:close()
	
	UTF8Len = fileTxt:len()
	
	patStart, patEnd = string.find(fileTxt, '你')
	firstUnicode = fileTxt:sub(patStart,patEnd)
	
	return fileTxt

end
ANSI.txt (encoded as ANSI)

Code: Select all

This file is encoded as ANSI and has no Unicode characters.
UTF8Unicode.txt (encoded as UTF-8 with BOM)

Code: Select all

This file is encoded as UTF8 with BOM and says 你好,世界 and Привет мир in Unicode.
Feel free to ask any questions...
You do not have the required permissions to view the files attached to this post.
User avatar
jsmorley
Developer
Posts: 18330
Joined: April 19th, 2009, 11:02 pm
Location: Fort Hunt, Virginia, USA

Re: Unicode and external files in Lua

jsmorley » January 21st, 2018, 6:05 pm

Work-around for reading UTF-16 encoded files in Lua

If you have a file that is encoded as UTF-16 Little Endian, and you have no control over how this file is encoded, you can still read the file and take action based on the contents in Lua.

The trick is to use a WebParser parent measure, which while it expects UTF-8 by default, can be forced to read UTF-16 Little Endian. What you do is use the CodePage option on the measure.

Code: Select all

[MeasureFile]
Measure=WebParser
URL=file://#CURRENTPATH#UTF16.txt
CodePage=1200
RegExp=(?siU)^(.*)$
StringIndex=1
Now this measure's string value will be the entirety of the named file, and in Lua you can read this:

Code: Select all

myMeasure = SKIN:GetMeasure('MeasureFile')
fileTxt = myMeasure:GetStringValue()
This will put the entire contents of the file, including any Unicode, into the variable fileTxt, which you can tear apart into "lines" by looking for "\n" characters, and then take the same kinds of actions discussed in the earlier post. The same "limitations" with functions that attempt to treat characters as bytes should be considered.

Do not attempt to write to this file using Lua. Doing so while keeping it UTF-16 is another whole exercise that involves recreating the entire file, and first writing the BOM (byte order mark) for UTF-16 Little Endian (Hex: FFFE String: ÿþ) to the file before you write the rest of the values. That may be a future discussion if someone runs into an absolute need. There is no such thing as UTF-16 without BOM.
User avatar
~Faradey~
Posts: 367
Joined: November 12th, 2009, 4:47 pm
Location: Ukraine

Re: Unicode and external files in Lua

~Faradey~ » January 21st, 2018, 6:39 pm

Let me add some:
• If you use lua to open file like file = io.open(fileName), filename (e.g 'D:\Music\天下无双\menu.txt') shouldn't have unicode characters in it, otherwise you will get nil as a file handle.

• To deal with string.len() and string.sub() for Unicode use scripts bellow. After including (use [url=https://docs.rainmeter.net/manual-beta/lua-scripting/#dofile]dofile[/url] ) you can use next functions:

> string.utf8len(s)
> string.utf8sub(s, i, j)
> string.utf8reverse(s)

Usage example

Code: Select all

function Initialize() 
	-- loading UTF-8 func.
	dofile(SKIN:GetVariable('@')..'scripts\\utf8.lua')
end --func	
utf8.lua
utf8data.lua
-- Provides UTF-8 aware string functions implemented in pure lua:
-- * string.utf8len(s)
-- * string.utf8sub(s, i, j)
-- * string.utf8reverse(s)
--
-- If utf8data.lua (containing the lower<->upper case mappings) is loaded, these
-- additional functions are available:
-- * string.utf8upper(s)
-- * string.utf8lower(s)
--
-- All functions behave as their non UTF-8 aware counterparts with the exception
-- that UTF-8 characters are used instead of bytes for all units.
UTF8_lua_scripts.zip
You do not have the required permissions to view the files attached to this post.
User avatar
jsmorley
Developer
Posts: 18330
Joined: April 19th, 2009, 11:02 pm
Location: Fort Hunt, Virginia, USA

Re: Unicode and external files in Lua

jsmorley » January 22nd, 2018, 2:32 pm

~Faradey~ wrote: • If you use lua to open file like file = io.open(fileName), filename (e.g 'D:\Music\天下无双\menu.txt') shouldn't have unicode characters in it, otherwise you will get nil as a file handle.
This bit is a good catch, and I have added it to my original post. Thanks!

On a side note, using Unicode in file names is tricky even in Rainmeter proper when used in config folders. The issue is that by default Rainmeter.ini is encoded as ANSI, and while you can load a skin with Unicode in the config folder names, it will create a section like [?????] in Rainmeter.ini. We are hesitant to encode Rainmeter.ini as UTF-16, which would solve this, as it would have backwards compatibility issues for anyone currently reading Rainmeter.ini with Lua.

I'm a little leery of the addon libraries, as I have looked at them in the past, and find that string.utf8sub(s, i, j) is so different in functionality from how string.sub(s, i, j) works that I don't find it much different than string.match(), which already works fine. Still, it is good to have them in your toolbox if you need them.