WebParser - What is that (?siU) all about?

Post by **jsmorley** » January 18th, 2014, 7:03 pm

When we use the WebParser plugin to parse sites or local files, we tend to always use a RegExp option that starts with RegExp=(?siU). It may not be clear what that (?siU) is at the beginning, so let me touch on that briefly.

When any part of a regular expression contains (? .. ) that simply means that what is included in the parentheses is a "directive" for the regular expression. We see this in examples of "lookaround assertions" and other things we use for parsing. When the (? is followed by the characters we are talking about here, namely "s", "i" and "U", those are used to modify the behavior of the expression in specific ways.

What they mean in a nutshell is:

s : Include "special characters" in the definition of "any character" when the . is used, for instance in (.*).
i : ignore case in all matching.
U : Change the default pcre behavior of "greedy matches" to "Ungreedy matches".

Let's go over them one at a time and demonstrate what they mean in practice.

First, here is some text in a sample Test.html file we can use for testing:

Code: Select all

<item>Item 1</item>
<item>Item 2</item>
<item>Item 3</item>

Let's try an example where we leave off the "U" directive. Our RegExp is RegExp=(?si)<item>(.*)</item>.

Code: Select all

[Rainmeter]
Update=1000
AccurateText=1
DynamicWindowSize=1

[MeasureDirective]
Measure=Plugin
Plugin=WebParser
Url=file://#CURRENTPATH#Test.html
RegExp=(?si)<item>(.*)</item>
StringIndex=1

Since pcre regular expressions are by default "greedy", what this will do is in effect find the "first" instance of the text "<item>". Then it will match "any number of any characters" (.*). This in effect searches all the way to the end of the string, then "backs up" until if finds </item>. This means you will find the first instance of "<item>", then capture everything until the last instance of "</item>".

So what you get is this in StringIndex 1:

01.png

Using the U directive makes it "ungreedy", it will stop at the "first" instance of "</item>", and you get what you want and expect, namely just "Item 1" in StringIndex 1.

Now let's try an example where we leave off the "s" directive. Our RegExp is RegExp=(?iU)<item>(.*)</item>.

Code: Select all

[Rainmeter]
Update=1000
AccurateText=1
DynamicWindowSize=1

[MeasureDirective]
Measure=Plugin
Plugin=WebParser
Url=file://#CURRENTPATH#Test.html
RegExp=(?iU)<item>(.*)</item>.*<item>(.*)</item>.*<item>(.*)</item>
StringIndex=3

This fails completely. As you can see from the Test.html above, there are linefeeds after each of the "items". Since by default the dot . character means "any character except special characters like linefeed, carriage return and tab", the pattern fails after getting the first <item>(.*)</item>. It is then looking to "skip any number of any characters" .* and find another <item>, but is stopped by the linefeed before it gets there. With the s directive, literally ANY character is included in the definition of ..

Last, let's try an example where we leave off the "i" directive. Our RegExp is RegExp=(?sU)<Item>(.*)</Item>. Note the capital "I"'s.

Code: Select all

[MeasureDirective]
Measure=Plugin
Plugin=WebParser
Url=file://#CURRENTPATH#Test.html
RegExp=(?sU)<Item>(.*)</Item>
StringIndex=1

This fails completely. Since by default regular expressions are "case sensitive", <Item> does not match <item>.

Hope this helps with understanding why you see that (?siU) in almost every WebParser skin or example around. It's easier to just get in the habit of always using those directives with WebParser, even if you don't always need all of them. You can really end up scratching your head when your expression fails otherwise.