It is currently April 26th, 2024, 8:16 am

Greediness in regex question.

Get help with creating, editing & fixing problems with skins
User avatar
qwerky
Posts: 182
Joined: April 10th, 2014, 12:31 am
Location: Canada

Greediness in regex question.

Post by qwerky »

Given this html fragment taken directly from the source web page: <strong title="Sunday">Sun</strong><br>13&nbsp;<abbr title="January">Jan</abbr>, the object is to capture the date (in this case, 13). This regex: (?siU).*br>(\d+) should capture one or more (+) digits (\d), but in fact it captures only the first digit (1). However, by limiting the greediness of the '+': (?siU).*br>(\d+?), it captures both digits. Why is this necessary?
User avatar
jsmorley
Developer
Posts: 22630
Joined: April 19th, 2009, 11:02 pm
Location: Fort Hunt, Virginia, USA

Re: Greediness in regex question.

Post by jsmorley »

Regular expression is by default "greedy". The directive that you commonly use with WebParser, (?siU), tells the entire regular expression to

1) s Treat white space characters, tabs, linefeeds, etc. as "any character" for purposes of ..
2) i Ignore case sensitivity.
3) U Be "Ungreedy".

Now "ungreedy" is good, in the sense that if you had this:

Code: Select all

<strong title="Sunday">Sun</strong><br>13&nbsp;<abbr title="January">Jan</abbr>
<strong title="Monday">Mon</strong><br>14&nbsp;<abbr title="January">Jan</abbr>
You wouldn't want a greedy regular expression like <br.*(\d\d)&nbsp to capture "14", which it would. It would find the first instance of <br then skip all characters .* until it found the last instance of (\d\d)&nbsp.

Not using U "ungreedy" is going to cause no end of problems with WebParser. Things like RSS feeds that have multiple instances of stuff like <item>(.*)</item>, and weather skins that have multiple instances of <temp>(.*)</temp> will just give you fits without it.

So you want U "ungreedy" almost always with WebParser, but as you point out, that makes all quantifiers "ungreedy", including any references to repeating character classes like (\d+) or ([\d]{1,2}) and such, IF getting just the "first" one will satisfy the regular expression.

In your example, since you DON'T END the capture with something specific, just one digit does satisfy the regular expression, and that is all you will get.

The right solution is not to force the quantifier to be "greedy" in specific cases, but to "end" the (capture) with something specific. If for instance you had RegExp=(?siU).*br>(\d+)&nbsp in your example, it would work fine, as it would be "ungreedy", which is good, but would capture "13" instead of "1", as you have specifically told it when to stop capturing, by telling it to stop capturing digits when it hits the first instance of &nbsp. So it is forced to capture both the "1" and the "3", in order to get to the "&nbsp".
User avatar
qwerky
Posts: 182
Joined: April 10th, 2014, 12:31 am
Location: Canada

Re: Greediness in regex question.

Post by qwerky »

Thanks! :rosegift: That is a great explanation. :great:

I overlooked greediness being turned off by 'U', and thought I was turning it off with '?' when in fact I was turning it on. I didn't specify an end character because 1. I wasn't sure whether '&nbsp;' would be seen literally (should have known it would :oops: ) or would be interpreted as a (no-break)space, and 2. I was attempting to future-proof against the source being changed in the future to remove the no-break-space.

But testing with the end character works as expected, and is what I will use.
User avatar
jsmorley
Developer
Posts: 22630
Joined: April 19th, 2009, 11:02 pm
Location: Fort Hunt, Virginia, USA

Re: Greediness in regex question.

Post by jsmorley »

qwerky wrote: January 25th, 2019, 1:03 am Thanks! :rosegift: That is a great explanation. :great:

I overlooked greediness being turned off by 'U', and thought I was turning it off with '?' when in fact I was turning it on. I didn't specify an end character because 1. I wasn't sure whether '&nbsp;' would be seen literally (should have known it would :oops: ) or would be interpreted as a (no-break)space, and 2. I was attempting to future-proof against the source being changed in the future to remove the no-break-space.

But testing with the end character works as expected, and is what I will use.
As a rule of thumb, you are pretty much ALWAYS going to want to "end" all (captures) in regular expression.

You don't have to specify the character string &nbsp to end the capture, you could simply end it by saying [^\d] "not a digit character".

RegExp=(?siU).*br>(\d+)[^\d]

So "capture digits until you hit something, anything, that isn't a digit".


Note that this won't work if the number in question is at the literal end of the stream/file. If you had:

<strong title="Sunday">Sun</strong><br>13

And there was literally nothing else, not a linefeed or anything after the "13", if that was the end of the file, then you would have to use:

RegExp=(?siU).*br>(\d+)$

or

RegExp=(?siU).*br>(\d+?)

As "end of file" is not a character of any kind, and simply can't be tested for with "not a character class". It must be specified with the $ end of line/stream/file indicator.
User avatar
qwerky
Posts: 182
Joined: April 10th, 2014, 12:31 am
Location: Canada

Re: Greediness in regex question.

Post by qwerky »

Works great! :great: