Greediness in regex question.

qwerky · Post by **qwerky** » January 24th, 2019, 11:59 pm

Given this html fragment taken directly from the source web page: Sun 13 <abbr title="January">Jan</abbr>, the object is to capture the date (in this case, 13). This regex: (?siU).*br>(\d+) should capture one or more (+) digits (\d), but in fact it captures only the first digit (1). However, by limiting the greediness of the '+': (?siU).*br>(\d+?), it captures both digits. Why is this necessary?

Post by **jsmorley** » January 25th, 2019, 12:24 am

Regular expression is by default "greedy". The directive that you commonly use with WebParser, (?siU), tells the entire regular expression to

1) s Treat white space characters, tabs, linefeeds, etc. as "any character" for purposes of ..
2) i Ignore case sensitivity.
3) U Be "Ungreedy".

Now "ungreedy" is good, in the sense that if you had this:

Code: Select all

<strong title="Sunday">Sun</strong><br>13&nbsp;<abbr title="January">Jan</abbr>
<strong title="Monday">Mon</strong><br>14&nbsp;<abbr title="January">Jan</abbr>

You wouldn't want a greedy regular expression like <br.*(\d\d)&nbsp to capture "14", which it would. It would find the first instance of <br then skip all characters .* until it found the last instance of (\d\d)&nbsp.

Not using U "ungreedy" is going to cause no end of problems with WebParser. Things like RSS feeds that have multiple instances of stuff like <item>(.*)</item>, and weather skins that have multiple instances of <temp>(.*)</temp> will just give you fits without it.

So you want U "ungreedy" almost always with WebParser, but as you point out, that makes all quantifiers "ungreedy", including any references to repeating character classes like (\d+) or ([\d]{1,2}) and such, IF getting just the "first" one will satisfy the regular expression.

In your example, since you DON'T END the capture with something specific, just one digit does satisfy the regular expression, and that is all you will get.

The right solution is not to force the quantifier to be "greedy" in specific cases, but to "end" the (capture) with something specific. If for instance you had RegExp=(?siU).*br>(\d+)&nbsp in your example, it would work fine, as it would be "ungreedy", which is good, but would capture "13" instead of "1", as you have specifically told it when to stop capturing, by telling it to stop capturing digits when it hits the first instance of &nbsp. So it is forced to capture both the "1" and the "3", in order to get to the "&nbsp".

qwerky · Post by **qwerky** » January 25th, 2019, 1:03 am

Thanks!

That is a great explanation.

I overlooked greediness being turned off by 'U', and thought I was turning it off with '?' when in fact I was turning it on. I didn't specify an end character because 1. I wasn't sure whether ' ' would be seen literally (should have known it would

) or would be interpreted as a (no-break)space, and 2. I was attempting to future-proof against the source being changed in the future to remove the no-break-space.

But testing with the end character works as expected, and is what I will use.

Post by **jsmorley** » January 25th, 2019, 1:09 am

qwerky wrote: ↑January 25th, 2019, 1:03 am Thanks! That is a great explanation.

I overlooked greediness being turned off by 'U', and thought I was turning it off with '?' when in fact I was turning it on. I didn't specify an end character because 1. I wasn't sure whether ' ' would be seen literally (should have known it would ) or would be interpreted as a (no-break)space, and 2. I was attempting to future-proof against the source being changed in the future to remove the no-break-space.

But testing with the end character works as expected, and is what I will use.

As a rule of thumb, you are pretty much ALWAYS going to want to "end" all (captures) in regular expression.

You don't have to specify the character string &nbsp to end the capture, you could simply end it by saying [^\d] "not a digit character".

RegExp=(?siU).*br>(\d+)[^\d]

So "capture digits until you hit something, anything, that isn't a digit".

Note that this won't work if the number in question is at the literal end of the stream/file. If you had:

Sun 13

And there was literally nothing else, not a linefeed or anything after the "13", if that was the end of the file, then you would have to use:

RegExp=(?siU).*br>(\d+)$

or

RegExp=(?siU).*br>(\d+?)

As "end of file" is not a character of any kind, and simply can't be tested for with "not a character class". It must be specified with the $ end of line/stream/file indicator.

qwerky · Post by **qwerky** » January 25th, 2019, 1:31 am

Works great!

Greediness in regex question.

Greediness in regex question.

Re: Greediness in regex question.

Re: Greediness in regex question.

Re: Greediness in regex question.

Re: Greediness in regex question.