RegExp matching error (-1), (-8)

qwerky · Post by **qwerky** » March 24th, 2019, 7:48 pm

jsmorley wrote: ↑March 24th, 2019, 2:35 am I am myself skeptical of parsing HTML sites. Not because it can't be done, because parsing is parsing, and as long as you can figure out a reasonable pattern for how things are presented in the HTML code, it's no harder than parsing XML (RSS or ATOM).

However, I would caution, and have had it proven more times then I care to count, that you can't depend on any HTML site staying the way it is for very long. It is up to some webmaster who is currently in a bad relationship, or some marketing drone or pointy-haired manager who doesn't even know HTML, or even just some outsourced code monkey they hire to add some new information to the site on a one-week contract. There is never any promise, and should never be any expectation, that any regular expression you used yesterday will work tomorrow. In addition, this may mean a minor tweak, or a complete rewrite of everything you have done. One is as likely as the other.

Fully agree with this... see below.

Yincognito wrote: ↑March 24th, 2019, 3:34 am You're probably right, after all, it's a changing world each day. I don't know the specifics of your skin and address that you access, but usually sticking with generic stuff, choosing carefully what you interrogate/parse (e.g. elements that you noticed that are more "persistent" in time) and using patterns instead of specific/hard-coded references when you're parsing should help. This is where generic stuff to "fill the blanks" like .* for regex, or flexible selectors like querySelectorAll() for actual parsers come handy.

This topic is perfect for such a discussion, since qwerky apparently also uses a site that suffers (this time, seasonal) changes over a period of time.

While this is true, it is not complete. I wrote my weather skin back in 2014, and it worked for a while. But eventually, it blew up when they greatly changed the code, as described above, and I didn't really find the time to fix it until now. So, now I'm doing a complete rewrite.

Yincognito wrote: ↑March 24th, 2019, 3:34 am Me, I would have tried not to be picky and ditch the volatile Wind Chill, Heat Index or Feels Like (in his case), and just stick with the stuff that is always there, like plain temperature, or wind speed or whatever. Some element might seem cool to have and all, but bear little importance compared to the more generic ones.

There are other factors which determine in part what is important, such as my wife constantly quoting Wind Chill!

jsmorley wrote: ↑March 24th, 2019, 2:35 am I have a skin that uses an HTML site and tracks the standings in Major League Baseball, and while I really like it and use it myself every year, I find I have to put a fair amount of work into it every year, right at the start of the season. I do so, but not without some grumbling....

[Ears perk up!] That sound like a terrific skin!

But on the other hand, there is only one team for us Canadians anyway.

Post by **jsmorley** » March 24th, 2019, 10:09 pm

https://www.deviantart.com/jsmorley/art/MLBStandings-2018-1-0-748675475

mlbstandings_2018_1_0_by_jsmorley-dcdqpoz.png.jpg

qwerky · Post by **qwerky** » March 24th, 2019, 10:29 pm

Got it, thanks! Can't wait for your 2019 update.

Though it looks good as is; can't really tell whether it needs an update, since it seems to display fine. But as the season hasn't started, the records are all zero of course. Might be nice to add "Wild Card Games Back"?

Post by **Yincognito** » March 24th, 2019, 11:06 pm

qwerky wrote: ↑March 24th, 2019, 7:48 pmWhile this is true, it is not complete. I wrote my weather skin back in 2014, and it worked for a while. But eventually, it blew up when they greatly changed the code, as described above, and I didn't really find the time to fix it until now. So, now I'm doing a complete rewrite.

jsmorley wrote: ↑March 24th, 2019, 10:09 pmhttps://www.deviantart.com/jsmorley/art/MLBStandings-2018-1-0-748675475

Could you guys give me an example of how the code of the sites you use as sources changed? Just one or two instances that you remember of, to have an idea on what this is all about. I mean, I know the code of a site changes over time, but were those changes that significant that you couldn't find the elements you took data / parsed from anymore? I find that hard to believe, personally. The HTML/JS elements and/or attributes don't change that often (apart from some site-wide changes lately to make them more "mobile friendly" and such), the basic weather elements that are measured are pretty much the same since the Stone Age, so those don't seem to be a problem. The only potential issue I can see is that the notation of the things you want to get changed (e.g. the name of the teams or the name of the classes/attributes maybe, in jsmorley's case, or things like 'WindChill' to 'wchill' or something like that in qwerky's case) - but that can be easily made more bearable if you just put those "tags" in some variables that you can modify with the help of some InputText boxes in a configuration skin or by editing the corresponding Variables.inc file directly, then apply them in the regexes you use to parse the whole thing.

I don't know, maybe I'm missing something here, but in my opinion a site can't change that much to be completely unrecognizable...

Post by **jsmorley** » March 24th, 2019, 11:50 pm

At least with that MLB skin of mine, the site that displays the standings has been radically changed cosmetically every single year. I'm hopeful, since it still works for this year, that maybe they are going to give it a rest for a year, but I'm not counting on it.

I said "cosmetically", and you might be tempted to say "well, who cares about cosmetics?", but that's missing the point, when you are parsing HTML and not XML / ATOM / RSS, it's all cosmetics. There is no set structure to how the data is displayed.

qwerky · Post by **qwerky** » March 24th, 2019, 11:59 pm

Yincognito wrote: ↑March 24th, 2019, 11:06 pm Could you guys give me an example of how the code of the sites you use as sources changed? Just one or two instances that you remember of, to have an idea on what this is all about. I mean, I know the code of a site changes over time, but were those changes that significant that you couldn't find the elements you took data / parsed from anymore? I find that hard to believe, personally. The HTML/JS elements and/or attributes don't change that often (apart from some site-wide changes lately to make them more "mobile friendly" and such), the basic weather elements that are measured are pretty much the same since the Stone Age, so those don't seem to be a problem. The only potential issue I can see is that the notation of the things you want to get changed (e.g. the name of the teams or the name of the classes/attributes maybe, in jsmorley's case, or things like 'WindChill' to 'wchill' or something like that in qwerky's case) - but that can be easily made more bearable if you just put those "tags" in some variables that you can modify with the help of some InputText boxes in a configuration skin or by editing the corresponding Variables.inc file directly, then apply them in the regexes you use to parse the whole thing.

I don't know, maybe I'm missing something here, but in my opinion a site can't change that much to be completely unrecognizable...

Just to be clear, I chose to rewrite the entire skin, but could have gotten by with just rewriting all of the regex's. But yes, it would have required pretty much a complete rewrite of the entire regex group.

Not only do names change, as you noted above, but also the order in which they appear! Also, the groups they are in, the order of those groups, and yes occasionally even new elements. And as so much time has passed, of course I didn't remember why the regex's were a particular way, or even how they worked. I use regex very infrequently, so it would have required relearning anyway.

By rewriting the skin, rather than just the regex's, there was an opportunity to add new elements, such as latitude and longitude, past twenty-four hour stats (these are on a completely separate HTML page), weather alerts text (again, another HTML page), an so forth.

Having said that, I would be very interested in any tips or techniques you would like to share, that would make all of this much easier the next time they rewrite the site--which I have no doubt will happen sooner or later.

Post by **Yincognito** » March 25th, 2019, 12:37 am

jsmorley wrote: ↑March 24th, 2019, 11:50 pm At least with that MLB skin of mine, the site that displays the standings has been radically changed cosmetically every single year. I'm hopeful, since it still works for this year, that maybe they are going to give it a rest for a year, but I'm not counting on it.

I said "cosmetically", and you might be tempted to say "well, who cares about cosmetics?", but that's missing the point, when you are parsing HTML and not XML / ATOM / RSS, it's all cosmetics. There is no set structure to how the data is displayed.

I get what you're saying...sort of. Is it that, for example, in place of a <span> you have a <div> or something like that? Or maybe to those nowrap or colspan attributes you use as "landmarks"/"anchors" in your regex? Anyway, looking at the WebParserDump.txt from your skin, the site or its parsing doesn't seem overkill, and when you have a superparent, like I also generally use, all that needs to be changed is a single regex.

What I do fully understand though is having to constantly modify the regex every couple of months or so. As for the complexity, as I said before, I don't think it's that bad, IMHO.

Post by **jsmorley** » March 25th, 2019, 1:40 am

Yincognito wrote: ↑March 25th, 2019, 12:37 am I get what you're saying...sort of. Is it that, for example, in place of a <span> you have a <div> or something like that? Or maybe to those nowrap or colspan attributes you use as "landmarks"/"anchors" in your regex? Anyway, looking at the WebParserDump.txt from your skin, the site or its parsing doesn't seem overkill, and when you have a superparent, like I also generally use, all that needs to be changed is a single regex.

What I do fully understand though is having to constantly modify the regex every couple of months or so. As for the complexity, as I said before, I don't think it's that bad, IMHO.

Maybe. Maybe not. Like I said, parsing is parsing. It it certainly never going to be any harder than it was to do it in the first place. The question is how much does the regular expressions need to be changed, and are you going to still be supporting the skin to do it.

I'm not suggesting you don't parse HTML, go for it. I'm just saying that if I can find a nice standard XML version of the data I want, I'm going that route.

Post by **Yincognito** » March 25th, 2019, 2:00 am

qwerky wrote: ↑March 24th, 2019, 11:59 pmNot only do names change, as you noted above, but also the order in which they appear! Also, the groups they are in, the order of those groups, and yes occasionally even new elements.

Having said that, I would be very interested in any tips or techniques you would like to share, that would make all of this much easier the next time they rewrite the site--which I have no doubt will happen sooner or later.

As a matter of fact, I do have some tips to deal with the different order of elements, groups, whatever. I'm successfully applying the trick in my feeds skin, because, as you probably know, the RSS/ATOM elements can be required, can be optional, can also be written in a different order, can have additional (and similar) elements, etc.

Regex is not so good when you have to deal with a different order of elements. You can use | (the regex OR) for two, three possibilities, but since the number of permutations is the factorial of the number of elements, they increase rapidly, and so does the complexity of the regex. What I do to handle that is not use a standard "parser style" regex, but rather a "remover style" substitute.

For example, let's say I have the string: <E1>a</E1><E2>b</E2><E3>c</E3>, where the elements E1, E2 and E3 can be in any kind of order, and I want to get the contents of E3. Instead of parsing the string with a standard regex like (?siU)<E\d>(.*)</E\d>.*<E\d>(.*)</E\d>.*<E\d>(.*)</E\d> in the WebParser parent measure and taking StringIndex=3 in one of its children, I'm passing the whole string (or just the piece where the element I'm looking for is) to a String measure (just like the WebParser parent is doing with its children), where I take only E3, using a Substitute="(?siU)(?:(?(?=.*<E3>).*<E3>(.*)</E3>.*+)|.*+)":"\1","(?:^\\1|\\1$)":"" and basically delete everything else. The substitute is not that complicated as it looks, it simply looks ahead for <E3> (the (?=...) part), all wrapped in a regex conditional (the (?(?=...)...|...) part) that instructs the regex engine to give me either the contents of E3 ... or (the | part in the conditional) nothing, since there is no capture group after the |. Now since the regex engine in Rainmeter has a problem returning empty strings, there will be possible \1 leftovers after the first substitute operation, which I delete in the second substitute operation.

The above will perform exactly like a WebParser child, returning the contents of E3 wherever that is in the string and irrespective of the order. It will be empty if the contents is empty, just like a WebParser child. The only difference is that you'd have to either use a bang to manually pass the whole string from the WebParser parent to the String measure where you do the substitutions ("fun" fact, substitutions have no effect on a WebParser parent), or just set the WebParser parent as the value of the String option in the String measure.

NOTE: I've used .*+ at the end of the first substitute operation to get the entire string afterwards. This could have been written also as .*? (which would be the usual greedy .*, i.e. matching as many characters as possible) or as .*+ (which is not just greedy, but possessive, i.e. matching as many characters as possible, but not releasing them to match subsequent tokens).

Post by **Yincognito** » March 25th, 2019, 2:32 am

jsmorley wrote: ↑March 25th, 2019, 1:40 amI'm not suggesting you don't parse HTML, go for it. I'm just saying that if I can find a nice standard XML version of the data I want, I'm going that route.

I agree with you on this. My point was not that I would prefer to regex parse HTML instead of XML, but rather that it can be done, if you're aware of regex' limits...despite what they say on the more technical places of the internet.

I'm always in the "yes, the impossible is possible" camp, as you probably have noticed already, LOL. That's why I am "triggered" / challenged by phrases like "regex can't be used to parse HTML" or "this will end up in tears"...

Fun fact, since we're at it, !WriteKeyValue failed before regex or the memory in my feeds skin, where I successfully "parse" and "browse" 100 feeds x 100 items for each feed, as promised. It seems !WriteKeyValue can't write too much data, so my nice debug system is useless for larger strings. Frankly, I would have expected regex to fail before !WriteKeyValue, as you warned me, but lo and behold, it was the other way.

RegExp matching error (-1), (-8)

Re: RegExp matching error (-1), (-8)

Re: RegExp matching error (-1), (-8)

Re: RegExp matching error (-1), (-8)

Re: RegExp matching error (-1), (-8)

Re: RegExp matching error (-1), (-8)

Re: RegExp matching error (-1), (-8)

Re: RegExp matching error (-1), (-8)

Re: RegExp matching error (-1), (-8)

Re: RegExp matching error (-1), (-8)

Re: RegExp matching error (-1), (-8)