It is currently February 29th, 2020, 4:02 am

weather.com - Parsing the HTML

Our most popular Tips and Tricks from the Rainmeter Team and others
Yincognito
Posts: 837
Joined: February 27th, 2015, 2:38 pm

Re: ⭐ Some Help With Parsing weather.com

Post by Yincognito »

SilverAzide wrote:
January 24th, 2020, 2:06 am
Hello... found a bug in the "next 36 hours" include file. Every [@CardNConditions] and [@CardNDetails] measure needs to have DecodeCharacterReference=1 added. This will fix non-English text, like Pluie dans l'après-midi / Chutes de neige.

Secondly, a question...
I am seeing text like the following when I set my locale to French (fr-FR):

Code: Select all

Nuageux. Maximales : 2 ºC. Vents O soufflant de 15 à 25 km\u002Fh.
I have no idea what is up with that "\u002F" stuff even with the DecodeCharacterReference=1 added, but when I try a Substitute to turn this style of markup to the Rainmeter-equivalent "[\x002F]" I just get the literal result of Nuageux. Maximales : 2 ºC. Vents O soufflant de 15 à 25 km[\x002F]h.. Perhaps my regexp is bad (I am the first to admit I stink at regexps). Brute force Substitute="\u002F":"/" works of course, but I don't want to do that for all 64000 codes, lol. Maybe the slash is being treated uniquely and I can skip the other 63999 values. ;)
The thing with those [\x002F] character references is that the implementation of them in Rainmeter is kind of "strange", pretty much the same as the implementation of DecodeCharacterReference, which is simplistic and often ineffective in more complex scenarios or in case of multiple encodings on the top of each other - heck, even my decoder variables handle more cases than it, even though I admit I didn't handle the \uCODE formats, but that's only because I didn't need to at the time -, and both of these two together produce unwanted "artifacts" once in a while, so to speak. I'm not criticizing a part of a free software, I'm just telling the truth, and it's backed with evidence.

The problem with the [\x002F] character references, as far as I tested, is that curiously, when the [\x002F] is written literally in a Substitute, it almost always works. However, if it's part of, say, a capture reference in a regex variable it often fails to do what it's supposed to do. Even more curiously is that in my weather skin I somehow managed to do pretty much the same as below, and it worked (when displaying data only, NOT when further regex manipulating the string!). I have no idea why it worked, although I tried to break the problem in small pieces to understand what's right or wrong, but in my opinion the devs should take a look at that. Here's a sample (feel free to move the post to its appropriate place in the forum if you think it's necessary - I only mentioned this here because there were already some replies dealing with the issue):

Code: Select all

[Rainmeter]
Update=1000
DynamicWindowSize=1
AccurateText=1

[Variables]
Decoder="(?i)\\u([a-f\d]{4})":"[\x\1]"

[MeasureBad]
Measure=String
String="\u002F"
RegExpSubstitute=1
;Substitute="#Decoder#"
Substitute="(?i)\\u([a-f\d]{4})":"[\x\1]"
;Substitute="\u002F":"[\x002F]"
;DynamicVariables=1

[MeterGood]
Meter=String
Y=0R
FontSize=20
FontWeight=500
FontColor=255,255,255,255
SolidColor=47,47,47,255
Padding=5,5,5,5
AntiAlias=1
Text=Hello [\x002F]

[MeterBad]
Meter=String
Y=0R
FontSize=20
FontWeight=500
FontColor=255,255,255,255
SolidColor=47,47,47,255
Padding=5,5,5,5
AntiAlias=1
Text=Hello [MeasureBad]
As you can see, uncommenting the 3rd Substitute in [MeasureBad] yields the right result, while the other two (which are basically the same thing) do not.
jsmorley wrote:
January 24th, 2020, 2:49 am
Fixed this in the .rmskin in the first post of this thread.
The skin is fixed, but the main issue ... well, let's say it's still under review. ;-)
User avatar
jsmorley
Developer
Posts: 20297
Joined: April 19th, 2009, 11:02 pm
Location: Fort Hunt, Virginia, USA

Re: ⭐ weather.com - Some Tools for Parsing

Post by jsmorley »

I don't know about all the in's and out's of your post, I glanced at it and just put on my Peril Sensitive Sunglasses, which solved it entirely for me, but to deal with those UNICODE character references with the weather.com data I just use:

Code: Select all

RegExpSubstitute=1
Substitute="\\u002F":"/"
That works fine, and every time, to turn km\u002Fh into km/h. This is the only code of that type I have run into on weather.com so far. It's possible we could run into others I guess. No idea why they did this, the / character is not "reserved" or open to any confusion that I am aware of, but there it is...

These are not HTML character references, like & or &, which are handled in WebParser by DecodeCharacterReference=1.
User avatar
SilverAzide
Posts: 700
Joined: March 23rd, 2015, 5:26 pm

Re: ⭐ weather.com - Some Tools for Parsing

Post by SilverAzide »

I've been grabbing weather from all sorts of locales for testing purposes and -- so far (fingers crossed) -- that "\u002F" code appears to be unique in their data. I've yet to see any other Unicode references like this.

That being said, there is the remaining question of why the "generic regexp" substitution (as shown in Yincognito's post) isn't working. Something seems out of whack, but -- Don't Panic! -- there is that simple work-around.
DeviantArt Gadgets More...
User avatar
jsmorley
Developer
Posts: 20297
Joined: April 19th, 2009, 11:02 pm
Location: Fort Hunt, Virginia, USA

Re: ⭐ weather.com - Some Tools for Parsing

Post by jsmorley »

SilverAzide wrote:
February 8th, 2020, 4:13 pm
I've been grabbing weather from all sorts of locales for testing purposes and -- so far (fingers crossed) -- that "\u002F" code appears to be unique in their data. I've yet to see any other Unicode references like this.

That being said, there is the remaining question of why the "generic regexp" substitution (as shown in Yincognito's post) isn't working. Something seems out of whack, but -- Don't Panic! -- there is that simple work-around.
No doubt related to the "order" in which things are done. Any option in Rainmeter has to be "parsed", and then different kinds of "nesting" and "replacements" and such has to be done. In spite of what one might think, a computer can't do two things at once, it has to decide on some order and do them one at a time. Digging into this is beyond my coding skills in C++, but perhaps Brian can take a look at this at some point.
User avatar
jsmorley
Developer
Posts: 20297
Joined: April 19th, 2009, 11:02 pm
Location: Fort Hunt, Virginia, USA

Re: ⭐ weather.com - Some Tools for Parsing

Post by jsmorley »

If you prefer, not sure why you would, this also works:

Code: Select all

[MeasureString]
Measure=String
String=embedded \u002F code
RegExpSubstitute=1
Substitute="\\u002F":"[\x002F]"

1.png


But this will NOT:

Code: Select all

[MeasureString]
Measure=String
String=embedded \u002F code
RegExpSubstitute=1
Substitute="\\u(002F)":"[\x\1]"


Again, no doubt due to the "order" that the regular expression \1 replacement and the Character Variable replacement are done in. Looks to me like the Character Variable is evaluated first, which results in an invalid and thus ignored character reference, then the regular expression replaces the \1, and you end up with a literal [\x002F]. The Character Variable replacement only gets one bite at the apple. I suspect that changing this would 1) Probably cause other problems with the opposite order, and 2) Cause Backwards Compatibility issues. We are just not going to have "variable" replacements and "regular expression" replacements chase each other around in circles.
You do not have the required permissions to view the files attached to this post.
Yincognito
Posts: 837
Joined: February 27th, 2015, 2:38 pm

Re: ⭐ weather.com - Some Tools for Parsing

Post by Yincognito »

jsmorley wrote:
February 8th, 2020, 3:50 pm
...I glanced at it and just put on my Peril Sensitive Sunglasses...
LMAO. Yeah, don't worry, everyone rants about something once in a while. That sensitive attitude doesn't last much anyway in my case, so it's all good. :D
jsmorley wrote:
February 8th, 2020, 4:28 pm
No doubt related to the "order" in which things are done. Any option in Rainmeter has to be "parsed", and then different kinds of "nesting" and "replacements" and such has to be done. In spite of what one might think, a computer can't do two things at once, it has to decide on some order and do them one at a time. Digging into this is beyond my coding skills in C++, but perhaps Brian can take a look at this at some point.
That was my first guess as well, that it's about the order in which things are evaluated. It makes sense: [\x\1] is not a Unicode character reference by itself, so it gets ignored by the Rainmeter algorithm of detecting these (which is executing first), resulting in the regex (that executes afterwards) to proceed with its usual capture reference. The thing is, it doesn't explain why passing the actual "[\x002F]" string to a second - also string - measure doesn't evaluate it to /. For example, if you add the following measure and meter to the code I posted, it gives the same result - despite nothing interfering this time with the Unicode character reference algorithm:

Code: Select all

[MeasureBadGood]
Measure=String
String=[MeasureBad]
DynamicVariables=1

[MeterBadGood]
Meter=String
Y=0R
FontSize=20
FontWeight=500
FontColor=255,255,255,255
SolidColor=47,47,47,255
Padding=5,5,5,5
AntiAlias=1
Text="Hello [MeasureBadGood]"
Nothing should prevent this time the [\x002F] to be evaluated as /, but it again fails. Which is why I didn't posted the explanation regarding the order of evaluation myself - although I thought about doing it -, and in the end went for the more probable cause which is that [\x002F] wasn't provided literally, as a constant.
SilverAzide wrote:
February 8th, 2020, 4:13 pm
I've been grabbing weather from all sorts of locales for testing purposes and -- so far (fingers crossed) -- that "\u002F" code appears to be unique in their data. I've yet to see any other Unicode references like this. That being said, there is the remaining question of why the "generic regexp" substitution (as shown in Yincognito's post) isn't working. Something seems out of whack, but -- Don't Panic! -- there is that simple work-around.
It is NOT unique! However, the good thing is that there aren't too many others, just these, from my analysis (I added the backslash just in case, although I didn't found it in the source):

Code: Select all

Decoder="\\u002F":"/","\\u005C":"\","\\u003C":"<","\\u003E":">"
In any case, the overwhelming majority of those references are of \u002F (from 2/3 of them upwards), which is probably why you haven't been able to notice other such references in the source. These characters are referenced this way because they are part of the regular HTML syntax of, say, <div>...</div>, so they need to be escaped somehow in the JSON part.

Speaking of the JSON, I found a free API key for the TWC data. It seems to be used even when people just navigate to their page, so I don't think it's technically a "hack", but more like not common knowledge. I can provide it if you want me to, as although at first sight it wouldn't be wise to present it to a vast number of users since it may be blocked or replaced one day, in the end we will "always" have the option to parse the entire main page if the key becomes invalid. The big plus in using this key is that you get only the JSON part that you want, although I'm yet to find out how to "aggregate" various JSON parts into one response (for example, location + observation + daily forecast), the speed (same as wxdata's and a big difference to the noticeable slowness of parsing the entire TWC page or pages) and the fact that it (currently) works with both v2 and v3 versions of the site (albeit in a different URL format). Unfortunately for the potential parathesis parsing headache, I haven't been able to find out how to get the data in other formats than JSON (XML would have been easier to parse, for example). Oh, and although this might be already known by others, I stumbled upon the "original" icon and descriptions of TWC. The icons are not that brilliant (they're a bit simple for my taste) and their descriptions is already common knowledge, but the source of the document is TWC itself, so you can be sure it's accurate.
User avatar
Brian
Developer
Posts: 1954
Joined: November 24th, 2011, 1:42 am
Location: Utah

Re: ⭐ weather.com - Some Tools for Parsing

Post by Brian »

Nested syntax to the rescue!

Code: Select all

[Rainmeter]
Update=1000
DynamicWindowSize=1
AccurateText=1

[Variables]
Decoder="(?i)\\u([a-f\d]{4})":"[\x\1]"

[MeasureBad]
Measure=String
String="\u002F"
RegExpSubstitute=1
;Substitute="#Decoder#"
Substitute="(?i)\\u([a-f\d]{4})":"[\x\1]"
;Substitute="\u002F":"[\x002F]"
;DynamicVariables=1

[MeterGood]
Meter=String
Y=0R
FontSize=20
FontWeight=500
FontColor=255,255,255,255
SolidColor=47,47,47,255
Padding=5,5,5,5
AntiAlias=1
Text=Hello [\x002F]

[MeterBad]
Meter=String
MeterStyle=MeterGood
Text=Hello [&MeasureBad]
-Brian
Yincognito
Posts: 837
Joined: February 27th, 2015, 2:38 pm

Re: ⭐ weather.com - Some Tools for Parsing

Post by Yincognito »

Brian wrote:
February 10th, 2020, 6:54 am
Nested syntax to the rescue!

Code: Select all

[Rainmeter]
Update=1000
DynamicWindowSize=1
AccurateText=1

[Variables]
Decoder="(?i)\\u([a-f\d]{4})":"[\x\1]"

[MeasureBad]
Measure=String
String="\u002F"
RegExpSubstitute=1
;Substitute="#Decoder#"
Substitute="(?i)\\u([a-f\d]{4})":"[\x\1]"
;Substitute="\u002F":"[\x002F]"
;DynamicVariables=1

[MeterGood]
Meter=String
Y=0R
FontSize=20
FontWeight=500
FontColor=255,255,255,255
SolidColor=47,47,47,255
Padding=5,5,5,5
AntiAlias=1
Text=Hello [\x002F]

[MeterBad]
Meter=String
MeterStyle=MeterGood
Text=Hello [&MeasureBad]
-Brian
I'm on mobile now and can't test it, but I'm sure it works. I actually tried nested syntax, but only did it in the [MeasureBad]'s regex and never thought to do it in the meter.

One small inconvenient would be that all measures should be referenced this way in the meters, in the (improbable case, I admit) that each of them has some Unicode reference included. I'll take this though, as it does what it's supposed to do.

Many thanks, and pardon my little rant earlier... :oops:
User avatar
Brian
Developer
Posts: 1954
Joined: November 24th, 2011, 1:42 am
Location: Utah

Re: ⭐ weather.com - Some Tools for Parsing

Post by Brian »

Yincognito wrote:
February 10th, 2020, 7:20 am
One small inconvenient would be that all measures should be referenced this way in the meters, in the (improbable case, I admit) that each of them has some Unicode reference included.
You won't need this syntax for normal character reference substitutions using Substitute on a measure. In this particular case, you are essentially building a variable after Rainmeter has tried to "resolve" its value - which both you and jsmorley eluded to in a previous post(s). It's all about the timing of when and where options are parsed.

Nested syntax gets around this by trying to resolve the variable again when the nested syntax is used. You may need DynamicVariables=1 on the meter if there are changing values instead of static values. Due to performance issues, I would only use nested syntax where absolutely necessary.


Yincognito wrote:
February 10th, 2020, 7:20 am
Many thanks, and pardon my little rant earlier... :oops:
No problem. Sometimes it isn't obvious why something doesn't work outright.

-Brian
Yincognito
Posts: 837
Joined: February 27th, 2015, 2:38 pm

Re: ⭐ weather.com - Some Tools for Parsing

Post by Yincognito »

Brian wrote:
February 10th, 2020, 7:44 am
You won't need this syntax for normal character reference substitutions using Substitute on a measure. In this particular case, you are essentially building a variable after Rainmeter has tried to "resolve" its value - which both you and jsmorley eluded to in a previous post(s). It's all about the timing of when and where options are parsed.

Nested syntax gets around this by trying to resolve the variable again when the nested syntax is used. You may need DynamicVariables=1 on the meter if there are changing values instead of static values. Due to performance issues, I would only use nested syntax where absolutely necessary.
Understood. Will play with this a bit later and see how it goes. :thumbup: