Website encoding

Post by **jsmorley** » April 8th, 2018, 12:55 am

Just wanted to point out that while the vast majority of websites are encoded as UTF-8, which is what WebParser expects by default, there are some, maybe about 10%, that are encoded with:

charset=iso-8859-1

or

charset=Windows-1252

For all practical purposes, these are the same thing. They were not originally, but all modern web browsers, and the HTML specification, will assume that Windows-1252 is meant when iso-8859-1 is seen as a meta command in the HTML.

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

Without getting into all the ins and outs of encoding, these characters sets are very similar, but NOT the same, as the first 255 characters of the Unicode characters set in UTF-8. Things like the ½ character for instance, when encoded as Windows-1252 will not be recognized when parsed as UTF-8. You will get a ? question mark instead. In a sense, Windows-1252 is seen as ANSI, which is not UTF-8, and in fact doesn't really exist. It's certainly not Unicode, which... come on guys, get with it! It's 2018!

In order to parse these sites correctly, you need to set

CodePage=1252

On the parent WebParser measure.

https://docs.rainmeter.net/manual/measures/webparser/#CodePage

https://en.wikipedia.org/wiki/ISO/IEC_8859-1
https://en.wikipedia.org/wiki/Windows-1252

Here is an example...
http://hosted.ap.org/dynamic/fronts/RAW?SITE=MYPSP&SECTION=HOME