Webparser Not Translating Characters

limitless · Post by **limitless** » August 6th, 2017, 3:52 pm

"El Gran Cañon"

I have scraped through the Forums and found multiple posts regarding the Unicode Settings. I have attempted different routes and have been unsuccessful. Please Help!

Here is the Measure for the Initial Pulling of Information from the API.

Code: Select all

[MeasurePlexPyNowPlaying]
Hidden=1
Group=Bar
; Returns the Names of the Media Being Played. (Ex. Ex. Moana, Frozen, Titanic)
Measure=Plugin
Plugin=WebParser
Url=[MeasurePlexPy]
RegExp=(?siU)(?(?=.*"full_title\":.\".*\").*"full_title\":.\"(.*)")(?(?=.*"full_title\":.\".*\").*"full_title\":.\"(.*)\")(?(?=.*"full_title\":.\".*\").*"full_title\":.\"(.*)\")(?(?=.*"full_title\":.\".*\").*"full_title\":.\"(.*)\")(?(?=.*"full_title\":.\".*\").*"full_title\":.\"(.*)\")(?(?=.*"full_title\":.\".*\").*"full_title\":.\"(.*)\")(?(?=.*"full_title\":.\".*\").*"full_title\":.\"(.*)\")(?(?=.*"full_title\":.\".*\").*"full_title\":.\"(.*)\")(?(?=.*"full_title\":.\".*\").*"full_title\":.\"(.*)\")(?(?=.*"full_title\":.\".*\").*"full_title\":.\"(.*)\")
UpdateRate=#RefreshRate#
LogSubstringErrors=0

and here is a Measure that pulls the String Index for each stream.

Code: Select all

[MeasurePlexPyNowPlayingString1]
Group=Bar
Disabled=1
Measure=Plugin
Plugin=WebParser
Url=[MeasurePlexPyNowPlaying]
StringIndex=1
IfMatch=^$
IfMatchAction=[!SetOption MeterProgressBar1Transparent ToolTipHidden "1"][!HideMeter "MeterProgressBar1"][!HideMeter "MeterTranscodeProgressBar1"][!HideMeter "User1Duration"]
IfNotMatchAction=[!SetOption MeterProgressBar1Transparent ToolTipHidden "0"][!ShowMeter "MeterProgressBar1"][!ShowMeter "MeterTranscodeProgressBar1"][!ShowMeter "User1Duration"]
DecodeCharacterReference=1

Post by **jsmorley** » August 6th, 2017, 4:09 pm

I'd have to see the actual HTML that WebParser is getting.

If you add Debug=2 to the PARENT WebParser measure, it will save what it it getting as WebParserDump.txt in the skin folder. Than you can give us that.

If it is literally putting "El Gran Ca\u00fon" in the output, then I'm not sure there is much that can be done about that. While that \u00f is the 16bit unicode "escape" representation of the ñ character, that is not how HTML does it. The HTML representation of ñ is ñ and then you would use DecodeCharacterReference=1 on the child measure. There is no way I know of for it to deal with \u00f embedded in a string.

https://en.wikipedia.org/wiki/%C3%91

If you are reading some resource that is not HTML output, that is not intended to be used in a web browser, but is rather some file that is intended to be read and resolved by some programming language, you might be out of luck.

limitless · Post by **limitless** » August 6th, 2017, 4:42 pm

jsmorley wrote:I'd have to see the actual HTML that WebParser is getting.

If you add Debug=2 to the PARENT WebParser measure, it will save what it it getting as WebParserDump.txt in the skin folder. Than you can give us that.

If it is literally putting "El Gran Ca\u00fon" in the output, then I'm not sure there is much that can be done about that. While that \u00f is the 16bit unicode "escape" representation of the ñ character, that is not how HTML does it. The HTML representation of ñ is ñ and then you would use DecodeCharacterReference=1 on the child measure. There is no way I know of for it to deal with \u00f embedded in a string.

https://en.wikipedia.org/wiki/%C3%91

If you are reading some resource that is not HTML output, that is not intended to be used in a web browser, but is rather some file that is intended to be read and resolved by some programming language, you might be out of luck.

"full_title": "Shameless (US) - El Gran Ca\u00f1on"

....
Well, shit.

__________________________________________________________________

And For Clarification.

Code: Select all

[MeasurePlexPy]
; Returns the API Information this Skin will use while running.
Measure=Plugin
Plugin=WebParser
Url=http://#PlexPyAddress#/api/v2?apikey=#APIKey#&cmd=get_activity
RegExp="(?siU)^\S(.*)$"
OnRegExpErrorAction="0"
UpdateRate=#RefreshRate#
FinishAction=[!EnableMeasure MeasurePlexPyTranscodeCount][!UpdateMeasure MeasurePlexPyTranscodeCount]
OnConnectErrorAction=[!HideMeter MeterOverallText][!HideMeter MeterServerNameText][!CommandMeasure MeasurePlexPy "Reset"]
LogSubstringErrors=0

Post by **jsmorley** » August 6th, 2017, 4:51 pm

Yeah, indeed.

What you are reading is just apparently not HTML output. In HTML, unicode is handled in one of two way.

By far the most common, is to just encode the HTML as UTF-8 w/o BOM, which is what 99.9% of modern websites do. That means that ñ is just in the file as of ñ , no extra encoding is needed.

An older, and more rare approach is to encode unicode characters as a "character reference sequence", which for of ñ would be ñ Then browsers and other HTML-aware applications can decode that to of ñ

You will still see character references in a lot of web HTML code, but mostly these days only to represent things that can be ambiguous in HTML, like the "&" character, which you will often see as & in HTML code. It is very rare anymore to see entire language elements, alphabetical characters, encoded. There is just no need with UTF-8.

Post by **jsmorley** » August 6th, 2017, 4:55 pm

You need a way to tell your API whatever to output to UTF-8, and not ASCII with broken encoding.

I mean, there is literally no good way to parse for \00f embedded in a string like Ca\00fon. While you can treat the initial escape "\" as the start of a unicode sequence, what you would have to do to parse that is to say:

\0 : is that a unicode char? yes.
\00 : is that a uncode char? yes.
\00f : is that a unicode char? yes.
\00fo : is that a unicode char? no.. Ah, then it's \00f and I can deal with that.

And that is stupid. It could easily be wrong, if the unicode was really \00 and the "f" was actually part of the string.

Post by **jsmorley** » August 6th, 2017, 5:01 pm

Like in our forums...

Code: Select all

<meta charset="utf-8"/>

...

Yeah, indeed.<br/><br/>What you are reading is just apparently not HTML output. In HTML, unicode is handled in one of two way. <br/><br/>By far the most common, is to just encode the HTML as UTF-8 w/o BOM, which is what 99.9% of modern websites do. That means that ñ is just in the file as of ñ , no extra encoding is needed.<br/><br/>An older, and more rare approach is to encode unicode characters as a "character reference sequence", which for of ñ would be &#241; Then browsers and other HTML-aware applications can decode that to of ñ<br/><br/>You will still see character references in a lot of web HTML code, but mostly these days only to represent things that can be ambiguous in HTML, like the "&" character, which you will often see as &amp; in HTML code. It is very rare anymore to see entire language elements, alphabetical characters, encoded. There is just no need with UTF-8.

limitless · Post by **limitless** » August 7th, 2017, 3:24 am

Whats funny is that the Website itself does the Parsing correctly.. Something is coded wrong with the API as you said.

Webparser Not Translating Characters

Webparser Not Translating Characters

Re: Webparser not recognizing Correct Characters

Re: Webparser not recognizing Correct Characters

Re: Webparser not recognizing Correct Characters

Re: Webparser Not Translating Characters

Re: Webparser Not Translating Characters

Re: Webparser Not Translating Characters