[Solved] DecodeCharacterReference broken on straight URLs?

killall-q · Post by **killall-q** » February 6th, 2011, 4:10 am

To test, comment out these original lines in illustro Feeds.ini and insert:

[Variables]
getItem=(?(?=.*<item).*<title.*>(.*)</title>.*<link.*>(.*)</link>)
feedURL=http://www.engadget.com/rss.xml

[measureFeed]
Url=#feedURL#
RegExp="(?siU)<title.*>(.*)</title>.*<link.*>(.*)</link>.*<item[^s].*<title.*>(.*)</title>.*<link.*>(.*)</link>#getItem##getItem##getItem##getItem##getItem##getItem##getItem#"

Links and titles won't be matched correctly, but the problem can be seen.

I tried to switch to using Google Reader, where DecodeCharacterReference works correctly, but it randomly intersperses item titles with linebreaks which interfere with the skin I'm building. And apparently regex is technically unable to skip characters in the middle of a capture.

I was thinking of trying a <a href="(.*)">(.*)?\r(.*)</a> and repiece together post facto, but I don't even want to think about that headache...

Post by **jsmorley** » February 6th, 2011, 4:26 am

As far as I know, DecodeCharacterReference does not, nor ever claimed to remove things like <![CDATA[ and ]]> and such. It is meant to turn > into >. You are going to have to alter the RegExp to exclude that stuff from the capture, OR use a substitute to remove it.

killall-q · Post by **killall-q** » February 6th, 2011, 4:35 am

That's good news that it's not simply not working. I'll use a limited substitute with DecodeCharacterReference and keep watching for misses.

Post by **jsmorley** » February 6th, 2011, 4:38 am

I use something like:

Substitute="<![CDATA[":"","]]>":""

Which seems to work fine for me on a Google News feed I have that is loaded with that CDATA stuff.

killall-q · Post by **killall-q** » February 6th, 2011, 4:46 am

Yeah, thanks, it's working great. I really hated playing whack-a-mole with substitutes in the old days. Took about 3 months to catch them all with Slashdot's feed.

killall-q · Post by **killall-q** » February 16th, 2011, 5:04 am

Ran into Slashdot's doubly fuddled HTML references. Ridiculous things like &
AFAIK only Slashdot does this, and web feed readers don't have a problem with it.

So to fix these

& — “ ” ‘ ’ 

I'm now using this

Code: Select all

SubstituteFeed="":"››› FEED OFFLINE ‹‹‹","<![CDATA[":"","]]>":"","&":"&","&mdash;":"—","&ldquo;":"“","&rdquo;":"”","&lsquo;":"‘","&rsquo;":"’","<br>":"","<em>":"","</em>":"","<nobr>":"","</nobr>":"","<wbr>":""

At least it's only half as long as it used to be. Wonder when I'll run into "

[Solved] DecodeCharacterReference broken on straight URLs?

[Solved] DecodeCharacterReference broken on straight URLs?

Re: DecodeCharacterReference not working on straight URLs?

Re: DecodeCharacterReference not working on straight URLs?

Re: DecodeCharacterReference not working on straight URLs?

Re: DecodeCharacterReference not working on straight URLs?

Re: [Solved] DecodeCharacterReference broken on straight URL