Suggestion on WebParser

thatsIch · Post by **thatsIch** » August 15th, 2012, 10:08 am

I was just playing around with webparser and entered a new feed into my reader but somehow that feed blasted their own website with feeds over several pages long, so I had to load like half an hour (felt time) to parse through all the data

My suggestion would be to be able to set an optional maxcap until the parser breaks its datastream cause I have no other possible instances to control it before

--
Issue solved with a workaround. Download your complete feed with something as this

Code: Select all

[scrMeasureRSSScript]
Measure = Script
ScriptFile = "#@#Lua\Paper.lua"
Disabled = 1

[measureRSS]
Measure = Plugin
Plugin = WebParser.dll
Url = http://www.google.de/reader/view/feed/#currFeed#?n=25
DecodeCharacterReference = 1
UpdateRate = (#UpdateRate# * 6)
;CodePage = 28591
;CodePage = 65001
Download = 1
ForceReload = 1
FinishAction = [!CommandMeasure "scrMeasureRSSScript" "Update()"]

and then load the downloaded file with LUA and parse it

Code: Select all

	-- GET PARSER AND FILEPATH
	local webParser = SKIN:GetMeasure('measureRSS')
	local downloadedFilePath = webParser:GetStringValue()

	-- FILE ACCESS AND FILE DATA
	local file = assert(io.open(downloadedFilePath, "r"))
    local data = file:read("*all")
    file:close()

Post by **jsmorley** » August 15th, 2012, 10:24 am

I doubt the amount of data was any issue. WebParser / regular expression just parses the data until the matching conditions are met, and then stops. So if a site has 10,000 entries and you are parsing the first 5, it will ignore the rest.

As far as WebParser retrieving the site HTML in general, 6 pages of XML is not that uncommon. WebParser uses your internet connection and downloads the entire site HTML, but unless you are having or had some kind of "slowness" problem with your connection, that is going to take seconds. In any case, limiting the amount of data isn't going to solve a problem with connection issues.

Another issue you can run into is if you have UpdateRate set very low on a WebParser measure. Check that to be sure you are not repeatedly asking WebParser to go get a site before it is done getting it from the last update.

Post by **MerlinTheRed** » August 15th, 2012, 11:12 am

Writing regular expressions incorrectly can cause a regular expression engine to take a very long time parsing because it has to try millions of possible permutations before it can report that the regexp didn't match anything. Perhaps this might be the case here.

Post by **jsmorley** » August 15th, 2012, 11:15 am

MerlinTheRed wrote:Writing regular expressions incorrectly can cause a regular expression engine to take a very long time parsing because it has to try millions of possible permutations before it can report that the regexp didn't match anything. Perhaps this might be the case here.

Always possible, although if he is using some existing feed reader I sorta thought that might not be it. Worth looking at though.

thatsIch · Post by **thatsIch** » August 15th, 2012, 9:15 pm

jsmorley wrote:I doubt the amount of data was any issue. WebParser / regular expression just parses the data until the matching conditions are met, and then stops. So if a site has 10,000 entries and you are parsing the first 5, it will ignore the rest.

As far as WebParser retrieving the site HTML in general, 6 pages of XML is not that uncommon. WebParser uses your internet connection and downloads the entire site HTML, but unless you are having or had some kind of "slowness" problem with your connection, that is going to take seconds. In any case, limiting the amount of data isn't going to solve a problem with connection issues.

Another issue you can run into is if you have UpdateRate set very low on a WebParser measure. Check that to be sure you are not repeatedly asking WebParser to go get a site before it is done getting it from the last update.

I dont have 10.000 entries just 1 entry for example which is big as this

@connection: no im pretty fine cause I can view the feeds in browser with no problem at all

@UpdateRate = -1
so it just runs once and never again

ok the UpdateTime on the main parser is 300*1000 = 5 Min I increase it to 30 Min just to test it

thatsIch · Post by **thatsIch** » August 15th, 2012, 9:21 pm

MerlinTheRed wrote:Writing regular expressions incorrectly can cause a regular expression engine to take a very long time parsing because it has to try millions of possible permutations before it can report that the regexp didn't match anything. Perhaps this might be the case here.

Code: Select all

if string.match(rawMeasureData, '<rss.-version=".-".->') then
		patEntry = '<item.-</item>'
		patEntryTitle = '.-<title.->(.-)</title>.-'
		patEntryLink = '.-<link.->(.-)</link>'
		patEntryDesc = '.-<description.->(.-)</description>'
		patEntryImg = '.-<img.-src="(.-)"'
		
	-- Atom
	else
		patEntry = '<entry.-</entry>'
		patEntryTitle = '.-<title.->(.-)</title>.-'
		patEntryLink = '.-<link.-href="(.-)"'
		patEntryDesc = '.-<summary.->(.-)</summary>'
		patEntryImg = '.-<img.-src="(.-)"'
	end

just for your sake
but anyways my problem here is
_even if I would have slow internet_ if my feed-provider would use 100mb image files as preview I would download them without any mean to stop it and the problem is in rainmeter if just freeze everything in it. Within a browser you can kill the one tab if need is there

thatsIch · Post by **thatsIch** » August 15th, 2012, 9:37 pm

and for own test purpose

Feed:
http://mal-kurz-in-der-kueche.blogspot.com/feeds/posts/default

I'm filtering first 25 feed entries
loads within 1 Min and 33 Sec and 2 later all the images are displayed
the images are around 40KB
so 25*40 kB are 1MB

Normal RSS feed (with 500 char limit) loads within 2 seconds and 1 sec later images display
even though they have 25KB Images
so 25*25KB are 625KB thats 40% less size

Ok I try optimizing it using for example only the first 500 chars or something as this
If somebody else wants to try it out I can release my current code, but its unfinished

€:
ok wasnt that part I guess
I took out the parsing of the contents cause I had fear that a normal string meter couldnt display over several thousands of characters and thats the reason why it couldnt calc it that fast, but apparently time was almost same

€€: measured it new: about 1 min so processing and displaying time took a huge part mh

€€€: I took out every bit of code and the measure took about 3 seconds to load the whole file. I think I need to look up some more efficient algorithms

someone suggestions how to improve this? I havent found any real way of the Std Lua Lib to limit a match to a specific size

€€€€: maybe move this to lua now?

thatsIch · Post by **thatsIch** » August 16th, 2012, 4:34 pm

I have let my friends try it on different popular skins
like Enigma etc

all had to restart whole rainmeter. I went through some debugging and it says

ERRO (08:36:14.529) WebParser.dll: (1493) Das Zeitlimit für den Vorgang wurde erreicht. (ErrorCode=12002)

which is the german version of "timelimit reached"

it seems if you just want to read the file on the fly its just too big

Code: Select all

[measureRSS]
Measure = Plugin
Plugin = WebParser.dll
Url = #currFeed#
RegExp = (?siU)(.*)$
DecodeCharacterReference = 1
UpdateRate = (#UpdateRate# * 6)
CodePage = 28591
FinishAction = [!CommandMeasure "scrMeasureRSSScript" "Update()"]

Is there a better way to handle feeds when I want to parse them in lua?
or shoud I combine the lookahead of rainmeter to preparse?

thatsIch · Post by **thatsIch** » August 17th, 2012, 8:00 am

Now I tested it with

.inc

Code: Select all

GET=.*(?(?=.*<div class="item">).*<div class="item">.*<a href="(.*)">(.*)</a>.*<div class="item-info">.* on (.*)</div>)
URL=http://mal-kurz-in-der-kueche.blogspot.com/feeds/posts/default

.ini

Code: Select all

[MeasureFeed]
Measure=Plugin
Plugin=WebParser
Url=http://www.google.de/reader/view/feed/#URL#?n=8
RegExp="(?siU)<h1>(.*)</h1>#GET#"
UpdateRate=1500
DecodeCharacterReference = 1
CodePage = 28591

[MeasureItem1]
Measure=Plugin
Plugin=WebParser
Url=[MeasureFeed]
StringIndex=2

[MeasureLink1]
Measure=Plugin
Plugin=WebParser
Url=[MeasureFeed]
StringIndex=3

[MeterTest]
Meter = String
MeasureName = MeasureLink1

So I shorted down the rainmeter regexp to just look if this helps in this matter, seems this is the only way to go back to basics. There is no way letting lua handling that sigh

haha ok I tested it again but not using just

StringIndex = 3

but

StringIndex = 46

with 25 times #GET#

so it has problems when it comes to larger files.
I'm currently reading the source-code of the plugin.. maybe I can find something

thatsIch · Post by **thatsIch** » August 17th, 2012, 6:08 pm

ok next test:

I try downloading the whole page now

with

download = 1

and then parse the whole thing with lua only

Suggestion on WebParser

Suggestion on WebParser

Re: Suggestion on WebParser

Re: Suggestion on WebParser

Re: Suggestion on WebParser

Re: Suggestion on WebParser

Re: Suggestion on WebParser

Re: Suggestion on WebParser

Re: Suggestion on WebParser

Re: Suggestion on WebParser

Re: Suggestion on WebParser