Web Parser on Amazon

RicardoTM · Post by **RicardoTM** » May 24th, 2024, 5:14 pm

Yincognito wrote: ↑May 24th, 2024, 11:30 am Did you consider the fact that the measures downloading the images finish at a different time than the product ones? So maybe you should add the enabling to the former? Optionally, with a [!Delay ...] before enabling as well, just to be sure?

Well, I use my tricks in mostly single item (and different site) displaying changeable via scrolling, as you already know, and that doesn't exactly suit your current scenario, where you poll the same site and you display multiple items and their properties at the same time. In my scenarios, I use a single "product" (that might have multiple "properties", of course), so a single "set" of measures / meters grabbing the said properties is enough. This makes a sequential system trivial to implement, e.g. no enabling / disabling needed.

Anyway, besides the considerations regarding the finish actions and the potential delays, or other tricks on full sequential access (which can be added to the bangs variable easily, although now its existence is not necessary since starting stuff for the 1st product automatically continues with the others through the finish actions), the ideal solution would involve a single request for multiple products via an API (checking the Network tab in the browser's Developer Tools after reloading the page might reveal such a system / link - already checked, no transparent API calls in this case, plus, even if it was, images would still be downloaded individually). Personally, I don't like redundancy and polling a site more than once, but yeah, in some cases it's unavoidable.

P.S. I might adjust your code to fit my ideas later on, but I don't guarantee it.

EDIT: Careful with the calls to the site, they have a captcha and all that (asked me twice in the browser, no shame whatsoever, lol). By the way, I get the encoding issue from the OP even with the UserAgent, so it looks like some header / flag / codepage configuration is needed for a "by the book" retrieval. So, the failed retrievals might have something to do with either the captcha or the encoding, as a possibility. Will stop trying for now in order to not make it worse or get blocked / banned and such, but I didn't abuse it anyway, just about 5 attempts till now.

I think it doesn't even have to do with the amount of calls or amount of images download, since now that I think about it, I had the problem even before starting to code, parsing only a single product page. I thought the fix was the UserAgent. I'll have to do more testing, maybe try a few flags and see if anything changes. Last night it didn't work from 12am to 3am that I turned off the pc (I didn't really try much, since I was doing other stuff, I was hoping it would fix itself). I also avoid calling it too often since I also consider being blocked as a high possibility (another reason to use update=-1).

I read somewhere that Amazon doesn't like being "scraped'" (I guess that's the same as "parsed"), and something about Python and the Amazon API. I'm not sure how to implement that. API's are a new thing to me and I have no idea how it works, is that free? No idea.

They also mention amazon blocking access, and also mention something about the headers. How the headers thing work on WebParser? In the examples I saw they do it like this (on python):

Code: Select all

custom_headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 13_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15',
    'Accept-Language': 'da, en-gb, en',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Referer': 'https://www.google.com/'
}

Post by **Yincognito** » May 24th, 2024, 10:03 pm

RicardoTM wrote: ↑May 24th, 2024, 5:14 pmI read somewhere that Amazon doesn't like being "scraped'" (I guess that's the same as "parsed"), and something about Python and the Amazon API. I'm not sure how to implement that. API's are a new thing to me and I have no idea how it works, is that free? No idea.

They also mention amazon blocking access, and also mention something about the headers. How the headers thing work on WebParser? In the examples I saw they do it like this (on python):
Code: Select all
custom_headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 13_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15',
    'Accept-Language': 'da, en-gb, en',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Referer': 'https://www.google.com/'
}

I linked to their "API Gateway" page in my earlier reply, though I referred mostly to a HTTP API (also linked on that page). Unfortunately, their setup process looks complicated, with AWS, blah blah blah and whatnot, but apparently they provide a free tier, I suppose with signing in to an Amazon account the main / first requirement. I don't know for sure, cause I never tried it.

Generally speaking (Amazon's complicated system aside), I was talking about a Web API. In simple terms, the main differences from a typical web address for the user (i.e. you when requesting data) is that the API one has api. somewhere at the beginning of the address, it returns a JSON or similar and not a HTML, and you can use various parameters in the query string to get data on multiple things and not just a single one (like weather for 15 days in the case of weather.com skins, for example). One of the particularities of a Web API is that it typically uses an API Key or Token as one of the parameters in the query string, which is meant to identify the one requesting data access, and which might be free or not.

Regarding headers, you could try replicating those Python ones in Rainmeter and see if that works better. WebParser has the HeaderN options and the syntax of their values is similar to the Python parts (bar the apostrophes / single quotes), see the example and the link in the manual (additional details here).

RicardoTM · Post by **RicardoTM** » May 25th, 2024, 8:48 pm

Yincognito wrote: ↑May 24th, 2024, 10:03 pm I linked to their "API Gateway" page in my earlier reply, though I referred mostly to a HTTP API (also linked on that page). Unfortunately, their setup process looks complicated, with AWS, blah blah blah and whatnot, but apparently they provide a free tier, I suppose with signing in to an Amazon account the main / first requirement. I don't know for sure, cause I never tried it.

Generally speaking (Amazon's complicated system aside), I was talking about a Web API. In simple terms, the main differences from a typical web address for the user (i.e. you when requesting data) is that the API one has api. somewhere at the beginning of the address, it returns a JSON or similar and not a HTML, and you can use various parameters in the query string to get data on multiple things and not just a single one (like weather for 15 days in the case of weather.com skins, for example). One of the particularities of a Web API is that it typically uses an API Key or Token as one of the parameters in the query string, which is meant to identify the one requesting data access, and which might be free or not.

Regarding headers, you could try replicating those Python ones in Rainmeter and see if that works better. WebParser has the HeaderN options and the syntax of their values is similar to the Python parts (bar the apostrophes / single quotes), see the example and the link in the manual (additional details here).

Thank you for the extra info. I had time to test today with no luck yet. I haven't been able to parse it today.

I have tried with multiple headers (I'm testing with only one product):

Code: Select all

[Product1]
Group=Products
Measure=WebParser
URL=[#Product1Link]
UserAgent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36
RegExp=(?siU)id="twister-plus-price-data-price" value="(.*)".*class="a-size-medium a-price-symbol">(.*)</span>.*class="a-unordered-list a-nostyle a-horizontal list maintain-height">.*src="(.*)".*id="productTitle" class="a-size-large product-title-word-break">\s*(\S.*)</span>
FinishAction=[!EnableMeasureGroup Product1][!EnableMeasure Product2][!Update]
Header=Accept-Language: en,es;q=0.9,es-ES;q=0.8
Header2=Accept: image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8
Header3=Accept-Encoding: gzip, deflate, br, zstd
Header4=Referer: https://www.google.com/
Debug=2
DynamicVariables=1

Nothing seems to help unfortunately. I have also tried with different user agents and referers.

Post by **Yincognito** » May 25th, 2024, 10:55 pm

RicardoTM wrote: ↑May 25th, 2024, 8:48 pmNothing seems to help unfortunately. I have also tried with different user agents and referers.

Yeah, well, once you get big on the back of consumers, you shut the door in their face and come up with all kinds of stuff to limit their legitimate behavior, that's nothing new. Some additional info, maybe the rabbit will eventually pop out from one of these:
- https://stackoverflow.com/questions/8916929/custom-headers-on-amazon-s3 (possibly outdated, but check links)
- google "amazon page custom headers" and see where this takes you
- google "scrape amazon" and check how various tools or methods work
- https://www.reddit.com/r/learnpython/comments/11jqcxk/is_amazon_unscrapable/ (relatively recent, though not that encouraging)
- https://crawlee.dev/blog/how-to-scrape-amazon (even more recent, might give you some ideas by looking at the GITHub source code)
- https://github.com/topics/amazon-scraper
Or, if all these avenues fail in helping you with WebParser's options, you could try WebView and sacrifice some of the skin control...

RicardoTM · Post by **RicardoTM** » May 26th, 2024, 9:03 am

Yincognito wrote: ↑May 25th, 2024, 10:55 pm Yeah, well, once you get big on the back of consumers, you shut the door in their face and come up with all kinds of stuff to limit their legitimate behavior, that's nothing new. Some additional info, maybe the rabbit will eventually pop out from one of these:
- https://stackoverflow.com/questions/8916929/custom-headers-on-amazon-s3 (possibly outdated, but check links)
- google "amazon page custom headers" and see where this takes you
- google "scrape amazon" and check how various tools or methods work
- https://www.reddit.com/r/learnpython/comments/11jqcxk/is_amazon_unscrapable/ (relatively recent, though not that encouraging)
- https://crawlee.dev/blog/how-to-scrape-amazon (even more recent, might give you some ideas by looking at the GITHub source code)
- https://github.com/topics/amazon-scraper
Or, if all these avenues fail in helping you with WebParser's options, you could try WebView and sacrifice some of the skin control...

I haven't been able to make it work, I guess I'm blocked now lol. Anyway, I found a website that lets you track amazon products for free. Which is pretty much what I need. I'm trying webview but I wonder if there's a way to at least change how the site is displayed.

I would like to delete every element but the actual content:

Captura de pantalla 2024-05-26 025735.jpg

So it looks like this:

Captura de pantalla 2024-05-26 030156.jpg

The other way, could be to find a way to parse the actual product page from that website, for example: https://keepa.com/#!product/11-B0BHJDY57J

Not sure why it can't, the return has any product data nor anything.

Post by **Yincognito** » May 26th, 2024, 3:11 pm

RicardoTM wrote: ↑May 26th, 2024, 9:03 ambut I wonder if there's a way to at least change how the site is displayed. [...] I would like to delete every element but the actual content

The WebView plugin can't do that, but you can offset the X and Y of the "measure" to negative, set a sufficiently large W and H on it (e.g. 1280 and 720), and use SkinWidth and SkinHeight in [Rainmeter] to "crop" the displayed part as desired:
https://forum.rainmeter.net/viewtopic.php?t=44024#p223647
Won't be the same as actually removing page parts since normally you should be able to scroll to display another part within the same "frame", but it gets closer (it will have disadvantages though).

RicardoTM wrote: ↑May 26th, 2024, 9:03 amNot sure why it can't, the return has any product data nor anything.

As far as I could see from the phone (can't inspect page source), the "product" visual part is either a frame, some <div> element or some Javascript creation with the graphs and everything, which might explain why you don't get its "contents" when WebParsing.

RicardoTM · Post by **RicardoTM** » May 26th, 2024, 9:21 pm

Yincognito wrote: ↑May 26th, 2024, 3:11 pm The WebView plugin can't do that, but you can offset the X and Y of the "measure" to negative, set a sufficiently large W and H on it (e.g. 1280 and 720), and use SkinWidth and SkinHeight in [Rainmeter] to "crop" the displayed part as desired:
https://forum.rainmeter.net/viewtopic.php?t=44024#p223647
Won't be the same as actually removing page parts since normally you should be able to scroll to display another part within the same "frame", but it gets closer (it will have disadvantages though).

As far as I could see from the phone (can't inspect page source), the "product" visual part is either a frame, some <div> element or some Javascript creation with the graphs and everything, which might explain why you don't get its "contents" when WebParsing.

Not the solution I expected, but it works without major inconveniences lol. I don't like it tho, it's too big and has a lot more info than what I need. I hope the plugin creator adds a way to load a style.css directly to the measure or some way to inject js. I mean, not an expert of course, but if the developer tools let's you edit the page, there could be a way to inject your settings using js commands on first load.

Captura de pantalla 2024-05-26 152411.jpg

And the product's page source indeed shows the product info with divs and id's. That's why I thought it could be parsed. But not sure about JS nor how can I check if it is using it.

Captura de pantalla 2024-05-26 152652.jpg

I think it has more to do with how the website loads when you first enter. It looks like what webparser is parsing is that load screen.

Post by **Yincognito** » May 26th, 2024, 10:38 pm

RicardoTM wrote: ↑May 26th, 2024, 9:21 pmI hope the plugin creator adds a way to load a style.css directly to the measure or some way to inject js. I mean, not an expert of course, but if the developer tools let's you edit the page, there could be a way to inject your settings using js commands on first load.

Yeah, that would be great to have for sure, though I'm not sure it's that easy from C++ / C# (which is what the plugin's source code works with), which might not be how an actual browser does it...

RicardoTM wrote: ↑May 26th, 2024, 9:21 pmAnd the product's page source indeed shows the product info with divs and id's. That's why I thought it could be parsed. But not sure about JS nor how can I check if it is using it. [...] I think it has more to do with how the website loads when you first enter. It looks like what webparser is parsing is that load screen.

Nah, you're talking about the Inspect outcome. That is NOT (or at least, not always) the View Page Source / Debug=2 outcome. The latter is the "raw" / "initial" output, while the former is the "processed" / "final" one, see here. Personally, I NEVER use Inspect when dealing with WebParser, because the result is sensibly different, even when it comes to HTML only and no Javascript (e.g. if you make your regex by looking at the Inspect outcome, you'll have the "nice" susprise to find out that it won't match anything because of the autocorrection).

Example, in this case: try to look for productInfoBox in the page source. Zip, nada, zilch.

RicardoTM · Post by **RicardoTM** » May 27th, 2024, 2:29 am

Yincognito wrote: ↑May 26th, 2024, 10:38 pm Yeah, that would be great to have for sure, though I'm not sure it's that easy from C++ / C# (which is what the plugin's source code works with), which might not be how an actual browser does it...

Nah, you're talking about the Inspect outcome. That is NOT (or at least, not always) the View Page Source / Debug=2 outcome. The latter is the "raw" / "initial" output, while the former is the "processed" / "final" one, see here. Personally, I NEVER use Inspect when dealing with WebParser, because the result is sensibly different, even when it comes to HTML only and no Javascript (e.g. if you make your regex by looking at the Inspect outcome, you'll have the "nice" susprise to find out that it won't match anything because of the autocorrection).

Example, in this case: try to look for productInfoBox in the page source. Zip, nada, zilch.

That makes sense... Well, that sucks

I'll go cry to some corner, thanks. It really pisses me off that amazon won't let us mortals scrape their PUBLICLY AVAILABLE data, but there are TONS of websites that offer that service for a monthly fee... they should let us parse it for free, at least with some limitations for us poor mortals.

They blocked me for parsing 5 products.. (I know because my skin won't parse even a single product page anymore).

Post by **Yincognito** » May 27th, 2024, 3:11 pm

RicardoTM wrote: ↑May 27th, 2024, 2:29 am That makes sense... Well, that sucks

I'll go cry to some corner, thanks. It really pisses me off that amazon won't let us mortals scrape their PUBLICLY AVAILABLE data, but there are TONS of websites that offer that service for a monthly fee... they should let us parse it for free, at least with some limitations for us poor mortals.

They blocked me for parsing 5 products.. (I know because my skin won't parse even a single product page anymore).

Yeah, I know. "Fun" fact - not that I'd care about it, furtunately I don't buy from Amazon yet, or have to put up with their shenanigans anyway - while trying to use this nice and free API as a workaround (for either amazon or keepa), I got this from keepa.com on my first attempt (got to admire their efficiency though):

Keepa 1.jpg

because of this:

Keepa 2.jpg

Oh no, I've been a bad boy, are they going to spank me now?

Anyway, joking aside, it's phylosophically interesting how directly inflating prices for the consumer is not considered abuse, but scraping details on a few products by a potential comsumer is. Personally, I wouldn't give up and continue to look for ways to make the skin functional, but yeah, chances are slim in this case. I even thought about opening the page in a W=0 and H=0 WebView measure and then try to use WebParser to parse the potentially cached page that would presumably be in the C:\Users\[User]\AppData\Local\Temp\RmWebViewUserFolder folder, but searching for product strings in the C:\ files returned only a few sqlite format 3 files in one of the subfolders of that folder. There are free programs that can open them, but I seriously doubt whole pages are stored there, judging by their size...

Web Parser on Amazon

Re: Web Parser on Amazon

Re: Web Parser on Amazon

Re: Web Parser on Amazon

Re: Web Parser on Amazon

Re: Web Parser on Amazon

Re: Web Parser on Amazon

Re: Web Parser on Amazon

Re: Web Parser on Amazon

Re: Web Parser on Amazon

Re: Web Parser on Amazon