It is currently April 26th, 2024, 6:59 am

Positive lookahead after the main expression, with capture

Get help with creating, editing & fixing problems with skins
User avatar
Yincognito
Rainmeter Sage
Posts: 7164
Joined: February 27th, 2015, 2:38 pm
Location: Terra Yincognita

Positive lookahead after the main expression, with capture

Post by Yincognito »

Test String: any RSS feed's page source - can be pasted in RainRegExp as well (for convenience, use the one below, taken from Marca.com):

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd">
    <channel>
        <title><![CDATA[English // marca]]></title>
        <link>http://www.marca.com</link>
        <description><![CDATA[English // marca]]></description>
        <language>es</language>
        <copyright><![CDATA[(c)  2019, Unidad Editorial, S.A.]]></copyright>
        <pubDate>Sat, 23 Feb 2019 10:01:50 +0100</pubDate>
        <lastBuildDate>Sat, 23 Feb 2019 10:01:50 +0100</lastBuildDate>
        <category><![CDATA[News & Politics]]></category>
        <ttl>60</ttl>
        <atom:link href="https://e00-marca.uecdn.es/rss/en/index.xml" rel="self" type="application/rss+xml" />
        <image>
            <title><![CDATA[English // marca]]></title>
            <url>http://estaticos.marca.com/imagen/canalima144.gif</url>
            <link>https://www.marca.com</link>
            <width>144</width>
            <height>24</height>
            <description><![CDATA[marca.com]]></description>
        </image>
    <item>
    <title><![CDATA[Hazard deal goes cold thanks to FIFA... And Real Madrid]]></title><description><![CDATA[&nbsp;<a href="https://www.marca.com/en/football/real-madrid/2019/02/23/5c7079d7e2704ee9b38b45a2.html"> Leer </a><img src="http://secure-uk.imrworldwide.com/cgi-bin/m?cid=es-widgetueditorial&amp;cg=rss-marca&amp;ci=es-widgetueditorial&amp;si=https://e00-marca.uecdn.es/rss/en/index.xml" alt=""/>]]></description><dc:creator><![CDATA[marca.com]]></dc:creator><link>https://www.marca.com/en/football/real-madrid/2019/02/23/5c7079d7e2704ee9b38b45a2.html</link><media:description type="html"><![CDATA[A potential transfer to take &lt;strong&gt;Eden Hazard to Real Madrid&lt;/strong&gt; looks less likely than ever at present, with a number of factors completely changing the complex of the sit]]></media:description><media:title type="html"><![CDATA[REAL MADRID|Unsure after form of Vinicius and Rodrygo's arrival]]></media:title><media:content url="https://e00-marca.uecdn.es/assets/multimedia/imagenes/2019/02/23/15508796541256.jpg" medium="image" width="650" height="366" /><media:thumbnail url="https://e00-marca.uecdn.es/assets/multimedia/imagenes/2019/02/23/15508796541256_150x0.jpg" width="150" height="84" /><guid>https://www.marca.com/en/football/real-madrid/2019/02/23/5c7079d7e2704ee9b38b45a2.html</guid>
    <pubDate>Sat, 23 Feb 2019 01:11:14 +0100</pubDate>
</item>
<item>
    <title><![CDATA[Modric and the club of legends]]></title><description><![CDATA[&nbsp;<a href="https://www.marca.com/en/football/real-madrid/2019/02/23/5c707443e2704ee9b38b459b.html"> Leer </a><img src="http://secure-uk.imrworldwide.com/cgi-bin/m?cid=es-widgetueditorial&amp;cg=rss-marca&amp;ci=es-widgetueditorial&amp;si=https://e00-marca.uecdn.es/rss/en/index.xml" alt=""/>]]></description><dc:creator><![CDATA[marca.com]]></dc:creator><link>https://www.marca.com/en/football/real-madrid/2019/02/23/5c707443e2704ee9b38b459b.html</link><media:description type="html"><![CDATA[He may already be in &lt;a href=&quot;https://www.marca.com/en/football/real-madrid.html?intcmp=MENUPROD&amp;s_kw=english-real-madrid&quot;&gt;&lt;strong&gt;Real Madrid&lt;/strong&gt;&lt;/a&gt;&#039;s history books with fou]]></media:description><media:title type="html"><![CDATA[REAL MADRID|Could become the sixth oldest player to retire at the Bernabeu]]></media:title><media:content url="https://e00-marca.uecdn.es/assets/multimedia/imagenes/2019/02/23/15508801724875.jpg" medium="image" width="140" height="79" /><media:thumbnail url="https://e00-marca.uecdn.es/assets/multimedia/imagenes/2019/02/23/15508801724875_150x0.jpg" width="150" height="84" /><guid>https://www.marca.com/en/football/real-madrid/2019/02/23/5c707443e2704ee9b38b459b.html</guid>
    <pubDate>Sat, 23 Feb 2019 01:10:39 +0100</pubDate>
</item>
<item>
    <title><![CDATA[Messi and Suarez have unfinished business with goals]]></title><description><![CDATA[&nbsp;<a href="https://www.marca.com/en/football/barcelona/2019/02/23/5c707d7ee2704e8e3e8b45eb.html"> Leer </a><img src="http://secure-uk.imrworldwide.com/cgi-bin/m?cid=es-widgetueditorial&amp;cg=rss-marca&amp;ci=es-widgetueditorial&amp;si=https://e00-marca.uecdn.es/rss/en/index.xml" alt=""/>]]></description><dc:creator><![CDATA[marca.com]]></dc:creator><link>https://www.marca.com/en/football/barcelona/2019/02/23/5c707d7ee2704e8e3e8b45eb.html</link><media:description type="html"><![CDATA[&lt;strong&gt;Barcelona&lt;/strong&gt;&#039;s lack of goals of late is clear, with just one in three games and even then coming from a penalty.It has shown that the side aren&#039;t in their best form]]></media:description><media:title type="html"><![CDATA[BARCELONA|Both are in poor scoring form]]></media:title><media:content url="https://e00-marca.uecdn.es/assets/multimedia/imagenes/2019/02/23/15508797785155.jpg" medium="image" width="140" height="79" /><media:thumbnail url="https://e00-marca.uecdn.es/assets/multimedia/imagenes/2019/02/23/15508797785155_150x0.jpg" width="150" height="84" /><guid>https://www.marca.com/en/football/barcelona/2019/02/23/5c707d7ee2704e8e3e8b45eb.html</guid>
    <pubDate>Sat, 23 Feb 2019 01:09:50 +0100</pubDate>
</item>
Current Regex: (?siU)<channel.*>(.*)(?=<item.*>)((?(?=.*<item.*>).*<item.*>.*</item>))

Goal to achieve: get <channel>'s content (without the <channel> part, or the <item>-s) in the first capture group, and the first <item> in the second capture group - by the way, the issue I have is with the first capture group, not the second (the latter works fine)

The above regex is relatively close to what I want to achieve, but it fails when the test string is null, for example. I want everything from <channel> to the first occurence of an <item> to be captured, but I want to do it using lookahead conditional (just like the first <item> is captured), so that it won't fail and it returns "" if the string lacks the captured part.
Profiles: Rainmeter ProfileDeviantArt ProfileSuites: MYiniMeterSkins: Earth
User avatar
balala
Rainmeter Sage
Posts: 16172
Joined: October 11th, 2010, 6:27 pm
Location: Gheorgheni, Romania

Re: Positive lookahead after the main expression, with capture

Post by balala »

Not sure I can follow, because I don't know how the source is changing in different circumstances, but for first try the following RegExp: RegExp=(?siU)(?(?=.*<channel).*>(.*)</image>)(?=.*<item.*>)((?(?=.*<item.*>).*<item.*>.*</item>)).
Does it work as intended?
User avatar
Yincognito
Rainmeter Sage
Posts: 7164
Joined: February 27th, 2015, 2:38 pm
Location: Terra Yincognita

Re: Positive lookahead after the main expression, with capture

Post by Yincognito »

balala wrote: February 24th, 2019, 7:33 am Not sure I can follow, because I don't know how the source is changing in different circumstances, but for first try the following RegExp: RegExp=(?siU)(?(?=.*<channel).*>(.*)</image>)(?=.*<item.*>)((?(?=.*<item.*>).*<item.*>.*</item>)).
Does it work as intended?
Not really. For the test string I mentioned, it captures the <rss> and <channel> lines (which I don't want) and, of course, it leaves out the </image> line (which should be the last to be included, from my perspective). Last but not least, once you test this on an empty string, it gives Error 1 = Could not match all searches in RainRegExp.

Don't bother with the source changing, I'll handle this for Atom cases or inconsistent RSS or Atom field order. I just want it to give the expected result (i.e. from <channel> section's <title> line to its </image> line - inclusive) and not throw errors by giving two empty captures when the test string is empty as well.

Just test the provided string and the empty string in RainRegExp, that's all. Don't bother with other scenarios. (I saved the provided string as a .txt in RainRegExp, and when I want to test the empty string I just click Browse, then Cancel - so that's a pretty easy test method).

EDIT:
To be more clear, if the string is the provided one, it should give in RainRegExp:

Code: Select all

1 => 
        <title><![CDATA[English // marca]]></title>
        <link>http://www.marca.com</link>
        <description><![CDATA[English // marca]]></description>
        <language>es</language>
        <copyright><![CDATA[(c)  2019, Unidad Editorial, S.A.]]></copyright>
        <pubDate>Sat, 23 Feb 2019 10:01:50 +0100</pubDate>
        <lastBuildDate>Sat, 23 Feb 2019 10:01:50 +0100</lastBuildDate>
        <category><![CDATA[News & Politics]]></category>
        <ttl>60</ttl>
        <atom:link href="https://e00-marca.uecdn.es/rss/en/index.xml" rel="self" type="application/rss+xml" />
        <image>
            <title><![CDATA[English // marca]]></title>
            <url>http://estaticos.marca.com/imagen/canalima144.gif</url>
            <link>https://www.marca.com</link>
            <width>144</width>
            <height>24</height>
            <description><![CDATA[marca.com]]></description>
        </image>
    
2 => <item>
    <title><![CDATA[Hazard deal goes cold thanks to FIFA... And Real Madrid]]></title><description><![CDATA[&nbsp;<a href="https://www.marca.com/en/football/real-madrid/2019/02/23/5c7079d7e2704ee9b38b45a2.html"> Leer </a><img src="http://secure-uk.imrworldwide.com/cgi-bin/m?cid=es-widgetueditorial&amp;cg=rss-marca&amp;ci=es-widgetueditorial&amp;si=https://e00-marca.uecdn.es/rss/en/index.xml" alt=""/>]]></description><dc:creator><![CDATA[marca.com]]></dc:creator><link>https://www.marca.com/en/football/real-madrid/2019/02/23/5c7079d7e2704ee9b38b45a2.html</link><media:description type="html"><![CDATA[A potential transfer to take &lt;strong&gt;Eden Hazard to Real Madrid&lt;/strong&gt; looks less likely than ever at present, with a number of factors completely changing the complex of the sit]]></media:description><media:title type="html"><![CDATA[REAL MADRID|Unsure after form of Vinicius and Rodrygo's arrival]]></media:title><media:content url="https://e00-marca.uecdn.es/assets/multimedia/imagenes/2019/02/23/15508796541256.jpg" medium="image" width="650" height="366" /><media:thumbnail url="https://e00-marca.uecdn.es/assets/multimedia/imagenes/2019/02/23/15508796541256_150x0.jpg" width="150" height="84" /><guid>https://www.marca.com/en/football/real-madrid/2019/02/23/5c7079d7e2704ee9b38b45a2.html</guid>
    <pubDate>Sat, 23 Feb 2019 01:11:14 +0100</pubDate>
</item>
And if the string is empty, it should give:

Code: Select all

1 => 
2 => 
Profiles: Rainmeter ProfileDeviantArt ProfileSuites: MYiniMeterSkins: Earth
User avatar
balala
Rainmeter Sage
Posts: 16172
Joined: October 11th, 2010, 6:27 pm
Location: Gheorgheni, Romania

Re: Positive lookahead after the main expression, with capture

Post by balala »

Yincognito wrote: February 24th, 2019, 12:10 pm I just want it to give the expected result (i.e. from <channel> section's <title> line to its </image> line - inclusive) and not throw errors by giving two empty captures when the test string is empty as well.
How does an "empty string" looks like? Would need a complete code of the source page.
User avatar
Yincognito
Rainmeter Sage
Posts: 7164
Joined: February 27th, 2015, 2:38 pm
Location: Terra Yincognita

Re: Positive lookahead after the main expression, with capture

Post by Yincognito »

balala wrote: February 24th, 2019, 1:19 pm How does an "empty string" looks like? Would need a complete code of the source page.
What kind of question is that, LOL? :jawdrop The empty string looks ... empty, obviously. You know like "". Nothing, zip, nada, etc... I didn't yet implement this properly in a Rainmeter skin, just doing tests in RainRegExp, and the source code is too complicated anyway, too many dependencies, etc.

By the way, I managed to make a better regex, that looks like this:
(?siU)(?(?=.*<channel.*>.*<item.*>)(?:.*<channel.*>(.*)(?=<item.*>)))((?(?=.*<item.*>).*<item.*>.*</item>))
...but while this doesn't throw errors for the empty ("") string, I'm confused why it gives more than 2 capture groups (although starting from the 3rd capture group, they are also empty, so it isn't that bad). RegexBuddy seems to find only two capture groups (like it should), but when I highlight the 2nd one, it also highlights other <item>...</item> parts.

Screenshots:
Regex 01.jpg
Regex 02.jpg
You do not have the required permissions to view the files attached to this post.
Profiles: Rainmeter ProfileDeviantArt ProfileSuites: MYiniMeterSkins: Earth
User avatar
balala
Rainmeter Sage
Posts: 16172
Joined: October 11th, 2010, 6:27 pm
Location: Gheorgheni, Romania

Re: Positive lookahead after the main expression, with capture

Post by balala »

Yincognito wrote: February 24th, 2019, 1:42 pm What kind of question is that, LOL? :jawdrop The empty string looks ... empty, obviously.
Maybe I am mistaken something, but I doubt the code of a website would be completely empty in any circumstances.
Had I missed something? I thought some parts of the code are missing when you are talking about that empty string.
Sorry if I has been mistaken.
User avatar
Yincognito
Rainmeter Sage
Posts: 7164
Joined: February 27th, 2015, 2:38 pm
Location: Terra Yincognita

Re: Positive lookahead after the main expression, with capture

Post by Yincognito »

balala wrote: February 24th, 2019, 1:46 pm Maybe I am mistaken something, but I doubt the code of a website would be completely empty in any circumstances.
Had I missed something? I thought some parts of the code are missing when you are talking about that empty string.
Sorry if I has been mistaken.
No need to apologize, at least not with me. ;-) Thing is, I want to handle the cases where, for example, there's no internet connection, the site isn't a RSS/ATOM/feed site, the data cannot be retrieved, typos, etc. And, of course, I'm looking to avoid errors in my regex. That's why I'm going to great lengths to design my regex to handle those cases...

NOTE: Eventually, the skin will take input from the user on the feed sites he wants to interrogate, so I need fallback in case of various errors and fringe cases.
Profiles: Rainmeter ProfileDeviantArt ProfileSuites: MYiniMeterSkins: Earth
User avatar
balala
Rainmeter Sage
Posts: 16172
Joined: October 11th, 2010, 6:27 pm
Location: Gheorgheni, Romania

Re: Positive lookahead after the main expression, with capture

Post by balala »

Yincognito wrote: February 24th, 2019, 2:02 pm Thing is, I want to handle the cases where, for example, there's no internet connection, the site isn't a RSS/ATOM/feed site, the data cannot be retrieved, typos, etc. And, of course, I'm looking to avoid errors in my regex.
Unfortunately in such cases there is no way to avoid the error message. At least I don't know one.
User avatar
Yincognito
Rainmeter Sage
Posts: 7164
Joined: February 27th, 2015, 2:38 pm
Location: Terra Yincognita

Re: Positive lookahead after the main expression, with capture

Post by Yincognito »

balala wrote: February 24th, 2019, 4:40 pm Unfortunately in such cases there is no way to avoid the error message. At least I don't know one.
Thanks for the interest anyway.
Profiles: Rainmeter ProfileDeviantArt ProfileSuites: MYiniMeterSkins: Earth
User avatar
balala
Rainmeter Sage
Posts: 16172
Joined: October 11th, 2010, 6:27 pm
Location: Gheorgheni, Romania

Re: Positive lookahead after the main expression, with capture

Post by balala »

Yincognito wrote: February 24th, 2019, 4:48 pm Thanks for the interest anyway.
Yep, sorry. Maybe someone will have a better idea, but I think there is no way to avoid the error message if for example there is no internet connection. At least not easily. Just maybe through a SysInfo plugin measure, which should have to detect if the internet connection is alright and act accordingly?