Merge(?) multiple values from external nfo file into single metadata tag

nylamb · April 23, 2022, 6:48pm

Hello,

I am trying to write multiple values for the metadata tag GENRE after reading in an external file (movie.nfo) into a user defined field called MOVIE_INFO.

The external nfo file is generated from media companion (a video scraper) that sometimes have multiple GENREs (and ACTORs) individually encased (in <genre>GenreType</genre>).

The file contains:

    <genre>Action</genre>
    <genre>Crime</genre>
    <genre>Drama</genre>
    <genre>Mystery</genre>

I am trying to get the action read this file and combine those lines into a single GENRE with the value of "Action, Crime, Drama, Mystery" after the action is ran.

I have tried using "format value" with:
$regexp(%plex_info%,'.*<genre>(.*)<\\/genre>.*',$1)

.. but GENRE will only hold the last one read(?) .. "Mystery".

Ideally, I'd like to try to get this working for actors names as well (which are encased in <name>ActorName</name>)

    <actor>
        <id>1500155</id>
        <name>Robert Pattinson</name>
        <role>The Batman</role>
        <thumb>https://something.jpg</thumb>
        <order>1</order>
    </actor>
    <actor>
        <id>2368789</id>
        <name>Zoë Kravitz</name>
        <role>Catwoman</role>
        <thumb>https://something.jpg</thumb>
        <order>2</order>
    </actor>
    <actor>
        <id>0942482</id>
        <name>Jeffrey Wright</name>
        <role>Lt. James Gordon</role>
        <thumb>https://something.jpg</thumb>
        <order>3</order>
    </actor>

etc..

To show up in the tag CAST (and ACTOR) as "Robert Pattinson, Zoë Kravitz, Jeffrey Wright, etc."

Any help/direction would be greatly appreciated.

Thank you.

ohrenkino · April 23, 2022, 7:08pm

Import the whole text from the nfo file to the field TMP_GENRE (or whatever you called it before, in my example it has that name).
Then try an action of the type "Format value" for GENRE
Format string: $regexp($regexp(%tmp_genre%,'.*?<genre>(.*)</genre\>\s*.*',$1),(.*?)</genre>\s*\r\n\s*<genre>,$1\\\)
This should create the string "Action\\Crime\\Drama\\Mystery which in return leads to 4 fields of the type GENRE.

nylamb · April 24, 2022, 12:45pm

Thank you VERY much .. that worked for the genre's but I was unable to figure out how to get it to work for an actors list (replacing "genre" with "name" of your format string did not work).

If you do not mind, can you explain how your string works?

I sorta understand that:
$regexp(%tmp_genre%,'.*?<genre>(.*)</genre\>\s*.*',$1)

%tmp_genre% = user defined field created during action "import text file"
.*? = lazy match any char except \n 0-unlimited times
<genre> = match this pattern
(.*) = greedy capture group
</genre\> = match this pattern .. why is there a "\" in there?
\s* = greedy match any space, tab or \n char
.* = greedy match any char except \n 0-unlimited times
$1 = data from the capture group is returned

If you could explain, or point me to where I can understand how did nesting it inside another $regexp work?

With my example, I would have expected the returned value "Action" at first .. so the line would have looked like:
$regexp(Action,(.*?)</genre>\s*\r\n\s*<genre>,$1\\\)

Action = value returned from the nested $regexp
(.*?) = lazy capture group (for the next genre I suppose?)
</genre> = match this pattern .. no "\" this time?
\s* = greedy match any space, tab or \n char
\r = carriage returns
\n = new lines
\s* = greedy match any space, tab or \n char
<genre> = match this pattern
$1\\\ = data from the capture group is returned and \\(?)

How did the outside $regexp know to read the (same) user defined as the nested one?

Can one modify this string to in order to work for an actors' list or is that a whole different thing?

Thank you.

ohrenkino · April 24, 2022, 12:57pm

That was a superfluous leftover from a previous iteration of the expression. Sorry.

Nesting expressions:
The expressions get invaluated from the inside.
So first the inner regex returns a string result which is then fed into the next (outer) level of the nested expressions.
My idea was to first get all the text inside the first <genre> start and the last </genre> end
and then the next regex would take care of all the bits in < and > and also remove any strange space or line feed character and replace them with the double backslash.
And as the double backslash is treated as field separator for multi-value fields you get the result with several genre fields.

ohrenkino · April 24, 2022, 1:04pm

My idea relies on the pattern that all the genres are together in 1 block.
If I look at the artist, you a separate xml-section for each artist which in itself features other xml-sections.
So you would have to save the whole contents of the nfo file to a (dummy) field and then cut away all the xml-sections that you do not want for the current purpose.
It would be a different set of actions, though.

Whenever you want to add another field of the same type to an already existing field, then use an action of the type "Format value" e.g. for ARTIST with
Format string: $meta_sep(artist,\\)\\New Artist Name
This adds at least "New Artist Name" in ARTIST, even if there had not been any data for ARTIST yet.
If there is already data in ARTIST, then the new name is added as new field of the type ARTIST.

nylamb · April 24, 2022, 5:47pm

Hello ..

Thank you SO MUCH for the explanations and ideas behind what you were doing. It helped me A LOT to imagine how to do things "piece by piece". (I'm not a programmer and I've only learned of RegEx in the past few weeks.)

After a lot of trial and errors today .. I have been able to get 95%(?) of what I am attempting to do (with actors/cast) done .. but I am unable to figure out a last piece of it.

For my particular format for the nfo file, I was able to figure out how to "strip the excess xml" in (almost) 1 step.

Actions:
1) Import text file %_filename.nfo" to user variable PLEX_INFO
2) Format value "plex_actor_name" format string:
$regexp(%plex_info%,'.*?<name>(.*?)</name>',$1\\\)
3) Remove fields "plex_info"

This was able to read ALL the actors into separate temporary-actor fields, but it is only 95% there because some excess data was "left behind". After reading the last "actor" .. it would retain the remaining XML data of the last actor.

I would like help to figure out how to remove that tailing piece of data from the temporary-actors fields.

Any continued assistance/advice would be greatly appreciated.

Thank you once again for the great help and advice.

I am unable to attach a copy of the nfo so here is an edited reduced portion of it with external links removed (in case it may help you to see what I am working with):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<movie>
    <fileinfo>
        <streamdetails>
            <video>
                <width>1920</width>
                <height>804</height>
                <aspect>2.40</aspect>
                <codec>HEVC</codec>
                <format>hev1</format>
                <durationinseconds>10572</durationinseconds>
                <bitrate>2 001 kb/s</bitrate>
                <container>.mp4</container>
                <framerate>23.976</framerate>
                <NumVideoBits>10</NumVideoBits>
            </video>
            <audio>
                <language>eng</language>
                <DefaultTrack>Yes</DefaultTrack>
                <codec>aac</codec>
                <channels>6</channels>
                <bitrate>224 kb/s</bitrate>
            </audio>
            <subtitle>
                <language>eng</language>
                <default>False</default>
                <forced>False</forced>
            </subtitle>
        </streamdetails>
    </fileinfo>
    <title>The Batman</title>
    <originaltitle>The Batman</originaltitle>
    <set>The Batman Collection</set>
    <setid>948485</setid>
    <sorttitle>The Batman</sorttitle>
    <outline>When the Riddler, a sadistic serial killer, begins murdering key political figures in Gotham, Batman is forced to investigate the city's hidden corruption and question his family's involvement.</outline>
    <plot>When the Riddler, a sadistic serial killer, begins murdering key political figures in Gotham, Batman is forced to investigate the city's hidden corruption and question his family's involvement.</plot>
    <tagline>Unmask The Truth</tagline>
    <year>2022</year>
    <premiered>2022-03-01</premiered>
    <ratings>
        <rating name="imdb" max="10" default="true">
            <value>8.2</value>
            <votes>344382</votes>
        </rating>
        <rating name="metacritic" max="10">
            <value>7.2</value>
            <votes>68</votes>
        </rating>
    </ratings>
    <userrating>0</userrating>
    <top250>161</top250>
    <country>United States</country>
    <runtime>176</runtime>
    <mpaa>PG-13</mpaa>
    <genre>Action</genre>
    <genre>Crime</genre>
    <genre>Drama</genre>
    <genre>Mystery</genre>
    <tag>
    </tag>
    <credits>Matt Reeves</credits>
    <credits>Peter Craig</credits>
    <credits>Bill Finger</credits>
    <director>Matt Reeves</director>
    <studio>Warner Bros.</studio>
    <studio>6th &amp; Idaho Productions</studio>
    <studio>DC Comics</studio>
    <trailer>
    </trailer>
    <playcount>0</playcount>
    <lastplayed>
    </lastplayed>
    <id>tt1877830</id>
    <tmdbid>414906</tmdbid>
    <videosource>WEBRip</videosource>
    <uniqueid type="imdb" default="true">tt1877830</uniqueid>
    <uniqueid type="tmdb">414906</uniqueid>
    <showlink>
    </showlink>
    <createdate>20220419160535</createdate>
    <stars>Robert Pattinson / Zoë Kravitz / Jeffrey Wright</stars>
    <actor>
        <id>1500155</id>
        <name>Robert Pattinson</name>
        <role>The Batman</role>
        <thumb>was_a_pic.jpg</thumb>
        <order>1</order>
    </actor>
    <actor>
        <id>2368789</id>
        <name>Zoë Kravitz</name>
        <role>Catwoman</role>
        <thumb>was_a_pic.jpg</thumb>
        <order>2</order>
    </actor>
    <actor>
        <id>0942482</id>
        <name>Jeffrey Wright</name>
        <role>Lt. James Gordon</role>
        <thumb>was_a_pic.jpg</thumb>
        <order>3</order>
    </actor>
    <actor>
        <id>0268199</id>
        <name>Colin Farrell</name>
        <role>The Penguin</role>
        <thumb>was_a_pic.jpg</thumb>
        <order>4</order>
    </actor>

    <actor>
        <id>11646175</id>
        <name>Chosen Wilkins</name>
        <role>Guard 2</role>
        <thumb>was_a_pic.jpg</thumb>
        <order>133</order>
    </actor>
    <actor>
        <id>9281009</id>
        <name>Daniel Joseph Woolf</name>
        <role>GCPD Officier</role>
        <thumb>was_a_pic.jpg</thumb>
        <order>134</order>
    </actor>
</movie>

nylamb · April 24, 2022, 6:07pm

Please .. no need to apologize. I am in unfamiliar territory (MP3tag and especially RegEx) .. so I am trying to understand/learn as much as I can .. to work things out on my own (and I don't want to become TOO much of a nuisance).

Thank you for your help and time to explain things to me. =)

ohrenkino · April 24, 2022, 6:10pm

I think the easiest way would be a second action of the type "Format value" for PLEX_INFO that looks if there is <role> somewhere in %plex_Info% and then delete that plus anything following it:
$regexp($meta_sep(plex_info,\\),(.*)\s*<role>.*,$1)

see with then Converter>Tag-Tag a preview if this works .

nylamb · April 24, 2022, 7:18pm

Hello,

I wasn't able to get the "convert tag-tag" working .. prolly cause it is late and my brain is mush.

I was able to remove the <role> 2 ways ..

was using your suggestion, post processing (after adding the \\\) and
pre-processing (before adding the \\\).

Bothe methods "worked" at removing the <role> and data the followed, however .. there is now a (new?) empty space in the spot:

Any suggestions on how to remove the blank tag? (I can live with this if I must as this is a great improvement over my initial start.)

Thank you.

ohrenkino · April 24, 2022, 7:25pm

It looks like the leading space characters have not been deleted.
Does the clean-up expression feature the \s*?

nylamb · April 24, 2022, 7:46pm

It has the \s* in the "cleanup line".

Here is a screenshot in case I may have messed something up:

Thanks (again).

ohrenkino · April 24, 2022, 7:49pm

Could it be that the 3rd action misses a backslash - I think it should be 4 of them at the end, not just 3.

nylamb · April 25, 2022, 3:03pm

Hello,

Just wanted to let you know that I found the fix for it .. I needed to add \s* to the 3rd line after (.*?)</name> from the above pic. After that, the cleanup line (line 4) took out the "blank" field.

After some trial and errors .. I was able to clean up my actions by nesting a number of items.

Thanks for the help on figuring out how to convert media companion's nfo to be written into the file's metadata.