Exported csv has varying number of columns

cmartin · March 22, 2022, 3:40pm

I exported tags from a large group of flac and mp3 files to a csv using a custom mte spec. I first used $list() to export a listing of tags in these same files. When viewing the $list() output, I noticed that the different files had different tag sets. This is not surprising as the files were generated and tagged inconsistently over a few years (I am in process of cleaning up)

But what is surprising is that the csv of exported tags contains a varying number of columns across all the files. My mte file contains all fields which were found in at least one file (using $list() output). While I realize not all tags are in all files, I would expect the output to use blank/null columns for tags not found.

When I import the '|'-delimited csv into sqlite, it reports varying # of columns on different lines:
all_rev.csv:95: expected 62 columns but found 17 - filling the rest with NULL
all_rev.csv:96: expected 62 columns but found 46 - filling the rest with NULL
all_rev.csv:323: expected 62 columns but found 17 - filling the rest with NULL
all_rev.csv:324: expected 62 columns but found 1 - filling the rest with NULL
...

This is telling me that the first 94 lines have the expected 62 columns, but lines 95 and 96 has 17 and 46 respectively, and lines 323 and 324 has 17 and 1 column each. There are many more lines I have not shown here.
I looked at the csv and verified that the # of delimiters agrees with this output.

The generated csv does has many blank columns (as expected) on the lines that have the expected # of columns, for instance, this line has many blank columns:

4/14|-6.88 dB|1.109784|-8.04 dB|0.991878|Back Porch Music Vol 7|Balfa Toujours|Various Artists|Audiograbber 1.83.01, LAME dll 3.96, VBR 2 (Old method), Joint Stereo, Normal quality|Kingpin Special|||||||||||||||||||||||||||||||%TCOM%|||||||||||||||||||||

So the question is, why doesn't every line has the same number of columns, blank or not? Obviously, the csv lines with varying # of columns can't be processed since it is not possible to tell which columns were dropped and which were not.

I have pasted the mte I used to generate the csv at bottom of this message.

Thanks

MTE FILE USED:

ohrenkino · March 22, 2022, 3:47pm

I see that you have UNSYNCEDLYRICS in your export script - did you take care that there are no line breaks in the data?
COMMENT is also a field the odd line break might have crept in.
Also, did you take care, that no field has the separator in the payload data? (e.g. Colourbox M|A|R|R|S would have the bar as part of the name)

cmartin · March 22, 2022, 4:03pm

I will look for embedded line breaks, this is possible, more likely in COMMENT field as I don't think there are any UNSYNCHEDLYRICS. However, once I eliminate those issues, you are confirming that I should see the same number of columns in every line (as I would expect), so I'll do some further cleaning and see what I've got.

Thanks

ohrenkino · March 22, 2022, 4:07pm

THe line breaks are always an issue.
I don't know what the target program should be - but if it is one that knows how to import CSV files or files with a dedicated separator, then I would use the TAB character as separator.
So replace every bar (|) with $char(9) and then see what happens in the import.
With that method it should be easy to find the lines where it went wrong

cmartin · March 23, 2022, 12:04am

The problem was embedded line breaks, as you suggested. I should have known as I this issue was something I encountered regularly before I retired, working with health research data, often including CSV files with embedded (within-field) line breaks. I developed some tools in SAS to deal with it back then, but it's been a few years and I hadn't anticipated they would exist in tag data.

The fix was quite simple, I just updated my MTE file to quote all fields:
$loop(%_filename_ext%)$char(34)%title%$char(34)|$char(34)%comment%$char(34) ...

The resulting output imports perfectly into sqlite, and I suspect into Excel as well, retaining any embedded line breaks found within fields.

Thanks for pointing out what should been obvious to me!

cmartin · March 24, 2022, 1:10pm

UPDATE: As I mentioned, I was able to resolve the embedded line breaks by double-quoting the fields, however, I then discoverd some fields had embedded quotes so that import of the delmited data was still failing. I decided to using the $replace() mp3tag scripting function to remove all offending characters (embedded line breaks and double-quotes) from the data when exporting. That solved all problems and allowed importing the files into sqlite (or any softwared that can handle delimited files) eliminating the need for additional post-processing of the delmited files.

In case other users need to solve similar issues here is my MTE code snippet (inside what loop() you need):

$char(9)
$trim(
$replace( %FIELDNAME%, $char(34),$char(39), $char(13)$char(10), $char(32), $char(10),$char(32) )
)
$char(9)

The 3 parameter pairs replace 3 problem character sequences as follows:
double-quotes with [$char(34)] single-quotes [$char(39)]
(Windows) line breaks [$char(13)$char(10)] with a space [$char(32)],
single line-feeds [$char(10)] with a space.

The whole thing is wrapped in $trim() function to strip leading/trailing spaces where the offending quotes or line breaks happen to occur at the start/end of the field value. Note this output not quoted but is instead tab-delimited ($char(9)), which is safe unless your data contains embedded tabs (unlikely). Comma- or pipe- (|) delimiters could be used instead but if so I would quote the field values as these may well appear in the data. Just add quotes inside the tabs or whatever delimter you use:

$char(9)$char(34) ... $char(34)$char(9)

My first attempt included only the first two parameter pairs (removing quotes and Windows line breaks), but I then discovered some occasional embedded single line-feed characters ($char(10)) were causing import failures, so I added the third parameter pair to convert those to spaces as well. Having multiple parameter pair support in the $replace() function is a great feature and simplified the code considerably.

I am finally able to export any/all fields to delimited files without worrying about embedded characters. Mp3Tag is a powerful and flexible piece of software, thanks so much.

ohrenkino · March 24, 2022, 1:19pm

I would not replace the line breaks with a space character - I assume that they served a special purpose in the original data.
So if I want to restore the line breaks in a later output, then I will fail if only space characters are left.
Admittedly, there are not a lot single characters left, that can serve as a substitute as most of them are either important for the data itself like punctuation) or for the parser to identify the data.

So what about 2 characters like ==?
This would be my suggestion: replace the line breaks with ==.

cmartin · March 25, 2022, 7:24pm

Yes, very good point, For my current purpose I just wanted to see what was in the tag data, I wasn't planning (yet) to edit and update the tags. I have many audio files from many sources (purchased/downloaded, rips of my CDs over some years using different programs with varying attention to detail when tagging), so for now I really just wanted to see at what is in the data. But if (when) I do want to update, I agree, it would be best to enable restoring the line breaks by substituting unique character(s). I have used != in the past in other (non-tagging) applications where I definitely needed to restore the breaks; '==' is good one as well.