[Closed] Accented Characters F**** Up in MP3 tags (Unicode Issues?)

 
    •  
      CommentAuthorspacemarine
    • CommentTimeSep 25th 2006 edited
     permalink
    When looking at the tags of the MP3 versions of the albums
    from Jamendo, the french accented characters (this is just one example, other country-specific
    characters are broken, too) are looking like garbage.

    This is NOT only happening with ONE album or so, but with ALL albums I downloaded.

    Example: "é" is shown as "é"

    Now I don't know which software Jamendo.com uses to write the tags (it is done by some
    Jamendo software, am I right ?). But I suspect that unicode issues are the reason for that.

    Looking at the tags via an HEX-Editor, it seems like the v2 tags are written as UTF-16 Unicode.
    Writing the tags as Unicode is fine, as it's the right step towards internationalization, but I suspect
    that accented characters, etc. somehow get lost in the coversion process.

    Maybe some Jamendo official could say something about that (what software is used
    to tag the files, and why accents get lost) :shamed:

    Thank you and all the best,
    Spacemarine:
    •  
      CommentAuthorMadcat
    • CommentTimeSep 25th 2006
     permalink
    I also had this problem with any of the MP3 versions I've DL'd here. I think what I did to correct it was to specify UTF-8 in my BT client (Azureus) and it seemed to work although I can't be certain if that was indeed the problem/fix over the long term, as soon afterwards I switched to getting the ogg versions only, since the ogg/flac format tags don't display this problem. I was also having a problem fixing it, so ended up re-DLing or transcoding to flac if I was having a problem getting the ogg versions on some of the older releases.

    After looking into the subject of ID3 tags, it looks like it's developed haphazardly over the years as to versions and what they do or features they add, so I avoid MP3's whenever I can now.
    •  
      CommentAuthorspacemarine
    • CommentTimeSep 25th 2006 edited
     permalink
    Well, you are mainly talking about the filenames, and you are absoultely correct, using an Unicode
    comaptible client should solve that problem :wink:

    But what I was focusing on was the tags, and as far as they are concerned, there is no "excuse"
    for declaring the tags as being Unicode (UTF-16) ID3v2.3 tags
    (which is in fact perfect for accented characters
    and all other international characters), but then writing wrongly encoded special characters into them.
    Note that buggy software (read: buggy ID3v2 implementations) is one of the main reasons for the mp3 tag chaos.

    If a software uses unicode, it has to "do it right", and I'm having the strong suspicion that the software
    Jamendo.com uses for tagging the albums has issues with Unicode. Which would be a shame :shamed:

    PS: Don't you Jamendo guys me wrong, I don't want to sound offensive or disrespectful at all, but I think this is a
    real issue that should be taken care of.
    •  
      CommentAuthorMadcat
    • CommentTimeSep 26th 2006
     permalink
    Ya. I should have specified about the filenames and tags issue within the BT client. I went back to an old MP3 DL I had previous problems with and indeed the filename was corrected but the tag info was fubar'd. And all the accented characters were still mangled.

    I never did manage to look into it far enough to actually find a way to fix the problem. However if I might suggest to the Jamendo folks if they were to first enter music info into MusicBrainz and then use the resulting tags thier system will generate, as a way to standardize the tags, as well as get the various artists info out to anybody that uses thier database (like Last.fm), which is a freely open system for anyone to edit/add information into.
    •  
      CommentAuthorcasainho
    • CommentTimeSep 27th 2006
     permalink
    well I use Linux, and there is a simple program to rename files from Windows to UFT-8..

    the name is "utf-conversion.sh" - look on google.
    •  
      CommentAuthorsylvinus
    • CommentTimeSep 30th 2006
     permalink
    Hm, I guess you're right spacemarine. Thanks for your technical insight. I've had problems too with accented characters on our tags.

    I'll fix that next month, I'll post some test MP3s here before re-encoding our whole database ;-)

    regards,
  1.  permalink
    Thanks, and I'll be glad to check out those test MP3s :)
    • CommentAuthortobox
    • CommentTimeJan 20th 2007
     permalink
    sylvinus wrote: I'll fix that next month, I'll post some test MP3s here before re-encoding our whole database ;-)


    Any updates on that? The problem still persists, I am currently working around it by retagging the files here, but I guess that most people don't know how to do it.
    It should not be too hard for jamendo to fix the problem. The ID3V1 tags are OK (or at least as good as ID3V1 tags can be). The problem is only in the ID3V2 tags, which are usually preferred if both exist. They are encoded with UTF16, which is also OK (spacemarine already mentioned it).

    This is the start of the song "Si mon rêve", but the accented "e" is somehow mangled into 2 UTF-16 characters, which is the problem. Could somebody from jamendo comment on how the tags are converted and written? I could then do some more debugging, or maybe provide a script that fixes already downloaded files.

    0000:0000 49 44 33 03 00 00 00 00 0b 34 54 49 54 32 00 00 ID3......4TIT2..
    0000:0010 00 1d 00 00 01 ff fe 53 00 69 00 20 00 6d 00 6f .....ÿþS.i. .m.o
    0000:0020 00 6e 00 20 00 72 00 c3 00 aa 00 76 00 65 00 00 .n. .r.Ã.ª.v.e..
    0000:0030 00 54 41 4c 42 00 00 00 15 00 00 01 ff fe 41 00 .TALB.......ÿþA.
    0000:0040 20 00 6c 00 27 00 61 00 75 00 62 00 65 00 00 00 .l.'.a.u.b.e...

    Cheers,
    Thomas
  2.  permalink
    Same encoding issue here (of course:-)
    As long as there's no fix and also for the already downloaded albums:
    Does anybody know a windows/platform independent application which can convert the tags in a batch job?
    Of course I could edit the tags one by one, but this could take days:-)
    •  
      CommentAuthorsylvinus
    • CommentTimeJan 20th 2007
     permalink
    Hi!

    Thanks for looking into this issue, I just pasted our id3v2 code here, if you can find the problem i'll happily patch it !

    http://pastebin.ca/322768

    thanks!
    •  
      CommentAuthorartm
    • CommentTimeFeb 2nd 2007
     permalink
    Hi

    I don't think the problem is with that file per se but may be with the way you get the information for tags.

    Here is the diagnosis:

    The tag strings first encoded with UTF-8 which should be find for mp3s from the whole world. But at some point raw string data gets treated as if it was Latin-1 and is encoded with UTF-16. that's why accented characters turn into double characters: they were double byte sequences in UTF-8 and gat treated as two characters when read/used as Latin-1. If you want I could further investigate where things go wrong - send me an email to femistofel@gmail.com .

    To illustrate this I've made a tiny python script that fixes the problem. I used the original pytagger library to read / write tags, but it should work with your version as well. I didn't bother fixing ID3v1 tags because I don't see them in players.

    http://pastebin.ca/337107

    the problem now is that i have to keep two versions of all albums because I keep seeding them forever. Well, at least the least popular ones.

    cheers,
    artm
    •  
      CommentAuthorsylvinus
    • CommentTimeFeb 5th 2007
     permalink
    Thanks for your post artm. I've worked a bit on the issue. there seems to be an updated version of pytagger out (0.5). We are currently using 0.3 :-/ Does 0.5 fix the bug?

    I'll post some examples of encoded files on the blog before updating the blog so that everyone can comment them.
    •  
      CommentAuthorartm
    • CommentTimeFeb 6th 2007
     permalink
    I never used pytagger before, I only checked it out since you use it, so I can't say if older version was causing problems. I don't see text encoding changes in ChangeLog between 0.3 and 0.5 though.

    But I suspect the problem could creep in before the actual tagging occurs. How do you aquire the text for tags? Does it come from some database? May be text is stored as UTF-8 in the database but read as if it was Latin-1?
    •  
      CommentAuthorsylvinus
    • CommentTimeFeb 6th 2007
     permalink
    Yeah maybe. I'll run checks in python against that.

    Thanks!
    •  
      CommentAuthorartm
    • CommentTimeApr 4th 2007
     permalink
    In one of the recently uploaded albums the problem seems to have mutated.

    http://www.jamendo.com/en/album/4937/

    the last track has three byte sequence for " í " which my unjamendo script can't recover properly.
    •  
      CommentAuthorspacemarine
    • CommentTimeMay 17th 2007 edited
     permalink
    Has this whole thing been fixed yet ? Any status update ;) ?
    •  
      CommentAuthorsylvinus
    • CommentTimeAug 24th 2007
     permalink
 

Forum powered by Vanilla