libintl-perl

Home -> libintl-perl -> 2005 -> August

 Problem with non-ASCII message id's 
Login Login Subscribe Subscribe  Date  08/16/05 15:22:33 GMT
 From  Guido Flohr
 Subject  Problem with non-ASCII message id's
 Previous Thread
 Next Thread
 Start of Thread
 Reference
 Previous Reply
 This Message
 Reply
 Next Reply
Hi Jörn,

Jörn Reder wrote:
> Guido Flohr wrote:
>
>> The problem is that the information about the character set is only used
>> for the possible output conversion of the translations.  It does not
>> influence the lookup of the translations, i. e. the msgid strings are not
>> first converted to the character set of the mo file.  There is no way to
>> set the input character set in the API.  How could libintl (Perl or C)
>> know which characer set is used?  It does a binary comparison.
> Hmm, libintl resp. Perl should know which character set is used, at least
> with perl > 5.8. Or lets say: it doesn't know exactly which character set,
> but it knows whether it's utf-8 or not (from the internal
> utf8 flag) which is sufficient for our problem.

The internal utf-8 flag is based on a guess that Perl takes and the Perl
unicode list is full of examples where this guess is wrong.

> If we could make libintl make know that the target charset is utf-8
> (probably this just can be assumed, at least I never used it with another
> target charset and we all know utf-8 is really reasonable here).
> This assumed the __() function just needs to call
> Encode::encode("utf8",$_[0]) to get the same binary representation as used
> in the message catalog, or am I missing something?

The conversion can fail.  And the assumption whether the input is in utf-8
will often be wrong.  The worst about it is that these failures will often be
caused by user locale settings.

> Sounds so simple I just tried it myself ;) I added the following hack to
> my test program:
>
> sub __ ($) {
> my $id = Encode::encode("utf8",$_[0]);
> Encode::_utf8_off($id);
> return Locale::TextDomain::__($id);
> }
> Switching off the internal utf8 flag was neccessary to make the variable
> "binary" again. Otherwise Perl's internal magic later recodes the variable
> back to latin1, presumably somewhere on the file I/O layer. Anyway, with
> this hack my tiny test program works.

You see? You had to switch on and off obscure flags.  And you pay that with a
compatibility nightmare (do you know the locale settings of your users?) plus
a performance penalty.

BTW, you could write the above in a more compatible fashion using
Locale::Recode and  Locale::Messages::turn_utf_8_off().

> Now the question is whether this would be worth to be added to libintl...
> For projects not dealing with 8bit message id's it's a lot of
> overhead for nothing. I think this could be solved by adding another
> parameter to Locale::TextDomain->import() which controls exporting the utf8
> mangling variants of the functions on demand. This should be no noticeable
> overhead at all. What do you think? I would make a correspondent patch if
> you would accept it ;)

Honestly, I would not encourage anybody to use non-ascii message ids with
Perl.

If you still want to do, the easiest fix IMHO is to not mix character sets.
Why introduce a solution for a runtime problem that can be easily solved when
writing or distributing the software?   Change both po files and sources to
iso-8859-1 and everything will work.  Change it to utf-8 and it will work as
well.  In the worst case you have to write some lines of code that will
recode the files to the appropriate character set before you release, if
somebody has difficulties editing utf-8 files.

Your solution will waste cpu cycles for users everytime they run the
software.  IMHO it's always favorable to only once waste the cpu cycles of
the development machine.

If somebody really, really accepts all that, than writing the tiny wrapper
around the libintl functions should not be unfeasible.

>> Thanks for your interest and for DVD::Rip. :-)
> You're welcome ;) Thanks for the quick answer. And dvd::rip uses
> Locale::TextDomain without any trouble, since the original language was
> english there ;)

And thanks for the credit. ;-)

Oops, it's dvd::rip, not DVD::Rip. ;-)

Regards,
Guido
--
Imperia AG, Development
Leyboldstr. 10 - D-50354 Hürth - http://www.imperia.net/
Attachments
 1  +-[no description] multipart/signed  
 2    |-index.html message/rfc822  
 3    +-OpenPGP digital signature application/pgp-signature  

 Download OpenPGP digital signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFDAgSvOo0HNPWNDz0RAsKPAJ4h1bFK97jrHztmVzhKZC8rT4P5pwCaAuPm
hB+fbg3tjYiIoX1sPXoCAJA=
=ec54
-----END PGP SIGNATURE-----

ATTENTION: HTML attachments to this mail have been converted to plain text to prevent you from possibly malicious HTML files. Other attachments are included here without any checking. Choose your own poison! The maintainers of this site cannot be held responsible for any damage caused by these attachments.

 Problem with non-ASCII message id's
 Previous Thread
 Next Thread
 Start of Thread
 Reference
 Previous Reply
 This Message
 Reply
 Next Reply
 
 08/16/05 09:09:12 GMT  JörnReder
 08/16/05 12:58:20 GMT  +--Guido Flohr
 08/16/05 14:01:13 GMT    |--JörnReder
 08/16/05 15:22:33 GMT    |  +--Guido Flohr
 08/17/05 11:00:11 GMT    |    |--JörnReder
 08/18/05 07:30:54 GMT    |    +--JörnReder
 08/18/05 08:20:43 GMT    |      +--JörnReder
 08/18/05 09:05:35 GMT    |        +--Guido Flohr
 08/17/05 08:29:41 GMT    +--Bruno Haible

Powered by Imperia
Home | Top | Imprint