[Openinteract-dev] Small i18n issue

Discussion:

Kutter Martin

2004-09-08 07:34:03 UTC

Hi * !

I ran into a small internationalization issue, today.
I'm running a OI2 site with SPOPS::LDAP as backend. On storing non-ASCII
characters in the LDAP directory server, this complains that properties with
non-ASCII-characters have an "invalid syntax".

I've been able to track this down to a charset problem.
LDAP expects directoryString attributes to be in UTF-8 encoding. The
perl-ldap interface (Net::LDAP) does not provide UTF-8 conversions by
default, so these are to be done by the application using Net::LDAP. This is
no big deal - just a

use Encode;
$value = decode($charset, $value);

for all the fields to set - but one needs to know the request's charset.

The charset used in the HTTP request is specified by the "charset" attribute
in the Content-Type header.

Example:

Content-Type: multipart/formdata; boundary="--------------12345";
charset="EUC-JP"

The default is "iso-8859-1" if no charset is supplied.

The problem is, that the only available solution to get the charset used in
the request is to grab it from the underlying Apache::Request or
CGI::Request handle - not really easy and not really portable:

my $contentHeader = CTX->request->apache->headers_in()->{ Content-Type };

As different charsets in HTTP requests are very likely to happen in i18n'ed
environments, and the problem is very likely to occur in non-LDAP
environments, too, I would suggest an extension to the
OpenInteract2::Request class, that provides access to the Content-Type HTTP
header, like it already does with some other header fields.

Maybe even a more general approach - exposing all HTTP headers in the
request object - could be suitable: This would remove the need to react on
additional HTTP headers by code changes forever.

Regards,

Martin Kutter

Kutter Martin

2004-09-08 08:07:00 UTC

Permalink

Hi Andreas,

This looks like the default solution. Unfortunately, SPOPS::Tool::UTFConvert
always assumes iso-8859-a (Latin1) as originating charset, which is not
neccessarily true.

So, this does not work for charsets other than Latin1. The ability to grab
the charset from the request in conjunction with a slightly modified
SPOPS::Tool::UTFConvert ( use the request's charset, if given) would remove
the problem completely.

Regards,

Martin Kutter

-----Original Message-----
From: openinteract-dev-***@lists.sourceforge.net
[mailto:openinteract-dev-***@lists.sourceforge.net]On Behalf Of
***@Bertelsmann.de
Sent: Mittwoch, 8. September 2004 11:53
To: openinteract-***@lists.sourceforge.net
Subject: AW: [Openinteract-dev] Small i18n issue

Hi Martin,

We had the same problem with reading Umlaut characters with LDAP for the
user names. You can solve this by adding the following to the spops.perl
file of the package in question:

rules_from => [ 'SPOPS::Tool::UTFConvert' ],

Hope that helped...

Mit freundlichen Grüßen
Andreas Nolte
Leitung IT
-------------------------------------------------
arvato direct services
Olympiastraße 1
26419 Schortens
Germany
http://www.arvato.com/
***@bertelsmann.de
Telefon +49 (0) 44 21 - 76-84002
Telefax +49 (0) 44 21 - 76-84111

-----Ursprüngliche Nachricht-----
Von: openinteract-dev-***@lists.sourceforge.net
[mailto:openinteract-dev-***@lists.sourceforge.net]
Gesendet: Mittwoch, 8. September 2004 11:33
An: 'openinteract-***@lists.sourceforge.net'
Betreff: [Openinteract-dev] Small i18n issue

Hi * !

I ran into a small internationalization issue, today.
I'm running a OI2 site with SPOPS::LDAP as backend. On storing non-ASCII
characters in the LDAP directory server, this complains that properties with
non-ASCII-characters have an "invalid syntax".

I've been able to track this down to a charset problem.
LDAP expects directoryString attributes to be in UTF-8 encoding. The
perl-ldap interface (Net::LDAP) does not provide UTF-8 conversions by
default, so these are to be done by the application using Net::LDAP. This is
no big deal - just a

use Encode;
$value = decode($charset, $value);

for all the fields to set - but one needs to know the request's charset.

The charset used in the HTTP request is specified by the "charset" attribute
in the Content-Type header.

Example:

Content-Type: multipart/formdata; boundary="--------------12345";
charset="EUC-JP"

The default is "iso-8859-1" if no charset is supplied.

The problem is, that the only available solution to get the charset used in
the request is to grab it from the underlying Apache::Request or
CGI::Request handle - not really easy and not really portable:

my $contentHeader = CTX->request->apache->headers_in()->{ Content-Type };

As different charsets in HTTP requests are very likely to happen in i18n'ed
environments, and the problem is very likely to occur in non-LDAP
environments, too, I would suggest an extension to the
OpenInteract2::Request class, that provides access to the Content-Type HTTP
header, like it already does with some other header fields.

Maybe even a more general approach - exposing all HTTP headers in the
request object - could be suitable: This would remove the need to react on
additional HTTP headers by code changes forever.

Regards,

Martin Kutter

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click
_______________________________________________
openinteract-dev mailing list openinteract-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openinteract-dev

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_idP47&alloc_id808&op=click

Teemu Arina

2004-09-08 08:41:10 UTC

Permalink

Post by Kutter Martin
I've been able to track this down to a charset problem.
LDAP expects directoryString attributes to be in UTF-8 encoding. The
perl-ldap interface (Net::LDAP) does not provide UTF-8 conversions by
default, so these are to be done by the application using Net::LDAP. This
is no big deal - just a
use Encode;
$value = decode($charset, $value);

I had a similar problem with DBD::mysql and UTF-8. DBI has no general policy
for UTF-8, so it has to be implemented by DBD:s themselves. DBD::mysql does
nothing to that issue. If you store UTF-8 strings in the database and
retrieve them back, these strings do not get marked as UTF-8. Later it might
happen that your UTF-8 strings get encoded as UTF-8, because Perl didn't mark
those as UTF-8 already =) Encode module helps fixing this problem for example
in SPOPS::DBI::MySQL (define a post_fetch_action() that converts all data
fields).

I wonder when you are able to write utf-8 compatible software without mocking
all the internals on several layers... It has been around so many years and
still many module writers ignore it. It also wasn't until mysql 4.x when they
included UTF-8 support in character type fields.

Post by Kutter Martin
The problem is, that the only available solution to get the charset used in
the request is to grab it from the underlying Apache::Request or
my $contentHeader = CTX->request->apache->headers_in()->{ Content-Type };

I noticed the same thing. I also found another way to set it:

CTX->response->content_type( 'text/html; charset=utf-8' )

CTX->response->charset() would be nice to have.

Greetings,

- Teemu

Kutter Martin

2004-09-08 10:23:05 UTC

Permalink

Hi Teemu, Hi *,

looks like we just found a not-so-small issue...

While SPOPS::Tool::UTFConvert can handle conversion for SPOPS backends,
there's nothing like that for the OI2 frontends (say, Template::Toolkit and
the like).

My suggestion for the "whole OI2 i18n charset encoding" would be:

1. get the charset from the request
2. encode all parameters as UTF-8 when fetching them in the request object
(all but uploads)
3. set the Content-type: charset="foo" for the response (if needed).
4. encode all output in the Response object to the appropriate charset just
before sending it (if needed).

Step 4 would probably be an issue for the Controller - OI2::Controller::Raw
should never re-code anything, and alternative controllers like, let's say
for outputting PDFs - probably shouldn't recode their stuff, too.

This would allow OI2 to use UTF-8 only in it's internal processing, but
serve frontends with potentially different character encodings.

It would also remove the need for charset conversions in SPOPS backends (as
long as the backends are UTF-8 capable - most perl modules should be) - they
would have the appropriate form already, and, the number of supported
charsets would largely superseed the current sad 'Latin1'.

Regards,

Martin

-----Original Message-----
From: Teemu Arina [mailto:***@dicole.fi]
Sent: Mittwoch, 8. September 2004 11:56
To: openinteract-***@lists.sourceforge.net
Cc: Kutter Martin; 'openinteract-***@lists.sourceforge.net'
Subject: Re: [Openinteract-dev] Small i18n issue

I had a similar problem with DBD::mysql and UTF-8. DBI has no general policy

for UTF-8, so it has to be implemented by DBD:s themselves. DBD::mysql does
nothing to that issue. If you store UTF-8 strings in the database and
retrieve them back, these strings do not get marked as UTF-8. Later it might

happen that your UTF-8 strings get encoded as UTF-8, because Perl didn't
mark
those as UTF-8 already =) Encode module helps fixing this problem for
example
in SPOPS::DBI::MySQL (define a post_fetch_action() that converts all data
fields).

I wonder when you are able to write utf-8 compatible software without
mocking
all the internals on several layers... It has been around so many years and
still many module writers ignore it. It also wasn't until mysql 4.x when
they
included UTF-8 support in character type fields.

Post by Kutter Martin
The problem is, that the only available solution to get the charset used

Post by Kutter Martin
the request is to grab it from the underlying Apache::Request or
my $contentHeader = CTX->request->apache->headers_in()->{ Content-Type };

I noticed the same thing. I also found another way to set it:

CTX->response->content_type( 'text/html; charset=utf-8' )

CTX->response->charset() would be nice to have.

Greetings,

- Teemu

Teemu Arina

2004-09-08 13:03:24 UTC

Permalink

Post by Kutter Martin
Step 4 would probably be an issue for the Controller - OI2::Controller::Raw
should never re-code anything, and alternative controllers like, let's say
for outputting PDFs - probably shouldn't recode their stuff, too.

I do agree. Although content_type should not include charset=iso-8859-1 as it
does now.

Post by Kutter Martin
This would allow OI2 to use UTF-8 only in it's internal processing, but
serve frontends with potentially different character encodings.

Utf8 atleast for internal data presentation is the way to go. I would also pay
attention that utf8 from backend to frontend is also possible without losing
any bits or forcing to certain character set in the middle. i.e. russian
interface with chinese content should be possible. This of course requires
that all the backends speak utf8.

Post by Kutter Martin
1. get the charset from the request
2. encode all parameters as UTF-8 when fetching them in the request object
(all but uploads)
3. set the Content-type: charset="foo" for the response (if needed).
4. encode all output in the Response object to the appropriate charset just
before sending it (if needed).

--
--------------

Teemu Arina
www.dicole.org

Teemu Arina

2004-09-08 13:23:22 UTC

Permalink

Also a slight note,

it seems that some broken browsers (tm) do not obey the charset of the
document when posting forms. It might be a good idea to use a form tag like
the following by default:

<form accept-charset="utf-8" enctype="application/x-www-form-urlencoded">

Sends the form in utf-8 to server.

Notice also weird problems with utf8 and template toolkit:
http://template-toolkit.org/pipermail/templates/2003-March/004314.html

Also see:
http://template-toolkit.org/pipermail/templates/2003-November/005342.html

I don't know the status of utf-8 in the current version of TT.
--
--------------

Teemu Arina

Ionstream Oy / Dicole
Komeetankuja 4 A
02210 Espoo
FINLAND
Tel: +358-(0)50 - 555 7636
http://www.dicole.fi

"Discover, collaborate, learn."

Kutter Martin

2004-09-08 13:26:12 UTC

Permalink

Hi Teemu,

Your step 6. is not an issue with perl >= 5.8.0
Perl >= 5.8 supports IO-Layers that can be used to filter/recode content
almost transparently (almost means: exactly like your example).

I don't think that all backends need to support utf-8 - if they don't
that's no harm as long as we can recode it in the backend (currently we're
recoding iso-8859-1 to utf8 for utf8-aware backends).

As for 7., using perl's locale() should override the broad scope of utf8's
\w meaning.
And backends that support utf8, normally support searches etc on them, too.

I agree with you that full utf8-support will probably require perl >= 5.8.

But as we're probably talking about years until everything's up & working,
this should not be an issue (hey, there's perl 5.10 out by nor - and perl6
will raise, too. Sure it will.).

Regards,

Martin Kutter

-----Original Message-----
From: openinteract-dev-***@lists.sourceforge.net
[mailto:openinteract-dev-***@lists.sourceforge.net]On Behalf Of Teemu
Arina
Sent: Mittwoch, 8. September 2004 17:02
To: Kutter Martin
Cc: 'openinteract-***@lists.sourceforge.net'
Subject: Re: [Openinteract-dev] Small i18n issue

Post by Kutter Martin
Step 4 would probably be an issue for the Controller -

OI2::Controller::Raw

Post by Kutter Martin
should never re-code anything, and alternative controllers like, let's say
for outputting PDFs - probably shouldn't recode their stuff, too.

I do agree. Although content_type should not include charset=iso-8859-1 as
it
does now.

Post by Kutter Martin
This would allow OI2 to use UTF-8 only in it's internal processing, but
serve frontends with potentially different character encodings.

Utf8 atleast for internal data presentation is the way to go. I would also
pay
attention that utf8 from backend to frontend is also possible without losing

any bits or forcing to certain character set in the middle. i.e. russian
interface with chinese content should be possible. This of course requires
that all the backends speak utf8.

just

Post by Kutter Martin
before sending it (if needed).

For full UTF-8 support, also:

5. translation files (I18N/maketext framework) should be in standard GNU
gettext (PO) format which is binary safe, instead of the current system
which
is poorly utf-8 compatible
6. Reading/writing OI2 configuration files should also be utf-8 compatible,
with something like:
open( INI, ">:utf8", "action.conf" );
7. Searching and text parsing should be utf8 compatible. This means use
utf8;
and support for utf8 on the database backends. At the moment =~ /\w/ doesn't

work very well with utf8 content ;)

Full support most likely requires perl 5.8.x =(... 5.6.x utf8 support just
sucks (Encode module for example does not work with 5.6.x).
--
--------------

Teemu Arina
www.dicole.org

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click