GSoC 2010 – week summary: May 31st – June 6th

After implementing CLDR reader last week, I focused on various classes which would use data from Common Locale Data Repository.

I started with PluralsReader, a class which will be used in messages (labels) translation process. This class basically returns a form-name (a string, one of: zero, one, two, few, many, other) for a pair of locale and number. In some languages form of noun depends on a number expressed in a sentence – for example, unit such as time or currency. English has only two rules (forms “one” and “other”), for example:

form "one": 1 day
form "other": 0 days, 2 days, 5 days, ...

A rule used here is simple, and can be written like this:

<pluralRule count="one">n is 1</pluralRule>

(it is actually how rules are defined in CLDR)

For other languages, rules can be very complicated:

<pluralRules locales="hr ru sr uk be bs sh">
    <pluralRule count="one">n mod 10 is 1 and n mod 100 is not 11</pluralRule>
    <pluralRule count="few">n mod 10 in 2..4 and n mod 100 not in 12..14</pluralRule>
    <pluralRule count="many">n mod 10 is 0 or n mod 10 in 5..9 or n mod 100 in 11..14</pluralRule>
    <!-- rest are plurals -->
</pluralRules>

PluralsReader class can parse these rules and define which form should be used for particular number. I committed this class (among others related) in Revision 4399.

When I wrote plurals reader, I started with another, similar class – NumbersReader. This one is more complicated. It can format a number (float or integer) using a format string. Syntax of format strings (patterns) is defined in CLDR. It’s very flexible and pretty complicated.

I didn’t implement all features described in CLDR (or to be precise: in Unicode Technical Standard #35), although I wanted the NumbersReader to support as much of the syntax as possible. I came across a solution of similar problem in Yii Framework codebase – it was very good starting point for me.

These are examples of formats supported by NumbersReader (one format per line):

#,##0.###
##0%
#,##0.00
00000.0000
'#,##0.0;(#)
¤ #,##0.00;¤ #,##0.00-
#,##0.05

NumbersReader class can parse format – it stores all parsed formats in the cache – and then can format a number using parsed representation of the format. It has methods for formatting decimal / percent / currency numbers (formats are extracted from CLDR). Formatting with custom format is also possible.

NumbersReader was commited in Revision 4445.

Next week I will work on similar Reader class for date and time. Hopefully I will also have time to do something with currency formatting – as for now, NumbersReader can format a value with currency sign, but it is simplified (just replacing currency placeholder with currency sign provided). CLDR has pretty extensive data for currencies.

Tags: ,

  • Great to hear about your progress with i18n in FLOW3, looking forward to see the results! Sounds good so far!

    Concerning date and time handling, I wrote a piece about that on my blog a while ago: http://michaelsauter.net/blog/...
    It's actually part of a series about date and time in PHP / FLOW3. Dunno if this is of any help for you as the posts cover only the very basics ...

  • Thank you for this comment! I've read all your posts about this topic and they turned up very useful for me. I use DateTime class in Locale subpackage. The Intl extension you describe bases on ICU library which also uses CLDR, and actually is pretty similar to my project. But the Locale subpackage will be certainly less complicated and much easier to use ;-). Also no external library dependency will be added.
  • Thanks for the post -- I'm really glad that you're working on this project; I'm sure this work will be really useful to others when it is complete!
  • Interesting about English having only "two" rules. Are there languages with more than 2 rules for that?
  • Sure! From CLDR I can see that most plural rules has Arabic language - 6 (all of them :-) ). But many other languages has three or more rules, or no rules (no plurals - e.g. Japanese). My native language (Polish) has 3 rules - "one", "few", and "other" according to the CLDR naming. I didn't count but it seems that most languages have just two rules like English :-).
blog comments powered by Disqus