Sunday, 8 August 2010

Strings and Qt

One thing which comes up quite often when I'm talking to developers new to Qt is the topic of strings, more specifically, character encoding: how to do it right, what options are available, and best practices.

Having written this a few times now, I thought that perhaps it was about time I write it up in a more permanent location (here) in the hopes that people will stumble across it and magically become enlightened, and end world hunger. ;)

Qt (and C++) have a number of different string types.

QString

Qt has a string type in QtCore called QString. QString, internally, stores data in utf16, and *does* have knowledge of character encoding.

Services across a network (like web services) often want data in utf8. QString, however, stores data in utf16. To get to utf8, you want QString::toUtf8(). To convert from utf8 back to utf16 QString (e.g. parsing input from a web service) see, QString::fromUtf8().

std::string

C++ also has std::string (although you won't find a lot of this in Qt applications). Simply put, it's a wrapper around a C string providing convenience operations and nicer syntax. It still doesn't have such (fairly essential) things like character encoding.

You probably want to avoid using this in an internationalized application or one requiring interaction with network services unless you find your own solution for encoding issues.

QString has ::toStdString() and ::fromStdString() methods if you must use them for whatever reason.

C strings (char*)

Finally, you have C strings (char*) which don't have any idea what encoding is, they are just a bunch of bytes.

Generally speaking, they're latin1 encoded (ASCII), to put them into a QString.. QString::fromLatin1(). If they aren't latin1, see QTextCodec::setCodecForCStrings().

QLatin1String class is also helpful - in particular, this will allow you to compile when using QT_NO_CAST_FROM_ASCII (which itself is helpful to make sure you explicitly give encodings for all of your strings).

3 comments:

  1. Thanks Robin for this short summary, that nails the basics pretty well. There's one issue, though: C strings are not latin1-encoded, they are encoded in the system's default character set. So please use QString::fromLocal8Bit() and QString::toLocal8Bit(). toLatin1() most likely breaks e.g. on kyrillic systems. On the other hand, I don't know how this would work on a Linux system that's all utf-8. Maybe someone can try this out?

    ReplyDelete
  2. Nowadays when storing utf8 inside std::string or C type character arrays, you pretty much cover your need. (@Daniel: developer determines what encoding it stores; aren't you confusing what system or library functions return?)
    Index'ed and length operations are slow with utf8 strings, because of the variable character length, but in practice one hardly needs is (e.g. most of the time isEmpty is sufficient).
    Advantages of utf8 storage is, like you said, the web uses utf8, so no conversions.

    ReplyDelete
  3. Are u sure that Qt uses UTF-16 for strings and not UCS2? QChar is fixed size of two bytes. So one QChar can't save any Unicode character. It will work for most of languages/countries but not for all.

    ReplyDelete