profile picture

How to truncate UTF-8 string

July 16, 2008 - programming unicode c utf-8

Suppose you need to put UTF-8 string into a fixed length buffer. Actually I was in need to do this. Problem is that the last symbol may be incomplete, so here is the example how to do this:

#include <string.h>
#include <stdio.h>
#include <err.h>
main (int argc, char *argv[])
{
  char buf[64];
  int i;
  if (argc != 2)
    errx (-1, "Usage: %s string", argv[0]);
  memset (buf, '\0', 64);
  strncpy (buf, argv[1], 63);
  /*
   * The following printf may output a trash
   * in the end of the string
   */
  printf ("Before: `%s'\n", buf);
  /*
   * here we check if there is truncated utf-8
   * character in the end of the string
   */
  i = 62;
  if (buf[i] & 128)
    {
      if (buf[i] & 64)
        buf[i] = '\0';
      else if ((buf[i - 1] & 224) == 224)
        buf[i - 1] = '\0';
      else if ((buf[i - 2] & 240) == 240)
        buf[i - 2] = '\0';
    }
  /*
   * Here is a clean output
   */
  printf ("After: `%s'\n", buf);
}