Wednesday, October 14, 2009

Latin1 in Windows Dos Console

This file was saved as utf8. If you don't save or view this file as utf8 then the special characters will not appear correctly.

Summary

We can change the cp programatically such that latin1 always works, but it only displays correctly in the console if the font is lucida.

The lucida font can be expected to be installed on all windows systems.

There is no programatic way to set the font.

It may be possible to programatically determine the font and print a warning when it is not lucida.

For most european customers the default oem cp is 850 which does display all latin1, so they work.

We could force some or all systems to a good codepage, or we could just force systems that start on a bad codepage.

We could force no systems and display question marks for bad characters and document that if they want it to work then they must change their code page.

All you have to do to change the default oem codepage is to change your system locale.

Start > Controp Panel > Regional and Language Options > Advanced

If I change the locale to French (France) and reboot then the default oem code page is 850 and the extended characters display correctly in the raster font.

Confirmed that the default oem code page for the united kingdom is 850.

References:

cp437
cp1252
microsoft
wincp
oem
iso
chcp
alt codes
microsoft
oldnewthing
blog
blog
en-us
change console font
microsoft

Windows OEM Code Pages

• 437 (US)
• 720 (Arabic)
• 737 (Greek)
• 775 (Baltic)
• 850 (Multilingual Latin I)
• 852 (Latin II)
• 855 (Cyrillic)
• 857 (Turkish)
• 858 (Multilingual Latin I + Euro)
• 862 (Hebrew)
• 866 (Russian)

Windows ANSI Code Pages

• 1250 (Central Europe)
• 1251 (Cyrillic)
• 1252 (Latin I)
• 1253 (Greek)
• 1254 (Turkish)
• 1255 (Hebrew)
• 1256 (Arabic)
• 1257 (Baltic)
• 1258 (Vietnam)
• 874 (Thai)

cp437  cp1252     cp737    cp850
PESETA zWithCaron smallEta timesSign
₧      ž          η        ×
20A7  017E        03B7     00D7
158   158         158      158
x9E   x9E         x9E      x9E

Note that cp437 and cp737  do not contain ž
     and  cp737 and cp1252 do not contain ₧
     and  cp437 and cp1252 do not contain η
Note that cp437 and cp737  do not contain ×, but cp1252 does

With: Active code page: 437, Font: Raster Fonts

INPUT    alt158 alt0158
DISPLAY       ₧       z // PESETA and z
REDISPLAY     ₧       z // PESETA and z
RECEIVED    x9E     x7A
DECIMAL     158     122
FROM OEM   20A7    007A
FROM ANSI  017E    007A

Here alt158 is asking for byte 158 from cp437  which is ₧
And alt0158 is asking for byte 158 from cp1252 which is ž

Because we are asking for ž from cp1252 when the active code page is cp437 (in which it doesn't exist) it first undergoes a best-fit mapping. This is not a desireable input method.

Since cp437 is an OEM codepage, the only way to always get correct bytes is to use CP_OEMCP and perform the round-trip test. Characters not in the active code page cannot be received.

With: Active code page: 437, Font: Lucida Console

INPUT    alt158 alt0158
DISPLAY       ₧       ž // PESETA and zWithCaron
REDISPLAY     ₧       z // PESETA and z
RECEIVED    x9E     x7A
DECIMAL     158     122
FROM OEM   20A7    007A
FROM ANSI  017E    007A

Here alt158 is asking for byte 158 from cp437  which is ₧
And alt0158 is asking for byte 158 from cp1252 which is ž
(i.e. same as above)

It appears that the font layer is a multi-byte layer that is above the input layer. So the font knows we asked for ž and displays it, but since the active code page does not support that character the best-fit mapping still occurs.

Since cp437 is an OEM codepage, the only way to always get correct bytes is to use CP_OEMCP and perform the round-trip test. Characters not in the active code page cannot be received.

Note that if using the Lucida font and entering bytes via alt0# when cp1252 is not the active codepage will confuse the user because the requested character will be displayed, but not received.

With: Active code page: 1252, Font: Lucida Console

INPUT    alt158 alt0158
DISPLAY       ž       ž // zWithCaron
REDISPLAY     ž       ž // zWithCaron
RECEIVED    x9E     x9E
DECIMAL     158     158
FROM OEM   20A7    20A7
FROM ANSI  017E    017E

Here alt158 is asking for byte 158 from cp1252 which is ž
And alt0158 is asking for byte 158 from cp1252 which is ž

Since our active codepage is cp1252, both the alt# method and the alt0# method are asking for the same byte, so there will never be a best-fit mapping. And the Lucida font is able to display characters from the active code page, so they always appear correctly.

Since cp1252 is an ANSI codepage, the only way to always get correct bytes is to use CP_ACP. Here it may not be necessary to perform the round-trip test because there may not be any method to input characters not in this codepage. This is not true in general for ANSI codepages.

With: Active code page: 1252, Font: Raster Fonts

INPUT    alt158 alt0158
DISPLAY       ₧       ₧ // PESETA
REDISPLAY     ₧       ₧ // PESETA
RECEIVED    x9E     x9E
DECIMAL     158     158
FROM OEM   20A7    20A7
FROM ANSI  017E    017E

Here alt158 is asking for byte 158 from cp1252 which is ž
And alt0158 is asking for byte 158 from cp1252 which is ž
(i.e. same as above)

This confirms that Raster Fonts don't support changing the codepage. It seems that the Raster Font is hard-coded to cp437, just as the alt0# input method is hard-coded to cp1252.

Since our active codepage is cp1252, both the alt# method and the alt0# method are asking for the same byte, so there will never be a best-fit mapping. But since the font is hard-coded to cp437 it displays byte 158 as ₧ instead of ž.

This scenario should never be used, as it will definetly confuse the customer. Bu using CP_ACP, one could get the byte for the character that was displayed, but since that wasn't the character that was requested, it seems unwise.

With: Active code page: 737, Font: Lucida Console

INPUT    alt158 alt0158
DISPLAY       η       ž // smallEta and zWithCaron
REDISPLAY     η       ? // smallEta and ?
RECEIVED    x9E     x3F
DECIMAL     158      63
FROM OEM   20A7    003F
FROM ANSI  017E    003F

Here alt158 is asking for byte 158 from cp737  which is η
And alt0158 is asking for byte 158 from cp1252 which is ž

Again, it looks like the input is multibyte and the font understands that and displays ž correctly but since the active codepage is singlebyte and ž doesn't exits, a best-fit mapping must occur. Because of this, it is bad to use the Lucida font because the character displayed is not the input received.

Bad news. It appears that CP_OEMCP and CP_ACP are hard-coded to cp437 and cp1252 respectively. There is no way for us to get the correct byte for η

With: Active code page: 737, Font: Raster Fonts

INPUT    alt158 alt0158
DISPLAY       ₧       ? // PESETA and ?
REDISPLAY     ₧       ? // PESETA and ? 
RECEIVED    x9E     x3F
DECIMAL     158      63
FROM OEM   20A7    003F
FROM ANSI  017E    003F

Here alt158 is asking for byte 158 from cp737  which is η
And alt0158 is asking for byte 158 from cp1252 which is ž
(i.e. same as above)

Again, because the Raster font doesn't support ž, the best-fit mapping occurs before dispaly. This is good because the character dispalyed is the character received.

Bad news worse than above. Not only are the CP_OEMCP and CP_ACP conversions invalid, since the Raster font onlz displays characters from cp437, it appears that ₧, not η was entered.

In the above I have assumed that alt# and CP_OEMCP are hard-coded to cp437, and that alt0# and CP_ACP are hard-coded to cp1252. It is likely that these settings are configurable during install. All we have discovered is that they are not effected by chcp.

We have confirmed that chcp is of no use to us. It is only useful if you want to remain in that codepage with single-byte characters.

MultiByteToWideChar: uCodePage specifies the codepage to be used when performing the conversion. The codepage can be any valid codepage number. The codepage may also be one of the following values: CP_ACP instructs the API to use the currently set default Windows ANSI codepage. CP_OEMCP instructs the API to use the currently set default OEM codepage.

So, if we force the console to Lucida font and a known latin1 codepage and hardcode MultiByteToWideChar to that codepage then it will all work.

Or we can let them use their default setup and advise against use of alt0# because of the best-fit conversion and suggest use of alt# and use CP_OEMCP and forbid use of chcp because of invalid conversion but not all latin1 will always be possible to input or display so rfc escape sequences will be necessary

Let's confirm that we can change the default OEM and ANSI codepages.

For Latin 1 use chcp 1252 and Lucida Console.

HKEY_CURRENT_USER\Software\Microsoft\Command Processor\AutoRun="chcp 1252"
HKEY_CURRENT_USER\Console\ (change the font somehow)
(wprintf doesn't work)

You can do the same from code
SetConsoleOutputCP
SetConsoleCP

The identifiers of the code pages available on the local computer are stored in the registry under the following key.

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage

cp437  cp1252     cp737    cp850
PESETA zWithCaron smallEta timesSign
₧      ž          η        ×
20A7  017E        03B7     00D7
158   158         158      158
x9E   x9E         x9E      x9E

With Raster Font

INITIAL CP    437    437    850    737
CODE SET CP   437    850    737   1252
INPUT      alt158 alt158 alt158 alt158
DISPLAY         ₧      ₧      ₧      ₧
REDISPLAY       ₧      ₧      ₧      ₧
RECEIVED      x9E    x9E    x9E    x9E
DECIMAL       158    158    158    158
FROM REAL    20A7   00D7   03B7   017E
FROM OEM     20A7   20A7   20A7   20A7
FROM ANSI    017E   017E   017E   017E
FINAL CP      437    850    737   1252

With Lucida Console Font (mostly the same)

DISPLAY         ₧      ×      η      ž
REDISPLAY       ₧      ×      η      ž

So, SetConsoleCP(cp) and SetConsoleOutputCP(cp). These work great. And we should always do this for console:

MultiByteToWideChar( GetConsoleCP(), MB_PRECOMPOSED | MB_ERR_INVALID_CHARS, ptr, strlen(ptr), wide, strlen(ptr) );

Because CP_OEMCP indicates the default cp, not the current cp.

So, if we're willing to force the console to 1252 then we are able to input and display all characters but the font must always be Lucida.

You CANNOT control the font progammatically.

-- Windows English Defaults ------------------------------------

cmd chcp
 Active code page: 437

regedit
 HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage
  ACP   1252
  OEMCP 437

-- Windows French Defaults ------------------------------------

cmd chcp
 Page de codes active : 850

regedit
 HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage
  ACP   1252
  OEMCP 850

-- English Keyboard -----------------------------------

`1234567890-=  ~!@#$%^&*()_+
qwertyuiop[]\  QWERTYUIOP{}|
asdfghjkl;'    ASDFGHJKL:"
zxcvbnm,./     ZXCVBNM<>?

-- German Keyboard ------------------------------------

^1234567890ß´  °!"§$%&/()=?`  ²³{[]}\
qwertzuiopü+#  QWERTZUIOPÜ*'  @€~µ
asdfghjklöä    ASDFGHJKLÖÄ
yxcvbnm,.-     YXCVBNM;:_

-- French Keyboard ------------------------------------

²&é"'(-è_çà)=  1234567890°+   ~#{[|`\^@]}
azertyuiop^$*  AZERTYUIOP¨£µ  €¤
qsdfghjklmù    QSDFGHJKLM%
wxcvbn,;:!     WXCVBN?./§

Here is a simple App that will let you test this stuff.

#include <stdio.h>
#include <string.h>
#include <Windows.h>

void asHex( char * str );

int main(int argc, char** argv)
{
   int result;

   int cp = GetConsoleCP();

   if ( argc == 2 ) { cp = atoi(argv[1]); }

   result = SetConsoleCP(cp);
   printf( "Console CP set to %d\n", GetConsoleCP() );

   result = SetConsoleOutputCP(cp);
   printf( "Console Output CP set to %d\n", GetConsoleOutputCP() );

   char line[256];
   char exit[] = "exit";

   printf("Enter input to see it echoed.\n");
   printf("Enter 'exit' to quit.\n\n");

   do
   {
      printf("> ");
      gets(line);
      if ( strcmp( line, exit ) == 0 ) { break; }
      asHex(line);
      printf("\n");
   }
   while(1);

   return 0;
}

char hexchars[] = "0123456789ABCDEF";

void asHex( char * str )
{
   char * ptr;
   char msg[1024];
   unsigned short wide[1024];
   unsigned int i;
   int size;

   // string -------------------------------
   i = 0;
   ptr = str;
   while ( *ptr != 0 )
   {
      msg[i++] = ' ';
      msg[i++] = ' ';
      msg[i++] = ' ';
      msg[i++] = *ptr;
      msg[i++] = ' ';
      ptr++;
   }
   msg[i] = 0;
   printf("%s\n",msg);

   // hex -------------------------------
   ptr = str;
   while ( *ptr != 0 )
   {
      printf(" x%X ", ((unsigned int)*ptr) & 0x000000ff );
      ptr++;
   }
   printf("\n");

   // decimal -------------------------------
   ptr = str;
   while ( *ptr != 0 )
   {
      printf(" %3u ", ((unsigned int)*ptr) & 0x000000ff );
      ptr++;
   }
   printf("\n");

   // wide actual -------------------------------

   ptr = str;
   size = MultiByteToWideChar( GetConsoleCP(), MB_PRECOMPOSED | MB_ERR_INVALID_CHARS, ptr, strlen(ptr), wide, strlen(ptr) );
   for (i=0; i<strlen(ptr); i++)
   {
      printf("%04X ", ((unsigned int)wide[i]) & 0x0000ffff );
   }
   printf("\n");

   // wide oem -------------------------------

   ptr = str;
   size = MultiByteToWideChar( CP_OEMCP, MB_PRECOMPOSED | MB_ERR_INVALID_CHARS, ptr, strlen(ptr), wide, strlen(ptr) );
   for (i=0; i<strlen(ptr); i++)
   {
      printf("%04X ", ((unsigned int)wide[i]) & 0x0000ffff );
   }
   printf("\n");

   // wide ansi -------------------------------

   ptr = str;
   size = MultiByteToWideChar( CP_ACP, MB_PRECOMPOSED | MB_ERR_INVALID_CHARS, ptr, strlen(ptr), wide, strlen(ptr) );
   for (i=0; i<strlen(ptr); i++)
   {
      printf("%04X ", ((unsigned int)wide[i]) & 0x0000ffff );
   }
   printf("\n");

}
{ "loggedin": false, "owner": false, "avatar": "", "render": "nothing", "trackingID": "UA-36983794-1", "description": "How to use the chcp command in dos console to support latin1 in your c++ application.", "page": { "blogIds": [ 221 ] }, "domain": "holtstrom.com", "base": "\/michael", "url": "https:\/\/holtstrom.com\/michael\/", "frameworkFiles": "https:\/\/holtstrom.com\/michael\/_framework\/_files.4\/", "commonFiles": "https:\/\/holtstrom.com\/michael\/_common\/_files.3\/", "mediaFiles": "https:\/\/holtstrom.com\/michael\/media\/_files.3\/", "tmdbUrl": "http:\/\/www.themoviedb.org\/", "tmdbPoster": "http:\/\/image.tmdb.org\/t\/p\/w342" }