Character set and encoding converter


1. Purpose

Converts text stream in one character set and encoding to another.

This program is obsolete. I am not maintaining it anymore, as it has been obsoleted by GNU recode and htmlrecode.

2. PHP and character sets

Did you came here looking for character set conversions using PHP?
I'm planning to write a sort of howto of that some day. Meanwhile, you can try my japcharset.php, which is an extempore character set converter to be used in php scripts handling multinational texts. It depends on the htmlrecode program, which you can get here too. You can also try the recode extension of php (php4-recode).

3. Supported character sets and encodings

Usage: charconv [-h] <incharset> <outcharset>

Reads stdin, outputs stdout. Does incharset->outcharset conversion via unicode.

-h = Input is html (THIS BUGS)
Available character sets/encodings:
- unihtml (&#number; codes)
- utf8linux (with vt100 escape codes)
- utf7mod (imap modified)
- koi8r          - jis-x-0201      - shift_jis       - big5
- iso-8859-1      - iso-8859-2      - iso-8859-3      - iso-8859-4
- iso-8859-5      - iso-8859-6      - iso-8859-7      - iso-8859-8
- iso-8859-9      - iso-8859-10     - iso-8859-13     - iso-8859-14
- iso-8859-15     - cp437           - cp737           - cp775
- cp850           - cp852           - cp855           - cp857
- cp860           - cp861           - cp862           - cp863
- cp864           - cp865           - cp866           - cp869
- cp874           - cp1250          - cp1252          - cp1254
- cp1256          - cp1258          - cp1251          - cp1253
- cp1255          - cp1257          - cp856           - cp1006
- cp424           - roman           - romanian        - iso-2022-jp
- utf8            - utf7            - euc-jp    

Typoes are allowed to some degree in the character set names,
and some general aliases like latin* and iso* are known.

4. Copying

charconv has been written by Joel Yliluoma, a.k.a. Bisqwit,
and is distributed under the terms of the General Public License (GPL).

If you want to make your own converter or just study how something works, you might still want to download this program. The package contains plain TXT files describing the character sets, and there are .cc files for each different encoding.

5. Examples

oktober:~/src/charconv$ echo '�iti tykk�� oliivi�ljyst�'|charconv latin1 utf7
+AMQ-iti tykk+AOQA5A oliivi+APY-ljyst+AOQ

oktober:~/src/charconv$ echo '+AMQ-iti tykk+AOQA5A oliivi+APY-ljyst+AOQ'|charconv utf7 unihtml
&Auml;iti tykk&auml;&auml; oliivi&ouml;ljyst&auml;

oktober:~/src/charconv$ echo 'pikachu' | sed -f /WWW/src/kr2k.sed | charconv sjis utf8
ぴかちゅ

oktober:~/src/charconv$ echo -e '\33$B$P$+\33(B' | charconv iso-2022-jp unihmtl
Charconv: Warning: Assuming 'unihmtl' means 'unihtml'
&#12400;&#12363;

oktober:~/src/charconv$ echo '�������' | charconv cp1251 koi8r
Charconv: Warning: Assuming 'koi8r' means 'koi8-r'
���������

6. Requirements

charconv has been written in C++, utilizing the standard template library.
The hashes the program uses have been heavily optimized for both size and speed, with size being the top priority. The compilation takes lots of memory and time therefore.
GNU make is required.
I have g++ version 3.0.1, and charconv compiles without warnings (except some signed/unsigned mismatches).
Some parts of makefiles have been generated with a php script (included in the archive). If you want to regenerate them, you need PHP 4 too.

7. See also

GNU Recode: This recoding library converts files between various coded character sets and surface encodings. When this cannot be achieved exactly, it may get rid of the offending characters or fall back on approximations. The library recognises or produces more than 300 different character sets and is able to convert files between almost any pair. Most RFC 1345 character sets, and all `libiconv' character sets, are supported. The `recode' program is a handy front-end to the library.
I have made an online version of it available for use for converting short amounts of data between encodings.

If you are converting HTML pages, use htmlrecode instead. It handles them (and changes the character set) losslessly.

If you are _not_ converting HTML encoding, use GNU recode. It might be more effecient than charconv.

8. Downloading

Downloading help

  • Do not download everything - you only need one file (newest version for your platform)!
  • Do not use download accelerators or you will be banned from this server before your download is complete!

Date (Y-md-Hi) acc        Size Name                
2002-0902-0124 ---        1816 patch-charconv-1.1.2.6-1.1.2.7.sh.bz2
2002-0902-0125 ---      251018 charconv-1.1.2.7.tar.bz2
2002-0902-0124 ---        1329 patch-charconv-1.1.2.6-1.1.2.7.bz2
2002-0827-2209 ---        6132 patch-charconv-1.1.2.5-1.1.2.6.sh.bz2
2002-0827-2209 ---      251120 charconv-1.1.2.6.tar.bz2
2002-0827-2209 ---        5983 patch-charconv-1.1.2.5-1.1.2.6.bz2
2002-0812-0205 ---      244250 charconv-1.1.2.5.tar.bz2
2002-0812-0205 ---        2452 patch-charconv-1.1.2.4-1.1.2.5.bz2
2002-0811-0016 ---      244922 charconv-1.1.2.4.tar.bz2
2002-0811-0016 ---        2296 patch-charconv-1.1.2.3-1.1.2.4.bz2
2002-0802-1002 ---      244659 charconv-1.1.2.3.tar.bz2
2002-0802-1002 ---        1620 patch-charconv-1.1.2.2-1.1.2.3.bz2
2002-0712-1431 ---      244449 charconv-1.1.2.2.tar.bz2
2002-0712-1431 ---        2675 patch-charconv-1.1.2.1-1.1.2.2.bz2
2002-0602-0953 ---      243994 charconv-1.1.2.1.tar.bz2
2002-0602-0953 ---        1316 patch-charconv-1.1.2-1.1.2.1.bz2
2002-0601-0035 ---      244002 charconv-1.1.2.tar.bz2
2002-0601-0035 ---        1774 patch-charconv-1.1.1.1-1.1.2.bz2
2002-0601-0035 ---       17357 patch-charconv-1.0.0-1.1.2.bz2
2002-0527-1546 ---        4852 patch-charconv-1.1.1-1.1.1.1.bz2
2002-0428-1144 ---        2620 patch-charconv-1.1.0-1.1.1.bz2
2002-0314-2042 ---        7239 patch-charconv-1.0.3-1.1.0.bz2
2002-0130-0845 ---        6004 patch-charconv-1.0.2-1.0.3.bz2
2002-0122-2358 ---        6971 patch-charconv-1.0.1-1.0.2.bz2
2002-0121-0227 ---        1180 patch-charconv-1.0.0-1.0.1.bz2
2002-0121-0208 ---      221861 charconv-1.0.0.rar
2002-0121-0208 ---      234350 charconv-1.0.0.tar.bz2
2002-0121-0208 ---        4734 patch-charconv-0.0.13-1.0.0.bz2
2002-0121-0208 ---       13439 patch-charconv-0.0.10-1.0.0.bz2
2002-0112-1858 ---        7100 patch-charconv-0.0.12-0.0.13.bz2
2001-1017-0131 ---        2633 patch-charconv-0.0.11-0.0.12.bz2
2001-1008-0830 ---        5858 patch-charconv-0.0.10-0.0.11.bz2
2001-1008-0501 ---      229891 charconv-0.0.10.tar.bz2
2001-1008-0501 ---       18696 patch-charconv-0.0.9-0.0.10.bz2
2001-1008-0501 ---      117514 patch-charconv-0.0.2-0.0.10.bz2
2001-1006-1704 ---        7617 patch-charconv-0.0.8-0.0.9.bz2
2001-1006-0322 ---       97222 patch-charconv-0.0.7-0.0.8.bz2
2001-1005-0427 ---        4099 patch-charconv-0.0.6-0.0.7.bz2
2001-1005-0202 ---        2985 patch-charconv-0.0.5-0.0.6.bz2
2001-1003-1220 ---        3777 patch-charconv-0.0.4-0.0.5.bz2
2001-0927-2334 ---        4783 patch-charconv-0.0.3-0.0.4.bz2
2001-0925-0149 ---        1406 patch-charconv-0.0.2-0.0.3.bz2
2001-0925-0112 ---      124076 charconv-0.0.2.tar.bz2
2001-0925-0112 ---        5441 patch-charconv-0.0.1-0.0.2.bz2
Back to the source directory index at Bisqwit's homepage