HTML file Character set converter | ||
1. Purpose
Recodes the HTML file using a new character set, while losing
no characters at all. You can recode shift_jis to euc-jp, utf8 to latin1,
iso-8859-15 to GB18030, iso-2022-jp to koi8-r etc if you wish, and none
of the characters on the page will become unreadable
(unless you specify -l switch, which disables making &#nnnn; escapes).
Standard-correct HTML is a good thing. One of the goals in the development of this program is that it never makes the HTML more broken than it previously was. It should even make it better than it was. So if you see that the program does the opposite, please tell me. 2. Usagehtmlrecode 1.2.0 - Copyright (C) 1992,2003 Bisqwit (http://iki.fi/bisqwit/) Usage: htmlrecode [<option> [<...>]] Reads stdin, writes stdout. Options: -I, --inset setname Assumed input character set (default: iso-8859-1) -O, --outset setname Wanted output character set (default: iso-8859-1) -V, --version Displays version information. -e, --usehex Use hexadecimal escapes. -g, --signature Prefix the file with an unicode signature. -h, --help This help. -l, --lossy Disable lossless conversion. -q, --quiet Be less verbose. -s, --strict Turn off support for slightly broken HTML. -v, --verbose Be less quiet. -x, --xmlmode XML mode: all tag param values quoted. Pipe in the html file and pipe the output to result file. 3. TODO
I'll soon add an interface for modifying the text content of a HTML file. This should make making filters like Pootpoot or Pikachifier easier. It is already theoretically supported, but I haven't invented an interface for it yet. 4. Installation$ make $ su # make installIf you do not want to install libargh (included in the archive), do not use "make install" and edit Makefile and enable the STATIC linking instead of DYNAMIC. 5. Example
This page template is locally stored in iso-8859-1, but is
automatically converted to utf-8 to make the final version.
Here are some latin letters: åäöñé Source code of the above: Here are some latin letters: åäöñé<br> Here are some CJK (chinese/japanese/korean ideograms): 日本<br> Here are some html escapes: >"äöê<br>What your browser is getting, is not 日 etc but the actual utf-8 characters. 6. Feedback
If you have problems using this program or ideas how to
develop it, email me your questions or ideas. Please do not omit the details. My email address (sigh) is: bisqwit at iki dot fi 7. Requirements
htmlrecode has been written in C++, utilizing the standard template library. GNU make is required. I have g++ version 3.3, and htmlrecode compiles without warnings. For now. 7.1. Compilation problems
htmlrecode uses widestrings, which is a feature different g++ versions
are very inconsistent about. htmlrecode.hh has some settings you can
try to choose between. Try this:Replace //#define wstring ucs4string typedef wchar_t ucs4; //typedef unsigned int ucs4; //typedef basic_string<ucs4> wstring;With //#define wstring ucs4string //typedef wchar_t ucs4; typedef unsigned int ucs4; typedef basic_string<ucs4> wstring;This might help compiling on g++-2.95. 8. ChangelogSince 1.3.0: - Compilation fixes on more up-to-date compilers. (Thanks Santiago M. Mola) Since 1.2.0: - Abrubtly terminated multibyte sequences no longer cause htmlrecode to enter an infinite loop Since 1.1.5: - Tags are now recognized in all mixed case - Tag values can be in '', not only in "" - -:_. are recognized to be part of tag value if no "" is there - Nonspace are also recognized as above :( (unless -s option was used) - SCRIPT and STYLE contents are "raw" until the next </, unless -s was used - SCRIPT/STYLE contents are properly rehidden if necessary - " and ' quotes (and no quotes) are used wisely - Warnings from some bad HTML - Indentations inside tags are now kept mostly intact - XHTML support - Unicode signature character support - Major structural rewrites - New "configure" script - Big thanks to Winfried Szukalski for his thorough testing efforts and comments. Since 1.1.4: - workaround for g++ versions, now compiles with g++-3 Since 1.1.3: - optimizations - error resistence Since 1.1.2: - hex support - g++ string workarounds Since 1.1.1: - improved documentation - fixed < (was outputted as >, should be <) 9. Copying
htmlrecode has been written by Joel Yliluoma, a.k.a.
Bisqwit, and is distributed under the terms of the General Public License (GPL). 10. DownloadingDownloading help
Date (Y-md-Hi) acc Size Name 2009-0721-1255 r-- 51387 htmlrecode-1.3.1.tar.bz2 2009-0721-1255 r-- 61880 htmlrecode-1.3.1.tar.gz 2004-0918-2236 r-- 47170 htmlrecode-1.3.0.tar.bz2 2004-0918-2236 r-- 9674 patch-htmlrecode-1.2.0-1.3.0.bz2 2003-0602-1824 r-- 46093 htmlrecode-1.2.0.tar.bz2 2003-0602-1824 r-- 11749 patch-htmlrecode-1.1.5.4-1.2.0.bz2 2003-0602-1824 r-- 16607 patch-htmlrecode-1.1.5-1.2.0.bz2 2003-0601-0336 r-- 45955 htmlrecode-1.1.5.4.tar.bz2 2003-0601-0336 r-- 12174 patch-htmlrecode-1.1.5.3-1.1.5.4.bz2 2003-0530-1019 r-- 43842 htmlrecode-1.1.5.3.tar.bz2 2003-0530-1019 r-- 4227 patch-htmlrecode-1.1.5.2-1.1.5.3.bz2 2003-0529-2352 r-- 43687 htmlrecode-1.1.5.2.tar.bz2 2003-0529-2352 r-- 10282 patch-htmlrecode-1.1.5.1-1.1.5.2.bz2 2003-0529-1558 r-- 42221 htmlrecode-1.1.5.1.tar.bz2 2003-0529-1558 r-- 4986 patch-htmlrecode-1.1.5-1.1.5.1.bz2 2003-0514-2155 r-- 40999 htmlrecode-1.1.5.tar.bz2 2003-0514-2155 r-- 12357 patch-htmlrecode-1.1.4.2-1.1.5.bz2 2003-0514-2155 r-- 12815 patch-htmlrecode-1.1.4-1.1.5.bz2 2002-1214-1437 r-- 41006 htmlrecode-1.1.4.2.tar.bz2 2002-1214-1437 r-- 6986 patch-htmlrecode-1.1.4.1-1.1.4.2.bz2 2002-1209-2317 r-- 39940 htmlrecode-1.1.4.1.tar.bz2 2002-1209-2317 r-- 6796 patch-htmlrecode-1.1.4-1.1.4.1.bz2 2002-0919-0841 r-- 41229 htmlrecode-1.1.4.tar.bz2 2002-0919-0841 r-- 6757 patch-htmlrecode-1.1.3.3-1.1.4.bz2 2002-0919-0841 r-- 11836 patch-htmlrecode-1.1.3-1.1.4.bz2 2002-0813-1058 r-- 38709 htmlrecode-1.1.3.3.tar.bz2 2002-0813-1058 r-- 6026 patch-htmlrecode-1.1.3.2-1.1.3.3.bz2 2002-0802-1005 r-- 37374 htmlrecode-1.1.3.2.tar.bz2 2002-0802-1005 r-- 1110 patch-htmlrecode-1.1.3.1-1.1.3.2.bz2 2002-0729-1211 r-- 37335 htmlrecode-1.1.3.1.tar.bz2 2002-0729-1211 r-- 1977 patch-htmlrecode-1.1.3-1.1.3.1.bz2 2002-0712-1358 r-- 37337 htmlrecode-1.1.3.tar.bz2 2002-0712-1358 r-- 1939 patch-htmlrecode-1.1.2.1-1.1.3.bz2 2002-0712-1358 r-- 2680 patch-htmlrecode-1.1.2-1.1.3.bz2 2002-0606-2333 r-- 37240 htmlrecode-1.1.2.1.tar.bz2 2002-0606-2333 r-- 1916 patch-htmlrecode-1.1.2-1.1.2.1.bz2 2002-0606-2221 r-- 37192 htmlrecode-1.1.2.tar.bz2 2002-0606-2221 r-- 3382 patch-htmlrecode-1.1.1-1.1.2.bz2 2002-0603-1106 r-- 36765 htmlrecode-1.1.1.tar.bz2 2002-0603-1106 r-- 1394 patch-htmlrecode-1.1.0-1.1.1.bz2 2002-0603-0859 r-- 36793 htmlrecode-1.1.0.tar.bz2 2002-0603-0859 r-- 21561 patch-htmlrecode-1.0.1-1.1.0.bz2 2002-0602-2224 r-- 24775 htmlrecode-1.0.1.tar.bz2 2002-0602-2224 r-- 22305 patch-htmlrecode-1.0.0-1.0.1.bz2 2002-0602-1540 r-- 11606 htmlrecode-1.0.0.tar.bz2← Back to the source directory index at Bisqwit's homepage | ||