HTML file Character set converter

0. Contents

This is the documentation of htmlrecode-1.1.5.1.
   1. Purpose
   2. Usage
   3. TODO
   4. Installation
      4.1. Problems with gcc 3
   5. Example
   6. Feedback
   7. Requirements
   8. Changes since 1.1.5
   9. Copying
   10. Downloading

1. Purpose

Recodes the HTML file using a new character set, while losing no characters at all. You can recode shift_jis to euc-jp, utf8 to latin1, iso-8859-15 to GB18030, iso-2022-jp to koi8-r etc if you wish, and none of the characters on the page will become unreadable (unless you specify -l switch, which disables making &#nnnn; escapes).

2. Usage

htmlrecode 1.1.5.1 - Copyright (C) 1992,2003 Bisqwit (http://iki.fi/bisqwit/)

Usage: htmlrecode [<option> [<...>]]

Reads stdin, writes stdout.

Options:
    -I, --inset setname   Assumed input character set (default: iso-8859-1)
    -O, --outset setname  Wanted output character set (default: iso-8859-1)
    -V, --version         Displays version information.
    -h, --help            This help.
    -l, --lossy           Disable lossless conversion.
    -x, --usehex          Use hexadecimal escapes.

Pipe in the html file and pipe the output to result file.

3. TODO

I'll soon add an interface for modifying the text content of a HTML file.
This should make making filters like Pootpoot or Pikachifier easier. It is already theoretically supported, but I haven't invented an interface for it yet.

4. Installation

$ make
$ su
# make install
If you do not want to install libargh (included in the archive), do not use "make install" and edit Makefile and enable the STATIC linking instead of DYNAMIC.

4.1. Problems with gcc 3

If somebody knows how to get this example code here below to compile and link with g++-3, please tell me.

#include <string>

using namespace std;

int main(void)
{
    basic_string<unsigned> test;
    test += 5;
    return 0;
}

5. Example

This page template is locally stored in iso-8859-1, but is automatically converted to utf-8 to make the final version.

Here are some latin letters: åäöñé
Here are some CJK (chinese/japanese/korean ideograms): 日本
Here are some html escapes: >"äöê

Source code of the above:

Here are some latin letters: åäöñé<br>
Here are some CJK (chinese/japanese/korean ideograms): &#26085;&#26412;<br>
Here are some html escapes: &gt;&quot;&auml;&ouml;&ecirc;<br>
What your browser is getting, is not &#26085; etc but the actual utf-8 characters.

6. Feedback

If you have problems using this program or ideas how to develop it, email me your questions or ideas. Please do not omit the details.

7. Requirements

htmlrecode has been written in C++, utilizing the standard template library.
GNU make is required.
I have g++ version 3.1.0, and htmlrecode compiles without warnings.
You also need libiconv.
Note: The Makefile tries to link the program against librecode. Try replacing the line LDFLAGS=-lrecode with LDFLAGS=-liconv or even commenting it out, if your linking fails.

8. Changes since 1.1.5

  - tags are now recognized in all mixed case
  - tag values can be in '', not only in ""
  - -:_. are recognized to be part of tag value if no "" is there
  - SCRIPT and STYLE contents are "raw" until the next </
  - Also ,/ are unconditionally recognized as part of the tag value

9. Copying

htmlrecode has been written by Joel Yliluoma, a.k.a. Bisqwit,
and is distributed under the terms of the General Public License (GPL).

10. Downloading

The official home page of htmlrecode is at http://iki.fi/bisqwit/source/htmlrecode.html.
Check there for new versions.

Generated from progdesc.php (last updated: Thu, 29 May 2003 18:58:43 +0300)
with docmaker.php (last updated: Thu, 13 Feb 2003 15:11:29 +0200)
at Thu, 29 May 2003 18:58:49 +0300