Simple pdf to html converter

0. Contents

This is the documentation of pdf2simplehtml-0.9.0.
   1. Purpose
   2. Copying
   3. Requirements
   4. Notes
   5. Downloading

1. Purpose

Converts PDF files into simple HTML.

2. Copying

This code has been written by Joel Yliluoma, a.k.a. Bisqwit,
and is distributed under the following terms:
* No warranty. You are free to modify this source and to
* distribute the modified sources, as long as you keep the
* existing copyright messages intact and as long as you
* remember to add your own copyright markings.
* You are not allowed to distribute the program or modified versions
* of the program without including the source code (or a reference to
* the publicly available source) and this notice with it.

3. Requirements

GNU make is probably required.
swftools is required -- particularly, the source code of its "pdf2swf/xpdf" directory and the "libpdf.a" file which is formed when the source code is compiled. Edit the Makefile and change the XPDF setting to point to your xpdf directory within the swftools source code directory.
xpdf and swftools are not required on runtime.

4. Notes

Full reconstruction of HTML pages from PDF documents is not possible. PDF is merely a sequence of rendering instructions; i.e., it is all about layout, nothing about structure. In contrast, HTML is all about structure, little about layout.

The usability of all-about-layout HTML pages usually is very low. Thus, to make the resulting pages usable and readable, pdf2simplehtml ignores most of the layout and only concentrates on making the text content readable.

In a PDF document, spaces are not always saved as text. Sometimes, each word (or even each letter!) is individually positioned on the page. pdf2simplehtml tries to guess word breaks, line breaks, and new paragraphs, by analyzing the distances of lines and words. Sometimes it does not work well, but most of time it does. It does not understand about indentations yet. It cannot guess about headers either. It tries to convert fonts and text sizes.

In the author's opinion, pdf2simplehtml converts PDF documents into better readable HTML pages than Google's cache does.

5. Downloading

The official home page of pdf2simplehtml is at http://iki.fi/bisqwit/source/pdf2simplehtml.html.
Check there for new versions.

Generated from progdesc.php (last updated: Wed, 05 Jul 2006 07:26:02 +0300)
with docmaker.php (last updated: Sun, 12 Jun 2005 06:08:02 +0300)
at Wed, 05 Jul 2006 07:27:41 +0300