Simple pdf to html converter


0. Contents

   1. Purpose
   2. Copying
   3. Requirements
   4. Notes
   5. Downloading

1. Purpose

Converts PDF files into simple HTML.

2. Copying

This code has been written by Joel Yliluoma, a.k.a. Bisqwit,
and is distributed under the following terms:
* No warranty. You are free to modify this source and to
* distribute the modified sources, as long as you keep the
* existing copyright messages intact and as long as you
* remember to add your own copyright markings.
* You are not allowed to distribute the program or modified versions
* of the program without including the source code (or a reference to
* the publicly available source) and this notice with it.

3. Requirements

GNU make is probably required.
swftools is required -- particularly, the source code of its "pdf2swf/xpdf" directory and the "libpdf.a" file which is formed when the source code is compiled. Edit the Makefile and change the XPDF setting to point to your xpdf directory within the swftools source code directory.
xpdf and swftools are not required on runtime.

4. Notes

Full reconstruction of HTML pages from PDF documents is not possible. PDF is merely a sequence of rendering instructions; i.e., it is all about layout, nothing about structure. In contrast, HTML is all about structure, little about layout.

The usability of all-about-layout HTML pages usually is very low. Thus, to make the resulting pages usable and readable, pdf2simplehtml ignores most of the layout and only concentrates on making the text content readable.

In a PDF document, spaces are not always saved as text. Sometimes, each word (or even each letter!) is individually positioned on the page. pdf2simplehtml tries to guess word breaks, line breaks, and new paragraphs, by analyzing the distances of lines and words. Sometimes it does not work well, but most of time it does. It does not understand about indentations yet. It cannot guess about headers either. It tries to convert fonts and text sizes.

In the author's opinion, pdf2simplehtml converts PDF documents into better readable HTML pages than Google's cache does.

5. Downloading

Downloading help

  • Do not download everything - you only need one file (newest version for your platform)!
  • Do not use download accelerators or you will be banned from this server before your download is complete!

Date (Y-md-Hi) acc        Size Name                
2006-0705-0726 r--      442726 pdf2simplehtml-0.9.0-x86linuxbinary-static.zip
2006-0705-0726 r--        8294 pdf2simplehtml-0.9.0.tar.bz2
Back to the source directory index at Bisqwit's homepage