Saturday, April 21, 2012

Reformating text for text readers

Raw text files or text files converted from other formats often have horrible formatting when trying to use in a text reader.  And that bad formating even carries over when you try to convert them to epub or other ebook formats


Before being processed.  Looks very bad, lines are broken.



After being processed.  Lines breaking good.

So I have written a small program to reformat text files to remove extra newlines and put a newline between paragraphs.  It tries to put an extra line between lines that end in less than 20 characters to accommodate Chapter 1 style headings or short page breaks such as “* * *”

The file is included inline here:
#include <stdio.h>
#include <ctype.h>

int
main (){ 

/* declare variables */
    int cur;    /* need a pointer to hold a single character */
    int prev = 0;            /* the previous non white space charater */
    int count = 0;               /* the count since the beginning of the current line */

    while ((cur=getc(stdin))!=EOF) {

    if ( cur == '\r' ) {

               // eat carriage returns

                count = 0;
                //prev = 0;

    } if ( cur == '\n' && (isalnum(prev) || !ispunct(prev) || prev == ',') && count > 20 ) {
        // reached end of line and previous character is either a letter or a number or a comma
        // while being over halfway across the page.
        // replace the newline with a space.
        putc(' ', stdout);
                count = 0;
                prev = 0;

        //putc('*', stdout);
    } else if (cur == '\n'){

        // New paragraph

        // eat extra newlines.
        if (count < 2) continue;

        putc('\n', stdout);
        putc('\n', stdout);

        count = 0;

        //putc('+', stdout);

    } else {

        // Just output the character
            putc(cur, stdout);

        // keep a copy of the previous non white space character;
        if (isalnum(cur) || ispunct(cur))
            prev = cur;

        // keep track of where we are on the line.
        count++;

    };

        }  /* end while */ 

/* report success back to environment */
    return 0; 

}  /* end main*/
The only other major problem I am seeing is that often files converted to text from other ereader formats have tags scattered several hundred times through the files, so I am going to try to automate remove those tags in a generic way.
A solution to the tags problem are filters using the following pattern:

perl -pi.orig -e 's/^file:\/\/\/.*+//e' filename
perl -pi.orig -e 's/^file:\/\/\/.*+//e' filename
perl -pi.orig -e 's/^[ ]*+file:\/\/\/.*+//e' filename
perl -pi.orig -e 's/^[ ]{0,}file:\/\/\/.*+//e' filename
perl -pi.orig -e 's/^[ ]{0,}file:\/\/\/.*+//e' filename
 
I did not like how the ebook-convert program was working, so I am rolling my own to have control over how it works.  The biggest issue was that the program was generating a stylesheet that overrode control over paragraph spacing in the e-reader.
I modified my script above to generate html output:

#include <stdio.h>
#include <ctype.h>

int
main (){ 

/* declare variables */
    int cur;    /* need a pointer to hold a single character */
    int prev = 0;    /* the previous non white space charater */
    int count = 0;    /* the count since the beginning of the current line */
    int i;        /* Index for loop */

        /* character conversion array */
        /* converts special characters to normal values */
    int conv[256];

    // fill in array with identity values.  Map all characters to themselves.
    for (i = 0; i < 256; i++)
        conv[i]=i;

        // Map set of special characters to space
    conv['\a'] = ' ';    // bell
    conv['\b'] = ' ';    // Backspace
    conv['\t'] = ' ';    // (Horizontal) Tab
    conv['\v'] = ' ';    // Vertical Tab    
    conv['\f'] = '\n';    // Form Feed     

    // Map these characters to regular ascii replacements
    conv['\r'] = '\n';    // Carriage Return (CR)

/*    conv[ 30] = 45;    // nonbreak hyphen    real hyphen
    conv[ 31] = 45;    // optional hyphen    real hyphen
    conv[ 96] = 39;    // Left Apostrophe    single quote
    conv[145] = 39;    // Left Quote        single quote
    conv[146] = 39;    // Right Quote        single quote
    conv[147] = 34;    // Left Double Quote    double quote
    conv[148] = 34;    // Right Double Quote    double quote
    conv[150] = 45;    // long hyphen        real hyphen
    conv[151] = 45;    // longer hyphen    real hyphen
    conv[160] = 32;    // non break space    real space
    conv[173] = 45;    // Short hyphen        real hyphen
*/

    printf ("<?xml version='1.0' encoding='utf-8'?>\r\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\r\n<head>\r\n<meta content=\"http://www.w3.org/1999/xhtml; charset=utf-8\" http-equiv=\"Content-Type\"/>\r\n</head>\n<body>\r\n<p>");
    while ((cur=getc(stdin))!=EOF) {

    // translate character set to ascii
    cur = conv[cur];

    if (!cur) continue;

    if ( cur == '\n' && (isalnum(prev) || !ispunct(prev) || prev == ',') && count > 30 ) {
        // reached end of line and previous character is either a letter or a number or a comma
        // while being over halfway across the page.
        // replace the newline with a space.
        putc(' ', stdout);
        count = 0;
        prev = 0;

        //putc('*', stdout);
    } else if (cur == '\n'){

        // New paragraph

        // eat extra newlines.
        if (count < 2) continue;
        //putc('\n', stdout);
        //putc('\n', stdout);

        printf("</p>\r\n<p>");

        count = 0;

        //putc('+', stdout);

    } else {

        // escape the special html characters in an html way.
        if (cur == '<')
            printf("&lt;");
        else if (cur == '>')
            printf("&gt;");
        else if (cur == '&')
            printf("&amp;");
        //else if (cur == '"')
        //    printf("&quot;");    // not positive that you need to escape quotes, fbreader works just fine without escaping double quotes.

        else

        // Just output the character
        putc(cur, stdout);

        // keep a copy of the previous non white space character;
        if (isalnum(cur) || ispunct(cur))
            prev = cur;

        // keep track of where we are on the line.
        count++;

    };

    }    /* end while */ 

    printf("</p>\r\n</body>\r\n</html>");

/* report success back to environment */
    return 0; 

}    /* end main*/
 
I figured out that I needed to escape ampersands and pointy brackets when most of my book was missing in the ereader.  Strangely, firefox displayed the html just fine.
epub files are just zip files.  The content is in specific places inside the zip directory.  If you unzip it you get a folder that looks a little like this:

BookTitle
|-- Book.html
|-- content.opf
|-- META-INF
|   `-- container.xml
|-- mimetype
|-- stylesheet.css
|-- toc.ncx
 
container.xml is the only file that is locked into one place.  It is the first file read and looks like this:

<?xml version="1.0"?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
<rootfiles>
<rootfile full-path="content.opf" media-type="application/oebps-package+xml"/>
</rootfiles>
</container>
 
The important bit is that this file tells you where to find content.opf, and what its name is.  It doesn’t have to be named content.opf even.

content.opf looks like:

<?xml version='1.0' encoding='utf-8'?>
<package xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="uuid_id">
<metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:calibre="http://calibre.kovidgoyal.net/2009/metadata" xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:language>en</dc:language>
<dc:title>Book Name</dc:title>
<meta name="calibre:timestamp" content="2011-02-26T00:47:46.052390+00:00"/>
<dc:creator opf:role="aut">Authors Name</dc:creator>
<dc:identifier id="uuid_id" opf:scheme="uuid">5b8a6644-8bb1-4b9b-90a3-c1b84daa0040</dc:identifier>
</metadata>
<manifest>
<item href="stylesheet.css" id="css" media-type="text/css"/>
<item href="Book.html" id="html1" media-type="application/xhtml+xml"/>
<item href="toc.ncx" media-type="application/x-dtbncx+xml" id="ncx"/>
</manifest>
<spine toc="ncx">
<itemref idref="html1"/>
</spine>
<guide/>
</package>
 
This contains a manifest of files for the book and gives metadata about the book.

The other important file is toc.ncx:

<?xml version='1.0' encoding='utf-8'?>
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1" xml:lang="en">
<head>
<meta content="5b8a6644-8bb1-4b9b-90a3-c1b84daa0040" name="dtb:uid"/>
<meta content="2" name="dtb:depth"/>
<meta content="calibre (0.7.18)" name="dtb:generator"/>
<meta content="0" name="dtb:totalPageCount"/>
<meta content="0" name="dtb:maxPageNumber"/>
</head>
<docTitle>
<text>Book Title</text>
</docTitle>
<navMap>
<navPoint id="59134a23-813d-49df-902f-bb161bfe8206" playOrder="1">
<navLabel>
<text>Start</text>
</navLabel>
<content src="Book.html"/>
</navPoint>
</navMap>
</ncx>
 
The UUID between the last two files should match and each bit of content gets its own UUID here too.  This is to allow readers to remember navigation details, I am assuming.  They could store it in key value pairs uniquely tied to a UUID which is tied to that one bit of content.

Next up is a book with multiple chapters each in their own file.  Evidently there is some arbitrary max that some book readers expect, so the .html files holding the book have to be split into pieces to fix that max.  Splitting on chapters makes it also convenient to navigate by chapter.

To split a book into chapters:

Split a book into chapters, then convert the chapters to html files.

csplit -f Chapter_ -ks Schmitz\,\ James\ H.\ -\ BookFileName.txt ‘/^ Chapter/’ {50}
rename ’s/Chapter_([0-9][0-9])/Chapter_\1.txt/’ *
find . -name “Chapter_*.txt” -exec ~/bin/txt2html.sh {} \;

No comments:

Post a Comment