Before being processed. Looks very bad, lines are broken. |
After being processed. Lines breaking good. |
So I have written a small program to reformat text files to remove extra newlines and put a newline between paragraphs. It tries to put an extra line between lines that end in less than 20 characters to accommodate Chapter 1 style headings or short page breaks such as “* * *”
The file is included inline here:
The only other major problem I am seeing is that often files converted to text from other ereader formats have tags scattered several hundred times through the files, so I am going to try to automate remove those tags in a generic way.#include <stdio.h> #include <ctype.h> int main (){ /* declare variables */ int cur; /* need a pointer to hold a single character */ int prev = 0; /* the previous non white space charater */ int count = 0; /* the count since the beginning of the current line */ while ((cur=getc(stdin))!=EOF) { if ( cur == '\r' ) { // eat carriage returns count = 0; //prev = 0; } if ( cur == '\n' && (isalnum(prev) || !ispunct(prev) || prev == ',') && count > 20 ) { // reached end of line and previous character is either a letter or a number or a comma // while being over halfway across the page. // replace the newline with a space. putc(' ', stdout); count = 0; prev = 0; //putc('*', stdout); } else if (cur == '\n'){ // New paragraph // eat extra newlines. if (count < 2) continue; putc('\n', stdout); putc('\n', stdout); count = 0; //putc('+', stdout); } else { // Just output the character putc(cur, stdout); // keep a copy of the previous non white space character; if (isalnum(cur) || ispunct(cur)) prev = cur; // keep track of where we are on the line. count++; }; } /* end while */ /* report success back to environment */ return 0; } /* end main*/
A solution to the tags problem are filters using the following pattern:
perl -pi.orig -e 's/^file:\/\/\/.*+//e' filename perl -pi.orig -e 's/^file:\/\/\/.*+//e' filename perl -pi.orig -e 's/^[ ]*+file:\/\/\/.*+//e' filename perl -pi.orig -e 's/^[ ]{0,}file:\/\/\/.*+//e' filename perl -pi.orig -e 's/^[ ]{0,}file:\/\/\/.*+//e' filename
I did not like how the ebook-convert program was working, so I am rolling my own to have control over how it works. The biggest issue was that the program was generating a stylesheet that overrode control over paragraph spacing in the e-reader.
I modified my script above to generate html output:
#include <stdio.h> #include <ctype.h> int main (){ /* declare variables */ int cur; /* need a pointer to hold a single character */ int prev = 0; /* the previous non white space charater */ int count = 0; /* the count since the beginning of the current line */ int i; /* Index for loop */ /* character conversion array */ /* converts special characters to normal values */ int conv[256]; // fill in array with identity values. Map all characters to themselves. for (i = 0; i < 256; i++) conv[i]=i; // Map set of special characters to space conv['\a'] = ' '; // bell conv['\b'] = ' '; // Backspace conv['\t'] = ' '; // (Horizontal) Tab conv['\v'] = ' '; // Vertical Tab conv['\f'] = '\n'; // Form Feed // Map these characters to regular ascii replacements conv['\r'] = '\n'; // Carriage Return (CR) /* conv[ 30] = 45; // nonbreak hyphen real hyphen conv[ 31] = 45; // optional hyphen real hyphen conv[ 96] = 39; // Left Apostrophe single quote conv[145] = 39; // Left Quote single quote conv[146] = 39; // Right Quote single quote conv[147] = 34; // Left Double Quote double quote conv[148] = 34; // Right Double Quote double quote conv[150] = 45; // long hyphen real hyphen conv[151] = 45; // longer hyphen real hyphen conv[160] = 32; // non break space real space conv[173] = 45; // Short hyphen real hyphen */ printf ("<?xml version='1.0' encoding='utf-8'?>\r\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\r\n<head>\r\n<meta content=\"http://www.w3.org/1999/xhtml; charset=utf-8\" http-equiv=\"Content-Type\"/>\r\n</head>\n<body>\r\n<p>"); while ((cur=getc(stdin))!=EOF) { // translate character set to ascii cur = conv[cur]; if (!cur) continue; if ( cur == '\n' && (isalnum(prev) || !ispunct(prev) || prev == ',') && count > 30 ) { // reached end of line and previous character is either a letter or a number or a comma // while being over halfway across the page. // replace the newline with a space. putc(' ', stdout); count = 0; prev = 0; //putc('*', stdout); } else if (cur == '\n'){ // New paragraph // eat extra newlines. if (count < 2) continue; //putc('\n', stdout); //putc('\n', stdout); printf("</p>\r\n<p>"); count = 0; //putc('+', stdout); } else { // escape the special html characters in an html way. if (cur == '<') printf("<"); else if (cur == '>') printf(">"); else if (cur == '&') printf("&"); //else if (cur == '"') // printf("""); // not positive that you need to escape quotes, fbreader works just fine without escaping double quotes. else // Just output the character putc(cur, stdout); // keep a copy of the previous non white space character; if (isalnum(cur) || ispunct(cur)) prev = cur; // keep track of where we are on the line. count++; }; } /* end while */ printf("</p>\r\n</body>\r\n</html>"); /* report success back to environment */ return 0; } /* end main*/
I figured out that I needed to escape ampersands and pointy brackets when most of my book was missing in the ereader. Strangely, firefox displayed the html just fine.
epub files are just zip files. The content is in specific places inside the zip directory. If you unzip it you get a folder that looks a little like this:
BookTitle |-- Book.html |-- content.opf |-- META-INF | `-- container.xml |-- mimetype |-- stylesheet.css |-- toc.ncx
container.xml is the only file that is locked into one place. It is the first file read and looks like this:
<?xml version="1.0"?> <container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container"> <rootfiles> <rootfile full-path="content.opf" media-type="application/oebps-package+xml"/> </rootfiles> </container>
The important bit is that this file tells you where to find content.opf, and what its name is. It doesn’t have to be named content.opf even.
content.opf looks like:
<?xml version='1.0' encoding='utf-8'?> <package xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="uuid_id"> <metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:calibre="http://calibre.kovidgoyal.net/2009/metadata" xmlns:dc="http://purl.org/dc/elements/1.1/"> <dc:language>en</dc:language> <dc:title>Book Name</dc:title> <meta name="calibre:timestamp" content="2011-02-26T00:47:46.052390+00:00"/> <dc:creator opf:role="aut">Authors Name</dc:creator> <dc:identifier id="uuid_id" opf:scheme="uuid">5b8a6644-8bb1-4b9b-90a3-c1b84daa0040</dc:identifier> </metadata> <manifest> <item href="stylesheet.css" id="css" media-type="text/css"/> <item href="Book.html" id="html1" media-type="application/xhtml+xml"/> <item href="toc.ncx" media-type="application/x-dtbncx+xml" id="ncx"/> </manifest> <spine toc="ncx"> <itemref idref="html1"/> </spine> <guide/> </package>
This contains a manifest of files for the book and gives metadata about the book.
The other important file is toc.ncx:
<?xml version='1.0' encoding='utf-8'?> <ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1" xml:lang="en"> <head> <meta content="5b8a6644-8bb1-4b9b-90a3-c1b84daa0040" name="dtb:uid"/> <meta content="2" name="dtb:depth"/> <meta content="calibre (0.7.18)" name="dtb:generator"/> <meta content="0" name="dtb:totalPageCount"/> <meta content="0" name="dtb:maxPageNumber"/> </head> <docTitle> <text>Book Title</text> </docTitle> <navMap> <navPoint id="59134a23-813d-49df-902f-bb161bfe8206" playOrder="1"> <navLabel> <text>Start</text> </navLabel> <content src="Book.html"/> </navPoint> </navMap> </ncx>
The UUID between the last two files should match and each bit of content gets its own UUID here too. This is to allow readers to remember navigation details, I am assuming. They could store it in key value pairs uniquely tied to a UUID which is tied to that one bit of content.
Next up is a book with multiple chapters each in their own file. Evidently there is some arbitrary max that some book readers expect, so the .html files holding the book have to be split into pieces to fix that max. Splitting on chapters makes it also convenient to navigate by chapter.
To split a book into chapters:
Split a book into chapters, then convert the chapters to html files.
csplit -f Chapter_ -ks Schmitz\,\ James\ H.\ -\ BookFileName.txt ‘/^ Chapter/’ {50}
rename ’s/Chapter_([0-9][0-9])/Chapter_\1.txt/’ *
find . -name “Chapter_*.txt” -exec ~/bin/txt2html.sh {} \;
No comments:
Post a Comment