The Two Minute Guide to Simple e-Books

Sometimes I'll find a new website where the content is so entrancing that I want to read through all the archives. But it's pretty time consuming to sit and read everything at once, and I'll always lose track of where I am.

Luckily I have a Kindle, a computer, and Python. So I've started throwing together simple e-books with this content. I get the nice auto-bookmarking and screen of the Kindle, making for a very nice reading experience.

Here's the quick step by step guide to packaging up some reading material for those long flights.

Find some content

The content could be someone's blog, a couple books from Project Gutenberg, or a batch of e-mails you've been meaning to read. Pull in a good amount of content.

I use the EPUB format. At its core, this format is just zipped-up HTML files. So long as your content is simple (X)HTML, then it can be pulled in with little to no modification.

(It's good form to ask before scraping someone's website, especially if you aren't careful with your throttling)

Get that content

For the most part, I use a small python script with requests to fetch web pages, and save it to disk. Saving the content prevents having to re-fetch if you tweak later steps.

#!python
import os
import requests

url = "http://example.com/%s.html"
build_dir = "build/"

if not os.path.exists(build_dir):
    os.makedirs(build_dir)

source_urls = [url % i for i in range(500)]

urls = [
    (build_dir + "%s.html" % i, url % i) for i in range(500)
]

for filename, url in urls:
    print("Getting ", url)
    response = requests.get(url)
    with open(filename, 'wb') as f:
        f.write(response.content)

Format the content

the EPUB format works like how you expect it to. Content is broken up into chapters, laid out in a sequential order.

How you want the content to be displayed up to you, but you want the end result to be a collection of HTML pages, representing your chapters.

You can also set some CSS styling, but in my case I don't bother. The Kindle default text handling is pretty good already.

For web page parsing I opt for pyquery.

#!python
from pyquery import PyQuery as pq
from ebooklib import epub


def make_chapter(filename):

    page = pq(filename=build_dir+filename)
    content = page.find('#c1')
    title = content.find('h1').text()
    date = page.find('.s i').text()

    chapter = epub.EpubHtml(
        title=date +' : ' + title, 
        file_name=filename, 
        lang='en'
    )
    chapter.content = '<i>' + date +'</i>' + content.html() 

    return chapter

Package up your content

I use the nifty ebooklib to package the content up. The example script given in the project is straightforward: give this library a bunch of chapters and it will generate a nice file to hand over.

#!python
from ebooklib import epub

book = epub.EpubBook()
...
for chapter in chapters:
    book.add_item(chapter)
...
epub.write_epub('output.epub', book, {})

Get the file to your reading device

I use Calibre or the Send to Kindle feature that lets you send files via e-mail.

Do whatever works for you, but the e-mail one is particularly fun when hooked up with automation (get your weekly reports automatically on your device!)

This works well for simple stuff (I've never bothered with having working images. for example), but I have an IPython Notebook script that has the full script I use as a template when working on these. Play around with it, and get reading!