How do I Gevent?

So, you think you’re going to sneak one past the GIL and turn your single-threaded snorefest (or your multi-threaded headache) into a den of slick greenlets?

Not so fast, buster. It may not be worth it.

tl;dr Gevent is IO-bound, don’t bother trying it out if you’re doing heavy data processing or anything that spends most of its time on in-memory operations. The pattern in the last code block is a practical basis for multiprocessing systems that depend on IO calls.

Gevent is IO-bound

In particular, as you’ll see in a moment, it monkey-patches the os, select, socket, ssl, thread and time modules to conform to gevent’s cooperative scheduler. As a general rule, if you’re spending all of your time waiting on any of these, gevent will make your application awesome. A good way to gauge is to run your program using the awesome cProfile module

python -m cProfile my_gevent_candidate_program.py

And then look for the time spent on the socket and other IO based libraries

Gevent locks on CPU-bound tasks

This may come back to bite you in the ass on occasion, but it is the expected functionality. If the beautiful routines your application is running suck up CPU time, gevent will not be able to automagically flop between greenlets and chances are, you won’t want it to. There are, of course, exceptions to this rule, but that’s not worth getting into right now.

Gevent Monkey-Patches the python stdlib

This is where things can get really hairy with your dependencies - in general most libraries will play nice with the changes, but there will always be certain ones that just explode in your face when you try to use them with gevent. As far as I know, there’s no definitive list out there yet for these, but based on my present knowledge you’re fairly safe to use the following:

  • python-requests: if you don’t know it yet, then you’re in for a treat.
  • pymongo: the official MongoDB driver, also officially supports greenlets (even does some nice pooling of connections to improve your usage)
  • amqplib: one of two AMQP libraries I’ve tested on gevent, so far no issues
  • haigha: and here’s the other, this one is a bit more powerful but can be kind of finnicky when it comes to debugging - when in doubt, it’s always turned out to be me doing something stupid though, not haigha. protip: using a second channel for publishing messages seems to fix a lot of socket contention issues.

note: anyone who has tested gevent with other IO libraries: let me know so I can update this list

Gevent requires libevent

Chances are you’re not going to care about this, but you can’t necessarily assume your sysadmin/cloud host/botnet/custom hardware/toaster will support libevent.

Ok ok, just get to the good stuff already…

Alright, so you’ve decided your application needs some gevented magic, or just want to play around with it. Let’s talk a bit about what is happening in the background, so you have a clue when it comes to the foreground.

Welcome to cooperative scheduling

Like in any modern IO system, gevent works by the notion of scheduling. This can be done with a promise-like system (asking things to be done after an async request completes), in a manner that will pretty much hide itself in your code, thanks to those monkey patches from before. The scheduler is built to switch between greenlet contexts quickly and frequently to allow as much airtime to each of your greenlets as they deserve (this is important to remember). Whenever one of these greenlets hits an IO bound job, it sends that through to libevent and then yields to the scheduler to allow for a context switch. Beyond that, the internals are interesting, but not necessary to understand at this point.

Getting started

Let’s start with a simple first-try at a web crawler - we’ll take a seed url and keep following links from there:

import sys
import re
import requests

# Track the urls we've found
crawled = 0

def crawler(u):
    '''A very simple web crawler'''
    global crawled

    # Crawl the page, print the status
    response = requests.get(u)
    print response.status_code, u

    # Extract some links to follow using a *really* bad regular expression
    for link in re.findall('<a href="(http.*?)"', response.content):

        # Limit to 10 pages
        if crawled < 10:
            crawled += 1
            crawler(link)

# Read the seed url from stdin
crawler(sys.argv[1])

The first thing to notice is that this crawler is depth-first recursive. That is an extremely bad idea, but also the easiest to implement in a single-threaded system. We’ve got a time complexity of O(n), since we will load each webpage sequentially. Now that we’ve got our base, let’s see some gevent goodness. Every gevent application should start with a monkey:

import gevent.monkey
gevent.monkey.patch_all()

That’s all you need to set up the gevent scheduler - once the stdlib (or a subset, if that’s what you’re into) has been patched, everything will start piping itself through gevent. This will amount to, well, absolutely no benefit to start off! Your application will run as single threadedly as before, and you may start to feel like reading this far was time that should have been spent finally starting that knitting business you’ve been dreaming about. calm down, the magic is on its way

Let’s get restructuring

Our first basic approach for the gevent crawler looks like this

# monkey-patch
import gevent.monkey
gevent.monkey.patch_all()

import gevent
import sys
import re
import requests

# List for holding our greenlets
greenlets = []

def crawler(u):
    '''A very simple gevented web crawler'''
    global crawled

    # Crawl the page, print the status
    response = requests.get(u)
    print response.status_code, u

    # Extract some links to follow
    for link in re.findall('<a href="(http.*?)"', response.content):

        # Limit to 10 pages
        if len(greenlets) < 10:
            greenlets.append(gevent.spawn(crawler, link))

# Read the seed url from stdin
greenlets.append(gevent.spawn(crawler, sys.argv[1]))

# Wait until we've spawned enough url requests
while len(greenlets) < 10:
    gevent.sleep(1)

# Wait for everything to complete
gevent.joinall(greenlets)

Already looking much better. We’re now breadth-first (a much better idea when it comes to web crawling), and our theoretical time complexity has been brought down to O(2).

This crawler shows off the nice feature that greenlets are cheap. In general, you can parallelize by just adding a greenlets per IO task, and the scheduler will take care of organizing them. Unfortunately, this is far from a useful crawler - we can’t control much about it, and since we’re doing one HTTP request per greenlet we can expect that opening up the tap a bit wider may cause some network issues. So deeper down the rabbit hole we go!

Introducing worker pools

Now, instead of spawning a greenlet per action, we can just tell a pool of greenlets to churn through our actions.

# monkey-patch
import gevent.monkey
gevent.monkey.patch_all()

import gevent.pool
import sys
import re
import requests

# Prepare a pool for 5 workers
pool = gevent.pool.Pool(5)

# Crawl tracker is back
crawled = 0

def crawler(u):
    '''A very simple pooled gevent web crawler'''
    global crawled

    # Crawl the page, print the status
    response = requests.get(u)
    print response.status_code, u

    # Extract some links to follow
    for link in re.findall('<a href="(http.*?)"', response.content):

        # Limit to 10 pages (ignores links when the pool is already full)
        if crawled < 10 and not pool.full():
            crawled += 1
            pool.spawn(crawler, link)

# Read the seed url from stdin
pool.spawn(crawler, sys.argv[1])

# Wait for everything to complete
pool.join()

Pools are joinable, meaning that we can tell our application to wait until all the requested actions have been processed (in gevent-speak, the pool is in a ready state).

In practice this can sometimes cause some headaches, as individual pool actions might not reach the ready state inexplicably (infinte loop, anyone?) and as a result the entire pool may never be safely joined. Nevertheless, our pooling code feels cleaner, except for one block that looks very, very wrong:

# Limit to 10 pages (ignores links when the pool is already full)
if crawled < 10 and not pool.full():
    crawled += 1
    pool.spawn(crawler, link)

Considering how our gevent.spawn example allowed us to do a full breadth-first search, we’re now throwing away results whenever our pool is busy. This is forcing us to do a lossy breadth-first search, and will throw away urls based on parent page load speed. Wacky. But why?

Pool spawning is a blocking action

When you attempt to spawn a thread on a gevent pool, gevent will check if the pool is full, and wait for an availability if it isn’t. This causes some problems - since each of the crawler greenlets is calling pool.spawn, if at any point we already have 5 (our pool size) crawlers active and they all find links, they will all call pool.spawn at the same time, leaving you with some very hungry philosophers.

If you take out the not pool.full() line, your application will happily consume ~5 urls, and then wait patiently for all eternity for a free greenlet, without realizing every greenlet is doing the same!

But we want to crawl everything!

Thankfully, despite the hiccups with using pooling, there’s one more data structure we’ve been missing that will turn our crawler from a silly toy into a powerhouse web crawling application: queueing.

# monkey-patch
import gevent.monkey
gevent.monkey.patch_all()

import gevent.pool
import gevent.queue

import sys
import re
import requests

# Prepare a pool for 5 workers and a messaging queue
pool = gevent.pool.Pool(5)
queue = gevent.queue.Queue()
crawled = 0

def crawler():
    '''A very simple queued gevent web crawler'''
    global crawled

    while 1:
        try:
            u = queue.get(timeout=1)
            response = requests.get(u)
            print response.status_code, u

            # Extract some links to follow
            for link in re.findall('<a href="(http.*?)"', response.content):
                # Limit to 10 pages (ignores links when the pool is already full)
                if crawled < 10:
                    crawled += 1
                    queue.put(link)

        except gevent.queue.Empty:
            break

queue.put(sys.argv[1])

# Read the seed url from stdin
for x in xrange(0, 5):
    pool.spawn(crawler)

# Wait for everything to complete
pool.join()

Looks good, right? It will work, but only under very specific circumstances - specifically we’re depending on the seed url taking less than one second to load. Any more and the rest of the workers are going to exit, leaving us with exactly one worker. But, otherwise, not a bad first attempt - so lets look at what’s happening:

Workers are now their own loops

Previously we were spawning new workers every time we wanted to crawl a url. This fits the greenlets are cheap maxim, but in that same system, each worker would consume exactly one queue message, then exit. Not the most efficient system in the world (you can actually mimic that by just removing the while loop and adding in another small block, we’ll visit that in a minute).

Instead, we spawn each of our greenlets and have it wait for queue messages until no more exist (or until the queue has been empty for longer than one second)

We spawn all of our workers upfront

Previously, we were spawning workers on-demand as we had pool availability and new urls to crawl. Now we’re spawning them all upfront and assuming we’ll fill in the queue as needed. This is not a particularly bad practice (if you’ll recall, greenlets are cheap), but it still feels kind of dirty…

Let’s do one better

In order to optimize our usage of gevent (context switches don’t cost much, but they still have a cost), we want to:

  1. consume every message in the queue or
  2. fill the pool with workers

Whichever is possible, with the obvious caveat that our pool is only going to allow us to fulfill #2 if the queue has already more messages than we are allowing workers.

# monkey-patch
import gevent.monkey
gevent.monkey.patch_all()

import gevent.pool
import gevent.queue

import sys
import re
import requests

# Prepare a pool for 5 workers and a messaging queue
pool = gevent.pool.Pool(5)
queue = gevent.queue.Queue()
crawled = 0

def crawler():
    '''A very simple queued gevent web crawler'''

    print 'starting crawler...'
    global crawled

    while 1:
        try:
            u = queue.get(timeout=0)
            response = requests.get(u)
            print response.status_code, u

            # Extract some links to follow
            for link in re.findall('<a href="(http.*?)"', response.content):
                # Limit to 10 pages (ignores links when the pool is already full)
                if crawled < 10:
                    crawled += 1
                    queue.put(link)

        except gevent.queue.Empty:
            break

    print 'stopping crawler...'

queue.put(sys.argv[1])
pool.spawn(crawler)

while not queue.empty() and not pool.free_count() == 5:
    gevent.sleep(0.1)
    for x in xrange(0, min(queue.qsize(), pool.free_count())):
        pool.spawn(crawler)

# Wait for everything to complete
pool.join()

Running this will show that we spawn exactly one worker to start, which then loads the seed url and pushes the rest of our messages into the queue - after which the other 4 workers start up and we’re crawling!

And now you can Gevent

These pattern form the basis of an IO-bound multiprocessing system, and can be applied to anything that waits on system IO calls (many reads/writes to the local filesystem), local threading and any network IO.

comments powered by Disqus
  1. xiaowl reblogged this from hownowstephen
  2. hownowstephen posted this