<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0"><channel><description>Freelance developer, passionate about distributed systems, web crawling &amp; new tech.</description><title>How now, Stephen?</title><generator>Tumblr (3.0; @hownowstephen)</generator><link>https://blog.hownowstephen.com/</link><item><title>Running multiple local Riak instances on OSX</title><description>&lt;p&gt;I recently had the need to set up a couple of different Riak servers locally for development purposes. Since there was no documentation for doing this, here we go!
&lt;br/&gt;&lt;b&gt;Assumptions:&lt;/b&gt;
&lt;/p&gt;&lt;ul&gt;&lt;li&gt;You have &lt;a href="http://brew.sh/"&gt;Homebrew&lt;/a&gt; installed
&lt;/li&gt;&lt;li&gt;You&amp;rsquo;re comfortable in the command line
&lt;/li&gt;&lt;li&gt;You don&amp;rsquo;t have an existing installation of Riak on your OSX device
&lt;/li&gt;&lt;/ul&gt;

Firstly, you&amp;rsquo;re going to need to pull the latest Riak source from &lt;a href="http://docs.basho.com/riak/latest/downloads/"&gt;http://docs.basho.com/riak/latest/downloads/&lt;/a&gt;

At the time of writing, it is v2.0.0
&lt;pre&gt;&lt;code class="bash"&gt;
curl -O &lt;a href="http://s3.amazonaws.com/downloads.basho.com/riak/2.0/2.0.0/riak-2.0.0.tar.gz"&gt;http://s3.amazonaws.com/downloads.basho.com/riak/2.0/2.0.0/riak-2.0.0.tar.gz&lt;/a&gt;
tar -xvf riak-2.0.0.tar.gz
cd riak-2.0.0
&lt;/code&gt;&lt;/pre&gt;

Next up you need the right version of Erlang/OTP. This means installing from a back version of Erlang. Woo!

&lt;pre&gt;&lt;code class="bash"&gt;brew install homebrew/versions/erlang-r16&lt;/code&gt;&lt;/pre&gt;

That should install to &lt;b&gt;/usr/local/Cellar/erlang-r16/&lt;/b&gt;, so now it&amp;rsquo;s time to build our development versions of Riak. First you need to alter your path so that OTP R16 is the first version of erlang that bash will find, then you can build the development instances

&lt;pre&gt;&lt;code class="bash"&gt;
export PATH=/usr/local/Cellar/erlang-r16/R16B03-1_1/bin/:$PATH
make devrel
&lt;/code&gt;&lt;/pre&gt;

This should create 8 databases for you in ./dev
&lt;figure data-orig-height="18" data-orig-width="277" data-orig-src="https://64.media.tumblr.com/a8410e44db1c7d129bcb4501fcbc72b7/tumblr_inline_ncepj1MRJG1r0m1xd.jpg"&gt;&lt;img src="https://64.media.tumblr.com/a8410e44db1c7d129bcb4501fcbc72b7/tumblr_inline_pk15xzRhqw1r0m1xd_540.jpg" data-orig-height="18" data-orig-width="277" data-orig-src="https://64.media.tumblr.com/a8410e44db1c7d129bcb4501fcbc72b7/tumblr_inline_ncepj1MRJG1r0m1xd.jpg"/&gt;&lt;/figure&gt;

By default these are configured as a cluster - which, for my purposes was not so useful. Thankfully they also each act as an independent database, so if you edit the line in devN/etc/riak.conf that looks like

&lt;pre&gt;&lt;code&gt;distributed_cookie = riak&lt;/code&gt;&lt;/pre&gt;

And set a distinct cookie for each database you want, you can run multiple distinct clusters locally (up to 8 with the default configuration). For my purposes I set up two clusters of 4 machines:&lt;br/&gt;&lt;br/&gt;&lt;pre&gt;&lt;code&gt;# dev1, dev3, dev5, dev7
distributed_cookie = riak-odds

# dev2, dev4, dev6, dev8
distributed_cookie = riak-evens
&lt;/code&gt;&lt;/pre&gt;</description><link>https://blog.hownowstephen.com/post/98304505821</link><guid>https://blog.hownowstephen.com/post/98304505821</guid><pubDate>Wed, 24 Sep 2014 09:07:28 -0400</pubDate><category>riak</category><category>databases</category><category>osx</category></item><item><title>How to bypass "/bin/rm: Argument list too long"</title><description>The following passes files to the &lt;strong&gt;rm&lt;/strong&gt; program 100 at a time, ensuring that it never gets overwhelmed.

&lt;pre&gt;&lt;code class="bash"&gt;find . -name "REGEX" | xargs -n 100 rm&lt;/code&gt;&lt;/pre&gt;

If you&amp;rsquo;re looking to get the most out of rm (though for all my use cases, 100 at a time seems sufficient), &lt;a href="http://unix.stackexchange.com/questions/45143/what-is-a-canonical-way-to-find-the-actual-maximum-argument-list-length"&gt;check out this StackExchange thread to work out what your absolute argument limit is&lt;/a&gt; &lt;em&gt; &lt;strong&gt;Protip&lt;/strong&gt;: it&amp;rsquo;s in bytes and it includes your environment data, so better to stick with small-ish number&lt;/em&gt;</description><link>https://blog.hownowstephen.com/post/63079529941</link><guid>https://blog.hownowstephen.com/post/63079529941</guid><pubDate>Fri, 04 Oct 2013 09:09:00 -0400</pubDate><category>bash</category><category>one liner</category><category>rm</category><category>argument list too long</category></item><item><title>Getting past the NoSQL curve with MongoDB</title><description>&lt;p&gt;A few months back a friend of a friend came to me asking whether MongoDB was &lt;em&gt;actually&lt;/em&gt; worth switching to from a fairly large-scale MySQL setup. My short answer was not unless your project is about to blow up (or is being rebuilt anyway), but my longer answer brought me to a list of the things I’ve learned about Mongo in the last couple of years that could at least help him to inform himself…&lt;/p&gt;

&lt;h3 id="firstlyitsagoodtime"&gt;Firstly, it’s a good time&lt;/h3&gt;

&lt;p&gt;As Luigi Montanez discusses in &lt;a href="http://luigimontanez.com/2011/mongodb-2.0-should-have-been-1.0/"&gt;Mongodb 2.0 Should Have Been 1.0&lt;/a&gt; - the database didn’t &lt;em&gt;really&lt;/em&gt; reach a point of proper stability until &lt;em&gt;at least&lt;/em&gt; their v2.0 release in September of 2011 (later for those of us waiting on some of the 2.2+ features to really make it worthwhile to have abandoned SQL). Up until that release, brave souls using Mongo in production were really risking &lt;strong&gt;a lot&lt;/strong&gt; of headaches for the (admittedly awesome) benefits of this simple NoSQL database. Thankfully the team at 10gen have done some very awesome work with the software to bring it to now be a production-grade database, most notably with the introduction of the &lt;a href="http://stackoverflow.com/questions/13908438/is-mongodb-aggregation-framework-faster-than-map-reduce"&gt;aggregation framework in v2.2&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id="structuringyourcollections"&gt;Structuring your collections&lt;/h3&gt;

&lt;p&gt;Mongo has no simple way of performing SQL-like JOIN queries - so you’re going to have to be much more clever about how you organize things. Remember database normalization? Forget it. Duplication of data comes with the beast, and (at least for now) it appears to be a necessary evil. When it comes to querying one versus three collections, you’ll start to see the value.&lt;/p&gt;

&lt;p&gt;We’re &lt;strong&gt;&lt;a href="http://stackoverflow.com/questions/15589184/what-does-being-schema-less-mean-for-a-nosql-database"&gt;schemaless&lt;/a&gt;&lt;/strong&gt; (or &lt;a href="https://blog.serverdensity.com/mongodb-schema-design-pitfalls/"&gt;we pretend to be&lt;/a&gt;). One of the unfortunate byproducts is that every one of your keys is taking up precious space. If you’re in it for big data (we’re talking tens of millions of documents) - then consider that by using single character keys, you stand to save on storage and RAM.&lt;/p&gt;

&lt;pre&gt;&lt;code class="javascript"&gt;
{"a": "1234 Fake Street", "p": "1-800-com-pany", "e": "email@company.com"}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Be aware that &lt;a href="http://docs.mongodb.org/manual/reference/limits/"&gt;documents can only be 16mb&lt;/a&gt;. This can be a pain, though you may think *who stores 16 megabytes in a single “row” anyway. It’ll happen, and you’ll suddenly not understand what is up with your application. If you’re adding fields on the fly - be aware of what is going in them and react accordingly. Also be aware that your keys factor into the 16mb limit, see the above about keys.&lt;/p&gt;

&lt;p&gt;Less worrisome, but worth noting from the limits documentation is that you can only nest BSON documents up to 100 levels. But if you’re doing more than that, you may need to rethink your life choices anyway.&lt;/p&gt;

&lt;p&gt;If you’re storing files, mongo actually has a built in &lt;a href="http://docs.mongodb.org/manual/core/gridfs/"&gt;Distributed Filesystem (GridFS)&lt;/a&gt; that is made for that, and if you’re storing text, be aware of what the text sizes are going to be. But for example, I work with documents in which I store lists of thousands of urls, and still stay well within document size limits. This also applies to the result of aggregate framework queries, so be advised (esp. if you’re writing ones that you expect to take awhile - it’s always disappointing to get to the end and discover the result is too big). One advantage of the AF is that intermediate steps &lt;strong&gt;do not&lt;/strong&gt; carry this same constraint.&lt;/p&gt;

&lt;h3 id="whileonthesubjectofram"&gt;While on the subject of RAM…&lt;/h3&gt;

&lt;p&gt;You’re going to use a lot of memory really poorly to start with. And this is because you get the wonderful task of &lt;a href="http://edgystuff.tumblr.com/post/43082387880/mongodb-indexing-tip-1-find-your-friends-recent"&gt;writing your own indexes&lt;/a&gt;. Some things to know right off the bat:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Indexes are prefix driven, so in this situation: db.test.ensureIndex({field1: 1, field2: 1}) and db.test.ensureIndex({field1: 1, field2: 1, field3: 1}) - the first index is actually redundant, as queries looking for field1 and field2 can use the second index. But as mentioned above, order still matters.&lt;/li&gt;
&lt;li&gt;Indexes should (as much as possible) be designed to fit in memory. This means you’re going to need your thinking cap on when designing your data structures to make as many of your queries fit as few indexes as possible.&lt;/li&gt;
&lt;li&gt;Indexing starts to be important when you hit tens of thousands of documents in a collection, and takes awhile to rebuild when you’re in the hundreds of thousands. Building the right indexes has very literally improved my various application’s performance by up to &lt;strong&gt;100x&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Mongo still doesn’t support compounding of separate indexes (though I believe they’re working on it) - so if you have two indexes and query on fields in both of them, it will just choose one of them, instead of self-optimizing. Hopefully this will just come as a nice bonus in the near future!&lt;/li&gt;
&lt;li&gt;Order matters for indexes, and affects their usage within queries:&lt;/li&gt;
&lt;/ol&gt;

&lt;pre&gt;&lt;code class="javascript"&gt;
// A table with these indexes
{"_id": 1}
{"name": 1, "address": -1}

// Will not use either of them for this query
{"address": "1234", "name": "fake", "_id": {"$gt": "somevalue"}}

// And will use ONLY the second index for
{"name": 1, "_id": {"$gt": "somevalue"}}
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id="andaboutthoseodms"&gt;And about those ODMs&lt;/h3&gt;

&lt;p&gt;Despite how trivial a task it seems - mapping Mongo documents to (your language here) objects must be easy, it’s all an object notation! So far everything I’ve worked with is a lot messier than you’d expect. In the Python world, &lt;a href="https://github.com/hmarr/mongoengine"&gt;MongoEngine&lt;/a&gt; seems to be the go-to ODM. It has clean Django-like syntax, but is unfortunately a little too tied to the SQL concepts of a traditional ORM. Also, the aggregate framework is pretty much the way to go for any moderately complex queries anyway, so I tend to lean on the side of a client-side validator (like JSONSchema) and working directly with the BSON.&lt;/p&gt;

&lt;p&gt;Would love to know if any ODMs really have gotten it right though!&lt;/p&gt;

&lt;h3 id="butatleastitsallperformant"&gt;But at least it’s all performant!&lt;/h3&gt;

&lt;p&gt;Until it isnt. Mongo database performance is generally &lt;a href="http://www.quora.com/MongoDB/What-are-some-best-practices-for-optimal-performance-of-MongoDB-particularly-for-queries-that-involve-multiple-documents"&gt;somewhat like a cliff&lt;/a&gt;: it will perform amazingly up until a certain critical point (memory/cpu/bad database structuring) at which point it peaks out. Usually this is from user error or insufficient server resources, but it’s worthwhile to note, as it can and will make you pull your hair out.&lt;/p&gt;

&lt;h3 id="whatisthisaggregateframeworkyoukeepmentioning"&gt;What is this aggregate framework you keep mentioning&lt;/h3&gt;

&lt;p&gt;In short: it is map-reduce on steroids. In long, it is the best thing to come out of the MongoDB project to date. By allowing the building of unix-y data pipelines, you can pull out &lt;strong&gt;very sophisticated metrics&lt;/strong&gt; about your large datasets painlessly. In a lot of cases I’ve seen it replace large blocks of client-side logic – and since it runs directly within the database itself, increase their performance massively (my first big win with the aggregate framework was more than 600x faster). It’s easier to just &lt;a href="http://docs.mongodb.org/manual/applications/aggregation/"&gt;read the documentation and work through the examples&lt;/a&gt;, than it is to explain!&lt;/p&gt;

&lt;h3 id="thereisalsooftentalkofeventualconsistency.whatsthedeal"&gt;There is also often talk of “&lt;a href="http://blog.mongodb.org/post/498145601/on-distributed-consistency-part-2-some-eventual"&gt;eventual consistency&lt;/a&gt;”. What’s the deal?&lt;/h3&gt;

&lt;p&gt;This is the idea that data reads after writing to the cluster may be slightly behind - which only really applies to reading from slave instances, which is disabled by default. Speaking from my own experience writing high frequency web crawlers (millions of webpages/day across hundreds of servers, performing tens of millions of database ops) this is only a problem you’ll encounter at unbelievably high load (i.e. telecom, high frequency trading, etc) and is not grounds for giving up on Mongo.&lt;/p&gt;

&lt;h3 id="otherthoughts"&gt;Other thoughts&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Adding a .pretty() to the end of a find query makes things a lot easier to look at. .findOne is going to upset you if you want to extend the same query&lt;/li&gt;
&lt;li&gt;You can add additional features to your Mongo shell, it’s just a Javascript prompt. Try adding some useful libraries, &lt;a href="https://gist.github.com/vidoss/2178987"&gt;like underscore.js&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;BSON is not JSON. If you want to send a BSON object across the wire, the BSON library (usually installed with your mongo driver) can handle seralizing objects&lt;/li&gt;
&lt;li&gt;facebook.com is a valid ObjectID - so don’t naively try to verify ObjectID strings by casting them to an ObjectID type (I speak from experience).&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.mongotips.com/b/a-few-objectid-tricks/"&gt;ObjectIds carry useful date information&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Just because it’s schemaless, doesn’t mean you can pretend not to need a schema. You can be much more flexible about how you implement the schema, but if you don’t make it somewhat rigid, you’re going to have a bad time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Am I wrong? did I miss anything? Let me know!&lt;/p&gt;</description><link>https://blog.hownowstephen.com/post/62523269567</link><guid>https://blog.hownowstephen.com/post/62523269567</guid><pubDate>Sat, 28 Sep 2013 13:14:22 -0400</pubDate><category>mongodb</category><category>nosql</category></item><item><title>How do I Gevent?</title><description>&lt;p&gt;&lt;strong&gt;So&lt;/strong&gt;, you think you’re going to sneak one past the &lt;a href="http://wiki.python.org/moin/GlobalInterpreterLock" title="Global Interpreter Lock"&gt;GIL&lt;/a&gt; and turn your single-threaded snorefest (or your multi-threaded headache) into a den of slick greenlets?&lt;/p&gt;

&lt;h1 id="notsofastbuster.itmaynotbeworthit."&gt;Not so fast, buster. It may not be worth it.&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;tl;dr&lt;/strong&gt; Gevent is IO-bound, don’t bother trying it out if you’re doing heavy data processing or anything that spends most of its time on in-memory operations. The pattern in the last code block is a practical basis for multiprocessing systems that depend on IO calls.&lt;/p&gt;

&lt;h3 id="geventisio-bound"&gt;Gevent is IO-bound&lt;/h3&gt;

&lt;p&gt;In particular, as you’ll see in a moment, it monkey-patches the &lt;em&gt;os, select, socket, ssl, thread&lt;/em&gt; and &lt;em&gt;time&lt;/em&gt; modules to conform to gevent’s cooperative scheduler. As a general rule, &lt;em&gt;if you’re spending all of your time waiting on any of these&lt;/em&gt;, gevent &lt;strong&gt;will&lt;/strong&gt; make your application awesome. A good way to gauge is to run your program using the awesome &lt;a href="http://docs.python.org/2/library/profile.html"&gt;cProfile&lt;/a&gt; module&lt;/p&gt;

&lt;pre&gt;&lt;code class="bash"&gt;python -m cProfile my_gevent_candidate_program.py&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And then look for the time spent on the socket and other IO based libraries&lt;/p&gt;

&lt;h3 id="geventlocksoncpu-boundtasks"&gt;Gevent locks on CPU-bound tasks&lt;/h3&gt;

&lt;p&gt;This may come back to bite you in the ass on occasion, but it is the expected functionality. If the beautiful routines your application is running suck up CPU time, gevent will not be able to automagically flop between greenlets &lt;em&gt;and chances are, you won’t want it to&lt;/em&gt;. There are, of course, exceptions to this rule, but that’s not worth getting into right now.&lt;/p&gt;

&lt;h3 id="geventmonkey-patchesthepythonstdlib"&gt;Gevent Monkey-Patches the python stdlib&lt;/h3&gt;

&lt;p&gt;This is where things can get really hairy with your dependencies - in general most libraries will play nice with the changes, but there will always be certain ones that just explode in your face when you try to use them with gevent. As far as I know, there’s no definitive list out there yet for these, but based on my present knowledge you’re fairly safe to use the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://docs.python-requests.org/en/latest/"&gt;python-requests&lt;/a&gt;: if you don’t know it yet, then you’re in for a treat.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://api.mongodb.org/python/current/examples/gevent.htm"&gt;pymongo&lt;/a&gt;: the official MongoDB driver, also officially supports greenlets (even does some nice pooling of connections to improve your usage)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://code.google.com/p/py-amqplib/"&gt;amqplib&lt;/a&gt;: one of two AMQP libraries I’ve tested on gevent, so far no issues&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pypi.python.org/pypi/haigha"&gt;haigha&lt;/a&gt;: and here’s the other, this one is a bit more powerful but can be kind of finnicky when it comes to debugging - when in doubt, it’s always turned out to be me doing something stupid though, not haigha. &lt;strong&gt;protip&lt;/strong&gt;: using a second channel for publishing messages seems to fix a lot of socket contention issues.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;note:&lt;/strong&gt; anyone who has tested gevent with other IO libraries: let me know so I can update this list&lt;/em&gt;&lt;/p&gt;

&lt;h3 id="geventrequireslibevent"&gt;Gevent requires libevent&lt;/h3&gt;

&lt;p&gt;Chances are you’re not going to care about this, but you can’t necessarily assume your sysadmin/cloud host/botnet/custom hardware/toaster will support libevent. &lt;/p&gt;

&lt;h1 id="okokjustgettothegoodstuffalready"&gt;Ok ok, just get to the good stuff already…&lt;/h1&gt;

&lt;p&gt;Alright, so you’ve decided your application needs some gevented magic, or just want to play around with it. Let’s talk a bit about what is happening in the background, so you have a clue when it comes to the foreground.&lt;/p&gt;

&lt;h3 id="welcometocooperativescheduling"&gt;Welcome to cooperative scheduling&lt;/h3&gt;

&lt;p&gt;Like in any modern IO system, gevent works by the notion of scheduling. This can be done with a &lt;a href="http://blog.parse.com/2013/01/29/whats-so-great-about-javascript-promises/"&gt;promise&lt;/a&gt;-like system (asking things to be done after an async request completes), in a manner that will pretty much hide itself in your code, thanks to those monkey patches from before. The scheduler is built to switch between greenlet contexts quickly and frequently to allow as much airtime to each of your greenlets as they deserve (&lt;strong&gt;this is important to remember&lt;/strong&gt;). Whenever one of these greenlets hits an IO bound job, it sends that through to &lt;a href="http://libevent.org/"&gt;libevent&lt;/a&gt; and then yields to the scheduler to allow for a context switch. Beyond that, the internals are interesting, but not necessary to understand at this point.&lt;/p&gt;

&lt;h3 id="gettingstarted"&gt;Getting started&lt;/h3&gt;

&lt;p&gt;Let’s start with a simple first-try at a web crawler - we’ll take a seed url and keep following links from there:&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;import sys
import re
import requests

# Track the urls we've found
crawled = 0

def crawler(u):
    '''A very simple web crawler'''
    global crawled

    # Crawl the page, print the status
    response = requests.get(u)
    print response.status_code, u

    # Extract some links to follow using a *really* bad regular expression
    for link in re.findall('&amp;lt;a href="(http.*?)"', response.content):

        # Limit to 10 pages
        if crawled &amp;lt; 10:
            crawled += 1
            crawler(link)

# Read the seed url from stdin
crawler(sys.argv[1])
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first thing to notice is that this crawler is &lt;strong&gt;depth-first&lt;/strong&gt; recursive. That is an &lt;em&gt;extremely&lt;/em&gt; bad idea, but also the easiest to implement in a single-threaded system. We’ve got a time complexity of &lt;strong&gt;O(n)&lt;/strong&gt;, since we will load each webpage sequentially. Now that we’ve got our base, let’s see some gevent goodness. &lt;em&gt;Every gevent application&lt;/em&gt; should start with a monkey:&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;import gevent.monkey
gevent.monkey.patch_all()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That’s all you need to set up the gevent scheduler - once the stdlib (or &lt;a href="http://www.gevent.org/gevent.monkey.html"&gt;a subset&lt;/a&gt;, if that’s what you’re into) has been patched, everything will start piping itself through gevent. This will amount to, well, absolutely no benefit to start off! Your application will run as single threadedly as before, and you may start to feel like reading this far was time that should have been spent finally starting that knitting business you’ve been dreaming about. &lt;strong&gt;calm down, the magic is on its way&lt;/strong&gt;&lt;/p&gt;

&lt;h3 id="letsgetrestructuring"&gt;Let’s get restructuring&lt;/h3&gt;

&lt;p&gt;Our first basic approach for the gevent crawler looks like this&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;# monkey-patch
import gevent.monkey
gevent.monkey.patch_all()

import gevent
import sys
import re
import requests

# List for holding our greenlets
greenlets = []

def crawler(u):
    '''A very simple gevented web crawler'''
    global crawled

    # Crawl the page, print the status
    response = requests.get(u)
    print response.status_code, u

    # Extract some links to follow
    for link in re.findall('&amp;lt;a href="(http.*?)"', response.content):

        # Limit to 10 pages
        if len(greenlets) &amp;lt; 10:
            greenlets.append(gevent.spawn(crawler, link))

# Read the seed url from stdin
greenlets.append(gevent.spawn(crawler, sys.argv[1]))

# Wait until we've spawned enough url requests
while len(greenlets) &amp;lt; 10:
    gevent.sleep(1)

# Wait for everything to complete
gevent.joinall(greenlets)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Already looking much better. We’re now &lt;strong&gt;breadth-first&lt;/strong&gt; (a much better idea when it comes to web crawling), and our theoretical time complexity has been brought down to &lt;strong&gt;O(2)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This crawler shows off the nice feature that &lt;strong&gt;greenlets are cheap&lt;/strong&gt;. In general, you can parallelize by just adding a greenlets per IO task, and the scheduler will take care of organizing them. Unfortunately, this is far from a useful crawler - we can’t control much about it, and since we’re doing one HTTP request per greenlet we can expect that opening up the tap a bit wider may cause some network issues. So deeper down the rabbit hole we go!&lt;/p&gt;

&lt;h3 id="introducingworkerpools"&gt;Introducing worker pools&lt;/h3&gt;

&lt;p&gt;Now, instead of spawning a greenlet per action, we can just tell a pool of greenlets to churn through our actions.&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;# monkey-patch
import gevent.monkey
gevent.monkey.patch_all()

import gevent.pool
import sys
import re
import requests

# Prepare a pool for 5 workers
pool = gevent.pool.Pool(5)

# Crawl tracker is back
crawled = 0

def crawler(u):
    '''A very simple pooled gevent web crawler'''
    global crawled

    # Crawl the page, print the status
    response = requests.get(u)
    print response.status_code, u

    # Extract some links to follow
    for link in re.findall('&amp;lt;a href="(http.*?)"', response.content):

        # Limit to 10 pages (ignores links when the pool is already full)
        if crawled &amp;lt; 10 and not pool.full():
            crawled += 1
            pool.spawn(crawler, link)

# Read the seed url from stdin
pool.spawn(crawler, sys.argv[1])

# Wait for everything to complete
pool.join()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Pools are joinable, meaning that we can tell our application to wait until all the requested actions have been processed (in gevent-speak, the pool is in a ready state). &lt;/p&gt;

&lt;p&gt;In practice this can sometimes cause some headaches, as individual pool actions might not reach the ready state inexplicably (&lt;em&gt;infinte loop, anyone?&lt;/em&gt;) and as a result the entire pool may never be safely joined. Nevertheless, our pooling code feels cleaner, except for one block that looks very, very wrong:&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;# Limit to 10 pages (ignores links when the pool is already full)
if crawled &amp;lt; 10 and not pool.full():
    crawled += 1
    pool.spawn(crawler, link)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Considering how our &lt;strong&gt;gevent.spawn&lt;/strong&gt; example allowed us to do a full breadth-first search, we’re now throwing away results whenever our pool is busy. This is forcing us to do a &lt;em&gt;lossy&lt;/em&gt; breadth-first search, and will throw away urls based on parent page load speed. &lt;em&gt;Wacky&lt;/em&gt;. But why?&lt;/p&gt;

&lt;h4 id="poolspawningisablockingaction"&gt;Pool spawning is a blocking action&lt;/h4&gt;

&lt;p&gt;When you attempt to spawn a thread on a gevent pool, gevent will check if the pool is full, and wait for an availability if it isn’t. This causes some problems - since each of the crawler greenlets is calling &lt;strong&gt;pool.spawn&lt;/strong&gt;, if at any point we already have 5 (our pool size) crawlers active and they all find links, they will all call &lt;strong&gt;pool.spawn&lt;/strong&gt; at the same time, leaving you with some &lt;a href="http://en.wikipedia.org/wiki/Dining_philosophers_problem"&gt;&lt;em&gt;very&lt;/em&gt; hungry philosophers&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you take out the &lt;strong&gt;not pool.full()&lt;/strong&gt; line, your application will happily consume ~5 urls, and then wait patiently for all eternity for a free greenlet, without realizing every greenlet is doing the same!&lt;/p&gt;

&lt;h3 id="butwewanttocrawleverything"&gt;But we want to crawl everything!&lt;/h3&gt;

&lt;p&gt;Thankfully, despite the hiccups with using pooling, there’s one more data structure we’ve been missing that will turn our crawler from a silly toy into a powerhouse web crawling application: &lt;em&gt;queueing&lt;/em&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;# monkey-patch
import gevent.monkey
gevent.monkey.patch_all()

import gevent.pool
import gevent.queue

import sys
import re
import requests

# Prepare a pool for 5 workers and a messaging queue
pool = gevent.pool.Pool(5)
queue = gevent.queue.Queue()
crawled = 0

def crawler():
    '''A very simple queued gevent web crawler'''
    global crawled

    while 1:
        try:
            u = queue.get(timeout=1)
            response = requests.get(u)
            print response.status_code, u

            # Extract some links to follow
            for link in re.findall('&amp;lt;a href="(http.*?)"', response.content):
                # Limit to 10 pages (ignores links when the pool is already full)
                if crawled &amp;lt; 10:
                    crawled += 1
                    queue.put(link)

        except gevent.queue.Empty:
            break

queue.put(sys.argv[1])

# Read the seed url from stdin
for x in xrange(0, 5):
    pool.spawn(crawler)

# Wait for everything to complete
pool.join()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Looks good, right? It will work, but only under very specific circumstances - specifically we’re depending on the seed url taking &lt;em&gt;less than one second&lt;/em&gt; to load. Any more and the rest of the workers are going to exit, leaving us with exactly one worker. But, otherwise, not a bad first attempt - so lets look at what’s happening:&lt;/p&gt;

&lt;h4 id="workersarenowtheirownloops"&gt;Workers are now their own loops&lt;/h4&gt;

&lt;p&gt;Previously we were spawning new workers every time we wanted to crawl a url. This fits the &lt;strong&gt;greenlets are cheap&lt;/strong&gt; maxim, but in that same system, each worker would consume exactly one queue message, then exit. Not the most efficient system in the world (you can actually mimic that by just removing the while loop and adding in another small block, we’ll visit that in a minute). &lt;/p&gt;

&lt;p&gt;Instead, we spawn each of our greenlets and have it wait for queue messages until no more exist (&lt;em&gt;or until the queue has been empty for longer than one second&lt;/em&gt;)&lt;/p&gt;

&lt;h4 id="wespawnallofourworkersupfront"&gt;We spawn all of our workers upfront&lt;/h4&gt;

&lt;p&gt;Previously, we were spawning workers on-demand as we had pool availability and new urls to crawl. Now we’re spawning them all upfront and assuming we’ll fill in the queue as needed. This is not a particularly bad practice (if you’ll recall, &lt;strong&gt;greenlets are cheap&lt;/strong&gt;), but it still feels kind of dirty…&lt;/p&gt;

&lt;h3 id="letsdoonebetter"&gt;Let’s do one better&lt;/h3&gt;

&lt;p&gt;In order to optimize our usage of gevent (context switches don’t cost much, but they still have a cost), we want to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;consume every message in the queue &lt;strong&gt;&lt;em&gt;or&lt;/em&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;fill the pool with workers&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Whichever is possible, with the obvious caveat that our pool is only going to allow us to fulfill #2 if the queue has already more messages than we are allowing workers.&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;# monkey-patch
import gevent.monkey
gevent.monkey.patch_all()

import gevent.pool
import gevent.queue

import sys
import re
import requests

# Prepare a pool for 5 workers and a messaging queue
pool = gevent.pool.Pool(5)
queue = gevent.queue.Queue()
crawled = 0

def crawler():
    '''A very simple queued gevent web crawler'''

    print 'starting crawler...'
    global crawled

    while 1:
        try:
            u = queue.get(timeout=0)
            response = requests.get(u)
            print response.status_code, u

            # Extract some links to follow
            for link in re.findall('&amp;lt;a href="(http.*?)"', response.content):
                # Limit to 10 pages (ignores links when the pool is already full)
                if crawled &amp;lt; 10:
                    crawled += 1
                    queue.put(link)

        except gevent.queue.Empty:
            break

    print 'stopping crawler...'

queue.put(sys.argv[1])
pool.spawn(crawler)

while not queue.empty() and not pool.free_count() == 5:
    gevent.sleep(0.1)
    for x in xrange(0, min(queue.qsize(), pool.free_count())):
        pool.spawn(crawler)

# Wait for everything to complete
pool.join()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Running this will show that we spawn exactly one worker to start, which then loads the seed url and pushes the rest of our messages into the queue - after which the other 4 workers start up and &lt;strong&gt;we’re crawling&lt;/strong&gt;!&lt;/p&gt;

&lt;h1 id="andnowyoucangevent"&gt;And now &lt;em&gt;you&lt;/em&gt; can Gevent&lt;/h1&gt;

&lt;p&gt;These pattern form the basis of an IO-bound multiprocessing system, and can be applied to anything that waits on system IO calls (many reads/writes to the local filesystem), local threading and any network IO.&lt;/p&gt;</description><link>https://blog.hownowstephen.com/post/50743415449</link><guid>https://blog.hownowstephen.com/post/50743415449</guid><pubDate>Sat, 18 May 2013 13:51:00 -0400</pubDate><category>gevent</category><category>python</category><category>web crawlers</category><category>async</category><category>greenlets</category></item><item><title>Intro to the gevent.queue module</title><description>&lt;p&gt;A very unofficial overview of all the awesomeness that can be accomplished with the &lt;strong&gt;&lt;a href="http://www.gevent.org/gevent.queue.html"&gt;gevent.queue&lt;/a&gt;&lt;/strong&gt; module&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;import gevent.queue
queue = gevent.queue.Queue()&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id="puttingmessagesintoqueues"&gt;Putting messages into queues&lt;/h3&gt;

&lt;p&gt;From examples and my own experience, it is most consistent with good pythonic practice to communicate across queues using tuples. This allows simple packing and unpacking of arguments, and also cascades down to some of the more interesting queue implementations (in particular, priority queues), so as an example, I might do something like the following:&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;from datetime import datetime
priority = 1
queue.put((priority, datetime.utcnow(), 'here is a message',))&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Some specific notes about some of the different queue environments. If you’re working with capped queues (queues with a fixed size, set using the &lt;em&gt;maxsize&lt;/em&gt; keyword argument), the queue.put command is going to be blocking - can be &lt;em&gt;quite&lt;/em&gt; a surprise when you go from an uncapped to capped queue that your application just locks up. But they thought of that, so you can do either of the following:&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;
try:
    # The preferred method, simple and clear
    queue.put_nowait((priority, datetime.utcnow(), 'here is a message',))

    # queue.put_nowait is actually just an alias of this function
    queue.put((priority, datetime.utcnow(), 'here is a message'), block=False)

except gevent.queue.Queue.Full:
    print 'Could not put to the queue'&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Using a timeout on the blocking &lt;em&gt;put&lt;/em&gt; allows you to simulate nonblocking, but react accordingly when no result is added&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;try:
    queue.put((priority, datetime.utcnow(), 'here is a message'), timeout=1)
except gevent.queue.Queue.Full:
    print 'Could not put to the queue'&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id="gettingmessagesfromqueues"&gt;Getting messages from queues&lt;/h3&gt;

&lt;p&gt;Most basic applications will use a blocking get, waiting until a new message is in the queue to be processed. Note the tuple unpacking makes the returned data structure very clear (&lt;em&gt;protip&lt;/em&gt;: for any situation where you’re going to expect more complex data, I suggest still using a tuple but tailing it with a dictionary and using LINKTHIS JSONSchema to validate messaging)&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;priority, date_sent, message = queue.get()&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;By default queue get is a blocking command, but you may need to make it nonblocking - it mirrors the same idioms as the put command&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;try:
    # Best-practice nonblocking get
    priority, date_sent, message = queue.get_nowait()

    # Which is an alias of this:
    priority, date_sent, message = queue.get(block=False)

    # And can be done with a timeout as well
    priority, date_sent, message = queue.get(timeout=1)

except gevent.queue.Queue.Empty:
    print 'Could not get from the queue'

except ValueError:
    print 'Your tuple unpack was bad, and you should feel bad'&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id="queuepeekingnewingevent1.0"&gt;Queue peeking (new in gevent 1.0)&lt;/h3&gt;

&lt;p&gt;Sometimes, you don’t want to take a message from the queue, just check to see what’s going on on top. The kind developers of gevent have your back on this, with the &lt;em&gt;queue.peek&lt;/em&gt; commands. They work identically to the &lt;em&gt;queue.get&lt;/em&gt; commands, but are non-destructive.&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;try:
    # Blocking peek
    priority, date_sent, message = queue.peek()

    # Best-practice nonblocking peek
    priority, date_sent, message = queue.peek_nowait()

    # Which is an alias of this:
    priority, date_sent, message = queue.peek(block=False)

    # And can be done with a timeout as well
    priority, date_sent, message = queue.peek(timeout=1)

except gevent.queue.Queue.Empty:
    print 'Could not peek at the queue'

except ValueError:
    print 'Your tuple unpack was bad, and you should feel bad'&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id="iteratingoverqueues"&gt;Iterating over queues&lt;/h3&gt;

&lt;p&gt;We also have the ability to iterate over messages in a queue. This is a &lt;em&gt;blocking&lt;/em&gt; action, so could be used to have a keep-alive consumer of a queue while other greenlets are feeding new messages in.&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;for priority, date, message in queue:
  print 'Got', message, 'from', date, 'with priority', priority&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id="typesofqueuesandimplementations"&gt;Types of queues and implementations&lt;/h3&gt;

&lt;p&gt;Out of the box, gevent supports several useful queue implementations: &lt;em&gt;priority queues&lt;/em&gt;, &lt;em&gt;joinable queues&lt;/em&gt;, &lt;em&gt;lifo queues&lt;/em&gt;, as well as a very simplistic interface for defining your own queue subclasses.&lt;/p&gt;

&lt;h4 id="priorityqueueingaka:whyweusetuples"&gt;Priority queueing (AKA: why we use tuples)&lt;/h4&gt;

&lt;pre&gt;&lt;code class="python"&gt;priority_queue = gevent.queue.PriorityQueue()&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Priority queues, internally, are built ontop of the amazing LINKTHIS heapq module in the python standard library. Any logic you would use in implementations of lists using heapq applies to gevent priority queues as well, so in a situation like the following:&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;Gevent priority queues 
priority_queue.put((1, datetime.now(), 'countdown 3'))
priority_queue.put((1, datetime.now() - timedelta(minutes=1), 'countdown 2'))
priority_queue.put((0, datetime.now(), 'countdown 1'))
priority_queue.put((0, datetime.now() - timedelta(minutes=1), 'countdown - blastoff!'))&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You can analytically determine that pulling all messages from the queue will yield the following:&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;(0, datetime.datetime(...), 'countdown - blastoff!')
(0, datetime.datetime(...), 'countdown 1')
(1, datetime.datetime(...), 'countdown 2')
(1, datetime.datetime(...), 'countdown 3')&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I would highly recommend reading through the heapq documentation a bit more to ensure you implement your comparison algorithms properly (I often get them backwards) and also because it is surprisingly useful in general.&lt;/p&gt;

&lt;h4 id="joinablequeues"&gt;Joinable queues&lt;/h4&gt;

&lt;p&gt;When using queues in gevent, you may find that you want to run a program only until there are no messages left in a queue, perhaps because you know it recurses in a predictable manner, or you just have a finite set of messages you want to process before exiting. Enter the joinable queue, which allows you to set up a blocking environment to live until the queue is empty&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;joinable_queue = gevent.queue.JoinableQueue()
joinable_queue.put(('wait for me...',))

# some code to spawn workers goes here, otherwise you're going to be waiting a LONG time

joinable_queue.join()&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id="lifoqueues"&gt;Lifo queues&lt;/h4&gt;

&lt;p&gt;Blah blah, rhetoric about lifo queues, there’s nothing special about the usage, this is just here for completeness.&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;lifo_queue = gevent.queue.LifoQueue()&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id="subclassingqueues"&gt;Subclassing queues&lt;/h4&gt;

&lt;p&gt;Extending the queue class is simple, and can be done with a pattern something like the following:&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;import random

class ChaosQueue(gevent.queue.Queue):
    '''A subclass of Queue that retrieves a random queue entry'''

    def _init(self, maxsize, items=None):
        if items:
            self.queue = list(items)
        else:
            self.queue = []

    def _put(self, item):
        self.queue.append(item)

    def _get(self):
        return self.queue.pop(random.randint(0, len(self.queue) - 1))&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And that, as far as I know, is the extent of the gevent queuing module. Let me know if there’s anything I missed or any errors in my sample code!&lt;/p&gt;</description><link>https://blog.hownowstephen.com/post/46872742800</link><guid>https://blog.hownowstephen.com/post/46872742800</guid><pubDate>Mon, 13 May 2013 18:00:00 -0400</pubDate><category>gevent</category><category>queue</category><category>python</category><category>queuing</category></item><item><title>One-liner while loop in bash shell</title><description>&lt;p&gt;To keep looping on a command forever in the shell, use the following (runs COMMAND every 5 seconds)&lt;/p&gt;

&lt;p&gt;$ &lt;strong&gt;while true; do COMMAND; sleep 5; done;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As it turns out, this can be surprisingly useful - embarrassed to say that I was using &lt;em&gt;cron&lt;/em&gt; to accomplish the same thing for awhile&amp;hellip;&lt;/p&gt;</description><link>https://blog.hownowstephen.com/post/49457971211</link><guid>https://blog.hownowstephen.com/post/49457971211</guid><pubDate>Thu, 02 May 2013 16:39:00 -0400</pubDate><category>bash</category><category>one liner</category><category>scripting</category><category>devops</category></item><item><title>Relocating RabbitMQ files</title><description>&lt;p&gt;RabbitMQ, by default (on ubuntu-like systems) will load this file, if it exists:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;/etc/rabbitmq/rabbitmq-env.conf&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This (not /etc/environment) is where you should be putting any of the variables found on &lt;a href="http://"&gt;http://www.rabbitmq.com/relocate.html&lt;/a&gt;&lt;/p&gt;</description><link>https://blog.hownowstephen.com/post/49272828335</link><guid>https://blog.hownowstephen.com/post/49272828335</guid><pubDate>Tue, 30 Apr 2013 14:59:00 -0400</pubDate><category>rabbitmq</category><category>ubuntu</category></item><item><title>How to install python lxml on ubuntu</title><description>&lt;p&gt;Since I can never seem to get it right:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;apt-get install libxml2-dev libxslt-dev
pip install lxml
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;or with python-libxml2&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;apt-get install python-libxml2 libxslt-dev
STATIC_DEPS=true pip install lxml
&lt;/code&gt;&lt;/pre&gt;</description><link>https://blog.hownowstephen.com/post/45839390608</link><guid>https://blog.hownowstephen.com/post/45839390608</guid><pubDate>Wed, 20 Mar 2013 11:57:00 -0400</pubDate><category>lxml</category><category>python</category><category>libxml2</category><category>libxslt</category></item><item><title>Where is my user-data script stored on EC2?</title><description>&lt;p&gt;On EC2 Ubuntu instances, the user-data script supplied on launch can be found in:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;/var/lib/cloud/instances/[instance-id]/user-data.txt&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Note: there may be several if you, like me, tend to launch instances from copies of copies of instances!&lt;/p&gt;</description><link>https://blog.hownowstephen.com/post/45268048932</link><guid>https://blog.hownowstephen.com/post/45268048932</guid><pubDate>Wed, 13 Mar 2013 10:33:00 -0400</pubDate><category>aws</category><category>ec2</category><category>user-data</category></item></channel></rss>
