Copilot: Coming Full Circle

Posted July 01, 2015

Exactly 10 years ago, Copilot was officially announced to the public. A little over a year ago, I acquired it from Fog Creek as a part of their corporate restructuring, driven mostly by the need to spin Trello out into its own company.

I've kept it pretty quiet so far, mostly because I was spending a lot of my time doing all of the things necessary to push Copilot out of the nest, something that I didn't want users to be adversely affected by. This included migrating it to new servers, splitting the Copilot credit cards into their own vault, separating the billing systems, and carving Copilot data out of the databases.

The other reason I've kept it quiet is that I was a little embarrassed about the state of Copilot. It had been quite a while since any major updates were done to it, and its age was starting to show. The next task was to rewrite the client applications to be faster, more stable, and easier to use.

It's been a long road, but with the new client applications feeling pretty solid, today I announce the soft-launch of the new client applications, along with a public acknowledgement of the new ownership of Copilot.

I wasn't actually aiming to hit the decade mark since Copilot's first official announcement. That was more of a happy coincidence than anything. But realizing it's been 10 years since that summer has made me reflect on the path I've taken to get here.

It's certainly been a windy road, but I'm happy to be back working on the project that set my career on this trajectory.

Magic: How to launch a product so no one will ever use it again.

Posted February 23, 2015

Yesterday, I came across a new text-based meta-service called Magic. Magic is supposed to be the ultimate on-demand service. According to its homepage, you can ask Magic for anything you can order online, and they'll take care of it for you.

It's actually a pretty simple idea. They hired some people to respond to the text messages that come in, decide which service is most appropriate for the request, and place the order. In terms of technology, it's also very simple, likely being a Twilio number with an admin panel on the backend, linked to Stripe's payment gateway.

Last night I was over at Jeff's place watching the eventual Best Picture winner, Birdman. We decided to order dinner, and since I had signed up for Magic, I thought it'd be fun to try it, even if it was a little pricier.

At 6:00pm I sent Magic a message asking for a burrito and a burrito bowl from Chipotle. Then we waited.

19 minutes later, I got a reply with a link to put in my credit card information. At 6:21pm I replied that I was done putting in my credit card and at 6:23pm I sent them the address to send the order to. Then we waited.

31 minutes later, at 6:54pm, Magic replied, asking for my name. I replied within a minute. Then we waited.

27 minutes later, at 7:21pm, Magic finally replied with a price. For one veggie burrito and one veggie burrito bowl to be delivered they wanted $35.

Just in case you think that might be a typo, I'll spell it out. Thirty five dollars.

An hour and twenty one minutes after first contact, we were offered delivery of our $13 meal for $35. Given most delivery times, we wouldn't have seen our food until 8:00pm. Had I gone through Postmates (which Magic likely used for this order) directly, I would have had my order for only $26 (still pricey for delivery) in only the time required for someone to pick it up and deliver it.

Magic? Not really. Surprisingly Slow And Doubly Expensive Delivery Service. That's more like it.

Like what you see here? Go check out my new project GamePlan30!

32 Is The New 20

Posted December 09, 2013

Recently, a dear friend turned 30. Being my longest tenured friend, I gave her a call to wish her well on the dawn of her fourth decade.

Like many, she was somewhat distressed by the prospect of turning 30. (In retrospect, opening the conversation with "Hey old lady!" was probably not the best choice.) Being that I had passed that same milestone only a few months before, she asked me how I got through it.

"I realized that 30 is a totally arbitrary number and that it is just another birthday."

I could tell she recognized the logic of the statement, but the sentiment rang hollow. Sure, it's an arbitrary number, but it's an arbitrary number that ends in zero. I decided to try another tack.

"That's just because we use base-10 for our numbers. If we used base-16, you would still have two years until you turn 20!"

She was intrigued by this prospect. But being one of the 99% of people who do not regularly think in other bases, she needed a quick refresher. "What's base-16?"

"It's when you don't go into double digits until 16, so in base-16, one-zero equals 16."

"Oh, right. But what do you do when you get above 9?"

"Well, most commonly you use letters. In base-16, which is also called hexadecimal, you use letters. So it goes 8, 9, A, B, C, D, E, F, 10."

"So I just turned 1E?"

"Yeah, exactly!"

"I like that! I'm going to tell everyone that I'm not turning 30, I'm just turning 1E!"

I'm not sure the deviation into base-16 helped assuage her fears about the interminable march of time, but for a moment a little math distracted her from her worries.

Microcaching for a Faster Site

Posted May 21, 2013

My website, this site, is not fast. But, because of this little trick I'm about to show you, you probably think it is.

It's not particularly slow, either, at least not when there's not too much load on it. ab reports that the median request time for the homepage is about 60ms when there's only one request per second coming in. But if traffic starts picking up, it starts slowing down. With 2 req/sec, the median jumps to 90ms per request, a 50% increase. At 5 req/sec, it slows to 225ms per request. Do some quick math and you'll see we'll soon have a problem.

Let's take a quick look under the hood. The website is a heavily modified version of an early iteration of Simple. It is written in Python using Flask and SqlAlchemy, talking to a PostgreSQL database. This is all being run by uWSGI in Emperor mode and served by Nginx.

Each of these levels could be a source of slowness. We could profile the Python code to figure out where we're spending our time. Is Jinja2 being slow about autoescaping the html? Maybe. Perhaps it's in the database layer. SqlAlchemy might be creating some bad queries, so we should log those. And, of course, we need to make sure that PostgreSQL is tuned properly so we're getting the most out of its caching. Then there's uWSGI; should we allocate 2, 4, or 8 processes to the site?

But you know what? That's a difficult, tedious process, and it's easy to make things worse in the process.

Optimization is hard! Let's go shopping.

What if we could just speed the whole thing up all at once?

It turns out that, for this type of site, where the users only see one version of the content (as opposed to a web app, where each user has their own version of the site) microcaching is an ideal solution.

Microcaching is the practice of caching the entire contents of a web page for a very short amount of time. In our case, this will be just one second.

By doing this, we ensure that when the site is under any sort of load, the vast majority of visitors are getting a copy of the site served as static content from the cache, which Nginx is very good at. In fact, because of the way the caching is set up, the only time a user would wait for the "slow" site would be if they were the first person to hit the site in over a second. But, we know that the "slow" site is pretty fast when it's under such light load.

The following is a slightly modified version of my Nginx config file for, which shows how to do this:

# Set cache dir
proxy_cache_path /var/cache/nginx levels=1:2 
                 keys_zone=microcache:5m max_size=1000m;

# Actual server
server {
    listen 80;
    # ...the rest of your normal server config...

# Virtualhost/server configuration
server {
    listen   80;

    # Define cached location (may not be whole site)
    location / {
        # Setup var defaults
        set $no_cache "";
        # If non GET/HEAD, don't cache & mark user as uncacheable for 1 second via cookie
        if ($request_method !~ ^(GET|HEAD)$) {
            set $no_cache "1";
        # Drop no cache cookie if need be
        # (for some reason, add_header fails if included in prior if-block)
        if ($no_cache = "1") {
            add_header Set-Cookie "_mcnc=1; Max-Age=2; Path=/";            
            add_header X-Microcachable "0";
        # Bypass cache if no-cache cookie is set
        if ($http_cookie ~* "_mcnc") {
            set $no_cache "1";
        # Bypass cache if flag is set
        proxy_no_cache $no_cache;
        proxy_cache_bypass $no_cache;
        # Point nginx to the real app/web server
        # Set cache zone
        proxy_cache microcache;
        # Set cache key to include identifying components
        proxy_cache_key $scheme$host$request_method$request_uri;
        # Only cache valid HTTP 200 responses for 1 second
        proxy_cache_valid 200 1s;
        # Serve from cache if currently refreshing
        proxy_cache_use_stale updating;
        # Send appropriate headers through
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        # Set files larger than 1M to stream rather than cache
        proxy_max_temp_file_size 1M;

(Most of this code was originally derived from Fenn Bailey. Unfortunately, it seems his site has gone down with the Posterous shutdown.)

So what's going on here? Let's start with the top. I've set up the Flask site to respond to That's the actual site that we'll be caching. Note that the subdomain is not required. You could just as easily use another port, like 8080.

Next, for, we first check to see if there's any reason we shouldn't use the cache. This includes doing a request other than HEAD or GET or having a certain cookie (which the admin page sets for me). If that's the case, we set a cookie for the next 2 seconds that says not to use the cache and we skip the cache for this request. (You want this to be longer than the caching time so your next GET request will grab a fresh copy.)

If we are using the cache for this request, then we defer to Nginx's usual proxy_pass. We tell it that all successful requests (HTTP 200) should be cached for 1 second. The choice of 1 second is pretty arbitrary, it could be longer, but since I know the app itself performs well with 1 request per second, there wouldn't be a lot of benefit to making it longer. We also set proxy_cache_use_stale to serve from the cache if Nginx is still busy updating the cache, meaning that users won't actually see a slower response while we go to the actual site.

So how does this do compared to the stock site? blows it out of the water.

Command used: ab -k -n 50000 -c [1|5|10|25|50|100] -t 10 http://[a.]

-c req/sec med resp (ms) req/sec med resp (ms)
1 15 64 5,952 0
5 32 151 17,283 0
10 31 312 19,991 0
25 33 751 19,916 1
50 30 1,589 17,397 3
100 32 2,984 16,717 5

While the Flask site can reliably serve up to about 30 requests per second, it starts slowing down pretty significantly. The microcached site, on the other hand, serves almost 20,000 requests per second at its peak. More importantly, the response times stay in the single digit milliseconds, making the site feel nice and fast, regardless of load.

So there's an easy way to speed up your blog without having to make any changes to the application code.

When the Intern Writes the Billing System

Posted April 19, 2013

In the autumn of 2005 I was finishing an internship at Fog Creek Software. My colleagues and I on Project Aardvark had successfully developed and released Copilot. After the launch, we cleaned up the few remaining bugs and added a few small, requested features. Then people started leaving. First Ben, then Yaron, and finally Michael. By the end of August, I was the only intern left.

It would stay that way for the next three months. Unlike the other interns, I didn't have to return to school immediately. I had graduated from Rose-Hulman already and was able to defer my admission to Stanford for the first quarter of classes. This gave me another three months at Fog Creek to tie up any loose ends with Copilot and start adding the next big feature. My assignment, as the last remaining intern? A subscription billing for Copilot.

(Yes, they had an intern writing the new billing system. No, I'm not making this up.)

When we first launched Copilot, there was only one option for payments: a $5 pass that was good for 24 hours of use. This was great for occasional users, since the cost was more than worth the time saved by not having to do remote support blind. But for more frequent users, it was a hassle. They'd have to enter their credit card information every time they needed to help someone. This was especially bad for companies that used it for customer support, since it forced their support people to request access to the company card on a daily basis.

Our plan was to implement a subscription billing system, similar to cell phone plans, based on minutes used per month, except with transparent, benevolent pricing. It would have several plans with a different number of included minutes and a per-minute charge for any overages. To make sure users weren't locked into something too expensive, we also said the system needed to let users change plans at any time.

My first task was to research options for storing and charging credit cards. I read dozens of marketing pages, sent emails to companies asking for info, and made phone calls to sales people. But every option had the same basic flaw: they could only do rigid, set-price subscriptions. This wouldn't work for us, since users could easily be charged a different amount each month, depending on how much they utilized the service.

Instead, we decided that we would write the whole subscription system from scratch ourselves. We already had a similar system in place for one-off purchases, so we thought we might be able to reuse some of the old code. I started writing the recurring billing code while one of our co-founders started on the credit card vault, which we called Roach Motel. (Get it? Cards check in, but they don't check out...)

We did everything possible to isolate credit cards from the rest of our infrastructure, keeping it all on a separate server stored in our secure cage in the office, instead of the data center where data-center employees might have access to it. The code was written to accept credit cards, but never to send them out again, except in encrypted form, and only to the payment processor. Access to the box required a thumb drive, held in secure storage, and a password. (Note: As of a couple years ago, this was no longer the case, as we began using a proper credit card vault instead.)

Over the next couple of months, the system came together nicely, being designed and redesigned to be flexible, so more products could be added in the future, while still being simple and robust. When I left that December, the system was up and running, billing customers each month. Overall, the project seemed like a total success.

Upon returning to Fog Creek the following summer, I found that was not the case. While I was away, FogBugz had grown from an install-only product to its own hosted solution. As with Copilot, this new SaaS-y version of FogBugz required a subscription billing system. So one of the developers on the FogBugz team added a significant amount of new code to support it.

While part of this new code was to deal with cases I had not foreseen in the original version, much more duplicated existing concepts in a different and incompatible way. (This might sound like an admonishment of the developers adding that new code, but most of the blame lies with me for not communicating, through documentation, the intent of the features they had duplicated.)

As time went on, the code base continued to grow organically. Occasionally, a change in one side of the code would cause billing errors for the other side. Bugs were introduced that caused miscalculations in customers bills. Other bugs caused duplicate payments to be triggered without properly recording them.

When Kiln came along it needed to be added to the system as well. Because of its tight integration with FogBugz, even more code warts were added on to deal with the possibility that accounts might only have Kiln, only have FogBugz, or have both.

In all, thousands of developer hours were spent developing and maintaining something that had nothing to do with the software our customers were paying us for. Time that we could have spent on writing new features and fixing bugs for customers was instead spent digging into stringy old billing code. What's worse, the system always was and, unless some great rewrite happens, will always be a mess. (Little known fact: The odds of such a rewrite happening on such a critical system are 2276709 to 1 against.) The fact is, writing billing software is not Fog Creek's core competency.

I bring this story up because it seems to closely parallel the situation that lead to Linode's recent security incident, in which they lost both their password database as well as their entire credit card database. (Fortunately, Fog Creek never suffered a similar loss of credit card data while we were still storing it ourselves.) In these sorts of security breaches, passwords are relatively easy to fix by requiring a password reset. The credit cards, however, are a much bigger problem, since they are often used many different places, requiring worried customers to update their credit card with every vendor they work with.

(Fortunately, the credit card numbers were encrypted in the database. Unfortunately, they decided to store the public and private key for those credit card numbers on the same server. According to Linode, the private key is encrypted with a "complex passphrase" which is "not guessable, sufficiently long and complex, not based on dictionary words, and not stored anywhere but in our heads".)

So how did Linode lose their entire credit card database to a hacker? Putting aside the technical details (which involve a zero-day exploit of ColdFusion), it comes down to Linode, a great VPS provider, spending time developing sensitive, complex systems that were outside of both their core competency and their main business. They, like Fog Creek, are not a payment processing company. Literally speaking, the system brings in the money, it's not what their customers pay them for. If, instead, they'd simply paid for one of the many affordable credit card vaults available, whose job it is to securely store credit card numbers, this never would have happened.