Leanpub: Publish Early, Publish Often

Chapter 1. What is S3, and what can I use it for?

The purpose of this section of the book is to show you what S3 can be used for. To that end, I’ve talked to a couple of companies who are doing interesting things with S3. I’ll also talk about some of the more common use cases.

Backups

The first thing you think of when you hear about S3 is backups. It’s quite a nice solution for this: you can easily backup any type of file and the storage is pretty cheap. Personally, I use it to back up all of my pictures. I used to back them up onto a rewriteable CD, but that was flaky and time consuming. Then I used various solutions such as synchronizing multiple computers on my home network. This is fine for most uses, but pictures of my kids are irreplaceable, so I prefer an online backup. That way, even if my house burns down, the big quake hits Vancouver or all of my computers are stolen, I know that my pictures are safe. Yes, it’s a little paranoid, but that’s what backups are all about!

The big win for S3, however, is how easy it is to back up just about anything. There’s no GUI or Web interface to work around - it’s designed for people like you and me: people who can code. You can use it to back up your SVN repositories, your databases or user generated content on your website. Not only that, you can also share your backed up files with others in creative ways.

Serving Data

The second thing you might want to do with S3 is serve data to your users. This data might be static data for your site (“Using S3 as an asset host”), user generated data (“Serving user generated data from s3”), a Bit Torrent for a large media file (“Seeding a bit torrent”) or a file that only authenticated people can access (“Giving access to a bucket or object with a special URL” and “Giving another user access to an object or bucket using S3SH”).

When you are serving data, you will probably want to keep track of what is being viewed and by who. This is discussed in “Determining logging status for a bucket” to “Accessing your logs using S3stat”.

Use Cases

There are a whole bunch of people out there using S3 in lots of different ways. I wanted to get in touch with some of them to find out how they were using S3 and why they decided to use S3. So, I semi-randomly sent out two e-mails to some people whose work I had found interesting. Amazingly enough, both were kind enough to take the time to talk to me. Claude Courbois works for the NASDAQ Stock Exchange, and wrote a great AIR application called Market Replay that gets all of its data directly from S3. Jason Kester runs a bunch of web applications, including S3stat, Twiddla and Blogabond, which all use S3 heavily.

NASDAQ Market Replay

NASDAQ Market Replay is an Adobe AIR application that allows professional stock traders and other people interested in trading stocks to see exactly what happened when a trade is made. It gives you data, in ten minute windows, on every transaction made on a single stock. This allows the users of the application to replay the data and figure out why they got the price they got when they bought or sold the stock.

You can find more about Market Reply and download a free trial copy at https://data.nasdaq.com/MR.aspx.

Claude Courbois of the NASDAQ OMX Group kindly agreed to talk with me about Market Replay and how and why they decided to use S3.

Market Replay By The Numbers

The data for each stock is stored in 10 minute chunks, two files per chunk (one for trades, the other for quotes). There are 40 10 minute chunks per day. There are around 6000 stocks traded on NASDAQ (3000 listed on NASDAQ, and another 3000 from the NYSE and AMEX exchanges). That makes 40 x 2 x 6000 = 480,000 new files every day. These files are all stored in two buckets (one for trades, the other for quotes). Data is never purged: they are planning to keep the data forever. Every year, there are about 260 trading days per year, which amounts to around 125 millions files per year. Wow.

According to Claude, what makes this all possible is that finding a file on S3 is fast, and doesn’t strongly depend on the number of files in the bucket. They don’t ever need to index all of the files in their buckets as they create the file names based on the stock symbol and the slice of time the data responds to.

Markey Replay’s Architecture

Market Replay is an Adobe AIR application: it is written in Flex and ActionScript, and runs inside the Adobe AIR runtime on the user’s computer. When you’re building a FLEX or AIR app that talks to S3, one of the key things is to figure out how to authenticate to S3 without compiling your secret key into the application. Markey Replay gets its data by making requests to a server, which grabs the data from S3 and then sends it back down to the AIR application.

Claude also looked into using the server to generate authenticated URLs for the AIR app and allowing the application to get the data directly from S3. They may still move to this, but right now it’s not a bottleneck so they are planning on leaving it as is.

Files are uploaded to S3 constantly while trading is happening. The raw trading data is massaged into a format that works well for the Market Replay app before being uploaded. Any further data manipulation and visualization is done by the AIR app. The newest data is about 15-20 minutes old.

One of the design decisions made during the development of Market Replay was the size of the time slice in the data files. The size needed to be small enough to keep download and data processing time low, keeping the application responsive. Smaller files also mean that the data you are getting is fresher: if the time slice was an hour, then data wouldn’t be fresher than an hour. It had to be large enough that users didn’t need to make multiple requests to view a single trade. Also, S3 charges $0.01 per 10,000 GET requests. If file sizes were too small, this might actually become a factor. In the end, they decided on storing the data in 10 minute chunks.

Why S3?

So, why did NASDAQ choose to use S3 for this application? First and foremost was the pricing structure of S3. Not just the low cost (although that was important), but the predicability of costs. Claude could easily calculate how much his storage costs would be, and how much adding another customer would increase transfer costs. Having solidly predictable costs allowed them to sell the idea of the product within NASDAQ.

The low cost of storage on S3 allows NASDAQ to keep their historical data forever. Even with the huge number of files they’re putting on S3, Claude still pays the monthly Amazon S3 bill on his corporate credit card.

Why Not S3?

When I asked Claude what problems they’ve had with S3, he had to think for a bit. They have had no problems with the service itself. One drawback that they have thought about with S3 is that it limits what you can do with the application.

As an example, it would be hard to do things like find out the highest price on a given stock in the last 30 days. For what they wanted to do, this wasn’t a requirement. They think they could do this by running an EC2 instance that parsed the data nightly and filled up a relational database with the results, increasing the granularity of the data in order to make the amount of storage manageable. If you are building a web application which needs to make requests like this and doesn’t need the huge storage of data they require, then perhaps a traditional relational database or even Amazon’s Simple DB web service would be more appropriate.

Jason Kester

Jason Kester is a man of many web apps. I wanted to talk to Jason when I read about S3stat (http://s3stat.com), a web application that parses your S3 log information and gives you nice graphical analytics as a result. Then I realized that Jason also has two other web apps that make heavy use of S3. Twiddla (http://twiddla.com) is an online whiteboarding app. Blogabond (http://blogabond.com) is a blog site for world travellers. Jason and I talked about all three in our conversation.

S3stat

S3stat was created when Jason started using S3 and missed his daily web-analytics hit. He set up logging for his buckets, but found it a bit painful. In order to reduce that pain for others, he created S3stat to help others analyze their logs.

To find out how to set up S3stat for your own buckets, see “Accessing your logs using S3stat” (which was written by Jason). If you want to go through the pain yourself, then you can check out “Enabling logging on a bucket” and “Parsing logs”, which tell you how to enable logging on a bucket and how to parse the results.

S3stat uses EC2 as well as S3. Every day, S3stat starts up an EC2 instance, grabs all of the logs for all of S3stat’s users, and parses them. The results are placed in a bucket owned by the appropriate user. Once the parsing is done, the EC2 instance is shut down for the day.

The hard part of the parsing is that the logs created by S3 are not in a standard format. Jason converts the log files into Webalyzer format, so that it’s easy to create graphs of the results. You might need something a little different. See “Parsing log files to find out how many times an object has been accessed” for an example of parsing S3 logs to find out how many times a single object has been accessed.

Twiddla

Twiddla is an online whiteboarding web app. It’s quite useful if you want to talk about an image or layout with a bunch of people who aren’t in the same room. When you use it, you upload files and allow other people in your session to view them. S3 comes into play with the storing and sharing of images. Files are uploaded to S3, and then everyone else in your session gets a time limited authenticated URL (see “Giving access to a bucket or object with a special URL” for more information on generating authenticated URLs). The limited time that the authenticated URLs are available for gives an extra level of security.

Blogabond

Blogabond is a blogging platform for world travellers. People travel around the world, blog about it, mark the places they’ve been on a map and - here’s where S3 comes in - upload pictures. Lots of pictures. The pictures are uploaded to the Blogabond server, resized, and then uploaded to S3. Files need to be resized as they are typically the full size images from a camera.

When someone reads your Blogabond posts, the images are served directly from S3.

Why S3?

Why did Jason choose S3 for his web applications? Well, for S3stat, it’s kind of obvious. For the rest of them, it was due to a few things. First, Jason trusts Amazon. They do lots of things that would feel invasive coming from other companies (like when you log on to Amazon.com and it tells you what books you should read next, and it’s right!), but somehow they manage to do it without feeling creepy.

Second, Jason likes the design philosophy of Amazon Web Services: build something useful, and then charge a low price purely based on usage. For example, the lack of a minimum monthly fee or signup costs. Plus, every once in a while you get an e-mail from Amazon saying the prices have gone down. You’ve gotta like that! Finally, Jason has found the people at Amazon very open to communication. S3stat is a pretty unique application, so Jason had some conversations with a bunch of people inside of Amazon regarding how he was doing things and whether it was okay with them. He found them very open and responsive.

Up next

Chapter 2. S3’s Architecture