Monday, March 28, 2005

A word on bandwidth...

In computer science, there are generally two basic measure of performance: bandwidth and latency. Bandwidth is the one you normally think about. DSL and cable modem companies like to talk about the bandwidth they provide. Cable modems these days are generally either 1.5 megabits (note that it's not megabytes! To get megabytes, divide by 8) per second or 3.0 Mb/s. At school, your ethernet connection was probably either 10 Mb/s or 100 Mb/s, although you generally don't actually get such high speeds, particularly when connecting to a site waaaay across the internet.

So, basically, bandwidth is a measure of how much data you can shove through something in a given amount of time. In contrast, latency is a measure of how long it takes to complete any particular thing.

Now, if you haven't grappled with these concepts before, it would seem that these are duals of each other. Shouldn't low latency mean high bandwidth and vice versa? Well, it turns out that no, that's not necessarily true. Case in point, a brief discussion I had with some people at lunch today.

People I know do data mining on data from the web. By one means or another, they obtain a "crawl" of the web, which is basically a bunch of data you get by starting out at a bunch of web pages, collecting those pages, then collecting all the pages those pages link to, and so on until you decide to stop. As you can imagine, these data sets get very big very fast. They are gigabytes if not terabytes large.

I am in California. These crawls live on computers in Washington (often). You would assume that being a big, fancy corporation, we would just transfer the data over the network. Not so! Remember our discussion of latency versus bandwidth? Well, imagine you could transfer stuff over the network at 2 megabytes per second. To get a 500 GB data set over that link, it would take about 70 hours, or 4 days.

So what do such researchers usually do instead? That's right: they take a big disk or two, write all the data they want to it, and then mail the fucker.

That's right, mail it. Good old UPS. High latency, but really high bandwidth. An oil tanker full of DVDs has far higher bandwidth than any cable modem you'll ever see. Just happens that it can take a month or two to get anything anywhere.

Welcome to the weird, wonderful world of computer science.

No comments: