Sunday, October 6, 2013

The Cost Of Big Data

Amazon Simple Storage Service (S3)

S3 is a large scale, low cost, minimal feature set REST/SOAP web service for storage and retrieval of data. You can write, read and delete objects up to 5TB in size. They claim you can store an unlimited number of objects thus eliminating the need for capacity planning. Objects are stored in S3 buckets. Buckets exist in an AWS region. Objects can be made private or public, and rights can be granted to specific users. The default download protocol is HTTP but a BitTorrent protocol is also provided. They provide an interface to monitor and control expense and to automatically archive data to lower cost storage.

Their SLA says you can request a credit of 10% or 25% of your monthly bill if their monthly up-time falls below 99.9% or 99% respectively. Up-time is calculated by the number of InternalError or ServiceUnavailable results you receive when you call their S3 service during a billing period.

S3 was launched in 2006. In 2008, it had an 8 hour outage. Despite the length of the outage some users claim to be unaffected due to caching. There doesn't appear to have been another major event since 2008.

S3 redundantly stores data in multiple facilities and on multiple devices within each facility and calculates checksums on all network traffic when storing or retrieving data. It performs regular, systematic data integrity checks and is built to be automatically self-healing. It is designed to sustain the concurrent loss of data in two facilities and for server-side latency to be insignificant relative to internet latency.

S3 supports versioning which allows you to recover from both unintended user actions and application failures. Storage rates apply for every version stored.

S3 objects can be managed through the AWS Management Console (a website) as described in their getting started guide or programmatically via REST/SOAP APIs. They also provide an SDK for .NET and an SDK for Java.

Aside from read/write and access control, you can define rules to automatically archive objects to Amazon Glacier based on their lifetime. Data archival rules are supported in the US-Standard, US-West (N. California), US-West (Oregon), EU-West (Ireland), and Asia Pacific (Japan) Regions.

You can tag S3 buckets and view breakdowns of your costs aggregated by tags. You can use Amazon CloudWatch to receive alerts when your S3 charges surpass arbitrary thresholds.

For people considering migrating a large amount of data to/from S3, Amazon provides an Import/Export service where the data is delivered to/from Amazon on physical media instead of over the internet.

For people who will frequently transfer massive data, Amazon offers Direct Connect where they basically function as your ISP. To use this service you must have physical infrastructure at one the eleven world-wide Direct Connect Locations, or you must employ one of their partners to establish network circuits between a Direct Connect location and your office. The cost of Direct Connect is based on port-hours and data-transfer-out (transfer-in is free). 1TB/day output over a 1Gbps port for one year costs (0.03*1024 + 0.3*24)*365 = $13,840

Cost for S3 is based on maximum data-at-rest plus number of API calls made plus data-transfer-out (transfer-in is free). Reduced Redundancy Storage is an alternative within standard S3 at a reduced cost. Amazon S3’s standard and reduced redundancy options both store data in multiple facilities and on multiple devices, but with RRS, data is replicated fewer times, so the cost is less. S3 standard storage is designed to provide 99.999999999% durability and to sustain the concurrent loss of data in two facilities, while RRS is designed to provide 99.99% durability and to sustain the loss of data in a single facility. Amazon suggests you use standard S3 for original data and RRS for data you produced from original data.

Cost varies by region and reduces as bulk increases, so Amazon has provided a cost calculator to help estimate costs. A few things to note are that data-in is free, data-out to EC2 in the Northern Virginia region is free, data-out to other AWS regions has a small cost and data out to the internet has a sliding scale cost.

Let's say you built an application in EC2 in the N.Virginia region and it needed to PUT and GET 1TB of standard S3 data per day using 1,000 PUTs and 10,000 GETs per day. You start the year with no data and end with 365TB of data but your reads of that data don't increase. Then the purely S3 costs for the entire year are as follows:

data-at-rest
(365/12=30.4)

  1 * 1024 * 0.095 + 29.4 * 1024 * 0.080
+ 1 * 1024 * 0.095 + 49.0 * 1024 * 0.080 +  10.8 * 1024 * 0.070
+ 1 * 1024 * 0.095 + 49.0 * 1024 * 0.080 +  41.2 * 1024 * 0.070
+ 1 * 1024 * 0.095 + 49.0 * 1024 * 0.080 +  71.6 * 1024 * 0.070
+ 1 * 1024 * 0.095 + 49.0 * 1024 * 0.080 + 102.0 * 1024 * 0.070
+ 1 * 1024 * 0.095 + 49.0 * 1024 * 0.080 + 132.4 * 1024 * 0.070
+ 1 * 1024 * 0.095 + 49.0 * 1024 * 0.080 + 162.8 * 1024 * 0.070
+ 1 * 1024 * 0.095 + 49.0 * 1024 * 0.080 + 193.2 * 1024 * 0.070
+ 1 * 1024 * 0.095 + 49.0 * 1024 * 0.080 + 223.6 * 1024 * 0.070
+ 1 * 1024 * 0.095 + 49.0 * 1024 * 0.080 + 254.0 * 1024 * 0.070
+ 1 * 1024 * 0.095 + 49.0 * 1024 * 0.080 + 284.0 * 1024 * 0.070
+ 1 * 1024 * 0.095 + 49.0 * 1024 * 0.080 + 314.8 * 1024 * 0.070
= $176,066.56

number of API calls
(0.005 + 0.004) * 365 = $3.29

plus data-transfer-out
$0 (since only the EC2 instance is reading directly from S3)

Obviously most people use less data more often. Also note that the above ignores the cost of transferring the data into EC2. Switching to RRS slightly reduces the figure.

  1 * 1024 * 0.076 + 29.4 * 1024 * 0.064
+ 1 * 1024 * 0.076 + 49.0 * 1024 * 0.064 +  10.8 * 1024 * 0.056
+ 1 * 1024 * 0.076 + 49.0 * 1024 * 0.064 +  41.2 * 1024 * 0.056
+ 1 * 1024 * 0.076 + 49.0 * 1024 * 0.064 +  71.6 * 1024 * 0.056
+ 1 * 1024 * 0.076 + 49.0 * 1024 * 0.064 + 102.0 * 1024 * 0.056
+ 1 * 1024 * 0.076 + 49.0 * 1024 * 0.064 + 132.4 * 1024 * 0.056
+ 1 * 1024 * 0.076 + 49.0 * 1024 * 0.064 + 162.8 * 1024 * 0.056
+ 1 * 1024 * 0.076 + 49.0 * 1024 * 0.064 + 193.2 * 1024 * 0.056
+ 1 * 1024 * 0.076 + 49.0 * 1024 * 0.064 + 223.6 * 1024 * 0.056
+ 1 * 1024 * 0.076 + 49.0 * 1024 * 0.064 + 254.0 * 1024 * 0.056
+ 1 * 1024 * 0.076 + 49.0 * 1024 * 0.064 + 284.0 * 1024 * 0.056
+ 1 * 1024 * 0.076 + 49.0 * 1024 * 0.064 + 314.8 * 1024 * 0.056
= $140,853.25

Same with Glacier, although you're getting reduced service there, such as retrieval times of several hours.

30.4 * 1024 * 0.01 * ( 1+2+3+4+5+6+7+8+9+10+11+12 )
= $24,281.09

S3 Alternatives

While that sounds expensive, if you really do need to store terabytes of data, how cheap are the cheapest options? Other online storage options like Amazon EBS are considerably more expensive at $0.10 per GB-month. If you're willing to run your own hardware and give up redundancy what's the cheapest option?

There's a great article on wired about Backblaze which provides online backup for home PCs. They have an infographic that shows the following cost per petabyte, although it may be out of date because it's from 2011.

Raw Drives          $81,000
Backblaze          $117,000
Dell MD1000        $826,000
Sun X4550        $1,000,000
NetApp FAS-6000  $1,714,000
Amazon S3        $2,806,000
EMC NS-960       $2,860,000

Since Amazon Glacier is ~10x cheaper than Amazon S3, that would put Glacier roughly 2x more expensive than Backblaze. More to the point though is that the founders of Backblaze wanted to build a mass-data service on S3 but decided their need was too fringe to be satisfied by the S3 price model.

In 2009 Backblaze blogged about their basic storage unit, a 67TB Storage Pod built for $7,867. In 2013, that has evolved to a 180TB Storage Pod 3.0 for $10,717.59, which is the parts cost of an open-source design. You can build it yourself or get the case manufactured by 45drives.

180TB in Glacier for 1yr = 180 * 1024 * 0.01 * 12 = $22,118.40
180TB Storage Pod 3.0 = $10,717.59 + assembly + housing + power
The storage pod probably lasts several years.
Wikipedia says 3 year old drives have an 8% failure rate.
And recall that Glacier has significant read delays.

It looks like the open-sourced Backblaze Pods have had broad uptake including Shutterfly and Vanderbilt’s medical school which is using the Pods to store medical images. Oxygen Cloud connected a v1 Pod to the web as a giant NAS drive.

SpiderOak is a competing online storage service at $10/month per 100 GB.

10 * 12 * 10.24 * 180 = $221,184 per year for 180 TB.

Google offers cloud storage competitive with Amazon and the two seem to be in a price war. Their rate for mass storage at reduced availability is $0.045/GB/month, which compares to Amazon's RRS at $0.56/GB/month.

180 * 1024 * 0.045 * 12 =  $99,532.80 for 180TB for one year at Google
180 * 1024 * 0.056 * 12 = $123,863.04 for 180TB for one year at Amazon
(data transfer is extra)

I've been comparing apples to oranges here because the 180TB Storage Pod is just the cost for unassembled hardware. The point is that if you want mass data available on the web but with infrequent reads/writes, that is a very fringe phenomenon and all the existing services are not well suited for it. However, with a technology like the storage pod and a flexible cloud warehouse like NIRIX you could, like Backblaze, build a custom solution.

See also 20 TB per Year which has details about bare disks and tapes.

web, aws
{ "loggedin": false, "owner": false, "avatar": "", "render": "nothing", "trackingID": "UA-36983794-1", "description": "", "page": { "blogIds": [ 456 ] }, "domain": "holtstrom.com", "base": "\/michael", "url": "https:\/\/holtstrom.com\/michael\/", "frameworkFiles": "https:\/\/holtstrom.com\/michael\/_framework\/_files.4\/", "commonFiles": "https:\/\/holtstrom.com\/michael\/_common\/_files.3\/", "mediaFiles": "https:\/\/holtstrom.com\/michael\/media\/_files.3\/", "tmdbUrl": "http:\/\/www.themoviedb.org\/", "tmdbPoster": "http:\/\/image.tmdb.org\/t\/p\/w342" }