DSHR's Blog: Economic Model of Long-Term Storage DSHR's Blog I'm David Rosenthal, and this is a place to discuss the work I'm doing in Digital Preservation. Tuesday, August 22, 2017 Economic Model of Long-Term Storage Cost vs. Kryder rate As I wrote last month in Patting Myself On The Back, I started working on economic models of long-term storage six years ago. I got a small amount of funding from the Library of Congress; when that ran out I transferred the work to students at UC Santa Cruz's Storage Systems Research Center. This work was published here in 2012 and in later papers (see here). What I wanted was a rough-and-ready Web page that would allow interested people to play "what if" games. What the students wanted was something academically respectable enough to get them credit. So the models accumulated lots of interesting details. But the details weren't actually useful. The extra realism they provided was swamped by the uncertainty from the "known unknowns" of the future Kryder and interest rates. So I never got the rough-and-ready Web page. Below the fold, I bring the story up-to-date and point to a little Web site that may be useful. Earlier this year the Internet Archive asked me to update the numbers we had been working with all those years ago. And, being retired with time on my hands (not!), I decided instead to start again. I built an extremely simple version of my original economic model, eliminating all the details that weren't relevant to the Internet Archive and everything else that was too complex to implement at short notice, and put it behind an equally simple Web site running on a Raspberry Pi (so please don't beat up on it). What This Model Does For a single Terabyte of data, the model computes the endowment, the money which deposited with the Terabyte and invested at interest would suffice to pay for the storage of the data "for ever" (actually 100 years in this model). Assumptions These are the less than totally realistic assumptions underlying the model: Drive cost is constant, although each year the same cost buys drives with more capacity as given by the Kryder rate. The interest rate and the Kryder rate do not vary for the duration. The storage infrastructure consists of multiple racks, containing multiple slots for drives. I.e. the Terabyte occupies a very small fraction of the infrastructure. The number of drive slots per rack is constant. Ingesting the Terabyte into the infrastructure incurs no cost. The failure rate of drives is constant and known in advance, so that exactly the right number of spare drives is included in each purchase to ensure that failed drives can be replaced by an identical drive. Drives are replaced after their specified life although they are still working. Some of these assumptions may get removed in the future (see below). Parameters This model's adjustable parameters are as follows. Media Cost Factors DriveCost: the initial cost per drive, assumed constant in real dollars. DriveTeraByte: the initial number of TB of useful data per drive (i.e. excluding overhead). KryderRate: the annual percentage by which DriveTeraByte increases. DriveLife: working drives are replaced after this many years. DriveFailRate: percentage of drives that fail each year. Infrastructure Cost factors SlotCost: the initial non-media cost of a rack (servers, networking, etc) divided by the number of drive slots. SlotRate: the annual percentage by which SlotCost decreases in real terms. SlotLife: racks are replaced after this many years Running Cost Factors SlotCostPerYear: the initial running cost per year (labor, power, etc) divided by the number of drive slots. LaborPowerRate: the annual percentage by which SlotCostPerYear increases in real terms. ReplicationFactor: the number of copies. This need not be an integer, to account for erasure coding. Financial Factors DiscountRate: the annual real interest obtained by investing the remaining endowment. Defaults The defaults are my invention for a rack full of 8TB drives. They should not be construed as representing the reality of your storage infrastructure. If you want to use the output of this model, for example for budgeting purposes, you need to determine your own values for the various parameters. Default values Parameter Value Units DriveCost 250.00 Initial $ DriveTeraByte 7.2 Usable TB per drive KryderRate 10 % per year DriveLife 4 years DriveFailRate 2 % per year SlotCost 150.00 Initial $ SlotRate 0 % per year SlotLife 8 years SlotCostPerYear 100.00 Initial $ per year LaborPowerRate 4 % per year DiscountRate 2 % per year ReplicationFactor 2 # of copies Unlike the KryderRate and the SlotRate, the LaborPowerRate reflects that the real cost of staff increases over time. Of course, the capacity of the slots is typically increasing faster than the LaborPowerRate, so the per-Terabyte cost from the LaborPowerRate still decreases over time. Nevertheless, the endowment calculated is quite sensitive to the value of the LaborPowerRate. Calculation The model works through the 100-year duration year by year. Each year it figures out the payments needed to keep the Terabyte stored, including running costs and equipment purchases. It then uses the DiscountRate to figure out how much would have to have been invested at the start to supply that amount at that time. In other words, it computes the Net Present Value of each year's expenditure and sums them to compute the endowment needed to pay for storage over the full duration. Usage Sample model output The Web site provides two ways to use the model: Provide a set of parameters including a DiscountRate and a KryderRate, and compute the model's estimate of the endowment. Provide a set of parameters excluding the DiscountRate and the KryderRate, and draw a graph of how the model's estimate of the endowment varies with the DiscountRate and KryderRate for reasonable ranges of these two parameters. The sample graph shows why adding lots of detail to the model isn't really useful, because the effects of the unknowable future DiscountRate and KryderRate parameters are so large. Code The code is here under an Apache 2.0 license. What This Model Doesn't (Yet) Do If I can find the time, some of these deficiencies in the model may be removed: Unlike earlier published research, this model ignores the cost of ingesting the data in the first place, and accessing it later. Experience suggests the following rule of thumb: ingest is half the total lifetime cost, storage is one-third the total lifetime cost, and access is one-sixth. Thus a reasonable estimate of the total preservation cost of a Terabyte is three times the result of this model. The model assumes that the parameters are constant through time. Historically, interest rates, the Kryder rate, labor costs, etc. have varied, and thus should be modeled using Monte Carlo techniques and a probability distribution for each such parameter. It is possible for real interest rates to go negative, disk cost per Terabyte to spike upwards, as it did after the Thai floods, and so on. These low-probability events can have a large effect on the endowment needed, but are excluded from this model. Fixing this needs more CPU power than a Raspberry Pi. There are a number of different possible policies for handling the inevitable drive failures, and different ways to model each of them. This model assumes that it is possible to predict at the time a batch of drives is purchased what proportion of them will fail, and inflates the purchase cost by that factor. This models the policy of buying extra drives so that failures can be replaced by the same drive model. The model assumes that drives are replaced after DriveLife years even though they are working. Continuing to use the drives beyond this can have significant effects on the endowment (see this paper). Posted by David. at 10:00 AM Labels: storage costs 4 comments: Unknown said... Nice post, and model, even if it can't be predictive. Might want to throw in inflation. Effect might be large, given the decade average in the US hasn't been lower than 2% for the last 100 years. Revised code, assumed to be buggy, here. Rick August 22, 2017 at 12:26 PM David. said... Rick, please read the post more carefully: "constant in real dollars" and: "annual real interest" The model works in real dollars, that is after adjusting for inflation. In other words, your idea of the average future rate of inflation needs to be subtracted from your idea of the KryderRate, SlotRate and LaborPowerRate in nominal dollars. Adding an inflation parameter would be double-counting. August 22, 2017 at 12:52 PM Unknown said... Oops. Apologies. I should've caught that by inference from your straw-man 2% discount rate, as well. August 22, 2017 at 1:04 PM David. said... I want to use the Pi for something else, so I have taken the model down. If you need to use the model please install it on your own hardware from github: https://github.com/dshrosenthal/EconomicModel If this isn't possible, post a comment and I'll see if I can resurrect the model. February 21, 2020 at 5:37 PM Post a Comment Newer Post Older Post Home Subscribe to: Post Comments (Atom) Blog Rules Posts and comments are copyright of their respective authors who, by posting or commenting, license their work under a Creative Commons Attribution-Share Alike 3.0 United States License. Off-topic or unsuitable comments will be deleted. DSHR DSHR in ANWR Recent Comments Full comments Blog Archive ►  2021 (39) ►  August (2) ►  July (6) ►  June (8) ►  May (4) ►  April (6) ►  March (3) ►  February (5) ►  January (5) ►  2020 (55) ►  December (4) ►  November (4) ►  October (3) ►  September (6) ►  August (5) ►  July (3) ►  June (6) ►  May (3) ►  April (5) ►  March (6) ►  February (5) ►  January (5) ►  2019 (66) ►  December (2) ►  November (4) ►  October (8) ►  September (5) ►  August (5) ►  July (7) ►  June (6) ►  May (7) ►  April (6) ►  March (7) ►  February (4) ►  January (5) ►  2018 (96) ►  December (7) ►  November (8) ►  October (10) ►  September (5) ►  August (8) ►  July (5) ►  June (7) ►  May (10) ►  April (8) ►  March (9) ►  February (9) ►  January (10) ▼  2017 (82) ►  December (6) ►  November (6) ►  October (8) ►  September (6) ▼  August (7) Don't own cryptocurrencies Recent Comments Widget Why Is The Web "Centralized"? Economic Model of Long-Term Storage Approaching The Physical Limits Preservation Is Not A Technical Problem Disk media market update ►  July (5) ►  June (7) ►  May (6) ►  April (7) ►  March (11) ►  February (5) ►  January (8) ►  2016 (89) ►  December (4) ►  November (8) ►  October (10) ►  September (8) ►  August (8) ►  July (7) ►  June (8) ►  May (7) ►  April (5) ►  March (10) ►  February (7) ►  January (7) ►  2015 (75) ►  December (7) ►  November (5) ►  October (11) ►  September (5) ►  August (3) ►  July (3) ►  June (8) ►  May (10) ►  April (6) ►  March (6) ►  February (7) ►  January (4) ►  2014 (68) ►  December (7) ►  November (8) ►  October (6) ►  September (8) ►  August (7) ►  July (3) ►  June (5) ►  May (6) ►  April (5) ►  March (6) ►  February (2) ►  January (5) ►  2013 (67) ►  December (3) ►  November (6) ►  October (7) ►  September (6) ►  August (3) ►  July (5) ►  June (6) ►  May (5) ►  April (9) ►  March (5) ►  February (5) ►  January (7) ►  2012 (43) ►  December (4) ►  November (4) ►  October (6) ►  September (6) ►  August (2) ►  July (5) ►  June (2) ►  May (5) ►  March (1) ►  February (5) ►  January (3) ►  2011 (40) ►  December (2) ►  November (1) ►  October (7) ►  September (3) ►  August (5) ►  July (2) ►  June (2) ►  May (2) ►  April (4) ►  March (4) ►  February (4) ►  January (4) ►  2010 (17) ►  December (5) ►  November (3) ►  October (4) ►  September (2) ►  July (1) ►  June (1) ►  February (1) ►  2009 (8) ►  July (1) ►  June (1) ►  May (1) ►  April (1) ►  March (2) ►  January (2) ►  2008 (8) ►  December (2) ►  March (1) ►  January (5) ►  2007 (14) ►  December (1) ►  October (3) ►  September (1) ►  August (1) ►  July (2) ►  June (3) ►  May (1) ►  April (2) LOCKSS system has permission to collect, preserve, and serve this Archival Unit. Simple theme. Powered by Blogger.