This note is a contribution to the continuing debates and analyses about what can and should be done to make public data open. In this note, I share some observations about current practices surrounding public data. In general, these observations lead to the insight that absolutely open public data is and will continue to be rare. Instead, various types of data are apt to be more or less open, and the reasons for the degree of openness may vary from one situation to another, that is by type of data, by country, by type of institution, etc.
First, willingness of public agencies to provide datasets varies not just country by country, but agency by agency and government by government. The division I have read most about concerns the openness of city bus schedule data, in part because this is tracked for US transport agencies by an organization called City-Go-Round at http://www.citygoround.org. The City-Go-Round website includes a list of transport agencies and whether or not they provide open transit data. The site contains a good discussion of degrees of openness and displays examples of the various degrees of openness from different agencies. One of the transit agencies they track, the Bay Area Rapid Transit system (BART) in the Bay Area in California, is famous for their openness and their willingness to publicize applications built on their data by third parties. Other transit agencies are famous for their willingness to sue entities that had the temerity to try use their data. For instance, RailCorp, a New South Wales state-owned train company threatened to sue an Australian who had developed an iPhone application that provided some of the RailCorp timetables (See http://www.macnn.com/articles/09/03/06/train.app.sued/, viewed on February 8, 2012).
Why might a public agency be reluctant to share public data, such as city bus schedules? I have encountered six reasons, although there may be other reasons as well.
Homeland Security: Homeland security is a fear that the data will be misused for terrorist purposes for instance, using city bus schedules to plan a disruption on a city bus. This reason appears more often in conjunction with maps of critical infrastructure, such as gas pipelines or electrical transmission lines.
Legal constraints: These primarily take the form of privacy or intellectual property restrictions. Privacy appears as a reason not to share when individual data is involved, such as names, birth dates and the like. Although techniques for anonymizing or aggregating such data do exist, often agencies lack the resources or expertise to apply them. Intellectual property (IP) restrictions stem in part from the concern that trade secrets or other confidential material will be disclosed. The confidences involved may be those of the government or an outside entity that submitted data. IP restrictions also stem from a concern that copyrighted material may be reproduced or distributed or otherwise used outside the scope of the rights that the government has. In the US, works created by federal government employees in the course of their employment are not subject to copyright, and the government usually obtains unlimited rights to final reports submitted to it, so those materials do not raise this concern. But that still leaves government-held material, such as proposals submitted to obtain government grants, which often contain proprietary material delivered for grant review purposes only, that cause concern.
Cost: Although the agency may be required to gather such data for its own operations, it almost never arrives in a form that can be automatically made open. Sometimes that data needs to be anonymized or aggregated to alleviate privacy concerns. Other times, in order to make the data useful to outsiders, it needs to be re-formatted and placed in open data sources in specified formats, usually electronic. Note that City-Go-Round, the transit data-tracking organization mentioned above, does not deem US transit data open unless it is available to outsiders in a specified electronic format. Sometimes, the agency itself may not have the in-house capacity in terms of technological expertise and assistance to maintain their data in open formats, and they may not have the budget to outsource for such services. Note that the costs of formatting for openness can be significant. In the 2011 round of US federal budget talks, one of the line items considered for elimination was the Electronic Government Fund, the US governments effort to make more of its data open. The president initially proposed around $34 million for the Electronic Government Fund, which pays for transparency projects such as data.gov, the US portal for many databases of public federal data. The final budget passed was $12.4 million, up from the $8 million that was proposed at one point. (http://www.ombwatch.org/node/11943, viewed February 8, 2012).
Revenue: The agency may be receiving revenue from providing its data on a less than open basis. I was part of a group that offered to place all of a particular US states statutes on an open website, available to everybody without charge for access and without restrictions on further reproduction and distribution. (In the US, state statutes are generally not copyrightable, although a particular collection and arrangement of them may be. For instance, some publishers provide not just the raw text of each statutory section, but citations to and short summaries of court cases that have discussed that particular section. The raw text can be made open by anyone, but the publisher-written case summaries are subject to copyright. In our case, at issue was the raw text.) The Legislature of that state opposed us, because it was selling an electronic copy (of the raw text only) for $90,000. It would sell us the same copy, but the terms of sale would prohibit our turning around and making the contents available for free (as described above). In some other cases, the agency is not itself receiving revenue, but one or a few outside re-packagers of the data are doing so, and the agency or the key legislators for it may be receiving in-kind services (or other less legal considerations; or may have a political preference for private enterprise) that leads it to want to perpetuate the situation. For an example of in-kind services: one US publisher of court cases tries to correct any typos, misspellings, or grammatical errors in court opinions before publishing them, so the judges do not look bad. There is no requirement to do so, and a low-budget competitor may not bother, (or consider it tampering), so the court system may be less eager to cooperate with such low-budget competitors.
Power: The old adage knowledge is power extends to the modern definition of information as trade secret which has value because it is not generally known. Even if the agency is not receiving revenue by keeping the data less than open, it may be receiving recognition, or other forms of power and influence, by controlling access to its data.
Ownership Mentality: We live in the information age, where products in digital form, such as sound recordings, video recordings, software programs, and electronic databases have been used by their owners (not always their creators) to make large amounts of money, in large part because intellectual property laws (copyright, trademark, patent, and trade secret) gave the owners rights to control their distribution and use. Mike Masnick of Techdirt (http://www.techdirt.com) writes on this subject often. One of his most extreme examples, albeit not a digital product and not a government, involves an antique shop that felt it had a right to stop people from copying the designs of the products it sold, even though it had not created those designs, merely having bought and re-sold the products involved. (http://www.techdirt.com/articles/20110505/00331514160/antique-shop-takes-ownership-culture-to-new-level-sues-over-lamps-it-doesnt-own.shtml, viewed May 16, 2011). A similar reluctance to let others use our data, even when it costs us nothing and is not producing revenue or power for us, is sometimes behind a lack of open data from some sources.
Some of the literature of open data seems to imply that data is either open or closed, where open is available for unlimited use without charge and everything else is closed. However, in reality, there are a range of positions on open data and openness, where the gamut expands from available for unlimited use without charge to highly classified, and a number of policies in between. I have already mentioned agencies that have exclusive or highly-selective arrangements, or high-cost prices. Some agencies are under legal constraints to limit use of their data, or even access to it, by type of recipient. For instance US health and insurance agencies can only share medical data with entities who are themselves covered under the appropriate health information law generally providers and insurance firms. Someone who wanted to develop an analytical tool to sell to hospitals and insurance agencies, but was itself a software firm, may not be allowed access to the health data. Agencies that collect data from firms in a given industry may have constraints on both some individualized data (because it would reveal trade secrets) and some industry-wide data that could facilitate price coordination and other prohibited activities. A forest service agency, for instance, might be authorized to collect and publish data on lumber yields for various species of tree per square meter in various areas, but be prohibited from collecting and publishing the prices charged per board foot for various species. The point is that agencies can (and do) offer varying degrees of "openness," not just a binary choice between none and all.
Different types of data have different levels of reliability and different sources of error. Some errors are inadvertent, but others are purposeful -- for instance, asking individuals to report their own behavior that is considered bad is not apt to produce accurate results, although clever questions by the data-gatherer, such as a media polling organization, may produce something more accurate. For instance, in a US Congressional race several decades ago, a candidate had very outspoken views against certain minorities. Almost no one would directly admit to favoring this candidate. However, the percentage of people who said that this candidate was equally qualified as her opponent, closely tracked the percentage of the vote she eventually received. In other words, asking will you vote for this person under-reported her eventual vote total; asking do you consider her as equally-qualified as her opponent closely tracked her vote total. Similarly, vague or confusing questions may produce answers that are less accurate than would be produced by better questions, and inferior data-gathering methods, such as less precise thermometers, or less trained data gathering staff, or staff that cannot or will not work around their own preconceptions, may produce less reliable data than might have been obtained with better training and better instruments.
These differences influence not only the value of each individual element, but the value of different types of analysis applied to the less-than-perfect dataset. For instance, some measures of central tendency, such as the median, are much less sensitive to outliers than the arithmetic mean. If the agency aggregates before opening, say to protect individual identities, the form of aggregation could be very sensitive to the type of data. To give a concrete example: the US Federal Communications Commission does release data on the availability of broadband services (how many suppliers, what speeds, etc.) aggregated by zip code. However, each zip code represents a defined area for delivery of US mail, and may vary from a few hundred square feet to many square miles in size, and may have similar variations in population, from several hundred to many thousands. Plus, the FCC deems an offering to a single site inside the zip code to be equivalent to an offer to each and every site within that zip code. By the time one gets to the level of an entire state, many of these differences average out, but for analyzing a sub-state region, many consider this form of aggregation extremely misleading. That is, using only one or two observations per zip code gives highly unreliable data about that zip code usually grossly over-estimating the values being gathered and using zip codes as the aggregating unit, when they vary so widely in both geographic size and human population, distorts both per-person and per-area statistics.
Even if the data is of known reliability, various analyses applied to it may not be. A classic case is the lessons drawn by opposing political parties from the same set of vote totals. My general answer is to let a thousand flowers bloom, so long as those flowers can include critiques of these analyses.
Even if the agency is perfectly even-handed in offering and delivering data, certain analyses of certain datasets may require enough funds or expertise that only a few are apt to even try. In those cases, open data may actually exacerbate gaps between haves and have-nots. The rich will be able to hire experts to help them use the data; the poor will not. I have heard anecdotes, although I do not have a citation, that successful real estate brokers use complex analyses of public land data to out-negotiate community groups that have access to the same data, but have neither the expertise nor the resources to hire expertise to perform similar analyses.
The need for resources may occur at the data stage: you need computers to receive electronic data, for instance, and you need mechanisms for converting to electronic data if the agency only provides paper copies. The need for resources may occur at the analysis stage: some knowledge of statistics, computer programming, analytical models, etc. is often needed to make sense of data and to make judgments about what the agency has done to gather and present the data and about analyses performed by others. The need for resources may occur at the presentation stage: some knowledge of and access to graphics and other presentation elements may be crucial to making effective use of data and analysis for instance, an application for a smartphone has to be easy to use as well as clever in working with the public data.
Some attempts to have citizen involvement in community planning in the 1960s illustrate all three of these gaps. Absentee landlords and others arguing with local citizens over various proposals for urban renewal always had more public data to analyze, much better analyses to present, and much better means of presentation. These inequalities were sustained until and unless the citizens groups gained access to comparable analytical resources, through volunteers or public funds.
The adage where theres a will, theres a way can be pertinent to understand the different ways access to data has been managed in the past. Indeed, adopting varying degrees of openness is one of the ways that agencies and the users of their data reach accommodations in relation to the reluctance to share and concerns about different sources of error. Here are some of the other accommodations I have observed:
Legal constraints: Beyond the technical accommodations of anonymization and aggregation, sometimes the law is changed. HIPAA, the Health Insurance Portability and Accountability Act of 1996 (P.L. 104-191), did its accommodation by authorizing sharing among those agreeing to observe certain protocols regarding the security of the data to be exchanged.
Cost/Revenue/Power/Ownership: These can be treated together because successful accommodations to them usually involved collaborations that share the expense and value from datasets. For instance, the data user might pay the agency some amount that would help offset costs and even provide some revenue for an agency, and would share credit and glory in ways that would reward the agency in political terms for participating. The result might not be as much net financial gain to government as an exclusive or extremely high-priced distribution, but would serve some of the non-revenue goals of the agency in ways that more limited distribution would not. Note the frequent failure of collaborations between agencies and outside users that do not provide for some ongoing means of covering costs or that do not provide benefits to both sides. They usually fade away after an initial burst of enthusiasm.
Several cities, for instance, have conducted high-publicity contests for the best use of agency data that resulted in some very innovative and impressive applications. But those that required continuing activity by either the agency or the outsider soon faded away unless the project included some continuing stream of benefits to those that had to continue the activity whether the agency, the outsider, or both. (See Russell Nichols, Do Apps for Democracy and Other Contests Create Sustainable Applications dated July 11, 2010, viewed at http://www.govtech.com/e-government/Do-Apps-for-Democracy-and-Other.html on May 16, 2011).
Sources of Error; The accommodations to this issue are not just technical, although appropriate statistical and analytical techniques do play a role. Another form of accommodation is less limited distribution. If many different outsiders are working with the same data, the probability that differing perspectives will be applied goes up, increasing the probability that errors will be dealt with appropriately. Although we can note that the probability never goes to one: we can all think of situations where everyone turned out to be wrong.
Resources Divide: The agency (or others in all cases in this paragraph) can deal with the data level by providing, or working with others to provide, the data in some generally known electronic format. The agency can deal with the analysis level by providing examples of analysis, references to experts and information about analysis, and even giving grants for analytical activities (note that many agencies do in fact pay for numerous analyses of their data through grants and contracts to private parties). The agency can then deal with the presentation level in the same way as the analysis one by providing examples, references, and even resources for making presentation materials based on their data. Thinking separately about the data, its analysis, and its presentation, can help the agency promote more equality of data use among the groups that would benefit themselves and the agency's service to the public by obtaining and using the data.
In conclusion, we note the wide-ranging diversity that actual practices surrounding open data produce. Even if there were one best way to produce and use open data, the various agencies and the outsiders working with them would fall short of the ideal in a multitude of different ways. If there are, as I suspect, many different best ways to produce and use open data, the multitude of ways in which open data gets produced and used and the results gets even larger.
But I, for one, do not think the variation will decrease. We will continue to increase the amount of public data. We will continue to increase our ability to gather and make use of it through advances in hardware, software, scientific understanding, and the like. Given the current world situation, our concerns about its misuse, its errors, and the unevenness of resources to exploit it will not decrease. Therefore, I predict that we will continue to see varying degrees of openness around public data into the foreseeable future.