Big Data, History and definition, Uses, Surveillance and big data, Mismanagement and privacy concerns

Big data refers to the phenomenon of processing large quantities of complex information in an acceptable period of time. Given the increasing amount of information and personal data available, the management of such large data sets requires improvements at different stages of the process. Leveraging big data includes the complex synchronization of capture devices (e.g., sensors), connection networks (e.g., Wi-Fi), storage capacities (e.g., cloud services), and analysis systems (e.g., software) that enable the extraction of useful information through the establishment of relationships, the detection of patterns, and the comparison of individuals, among other tasks. There is no definition of what constitutes big data, but it is generally understood that data are “big” when their level of complexity goes beyond traditional data-analyzing tools. While the uses of big data proliferate, the possibility of function creep (i.e., widening the use of a system beyond its original purpose) and the use of metadata by national intelligence agencies without control or consent raise questions about the acquisition and use of big data and big data analysis. Questions concerning the surveillance techniques used to acquire big data, as well as a perception of invasion of privacy when data are unknowingly acquired, have increasingly surfaced. This entry first reviews the history and definition of the term big data and then looks at some of the uses of big data. Surveillance techniques related to big data are then discussed, and the entry concludes with an examination of privacy concerns and mismanagement of big data.

History and Definition

Big data may be understood as a generalized trend, not just as a problem that arises when an organization has to deal with large and complex amounts of information. In this sense, big data refers to the imaginary vast network of databases, potentially interconnected, that contribute to deepening and optimizing the knowledge of specific contexts and situations (through the monitoring of variables) or to depicting the profiles or data doubles of individuals or human groups (through the monitoring of their personal data).

In the 1940s, there were attempts to describe the problem posed by the availability of large amounts of data sets. At the time, the term chosen by those discussing the phenomenon was information explosion, a term first used in the newspaper Lawton Constitution. This expression was further developed in an article published in the New Statesman in March 1964, pointing out the difficulty of managing the large volumes of information available. Information explosion refers to the rapid increase in the amount of published information, as this renders the processing problematic and can lead to overload.

The term big data, however, was not used until 1997, in an article titled “Application-Controlled Demand Paging for Out-of-Core Visualization,” in which the authors, NASA (National Aeronautics and Space Administration) researchers Michael Cox and David Ellsworth, mentioned the “problem of big data” regarding visualization issues and that “when data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources.”

In addition, the notion of “too much” or “too large” has varied over time. Innovations in software and artificial intelligence (e.g., algorithms) and the increase of storage capacity have advanced in a way that makes the difficulty of managing a data set always relative. According to Parkinson’s law of data, “data expands to fill the space available for storage.” Over the past 10 years, the memory usage of evolving systems has doubled roughly once every 18 months. At the same time, the memory density available for constant dollars has also tended to double about once every 12 months. Therefore, the “big” in big data is commonly assessed using the concept of the “3 Vs”: volume, velocity, and variety. A fourth V has been proposed, veracity, and there are references to value and variability, even though the existing consensus does not expand beyond the three original terms.

Simply put, big data processes comprise the following set of moments or elements: (a) collection, (b) storage, and (c) analysis. The collection of data is becoming increasingly sophisticated, since the development and deployment of a wide range of sensor solutions allow for the measurement of a growing variety of facts and actions. The collected information may refer to human activities or not, and it can be exerted by instrumented or uninstrumented devices. The storage of data is one of the most obvious and immediate challenges of big data, as the growing size of data sets, together with the optimization and maximization of storage units, requires the constant creation of storage space. In this respect, the development of remote storage capacities connected through the Internet to clients, known as cloud services, has momentarily eased the storage question. Once collected and stored, the analysis of big data continues to be the main challenge—how to extract useful information from the collected data and find solutions to facilitate and manage the visualization and interpretation processes, identifying meaningful connections and establishing comparisons that reveal clear patterns.

Uses

The kind of information collected under the logic of big data processes may refer not only to impersonal facts but also to personal traits and actions. Whereas the former generates information related to phenomena that in normal conditions should not reveal or allow for the inference of personal data (e.g., environmental indicators, traffic density, weather conditions, market trends), the latter enables the possibility of profiling and targeting human groups and individuals.

There are many areas and applications that take advantage of complex data sets, such as home automation, health care records, and industry manufacturing optimization. The cross-area exchange of data creates compound applications that may be perceived as additional benefits or as undesired externalities. Big data is also an integral part of new technology-enabled trends, such as the drive to develop solutions to measure an increasing number of public and private variables in the context of smart cities, wearables and the quantified self, and the Internet of Things.

This drive to digitize everything is what makes it possible for large data sets to be generated, collected, and analyzed. The collected personal data can be classified under two main categories: (1) processes related to the optimization of informational usage by public bodies (political/government dimension) and (2) processes related to the optimization of information analysis by private companies (corporate dimension). The lines between these two spheres, however, are increasingly blurred.

Surveillance and Big Data

The improvement of solutions developed to deal with large and complex data sets may also help foster the sophistication of surveillance practices, as shown by the NSA leaks. However, not all data sets that can be processed as big data can be used for surveillance purposes. There are certain features of the information gathered that will determine whether it can enable such practices. When assessing potential surveillance practices based on big data contexts, the subset of information to be taken into account is the one that refers to data that can be linked to identifiable individuals (i.e., personal data). This may contain intrinsic or extrinsic traits, which reveal static or dynamic properties that are used to investigate or monitor the actions or communications of groups of people.

Big data surveillance implies the collection, storage, and processing of massive amounts of personal data to find useful patterns that allow control tasks to be optimized. It is different from conventional surveillance, even if also performed with the support of information and communication technologies, as big data surveillance characteristically deploys tools that keep track of large numbers of people, originally unsorted, and aims to collect the largest possible range of information, ideally on a global scale. It also takes advantage of the increasing voluntary self-disclosure of information and the data resulting from automation of processes.

Although capturing information from specific targets was the norm in conventional surveillance routines, big data introduces a comprehensive logic based on the premise of collecting as much data as possible: the largest number of measurable variables and indicators, for the largest number of individuals, during the longest time possible. The self-volition of individuals in the generation of data, the shift toward an exante prediction-focused perspective (tracking and storing rather than monitoring), and the increasing automation of data collection and analysis by data mining agents and data brokers suggest deep changes in traditional surveillance practices, in which the surveillant and the surveilled had clear roles and positions in the hierarchy.

Mismanagement and Privacy Concerns

One of the most controversial issues regarding the management of big data is the potential for function creep. Given the wide range of available uses and the flexibility of the information, personal data, data processing, and the analysis derived from them can be used for functions different from those stated when the data were collected, thus creating a situation of function creep. In addition, the line that separates what is personal data and what is nonpersonal data (in the sense of being nonidentifiable or non-reidentifiable) may change over time, and reidentification techniques that did not exist previously may be developed in the future and turn nonpersonal data into personal data without knowledge or control by the data subject.

A key reference concerning the mismanagement of personal data is the aforementioned NSA leaks. They revealed the existence of global massive surveillance programs, such as PRISM and XKeyscore, that rely on metadata gathered without control or a legal framework. The revelations were a turning point for global awareness of the privacy risks of mass personal data or metadata gathering by unaccountable bodies. In June 2013, U.S. president Barack Obama justified the activities conducted under espionage programs, highlighting that content was not gathered and analyzed but, rather, “just metadata” were collected. He argued that the security versus privacy trade-off was necessary and justified.

As mentioned, metadata are not anonymous data. Possibilities for reidentification exist and are on the increase. Although complete anonymity is impossible to achieve, there are alternatives available that protect the privacy of the data subjects, such as “pseudonimity.” These include avoiding the disclosure of actual data whenever possible through the use of anonymization tools (e.g., software, proxies) or opting out of data-intensive services and processes. These actions undermine the accuracy of the data subject’s data double, but they improve the privacy and control levels of individuals.

Final Thoughts

Big data refers to the efficient management of large and complex amounts of data, whether or not they are personal, through a process of collection, storage, and analysis. Such management of data can be performed for governmental or public purposes or to seek corporate or private ends. As the optimization of information sources and analysis creates management and decision-making advantages, big data offers meaningful benefits for a wide range of bodies. However, there are negative externalities resulting from the analysis of information that is collected without consent or taking into account legal, ethical, and societal concerns.

Big Data

History and Definition

Uses

Surveillance and Big Data

Mismanagement and Privacy Concerns

Final Thoughts

Further Readings