Data mining refers to the practice of analyzing a vast amount of information to discern the patterns and relationships contained within those data. Once detected, these patterns can then be used to draw conclusions about past behavior and make predictions about future behavior. A related practice, data profiling, is the creation and utilization of a user profile from a specific database. Data mining and profiling have many applications and are used by retailers and intelligence agencies alike. Recently, analysts have begun to apply data mining and profiling to social network analysis, examining how the linkages between individuals or “nodes” in a social network affect behavior. As discussed in this entry, the practices of data mining and data profiling, and their use in social network analysis for marketing and intelligence, have raised concerns about privacy and security.
Data mining programs rely on complex computer algorithms that can identify patterns in individual behavior. More specifically, data mining involves creating user profiles, which serve as data points. Profiles are made up of two components: (1) a demographic component and (2) a behavioral component. Specifically, the demographic component contains basic information about an individual, such as his or her age, household income, profession, and location. The behavioral component identifies what an individual does; for a retailer, the behavioral component of a customer profile would include information about the items this individual purchases, how often he or she shops, and how likely he or she is to use coupons.
By aggregating the patterns found in individual profiles, data mining can create behavior rules that can predict the behavior of other individuals with similar demographic profiles. In retail applications, for example, a store can discover which consumers buy specific items, and target its marketing accordingly.
Increasingly an individual’s membership within overlapping social networks is used as a predictor of behavior. Social network analysis, a field of study first developed by sociologists in the 19th century, looks at the manner in which individuals are embedded in various social networks; for instance, a person can simultaneously be a member of a family, an employee of a company, a member of a professional association, and a member of an activist group. Social network analysis suggests that individuals are likely to behave in similar ways to other people in their network; moreover, members of the same social network possess a great ability to influence the beliefs and perceptions of one another.
Social networks, as manifested on social media sites such as Facebook, Twitter, LinkedIn, and Google+, provide a rich database for data mining; a study by IBM estimated that humanity creates approximately 2.5 quintillion bytes of data every day.
In Social Network Data Analytics, edited by Charu C. Aggarwal, the contributors identify two ways in which public and private actors can analyze the vast amounts of data generated by these social networks. First, and most commonly, analysts can mine the text or visual content of the networks, such as the names of videos on YouTube or the tags on photos on Instagram. Alternately, analysts can undertake a structural analysis of the network; this includes identifying key nodes and hubs in the networks, as well as changes in the size and scope of a particular network over time. The goal of this type of analysis is to identify individuals within a network who exercise a strong influence on other members.
The vast amounts of data generated by social networks have proved attractive to retailers, political campaigns, and intelligence agencies. The use of commercial data mining has occasioned legal challenges, as technological developments outpace existing law. In 2011, in the Supreme Court case Sorrell, Attorney General of Vermont v. IMS Health, Inc., by a vote of 6–3 the Court struck down a Vermont law that banned the unauthorized sale of doctors’ prescribing information to pharmaceutical companies for the purposes of data mining. Although the case did not directly touch on social network data mining, it did suggest that the Court would allow private actors broad latitude to use data mining for commercial purposes. Subsequently, in 2016, students and alumni from the University of California, Berkeley, who used emailed accounts provided by the Google Apps for Education program sued the company; the plaintiffs argued that Google’s practice of scanning student emails for data to provide targeted advertising violates federal privacy and wiretap laws. As of January 2017, the case is pending before a U.S. District Court.
In recent years, retailers have relied on customer segmentation when designing marketing campaigns. More precisely, retailers no longer send the same catalogs and coupons to all potential customers; instead, stores seek to personalize the shopping experience, using data mining to predict what types of appeals motivate different customer segments to make purchases.
For example, the online retailer Amazon uses data mining to discover which products customers typically buy together. Based on an analysis of past customer behavior, when a new customer views an item, Amazon activates the “Frequently Bought Together” feature, showing items that past customers have often purchased together. In addition, when a customer purchases an item from Amazon, he or she is offered the opportunity to “share” this information on Facebook; this allows members of the customer’s social network to view the purchase, and based on the user’s recommendation, other members of that social network may buy the same item.
Google is another company that collects and analyzes vast amounts of user data to provide retailers with advertising opportunities. Currently, Google compiles user data from each of its services, including its search engine, Gmail, and its social network, Google+. It uses these data to create a profile of the user; subsequently, Google displays targeted marketing whenever a user accesses any of its services. For example, a user who searches Google for “hotels in France” is assumed to be a potential traveler; this user will later see advertisements for flights to France when the user opens his or her Gmail account.
Increasingly, U.S. intelligence agencies are using data mining to predict and prevent terrorist attacks; these agencies employ models that use demographic and behavioral data to predict whether or not a person is likely to commit an act of terrorism. One of the earliest attempts at this was the “Able Danger” program, used by the Army Land Warfare Agency in 1999. This program combed through various data sources in an attempt to identify individuals who shared traits with known terrorists. Moreover, the program relied on social network analysis to root out possible terrorist sleeper cells. Specially, the program could detect whether an individual flagged as a potential terrorist had any connection to another flagged individual.
Another similar program, MATRIX (an acronym for the Multi-State Anti-Terrorism Exchange) was unveiled in 2001, soon after the 9/11 attacks. The program analyzed information from both federal and state databases in an effort to identify potential terrorists. The information in this database included basic demographic data, as well as banking records, criminal records, and flight information. In addition, the MATRIX program examined whether or not an individual was linked to any known terrorists, by examining the person’s social network, including membership in various organizations. After analyzing the data, the algorithm assigned each individual a Terrorism Factor Score; those with a High Terrorism Factor were deemed likely to join a terror cell. During a pilot of the program in 2001, the analysts identified 120,000 individuals in the United States with a High Terrorism Factor rating. Ultimately, only four states ended up participating in the data mining program, citing both high costs and privacy concerns.
Beginning with the 2004 election, national political campaigns began to use predictive analytics to pinpoint possible supporters and encourage them to turn out on Election Day. By 2012, the presidential campaigns of both Barack Obama and Mitt Romney maintained a separate department for voter analytics.
George W. Bush’s campaign first used data mining in 2004. The campaign began by creating a database of individuals who had voted for Bush in 2000; analysts then gathered a vast array of information on these individuals, including their income levels, their tastes in food and beverages, and their religious practices. Based on inferences from this data set, the campaign was able to identify a group of people who had not voted for Bush but who shared many characteristics with existing Bush supporters. The campaign then worked to contact these potential supporters directly, sending volunteers to their homes and encouraging them to vote for Bush on Election Day.
By 2008, the Obama campaign had developed even more sophisticated data mining algorithms; specifically, the campaign had models that could predict how likely a person was to vote, as well as how likely that person was to vote for Obama. This information allowed the campaign to allocate its resources to those voters who were considered the most persuadable. Four years later, during the 2012 campaign, Obama’s campaign managers increased the size of his analytics team fivefold.
Kelly McHugh and Corey Koch
See also Big Data ; Cookies ; Corporate Surveillance ; Data Mining and Profiling in Big Data ; Privacy, Internet
Aggarwal, Charu C., ed.Social Network Data Analytics. New York, NY: Springer, 2014.
“Data Mining, Dog Sniffs, and the Fourth Amendment: A Framework for evaluating suspicionless mass surveillance programs.” Harvard Law Review, v.128/2 (2014).
Duhigg, Charles. “How Companies Learn Your Secrets.” The New York Times Magazine (February 16, 2012). http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?pagewanted=all&_r=0 (Accessed July 2014).
Issenberg, Sasha. “How President Obama’s Campaign Used Big Data to Rally Individual Voters.” MIT Technology Review (December 19, 2012). http://www.technologyreview.com/featuredstory/509026/how-obamas-team-used-big-data-to-rally-voters/ (Accessed July 2014).
Seifert, Jeffrey W. “Data Mining and Homeland Security: An Overview” Congressional Research Service Report (Updated January 18, 2007). https://fas.org/sgp/crs/intel/RL31798.pdf (Accessed July 2014).
Sorrell, Attorney General of Vermont v. IMS Health Inc. 564 U.S. ___ (2011).