This article was downloaded by: [172.243.214.43] On: 14 December 2020, At: 18:26Publisher: Institute for Operations Research and the Management Sciences (INFORMS)INFORMS is located in Maryland, USA
Information Systems Research
Publication details, including instructions for authors and subscription information:http://pubsonline.informs.org
Editorial—Big Data, Data Science, and Analytics: TheOpportunity and Challenge for IS ResearchRitu Agarwal, Vasant Dhar
To cite this article:Ritu Agarwal, Vasant Dhar (2014) Editorial—Big Data, Data Science, and Analytics: The Opportunity and Challenge for ISResearch. Information Systems Research 25(3):443-448. https://doi.org/10.1287/isre.2014.0546
Full terms and conditions of use: https://pubsonline.informs.org/Publications/Librarians-Portal/PubsOnLine-Terms-and-Conditions
This article may be used only for the purposes of research, teaching, and/or private study. Commercial useor systematic downloading (by robots or other automatic processes) is prohibited without explicit Publisherapproval, unless otherwise noted. For more information, contact [email protected].
The Publisher does not warrant or guarantee the article’s accuracy, completeness, merchantability, fitnessfor a particular purpose, or non-infringement. Descriptions of, or references to, products or publications, orinclusion of an advertisement in this article, neither constitutes nor implies a guarantee, endorsement, orsupport of claims made of that product, publication, or service.
Copyright © 2014, INFORMS
Please scroll down for article—it is on subsequent pages
With 12,500 members from nearly 90 countries, INFORMS is the largest international association of operations research (O.R.)and analytics professionals and students. INFORMS provides unique networking and learning opportunities for individualprofessionals, and organizations of all types and sizes, to better understand and use O.R. and analytics tools and methods totransform strategic visions and achieve better outcomes.For more information on INFORMS, its publications, membership, or meetings visit http://www.informs.org
Information Systems ResearchVol. 25, No. 3, September 2014, pp. 443–448ISSN 1047-7047 (print) � ISSN 1526-5536 (online) http://dx.doi.org/10.1287/isre.2014.0546
© 2014 INFORMS
Editorial
Big Data, Data Science, and Analytics:The Opportunity and Challenge for IS Research
Ritu AgarwalCenter for Health Information and Decision Systems, Robert H. Smith School of Business, University of Maryland,
College Park, Maryland 20742, [email protected]
Vasant DharCenter for Business Analytics, Stern School of Business, New York University, New York, New York 10012,
We address key questions related to the explosion of interest in the emerging fields of big data, analytics, anddata science. We discuss the novelty of the fields and whether the underlying questions are fundamentally
different, the strengths that the information systems (IS) community brings to this discourse, interesting researchquestions for IS scholars, the role of predictive and explanatory modeling, and how research in this emergingarea should be evaluated for contribution and significance.
IntroductionIt is difficult, nay, impossible to open a popular pub-lication today, online or in the physical world and notrun into a reference to data science, analytics, big data,or some combination thereof. To use a Twitter-esquephrase, that’s what’s trending now. A search onGoogle in the middle of August 2014 for the phrases“Big data,” “Analytics,” and “Data science” yielded822 million, 154 million, and 461 million results,respectively. Our major journals (Management Science,MIS Quarterly) have commissioned special issues onthese topics, and a large number of position announce-ments in the information systems field are specifyingknowledge of, and skills in, one or more of these areasas desirable, if not a requirement for the job. A newjournal called Big Data,1 launched just over a year ago,is already seeing thousands of downloads of articles. Itwould not be hyperbole to claim that big data is possi-bly the most significant “tech” disruption in businessand academic ecosystems since the meteoric rise of theInternet and the digital economy.
What does this tsunami mean for information sys-tems researchers? Rather than simply rely on our ownview of the world, we invited other accomplishedscholars with a history of publication in this arena toparticipate in the conversation. We posed five ques-tions to frame our thinking about the domain:
1 See http://www.liebertpub.com/overview/big-data/611.
1. Are big data, analytics, and data science, as beingdescribed in the popular outlets, old wine in new bot-tles or is it something new?
2. What are the strengths that the informationsystems (IS) community brings to the discourse onbusiness analytics? In other words, what is our com-petitive advantage?
3. What are important and interesting researchquestions and domains that may “fit” with on-goingresearch in our community? How might we pushthe envelope by extending or modifying our existingresearch agendas? What about new areas of inquiry?
4. To what extent should robust prediction prowessbe used as a criterion in evaluating data-driven mod-els versus current criteria that favor “explanatory”models without subjecting them to rigorous tests offuture predictability?
5. As editors and reviewers, how should we evalu-ate research in this domain? What constitutes a “sig-nificant” contribution?
We would like to acknowledge the contributionsof Galit Shmueli and Martin Bichler whose thought-ful responses to these questions gave us much foodfor contemplation. In the rest of this commentary weoffer a synthesis of our collective reflection on the fivequestions.
On the Novelty of Big Data,Analytics and Data ScienceWe believe that some components of data science andbusiness analytics have been around for a long time,
443
Agarwal and Dhar: The Opportunity and Challenge for IS Research444 Information Systems Research 25(3), pp. 443–448, © 2014 INFORMS
but there are significant new questions and oppor-tunities created by the availability of big data andmajor advancements in machine intelligence.2 Whilethe notion that analytical techniques can be used tomake sense of and derive insights from data is asold as the field of statistics, and dates back to the18th century, one obvious difference today is the rapidpace at which economic and social transactions aremoving online, allowing for the digital capture of bigdata. The ability to understand the structure and con-tent of human discourse has considerably expandedthe dimensionality of data sets available. As a result,the set of opportunities for inquiry has explodedexponentially with readily available large and com-plex data sets related to any type of phenomenonresearchers want to study, ranging from deconstruct-ing the human genome, to understanding the pathol-ogy of Alzheimer’s disease across millions of patients,to observing consumer response to different market-ing offers in large scale field experiments. And, easy(and relatively inexpensive) access to computationalcapacity and user-friendly analytical software havedemocratized the field of data science allowing manymore scholars (and practitioners) to participate in theopportunities enabled by big data.
In some ways in could be argued that the nature ofinquiry has also changed, turbocharged by machinesbecoming a lot smarter through better algorithms, andby information technologies that enable people andthings to be inherently instrumented for observationand interaction that feeds the algorithms. Increasingly,data are collected not with the aim of solely testinga human-generated hypotheses or essential record-keeping, but to the extent that data torrents are cap-tured inexpensively, often for the possibility of testinghypotheses that have not yet been envisioned at thetime of collection. When such data are gathered on ascale that observes every part of the joint distributionsof the observed variables (behaviors, demographics,etc.), the computer becomes an active question ask-ing machine as opposed to a pure analytic servant.By initiating interesting questions and refining themwithout active human intervention, it becomes capa-ble of creating new knowledge and making discover-ies on its own (Dhar 2013). It can, for example, dis-cover automatically from a large swath of healthcaresystem data that younger people in a specific regionof the world are becoming increasingly diabetic andthen conjecture and test whether the trend is due tospecific habits, diet, specific types of drugs, and a
2 Arguably, for the first time in history, a machine passed thefamous Turing test by defeating human champions at Jeopardywhere topics are not known in advance and questions are posed innatural language of considerable complexity and nuance.
range of factors we may not have hypothesized ashumans. This is powerful. As scientists, we have notseriously entertained the possibility of theory origi-nating in the computer, and as science-fiction-like asthat may sound, we are in principle already there.
New challenging problems and inquiry also leadto research on better algorithms and systems. Sincethe torrent of data being generated is increasinglyunstructured and coming from networks of people ordevices, we are seeing the emergence of more pow-erful algorithms and better knowledge representationschemes for making sense of all of this heteroge-neous and fragmented information. Text and imageprocessing capability are one frontier of research, withsystems such as IBM’s Watson being on the cuttingedge in natural language processing, albeit with along way to go in terms of their capability for ingest-ing and interpreting big data across the Internet.
Networks, such as those created by connectionsbetween individuals and/or products, further createsignificant and unique challenges at a fundamentallevel such as how we sample them or infer treat-ment effects. For example, in A/B testing, a “standardapproach” for estimating the average treatment effectof a new feature or condition by exposing a sam-ple of the overall population to it, the treatment ofindividuals can spill over to neighboring individu-als along the structure of the underlying network.To address this type of “social interference,” neweralgorithms are required that support valid samplingand estimation of treatment effects (Ugander et al.2013). This is but one example of how “relational” and“networked” data necessitates new development inalgorithms. Developments may emerge not only fromcomputer science but also from IS or other disciplineswhere researchers are “closer” to the problem beingstudied than pure methods researchers tend to be.
Finally, the Internet has fueled our ability to con-duct large scale experiments on social phenomena.As of this writing, Facebook researchers conducteda massive study to determine whether the mood ofusers could be manipulated and found that it could(Kramer et al. 2014). By conducting controlled exper-iments in large numbers of people such studies canextract the causal structure among variables. Whilethe study raised important questions about privacyand the ethical implications of conducting experi-ments without informed consent, the broader pointis that researchers now have a medium for theorydevelopment through massive experimentation in thesocial, health, urban, and other sciences.
A Comparative Advantage for IS?Arguably, this is the golden age for IS researchers.Data science and big data research from IS has
Agarwal and Dhar: The Opportunity and Challenge for IS ResearchInformation Systems Research 25(3), pp. 443–448, © 2014 INFORMS 445
attracted attention at the level of widely read scien-tific outlets such as PNAS and Science (Aral et al.2009, Aral and Walker 2012) because of the impor-tance and generic nature of the questions asked, suchas “are choices in social networks a result of influ-ence or homophily?” Such a question has profoundimplications for phenomena too numerous to mentionthat involve how we communicate, persuade, moni-tor, and more. In other words, perhaps it is time toset our sights higher, beyond our traditional journals,to communicate with the larger community of scien-tists and businesses. It is a time of opportunity forsocial scientists that have heretofore been hamstrungby the lack of data. For the first time we are ableto observe and measure human behavior on a globalscale. IS researchers are, quite incredibly, at the centerof the digital world unfolding before us.
To put things in historical perspective, one couldreasonably assert that the IS discipline has the longesthistory of conducting research at the nexus of com-puting technology and data in business and society.In fact, it is widely recognized that the discipline of“MIS” emerged when computers enabled the automa-tion of business processes and the digital capture ofbusiness transactions. Understanding how to designand implement systems to provide the “right infor-mation to the right person at the right time” wasthe raison d’être of IS programs and defined muchof early IS research. These endeavors entailed under-standing what individuals’ and executives data needsare, designing structures to capture and manage data,and conceptualizing interfaces that made the dataaccessible and usable. So, it is not unreasonable toclaim that IS scholars started off with a comparativeadvantage with respect to big data: we knew how tostore, manage, process data; and about the complex-ity of data structures very early on in the history ofcomputing. We also understand the challenges associ-ated with the infrastructure needed for handling thevolume of data being generated today.
Armed with these skills and tools, IS scholars havebeen catalyzed to focus on problems and outcomes.As has been suggested elsewhere (Agarwal and Lucas2005) among all functional areas of business, IS re-searchers perhaps have the broadest perspective onthe enterprise as a whole, and how different pieces fittogether. This focus creates a tighter linkage betweendata and business models: we care deeply aboutbusiness transformation and value creation throughdata, and less for algorithms or frameworks withouta linkage to business value. Our research has beenalternately praised and criticized for being too cross-disciplinary, but we believe this is strength and not aweakness in today’s data rich environment. IS scholarshave invoked theories from the fields of economics,
sociology, psychology, and political science to name afew, and studied phenomena such as electronic mar-kets, consumer behavior, crowdsourcing, informationsecurity, and online retailing from a diversity of per-spectives. This cross- and transdisciplinary nature ofIS research to date positions us uniquely to exploitthe big data opportunity. We can do so by addressingthe same types of questions as we have in the pastbut with significantly richer data sets, or we can beginto initiate new inquiries that represent questions thatwere not possible to ask and answer previously.
On Research QuestionsThe wealth of possibilities enabled by big data is toomany to enumerate in a brief commentary. Nonethe-less, we offer some illustrations of fruitful researchprojects that IS scholars could begin to explore (and,evidence suggests that many already are). First, theability to observe and measure micro, individual-leveldata on a comprehensive scale enables us to addressgrand problems on a societal level with deep pol-icy implications that go beyond the confines of asingle organization. Examples include using micro-level technology usage data to ask if ubiquitous dig-itization exacerbates or diminishes social inequities,exploring how labor markets are evolving by exam-ining the actions of workers on social networkingand employment sites, and investigating if the quan-tified self-movement with sensors tracking exerciseand nutrition information during daily living is, infact, producing a measurable effect on the incidenceof disease or health problems. Indeed, the entire fieldof personalized medicine (and, the heretofore under-studied opportunity of personalized technology inter-ventions) is enabled by big data.
Second, following the call to focus on the transfor-mational aspects of IT (Lucas et al. 2013), big dataallows for the design and execution of studies relatedto the profound changes our own profession is expe-riencing. We could extend our existing research agen-das that have explored the effects of technology medi-ation on learning outcomes, learner satisfaction, etc.,to examine individual-level effects of MOOCs, onlinecourses, blended learning, and the like on an unprece-dented scale.
Third, and this is an area where perhaps IS re-search has a robust set of studies to build on, is bigdata research in the context of social networks andmarketing outcomes. Geo-coded social media interac-tions coupled with extensive demographic and socioe-conomic data allow us to quantify how networksaffect micro (individual behavior), meso (organiza-tional value), and societal outcomes (economic andsocial value). Pervasive mobile devices and the rapidrise in commercial transactions (banking, purchase,
Agarwal and Dhar: The Opportunity and Challenge for IS Research446 Information Systems Research 25(3), pp. 443–448, © 2014 INFORMS
etc.) on mobile platforms that can be captured andrecorded enable novel insights into the classic market-ing problems of advertising and promotions, and theireffects on sales. And of course, the ability to run fieldexperiments in these settings, varying interventionson a grand scale, substantially enhances the scope ofcausal relationships we can extract from the data.
Big Data, and Prediction vs.ExplanationAt the time of writing this editorial, two big datastories were receiving substantial media attention.One, the scathing review of Google flu trends (Lazeret al. 2014) (that uses search terms to predict the inci-dence of flu) with respect to its accuracy as com-pared with estimates produced by the Centers forDisease Control and Prevention, and two, the ethi-cal debate ignited by experiments conducted on Face-book. In some ways both stories are related to theissue of whether big data are useful solely for pre-diction or also for an understanding of the causalprocesses that are yielding the observed outcomes.In the case of Google flu trends, the algorithm hasbeen criticized for overfitting a small number of casesand masking a simple question, namely, does it pre-dict flu or is it merely reflecting the incidence ofwinter, and for not taking into account the fact thattechnologies like Google’s search engine are profit-driven and changing and therefore limited for sci-entific inquiry. While the Facebook study raises thespecter of widespread experimentation on the Inter-net without adequate protection of individual rightsand privacy, ironically, the experiment is a greatexample of where causal claims can be made withsome confidence. This tension between correlationalanalysis and causal testing of hypotheses represents afundamental dilemma in the use of big data for expla-nation versus prediction.
But we should not be too hasty in dismissing pre-diction. Indeed, as has been argued by the philoso-pher of science Karl Popper, prediction is a key epis-temic criterion for assessing how seriously we shouldentertain a theory or a new insight: a good theorymakes “bold” predictions that stand repeated effectsas falsification (Popper 1963). Popper goes on to criti-cize certain social science theories like those of Adleror Marx that can conveniently “bend” the theory toaccommodate even contradictory data without anyonus on prediction whatsoever.3 In this regard, we
3 Popper used opposite cases of a man who pushes a child into thewater with the intention of drowning the child and that of a manwho sacrifices his life in an attempt to save the child. In Adler’sview, the first man suffered from feelings of inferiority (produc-ing perhaps the need to prove to himself that he dared to commit
should encourage additional rigorous testing of mod-els on data that were collected later in time than thoseused to construct the models.
Prediction as an initial basis for theory building alsohas particular value in a world where patterns oftenemerge before the reasons for them become apparent(Dhar 2013). While the scientific process is predicatedon a cycle of hypothesis generation, experimentation,hypothesis testing, and inference, there are multiplestarting points. Big data are as, and increasingly moreso, useful at the hypothesis generation stage as theyare at the hypothesis testing phase. A study that seeksto predict rather than explain may reveal associa-tions between variables that form the foundation forthe development of theory that can be subsequentlysubject to rigorous testing. In this context, Shmueli(2010) provides an elegant summary of when predic-tive modeling can be particularly useful in scientificendeavors, including newly available rich data setswith newly measured concepts for which theory is yetto be developed (as can be the case with big data), andfor the discovery of new measures. Scientific progressdoes not rely on a single study, and big data-basedstudies offer the promise of novel discoveries.
Some domains may often view prediction to be as,if not more, valuable than explanation. A compellingexample of this is in healthcare, where the cost ofdelaying action based on a good predictive modeluntil the construction and testing of explanatory mod-els is complete, is measured in lives that may belost. This is not to say that clinical and biomedi-cal researchers do not seek to build causal models,quite the contrary. The far-reaching human genomeproject and recently launched efforts to understandthe human brain in greater detail aim to unravel theunderlying causal structures of disease. But in manyinstances prediction with big data in and of itself is ofimmense value such as determining the probability ofhospital re-admittances, or the risk of development ofhepatocellular carcinoma among patients with cirrho-sis (Waljee et al. 2014). The biomedical community hasbegun to acknowledge that big data provides a criti-cal complement to the gold standard of randomizedcontrolled trials by supporting massive observationalstudies that were not feasible before Weil (2014).
Evaluating Knowledge ClaimsBased on Big DataThere is little doubt that the IS community will in-creasingly look for ways to conduct innovative re-search that leverages the power of big data. Some of
the crime), and so did the second man (whose need was to proveto himself that he dared to rescue the child at the expense of hisown life).
Agarwal and Dhar: The Opportunity and Challenge for IS ResearchInformation Systems Research 25(3), pp. 443–448, © 2014 INFORMS 447
this research may be “non-traditional” in that it devel-ops predictive models of phenomena for their ownsake or as a first step towards theory building. Someof it may use new variables and measures enabledby the digital capture of social and economic activ-ity, and user generated context creatively combinedwith external data sets such as that obtained from theCensus Bureau or the Bureau of Labor Statistics. Howshould we, as editors and reviewers, evaluate suchresearch?
There is no simple answer to this question and, toa large extent, our standards and methods of eval-uation will inevitably evolve as our experience withsuch research grows. With respect to ISR in particu-lar, we revert to the general criteria used by editorsand reviewers when assessing research: Fit, Interest-ingness, Rigor, Story, Theory (Agarwal 2012). Fromthe standpoint of fit, certainly novel and originalresearch that uses big data to address a phenomenonof interest to the IS community “fits” with the mis-sion of ISR, and is welcomed. Among the other cri-teria, while it is difficult to provide a rank ordering,we suspect that interestingness and rigor will be, atthe margin, more important in evaluating researchbased on big data, at least in these early stages. Aboveall, the research must pose an interesting and rel-evant question. We expect that questions that chal-lenge conventional wisdom and intuition or those thatpose a hitherto unexplored puzzle, i.e., those that arenovel, will be received more positively by review-ers than those that have been examined extensivelyin prior work. But we should also encourage testingand replication of prior results since such research iskey to scientific advancement, as argued eloquentlyby Popper (1963). Unfortunately, such work is oftengiven short-shrift relative to research that claims newresults based on data sets that are not shared anddo not provide opportunity for falsification or con-firmation. Data sharing and transparency must beencouraged.
Interesting is hard to define explicitly since it couldarise in many different forms. Part of the interesting-ness of a question might well derive solely from thepower of big data, where it becomes possible to inves-tigate a previously formulated research question witha novel and extensive data set that integrates observa-tions from diverse sources. Indeed, it is entirely pos-sible that the contribution of a study lies primarily inthe uniqueness of the data set and the rigor of theempirical methods used to analyze the data.
With respect to rigor, we do not expect the stan-dards for evaluating knowledge claims in studiesusing big data to be substantially different from thoseusing more traditional data sets. Reviewers wouldexpect the usual threats to inferential validity and
causal claims including self-selection and identifica-tion to be adequately accounted for. Researchers mustalso pay attention to the special challenges of workingwith very large data sets and reflect thoughtfully onthe economic significance of their findings. In contrastto studies that utilize archival data collected by “trust-worthy” entities such as businesses, governmentaland other agencies, or studies based on primary datacollection by researchers, big data could originatefrom unknown sources with questionable at best orunknown at worst credentials. It is incumbent uponauthors to convince readers that their “big” data setwas validly generated from appropriate foundations.
Second, and this challenge is not unique to big databut may be compounded by sheer size and variety invariables, data that the researcher does not generateherself is often an imperfect observation for the realworld concept that is being referred to. For exam-ple, if we are not able to capture individuals’ incomein a social network, is the zip code in which theyreside an appropriate proxy for socioeconomic sta-tus? The answer, of course, depends on a variety ofinterrelated factors including the nature of the study,its research questions, other variables the researcherhas, etc. Nonetheless, this example serves to illustratethe special difficulty that may be amplified as a func-tion of the number of variables in the data set, espe-cially when they are integrated from multiple sources.
Closing ThoughtsAs a community of scholars we would be remiss notto take full advantage of the scientific possibilities cre-ated by the availability of big data, sophisticated ana-lytical tools, and powerful computing infrastructures.Indeed, for reasons mentioned above, this is an excit-ing time to be an IS researcher and to think beyond ISto science in general. Big data still aims in large partto deliver the right information to the right person atthe right time in the right form, but is now able todo so in a significantly more sophisticated form. TheIS discipline has been thinking and researching ques-tions at the intersection of technology, data, business,and society for five decades and should leverage itsthought leadership to become a centerpiece of educa-tion, business, and policy.
ReferencesAgarwal R(2012)Editorialnotes. Inform.SystemsRes.23(4):1087–1092.Agarwal R, Lucas HC (2005) The IS identity crisis: Focusing on high
visibility and high impact research. MIS Quart. 29(3):381–398.Aral S, Walker D (2012) Identifying influential and susceptible
members of social networks. Science 20:337–341.Aral S, Muchnik L, Sundararajan A (2009) Distinguishing influence
based contagion from homophily driven diffusion in dynamicnetworks. Proc. Natl. Acad. Sci. USA 106(51):21544–21549.
Dhar V (2013) Data science and prediction. Comm. ACM 56(12):64–73.
Agarwal and Dhar: The Opportunity and Challenge for IS Research448 Information Systems Research 25(3), pp. 443–448, © 2014 INFORMS
Kramer ADI, Guillory JE, Hancock JT (2014) Experimental evidenceof massive-scale emotional contagion through social networks.Proc. Natl. Acad. Sci. USA 111(24):8788–8790.
Lazer LD, Kennedy R, King G, Vespignani A (2014) The parable ofGoogle flu: Traps in big data analysis. Science 343:1203–1205.
Lucas HC, Agarwal R, Clemons E, El-Sawy O, Weber B (2013)Impactful research on transformational IT: An opportunity toinform new audiences. MIS Quart. 27(2):371–382.
Popper K (1963) Conjectures and Refutations (Routledge, London).
Shmueli G (2010) To explain or to predict? Statist. Sci. 25(3):289–310.Ugander U, Karrer B, Backstrom L, Kleinberg J (2013) Graph clus-
ter randomization: Network exposure to multiple universes.Proc. 19th ACM SIGKDD Internat. Conf. Knowledge, Discovery,and Data Mining (ACM, New York).
Waljee AK, Higgins PDR, Singal AG (2014) A primer on predictivemodels. Clin. Translational Gastroenterology 5(1):e44.
Weil AR (2014) Big data in health: A new era for research andpatient care. Health Affairs 33(7):1110.