Archive for March 2014

Data Science Highlights: An Investigation of the Discipline

March 28th, 2014 — 1:26pm

I’ve posted a sub­stan­tial read­out sum­ma­riz­ing some of the more salient find­ings from a long-running pro­gram­matic research pro­gram into data sci­ence. This deck shares syn­the­sized find­ings around many of the facets of data sci­ence as a dis­ci­pline, includ­ing prac­tices, work­flow, tools, org mod­els, skills, etc. This read­out dis­tills a very wide range of inputs, includ­ing; direct inter­views, field-based ethnog­ra­phy, com­mu­nity par­tic­i­pa­tion (real-world and on-line), sec­ondary research from indus­try and aca­d­e­mic sources, analy­sis of hir­ing and invest­ment activ­ity in data sci­ence over sev­eral years, descrip­tive and def­i­n­i­tional arti­facts authored by prac­ti­tion­ers / ana­lysts / edu­ca­tors, and other exter­nal actors, media cov­er­age of data sci­ence, his­tor­i­cal antecedents, the struc­ture and evo­lu­tion of pro­fes­sional dis­ci­plines, and even more.

I con­sider it a sort of business-anthropology-style inves­ti­ga­tion of data sci­ence, con­ducted from the view­point of prod­uct making’s pri­mary aspects; strat­egy, man­age­ment, design, and delivery.

I learned a great deal dur­ing the course of this effort, and expect to con­tinue to learn, as data sci­ence will con­tinue to evolve rapidly for the next sev­eral years.

Data sci­ence prac­ti­tion­ers look­ing at this mate­r­ial are invited to pro­vide feed­back about where these mate­ri­als are accu­rate or inac­cu­rate, and most espe­cially about what is miss­ing, and what is com­ing next for this very excit­ing field.



Data Sci­ence High­lights from Joe Laman­tia



Comment » | Big Data, User Research

Data Science and Empirical Discovery: A New Discipline Pioneering a New Analytical Method

March 26th, 2014 — 11:00am

One of the essen­tial pat­terns of sci­ence and indus­try in the mod­ern era is that new meth­ods for under­stand­ing — what I’ll call sense­mak­ing from now on — often emerge hand in hand with new pro­fes­sional and sci­en­tific dis­ci­plines.  This link­age between new dis­ci­plines and new meth­ods fol­lows from the  decep­tively sim­ple imper­a­tive to real­ize new types of insight, which often means analy­sis of new kinds of data, using new tech­niques, applied from newly defined per­spec­tives. New view­points and new ways of under­stand­ing are lit­er­ally bound together in a sort of symbiosis.

One famil­iar exam­ple of this dynamic is the rapid devel­op­ment of sta­tis­tics dur­ing the 18th and 19th cen­turies, in close par­al­lel with the rise of new social sci­ence dis­ci­plines includ­ing eco­nom­ics (orig­i­nally polit­i­cal econ­omy) and soci­ol­ogy, and nat­ural sci­ences such as astron­omy and physics.  On a very broad scale, we can see the pat­tern in the tan­dem evo­lu­tion of the sci­en­tific method for sense­mak­ing, and the cod­i­fi­ca­tion of mod­ern sci­en­tific dis­ci­plines based on pre­cur­sor fields such as nat­ural his­tory and nat­ural phi­los­o­phy dur­ing the sci­en­tific rev­o­lu­tion.

Today, we can see this pat­tern clearly in the simul­ta­ne­ous emer­gence of Data Sci­ence as a new and dis­tinct dis­ci­pline accom­pa­nied by Empir­i­cal Dis­cov­ery, the new sense­mak­ing and analy­sis method Data Sci­ence is pio­neer­ing.  Given its dra­matic rise to promi­nence recently, declar­ing Data Sci­ence a new pro­fes­sional dis­ci­pline should inspire lit­tle con­tro­versy. Declar­ing Empir­i­cal Dis­cov­ery a new method may seem bolder, but when we with the essen­tial pat­tern of new dis­ci­plines appear­ing in tan­dem with new sense­mak­ing meth­ods in mind, it is more con­tro­ver­sial to sug­gest Data Sci­ence is a new dis­ci­pline that lacks a cor­re­spond­ing new method for sense­mak­ing.  (I would argue it is the method that makes the dis­ci­pline, not the other way around, but that is a topic for fuller treat­ment elsewhere)

What is empir­i­cal dis­cov­ery?  While empir­i­cal dis­cov­ery is a new sense­mak­ing method, we can build on two exist­ing foun­da­tions to under­stand its dis­tin­guish­ing char­ac­ter­is­tics, and help craft an ini­tial def­i­n­i­tion.  The first of these is an under­stand­ing of the empir­i­cal method. Con­sider the fol­low­ing description:

The empir­i­cal method is not sharply defined and is often con­trasted with the pre­ci­sion of the exper­i­men­tal method, where data are derived from the sys­tem­atic manip­u­la­tion of vari­ables in an exper­i­ment.  …The empir­i­cal method is gen­er­ally char­ac­ter­ized by the col­lec­tion of a large amount of data before much spec­u­la­tion as to their sig­nif­i­cance, or with­out much idea of what to expect, and is to be con­trasted with more the­o­ret­i­cal meth­ods in which the col­lec­tion of empir­i­cal data is guided largely by pre­lim­i­nary the­o­ret­i­cal explo­ration of what to expect. The empir­i­cal method is nec­es­sary in enter­ing hith­erto com­pletely unex­plored fields, and becomes less purely empir­i­cal as the acquired mas­tery of the field increases. Suc­cess­ful use of an exclu­sively empir­i­cal method demands a higher degree of intu­itive abil­ity in the practitioner.”

Data Sci­ence as prac­ticed is largely con­sis­tent with this pic­ture.  Empir­i­cal pre­rog­a­tives and under­stand­ings shape the pro­ce­dural plan­ning of Data Sci­ence efforts, rather than the­o­ret­i­cal con­structs.  Semi-formal approaches pre­dom­i­nate over explic­itly cod­i­fied meth­ods, sig­nal­ing the impor­tance of intu­ition.  Data sci­en­tists often work with data that is on-hand already from busi­ness activ­ity, or data that is newly gen­er­ated through nor­mal busi­ness oper­a­tions, rather than seek­ing to acquire wholly new data that is con­sis­tent with the design para­me­ters and goals of for­mal exper­i­men­tal efforts.  Much of the sense­mak­ing activ­ity around data is explic­itly exploratory (what I call the ‘pan­ning for gold’ stage of evo­lu­tion — more on this in sub­se­quent post­ings), rather than sys­tem­atic in the manip­u­la­tion of known vari­ables.  These exploratory tech­niques are used to address rel­a­tively new fields such as the Inter­net of Things, wear­ables, and large-scale social graphs and col­lec­tive activ­ity domains such as instru­mented envi­ron­ments and the quan­ti­fied self.  These new domains of appli­ca­tion are not mature in ana­lyt­i­cal terms; ana­lysts are still work­ing to iden­tify the most effec­tive tech­niques for yield­ing insights from data within their bounds.

The sec­ond rel­e­vant per­spec­tive is our under­stand­ing of dis­cov­ery as an activ­ity that is dis­tinct and rec­og­niz­able in com­par­i­son to gen­er­al­ized analy­sis: from this, we can sum­ma­rize as sense­mak­ing intended to arrive at novel insights, through explo­ration and analy­sis of diverse and dynamic data in an iter­a­tive and evolv­ing fashion.

Look­ing deeper, one spe­cific char­ac­ter­is­tic of dis­cov­ery as an activ­ity is the absence of for­mally artic­u­lated state­ments of belief and expected out­comes at the begin­ning of most dis­cov­ery efforts.  Another is the iter­a­tive nature of dis­cov­ery efforts, which can change course in non-linear ways and even ‘back­track’ on the way to arriv­ing at insights: both the data and the tech­niques used to ana­lyze data change dur­ing dis­cov­ery efforts.  For­mally defined exper­i­ments are much more clearly deter­mined from the begin­ning, and their def­i­n­i­tion is less open to change dur­ing their course. A pro­gram of related exper­i­ments con­ducted over time may show iter­a­tive adap­ta­tion of goals, data and meth­ods, but the indi­vid­ual exper­i­ments them­selves are not mal­leable and dynamic in the fash­ion of dis­cov­ery.  Discovery’s empha­sis on novel insight as pre­ferred out­come is another impor­tant char­ac­ter­is­tic; by con­trast, for­mal exper­i­ments are repeat­able and ver­i­fi­able by def­i­n­i­tion, and the degree of repeata­bil­ity is a cri­te­ria of well-designed exper­i­ments.  Dis­cov­ery efforts often involve an intu­itive shift in per­spec­tive that is recount­able and retrace­able in ret­ro­spect, but can­not be anticipated.

Build­ing on these two foun­da­tions, we can define Empir­i­cal Dis­cov­ery as a hybrid, pur­pose­ful, applied, aug­mented, iter­a­tive and serendip­i­tous method for real­iz­ing novel insights for busi­ness, through analy­sis of large and diverse data sets.

Let’s look at these facets in more detail.

Empir­i­cal dis­cov­ery pri­mar­ily addresses the prac­ti­cal goals and audi­ences of busi­ness (or indus­try), rather than sci­en­tific, aca­d­e­mic, or the­o­ret­i­cal objec­tives.  This is tremen­dously impor­tant, since  the prac­ti­cal con­text impacts every aspect of Empir­i­cal Discovery.

Large and diverse data sets’ reflects the fact that Data Sci­ence prac­ti­tion­ers engage with Big Data as we cur­rently under­stand it; sit­u­a­tions in which the con­flu­ence of data types and vol­umes exceeds the capa­bil­i­ties of busi­ness ana­lyt­ics to prac­ti­cally real­ize insights in terms of tools, infra­struc­ture, prac­tices, etc.

Empir­i­cal dis­cov­ery uses a rapidly evolv­ing hybridized toolkit, blend­ing a wide range of gen­eral and advanced sta­tis­ti­cal tech­niques with sophis­ti­cated exploratory and ana­lyt­i­cal meth­ods from a wide vari­ety of sources that includes data min­ing, nat­ural lan­guage pro­cess­ing, machine learn­ing, neural net­works, bayesian analy­sis, and emerg­ing tech­niques such as topo­log­i­cal data analy­sis and deep learn­ing.

What’s most notable about this hybrid toolkit is that Empir­i­cal Dis­cov­ery does not orig­i­nate novel analy­sis tech­niques, it bor­rows tools from estab­lished dis­ci­plines such infor­ma­tion retrieval, arti­fi­cial intel­li­gence, com­puter sci­ence, and the social sci­ences.  Many of the more spe­cial­ized or appar­ently exotic tech­niques data sci­ence and empir­i­cal dis­cov­ery rely on, such as sup­port vec­tor machines, deep learn­ing, or mea­sur­ing mutual infor­ma­tion in data sets, have estab­lished his­to­ries of usage in aca­d­e­mic or other indus­try set­tings, and have reached rea­son­able lev­els of matu­rity.  Empir­i­cal discovery’s hybrid toolkit is  trans­posed from one domain of appli­ca­tion to another, rather than invented.

Empir­i­cal Dis­cov­ery is an applied method in the same way Data Sci­ence is an applied dis­ci­pline: it orig­i­nates in and is adapted to busi­ness con­texts, it focuses on arriv­ing at use­ful insights to inform busi­ness activ­i­ties, and it is not used to con­duct basic research.  At this early stage of devel­op­ment, Empir­i­cal Dis­cov­ery has no inde­pen­dent and artic­u­lated the­o­ret­i­cal basis and does not (yet) advance a dis­tinct body of knowl­edge based on the­ory or prac­tice. All viable dis­ci­plines have a body of knowl­edge, whether for­mal or infor­mal, and applied dis­ci­plines have only their cumu­la­tive body of knowl­edge to dis­tin­guish them, so I expect this to change.

Empir­i­cal dis­cov­ery is not only applied, but explic­itly pur­pose­ful in that it is always set in motion and directed by an agenda from a larger con­text, typ­i­cally the spe­cific busi­ness goals of the orga­ni­za­tion act­ing as a prime mover and fund­ing data sci­ence posi­tions and tools.  Data Sci­ence prac­ti­tion­ers effect Empir­i­cal Dis­cov­ery by mak­ing it hap­pen on a daily basis — but wher­ever there is empir­i­cal dis­cov­ery activ­ity, there is sure to be inten­tion­al­ity from a busi­ness view.  For exam­ple, even in orga­ni­za­tions with a for­mal hack time pol­icy, our research sug­gests there is lit­tle or no com­pletely undi­rected or self-directed empir­i­cal dis­cov­ery activ­ity, whether con­ducted by for­mally rec­og­nized Data Sci­ence prac­ti­tion­ers, busi­ness ana­lysts, or others.

One very impor­tant impli­ca­tion of the sit­u­a­tional pur­pose­ful­ness of Empir­i­cal Dis­cov­ery is that there is no direct imper­a­tive for gen­er­at­ing a body of cumu­la­tive knowl­edge through orig­i­nal research: the insights that result from Empir­i­cal Dis­cov­ery efforts are judged by their prac­ti­cal util­ity in an imme­di­ate con­text.  There is also no explicit sci­en­tific bur­den of proof or ver­i­fi­a­bil­ity asso­ci­ated with Empir­i­cal Dis­cov­ery within it’s pri­mary con­text of appli­ca­tion.  Many prac­ti­tion­ers encour­age some aspects of ver­i­fi­a­bil­ity, for exam­ple, by anno­tat­ing the var­i­ous sources of data used for their efforts and the trans­for­ma­tions involved in wran­gling data on the road to insights or data prod­ucts, but this is not a require­ment of the method.  Another impli­ca­tion is that empir­i­cal dis­cov­ery does not adhere to any explicit moral, eth­i­cal, or value-based mis­sions that tran­scend work­ing con­text.  While Data Sci­en­tists often inter­pret their role as trans­for­ma­tive, this is in ref­er­ence to busi­ness.  Data Sci­ence is not med­i­cine, for exam­ple, with a Hip­po­cratic oath.

Empir­i­cal Dis­cov­ery is an aug­mented method in that it depends on com­put­ing and machine resources to increase human ana­lyt­i­cal capa­bil­i­ties: It is sim­ply imprac­ti­cal for peo­ple to man­u­ally under­take many of the ana­lyt­i­cal tech­niques com­mon to Data Sci­ence.  An impor­tant point to remem­ber about aug­mented meth­ods is that they are not auto­mated; peo­ple remain nec­es­sary, and it is the com­bi­na­tion of human and machine that is effec­tive at yield­ing insights.  In the prob­lem domain of dis­cov­ery, the pat­terns of sense­mak­ing activ­ity lead­ing to insight are intu­itive, non-linear, and asso­cia­tive; activites with these char­ac­ter­is­tics are not fully automat­able with cur­rent tech­nol­ogy. And while many ana­lyt­i­cal tech­niques can be use­fully auto­mated within bound­aries, these tasks typ­i­cally make up just a por­tion of an com­plete dis­cov­ery effort.  For exam­ple, using latent class analy­sis to explore a machine-sampled sub­set of a larger data cor­pus is task-specific automa­tion com­ple­ment­ing human per­spec­tive at par­tic­u­lar points of the Empir­i­cal Dis­cov­ery work­flow.  This depen­dence on machine aug­mented ana­lyt­i­cal capa­bil­ity is recent within the his­tory of ana­lyt­i­cal meth­ods.  In most of the mod­ern era — roughly the later 17th, 18th, 19th and early 20th cen­turies — the data employed in dis­cov­ery efforts was man­age­able ‘by hand’, even when using the newest math­e­mat­i­cal and ana­lyt­i­cal meth­ods emerg­ing at the time.  This remained true until the effec­tive com­mer­cial­iza­tion of machine com­put­ing ended the need for human com­put­ers as a rec­og­nized role in the mid­dle of the 20th century.

The real­ity of most ana­lyt­i­cal efforts — even those with good ini­tial def­i­n­i­tion — is that insights often emerge in response to and in tan­dem with chang­ing and evolv­ing ques­tions which were not iden­ti­fied, or per­haps not even under­stood, at the out­set.  Dur­ing dis­cov­ery efforts, ana­lyt­i­cal goals and tech­niques, as well as the data under con­sid­er­a­tion, often shift in unpre­dictable ways, mak­ing the path to insight dynamic and non-linear.  Fur­ther, the sources of and inspi­ra­tions for insight are  dif­fi­cult or impos­si­ble to iden­tify both at the time and in ret­ro­spect. Empir­i­cal dis­cov­ery addresses the com­plex and opaque nature of dis­cov­ery with iter­a­tion and adap­ta­tion, which com­bine  to set the stage for serendip­ity.

With this ini­tial def­i­n­i­tion of Empir­i­cal Dis­cov­ery in hand, the nat­ural ques­tion is what this means for Data Sci­ence and busi­ness ana­lyt­ics?  Three thigns stand out for me.  First, I think one of the cen­tral roles played by Data Sci­ence is in pio­neer­ing the appli­ca­tion of exist­ing ana­lyt­i­cal meth­ods from spe­cial­ized domains to serve gen­eral busi­ness goals and per­spec­tives, seek­ing effec­tive ways to work with the new types (graph, sen­sor, social, etc.) and tremen­dous vol­umes (yotta, yotta, yotta…) of busi­ness data at hand in the Big Data moment and real­ize insights

Sec­ond, fol­low­ing from this, Empir­i­cal Dis­cov­ery is method­olog­i­cal a frame­work within and through which a great vari­ety of ana­lyt­i­cal tech­niques at dif­fer­ing lev­els of matu­rity and from other dis­ci­plines are vet­ted for busi­ness ana­lyt­i­cal util­ity in iter­a­tive fash­ion by Data Sci­ence practitioners.

And third, it seems this vet­ting func­tion is delib­er­ately part of the makeup of empir­i­cal dis­cov­ery, which I con­sider a very clever way to cre­ate a feed­back loop that enhances Data Sci­ence prac­tice by using Empir­i­cal Dis­cov­ery as a dis­cov­ery tool for refin­ing its own methods.

Comment » | Big Data

Big Data is a Condition (Or, "It's (Mostly) In Your Head")

March 10th, 2014 — 1:07pm

Unsur­pris­ingly, def­i­n­i­tions of Big Data run the gamut from the turgid to the flip, mak­ing room to include the trite, the breath­less, and the sim­ply un-inspiring in the big cir­cle around the camp­fire. Some of these def­i­n­i­tions are use­ful in part, but none of them cap­tures the essence of the mat­ter. Most are mis­takes in kind, try­ing to ground and cap­ture Big Data as a ‘thing’ of some sort that is mea­sur­able in objec­tive terms. Any­time you encounter a num­ber, this is the school of thought.

Some approach Big Data as a state of being, most often a sim­ple oper­a­tional state of insuf­fi­ciency of some kind; typ­i­cally resources like ana­lysts, com­pute power or stor­age for han­dling data effec­tively; occa­sion­ally some­thing less quan­tifi­able like clar­ity of pur­pose and cri­te­ria for man­age­ment. Any­time you encounter phras­ing that relies on the reader to inter­pret and define the par­tic­u­lars of the insuf­fi­ciency, this is the school of thought.

I see Big Data as a self-defined (per­haps diag­nosed is more accu­rate) con­di­tion, but one that is based on idio­syn­cratic inter­pre­ta­tion of cur­rent and pos­si­ble future sit­u­a­tions in which under­stand­ing of, plan­ning for, and activ­ity around data are central.

Here’s my work­ing def­i­n­i­tion: Big Data is the con­di­tion in which very high actual or expected dif­fi­culty in work­ing suc­cess­fully with data com­bines with very high antic­i­pated but unknown value and ben­e­fit, lead­ing to the a-priori assump­tion that cur­rently avail­able infor­ma­tion man­age­ment and ana­lyt­i­cal capa­bil­ties are broadly insuf­fi­cient, mak­ing new and pre­vi­ously unknown capa­bil­i­ties seem­ingly necessary.

Comment » | Big Data

Back to top