Tag: business_analytics


Empirical Discovery: Concept and Workflow Model

June 20th, 2014 — 12:53pm

Con­cept mod­els are a pow­er­ful tool for artic­u­lat­ing the essen­tial ele­ments and rela­tion­ships that define new or com­plex things we need to under­stand.  We’ve pre­vi­ously defined empir­i­cal dis­cov­ery as a new method, look­ing at antecedents, and also com­par­ing and con­trast­ing the dis­tinc­tive char­ac­ter­is­tics of Empir­i­cal Dis­cov­ery with other knowl­edge cre­ation and insight seek­ing meth­ods.  I’m now shar­ing our con­cept model of Empir­i­cal Dis­cov­ery, which iden­ti­fies the most impor­tant actors, activ­i­ties, and out­comes of empir­i­cal dis­cov­ery efforts, to com­ple­ment the writ­ten def­i­n­i­tion by illus­trat­ing   how the method works in practice.

Empir­i­cal dis­cov­ery con­cept model from Joe Laman­tia

In this model, we illus­trate the activ­i­ties of the three kinds of peo­ple most cen­tral to dis­cov­ery efforts: Insight Con­sumers, Data Sci­en­tists, and Data Engi­neers.  We have robust def­i­n­i­tions of all the major actors involved in dis­cov­ery (used to drive prod­uct devel­op­ment), and may share some of these var­i­ous per­sonas, pro­files, and snap­shots sub­se­quently.  For read­ing this model, under­stand Insight Con­sumers as the peo­ple who rely on insights from dis­cov­ery efforts to effect and man­age the oper­a­tions of the busi­ness.  Data Sci­en­tists are the sense­mak­ers who achieve insights, and cre­ate data prod­ucts, and ana­lyt­i­cal mod­els through dis­cov­ery efforts.  Data Engi­neers enable dis­cov­ery efforts by build­ing the enter­prise data analy­sis infra­struc­ture nec­es­sary for dis­cov­ery, and often imple­ment the out­comes of empir­i­cal dis­cov­ery by build­ing new tools based on the insights and mod­els Data Sci­en­tists create.

A key assump­tion of this model is that dis­cov­ery is by def­i­n­i­tion an iter­a­tive and serendip­i­tous method, rely­ing on fre­quent back-steps and unpre­dictable rep­e­ti­tion of activ­i­ties as a nec­es­sary aspect of how dis­cov­ery efforts unfold.  This model also assumes the data, meth­ods, and tools shift dur­ing dis­cov­ery efforts, in keep­ing with the evo­lu­tion of moti­vat­ing ques­tions, and the achieve­ment of interim out­comes.  Sim­i­larly, dis­cov­ery efforts do not always involve all of these elements.

To keep the essen­tial struc­ture and rela­tion­ships between ele­ments clear and in the fore­ground, we have not shown all of the pos­si­ble iter­a­tive loops or repeated steps.  Some closely related con­cepts are grouped together, to allow read­ing the model on two lev­els of detail.

For a sim­pli­fied view, fol­low the links between named actors and groups of con­cepts shown with col­ored back­grounds and labels.  In this read­ing, an Insight Con­sumer artic­u­lates ques­tions to a Data Sci­en­tist, who com­bines domain knowl­edge with the Empir­i­cal Dis­cov­ery Method (yel­low) to direct the appli­ca­tion of Ana­lyt­i­cal Tools (blue) and Mod­els (salmon) to Data Sets (green) drawn from Data Sources (magenta).  The Data Sci­en­tist shares Insights result­ing from dis­cov­ery efforts with the Insight Con­sumer, while Data Engi­neers may imple­ment the mod­els or data prod­ucts cre­ated by the Data Sci­en­tist by turn­ing them into tools and infra­struc­ture for the rest of the busi­ness.  For a more detailed view of the spe­cific con­cepts and activ­i­ties com­mon to Empir­i­cal dis­cov­ery efforts, fol­low the links between the indi­vid­ual con­cepts within these named groups.  (Note: there are two kinds of con­nec­tions; solid arrows indi­cat­ing def­i­nite rela­tion­ships, and for the Data Sets and Mod­els groups, dashed arrows indi­cat­ing pos­si­ble paths of evo­lu­tion.  More on this to follow)

Another way to inter­pret the two lev­els of detail in this model is as descrip­tions of for­mal vs. infor­mal imple­men­ta­tions of the empir­i­cal dis­cov­ery method.  Peo­ple and orga­ni­za­tions who take a more for­mal approach to empir­i­cal dis­cov­ery may require explic­itly defined arti­facts and activ­i­ties that address each major con­cept, such as pre­dic­tions and exper­i­men­tal results.  In less for­mal approaches, Data Sci­en­tists may implic­itly address each of the major con­cepts and activ­i­ties, such as fram­ing hypothe­ses, or track­ing the states of data sets they are work­ing with, with­out any for­mal arti­fact or deci­sion gate­way.  This sit­u­a­tional flex­i­bil­ity is follow-on of the applied nature of the empir­i­cal dis­cov­ery method, which does not require sci­en­tific stan­dards of proof and repro­ducibil­ity to gen­er­ate val­ued outcomes.

The story begins in the upper right cor­ner, when an Insight Con­sumer artic­u­lates a belief or ques­tion to a Data Sci­en­tist, who then trans­lates this moti­vat­ing state­ment into a planned dis­cov­ery effort that addresses the busi­ness goal. The Data Sci­en­tist applies the Empir­i­cal Dis­cov­ery Method (con­cepts in yel­low); pos­si­bly gen­er­at­ing a hypoth­e­sis and accom­pa­ny­ing pre­dic­tions which will be tested by exper­i­ments, choos­ing data from the range of avail­able data sources (grouped in magenta), and select­ing ini­tial ana­lyt­i­cal meth­ods con­sis­tent with the domain, the data sets (green), and the ana­lyt­i­cal or ref­er­ence mod­els (salmon) they will work with.  Given the par­tic­u­lars of the data and the ana­lyt­i­cal meth­ods, the Data Sci­en­tist employs spe­cific ana­lyt­i­cal tools (blue) such as algo­rithms and sta­tis­ti­cal or other mea­sures, based on fac­tors such as expected accu­racy, and speed or ease of use.  As the effort pro­gresses through iter­a­tions, or insights emerge, exper­i­ments may be added or revised, based on the con­clu­sions the Data Sci­en­tist draws from the results and their impact on start­ing pre­dic­tions or hypotheses.

For exam­ple, an Insight Con­sumer who works in a prod­uct man­age­ment capac­ity for an on-line social net­work with a busi­ness goal of increas­ing users’ level of engage­ment with the ser­vice wishes to iden­tify oppor­tu­ni­ties to rec­om­mend users estab­lish new con­nec­tions with other sim­i­lar and pos­si­bly known users based on unrec­og­nized affini­ties in their posted pro­files.  The data sci­en­tist trans­lates this busi­ness goal into a series of exper­i­ments inves­ti­gat­ing pre­dic­tions about which aspects of user pro­files more effec­tively pre­dict the like­li­hood of cre­at­ing new con­nec­tions in response to system-generated rec­om­men­da­tions for sim­i­lar­ity.  The Data Sci­en­tist frames exper­i­ments that rely on data from the accu­mu­lated logs of user activ­i­ties within the net­work that have been anonymized to com­ply with pri­vacy poli­cies, select­ing spe­cific work­ing sets of data to ana­lyze based on aware­ness of the shoe and nature of the attrib­utes that appear directly in users’ pro­files both across the entire net­work, and among pools of sim­i­lar but uncon­nected users. The Data Sci­en­tist plans to begin with ana­lyt­i­cal meth­ods use­ful for pre­dic­tive mod­el­ing of the effec­tive­ness of rec­om­mender sys­tems in net­work con­texts, such as mea­sure­ments of the affin­ity of users’ inter­ests based on seman­tic analy­sis of social objects shared by users within this net­work and also pub­licly in other online media, and also struc­tural or topo­log­i­cal mea­sures of rel­a­tive posi­tion and dis­tance from the field of net­work sci­ence.  The Data Sci­en­tist chooses a set of stan­dard social net­work analy­sis algo­rithms and mea­sures, com­bined with cus­tom mod­els for inter­pret­ing user activ­ity and inter­est unique to this net­work.  The Data Sci­en­tist has pre­de­fined scripts and open source libraries avail­able for ready appli­ca­tion to data (MLlib, Gephi, Weka, Pan­das, etc.) in the form of Ana­lyt­i­cal tools, which she will com­bine in sequences accord­ing to the desired ana­lyt­i­cal flow for each experiment.

The nature of ana­lyt­i­cal engage­ment with data sets varies dur­ing the course of dis­cov­ery efforts, with dif­fer­ent types of data sets play­ing dif­fer­ent roles at spe­cific stages of the dis­cov­ery work­flow.  Our con­cept map sim­pli­fies the life­cy­cle of data for pur­poses of descrip­tion, iden­ti­fy­ing five dis­tinct and rec­og­niz­able ways data are used by the Data Sci­en­tist, with five cor­re­spond­ing types of data sets.  In some cases, for­mal cri­te­ria on data qual­ity, com­plete­ness, accu­racy, and con­tent gov­ern which stage of the data life­cy­cle any  given data set is at.  In most dis­cov­ery efforts, how­ever, Data Sci­en­tists them­selves make a series of judge­ments about when and how the data in hand is suit­able for use.  The dashed arrows link­ing the five types of data sets cap­ture the approx­i­mate and con­di­tional nature of these dif­fer­ent stages of evo­lu­tion.  In prac­tice, dis­cov­ery efforts begin with explo­ration of data that may or may not be rel­e­vant for focused analy­sis, but which requires some direct engage­ment to and atten­tion to rule in or out of con­sid­er­a­tion. Focused ana­lyt­i­cal inves­ti­ga­tion of the rel­e­vant data fol­lows, made pos­si­ble by the iter­a­tive addi­tion, refine­ment and trans­for­ma­tion (wran­gling — more on this in later posts) of the exploratory data in hand.  At this stage, the Data Sci­en­tist applies ana­lyt­i­cal tools iden­ti­fied by their cho­sen ana­lyt­i­cal method.  The model build­ing stage seeks to cre­ate explicit, for­mal, and reusable mod­els that artic­u­late the pat­terns and struc­tures found dur­ing inves­ti­ga­tion.  When val­i­da­tion of newly cre­ated ana­lyt­i­cal mod­els is nec­es­sary, the Data Sci­en­tist uses appro­pri­ate data — typ­i­cally data that was not part of explicit model cre­ation.  Finally, train­ing data is some­times nec­es­sary to put mod­els into pro­duc­tion — either using them for fur­ther steps in ana­lyt­i­cal work­flows (which can be very com­plex), or in busi­ness oper­a­tions out­side the ana­lyt­i­cal context.

Because so much dis­cov­ery activ­ity requires trans­for­ma­tion of the data before or dur­ing analy­sis, there is great inter­est in the Data Sci­ence and busi­ness ana­lyt­ics indus­tries in how Data Sci­en­tists and sense­mak­ers work with data at these var­i­ous stages.  Much of this atten­tion focuses on the need for bet­ter tools for trans­form­ing data in order to make analy­sis pos­si­ble.  This model does not explic­itly rep­re­sent wran­gling as an activ­ity, because it is not directly a part of the empir­i­cal dis­cov­ery method; trans­for­ma­tion is done only as and when needed to make analy­sis pos­si­ble.  How­ever, under­stand­ing the nature of wran­gling and trans­for­ma­tion activ­i­ties is a very impor­tant topic for grasp­ing dis­cov­ery, so I’ll address in later post­ings. (We have a good model for this too…)

Empir­i­cal dis­cov­ery efforts aim to cre­ate one or more of the three types of out­comes shown in orange: insights, mod­els, and data prod­ucts.  Insights, as we’ve defined them pre­vi­ously, are dis­cov­er­ies that change people’s per­spec­tive or under­stand­ing, not sim­ply the results of ana­lyt­i­cal activ­ity, such as the end val­ues of ana­lyt­i­cal cal­cu­la­tions, the gen­er­a­tion of reports, or the retrieval and aggre­ga­tion of stored information.

One of the most valu­able out­comes of dis­cov­ery efforts is the cre­ation of exter­nal­ized mod­els that describe behav­ior, struc­ture or rela­tion­ships in clear and quan­ti­fied terms.  The mod­els that result from empir­i­cal dis­cov­ery efforts can take many forms — google ‘pre­dic­tive model’ for a sense of the tremen­dous vari­a­tion in what peo­ple active in busi­ness ana­lyt­ics con­sider to be a use­ful model — but their defin­ing char­ac­ter­is­tic is that a model always describes aspects of a sub­ject of dis­cov­ery and analy­sis that are not directly present in the data itself.  For exam­ple, if given the node and edge data iden­ti­fy­ing all of the con­nec­tions between peo­ple in the social net­work above, one pos­si­ble model result­ing from analy­sis of the net­work struc­ture is a descrip­tive read­out of the topol­ogy of the net­work as scale-free, with some set of sub­graphs, a range of node cen­tral­ity val­ues’, a matrix of pos­si­ble short­est paths between nodes or sub­graphs, etc.  It is pos­si­ble to make sense of, inter­pret, or cir­cu­late a model inde­pen­dently of the data it describes and is derived from.

Data Sci­en­tists also engage with mod­els in dis­tinct and rec­og­niz­able ways dur­ing dis­cov­ery efforts.  Ref­er­ence mod­els, deter­mined by the domain of inves­ti­ga­tion, often guide exploratory analy­sis of dis­cov­ery sub­jects by pro­vid­ing Data Sci­en­tists with gen­eral  expla­na­tions and quan­tifi­ca­tions for processes and rela­tion­ships com­mon to the domain.  And the mod­els gen­er­ated as insight and under­stand­ing accu­mu­late dur­ing dis­cov­ery evolve in stages from ini­tial artic­u­la­tion through val­i­da­tion to readi­ness for pro­duc­tion imple­men­ta­tion; which means being put into effect directly on the oper­a­tions of the business.

Data prod­ucts are best under­stood as ‘pack­ages’ of data which have util­ity for other ana­lyt­i­cal or busi­ness pur­poses, such as a list of users in the social net­work who will form new con­nec­tions in response to system-generated sug­ges­tions of other sim­i­lar users.  Data prod­ucts are not lit­er­ally fin­ished prod­ucts that the busi­ness offers for exter­nal sale or con­sump­tion.  And as back­ground, we assume oper­a­tional­iza­tion or ‘imple­men­ta­tion’ of the out­comes of empir­i­cal dis­cov­ery efforts to change the func­tion­ing of the busi­ness is the goal of dif­fer­ent busi­ness processes, such as prod­uct devel­op­ment.  While empir­i­cal dis­cov­ery focuses on achiev­ing under­stand­ing, rather than mak­ing things, this is not the only thing Data Sci­en­tists do for the busi­ness.  The clas­sic def­i­n­i­tion of Data Sci­ence as aimed at cre­at­ing new prod­ucts based on data which impact the busi­ness, is a broad man­date, and many of the posi­tion descrip­tions for data sci­ence jobs require par­tic­i­pa­tion in prod­uct devel­op­ment efforts.

Two or more kinds of out­comes are often bun­dled together as the results of a gen­uinely suc­cess­ful dis­cov­ery effort; for exam­ple, an insight that two appar­ently uncon­nected busi­ness processes are in fact related through mutual feed­back loops, and a model explic­itly describ­ing and quan­ti­fy­ing the nature of the rela­tion­ships as dis­cov­ered through analysis.

There’s more to the story, but as one trip through the essen­tial ele­ments of empir­i­cal dis­cov­ery, this is a log­i­cal point to pause and ask what might be miss­ing from this model? And how can it be improved?


 

Comment » | Language of Discovery

The Sensemaking Spectrum for Business Analytics: Translating from Data to Business Through Analysis

June 10th, 2014 — 8:33am

One of the most com­pelling out­comes of our strate­gic research efforts over the past sev­eral years is a grow­ing vocab­u­lary that artic­u­lates our cumu­la­tive under­stand­ing of the deep struc­ture of the domains of dis­cov­ery and busi­ness analytics.

Modes are one exam­ple of the deep struc­ture we’ve found.  After look­ing at dis­cov­ery activ­i­ties across a very wide range of indus­tries, ques­tion types, busi­ness needs, and prob­lem solv­ing approaches, we’ve iden­ti­fied dis­tinct and recur­ring kinds of sense­mak­ing activ­ity, inde­pen­dent of con­text.  We label these activ­i­ties Modes: Explore, com­pare, and com­pre­hend are three of the nine rec­og­niz­able modes.  Modes describe *how* peo­ple go about real­iz­ing insights.  (Read more about the pro­gram­matic research and for­mal aca­d­e­mic ground­ing and dis­cus­sion of the modes here: https://www.researchgate.net/publication/235971352_A_Taxonomy_of_Enterprise_Search_and_Discovery) By anal­ogy to lan­guages, modes are the ‘verbs’ of dis­cov­ery activ­ity.  When applied to the prac­ti­cal ques­tions of prod­uct strat­egy and devel­op­ment, the modes of dis­cov­ery allow one to iden­tify what kinds of ana­lyt­i­cal activ­ity a prod­uct, plat­form, or solu­tion needs to sup­port across a spread of usage sce­nar­ios, and then make con­crete and well-informed deci­sions about every aspect of the solu­tion, from high-level capa­bil­i­ties, to which spe­cific types of infor­ma­tion visu­al­iza­tions bet­ter enable these sce­nar­ios for the types of data users will analyze.

The modes are a pow­er­ful gen­er­a­tive tool for prod­uct mak­ing, but if you’ve spent time with young chil­dren, or had a really bad hang­over (or both at the same time…), you under­stand the dif­fi­cult of com­mu­ni­cat­ing using only verbs.

So I’m happy to share that we’ve found trac­tion on another facet of the deep struc­ture of dis­cov­ery and busi­ness ana­lyt­ics.  Con­tin­u­ing the lan­guage anal­ogy, we’ve iden­ti­fied some of the ‘nouns’ in the lan­guage of dis­cov­ery: specif­i­cally, the con­sis­tently recur­ring aspects of a busi­ness that peo­ple are look­ing for insight into.  We call these dis­cov­ery Sub­jects, since they iden­tify *what* peo­ple focus on dur­ing dis­cov­ery efforts, rather than *how* they go about dis­cov­ery as with the Modes.

Defin­ing the col­lec­tion of Sub­jects peo­ple repeat­edly focus on allows us to under­stand and artic­u­late sense mak­ing needs and activ­ity in more spe­cific, con­sis­tent, and com­plete fash­ion.  In com­bi­na­tion with the Modes, we can use Sub­jects to con­cretely iden­tify and define sce­nar­ios that describe people’s ana­lyt­i­cal needs and goals.  For exam­ple, a sce­nario such as ‘Explore [a Mode] the attri­tion rates [a Mea­sure, one type of Sub­ject] of our largest cus­tomers [Enti­ties, another type of Sub­ject] clearly cap­tures the nature of the activ­ity — explo­ration of trends vs. deep analy­sis of under­ly­ing fac­tors — and the cen­tral focus — attri­tion rates for cus­tomers above a cer­tain set of size cri­te­ria — from which fol­low many of the specifics needed to address this sce­nario in terms of data, ana­lyt­i­cal tools, and methods.

We can also use Sub­jects to trans­late effec­tively between the dif­fer­ent per­spec­tives that shape dis­cov­ery efforts, reduc­ing ambi­gu­ity and increas­ing impact on both sides the per­spec­tive divide.  For exam­ple, from the lan­guage of busi­ness, which often moti­vates ana­lyt­i­cal work by ask­ing ques­tions in busi­ness terms, to the per­spec­tive of analy­sis.  The ques­tion posed to a Data Sci­en­tist or ana­lyst may be some­thing like “Why are sales of our new kinds of potato chips to our largest cus­tomers fluc­tu­at­ing unex­pect­edly this year?” or “Where can inno­vate, by expand­ing our prod­uct port­fo­lio to meet unmet needs?”.  Ana­lysts trans­late ques­tions and beliefs like these into one or more empir­i­cal dis­cov­ery efforts that more for­mally and gran­u­larly indi­cate the plan, meth­ods, tools, and desired out­comes of analy­sis.  From the per­spec­tive of analy­sis this sec­ond ques­tion might become, “Which cus­tomer needs of type ‘A’, iden­ti­fied and mea­sured in terms of ‘B’, that are not directly or indi­rectly addressed by any of our cur­rent prod­ucts, offer ‘X’ poten­tial for ‘Y’ pos­i­tive return on the invest­ment ‘Z’ required to launch a new offer­ing, in time frame ‘W’?  And how do these com­pare to each other?”.  Trans­la­tion also hap­pens from the per­spec­tive of analy­sis to the per­spec­tive of data; in terms of avail­abil­ity, qual­ity, com­plete­ness, for­mat, vol­ume, etc.

By impli­ca­tion, we are propos­ing that most work­ing orga­ni­za­tions — small and large, for profit and non-profit, domes­tic and inter­na­tional, and in the major­ity of indus­tries — can be described for ana­lyt­i­cal pur­poses using this col­lec­tion of Sub­jects.  This is a bold claim, but sim­pli­fied artic­u­la­tion of com­plex­ity is one of the pri­mary goals of sense­mak­ing frame­works such as this one.  (And, yes, this is in fact a frame­work for mak­ing sense of sense­mak­ing as a cat­e­gory of activ­ity — but we’re not con­sid­er­ing the recur­sive aspects of this exer­cise at the moment.)

Com­pellingly, we can place the col­lec­tion of sub­jects on a sin­gle con­tin­uüm — we call it the Sense­mak­ing Spec­trum — that sim­ply and coher­ently illus­trates some of the most impor­tant rela­tion­ships between the dif­fer­ent types of Sub­jects, and also illu­mi­nates sev­eral of the fun­da­men­tal dynam­ics shap­ing busi­ness ana­lyt­ics as a domain.  As a corol­lary, the Sense­mak­ing Spec­trum also sug­gests inno­va­tion oppor­tu­ni­ties for prod­ucts and ser­vices related to busi­ness analytics.

The first illus­tra­tion below shows Sub­jects arrayed along the Sense­mak­ing Spec­trum; the sec­ond illus­tra­tion presents exam­ples of each kind of Sub­ject.  Sub­jects appear in col­ors rang­ing from blue to reddish-orange, reflect­ing their place along the Spec­trum, which indi­cates whether a Sub­ject addresses more the view­point of sys­tems and data (Data cen­tric and blue), or peo­ple (User cen­tric and orange).  This axis is shown explic­itly above the Spec­trum.  Anno­ta­tions sug­gest how Sub­jects align with the three sig­nif­i­cant per­spec­tives of Data, Analy­sis, and Busi­ness that shape busi­ness ana­lyt­ics activ­ity.  This ren­der­ing makes explicit the trans­la­tion and bridg­ing func­tion of Ana­lysts as a role, and analy­sis as an activity.


 
 

Sub­jects are best under­stood as fuzzy cat­e­gories [http://georgelakoff.files.wordpress.com/2011/01/hedges-a-study-in-meaning-criteria-and-the-logic-of-fuzzy-concepts-journal-of-philosophical-logic-2-lakoff-19731.pdf], rather than tightly defined buck­ets.  For each Sub­ject, we sug­gest some of the most com­mon exam­ples: Enti­ties may be phys­i­cal things such as named prod­ucts, or loca­tions (a build­ing, or a city); they could be Con­cepts, such as sat­is­fac­tion; or they could be Rela­tion­ships between enti­ties, such as the vari­ety of pos­si­ble con­nec­tions that define link­age in social net­works.  Like­wise, Events may indi­cate a time and place in the dic­tio­nary sense; or they may be Trans­ac­tions involv­ing named enti­ties; or take the form of Sig­nals, such as ‘some Mea­sure had some value at some time’ — what many enter­prises under­stand as alerts.

The cen­tral story of the Spec­trum is that though con­sumers of ana­lyt­i­cal insights (rep­re­sented here by the Busi­ness per­spec­tive) need to work in terms of Sub­jects that are directly mean­ing­ful to their per­spec­tive — such as Themes, Plans, and Goals — the work­ing real­i­ties of data (con­di­tion, struc­ture, avail­abil­ity, com­plete­ness, cost) and the chang­ing nature of most dis­cov­ery efforts make direct engage­ment with source data in this fash­ion impos­si­ble.  Accord­ingly, busi­ness ana­lyt­ics as a domain is struc­tured around the fun­da­men­tal assump­tion that sense mak­ing depends on ana­lyt­i­cal trans­for­ma­tion of data.  Ana­lyt­i­cal activ­ity incre­men­tally syn­the­sizes more com­plex and larger scope Sub­jects from data in its start­ing con­di­tion, accu­mu­lat­ing insight (and value) by mov­ing through a pro­gres­sion of stages in which increas­ingly mean­ing­ful Sub­jects are iter­a­tively syn­the­sized from the data, and recom­bined with other Sub­jects.  The end goal of  ‘lad­der­ing’ suc­ces­sive trans­for­ma­tions is to enable sense mak­ing from the busi­ness per­spec­tive, rather than the ana­lyt­i­cal perspective.

Syn­the­sis through lad­der­ing is typ­i­cally accom­plished by spe­cial­ized Ana­lysts using ded­i­cated tools and meth­ods. Begin­ning with some moti­vat­ing ques­tion such as seek­ing oppor­tu­ni­ties to increase the effi­ciency (a Theme) of ful­fill­ment processes to reach some level of prof­itabil­ity by the end of the year (Plan), Ana­lysts will iter­a­tively wran­gle and trans­form source data Records, Val­ues and Attrib­utes into rec­og­niz­able Enti­ties, such as Prod­ucts, that can be com­bined with Mea­sures or other data into the Events (ship­ment of orders) that indi­cate the work­ings of the business.

More com­plex Sub­jects (to the right of the Spec­trum) are com­posed of or make ref­er­ence to less com­plex Sub­jects: a busi­ness Process such as Ful­fill­ment will include Activ­i­ties such as con­firm­ing, pack­ing, and then ship­ping orders.  These Activ­i­ties occur within or are con­ducted by orga­ni­za­tional units such as teams of staff or part­ner firms (Net­works), com­posed of Enti­ties which are struc­tured via Rela­tion­ships, such as sup­plier and buyer.  The ful­fill­ment process will involve other types of Enti­ties, such as the prod­ucts or ser­vices the busi­ness pro­vides.  The suc­cess of the ful­fill­ment process over­all may be judged accord­ing to a sophis­ti­cated oper­at­ing effi­ciency Model, which includes tiered Mea­sures of busi­ness activ­ity and health for the trans­ac­tions and activ­i­ties included.  All of this may be inter­preted through an under­stand­ing of the oper­a­tional domain of the busi­nesses sup­ply chain (a Domain).

We’ll dis­cuss the Spec­trum in more depth in suc­ceed­ing posts.

Comment » | Big Data, Language of Discovery

Data Science and Empirical Discovery: A New Discipline Pioneering a New Analytical Method

March 26th, 2014 — 11:00am

One of the essen­tial pat­terns of sci­ence and indus­try in the mod­ern era is that new meth­ods for under­stand­ing — what I’ll call sense­mak­ing from now on — often emerge hand in hand with new pro­fes­sional and sci­en­tific dis­ci­plines.  This link­age between new dis­ci­plines and new meth­ods fol­lows from the  decep­tively sim­ple imper­a­tive to real­ize new types of insight, which often means analy­sis of new kinds of data, using new tech­niques, applied from newly defined per­spec­tives. New view­points and new ways of under­stand­ing are lit­er­ally bound together in a sort of symbiosis.

One famil­iar exam­ple of this dynamic is the rapid devel­op­ment of sta­tis­tics dur­ing the 18th and 19th cen­turies, in close par­al­lel with the rise of new social sci­ence dis­ci­plines includ­ing eco­nom­ics (orig­i­nally polit­i­cal econ­omy) and soci­ol­ogy, and nat­ural sci­ences such as astron­omy and physics.  On a very broad scale, we can see the pat­tern in the tan­dem evo­lu­tion of the sci­en­tific method for sense­mak­ing, and the cod­i­fi­ca­tion of mod­ern sci­en­tific dis­ci­plines based on pre­cur­sor fields such as nat­ural his­tory and nat­ural phi­los­o­phy dur­ing the sci­en­tific rev­o­lu­tion.

Today, we can see this pat­tern clearly in the simul­ta­ne­ous emer­gence of Data Sci­ence as a new and dis­tinct dis­ci­pline accom­pa­nied by Empir­i­cal Dis­cov­ery, the new sense­mak­ing and analy­sis method Data Sci­ence is pio­neer­ing.  Given its dra­matic rise to promi­nence recently, declar­ing Data Sci­ence a new pro­fes­sional dis­ci­pline should inspire lit­tle con­tro­versy. Declar­ing Empir­i­cal Dis­cov­ery a new method may seem bolder, but when we with the essen­tial pat­tern of new dis­ci­plines appear­ing in tan­dem with new sense­mak­ing meth­ods in mind, it is more con­tro­ver­sial to sug­gest Data Sci­ence is a new dis­ci­pline that lacks a cor­re­spond­ing new method for sense­mak­ing.  (I would argue it is the method that makes the dis­ci­pline, not the other way around, but that is a topic for fuller treat­ment elsewhere)

What is empir­i­cal dis­cov­ery?  While empir­i­cal dis­cov­ery is a new sense­mak­ing method, we can build on two exist­ing foun­da­tions to under­stand its dis­tin­guish­ing char­ac­ter­is­tics, and help craft an ini­tial def­i­n­i­tion.  The first of these is an under­stand­ing of the empir­i­cal method. Con­sider the fol­low­ing description:

The empir­i­cal method is not sharply defined and is often con­trasted with the pre­ci­sion of the exper­i­men­tal method, where data are derived from the sys­tem­atic manip­u­la­tion of vari­ables in an exper­i­ment.  …The empir­i­cal method is gen­er­ally char­ac­ter­ized by the col­lec­tion of a large amount of data before much spec­u­la­tion as to their sig­nif­i­cance, or with­out much idea of what to expect, and is to be con­trasted with more the­o­ret­i­cal meth­ods in which the col­lec­tion of empir­i­cal data is guided largely by pre­lim­i­nary the­o­ret­i­cal explo­ration of what to expect. The empir­i­cal method is nec­es­sary in enter­ing hith­erto com­pletely unex­plored fields, and becomes less purely empir­i­cal as the acquired mas­tery of the field increases. Suc­cess­ful use of an exclu­sively empir­i­cal method demands a higher degree of intu­itive abil­ity in the practitioner.”

Data Sci­ence as prac­ticed is largely con­sis­tent with this pic­ture.  Empir­i­cal pre­rog­a­tives and under­stand­ings shape the pro­ce­dural plan­ning of Data Sci­ence efforts, rather than the­o­ret­i­cal con­structs.  Semi-formal approaches pre­dom­i­nate over explic­itly cod­i­fied meth­ods, sig­nal­ing the impor­tance of intu­ition.  Data sci­en­tists often work with data that is on-hand already from busi­ness activ­ity, or data that is newly gen­er­ated through nor­mal busi­ness oper­a­tions, rather than seek­ing to acquire wholly new data that is con­sis­tent with the design para­me­ters and goals of for­mal exper­i­men­tal efforts.  Much of the sense­mak­ing activ­ity around data is explic­itly exploratory (what I call the ‘pan­ning for gold’ stage of evo­lu­tion — more on this in sub­se­quent post­ings), rather than sys­tem­atic in the manip­u­la­tion of known vari­ables.  These exploratory tech­niques are used to address rel­a­tively new fields such as the Inter­net of Things, wear­ables, and large-scale social graphs and col­lec­tive activ­ity domains such as instru­mented envi­ron­ments and the quan­ti­fied self.  These new domains of appli­ca­tion are not mature in ana­lyt­i­cal terms; ana­lysts are still work­ing to iden­tify the most effec­tive tech­niques for yield­ing insights from data within their bounds.

The sec­ond rel­e­vant per­spec­tive is our under­stand­ing of dis­cov­ery as an activ­ity that is dis­tinct and rec­og­niz­able in com­par­i­son to gen­er­al­ized analy­sis: from this, we can sum­ma­rize as sense­mak­ing intended to arrive at novel insights, through explo­ration and analy­sis of diverse and dynamic data in an iter­a­tive and evolv­ing fashion.

Look­ing deeper, one spe­cific char­ac­ter­is­tic of dis­cov­ery as an activ­ity is the absence of for­mally artic­u­lated state­ments of belief and expected out­comes at the begin­ning of most dis­cov­ery efforts.  Another is the iter­a­tive nature of dis­cov­ery efforts, which can change course in non-linear ways and even ‘back­track’ on the way to arriv­ing at insights: both the data and the tech­niques used to ana­lyze data change dur­ing dis­cov­ery efforts.  For­mally defined exper­i­ments are much more clearly deter­mined from the begin­ning, and their def­i­n­i­tion is less open to change dur­ing their course. A pro­gram of related exper­i­ments con­ducted over time may show iter­a­tive adap­ta­tion of goals, data and meth­ods, but the indi­vid­ual exper­i­ments them­selves are not mal­leable and dynamic in the fash­ion of dis­cov­ery.  Discovery’s empha­sis on novel insight as pre­ferred out­come is another impor­tant char­ac­ter­is­tic; by con­trast, for­mal exper­i­ments are repeat­able and ver­i­fi­able by def­i­n­i­tion, and the degree of repeata­bil­ity is a cri­te­ria of well-designed exper­i­ments.  Dis­cov­ery efforts often involve an intu­itive shift in per­spec­tive that is recount­able and retrace­able in ret­ro­spect, but can­not be anticipated.

Build­ing on these two foun­da­tions, we can define Empir­i­cal Dis­cov­ery as a hybrid, pur­pose­ful, applied, aug­mented, iter­a­tive and serendip­i­tous method for real­iz­ing novel insights for busi­ness, through analy­sis of large and diverse data sets.

Let’s look at these facets in more detail.

Empir­i­cal dis­cov­ery pri­mar­ily addresses the prac­ti­cal goals and audi­ences of busi­ness (or indus­try), rather than sci­en­tific, aca­d­e­mic, or the­o­ret­i­cal objec­tives.  This is tremen­dously impor­tant, since  the prac­ti­cal con­text impacts every aspect of Empir­i­cal Discovery.

Large and diverse data sets’ reflects the fact that Data Sci­ence prac­ti­tion­ers engage with Big Data as we cur­rently under­stand it; sit­u­a­tions in which the con­flu­ence of data types and vol­umes exceeds the capa­bil­i­ties of busi­ness ana­lyt­ics to prac­ti­cally real­ize insights in terms of tools, infra­struc­ture, prac­tices, etc.

Empir­i­cal dis­cov­ery uses a rapidly evolv­ing hybridized toolkit, blend­ing a wide range of gen­eral and advanced sta­tis­ti­cal tech­niques with sophis­ti­cated exploratory and ana­lyt­i­cal meth­ods from a wide vari­ety of sources that includes data min­ing, nat­ural lan­guage pro­cess­ing, machine learn­ing, neural net­works, bayesian analy­sis, and emerg­ing tech­niques such as topo­log­i­cal data analy­sis and deep learn­ing.

What’s most notable about this hybrid toolkit is that Empir­i­cal Dis­cov­ery does not orig­i­nate novel analy­sis tech­niques, it bor­rows tools from estab­lished dis­ci­plines such infor­ma­tion retrieval, arti­fi­cial intel­li­gence, com­puter sci­ence, and the social sci­ences.  Many of the more spe­cial­ized or appar­ently exotic tech­niques data sci­ence and empir­i­cal dis­cov­ery rely on, such as sup­port vec­tor machines, deep learn­ing, or mea­sur­ing mutual infor­ma­tion in data sets, have estab­lished his­to­ries of usage in aca­d­e­mic or other indus­try set­tings, and have reached rea­son­able lev­els of matu­rity.  Empir­i­cal discovery’s hybrid toolkit is  trans­posed from one domain of appli­ca­tion to another, rather than invented.

Empir­i­cal Dis­cov­ery is an applied method in the same way Data Sci­ence is an applied dis­ci­pline: it orig­i­nates in and is adapted to busi­ness con­texts, it focuses on arriv­ing at use­ful insights to inform busi­ness activ­i­ties, and it is not used to con­duct basic research.  At this early stage of devel­op­ment, Empir­i­cal Dis­cov­ery has no inde­pen­dent and artic­u­lated the­o­ret­i­cal basis and does not (yet) advance a dis­tinct body of knowl­edge based on the­ory or prac­tice. All viable dis­ci­plines have a body of knowl­edge, whether for­mal or infor­mal, and applied dis­ci­plines have only their cumu­la­tive body of knowl­edge to dis­tin­guish them, so I expect this to change.

Empir­i­cal dis­cov­ery is not only applied, but explic­itly pur­pose­ful in that it is always set in motion and directed by an agenda from a larger con­text, typ­i­cally the spe­cific busi­ness goals of the orga­ni­za­tion act­ing as a prime mover and fund­ing data sci­ence posi­tions and tools.  Data Sci­ence prac­ti­tion­ers effect Empir­i­cal Dis­cov­ery by mak­ing it hap­pen on a daily basis — but wher­ever there is empir­i­cal dis­cov­ery activ­ity, there is sure to be inten­tion­al­ity from a busi­ness view.  For exam­ple, even in orga­ni­za­tions with a for­mal hack time pol­icy, our research sug­gests there is lit­tle or no com­pletely undi­rected or self-directed empir­i­cal dis­cov­ery activ­ity, whether con­ducted by for­mally rec­og­nized Data Sci­ence prac­ti­tion­ers, busi­ness ana­lysts, or others.

One very impor­tant impli­ca­tion of the sit­u­a­tional pur­pose­ful­ness of Empir­i­cal Dis­cov­ery is that there is no direct imper­a­tive for gen­er­at­ing a body of cumu­la­tive knowl­edge through orig­i­nal research: the insights that result from Empir­i­cal Dis­cov­ery efforts are judged by their prac­ti­cal util­ity in an imme­di­ate con­text.  There is also no explicit sci­en­tific bur­den of proof or ver­i­fi­a­bil­ity asso­ci­ated with Empir­i­cal Dis­cov­ery within it’s pri­mary con­text of appli­ca­tion.  Many prac­ti­tion­ers encour­age some aspects of ver­i­fi­a­bil­ity, for exam­ple, by anno­tat­ing the var­i­ous sources of data used for their efforts and the trans­for­ma­tions involved in wran­gling data on the road to insights or data prod­ucts, but this is not a require­ment of the method.  Another impli­ca­tion is that empir­i­cal dis­cov­ery does not adhere to any explicit moral, eth­i­cal, or value-based mis­sions that tran­scend work­ing con­text.  While Data Sci­en­tists often inter­pret their role as trans­for­ma­tive, this is in ref­er­ence to busi­ness.  Data Sci­ence is not med­i­cine, for exam­ple, with a Hip­po­cratic oath.

Empir­i­cal Dis­cov­ery is an aug­mented method in that it depends on com­put­ing and machine resources to increase human ana­lyt­i­cal capa­bil­i­ties: It is sim­ply imprac­ti­cal for peo­ple to man­u­ally under­take many of the ana­lyt­i­cal tech­niques com­mon to Data Sci­ence.  An impor­tant point to remem­ber about aug­mented meth­ods is that they are not auto­mated; peo­ple remain nec­es­sary, and it is the com­bi­na­tion of human and machine that is effec­tive at yield­ing insights.  In the prob­lem domain of dis­cov­ery, the pat­terns of sense­mak­ing activ­ity lead­ing to insight are intu­itive, non-linear, and asso­cia­tive; activites with these char­ac­ter­is­tics are not fully automat­able with cur­rent tech­nol­ogy. And while many ana­lyt­i­cal tech­niques can be use­fully auto­mated within bound­aries, these tasks typ­i­cally make up just a por­tion of an com­plete dis­cov­ery effort.  For exam­ple, using latent class analy­sis to explore a machine-sampled sub­set of a larger data cor­pus is task-specific automa­tion com­ple­ment­ing human per­spec­tive at par­tic­u­lar points of the Empir­i­cal Dis­cov­ery work­flow.  This depen­dence on machine aug­mented ana­lyt­i­cal capa­bil­ity is recent within the his­tory of ana­lyt­i­cal meth­ods.  In most of the mod­ern era — roughly the later 17th, 18th, 19th and early 20th cen­turies — the data employed in dis­cov­ery efforts was man­age­able ‘by hand’, even when using the newest math­e­mat­i­cal and ana­lyt­i­cal meth­ods emerg­ing at the time.  This remained true until the effec­tive com­mer­cial­iza­tion of machine com­put­ing ended the need for human com­put­ers as a rec­og­nized role in the mid­dle of the 20th century.

The real­ity of most ana­lyt­i­cal efforts — even those with good ini­tial def­i­n­i­tion — is that insights often emerge in response to and in tan­dem with chang­ing and evolv­ing ques­tions which were not iden­ti­fied, or per­haps not even under­stood, at the out­set.  Dur­ing dis­cov­ery efforts, ana­lyt­i­cal goals and tech­niques, as well as the data under con­sid­er­a­tion, often shift in unpre­dictable ways, mak­ing the path to insight dynamic and non-linear.  Fur­ther, the sources of and inspi­ra­tions for insight are  dif­fi­cult or impos­si­ble to iden­tify both at the time and in ret­ro­spect. Empir­i­cal dis­cov­ery addresses the com­plex and opaque nature of dis­cov­ery with iter­a­tion and adap­ta­tion, which com­bine  to set the stage for serendip­ity.

With this ini­tial def­i­n­i­tion of Empir­i­cal Dis­cov­ery in hand, the nat­ural ques­tion is what this means for Data Sci­ence and busi­ness ana­lyt­ics?  Three thigns stand out for me.  First, I think one of the cen­tral roles played by Data Sci­ence is in pio­neer­ing the appli­ca­tion of exist­ing ana­lyt­i­cal meth­ods from spe­cial­ized domains to serve gen­eral busi­ness goals and per­spec­tives, seek­ing effec­tive ways to work with the new types (graph, sen­sor, social, etc.) and tremen­dous vol­umes (yotta, yotta, yotta…) of busi­ness data at hand in the Big Data moment and real­ize insights

Sec­ond, fol­low­ing from this, Empir­i­cal Dis­cov­ery is method­olog­i­cal a frame­work within and through which a great vari­ety of ana­lyt­i­cal tech­niques at dif­fer­ing lev­els of matu­rity and from other dis­ci­plines are vet­ted for busi­ness ana­lyt­i­cal util­ity in iter­a­tive fash­ion by Data Sci­ence practitioners.

And third, it seems this vet­ting func­tion is delib­er­ately part of the makeup of empir­i­cal dis­cov­ery, which I con­sider a very clever way to cre­ate a feed­back loop that enhances Data Sci­ence prac­tice by using Empir­i­cal Dis­cov­ery as a dis­cov­ery tool for refin­ing its own methods.

Comment » | Big Data

Big Data is a Condition (Or, "It's (Mostly) In Your Head")

March 10th, 2014 — 1:07pm

Unsur­pris­ingly, def­i­n­i­tions of Big Data run the gamut from the turgid to the flip, mak­ing room to include the trite, the breath­less, and the sim­ply un-inspiring in the big cir­cle around the camp­fire. Some of these def­i­n­i­tions are use­ful in part, but none of them cap­tures the essence of the mat­ter. Most are mis­takes in kind, try­ing to ground and cap­ture Big Data as a ‘thing’ of some sort that is mea­sur­able in objec­tive terms. Any­time you encounter a num­ber, this is the school of thought.

Some approach Big Data as a state of being, most often a sim­ple oper­a­tional state of insuf­fi­ciency of some kind; typ­i­cally resources like ana­lysts, com­pute power or stor­age for han­dling data effec­tively; occa­sion­ally some­thing less quan­tifi­able like clar­ity of pur­pose and cri­te­ria for man­age­ment. Any­time you encounter phras­ing that relies on the reader to inter­pret and define the par­tic­u­lars of the insuf­fi­ciency, this is the school of thought.

I see Big Data as a self-defined (per­haps diag­nosed is more accu­rate) con­di­tion, but one that is based on idio­syn­cratic inter­pre­ta­tion of cur­rent and pos­si­ble future sit­u­a­tions in which under­stand­ing of, plan­ning for, and activ­ity around data are central.

Here’s my work­ing def­i­n­i­tion: Big Data is the con­di­tion in which very high actual or expected dif­fi­culty in work­ing suc­cess­fully with data com­bines with very high antic­i­pated but unknown value and ben­e­fit, lead­ing to the a-priori assump­tion that cur­rently avail­able infor­ma­tion man­age­ment and ana­lyt­i­cal capa­bil­ties are broadly insuf­fi­cient, mak­ing new and pre­vi­ously unknown capa­bil­i­ties seem­ingly necessary.

Comment » | Big Data

Back to top