Category: Big Data

The Sensemaking Spectrum for Business Analytics: Translating from Data to Business Through Analysis

June 10th, 2014 — 8:33am

One of the most com­pelling out­comes of our strate­gic research efforts over the past sev­eral years is a grow­ing vocab­u­lary that artic­u­lates our cumu­la­tive under­stand­ing of the deep struc­ture of the domains of dis­cov­ery and busi­ness analytics.

Modes are one exam­ple of the deep struc­ture we’ve found.  After look­ing at dis­cov­ery activ­i­ties across a very wide range of indus­tries, ques­tion types, busi­ness needs, and prob­lem solv­ing approaches, we’ve iden­ti­fied dis­tinct and recur­ring kinds of sense­mak­ing activ­ity, inde­pen­dent of con­text.  We label these activ­i­ties Modes: Explore, com­pare, and com­pre­hend are three of the nine rec­og­niz­able modes.  Modes describe *how* peo­ple go about real­iz­ing insights.  (Read more about the pro­gram­matic research and for­mal aca­d­e­mic ground­ing and dis­cus­sion of the modes here: By anal­ogy to lan­guages, modes are the ‘verbs’ of dis­cov­ery activ­ity.  When applied to the prac­ti­cal ques­tions of prod­uct strat­egy and devel­op­ment, the modes of dis­cov­ery allow one to iden­tify what kinds of ana­lyt­i­cal activ­ity a prod­uct, plat­form, or solu­tion needs to sup­port across a spread of usage sce­nar­ios, and then make con­crete and well-informed deci­sions about every aspect of the solu­tion, from high-level capa­bil­i­ties, to which spe­cific types of infor­ma­tion visu­al­iza­tions bet­ter enable these sce­nar­ios for the types of data users will analyze.

The modes are a pow­er­ful gen­er­a­tive tool for prod­uct mak­ing, but if you’ve spent time with young chil­dren, or had a really bad hang­over (or both at the same time…), you under­stand the dif­fi­cult of com­mu­ni­cat­ing using only verbs.

So I’m happy to share that we’ve found trac­tion on another facet of the deep struc­ture of dis­cov­ery and busi­ness ana­lyt­ics.  Con­tin­u­ing the lan­guage anal­ogy, we’ve iden­ti­fied some of the ‘nouns’ in the lan­guage of dis­cov­ery: specif­i­cally, the con­sis­tently recur­ring aspects of a busi­ness that peo­ple are look­ing for insight into.  We call these dis­cov­ery Sub­jects, since they iden­tify *what* peo­ple focus on dur­ing dis­cov­ery efforts, rather than *how* they go about dis­cov­ery as with the Modes.

Defin­ing the col­lec­tion of Sub­jects peo­ple repeat­edly focus on allows us to under­stand and artic­u­late sense mak­ing needs and activ­ity in more spe­cific, con­sis­tent, and com­plete fash­ion.  In com­bi­na­tion with the Modes, we can use Sub­jects to con­cretely iden­tify and define sce­nar­ios that describe people’s ana­lyt­i­cal needs and goals.  For exam­ple, a sce­nario such as ‘Explore [a Mode] the attri­tion rates [a Mea­sure, one type of Sub­ject] of our largest cus­tomers [Enti­ties, another type of Sub­ject] clearly cap­tures the nature of the activ­ity — explo­ration of trends vs. deep analy­sis of under­ly­ing fac­tors — and the cen­tral focus — attri­tion rates for cus­tomers above a cer­tain set of size cri­te­ria — from which fol­low many of the specifics needed to address this sce­nario in terms of data, ana­lyt­i­cal tools, and methods.

We can also use Sub­jects to trans­late effec­tively between the dif­fer­ent per­spec­tives that shape dis­cov­ery efforts, reduc­ing ambi­gu­ity and increas­ing impact on both sides the per­spec­tive divide.  For exam­ple, from the lan­guage of busi­ness, which often moti­vates ana­lyt­i­cal work by ask­ing ques­tions in busi­ness terms, to the per­spec­tive of analy­sis.  The ques­tion posed to a Data Sci­en­tist or ana­lyst may be some­thing like “Why are sales of our new kinds of potato chips to our largest cus­tomers fluc­tu­at­ing unex­pect­edly this year?” or “Where can inno­vate, by expand­ing our prod­uct port­fo­lio to meet unmet needs?”.  Ana­lysts trans­late ques­tions and beliefs like these into one or more empir­i­cal dis­cov­ery efforts that more for­mally and gran­u­larly indi­cate the plan, meth­ods, tools, and desired out­comes of analy­sis.  From the per­spec­tive of analy­sis this sec­ond ques­tion might become, “Which cus­tomer needs of type ‘A’, iden­ti­fied and mea­sured in terms of ‘B’, that are not directly or indi­rectly addressed by any of our cur­rent prod­ucts, offer ‘X’ poten­tial for ‘Y’ pos­i­tive return on the invest­ment ‘Z’ required to launch a new offer­ing, in time frame ‘W’?  And how do these com­pare to each other?”.  Trans­la­tion also hap­pens from the per­spec­tive of analy­sis to the per­spec­tive of data; in terms of avail­abil­ity, qual­ity, com­plete­ness, for­mat, vol­ume, etc.

By impli­ca­tion, we are propos­ing that most work­ing orga­ni­za­tions — small and large, for profit and non-profit, domes­tic and inter­na­tional, and in the major­ity of indus­tries — can be described for ana­lyt­i­cal pur­poses using this col­lec­tion of Sub­jects.  This is a bold claim, but sim­pli­fied artic­u­la­tion of com­plex­ity is one of the pri­mary goals of sense­mak­ing frame­works such as this one.  (And, yes, this is in fact a frame­work for mak­ing sense of sense­mak­ing as a cat­e­gory of activ­ity — but we’re not con­sid­er­ing the recur­sive aspects of this exer­cise at the moment.)

Com­pellingly, we can place the col­lec­tion of sub­jects on a sin­gle con­tin­uüm — we call it the Sense­mak­ing Spec­trum — that sim­ply and coher­ently illus­trates some of the most impor­tant rela­tion­ships between the dif­fer­ent types of Sub­jects, and also illu­mi­nates sev­eral of the fun­da­men­tal dynam­ics shap­ing busi­ness ana­lyt­ics as a domain.  As a corol­lary, the Sense­mak­ing Spec­trum also sug­gests inno­va­tion oppor­tu­ni­ties for prod­ucts and ser­vices related to busi­ness analytics.

The first illus­tra­tion below shows Sub­jects arrayed along the Sense­mak­ing Spec­trum; the sec­ond illus­tra­tion presents exam­ples of each kind of Sub­ject.  Sub­jects appear in col­ors rang­ing from blue to reddish-orange, reflect­ing their place along the Spec­trum, which indi­cates whether a Sub­ject addresses more the view­point of sys­tems and data (Data cen­tric and blue), or peo­ple (User cen­tric and orange).  This axis is shown explic­itly above the Spec­trum.  Anno­ta­tions sug­gest how Sub­jects align with the three sig­nif­i­cant per­spec­tives of Data, Analy­sis, and Busi­ness that shape busi­ness ana­lyt­ics activ­ity.  This ren­der­ing makes explicit the trans­la­tion and bridg­ing func­tion of Ana­lysts as a role, and analy­sis as an activity.


Sub­jects are best under­stood as fuzzy cat­e­gories [], rather than tightly defined buck­ets.  For each Sub­ject, we sug­gest some of the most com­mon exam­ples: Enti­ties may be phys­i­cal things such as named prod­ucts, or loca­tions (a build­ing, or a city); they could be Con­cepts, such as sat­is­fac­tion; or they could be Rela­tion­ships between enti­ties, such as the vari­ety of pos­si­ble con­nec­tions that define link­age in social net­works.  Like­wise, Events may indi­cate a time and place in the dic­tio­nary sense; or they may be Trans­ac­tions involv­ing named enti­ties; or take the form of Sig­nals, such as ‘some Mea­sure had some value at some time’ — what many enter­prises under­stand as alerts.

The cen­tral story of the Spec­trum is that though con­sumers of ana­lyt­i­cal insights (rep­re­sented here by the Busi­ness per­spec­tive) need to work in terms of Sub­jects that are directly mean­ing­ful to their per­spec­tive — such as Themes, Plans, and Goals — the work­ing real­i­ties of data (con­di­tion, struc­ture, avail­abil­ity, com­plete­ness, cost) and the chang­ing nature of most dis­cov­ery efforts make direct engage­ment with source data in this fash­ion impos­si­ble.  Accord­ingly, busi­ness ana­lyt­ics as a domain is struc­tured around the fun­da­men­tal assump­tion that sense mak­ing depends on ana­lyt­i­cal trans­for­ma­tion of data.  Ana­lyt­i­cal activ­ity incre­men­tally syn­the­sizes more com­plex and larger scope Sub­jects from data in its start­ing con­di­tion, accu­mu­lat­ing insight (and value) by mov­ing through a pro­gres­sion of stages in which increas­ingly mean­ing­ful Sub­jects are iter­a­tively syn­the­sized from the data, and recom­bined with other Sub­jects.  The end goal of  ‘lad­der­ing’ suc­ces­sive trans­for­ma­tions is to enable sense mak­ing from the busi­ness per­spec­tive, rather than the ana­lyt­i­cal perspective.

Syn­the­sis through lad­der­ing is typ­i­cally accom­plished by spe­cial­ized Ana­lysts using ded­i­cated tools and meth­ods. Begin­ning with some moti­vat­ing ques­tion such as seek­ing oppor­tu­ni­ties to increase the effi­ciency (a Theme) of ful­fill­ment processes to reach some level of prof­itabil­ity by the end of the year (Plan), Ana­lysts will iter­a­tively wran­gle and trans­form source data Records, Val­ues and Attrib­utes into rec­og­niz­able Enti­ties, such as Prod­ucts, that can be com­bined with Mea­sures or other data into the Events (ship­ment of orders) that indi­cate the work­ings of the business.

More com­plex Sub­jects (to the right of the Spec­trum) are com­posed of or make ref­er­ence to less com­plex Sub­jects: a busi­ness Process such as Ful­fill­ment will include Activ­i­ties such as con­firm­ing, pack­ing, and then ship­ping orders.  These Activ­i­ties occur within or are con­ducted by orga­ni­za­tional units such as teams of staff or part­ner firms (Net­works), com­posed of Enti­ties which are struc­tured via Rela­tion­ships, such as sup­plier and buyer.  The ful­fill­ment process will involve other types of Enti­ties, such as the prod­ucts or ser­vices the busi­ness pro­vides.  The suc­cess of the ful­fill­ment process over­all may be judged accord­ing to a sophis­ti­cated oper­at­ing effi­ciency Model, which includes tiered Mea­sures of busi­ness activ­ity and health for the trans­ac­tions and activ­i­ties included.  All of this may be inter­preted through an under­stand­ing of the oper­a­tional domain of the busi­nesses sup­ply chain (a Domain).

We’ll dis­cuss the Spec­trum in more depth in suc­ceed­ing posts.

Comment » | Big Data, Language of Discovery

Data Science Highlights: An Investigation of the Discipline

March 28th, 2014 — 1:26pm

I’ve posted a sub­stan­tial read­out sum­ma­riz­ing some of the more salient find­ings from a long-running pro­gram­matic research pro­gram into data sci­ence. This deck shares syn­the­sized find­ings around many of the facets of data sci­ence as a dis­ci­pline, includ­ing prac­tices, work­flow, tools, org mod­els, skills, etc. This read­out dis­tills a very wide range of inputs, includ­ing; direct inter­views, field-based ethnog­ra­phy, com­mu­nity par­tic­i­pa­tion (real-world and on-line), sec­ondary research from indus­try and aca­d­e­mic sources, analy­sis of hir­ing and invest­ment activ­ity in data sci­ence over sev­eral years, descrip­tive and def­i­n­i­tional arti­facts authored by prac­ti­tion­ers / ana­lysts / edu­ca­tors, and other exter­nal actors, media cov­er­age of data sci­ence, his­tor­i­cal antecedents, the struc­ture and evo­lu­tion of pro­fes­sional dis­ci­plines, and even more.

I con­sider it a sort of business-anthropology-style inves­ti­ga­tion of data sci­ence, con­ducted from the view­point of prod­uct making’s pri­mary aspects; strat­egy, man­age­ment, design, and delivery.

I learned a great deal dur­ing the course of this effort, and expect to con­tinue to learn, as data sci­ence will con­tinue to evolve rapidly for the next sev­eral years.

Data sci­ence prac­ti­tion­ers look­ing at this mate­r­ial are invited to pro­vide feed­back about where these mate­ri­als are accu­rate or inac­cu­rate, and most espe­cially about what is miss­ing, and what is com­ing next for this very excit­ing field.





Comment » | Big Data, User Research

Data Science and Empirical Discovery: A New Discipline Pioneering a New Analytical Method

March 26th, 2014 — 11:00am

One of the essen­tial pat­terns of sci­ence and indus­try in the mod­ern era is that new meth­ods for under­stand­ing — what I’ll call sense­mak­ing from now on — often emerge hand in hand with new pro­fes­sional and sci­en­tific dis­ci­plines.  This link­age between new dis­ci­plines and new meth­ods fol­lows from the  decep­tively sim­ple imper­a­tive to real­ize new types of insight, which often means analy­sis of new kinds of data, using new tech­niques, applied from newly defined per­spec­tives. New view­points and new ways of under­stand­ing are lit­er­ally bound together in a sort of symbiosis.

One famil­iar exam­ple of this dynamic is the rapid devel­op­ment of sta­tis­tics dur­ing the 18th and 19th cen­turies, in close par­al­lel with the rise of new social sci­ence dis­ci­plines includ­ing eco­nom­ics (orig­i­nally polit­i­cal econ­omy) and soci­ol­ogy, and nat­ural sci­ences such as astron­omy and physics.  On a very broad scale, we can see the pat­tern in the tan­dem evo­lu­tion of the sci­en­tific method for sense­mak­ing, and the cod­i­fi­ca­tion of mod­ern sci­en­tific dis­ci­plines based on pre­cur­sor fields such as nat­ural his­tory and nat­ural phi­los­o­phy dur­ing the sci­en­tific rev­o­lu­tion.

Today, we can see this pat­tern clearly in the simul­ta­ne­ous emer­gence of Data Sci­ence as a new and dis­tinct dis­ci­pline accom­pa­nied by Empir­i­cal Dis­cov­ery, the new sense­mak­ing and analy­sis method Data Sci­ence is pio­neer­ing.  Given its dra­matic rise to promi­nence recently, declar­ing Data Sci­ence a new pro­fes­sional dis­ci­pline should inspire lit­tle con­tro­versy. Declar­ing Empir­i­cal Dis­cov­ery a new method may seem bolder, but when we with the essen­tial pat­tern of new dis­ci­plines appear­ing in tan­dem with new sense­mak­ing meth­ods in mind, it is more con­tro­ver­sial to sug­gest Data Sci­ence is a new dis­ci­pline that lacks a cor­re­spond­ing new method for sense­mak­ing.  (I would argue it is the method that makes the dis­ci­pline, not the other way around, but that is a topic for fuller treat­ment elsewhere)

What is empir­i­cal dis­cov­ery?  While empir­i­cal dis­cov­ery is a new sense­mak­ing method, we can build on two exist­ing foun­da­tions to under­stand its dis­tin­guish­ing char­ac­ter­is­tics, and help craft an ini­tial def­i­n­i­tion.  The first of these is an under­stand­ing of the empir­i­cal method. Con­sider the fol­low­ing description:

The empir­i­cal method is not sharply defined and is often con­trasted with the pre­ci­sion of the exper­i­men­tal method, where data are derived from the sys­tem­atic manip­u­la­tion of vari­ables in an exper­i­ment.  …The empir­i­cal method is gen­er­ally char­ac­ter­ized by the col­lec­tion of a large amount of data before much spec­u­la­tion as to their sig­nif­i­cance, or with­out much idea of what to expect, and is to be con­trasted with more the­o­ret­i­cal meth­ods in which the col­lec­tion of empir­i­cal data is guided largely by pre­lim­i­nary the­o­ret­i­cal explo­ration of what to expect. The empir­i­cal method is nec­es­sary in enter­ing hith­erto com­pletely unex­plored fields, and becomes less purely empir­i­cal as the acquired mas­tery of the field increases. Suc­cess­ful use of an exclu­sively empir­i­cal method demands a higher degree of intu­itive abil­ity in the practitioner.”

Data Sci­ence as prac­ticed is largely con­sis­tent with this pic­ture.  Empir­i­cal pre­rog­a­tives and under­stand­ings shape the pro­ce­dural plan­ning of Data Sci­ence efforts, rather than the­o­ret­i­cal con­structs.  Semi-formal approaches pre­dom­i­nate over explic­itly cod­i­fied meth­ods, sig­nal­ing the impor­tance of intu­ition.  Data sci­en­tists often work with data that is on-hand already from busi­ness activ­ity, or data that is newly gen­er­ated through nor­mal busi­ness oper­a­tions, rather than seek­ing to acquire wholly new data that is con­sis­tent with the design para­me­ters and goals of for­mal exper­i­men­tal efforts.  Much of the sense­mak­ing activ­ity around data is explic­itly exploratory (what I call the ‘pan­ning for gold’ stage of evo­lu­tion — more on this in sub­se­quent post­ings), rather than sys­tem­atic in the manip­u­la­tion of known vari­ables.  These exploratory tech­niques are used to address rel­a­tively new fields such as the Inter­net of Things, wear­ables, and large-scale social graphs and col­lec­tive activ­ity domains such as instru­mented envi­ron­ments and the quan­ti­fied self.  These new domains of appli­ca­tion are not mature in ana­lyt­i­cal terms; ana­lysts are still work­ing to iden­tify the most effec­tive tech­niques for yield­ing insights from data within their bounds.

The sec­ond rel­e­vant per­spec­tive is our under­stand­ing of dis­cov­ery as an activ­ity that is dis­tinct and rec­og­niz­able in com­par­i­son to gen­er­al­ized analy­sis: from this, we can sum­ma­rize as sense­mak­ing intended to arrive at novel insights, through explo­ration and analy­sis of diverse and dynamic data in an iter­a­tive and evolv­ing fashion.

Look­ing deeper, one spe­cific char­ac­ter­is­tic of dis­cov­ery as an activ­ity is the absence of for­mally artic­u­lated state­ments of belief and expected out­comes at the begin­ning of most dis­cov­ery efforts.  Another is the iter­a­tive nature of dis­cov­ery efforts, which can change course in non-linear ways and even ‘back­track’ on the way to arriv­ing at insights: both the data and the tech­niques used to ana­lyze data change dur­ing dis­cov­ery efforts.  For­mally defined exper­i­ments are much more clearly deter­mined from the begin­ning, and their def­i­n­i­tion is less open to change dur­ing their course. A pro­gram of related exper­i­ments con­ducted over time may show iter­a­tive adap­ta­tion of goals, data and meth­ods, but the indi­vid­ual exper­i­ments them­selves are not mal­leable and dynamic in the fash­ion of dis­cov­ery.  Discovery’s empha­sis on novel insight as pre­ferred out­come is another impor­tant char­ac­ter­is­tic; by con­trast, for­mal exper­i­ments are repeat­able and ver­i­fi­able by def­i­n­i­tion, and the degree of repeata­bil­ity is a cri­te­ria of well-designed exper­i­ments.  Dis­cov­ery efforts often involve an intu­itive shift in per­spec­tive that is recount­able and retrace­able in ret­ro­spect, but can­not be anticipated.

Build­ing on these two foun­da­tions, we can define Empir­i­cal Dis­cov­ery as a hybrid, pur­pose­ful, applied, aug­mented, iter­a­tive and serendip­i­tous method for real­iz­ing novel insights for busi­ness, through analy­sis of large and diverse data sets.

Let’s look at these facets in more detail.

Empir­i­cal dis­cov­ery pri­mar­ily addresses the prac­ti­cal goals and audi­ences of busi­ness (or indus­try), rather than sci­en­tific, aca­d­e­mic, or the­o­ret­i­cal objec­tives.  This is tremen­dously impor­tant, since  the prac­ti­cal con­text impacts every aspect of Empir­i­cal Discovery.

Large and diverse data sets’ reflects the fact that Data Sci­ence prac­ti­tion­ers engage with Big Data as we cur­rently under­stand it; sit­u­a­tions in which the con­flu­ence of data types and vol­umes exceeds the capa­bil­i­ties of busi­ness ana­lyt­ics to prac­ti­cally real­ize insights in terms of tools, infra­struc­ture, prac­tices, etc.

Empir­i­cal dis­cov­ery uses a rapidly evolv­ing hybridized toolkit, blend­ing a wide range of gen­eral and advanced sta­tis­ti­cal tech­niques with sophis­ti­cated exploratory and ana­lyt­i­cal meth­ods from a wide vari­ety of sources that includes data min­ing, nat­ural lan­guage pro­cess­ing, machine learn­ing, neural net­works, bayesian analy­sis, and emerg­ing tech­niques such as topo­log­i­cal data analy­sis and deep learn­ing.

What’s most notable about this hybrid toolkit is that Empir­i­cal Dis­cov­ery does not orig­i­nate novel analy­sis tech­niques, it bor­rows tools from estab­lished dis­ci­plines such infor­ma­tion retrieval, arti­fi­cial intel­li­gence, com­puter sci­ence, and the social sci­ences.  Many of the more spe­cial­ized or appar­ently exotic tech­niques data sci­ence and empir­i­cal dis­cov­ery rely on, such as sup­port vec­tor machines, deep learn­ing, or mea­sur­ing mutual infor­ma­tion in data sets, have estab­lished his­to­ries of usage in aca­d­e­mic or other indus­try set­tings, and have reached rea­son­able lev­els of matu­rity.  Empir­i­cal discovery’s hybrid toolkit is  trans­posed from one domain of appli­ca­tion to another, rather than invented.

Empir­i­cal Dis­cov­ery is an applied method in the same way Data Sci­ence is an applied dis­ci­pline: it orig­i­nates in and is adapted to busi­ness con­texts, it focuses on arriv­ing at use­ful insights to inform busi­ness activ­i­ties, and it is not used to con­duct basic research.  At this early stage of devel­op­ment, Empir­i­cal Dis­cov­ery has no inde­pen­dent and artic­u­lated the­o­ret­i­cal basis and does not (yet) advance a dis­tinct body of knowl­edge based on the­ory or prac­tice. All viable dis­ci­plines have a body of knowl­edge, whether for­mal or infor­mal, and applied dis­ci­plines have only their cumu­la­tive body of knowl­edge to dis­tin­guish them, so I expect this to change.

Empir­i­cal dis­cov­ery is not only applied, but explic­itly pur­pose­ful in that it is always set in motion and directed by an agenda from a larger con­text, typ­i­cally the spe­cific busi­ness goals of the orga­ni­za­tion act­ing as a prime mover and fund­ing data sci­ence posi­tions and tools.  Data Sci­ence prac­ti­tion­ers effect Empir­i­cal Dis­cov­ery by mak­ing it hap­pen on a daily basis — but wher­ever there is empir­i­cal dis­cov­ery activ­ity, there is sure to be inten­tion­al­ity from a busi­ness view.  For exam­ple, even in orga­ni­za­tions with a for­mal hack time pol­icy, our research sug­gests there is lit­tle or no com­pletely undi­rected or self-directed empir­i­cal dis­cov­ery activ­ity, whether con­ducted by for­mally rec­og­nized Data Sci­ence prac­ti­tion­ers, busi­ness ana­lysts, or others.

One very impor­tant impli­ca­tion of the sit­u­a­tional pur­pose­ful­ness of Empir­i­cal Dis­cov­ery is that there is no direct imper­a­tive for gen­er­at­ing a body of cumu­la­tive knowl­edge through orig­i­nal research: the insights that result from Empir­i­cal Dis­cov­ery efforts are judged by their prac­ti­cal util­ity in an imme­di­ate con­text.  There is also no explicit sci­en­tific bur­den of proof or ver­i­fi­a­bil­ity asso­ci­ated with Empir­i­cal Dis­cov­ery within it’s pri­mary con­text of appli­ca­tion.  Many prac­ti­tion­ers encour­age some aspects of ver­i­fi­a­bil­ity, for exam­ple, by anno­tat­ing the var­i­ous sources of data used for their efforts and the trans­for­ma­tions involved in wran­gling data on the road to insights or data prod­ucts, but this is not a require­ment of the method.  Another impli­ca­tion is that empir­i­cal dis­cov­ery does not adhere to any explicit moral, eth­i­cal, or value-based mis­sions that tran­scend work­ing con­text.  While Data Sci­en­tists often inter­pret their role as trans­for­ma­tive, this is in ref­er­ence to busi­ness.  Data Sci­ence is not med­i­cine, for exam­ple, with a Hip­po­cratic oath.

Empir­i­cal Dis­cov­ery is an aug­mented method in that it depends on com­put­ing and machine resources to increase human ana­lyt­i­cal capa­bil­i­ties: It is sim­ply imprac­ti­cal for peo­ple to man­u­ally under­take many of the ana­lyt­i­cal tech­niques com­mon to Data Sci­ence.  An impor­tant point to remem­ber about aug­mented meth­ods is that they are not auto­mated; peo­ple remain nec­es­sary, and it is the com­bi­na­tion of human and machine that is effec­tive at yield­ing insights.  In the prob­lem domain of dis­cov­ery, the pat­terns of sense­mak­ing activ­ity lead­ing to insight are intu­itive, non-linear, and asso­cia­tive; activites with these char­ac­ter­is­tics are not fully automat­able with cur­rent tech­nol­ogy. And while many ana­lyt­i­cal tech­niques can be use­fully auto­mated within bound­aries, these tasks typ­i­cally make up just a por­tion of an com­plete dis­cov­ery effort.  For exam­ple, using latent class analy­sis to explore a machine-sampled sub­set of a larger data cor­pus is task-specific automa­tion com­ple­ment­ing human per­spec­tive at par­tic­u­lar points of the Empir­i­cal Dis­cov­ery work­flow.  This depen­dence on machine aug­mented ana­lyt­i­cal capa­bil­ity is recent within the his­tory of ana­lyt­i­cal meth­ods.  In most of the mod­ern era — roughly the later 17th, 18th, 19th and early 20th cen­turies — the data employed in dis­cov­ery efforts was man­age­able ‘by hand’, even when using the newest math­e­mat­i­cal and ana­lyt­i­cal meth­ods emerg­ing at the time.  This remained true until the effec­tive com­mer­cial­iza­tion of machine com­put­ing ended the need for human com­put­ers as a rec­og­nized role in the mid­dle of the 20th century.

The real­ity of most ana­lyt­i­cal efforts — even those with good ini­tial def­i­n­i­tion — is that insights often emerge in response to and in tan­dem with chang­ing and evolv­ing ques­tions which were not iden­ti­fied, or per­haps not even under­stood, at the out­set.  Dur­ing dis­cov­ery efforts, ana­lyt­i­cal goals and tech­niques, as well as the data under con­sid­er­a­tion, often shift in unpre­dictable ways, mak­ing the path to insight dynamic and non-linear.  Fur­ther, the sources of and inspi­ra­tions for insight are  dif­fi­cult or impos­si­ble to iden­tify both at the time and in ret­ro­spect. Empir­i­cal dis­cov­ery addresses the com­plex and opaque nature of dis­cov­ery with iter­a­tion and adap­ta­tion, which com­bine  to set the stage for serendip­ity.

With this ini­tial def­i­n­i­tion of Empir­i­cal Dis­cov­ery in hand, the nat­ural ques­tion is what this means for Data Sci­ence and busi­ness ana­lyt­ics?  Three thigns stand out for me.  First, I think one of the cen­tral roles played by Data Sci­ence is in pio­neer­ing the appli­ca­tion of exist­ing ana­lyt­i­cal meth­ods from spe­cial­ized domains to serve gen­eral busi­ness goals and per­spec­tives, seek­ing effec­tive ways to work with the new types (graph, sen­sor, social, etc.) and tremen­dous vol­umes (yotta, yotta, yotta…) of busi­ness data at hand in the Big Data moment and real­ize insights

Sec­ond, fol­low­ing from this, Empir­i­cal Dis­cov­ery is method­olog­i­cal a frame­work within and through which a great vari­ety of ana­lyt­i­cal tech­niques at dif­fer­ing lev­els of matu­rity and from other dis­ci­plines are vet­ted for busi­ness ana­lyt­i­cal util­ity in iter­a­tive fash­ion by Data Sci­ence practitioners.

And third, it seems this vet­ting func­tion is delib­er­ately part of the makeup of empir­i­cal dis­cov­ery, which I con­sider a very clever way to cre­ate a feed­back loop that enhances Data Sci­ence prac­tice by using Empir­i­cal Dis­cov­ery as a dis­cov­ery tool for refin­ing its own methods.

Comment » | Big Data

Big Data is a Condition (Or, "It's (Mostly) In Your Head")

March 10th, 2014 — 1:07pm

Unsur­pris­ingly, def­i­n­i­tions of Big Data run the gamut from the turgid to the flip, mak­ing room to include the trite, the breath­less, and the sim­ply un-inspiring in the big cir­cle around the camp­fire. Some of these def­i­n­i­tions are use­ful in part, but none of them cap­tures the essence of the mat­ter. Most are mis­takes in kind, try­ing to ground and cap­ture Big Data as a ‘thing’ of some sort that is mea­sur­able in objec­tive terms. Any­time you encounter a num­ber, this is the school of thought.

Some approach Big Data as a state of being, most often a sim­ple oper­a­tional state of insuf­fi­ciency of some kind; typ­i­cally resources like ana­lysts, com­pute power or stor­age for han­dling data effec­tively; occa­sion­ally some­thing less quan­tifi­able like clar­ity of pur­pose and cri­te­ria for man­age­ment. Any­time you encounter phras­ing that relies on the reader to inter­pret and define the par­tic­u­lars of the insuf­fi­ciency, this is the school of thought.

I see Big Data as a self-defined (per­haps diag­nosed is more accu­rate) con­di­tion, but one that is based on idio­syn­cratic inter­pre­ta­tion of cur­rent and pos­si­ble future sit­u­a­tions in which under­stand­ing of, plan­ning for, and activ­ity around data are central.

Here’s my work­ing def­i­n­i­tion: Big Data is the con­di­tion in which very high actual or expected dif­fi­culty in work­ing suc­cess­fully with data com­bines with very high antic­i­pated but unknown value and ben­e­fit, lead­ing to the a-priori assump­tion that cur­rently avail­able infor­ma­tion man­age­ment and ana­lyt­i­cal capa­bil­ties are broadly insuf­fi­cient, mak­ing new and pre­vi­ously unknown capa­bil­i­ties seem­ingly necessary.

Comment » | Big Data

Back to top