Friday
Apr172015

Analytics and Healthcare: the Next Scientific Revolution

"The next Jonas Salk will be a Mathematician, not a Doctor."

~ Jack Einhorn, Chief Scientist, Inform Laboratories

"Soon, you're probably not going to be able to say that you're a molecular biologist if you don't understand some statistics or rudimentary data-handling technologies," says Blume. "You're simply going to be a dinosaur if you don't."

~ John Blume, VP of Product Development, Affymetrix in Nature

Pieces of April

I'm excited about the start of baseball season — for someone who grew up in Western New York the coming of April, spring and warm weather is always a good thing. April meant that the days of taking ground balls off of gymnasium floors were over and it might be cold and awful outside, but you were finally out playing baseball! The cold dark winter was finally over…

Later, when I was at MIT and lived in the Boston area, April also meant the Boston Marathon. Patriot's Day was the one day of the year when (in sympathy) I always hoped for cool, damp weather.

This year's April 21 Marathon is a fun one for me, because this year I have a rooting interest and this year (courtesy of the Boston Athletic Association, Nike and the Internet) it will be possible to track the split times of all the runners as they race throughout the day. My rooting interest is for my niece Amanda — Mars rocket scientist at the Jet Propulsion Lab who ran a 3:16 ( WOW! ) marathon out in California last year to qualify. Amanda is a great person, a great runner, and her run Monday is a great win for the next Scientific Revolution that is upon us:

Amanda has leukemia.

Has. Not had, but qualifying for Boston does show that she has a certain leg-up on things. Leukemia. Think of old movie images — Ali MacGraw as Jennifer Cavalleri in Love Story, fading away romantically (if unrealistically) to close the movie… Not anymore — That was then, this is now.

The difference between 2010's medicine and 1970s bathos is advanced analytics — here manifested as rational drug design. As summarized in the article on the breakthrough:

Imatinib was developed by rational drug design. After the Philadelphia chromosome mutation and hyperactive bcr-abl protein were discovered, the investigators screened chemical libraries to find a drug that would inhibit that protein. With high-throughput screening, they identified 2-phenylaminopyrimidine. This lead compound was then tested and modified by the introduction of methyl and benzamide groups to give it enhanced binding properties, resulting in imatinib.

This is where the new scientific revolution is coming from. We've reached the limits of the classic scientific method, where scientific advancement came in a three-step process:

In the world ahead we advance our understanding by:

These two new first steps change the world as we know it.

So, amid the many steps that drive the advance of cancer, there may be some where the cancer shows an Achilles heel — here the rogue bcr-abl protein that is critical to cancer progression. FORGET about "curing" cancer — if you can identify and deliver a mechanism that stops that protein you will stop the progression of the disease and change the cancer from dreaded-evil-of-pulp-movies into just another serious-but-treatable condition.

Rule #1 of the new Revolution: Answers aren't All or Nothing anymore. If you can identify things that make your world a little better, you might then find big wins by targeting their underlying mechanism.

You can track individual runners and results for Monday's Boston Marathon here: Live Raceday Coverage.

Moneyball

So back to baseball. Part of my love of the sport was that it was possible to be an effective baseball player without necessarily being all that great a baseball player. In high school and at MIT I had an OK fastball, no curveball but a surprising knuckleball, and I could throw strikes, hit, bunt and field decently. I was the type of pitcher who had batters cursing themselves on the way back to the dugout, but if I played the percentages I could survive more than I deserved by "throwing junk."

The Michael Lewis book Moneyball is fascinating because it is based on the notion that, to paraphrase The Bard: "There are more ways to win baseball games, Horatio, Than are dreamt of in your philosophy."

The magic here was not that there were still ways for a cheap, bad team to win baseball games. The magic was that, even in as statistical a game as baseball (Proof: How many home runs did Babe Ruth hit? If you know baseball then you didn't have to look that up.), we paid most of our attention to the wrong things!

Everybody who has ever been around a ballpark has heard of the Triple Crown (home runs, RBIs and batting average) for hitters, and W/L records and maybe strikeouts per innings pitched for pitchers. These may be fine, but a very different set of indicators are the statistics that made Moneyball happen.

Before Moneyball, who ever heard of

(and for you Fantasy Baseball GMs)

These bring us to our second rule of the new scientific revolution. Back in the DBS (days before spreadsheets) we worried about things like "statistical sample sizes" because it was practically impossible to track and measure everything.

Again — that was before. With a million rows possible in Excel and smartphone processors 150 million times faster than the computers that took Man to the Moon (much less new tools like Hadoop), our sample sizes can now be the entire population and we can track anything imaginable within that population. In baseball we historically tracked HRs, BA and RBIs because you could calculate them in time for the next edition of the morning paper, not because they were what we should have tracked. So:

Rule #2 of the new Revolution: Your sample size is the entire population: what things do you need to track to make things better in Rule #1?

Application to Modern Medicine and Healthcare

While it's easy to focus on innovations in medicine and patient care from advanced analytics, there are also positive advances in healthcare business management that shouldn't be overlooked when considering data in the healthcare equation. The diagram below shows some of the kinds of advances that are becoming everyday practices for leading healthcare providers.

Integrated patient data in the evolving age of EMR is just one of the places where advanced analytics can improve facility operations. Analytics and heuristics can advance this critical area, potentially augmenting advanced practices such as those outlined by Grant Landsbach in a recent paper.

Assisted diagnosis is another area of rapid evolution. IBM's Jeopardy-winning analytics system 'Watson' has been retargeted at medical diagnosis, and even if the breathless claim IBM's Watson Supercomputer May Soon Be The Best Doctor in the World falls short in actual practice, everyday benefits are likely in all fields of medicine as smart analytics evolve to resemble 'Librarians' from Neal Stephenson's futuristic novel Snow Crash.

Genomic treatment is a third area that is rapidly evolving today. Tamoxifen (where metabolism to endoxifen is genetically limited in 20% of cases) and Plavix (which by a similar mechanism is ineffective in almost 25% of patients) are just two of potentially many cases where medicine must be personalized to be effective.

This brings us to the third rule of the new Revolution. As the sample size grows to the whole population, the target population shrinks to one. I still remember being stunned, reading the headline back in 1991 when Magic Johnson announced that he had the AIDS virus. With the devastation legacy that AIDS had wrought by 1991, who would have dreamed that Magic would be alive and well almost 25 years later? We still have no "cure" for AIDS, but drug cocktails have moved from Dallas Buyers Club to real medicine. So, finally:

Rule #3 of the new Revolution: Wins are statistical — just because you don't have the all-or-nothing of a "cure" doesn't mean that progress on a little can't add up to a lot.

The Revolution Realized — Moving to Personalized Healthcare

The Healthcare advanced analytics field is ripe and significant advances are occurring daily. Even in an increasingly rugged business environment, Healthcare leaders are still driving better business practices and innovations in medicine and patient care.

Innovations in healthcare are increasingly appearing from outside of the classical scientific method, and these advances in management and patient care may match the revolutionary breakthroughs from Lister and Semmelweis more than a century ago. Leukemia treatable? — sure, we've had a range of approaches for fifty years now. But glioblastoma multiforme?

Advanced analytics is changing medicine and healthcare, and innovative leaders are changing the practice of medicine and with it are changing life as we know it. Managerial and clinical advances are a marathon effort, and many of the tools and techniques for advanced analytics are still in their infancy. We are, both healthcare providers and analytics experts, just getting started.

YAY Amanda! 3:27:42 in driving rain IN THE BOSTON MARATHON!!!

Find your strength. Change the world. Be part of next revolution at TeamAmandaStrong !

Play Ball!

Wednesday
Oct222014

Spark 1.1 live - from Kitty Hawk to Infinity (and beyond...)

"The credit belongs to the man who is actually in the arena … who at the worst, if he fails, at least fails while daring greatly, so that his place shall never be with those cold and timid souls who neither know victory nor defeat.”

~ Theodore Roosevelt

It's not fair to be too hard on technological pioneers; the path to great progress is often marked with fine innovations that are trumpeted as "better than sliced bread", even if later hindsight shows them to be merely VHS ("better than Beta") — a humble step on the road to DVDs and then digital video.

So it has been with Big Data technologies; Big Data was has done great things for my Stanford classmate Omid Kordestani at Google; even if Google doesn't use MapReduce anymore it was still a milestone on our path, not just to the "Internet of Things" but to the hopefully-coming "Internet of Us."

So it's not surprising that Big Data is taking a pounding these days, exemplified by machine learning's Michael Jordan decrying the Delusions of Big Data. This is par for the course; even as advanced analytics becomes too big to simply dismiss the techniques are still subject to the ills that flesh technology is heir to — welcome to the human condition.

Jordan notes:

These are all true - this is the imperfect world we inhabit. I still see great possibilities in big data, and my take on Jordan's comments falls somewhere between physicist Niels Bohr:

"The opposite of a great truth is also true."

and an unknown writer (possibly Vonnegut), who opined:

"A pioneer in any field, if he stays long enough in the field, becomes an impediment to progress in that field…"

Progress changes everything. We must try to imagine the mindset of a Henry Ford, advancing manufacturing processes to put automobiles in the hands of all of his employees; even if he lacked the gasoline to power them; gas stations to fill them or even paved roads to drive them on. The first models were technological marvels of their age, but that doesn't mean we can't laugh at them now:

So it is with the advances of big data technologies. I might reasonably agree with both Jordan and Michael Stonebraker that Hadoop, the darling of the first Data Age, is not just a yellow elephant but has some of the characteristics of a white elephant as well.

I've written about the foibles of Hadoop before. Hadoop is (and continues to advance as) a terrific technology for working with embarrassingly parallel data, but in a real-time world these drawbacks are like a manual crank on a car — it may work, but it's not what everybody (anybody?) would choose going forward. Here's what's wrong:

  • Limited acceleration options
  • Poor data provenance
  • Disc (not RAM) based triply-redundant storage — bulky and slow
  • Slow (HIVE) support for SQL — the data query language that everybody knows

Fortunately, the next step in technology evolution has reached Version 1.1.0 since I last wrote. Spark can solve all of these problems, so let's go get it. Spark can be downloaded from the Spark download site:

Once we've downloaded the latest Spark tar file, we can un-tar it and set it up:

$ curl -O http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0.tgz
$ mv ~/Downloads/spark-1.1.0.tar .
$ cd /usr/local
$ sudo tar xvf ~/Downloads/spark-1.1.0.tar
$ cd spark-1.1.0

Got it! Now let's try running Spark 1.1.0…

$ ./bin/run-example SparkPi 10
Failed to find Spark examples assembly in /Users/jkrepko/src/spark-1.1.0/lib or /Users/jkrepko/src/spark-1.1.0/examples/target

Whoops — spoke too soon. Let's build Spark, starting with Hadoop and including Scala and any of the other tools we'll need.

First let's install hadoop 2.4.1 by downloading our choice of Hadoop version from our chosen download mirror.

Once the Hadoop 2.4.1 download is complete, we untar it and symlink it

$ sudo tar xvf $HOME/Downloads/hadoop-2.4.1.tar
$ sudo ln -s hadoop-2.4.1 hadoop

Now we set the ownership of the installed files

$ ls -ld $HOME

which for me gives

drwxr-xr-x+ 127 jkrepko  staff  4318 Oct 20 09:43 /Users/jkrepko

Let me set the ownership for our Hadoop install and we can roll on from here

$ sudo chown -R jkrepko:staff hadoop-2.4.1 hadoop

We can then check the changes with

$ ls -ld hadoop* — which for me gives
lrwxr-xr-x   1 jkrepko  staff   12 Oct 21 10:04 hadoop -> hadoop-2.4.1
drwxr-xr-x@ 12 jkrepko  staff  408 Jun 21 00:38 hadoop-2.4.1

We'll want to update our ~/.bashrc file to make sure our HADOOP_HOME and other key globals set correctly:

export HADOOP_PREFIX="/usr/local/hadoop"
export HADOOP_HOME="${HADOOP_PREFIX}"
export HADOOP_COMMON_HOME="${HADOOP_PREFIX}"
export HADOOP_CONF_DIR="${HADOOP_PREFIX}/etc/hadoop"
export HADOOP_HDFS_HOME="${HADOOP_PREFIX}"
export HADOOP_MAPRED_HOME="${HADOOP_PREFIX}"
export HADOOP_YARN_HOME="${HADOOP_PREFIX}"
export "PATH=${PATH}:${HADOOP_PREFIX}/bin:${HADOOP_PREFIX}/sbin"
export SCALA_HOME=/usr/local/bin/scala

Now that Hadoop is installed, we can walk through the .sh and .xml files to ensure that our Hadoop installation is configured correctly. These are all routine Hadoop configurations. We'll start with hadoop-env.sh — comment out the first HADOOP_OPTS, and add the following line:

vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh
## export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="

Next up are our updates to Core-site.xml

vi /usr/local/hadoop/etc/hadoop/core-site.xml

Here we'll add the following lines to configuration:

<configuration>
  <property>
     <name>hadoop.tmp.dir</name>
     <value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
     <description>A base for other temporary directories.</description>
  </property>
  <property>
     <name>fs.default.name</name>
     <value>hdfs://localhost:9000</value>
  </property>
</configuration>

Next up is our mapred-site.xml.

vi /usr/local/hadoop/etc/hadoop/mapred-site.xml

Immediately following installation this file will be blank, but feel free to copy and edit the mapred-site.xml.template file, or simply add the following code to our blank mapred-site.xml file:



<configuration>
       <property>
         <name>mapred.job.tracker</name>
         <value>localhost:9010</value>
       </property>
</configuration>

Our final configuration file is hdfs-site.xml — let's edit it as well:

$ vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Add the following configuration information

<property>
         <name>dfs.replication</name>
         <value>1</value>
      </property>

Finally, to start / stop Hadoop let's add the following to our ~/.profile or ~/.bashrc file

$ vi ~/.profile
alias hstart="$HADOOP_HOME/sbin/start-dfs.sh;$HADOOP_HOME/sbin/start-yarn.sh"
alias hstop="$HADOOP_HOME/sbin/stop-yarn.sh;$HADOOP_HOME/sbin/stop-dfs.sh"

And source the file to make hstart and hstop active

$ source ~/.profile

Before we can run Hadoop we first need to format the HDFS using

$ hadoop namenode -format

This yields a lot of configuration messages ending in

/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at jkrepko-2.local/10.0.0.153
************************************************************/

Just as housekeeping , if you haven't done so already you must make your ssh keys available. I already have keys (which can be otherwise generated by keygen), so I just need to add:

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

I can then confirm that ssh is working with:

$ ssh localhost
$ exit

We can now start hadoop with

$ hstart

Let's see how our Hadoop system is running by entering

http://localhost:50070
Bravo! Hadoop 2.4.1 is up and running. Port 50070 gives us a basic heartbeat:

We've started Hadoop, and we can stop it with

$ hstop

Now that Hadoop is installed, we can build Spark, but first we have to set up Maven properties

Add this to ~/.bashrc

export MAVEN_OPTS="-Xmx2g  -XX:ReservedCodeCacheSize=512m"

Now we can build Spark with the built-in Scala builder. We'll kick that off with the following command. First let's make sure we have version 2.10 or later of Scala installed. On a Macintosh (my base machine here) this is a simple Homebrew install command:

$ brew install scala

Now we can run the Spark build utility:

$ SPARK_HADOOP_VERSION=2.4.1 sbt/sbt assembly
NOTE: SPARK_HADOOP_VERSION is deprecated, please use -Dhadoop.version=2.4.1

The build succeeded, but with the deprecation warning we might do better in the future with something more like:

 $ sbt/sbt assembly -Dhadoop.version=2.4.1

We're now LIVE on Spark 1.1.0! Before we start, let's turn the logging down a bit. We can do this by editing an update to the conf/log4j.properties.template file, and saving that update as conf/log4j.properties, as such:

log4j.rootCategory=INFO, console

Let's lower the log-level so that we only show WARN message and above. Here we change the rootCategory as such:

log4j.rootCategory=WARN, console

Now we're live and can run some examples! Time for some Pi…

$ ./bin/run-example SparkPi 10
Pi is roughly 3.140184

Not bad, but let's bump up the precision a bit:

$ ./bin/run-example SparkPi 100
Pi is roughly 3.14157

Mmmmmnnnn, Mmmmmnnnn good!

There are lots of other great emerging Spark examples, but we're up and running here and we'll stop for now.

It's a long road from Kitty Hawk to the (sadly missed) Concorde or the 787, and we won't get there in just one step. In my next post I'll lay out the toolkit we have today that should take Big Data from the sandy shores of North Carolina and a team of crazy bike guys (who should never have beaten Samuel Langley to first-flight, but did anyway!) to Lindbergh crossing the Atlantic, and maybe even to the DC3 — the airplane that brought air travel (big data?) to everyone.

Ad astra per aspera!

Sunday
Sep072014

I Saw Sparks

I've long been a follower of Joel Spolsky and his writings on software development, and some of them (e.g. Can Your Programming Language Do This?) are practically QED for their topics. I think I can do him one better on one similarly terrific writing of his: Smart and Gets Things Done.

You can't argue with "smart" — as a software engineer, manager or executive you have to expect that your discipline and skill set will turn over 99% (the remaining 1% being vi and UNIX commands from the '80s) every 3-4 years. If you're not really smart and really dedicated you can't possibly keep up past a single product cycle.

"Gets Things Done" is similarly dispositive — even the brainiest developer won't get products out if they

  • Stay locked in their Microsoft-mandated individual offices, never talk to anyone and stay alive only if/because you keep sliding pizza's under their door
  • Are so spectacularly abrasive that they make the rest of your team take to living in their Microsoft-mandated offices — leading to endless shifts in the product schedule (and ever-increasing pizza bills !)

Interpersonal skills tend to be undervalued in Technology, but are essential to getting things done. Even more than pure skill, "Get's things done" is a testimony to human grace: It takes a lot of humility to get products out the door, and you might be a Putnam Fellow, but without some give and take all that brilliant code will never make it off of your machine!

To Spolsky's pair I'd add one more category — one more thing that I look for when I'm hiring or building teams: "Sparks." Sparks are those odd nuggets that pop up on a resume — seemingly unrelated to anything, that indicate the kind of rare gifts that make our world the wonder it is. I once interviewed (and hired!) a fantastic software engineer, former graduate EE whose "spark" was that she'd done research work on (and helped write the book on) chinchillas! She was qualified in all the Spolsky ways — but to me the chinchilla book was the clincher. Few are those who do EE research on chinchillas, but only the rarest write the book following that research.

Smart?
Check!
Get's Things Done?
Check!
Chinchillas? Chinchillas??? Chinchillas!

HIRED!

Sparks have brought me some of the best things in my life: my wife Kate and some great friends and co-workers (among them a founding Menudo member, another of the greatest musicians of all time, several rare inventors and scientists, and artists and more…)

Sparks are also one of the things I look for in software development efforts, and thus (for my own efforts, and for my work) I tend to stay away from development approaches and tools that required teams and time that only Cecil B DeMille could master.

This might be fine for some, but I think it's just too many to expect the apex of human expression and genius to appear. We don't know what we're doing, but that doesn't stop us from trying and genius is sometimes the result. As I've written before, the great breakthrough that was Lotus 1-2-3 came from its macro capability — the magic-decoder-ring that gave spreadsheet users the ability to do things that it's inventors might never have dreamed!

This is what makes Spark such a big win for Big Data — It's light and interactive, and rewards people who might have that spark of insight — even if they can't afford a 10-geek programming team. With the balance of this post we'll get Spark started, and in my next post we'll go deeper into the wonders that Spark can do.

First, we're going to want to update our Java runtime and JDK environments. There are options in this space now, but as a former Oracle employee (and still Oracle-rooter and fan) we'll head directly over to Larry's site for what we need:

And we're set. I'm running on a Macintosh and I've chosen Java 8 (finally! closures!!), but there is a version of Java that comes with the Macintosh, and you're going to want to add a little magic to your ~/.bash_profile to make OSX recognize the latest Java. Once that's in, you can run
$ java -version

java version "1.8.0_20" Java(TM) SE Runtime Environment (build 1.8.0_20-b26) Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)

from your terminal to confirm that we're all ready with Java.

Next comes an installation (or update) to Hadoop. I know I've spent most of my past four Big Data posts moaning about Hadoop's batch-y, non-interactive style, but for data that really is embarrassingly parallel it's the tool you need. Cloudera has taken a lot of the adventure out of Hadoop installations, but for my Mac I'm grateful to Yi Wang's Terrific Tech Note on Hadoop for Mac OSX. I've installed Hadoop 2.4.1, and the tech note covers the installation and does a nice job on getting started with the core.site.xml, hdfs-site.xml, and yarn-site.xml setup as well. Again, once you can run

$ hadoop jar ./hadoop-mapreduce-examples-2.4.1 wordcount LICENSE.txt out

from the MapReduce Examples folder you're set. Now, for Sparks and the show we've all been waiting for:

Monday
Sep012014

Spreadsheets for the New Millennium -- Getting There...

I've written some posts about my hopes for a next generation of computing; about the rise of "Spreadsheets for the New Millennium" here: part 1, part 2, and part 3. Well it's been a couple of years since I wrote about Spreadsheets, and it's a decade now since our current computational generation began, so let's see what we've got:

Google published their breakthrough paper in late 2004, and it was based on work that had been ongoing since at least 2002. Now 2002 was a great year (but with Nickleback topping the charts we might want to reconsider just how great), but that's more than 10 years ago now and computing has changed a lot since then. Here are some things that were unknown in 2002 that are commonplace now:

  • SSDs — My current Mac has a 768GB SSD, and computer disks have since gone the way of… CRT displays
  • Flat-screen displays — HP used to sell monitors that my friend Julie Funk (correctly) called "2 men and a small boy" monitors — because that's what it took to carry one. Nobody misses them now — gone and forgotten
  • Multicore processors — I'm still waiting for faster living from my GPU, but Moore's Law still lives on in multicore
  • "10Gig-E" networks — I'm old enough to still remember IBM token-ring networks. Now "E" has been replaced by "Gig-E", which is itself headed for the "10Gig-E" boneyard.
  • GigaRAM — I did some work for Oracle back in 2000's that showed the largest Oracle transactional DB running on about 1TB of memory. That was a lot then, but you can buy machines with at TB of RAM now. Memory is the new disk, and disk is the new tape…

There are still more innovations, but even with what I've listed so far I believe that we can safely say that we're not living in the same computational world that Brin & Page found in the early days of Google. "Big Data" has had lots of wins in the technology domain and even some that have reached general public recognition (such as IBM's jeopardy-playing Watson, Amazon's Customers who bought … also bought.) Expectations for results from data have risen, and it's time for some new approaches. Technologies from then just don't meet the needs of now

Hadoop was a terrific advance (basically a Dennis Machine for the rest of us), but by today’s standards it’s clumsy, slow and inefficient. Hadoop brought us parallel computing and Big Data, but has done it through a disk-y solution model that really doesn't "feed the bulldog" now:

  1. Everything gets written to disk ( the new tape ), including all the interim steps — and there are lots of interim steps
  2. Hadoop really doesn't handle intermediate results and you'll need to chain lots of jobs together to perform your analysis, making Problem 1 even worse.
  3. I've written beautiful MapReduce code with ruby and streaming, but nobody's willing to pay the streaming performance penalty and thus we have the bane of Java MapReduce code — the API is rudimentary, it's difficult to test and it's impossible to confirm results. Hadoop has spawned a pile of add-on tools such as Hive and Pig that make this easier, but the API problems here are fundamental:
    • You have to write and test piles of code to perform even modest tasks
    • You have to generate volumes of "boilerplate" code
    • Hadoop doesn’t do anything out of the box. It can be a herculean writing and configuration effort to tackle even modest problems.

This brings us to the biggest problem of the now-passing MapReduce era — most haystacks DO NOT have any needles in them! The "Big Data" era is still only just beginning, but if you're looking for needles then lighter, more interactive approaches are already a better way to find them.

The great news is that solutions are emerging that increasingly provide my long-dreamed Spreadsheets for the New Millennium. One of my favorite of these new approaches is Apache Spark and the work evolving from the Berkeley Data Analytics Stack.

Spark is a nice framework for general-purpose in-memory distributed analysis. I've sung the praises of in-memory before ( Life Beyond Hadoop ), and in-memory is a silver bullet for real-time computation. Spark is also familiar: You can deploy Spark as a cluster and submit jobs to it - much as you would with Hadoop. Spark also offers Spark SQL (formerly Shark) that brings advances beyond Hive in the Spark environment.

Many of the major Hadoop vendors have embraced Spark and it's a strong Hadoop replacement because it tackles the fundamental issues that have plagued Hadoop in the 2010's:

  • Hadoop has a single point of failure (namenode) — fixed using Hadoop v2 or Spark
  • Hadoop lacks acceleration features — Spark is in-memory and parallelized and fast
  • Hadoop provides neither data integrity nor data provenance — RDDs (resilient distributed datasets) are (re)generated by provenance, and legacy data management can be augmented by Loom in the Hadoop ecosystem
  • HDFS stores three copies of all data (basically by brute force) — Spark RDDs are cleaner and faster
  • Hive is slow — Spark SQL (with caching) is fast - routinely 10X to 100X faster

Spark supports both batch and ( unlike Hadoop ) streaming analysis, thus you can use a single framework for both real-time exploration as well as batch processing. Spark also introduces a nice Scala-based functional programming model, which offers a simple introduction to the dominant Reduce patterns for Hadoop’s Map/Reduce.

So Spark is:

  • An in-memory cluster computing framework
  • Built in Scala, so it runs on the JVM and is compatible with existing Java code and libraries
  • 10-100 times faster than Hadoop Map/Reduce because it runs in memory and avoids Hadoop-y disk I/O
  • A riff on the Scala collections API -- working with large distributed datasets
  • Batch and stream processing in single framework with a common API

Spark really looks like the next step forward. It additionally supports Thomas Kuhn's Structure of Scientific Revolutions requirements for The Next Step Forward in that it preserves many of the existing Big Data approaches while simultaneously moving beyond them. Spark has native language bindings for Scala, Python, and Java and offers some interesting advances, including a native graph processing library called GraphX and a machine learning library (like Mahout) called MLlib.

These are all valuable steps beyond the Toy Elephant, and they give us a great way to find needles while controlling "Needle-less Haystack" risks and costs. So here is our core scenario:

  • You have a haystack
  • You think there might be a needle (or needles!) in it
  • You want to staff and fund a project to find needles — even if you don't know where they are or exactly how to find them

So — do you:

Staff a big project with lots of resources, writing lots of boilerplate code that you'll run slowly in batch mode -- all while praying that magical "needle" answers appear?

or

Start experimenting in real time, with most of your filters and reducers pre-written for you — producing new knowledge and results in 1-10th to 1-100th the time!

With Spark and Spark SQL, to paraphrase William Gibson, "the future is already here, and it's about to get a lot more evenly distributed!" More on rolling with Spark and Shark / Spark SQL in future postings…

Saturday
May172014

The New Old Software

The New Old Software

The forward march of technology is magnificent — sometimes the path to progress is so breathtaking that we lose track of just how far that path might stray from our everyday lives. It is great that Twitter and Facebook are inventing all kinds of ingenious innovations, but before we dive in we have to make sure we don't blot out previous generations of working software (and leave users high-and-dry). They (the kool kids) have different goals and requirements than we do…

The Road to Ruin

This may not draw nearly the buzz and press of its Facebook-y counterparts, but we are seeing a terrific evolution in the processing of "legacy" software — a lot of great pieces are all coming together and developers are just catching on on to the combined wonder of them now. Here is the Hobbesian world of the-rest-of-software today:

  • Vastly more brownfield than "greenfield" development — yes, the kool kids may be inventing Rails4.1.x and AngularJS and MongoDB, but we're NOT. We are more commonly asked to provide updates to current site, even if it was written all the way back in the bubble in ASP or JSP or something, with piles of HTML table-layouts and stored procedures.
  • Outdated technology. A/JSP pages may have been great then, but they scream "2001: A Code Odyssey" now. There was nothing wrong with writing software that way back then, but we might do better now…
  • No clear requirements, either for the legacy system or for the futuristic wonder that we are being asked to create to replace it
  • No documentation of the legacy system,or libraries of documents that are ponderous and out-of-date
  • Few tests or coverage of the legacy system, or…
  • Bad tests — lovely unit tests that may all pass (if we don't touch anything), but might all break if we modify a single line of code...
  • BUT (and this is critical) the current software mostly does work, and has lots of users who count on it!

We don't need kool-kid-ware, we just need the tools to advance in the world we live in. Most code stinks and simply disappears… The code that remains is still with us because it met a real need, with real users. Whatever sins may plague them, we're left with the winners and we need to take them boldly into the next generation.

We need the right approach, now — hesitation leads to even worse issues…

The Road to Ruin

So here is our task — to skip the magical incantations (that might be great for Google-glass alpha-testers) and to mix a little magic into the systems that serve our customers and pay the bills every day. Here are the rules we will follow into that new world.

  • Brownfield - we will evolve the code base, but we cannot break it and we'd be idiots to rewrite it!
  • We will adopt leading technologies to provide new features — leading edge but not bleeding edge
  • Our user base is engaged, but their requirements are sketchy — one of the most important creations from our work will be clear requirements for any future work!
  • Our requirements and our tests will be read by future generations, but (face it) NOBODY is ever going to read our documentation, and we have no time or budget to write any
  • Tests — TDD and BDD are a great step forward, so we will write tests AND we'll write them the right way, at the right level!
  • We will update the system incrementally, and at every step in our updates everything will still work!

In my next post I'll describe how to write new old software, updated for 2014...