Thursday
Sep262013

My Maria ... DB

You set my soul free like a ship sailing on the sea
~ "My Maria" - BW Stevenson, (later Brooks & Dunn)

As a former Oracle employee I was pleased today to see the last of the stirring comeback for Oracle Team USA, and thrilled to see some footage shot from 101 California St. in San Francisco by former co-workers at Booz Allen. Faulkner was right: "The past is never dead. It's not even past."

The best part of that past world was working for Chris Russell and meeting with Larry Ellison every Friday (3:00PM or whenever Larry showed up to 5:00PM) to go over our progress with the hosted Oracle offering Oracle Business OnLine. Chris is a fantastic manager and a great person -- we got BOL (as it was called) rolling, and I was thrilled just coming out of our first couple of Larry-meetings with the feeling: "He liked it! We didn't get fired!" Little secret: Larry was a better manager than he ever gets credit for, and/but he is magnificently competitive!

Oracle is a terrific database, but when Oracle acquired Sun Microsystems they also acquired the previously-acquired-by-Sun MySQL database. MySQL is a nice open-source database, and was at the core of Ruby on Rails development efforts practically back to RoR 1.0 back in 2006. Rails has long since broken that direct linkage, but it was a nice luxury to tie in MySQL and get ActiveRecord object-relational management for free. I've missed MySQL, and there have been varied worries that MySQL would be the red-headed stepchild in the Oracle household, but now Maria has stepped in to take our worries away.

"Maria" in this case is MariaDB. As wikipedia notes "MariaDB is a community-developed fork of the MySQL relational database management system, the impetus being the community maintenance of its free status under the GNU GPL." Having a good, trusted, universal SQL database was a terrific luxury, and that sounds enough like the old MySQL battle cry, so let's get started.

The first thing we'll want to do is clear out any installations or vestiges of mySQL that currently exist on our servers. For the MariaDB example, I'm going to clear out both my development machine (a MacBook Pro running OSX 10.8.5) and my target machine (a Linux box running Ubuntu 13_04 Raring Ringtail). Tom Keur offers an nice example here: How to remove MySQL completely Mac OS X Leopard


$ sudo rm /usr/local/mysql
$ sudo rm -rf /usr/local/mysql*
$ sudo rm -rf /Library/StartupItems/MySQLCOM
$ sudo rm -rf /Library/PreferencePanes/My*
$ sudo rm -rf /Library/Receipts/mysql*
$ sudo rm -rf /Library/Receipts/MySQL*
$ sudo rm /etc/my.cnf

That clears out the Mac version, and on Ubuntu a simple...


$ sudo apt-get remove mysql

... should do the trick. Now we'll install MariaDB. For work on a Macintosh we can let Homebrew do most of the work for us. As MariaDB is a drop-in replacement for mysql, once we have it we can have it install itself:


$ brew install mariadb
$ unset TMPDIR
$ mysql_install_db
$ cp /usr/local/Cellar/mariadb/5.5.32/homebrew.mxcl.mariadb.plist \ 
~/Library/LaunchAgents/homebrew.mxcl.mariadb.plist

The final plist copy ensures that MariaDB starts up whenever we boot the Macintosh. It's just a little more work on Ubuntu:


$ sudo apt-get install mariadb-server
$ sudo apt-get install libmariadbd-dev


The second call to install libmariadb-dev is something we'll need to install the mysql2 gem on Ubuntu. With our database installed, we'll now install the mysql2 gem to be our database adaptor.


$ sudo gem install --no-rdoc --no-ri mysql2 -- \ 
--with-mysql-dir=$(brew --prefix mariadb) \
--with-mysql-config=$(brew --prefix mariadb)/bin/mysql_config

The commands to make this work on Ubuntu will be familiar to any Linux sysadmin:


  sudo /etc/init.d/mysql start
  sudo /etc/init.d/mysql stop


Now that the installation is complete, lets try a nice standard Rails app to confirm that our DB and adaptor are working properly. Let's create a new Rails app and add in the necessary components to run it. Since I've just updated my development Mac to Apple's new "Mavericks" OSX 10.9 release, let's call our app Mavericks, put in on mySQL, and add in the web server 'thin' and


$ rails new mavericks -d mysql
$ cd mavericks
$ gem install thin
  Fetching: eventmachine-1.0.3.gem (100%)
  Building native extensions.  This could take a while...
  Successfully installed eventmachine-1.0.3
  Fetching: daemons-1.1.9.gem (100%)
  Successfully installed daemons-1.1.9
  Fetching: thin-1.6.0.gem (100%)
  Building native extensions.  This could take a while...
  Successfully installed thin-1.6.0
3 gems installed


Now the final step is to create a test mavericks_development database, and of course we'll add thin to our Gemfile. We'll want our first MariaDB app to do a bit, so lets give it a controller to show us some pages. Here's the command to generate our controller 'pages', and to stub in 'index' and 'about' methods:


$ rails generate controller pages index about


We'll create some "Hello World" text in our pages_controller:

and pass that code to be displayed in our index.html.erb file:

Once we've added these in we can fire the Mavericks application up:

<post>
$ rails server

We'll go to the web url of our pages' index page, and here we are:

YAY! Rails is up and running, with our mySQL replacement MariaDB under the hood. Let's take a look at the Rails Environment, and we can confirm that all is well with our mysql adapter as well.

Victory! It's not quite an America's Cup regatta triumph, but we've got a nice defensibly-open database under our application and we can roll forward Oraclelessly from here. But who cares, really? Oracle mySQL goes back to the future with MariaDB, but where does that take us from here?

The answer is: it's all about architecture.

As Proust said, "The real voyage of discovery consists not in seeking new landscapes, but in having new eyes." Once upon a time the IT landscape was all mainframe-based, and no one got fired for buying IBM. Then came the minicomputer and PC ages, both (as DEC's disappearance and Microsoft's present travails show us) replaced by the web. But the web isn't the leading edge of systems design anymore. With MariaDB we've reopened the data layer and in my next posts I'll explore the soul of these new machines -- web enabled and handheld, event-driven and big-data ready. The right architecture will be the DeLorean for our trip into the new millennium of computer services.

Tuesday
Aug062013

Wooden Nickels

It ain't what you don't know that gets you into trouble. It's what you know for sure that just ain't so. ~ Mark Twain

In theory there is no difference between theory and practice. In practice there is. ~ Yogi Berra

I love Big Data analysis -- the prospect that from mighty oaks of data tiny acorns of insight might be gathered. As much as I love it, I expect that the Gartner Hype Cycle will eventually catch up with it, and here I'm going to make a breathless prediction -- that the Big Data "Trough of Disillusionment" will soon be upon us.

I'm not predicting these analytics sloughs of despond willy-nilly - here the trough will come because of the meshing of 4 specific leading indicators:

  • The Early Adopters (Google with MapReduce, Progressive Insurance with insurance updates for every car, every night, etc.) have succeeded, been recognized and rewarded. The pioneering innovators are doing a victory-lap.
  • Early Majority applications (IBM's Jeopardy-winning Watson, running on $3M of hardware) are HARD
  • Everybody is in the game now, but major new wins are scarce
  • It's easy to fudge - with Big Data / MapReduce data provenance is nonexistent, and all results (even nonsensical ones) can be taken as "valid" if big-enough, complex-enough solutions produced them...
  • Lots of money has been bet on Early Majority solutions (I've been writing about them for four years now), and the bets are still out there...

With this type of dynamic, we might expect a couple of things to happen now:

  • Outcomes start to become selected before there are data sets to justify them, and
  • Managers and executives start learning how to properly question results

My posting last week Glittering Data was the start of a set of posts on how to judge Big Data results -- it talked about data sets that can be shown to have no magic numbers in them. This post is about "wooden nickels" - how to know whether to trust Big Systems or our lying eyes when our results are different from the facts. So let's get started...

In our last, Glittering Data posting we talked about the T-Distribution calculation, and in that example we used it to show that sometimes there really isn't a pony there. We can also use it to show the opposite. Our data might contain magic numbers, just not the magic numbers that we were hoping for...

Let's take a marketing example: You've been running pilot tests of your new car, and you've been collecting data from focus groups, interviews and social media sources. Everybody loves your new car! The data back from your Marketing group shows a clear winner .. then you decide to do some drill-downs...

You start worrying with your first look into the drill-downs. According to Marketing, 70% of pilot customers loved your car - rating it an average of 4 (out of 5) on their scale. But when you dug into individual surveys the results were a bit different: the first dozen reviewers you read hated it, rating it only 2 our of 5. Can these results be real? Let's see what our test shows...

In our T-distribution calculation, we have 10,000 inputs collected with an average of 4 and a standard deviation of 1.3. The first dozen "actuals" you reviewed have a mean of 2.0 and a standard deviation of 1.3. Can such results legitimately have come at random from our larger data set?

Here's what our T-distribution shows:

We might be fine if our test sample failed at 80% confidence interval. Even 90% or 95% might be easily explained away by the way we grabbed our sample. But our tests show that our sample represents different data, at 99.9% confidence interval! Time to take a hard look at the numbers...

There are several explanation that might explain such a discrepancy, such as

  • Our sample was drawn from pre-sorted data, so our results are not randomly selected
  • Sample bias - our sample was drawn from a user set prejudicially disposed against our product
  • Various mathematical errors, in either our calculation or the data selection

But there could be other, darker causes at work

This is where data provenance (still a novelty in Big Data analysis) will be so valuable:

  • Yes - drill-down data can differ from the general Big Data population, but
  • No - the laws of statistics still apply, and if our actuals are that different from our expectations from greater population, then we need to take a hard look somewhere.

In data, getting Trustworthy will be even more important than getting Big was.

The great enemy of the truth is very often not the lie — deliberate, contrived, and dishonest — but the myth — persistent, persuasive and realistic. ~ John F. Kennedy

Sunday
Jul212013

Glittering data

All that is gold does not glitter,
Not all those who wander are lost; ~ J.R.R. Tolkien

Even the best of hunters finds no game in an empty field ~ I Ching

One of the wonders of data analysis are the nuggets of wisdom it can offer, and part of the thrill of Big Data is the notion that bigger claims will have more nuggets. That is often true, and I've been writing since 2009 of the different kinds of nuggets that can be found in the Gold Country of modern data analysis.

We'll take on a different task today: not that of finding nuggets, but of proving nugget-less-ness. You might think "It's impossible to prove a negative," but in analytics we do have tools that we can use to show (As Gertrude Stein said of Oakland) "There is no there there."

The magic trick we'll rely on here is a T-Distribution calculation. TDC's are terrific for analyzing small sets of data for descriptive features and equivalence, and it's not clear (in the literature I've seen, anyway -- happy to be corrected by mathier mathematicians) that their usefulness is limited to small data sets. TDC's give a simple calculation where the input of the following 6 data elements:

Sample Set 1

  • Number of elements in the set
  • Mean of the set
  • Standard deviation of the set

Sample Set 2

  • Number of elements in the set
  • Mean of the set
  • Standard deviation of the set

will give us a result called critical t -- and we can establish data set equivalence at different confidence levels for which, if

t crit < t value at a given confidence level

then we cannot, with the chosen confidence level, discard the null hypothesis that the data sets are equivalent. We generate t crit by the following formula:

t crit = (x 1_mean - x 2_mean )/√((n 1 s 1 2 + n 2 s 2 2)/ (n 1 + n 2 - 2) * ((n 1 + n 2) / n 1 n 2 ))

and we compare our t crit values against a table of t-distribution values for varying levels of confidence, such as PERCENTAGE POINTS OF THE T DISTRIBUTION

Let's try this out with a simple example, taken from Texas Instrument's Classic Sourcebook for Programmable Calculators:

Two classes are taking Differential Equations. The first class has 12 members, with a mean of 87 and a standard deviation of 3.6. The second class has 14 members, a mean of 85 and a standard deviation of 3.25. Can we say, with 80% confidence, that there's a statistical difference in the results for the two classes? How about 95%? 99%? 99.9%?

Here's what we find:

T-Distribution Distribution Results

Here our two sets of test results, with class averages of 87 and 85 respectively, only show statistical differences at a comparatively forgiving 80% confidence interval, and show no statistical difference at the more restrictive 90, 95, 99 and 99.9% confidence intervals.

These results interesting because modern data analytics presents us with wide varieties of data sets and sometimes little judgment goes into the assessment of just how nuggety those data sets can be. In our example data set here, even if the second class average was to fall a full standard deviation below the first class - coming in at a paltry 83.4% - we cannot establish statistical difference at 99% confidence. With larger data sets our degrees of freedom rise and our t crit 's fall, but even here our standard deviations will also generally fall as our populations grow.

It's a rich world out there, but watch your data and never forget the Spanish proverb: "No es oro todo que brilla..."

Sunday
Jul212013

A Gift From The Past

“We can only see a short distance ahead, but we can see plenty there that needs to be done.”
― Alan M. Turing

Way back in 2009, I wrote about Prime Minister Gordon Brown's apology to Alan Turing (About Time), with the hope that the apology would turn into a formal pardon. Alan Turing was gay, and I've read that he believed that his legacy would be not just his world-changing work at Bletchley Park but his stand for gay rights in Britain. It didn't initially turn out that way for him: he was prosecuted for indecency, chemically castrated, and died at his own hand from a cyanide-laced apple. He deserved better.

Perhaps 2009 was at least a step in the right direction. Word from Britain (where there is more news than just the Royal Baby...) is that Alan Turing may finally be on his way to a full pardon. As Lord Sharkey notes in The Guardian:

"The government know that Turing was a hero and a very great man. They acknowledge that he was cruelly treated. They must have seen the esteem in which he is held here and around the world."

Even more compelling is this quote from Lady Trumpington:

"I am certain that but for his work we would have lost the war through starvation."

This final quote reveals a personal gift, down through the years from Alan Turing to me. My fiancée Kate's great aunt Joan worked at Bletchley as well, and her family is English and she herself was born in London. Without the work of Alan Turing, Joan Harvey and so many more, Britain might well have sought a separate peace with Germany in 1940. In that world it's doubtful that Kate's parents (her father was an American Army officer) would ever have met, so in my world

No Alan Turing = No Kate

As I've written before, I have benefitted from Alan Turing's work as a computer scientist. I have now also personally benefitted from his genius. Surely that's worth a full pardon...

Thursday
Dec132012

Both Flesh and Not - Where little advantages add up to a lot

“The truth will set you free. But not until it is finished with you.”
― David Foster Wallace, Infinite Jest

I've been reading a lot of David Foster Wallace, starting with some of his articles and essays and leading up to the infinitely-lengthy Infinite Jest. Wallace is a remarkable writer, and his article on tennis star Roger no middle initial Federer that gives name to this posting is a wonderful description of how, even among the supremely gifted players at the top of the international tennis circuit, Federer is ever so slightly more gifted, and how the accumulation of these small gifts lead to wonderful "Federer moments" of exquisite play beyond the highest level. It's also led to 17 Grand Slam (Australian Open, French Open, Wimbledon, US Open) championships and a reasonable argument that Federer is the greatest tennis player of all time.

Both Flesh and Not describes the small advantages in speed, sense and angle that add up to enough to win specific points, but gives no sense that such points are ubiquitous - Federer has had remarkable winning streaks, but not the kind my son calls Triple Bagels (6-0, 6-0, 6-0), and in racking up his wins, it's not like he's winning all his games by shutout (or "at love", as the tennis aficionados say). So, how has he done it? His advantages are subtle. Is the accumulation of such small advantages really enough to add up to "the greatest of all time?" They are - and in this posting we'll go into a bit of why.

"A chessgame is won with the gradual accumulation of small advantages."
― Wilhelm Steinitz, World Chess Champion (1886 - 1894)

Roger Federer may not be winning his games, sets and matches at love, but with 17 Grand Slam titles he has spent a lot of time at the top of the pyramid -- far more than might seem obvious, even for a magnificent player who at his best might have won maybe 60% of the points in his matches. Surely 60% of the points = 60% of the games = 60% of the matches, doesn't it?

Not so fast. Tennis has a curious scoring system - basically "first to 4-points, but you have to win by 2." We'll assume that each point is an individual event (with no causal link to any other point), thus we can model tennis games with discrete Markov processes.

A quick word about Markov and our analysis. Markov processes (named for Russian mathematician Andrey Markov and often referred to as "Markov chains") are terrific modeling tools for systems that transition from state to state with a finite number of countable states. For our tennis example here the "states" will be points, games, sets, and matches. Markov processes are said to be "memoryless" - the next state depends only on the current state and not on any sequence of preceding events or states. If the system that we're modeling (here Roger Federer playing tennis points/games/sets/matches) conforms to these rules, then the Markov approach can offer lots of interesting quantitative information: probability of winning, expected number of points/games/sets played and more. But first let's see what the flow of a game looks like in modeling:

So here we start at the top of the graph with a score of love-love (0-0), and you can follow the graph through the progression of points to the outcome -- either a win for Player A, or a win by Player B. This is great -- we can number the vertices of the graph and put this into a Markov Process model. First let's number the vertices, here:

And now we can apply a discrete Markov process model (courtesy of Mathematica 9). In the model we'll set Federer's point-winning percentage to 60% (.60), and thus giving his opponents 40% of the points. A single-line Matematica equation, and we have our basic model here:

Super -- as far as it goes. But what does it tell us? If Federer wins 60% of the points, doesn't he (obviously) win 60% of the games, sets and matches? Let's see what our model tells us:

Now we're on to something! So, by our model, a winning probability of 60% for each point in a tennis game gives us a game-win 73.6% of the time. And if a winning probability of 60% for each point leads to a win in 74% of the games, what does a 74% game-win-probability give us? As it turns out we can apply a discrete Markov process to games and sets, too. The process graph for a full tennis set (shown below) is a bit more complicated than for a game, but the Markov process works similarly, and here's what we get turning 60% points-wins into 74% game-wins:

Again, as before you can start at 0-0 and work your way through the different game-results to a completed set. The Markov process for a tennis set is a bit more cumbersome, but works similarly to the game Markov process and is shown below:

The key thing to note in the process is that our input "p" is no longer the 60% point-win-percentage, but rather the 73.6% game-win-percentage. If the Federer of our model wins 73.6% of his games, what percentage of his sets will he win? Let's see...

Now we're getting somewhere - winning 60% of points may not sound all that great, but a set-win percentage of almost 95% sounds more like the stuff of the greatest of all time. Moreover we're still we're not quite done -- we have "game, set" calculated, so let's extend the analysis to see what we get for "Game, Set, MATCH."

Match play is pretty simple -- and here we'll use Wimbledon-style matches -- best of 5 sets, first to three wins, wins. The Markov process is also similar to the processes we've seen for games and sets, and is shown below:

Serve it up and you can see how a Roger Federer, here modeled as winning 60% of the points in his matches can be (as he did that over most of a decade) the greatest of all time:

Here 60% of the points give us (or gives Roger, in our model) wins in 99.8% of his matches. Now, of course there are some caveats here:

  • The 60% number is a very round figure, taken from the guesstimate that Federer in his prime won 70+% of points on serve, and may have won almost 50% even on return-of-service
  • Federer at 30 years of age is not the Federer of 25, but even last year he won the Wimbledon final with 151/288, or 52.43% of points won.
  • If we run 52.43% through our Markov models, we can estimate that Federer would win 56% of his games, 62% of his sets, and 72% of his matches - and even those statistics are for an older Federer, taken from a single Finals match against one of the top-4 players in the world.

David Foster Wallace writes wonderfully about "Federer Moments" in Roger Federer as Religious Experience, and the wonder shown here in our Markov models is not only Federer's magnificent play, but the incomparable consistency at that exalted level for the decade that saw him win 4 Australian Opens, 1 French Open, 7 Wimbledon Championships and 5 US Opens. The models here show, not "Federer Moments" but the "Federer Edge" -- just a little bit better than anyone in the world, point-after-point, game-over-game, match-over-match for a decade of play. We can't know what lies ahead, but we can plan for it -- the Wimbledon Men's Singles Final this year is on July 7.

"In an era of specialists, you're either a clay court specialist, a grass court specialist, or a hard court specialist...or you're Roger Federer."
― Jimmy Connors

Page 1 ... 3 4 5 6 7 ... 11 Next 5 Entries »