Monday
Jan102011

NoSQL and the Cloud: Enabling Modern Marketing

A conversation is a dialogue, not a monologue. -- Truman Capote

In the earliest days of the World Wide Web, if you could just present images and text of your products and company then you were mission-accomplished for Marketing on the web. User interest made this modest offering compelling -- enough that every company had to put up a website and an initial level of user expectations were set.

In the second era of web marketing your web presence would now interact with your users. Interested users could submit information to you on the web, get specific information back, complete simple transactions and each of these actions began to build a custom web conversation.

Every user action on the web leaves electronic footprints, and the ability to read these footprints analytically is what now enables a third era of web marketing. In the first era the website really didn't care who you were, and in the second era the site managed an increasingly-rich cookie to keep its history of you. In the third era that cookie is reverting back to a unique user:session identifier as the back-end now manages a rich store of user information. As the web evolves from Informational to Social, marketing has already passed from caring about who you are to seeking what you want, and our views on what you want are increasingly determined from who you know and what your influences are.

The new marketing era takes advantage of new techniques to capture and analyze data, and offers a rich new toolset to divine business information from this rich data collection. Let's take a look at the What and the Where of modern web marketing.

The What - Web Data Tools and Data Stores

Every user interaction with a web site -- every link clicked, every search entered, and every transaction started leaves a detailed record, and this data can be mined to produce better customer understanding and more-focused marketing and sales efforts. To support a conversation with your customers your web tools need to support a rich set of services:

  • Componentization and templates for site content
  • Workflows that support rapid modification and publication of content
  • An API for programmatic access and integration
  • Intelligent meta-data associated with each content item

Componentization and templates are critical for product management in web marketing, as you will want to track user and customer conversations at a specific content or SKU level. Workflows and APIs are essential to making your web presence a living document and escaping the static-page past. A rich, powerful data store is essential to supporting content and ongoing customer conversations, and the data model for web marketing is one of the areas that has undergone the greatest transformation in arriving in the current web marketing era.

The Relational data model -- a model originally designed for 50's-era accounting data and long the backbone of Enterprise computing -- in many cases no longer fits the needs of modern marketing. For rich, schema-less customer interactions and analytics a rigorous relational model just doesn't make sense anymore. To better-fit the new generation of rich, unstructured user data a new family of data store approaches become prominent, characterized by the explanation that they are "Not Only SQL" -- the "NoSQL" family of data stores.

NoSQL solutions have risen to prominence in many retail and social web companies because the rigor and restrictions of a purely Relational model simply don't match their data processing needs. We can see this by looking at the data processing challenges that a company like Facebook faces:

  • 570 billion page views per month
  • 25 billion pieces of content, served by more than 30,000 servers
  • More photos than all other photo sites combined -- More than 3 billion photos uploaded every month, 1.2 million uploaded per second

Facebook is clearly not a 50's-era accounting department, and its data processing needs are vastly different than the ACID rows and columns that characterized the Relational data era. Facebook has adopted a rich set of tools to meet these data challenges:

  • Memcached. Facebook runs thousands of Memcached servers with tens of TB of cached data at any point in time
  • Cassandra (now replaced by HBase). Distributed storage with no single point of failure
  • Hadoop and Hive. Used for massive data analysis and marketing analytics

These data architectures are key to Facebook's growth and scalability and they underlie the growth of other companies like Yahoo, Foursquare and Twitter as well. These companies may represent the frontier of "big data" tools, but the core technologies that underlie their growth are generally available and are finding wide experimentation and adoption in the broader business community.

NoSQL approaches make sense for problem domains outside the traditional relational world, and can give vastly better performance for certain families of uses:

  • Frequently-written, rarely read data (like web hit counters, or data from logging devices or space-probes) work well in key-value stored like Redis or Voldemort, or document-oriented databases like MongoDB
  • Frequently-read, seldom written or updated data (such as the Facebook statistics above) benefit from several NoSQL data approaches: Memcached for transient data caching, Cassandra or HBase for searching, and Hadoop and Hive for data analysis
  • High-availability applications which demand minimal downtime do well with clustered, redundant data stores like Riak or Cassandra
  • Data that will be sync'd across multiple locations can benefit from the replication features of a database like CouchDB, MongoDB or Tokyo Cabinet
  • Transient data (like web sessions and caches) do well in transient key-value data stores like Voldemort or Memcached
  • Big data arising from business or web analytics that may not follow any apparent schema but which will still require rich (possibly parallel) querying will do well in the family of access tools like Hadoop

NoSQL tools are particularly well-adapted to the latest generation of content management systems. A good CMS is about much more than just "individualization" and serving up general web content. A state-of-the-art CMS today is focused on providing better answers two basic questions:

  • What are consumers looking for? and matching that information with
  • What do you know about them?

New data technologies enable the best modern CMS's to capture and map consumer intent to deliver relevant online ads across multiple channels. New CMS applications and data technologies enable massive scale—and dramatically improve customer touch and conversation.

The Where - Web Marketing in "The Cloud"

The cloud -- virtualized web hosting available in increments -- is a great fit for web marketing because it originally arose out of the needs of web retail. Amazon's retail data processing has long been focused on responsiveness in the extreme: "Black Friday" (_Note: the day following the US Thanksgiving holiday, a retail crush that is traditionally the start of holiday shopping_). The processing power to meet the peak demands of "Black Friday," coupled with virtualization technology from offerings such as Xen enabled Amazon to bundle up processing power in virtualized chunks, and thus was born the Amazon Web Services offering, that along with similar offerings from competitors has created the market known as "cloud computing."

With cloud computing, a web browser is all that is required to create online servers and server farms, storage, queuing, monitoring, database, backup and recovery and electronic commerce: in short, an entire virtual data center, available by-the-hour. With the cloud it's possible to spin up market tests and rich data analysis for modest hourly fees with no capital expenses.

So what can the cloud provide? The cloud provides the best imaginable for product trials, focus groups and general customer analytics and "predictive marketing." Cloud systems:

  • Offer low latency and fast (sub-50ms) response times
  • Can map/identify visitors across touchpoints
  • Can support specific campaigns or marketing events with no capital costs
  • Are highly available and scalable to support massively parallel data processing
  • Offer secure access to first-party data and integration to third-part data stores
  • Support new families of analytic tools that can be applied to customer "big data"

The magic does not lie in cloud-parts: practically all the components used for cloud apps run and offer the same benefits in a cloudless world. Still, the cloud offers unique opportunities: Suppose you have an customer-data analysis application ready for pilot, but with all your customer data it'll take 100 servers to run. Here the cloud can come to your rescue -- spin up the app on AWS on 100 servers over a long weekend, and your life-changing pilot will come in at under $1000. Try comparing that to the cost of renting data center space for 100 servers, buying or leasing all the computer and network hardware, setting it all up, running the test, and then tearing it all down again in a long weekend.

What makes the cloud magical is its flexibility -- and when combined with web standards and tools and modern development practices, it's possible to "solve" the IT equation -- specific rich solutions generated quickly and reliably, and built from standard parts that are familiar to any IT shop.

With the right plans and tools it's possible to engage in an ongoing conversation with your customers. You website and systems track their interests, and data from their web interactions and transactions can mesh seamlessly with the data you already keep in your existing systems. With the right tools and a strong implementation, the web becomes your interpersonal channel to your customers.

Sunday
Jan092011

NoSQL Next Up: Hadoop and Cloudera


Jack Dennis is a true computer pioneer, and was already famous at MIT by the time I got there. He was famous for Multics (which we endured) and his stewardship of the MIT Model Railroad Club (which we smiled curiously at). He was not famous for the eponymous "Dennis Machine" -- a terrific model for parallel processing -- because nobody ever thought we'd ever (with computer time so expensive that a 5:00AM session was "good time") be able to run anything on bunches of computers.

We can run on bunches of computers today, and the means of doing so is so routine that I'm not even going to devote quite a full blog-posting to it. For the details and a great way to get started with NoSQL processing through Hadoop you should take a scan at Phil Whelan's terrific blog entry: Map-Reduce With Ruby Using Hadoop. Phil's article is a great way to get started -- where an investment of an hour or two will get you familiar with MapReduce and the Google-y way to solve problems with piles of computers.

MapReduce itself takes some getting used-to. The basic idea is to take a single function, "map" it out to process lots of data in parallel in separate servers, and then "reduce" the results so that a summary of the map-function gets returned to the user. This is the approach Google has used for search from the beginning: Google can take my query request: "Accenture new CEO" and Map it over hundreds (or thousands) of servers, each of which perform the search over their own little corner of the Internet, and then Reduce it by doing a Pagerank summation of the pages returned from each mapped search. Google sorts the results, and the front page of my search results shows me the best ones.

Joel Spolsky did a nice writeup of the thinking behind MapReduce in his posting: Can Your Programming Language Do This? back in 2006. In the example in the Whelan article, we'll use a Cloudera Script called "whirr" to fire up a cluster of AWS servers with Hadoop, and we'll use that cluster to run a MapReduce job to:

...write a map-reduce job that scans all the files in a given directory, takes the words found in those files and then counts the number of times words begin with any two characters.

That's simple enough, and just the kind of innately-parallelizable task that Hadoop is perfect for. Whelan's article as another nice tidbit in it -- the use of a dynamic language to define the "map" and "reduce" tasks. The idea here is simple -- let's see how much code it takes to map the task out, and reduce the results of our word-count:

Source: Map-Reduce With Ruby Using Hadoop

So, Map is simple: for each line, just remove newline and ignore shorties, then snip the two-character key, give it a value of "1", and write it to stdout. It's a simple idea, and the nice thing about dynamic languages is that they can make the code to do so simple a task look simple as well. Let's take a look at the reduce function now:

Source: Map-Reduce With Ruby Using Hadoop​

Simple as well: for each line reset the total every time we get a new key, otherwise sum up the values for that key. Again, the code here very transparently accomplishes the reduction, and leads us to the result (from my terminal, with the good parts highlighted):

wireless:whirr-0.1.0+23 johnrepko$ hadoop fs -cat output/part-00000 | head
11/01/05 16:11:06 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
aa      13
ab      666
ac      1491
ad      867
ae      337
af      380
ag      507
ah      46
ai      169
aj      14

...and there are our results: only 13 aardvarks and aardwolves, but plenty (1491) of words that start with "ac". Nice, clean result, and absolutely worth an hour or two to start to sense the power of Hadoop, the beautiful cleanness of the Cloudera implementation, and the power of the problems you can solve with such a massively parallel approach. To wrap up, make sure you clean up your Hadoop sessions in AWS with:

$ ec2-describe-instances    and
$ ec2-terminate-instances

otherwise you can easily pile up AWS time.

It wasn't that long ago that the "Dennis Machine" was just a theoretical construct, and parallel-processing was nice for graduate work, but wouldn't ever solve real problems. Google brought "massively parallel" and MapReduce to the masses, and there are lots of business problems that we can now solve easily once we're comfortable with the tools.

Jack Dennis laid the tracks ... Phil Whelan's terrific blog entry: Map-Reduce With Ruby Using Hadoop shows you how... let's get this train rolling!

Thursday
Jan062011

Why NoSQL Matters Today

“But what is it good for?”
(Engineer at the Advanced Computing Systems Division of IBM, commenting on the microchip, 1968)

It wasn't that long ago that "Procedural Programming" was all the rage. The Pascal and C programming languages ruled the universities, and Modula-2 was presented as the be-all and end-all of that world. Pascal and C were (and are) fine languages, but it didn't turn out like that.

The problem was that too many interesting problem domains just didn't match Pascal-and-C syntax, and trying to wedge them into that model was just too hard. "Object-oriented" programming took hold as a more logical match for many programming domains, and the web programming that followed swung to approaches that closely match web formalisms and protocols. Pascal and C were terrific, but they reached their limits when it became clear that too much of the world just didn't work that way.

So things stand today with the data management of business- and web-systems. Relational data models have ruled enterprise data management for more than 20 years -- to the point where it may be hard for generations of developers to imagine that there could be any kind of data model other than rows and columns. But the model is straining with business "big data" and Internet-era solutions -- while they still seemingly fit into the old Payables / Receivables rows-and-columns world, we might want to take stock of what we are losing in jamming that square peg into that round hole. As Dare Obasanjo wrote

What tends to happen once you’ve built a partitioned/sharded SQL database architecture is that you tend to notice that you’ve given up most of the features of an ACID relational database. You give up the advantages of the relationships by eschewing foreign keys, triggers and joins since these are prohibitively expensive to run across multiple databases. Denormalizing the data means that you give up on Atomicity, Consistency and Isolation when updating or retrieving results. And the end all you have left is that your data is Durable (i.e. it is persistently stored) which isn’t much better than you get from a dumb file system.

In the era of the Internet and "big data," a rich, powerful data store still makes sense, but a model originally designed for the 1950's accounting department may not make sense anymore. To better-fit this new generation of problems, a new family of data store approaches has risen to prominence, characterized by the explanation that they are "Not Only SQL" -- the "NoSQL" data stores.

NoSQL approaches make sense for problem domains outside the traditional relational world, and can give vastly better performance for certain families of uses:

  • Frequently-written, rarely read data (like web hit counters, or data from logging devices or space-probes) work well in key-value stored like Redis, or document-oriented databases like MongoDB
  • Frequently-read, seldom written or updated data (see Facebook statistics below) benefit from several NoSQL data approaches: Memcached for transient data caching, Cassandra or HBase for searching, and Hadoop and Hive for data analysis
  • High-availability applications which demand minimal downtime do well with clustered, redundant data stores like Riak or Cassandra
  • Data that will be sync'd across multiple locations can benefit from the replication features of a database like CouchDB
  • Transient data (like web sessions and caches) do well in transient key-value data stores like Memcached
  • Big data arising from business or web analytics that may not follow any apparent schema but which will still require rich (possibly parallel) querying will do well in the family of access tools like Hadoop

A growing number of leading websites and business applications have migrated to NoSQL solutions, driven by the needs arising from their size, scale, and the unavoidable gap between the problem domains they serve and the structure of previously-existing SQL solutions. The demand for NoSQL solutions didn't arise because of problems with the SQL language, but rather because of limitations in the relational model itself. In 2000 Eric Brewer outlined the core deficiency of the relational model in a partitioned, global data world with his CAP Theorem, which states that both Consistency and high Availability cannot be maintained when a database is Partitioned across a (fallable) wide area network. The CAP Theorem opened the door to consideration of data models where Partitioning and high Availability are the requirements, and Consistency is delayed (or "eventual") to meet Availability needs in a Partitioned world. NoSQL data store solutions, which provide partitioning and high availability while settling for "eventual" consistency have been the result.

NoSQL solutions have risen to prominence in many "social web" companies because the rigor and restrictions of a purely Relational world could never have met their data needs. We can see this from a look at the scalability challenges that a company like Facebook faces:

  • 570 billion page views per month
  • More photos than all other photo sites combined
    • More than 3 billion photos uploaded every month
    • 1.2 million photos served per second
  • 25 billion pieces of content, served by more than 30,000 servers

Facebook is clearly not a 50's-era accounting department, and its data processing needs are vastly different than anything even considered in the Relational data era. Facebook has adopted a rich set of tools to meet these data challenges:

  • Memcached. Facebook runs thousands of Memcached servers with tens of TB of cached data at any point in time
  • Cassandra (now replaced by HBase). Distributed storage with no single point of failure
  • Hadoop and Hive. Used to massive data analysis and marketing analytics

These data architectures are key to Facebook's growth and scalability and they underlie the growth of other "big data" web companies like Yahoo, Foursquare and Twitter as well. These companies may represent the frontier of big data tools, but the core technologies that underlie their growth are generally available (often open-sourced) and are finding wide experimentation and adoption in the broader business community.

Much as Procedural Programming once gave way to approaches that were a better match to new problem domains, we expect the richness and flexibility of NoSQL solutions to play a growing role in business solutions now and in the future. Whole business solution approaches, such as Predictive Analytics, are becoming widely available only because of advances in data store technology. As NoSQL solutions evolve the problems we can solve with them will become richer and more business-critical. There are already roughly a dozen major NoSQL packages under broad industry review, and as the NoSQL platform matures we are confident that NoSQL approaches will grow as an important direction in business solutions and database technology.

“First, solve the problem. Then, write the code.”
(John Johnson)

Sunday
Jan022011

A Quick Redis Key-Value Example for the Holidays

Source: http://www.christmas-graphics-plus.com/free/present-clip-art.html

I really like Mongo, and we adopted it at my former company in large part because it gave us full NoSQL goodness without taking anyone too far out of the SQL-Active Record world we Railsians have come to know and love.

For my final little example for the holidays, I'd like to go outside the SQL-box, and present an example in NoSQL for which NoSQL fits like a glove, and SQL just doesn't make sense. I'll set up and use the Redis NoSQL data store because key-value pairings are just what I need.

The example application here is a URL-shortener. URL-shorteners map a short, meaningless text string with a (sometimes frightfully) longer URL, and might have been a curiosity in a Twitterless world, but are vital in a tweety world where 140-chars-is-all-you-get!

The good news is, we've already got our base image, and adding a new Redis data store and example app to it only took about an hour. As before, you can play with the URL-shortener at Redis URL Shortener, and you can download and play with the code for the application at
Redis URL Shortener Source Code.

Here the great work was done by Christoph Petschnig, who wrote the original app and posted it to github -- my contributions are limited to updating it to Rails 3.0.3, moving it to AWS and giving it some CSS formatting.

All the same, this is a great example of where you'll want a NoSQL solution and just why such a solution matters. URL-shortening is a perfect application for key-value pair databases, because all you have are the key (the hash the app produces), and the value (the original URL), and you'll want to be able to make zillions of them, and make them lickety-split.

Redis is perfect for that. It's available (BSD-licensed), written in C/C++, and generally said to be wicked-fast. Redis can do transactions, and was designed to be a disk-backed in-memory database. It may not fit every need, but it's perfect for a pure key-value application like URL-shortening.

Mini_url is yet another "toy" application, but you can get it running quickly, and on the cloud. To try it out with real application loads you need only a little programming time and a willing community to try it or a pile of data to feed it.

"Hello World" matters -- if you can say it, you might get the world to answer "Hello!" back.

Happy New Year!

Monday
Aug162010

Why Our Little NoSQL App Matters

So let's sum up -- after a handful of posts and a small but still sorrowful amount of command-line and rails code, we've managed to accomplish the following "Hello World" tasks in NoSQL on the cloud:

  1. Created a cloud account
  2. Got our first app created, and saw it in a browser on the web
  3. Loaded up real development environments (Ruby/Rails we added, Java we got for free)
  4. Added a stronger app server (thin >> webrick) and a stronger web server (nginx >> almost anything)
  5. Added our first NoSQL data store (MongoDB) and mapping software to simulate ActiveRecord in NoSQL
  6. Created a little NoSQL app to show all this, and made it visible though a dynamic DNS address:  Rails Mongo Notes Example

Just to wrap the little app up:  I updated John Nunemaker's Mongomapper demo app to work with Rails3 and the cloud, and if you like you can take a look at the code for it here:

 Rails Mongo Code.  

The main files you'll want to look at are the Gemfile, the model file note.rb, the controller notes_controller.rb, the views (basic Rails here), and the initializer mongo_config.rb.



But why are these little code bits even worth looking at at all?   There are some obvious drawbacks to the approach I've presented here:

  • Macs and Linux machines may not be common development environments in your shop
  • Amazon may be a cloud-leader, but wouldn't Azure make more sense?
  • Rails is nice, but doesn't Java rule our world?

Any of these things may well be true, but I think our curious platform is worth a look anyway, for one principal reason:  the software development "rules" coming out of the Rails / cloud worlds are becoming predominant in the broader programming discipline as well.   Rails and cloud developers originally arose from the Java and open-source camps, looking for the freedom and richness of web Javaland while avoiding its brittleness, such as Java's Kingdom of Nouns.

By some accounts they've achieved success -- as Tim O'Reilly has noted:



"Ruby on Rails is a breakthrough in lowering the barriers of entry to programming. Powerful web applications that formerly might have taken weeks or months to develop can be produced in a matter of days."



With the Amazon cloud we've created a base environment image-file that we can launch in any quantity, inexpensively, in seconds, with all the basic tools we'll need to write and delivery rich applications that experiment with and (hopefully) advance the emerging tools in the "big data" and "cloud" worlds of computing.   There are a LOT of tools emerging in these worlds, and if we want to get familiar with them we'll need a sandbox to play with them in -- something rich, powerful, fast and cheap.   With our core image we now have that, and in coming posts we'll hopefully head out to some of the frontiers of this new world of computing.  We have LOTS to look at!

"Nothing is so dangerous to the progress of the human mind than to assume that our views of science are ultimate, that there are no mysteries in nature, that our triumphs are complete and that there are no new worlds to conquer." 
~ Sir Humphry Davy