Even today, the notion that a consumer can go to a website, be identified, trigger a live auction involving as many as a dozen or more advertisers, and be served an ad in real-time, seems like a marvel of technology. It takes a tremendous amount of hardware and, even more than ever, a tremendous amount of lightning-fast software to accomplish. What has been driving the trend towards ever faster computing within ad technology are new no-SQL database technologies, specifically designed to read and write data in millisecond frameworks. We talked with one of the creators of this evolving type of database software, who has been quietly powering companies including BlueKai, AppNexus, and [x+1], and got his perspective on data science, what “real time” really means, and “the cloud.”
Data is growing exponentially, and becoming easier and cheaper to store and access. Does more data always equal more results for marketers?
Srini Srinivasan: Big Data is data that cannot be managed by traditional relational databases because it is unstructured or semi-structured and the most important big data is hot data, data you can act on it in real-time. It’s not so much the size of the data but rather the rate at which data is changing. It is about the ability to adapt applications to react to the fast changes in large amounts of data that are happening constantly on the Web.
Let’s consider a consumer who is visiting a Web page, or buying something online, or viewing an ad. The data associated with each of these interactions is small. However, when these interactions are multiplied by the millions of people online at any moment, they generate a huge amount of data. AppNexus, which uses our Aerospike NoSQL database to power its real-time bidding platform, handles more than 30 billion transactions per day.
The other aspect is that real-time online consumer data has a very short half life. It is extremely valuable the moment it arrives, but as the consumer continues to move around the Web it quickly loses relevance. In short, if you can’t act on it in real-time, it’s not that useful. That is why our customers demand a database that handles reads and writes in milliseconds with sub-millisecond latency.
Let me give you a couple examples. [x+1] uses our database to analyze thousands of attributes and return a response within 4 milliseconds. LiveRail uses our database to reliably handle 200,000 transactions per second (TPS) while making data accessible within 5 milliseconds at least 99% of the time.
This leads into the last dimension, which is predictable high performance. Because so much of consumer-driven big data loses value almost immediately, downtime is not an option. Moreover, a 5-millisecond response has to be consistent, whether a marketing platform is processing 50,000 TPS or 300,000 TPS.
What are some of the meta-trends you see that is making data management easier (standardization around a platform such as Hadoop? The emergence of No-SQL systems? The accessibility of cloud-hosting?
SS: Today, with consumers engaged more with Web applications, social media sites like Facebook, and mobile devices, marketers need to do a tremendous amount of analysis against data to make sure that they are drawing the right conclusions. They need data management platforms that can absorb terabytes of data—structured and unstructured—while enabling more flexible queries on flexible schema.
In my opinion, classical data systems have completely failed to meet these needs over the last 10 years. That is why we are seeing an explosion of new products, so called NoSQL databases that work on individual use cases. Going forward, I think we’ll see a consolidation as databases and other data management platforms extend their capabilities to handle multiple use cases. There will still be batch analysis platforms like Hadoop, real-time transactional systems, and some databases like Aerospike that combine the two. Additionally, there will be a role for a few special-purpose platforms, just like in the old days we had OLTP, OLAP and special purpose platforms like IBM IMS. However, you won’t see 10 different types of systems trying to solve different pieces of the puzzle.
The fact is we are beginning to see the creation of a whole new market to address the question, “How do you produce insights and do so at scale?”
One of the biggest challenges for marketers has been that useful data is often in silos and not shared. What are some of the new techniques and technologies making data collection and integration easier and more accessible for today’s marketer?
SS: Many of our customers are in the ad-tech space, which is generally at the front-end of technology trends adopted by the broader marketing sector. We are just beginning to see a new trend among some of these customers, who are using Aerospike as a streaming database. They are eliminating the ETL (extract, transformation, load) process. By removing the multi-stage processing pipeline, these companies are making big data usable, faster than ever.
The ability to achieve real-time speed at Web-scale, is making it possible to rethink how companies approach processing their data. Traditional relational databases haven’t provided this speed at scale. However, new technology developments in clustering and SSD optimization are enabling much greater amounts of data to be stored in a cluster—and for that data to be processed in milliseconds.
This is just one new way that real-time is changing how marketers capitalize on their big data. I think we’ll continue to see other innovative new approaches that we wouldn’t have imagined just a couple years ago.
Storing lots of data and making it accessible quickly requires lots of expensive hardware and database software. The trend has been rapidly shifting from legacy models (hosted Oracle or Neteeza solutions) to cloud-based hosting through Rackspace or Amazon, among others. Open source database software solutions such as Hadoop are also shifting the paradigm. Where does this end up? What are the advantages of cloud vs. hosted solutions? How should companies be thinking about storing their marketing-specific data for the next 5-10 years?
SS: A couple years ago nearly everyone was looking at the cloud. While some applications are well suited for the cloud, those built around real-time responses require bare metal performance. Fundamentally it depends on the SLA of the applications. If you need response times in the milliseconds, you can’t afford the cloud’s lack of predictable performance. The demand for efficient scalability is also driving more people back from the cloud. We’re even seeing this with implementations of Hadoop, which is used for batch processing. If a company can run a 100-server cluster locally versus having to depend on a 1,000-server cluster in the cloud, the local 100-server option will win out because efficiency and predictability matter in performance.
What are top companies doing right now to leverage disparate data sets? Are the hardware and software technology available today adequate to build global, integrated marketing “stacks?”
SS: Many of the companies we work with today have two, four, sometimes more data centers in order to get as close to their customers as possible. Ad-tech companies in particular tell us they have about 100 milliseconds—just one-tenth of a second—to receive data, analyze it, and deliver a response. Shortening the physical distance to the customer helps to minimize the time that information travels the network.
Many of these firms take advantage of cross data center replication to include partial or full copies of their data at each location. This gives marketers more information on which to make decisions. It also addresses the demand for their systems to deliver 100% uptime. Our live link approach to replication makes it possible to copy data from one data center to another with no impact on performance and ensures high availability.
Over the last year, we’ve have had customers experience a power failure at one data center due to severe weather, but with one or more data centers available to immediately pick up the workload, they were able to continue business as usual. It comes back to the earlier discussion. Data has the highest value when marketers can act on it in real-time, 100% of the time.
This interview, among many others, appears in EConsultancy's recently published Best Practices in Data Management by Chris O’Hara. Chris is an ad technology executive, the author of Best Practices in Digital Display Media, a frequent contributor to a number of trade publications, and a blogger.