– Uri Cohen, VP of Product Management, GigaSpaces (www.gigaspaces.com), says:
These past few months we’ve heard of numerous frameworks that claim to be a real time analytics silver bullet (some rightfully so, some less…). I wanted to share what we’ve learned here at GigaSpaces over the course of the past few years from dealing with large scale event processing systems, and what we’ve discovered is involved with implementing massive scale event processing in real time. Our XAP 9.0 product can support these scenarios, but the general principle should work with your technology of choice.
What It Takes to Implement Massive Scale Event Processing
There are a number of key attributes that are important for big data event processing systems to support:
· Partitioning and distribution: Perhaps the single most important attribute when it comes scaling your system. If you can efficiently and evenly distribute the load of events across many event processing nodes (and even increase the number of nodes as your system grows), you will ensure consistent response times and avoid accumulating a backlog of unprocessed events.
What’s needed: Seamless content based routing, based on properties of your events. You should also be able to add more nodes on the fly if needed.
· In-memory data: In many cases, you need to access and even update reference data to process incoming events. For example, if you’re calculating a financial index you may need to know if a certain event represents a change in price of a security which is part of the index, and if so you may want to update index-related information accordingly. Hitting the network and the disk for every such event is not practical at this scale, so storing everything in memory (the events themselves, the reference data and even calculation results) makes much more sense. An in-memory data grid allows you to gain these insights by implementing memory-based indexing and querying. It helps you to simply distribute data across any number of nodes, and takes care of high availability for you. The more powerful your in-memory indexing and querying capabilities are, the faster you can perform sophisticated calculations on the fly, without ever hitting a backend store or accessing the network.
What’s needed: A data grid provides high availability, sophisticated indexing and querying, and a multitude of APIs and data models to choose from.
· Data and processing co-location: Once you’ve stored data in memory across multiple nodes, it’s important to achieve locality of data and processing. If you process an incoming event on one node, and need to read and update data on other nodes, your processing latency and scalability will be very limited. Achieving locality requires a solid routing mechanism that will allow you to send your events to the node most relevant to them, i.e. the one that contains the data that is needed for their processing.
What’s needed: Event containers enable you to easily deploy and trigger event processors that are collocated with the data and make sure that you never have to cross the boundaries of a single partition when processing events.
· Fault tolerance: Event processing is in many cases a multi-step process. For example, even the simplest use case of counting words on Twitter entails at least 3 steps (tokenizing, filtering and aggregating counters). When one of your nodes fails and it will at some point, you want your application to continue from the point at which it stopped, and not go through the entire processing flow for each “in-flight” event (or even worse, completely lose track of some events). Replaying the entire event flow again can cause numerous problems. If your event processing code is not idempotent your processors will fail. And under big loads this can create a backlog of events to process, which will make your system much less real time and less resilient to throughput spikes.
What’s needed: Full transactionality. With a data grid in general, and event containers in particular (you can even use Spring’s declarative transaction API for that if you’d like), each partition is deployed with one synchronous backup by default, and transactions are only reported as committed once all updates reach the backup. When a primary fails, the backup takes over in a matter of seconds and continues the processing from the last committed point. In addition, a “self-healing” mechanism can automatically redeploy a failed instance on existing or even new servers.
· Integration with batch oriented big data backends: Real time event processing is just part of the picture. There are certain insights and patterns you can only discover thorough intensive batch processing and long running Map/Reduce jobs. It’s important to be able to easily push data to a big data backend such as Hadoop HDFS or a NoSQL database, which unlike relational databases, can handle massive amounts of write operations. It’s also important to be able to extract data from it when needed. For example, in an ATM fraud detection system you want to push all transaction records to the backend, but also extract calculated “patterns” for each user, so you can compare his or her transactions to the generated pattern and detect frauds in real time.
What’s needed: An open persistency interface will allow you to easily integrate with most of these systems. That way you can use numerous adapters to save data to the NoSQL data stores.
· Manageability: & Cloud Readiness: Big data apps can become quite complex. You have the real time tier, the Map/Reduce / NoSQL tier, a web-based front end and maybe other components as well. Managing this consistently and more so, on the cloud which makes for a great foundation for big data apps, which can quickly become a nightmare. You need a way to manage all these nodes, scale when needed and recover from failure when they happen.
What’s needed: A system that can leverage all the benefits of the cloud, to deploy and manage big data apps in their own data center or on the cloud, with benefits like automatic machine provisioning for any cloud, consistent cluster and application-aware monitoring, and automatic scaling for the entire application stack.
You can also learn how to build your own real time analytics system for big data here, and even fork the code On Github. There’s a lot happening in the Big Data world – and if you can get a handle on analytics you are well positioned to take on new challenges for the long run.