[Big] Data Management

Big data is seen as one of the most important enabling technologies for the IoT. In this chapter, we will talk about big (and not-so-big) data management for M2M and IoT applications. Because this is a complex topic and there is no one-size-fits-all solution, we will gradually construct the conversation by talking about scenarios of different complexity. But, before we get to that, we will set the stage by talking to one of the most respected figureheads of the big data movement: Mike Olson, Chief Strategy Officer and Founder of Cloudera.

Dirk Slama: Mike, Cloudera is the company behind Apache Hadoop, one of the most widely used big data frameworks. Can you talk about some of the most interesting IoT use cases that you’ve seen using your technology?

Mike Olson: Sure. You mentioned the Large Hadron Collider (LHC) at CERN as one of the case studies in your book, and how the LHC is not only capturing enormous amounts of sensor data, but is also building up an efficient, multi-tiered processing pipeline for the data from the experiments. As it turns out, at least one of the research facilities in tier 2 of this processing pipeline is using Hadoop: The University of Nebraska-Lincoln. They are using it to perform advanced physics analytics on the massive amounts of data generated by the particle collisions at CERN.

There are many other interesting cases. Take energy, for example; the smart grid is obviously an important IoT application using big data technology, and we know that Hadoop is being used by a number of those companies. A good example is Opower, who sell their services to utilities that want to capture, analyze, and use the smart grid observations streaming from their smart meters. Smart meters are now generating an observation every 6 seconds, which is vastly larger than the one reading per month that they used to collect. This means that they can establish a very fine-grained signal of demand over the course of a day. They can even determine when significant events happen, such as turning your washing machine or your refrigerator on. As such, they can observe demand in real time, and then use those observations to predict demand fluctuations, not just on the basis of smart grid activity, but also based on weather reports, forecasts, and significant upcoming events and celebrations. They can even use gamification to manage demand, for example by letting customers know what their usage is as compared to their neighbor’s usage and encouraging people to compete a little to preserve energy.

There’s a pretty broad range of use cases but, generally, all of this data is being generated by sensors and machines at machine scale. Collecting and analyzing it using older technologies is a challenge. Big, scale-out infrastructure makes the processing and analysis of this data much cheaper.

Dirk Slama: So what challenges does the IoT present that haven’t been faced by previous generations of computing and data management?

Mike Olson: My own view is that we are only seeing the very early days of IoT data flows, and already those data flows are almost overwhelming. Take the amount of information streaming from the smart grid, from taking readings once a month to 10 times a minute; that’s 150,000 times more observations we’re getting per meter per month. Those data volumes are guaranteed to accelerate. In the future, we will be collecting more data at finer grain, and from a lot more devices.

Look at the City of San Francisco: Some estimate that we already have 2 billion sensors in the city. Not only in smartphones and cars, but in many other places, such as the city’s many high-rise buildings, where they measure air pressure, temperature, vibration, etc. The most interesting thing about those sensors right now is that most of them are not connected to a network. I predict that most of them will be on a network within the next half decade; that those devices will be swapped out with network- and mesh-connected sensors. And that will bring about an absolute tsunami of data! So designing systems that can capture, process, summarize, manage, and then analyze this data is the big challenge for IT. We have never seen a flood of information like the one we are about to see.

Dirk Slama: So, are there any key advances in data management that are making the IoT a reality?

Mike Olson: We are building scale-out storage and compute platforms today that we didn’t have a decade ago. We didn’t need them then, because we weren’t collecting information on this scale, and we weren’t trying to analyze it in the way we are today. The emergence of machine-generated data has forced us to rethink how we capture, store, and process data; it’s now completely commonplace to build very large-scale, highly parallel compute farms. So that transformation has already happened. If we look to the next 5 or 10 years of advances, the state of the software should continue to improve; we will have more and better analytic algorithms; we’ll have cheaper scale-out storage architectures; we will be able to manage with less disk space, because we’ll be smarter about how we encode and replicate data.

But the thing I think is going to be most interesting are advances in hardware. The proliferation of network-connected sensors in mobile devices and in the environment in general is going to continue or may even explode. That will produce a lot of new data. Think about the Intel Atom Chip-Line and its equivalent from all other vendors. On the data capture/storage/analysis side, we will see chips that are better suited to this scale-out infrastructure. Memory densities will increase of course, and we will see solid-state drives replace disks in many applications. Networking interfaces at the chip level will become more ubiquitous and much faster. We will see optical instead of just electrical networks available widely at chip level. The relative latencies of storage – that is disk to memory – will shift; solid-state disk to RAM is going to be a much more common path in the future. The speed of optical networks will make remote storage much more accessible. So I think we will see a lot of innovation at the hardware level that will enable software to do way more with the data than was ever possible before.

Dirk Slama: What are the biggest risks for companies that want to build IoT solutions that leverage big data? How can they mitigate these risks?

Mike Olson: The technologies for generating this type of data – the scale-out proliferation of sensor networks – as well as the infrastructure to capture, process, and analyze this data, are new. Our experience has been that it is a very smart idea to start with a small-scale proof of concept. Instead of a million devices, maybe start with a thousand devices. And then build a data capture and processing infrastructure that can handle that. This should allow you to check that it works, and to educate your people and your organization about how these systems function, and what they are capable of. These are new technologies, and adoption of new technologies requires learning and new processes for successful deployment. It’s important to learn those at small scale before you go for infinite-scale IoT.

We talk to a lot of people who are fascinated by IoT technology. They are excited about big data for its own sake. Those are bad people for us to work with because they are not fundamentally driven by a business problem. It’s important when you start thinking about the IoT to think about why it matters. What questions do you want to answer with the sensor data streaming in? What are the business problems you want to solve? What are the optimizations you want to make? And then design your systems to address these problems. The “shiny object syndrome” of engineers who want to play with new technology – I totally get that, I am one of those guys, but those projects generally fail because they don’t have clear success criteria.

Dirk Slama: Are there clear indicators that tell you when to recommend using big data technology, and when not?

Mike Olson: If you have a traditional transaction processing or OLAP workload, we will point you towards Oracle, SQL Server, Terra Data, etc., because those systems evolved to work on those kinds of problems. New data volumes and new analytical workloads are where big data technology works best. When the goal is, “We want to rip out our existing infrastructure and replace it with Hadoop,” we generally walk away from those opportunities; they don’t go well. If you have new problems or business drivers, and new data volumes, those are the cases where we are most successful. A wholesale, blind desire to rip and replace never works.

Dirk Slama: Can you quantify these indicators? How big does data really have to be to require big data technology?

Mike Olson: When the industry talks about big data, they always talk about volume, variety, velocity; meaning that you can have a very large amount of data, you can have wildly different types of data that you have never been able to bring together before, or you can have data that is arriving in at a furious pace. There is one other criterion that we see which doesn’t fit under the Vs, but which is the analytic algorithm you should really be using, namely: What is the computational approach you want to take towards the data? Sometimes, if you need to do a lot of computation – such as for machine learning – based on modest amounts of data, a scale-out infrastructure like Hadoop makes sense. Satisfying any one of the volume, variety, velocity or computational complexity requirements is enough to make a big data infrastructure attractive. In our experience, if you have two or more of those requirements, then the new platform is critical.

Dirk Slama: What advice would you give to readers who are developing their strategy and implementation for big data and the IoT?

Mike Olson: My consistent advice to organizations we work with is to take a use case or two – ones that really matter, where a successful result would be meaningful for the business – and then attack those on a modest scale. This will allow you to learn the necessary skills, and will demonstrate that the technology actually solves the problem. Once you have done that, scaling big – given the infrastructure – can be done for a simple linear cost, and works great.

Dirk Slama: Who stands to gain the most from the IoT and big data? And who stands to lose the most?

Mike Olson: My heartfelt conviction is that data will transform virtually every field of human endeavor. So, in your lifetime and mine, we will find cures for cancer, because we will be able to analyze genetic and environmental data in ways that we never could before. In the production and distribution of clean water; the production and distribution of energy; in agriculture – where we will be able to grow better, denser crops that feed 9 billion instead of 7 billion people; in every endeavor, I believe that data is going to drive efficiency. If organizations fail to embrace that opportunity, they risk losing out to the competition.

There has been a lot of discussion recently about privacy and Edward Snowden and the NSA, and one concern I have is that we will have an unfair backlash because of those examples. But think about the advantages if we could – for example – monitor student behavior at a very fine grain, and design courses that cater expressly to the learning modalities of individual students. We will have smarter people learning things faster and better than ever before. Think about the quality of the healthcare we could deliver.

Privacy does and will matter; we need sensible policy and meaningful laws and penalties to enforce reasonable guarantees of privacy. The “Data Genie” is kind of out of the bottle, I don’t think we will be able to stuff it back in. Most of all, I think it would be a mistake for us to try to curtail the production and collection of data, because I believe that it is such an opportunity for good in society.

IoT Data Types and Analytics Categories

As we discussed in the Plan/Build/Run section of Part II (“Ignite | IoT Methodology”), it is important to understand the distribution of and relationships between different data types within your distributed IoT application landscape. The figure below provides an overview of some of the most common data types found in an IoT environment. Data coming from remote assets typically includes meter data, sensor data, event data and multimedia data (such as video data). Data uploaded to assets typically includes configuration data, work instructions, decision context (weather forecast, for example), and various multimedia files. In the backend, data coming from remote assets is usually managed in a dedicated IoT or M2M data repository, which includes basic asset and device data, status data, and status change history, as well as time series data (such as groupings of meter readings). It’s worth noting that, as in any good, heterogeneous enterprise application landscape, there will most likely be other applications which also contain asset-related data – some of which might be redundant. For example, the ERP system is likely to contain the Asset Master Data, including all data relating to the product configuration at sales time, as well as customer contract data. One or more ticket management systems will contain customer queries related to specific assets. Integrating this data into one, holistic view of the asset is usually a challenging task. In addition, there will be a lot of related data in other systems, including customer transaction history, customer social data, product data, etc.

Data & analytics in the context of Enterprise IoT

Data & analytics in the context of Enterprise IoT

Building efficient applications based on this data is as challenging as in any heterogeneous enterprise application landscape, and can be tackled via well-established approaches like EAM (Enterprise Architecture Management), EAI (Enterprise Application Integration), SOA (Service Oriented Architecture), and BPM (Business Process Management) [EBPM]. This is something we will look at in the next chapter.

Making sense of IoT data requires efficient analytics – this does not just apply to big data, as we will see in the following. There are different categories of analytics. Some of them simply review past events, while some more advanced analytics categories attempt to use historical data to forecast future events and developments.

Basic analytics include:

  • Descriptive analytics: “What is happening?” (Example: An engine stopped working)
  • Diagnostic analytics: “Why did it happen?” (Example: Because of a faulty vault)

Advanced, forward-looking analytics include:

  • Predictive analytics: “What is likely to happen?” (Example: When is an engine likely to stop working?)
  • Prescriptive analytics: “What should be done to prevent this from happening?” (Example: Exchange the vault before it breaks)

Not all IoT solutions will leverage all of the above data types and analytics categories. In fact, it is extremely important for a project manager to understand how they can ensure that the data management architecture and selected tools are limited to what is really needed. As we will see in the following, there are good ways to combine different technologies, so starting with a Minimum Viable Product (MVP) philosophy often makes sense. Of course, choosing the right core data management technology – for example, relational versus NoSQL – must be based on careful analysis.

Four IoT Data Management Scenarios

In order to ensure that an IoT project manager using the Ignite | IoT Methodology can more easily identify the right architecture and technologies for data management, we have defined four basic scenarios. We will use these scenarios to gradually construct our discussion of different options for IoT data management architectures and technologies.

Scenario A is relatively straightforward, assuming a moderate number of assets in the field (a couple of thousand assets), with a very small number of events coming in per asset per hour. Analytics will be limited to basic reports and descriptive analytics. We call this scenario “Basic M2M.”

Scenario B assumes a real Enterprise IoT solution with hundreds of thousands of assets in the field, and/or higher volumes of data coming in from the assets. The initial focus for this project is on analyzing the data stream from the assets and being able to react in “real time” to critical events (meaning within a second, for example, not hard real-time). We call this scenario “Enterprise IoT and CEP,” where CEP stands for Complex Event Processing.

Scenario C builds on this scenario, adding the requirement to store field data over a longer period of time and on a large scale. Basic descriptive and, potentially, diagnostics analytics will be performed on this data.

Scenario D is an extension of Scenario C, plus more advanced analytics, like predictive and prescriptive.

A summary of all four scenarios can be found in the figure below.

Overview of 4 IoT data management scenarios

Overview of 4 IoT data management scenarios

Scenario A: Basic M2M

A typical application for Scenario A would be remote equipment monitoring, such as for a fleet of mobile industrial machines. A limited number of status updates and events would be expected per machine over time. Basic reporting will show machine status overview, machine utilization, etc. The architecture will usually be relatively straightforward for the backend. The initial challenge will most likely center on integration of assets and having the required device interfaces for assets (see our discussion of Technology Profile for IoT gateways). The figure below shows the AIA for Scenario A.

Architecture for data management in basic M2M

Architecture for data management in basic M2M

In terms of technology selection, two key decisions have to be made: which data repository should be used to manage asset and field data, and which interface technology should be used to integrate the remote asset.

For the data repository there are a number of different options, including RDBMS and NoSQL. Many traditional M2M application platforms are built on RDBMS technology, and this works perfectly well in most cases. One advantage is that one can build on the rich ecosystem of related tools, from reporting to backup management. Also, most organizations will have the required skills to build and operate RDBMS-based solutions.

However, if the scenario is likely to be extended over time; if slightly more complex applications have to be developed, for example, or the analytics requirements are likely to include some more advanced time series analysis in the future, then the solution architect might be well advised to consider a NoSQL database instead.

The second technology decision relates to the integration of the asset and the management of the events and other data streaming from it. If an M2M or IoT application platform is used, it will most likely have this functionality built in. If it is a custom development, there are multiple options. For very basic applications, an application server or a basic event processing technology like NodeJS could be used. There is no lack of established messaging and related technologies that could be used alternatively or in addition. Finally, some IoT-specific messaging solutions like MQTT have started to emerge, which should offer additional benefits due to their specific nature. The figure below provides an overview:

Basic event processing technologies

Basic event processing technologies

Scenario B: Enterprise IoT and CEP

Scenario B assumes that we are dealing with a much larger number of assets in the field, or that the assets are delivering a much higher data volume to the backend – or both.

In this scenario, close monitoring of data streams will often be required, as will the ability to react quickly to certain types of situations. In many cases it will also be necessary to analyze data streams for certain patterns in real time, for a steep increase in temperature for example.

In such cases, one technology that can be very useful is Complex Event Processing (CEP). CEP allows data from multiple data streams to be tracked and analyzed in order to infer patterns that have a specific meaning on the business level – for example, a traffic light switching to red and a pedestrian crossing the road at the same time. Consequently, one interesting use case of CEP is in the area of sensor data fusion (see Technology Profile for IoT Gateways).

The overall goal of CEP is to identify the business logic of related events, and be able to respond to them in (near) real-time.

Robin Smith, Director of Product Management at Oracle has summarized the basic event patterns in CEP for us as follows:


  • Data stream is filtered for specific criteria, such as temperature > 200 F

Correlation & Aggregation

  • Scrolling, time-based window metrics, such as average heart pulse rate in the last 3 days

Pattern Matching

  • Notification of detected event patterns, such as machine events A, B, and C occurred within 15-minute window

Geospatial, Predictive Modeling and Beyond

  • Immediate recognition of geographical movement patterns, application of historical business intelligence models using data mining algorithms

The figure below shows the basic principle of CEP:

Basic principle of CEP

Basic principle of CEP

Example: Continuous Query Language (CQL)

One good example of a well-established method for analyzing event streams in CEP is the Continuous Query Language (CQL), an extension to the standard SQL language. There are many other approaches, including a number of tools that use visual modelers to support CEP. We will use CQL as our example as it will allow us to illustrate the basic principles.

Robin Smith, Director of Product Management at Oracle describes CQL as follows:

  • CQL queries support filtering, partitioning, aggregation, correlation (across streams), and pattern matching on streaming and relational data
  • CQL extends standard SQL by adding the notion of stream, operators for mapping between relations and streams, and extensions for pattern matching
  • A window operator (RANGE 1 MINUTE for example) transforms the stream into a relation

The simple example below calculates the average temperature in the last minute for all temperature sensors submitting data to the selected stream:

SELECT AVG(temperature) AS avgTemp, tempSensorId

FROM temperatureInputStream[RANGE 1 MINUTE]

GROUP BY tempSensorId

CEP and IoT

In the context of the IoT, CEP can be deployed on the asset or in the backend, or both. The figure below shows the latter. Deploying CEP on the asset (shown here covering both gateway and on-asset business logic, because the boundaries can be blurred) has the advantage that many events can be pre-filtered or combined (sensor fusion), before being used locally (assets autonomy) or being forwarded to the backend (event forwarding). Deploying CEP in the backend has the advantage that multiple data streams from different assets can be combined for analysis. Also, context data (such as weather forecasts) can be added more easily to the decision-making process.

CEP and Enterprise IoT

CEP and Enterprise IoT

Case Study: Emerson

Emerson is an American multinational corporation that manufactures products and provides engineering services for a wide range of industrial, commercial, and consumer markets.

The Emerson Trellis™ Platform is a data center infrastructure management solution, which combines inventory management, site management, and power management. The solution uses a local gateway appliance to connect to storage units, servers, power systems, and cooling equipment. CEP is used in this solution for data aggregation, filtering, and processing to support real-time decision making in a very large-scale environment. The figure below provides an overview of the solution architecture:

Emerson Trellis Platform

Emerson Trellis Platform (Source: Oracle)


Scenario C: Adding Big Data and Basic Analytics

Scenario C assumes that the solution needs to be able to perform long-term analysis of asset data on a very large, big data scale. Here, the boundaries are often not clear. Many CEP systems will be able to scale to very large data volumes. However, as we will see in the following, big data is about more than just volume.

Big Data

There are many different definitions of big data, which should generally be seen as a continuously evolving concept. Big data usually refers to very large amounts of structured, semi-structured, and unstructured data which is stored with the intention to mine it for information. It is hard to get specific numbers on how big data has to be to count as “big data,” but the general assumption is that it refers to petabytes, if not exabytes of data. Traditional, large enterprise applications such as ERP and CRM usually only reach terabyte level.

The driving forces behind big data have traditionally been large Internet companies, such as Google, Facebook, and the like. Google’s engineers, for example, elaborated the initial concept of MapReduce, which allowed data and queries to be spread across thousands of servers. Together with other companies like Yahoo, this concept was implemented as an open-source product called Hadoop, which is now supported by start-up companies such as Cloudera (see the interview at the beginning of this chapter) and Hortonworks. Google, arguably the leading-edge company in this field, has also introduced next generation technologies like its Cloud Dataflow, which is based on Flume and MillWheel.

Regardless of the technology used, big data has brought a lot of movement into the traditional RDBMS/OLAP market. One of the key changes is that big data is very well suited to manage very different types of data, from highly structured to completely unstructured.

Based on this infrastructure, for example, the emerging Data Lake concept is proposing to turn things upside down for enterprise; instead of defining a database structure first, and then populating it with data that fits into this structure, the Data Lake simply stores any and all kinds of data, and then makes this data available when it is needed, in whatever format is needed.

Doug Laney (now with Gartner) is credited with providing the first description of what many people see as the smallest common definition of big data – the three Vs: volume (dealing with large volumes of data), velocity (dealing with high-speed data streams in near real time), and variety (dealing with data in a variety of forms). Some people also add variability (or inconsistencies) and veracity (data quality) to the big data “Vs.”

Basic Analytics

Basic analysis of big data typically involves descriptive and diagnostic analytics. Descriptive statistics involve counts, sums, averages, percentages, minimum and maximum, and simple arithmetic, applied to groupings or filtered datasets.

Some people claim that 80% of business analytics are descriptive analytics, especially in social media analytics [LI1]. Examples include page views, number of posts, mentions, followers, average response time, etc.

Because of the explorative nature of many IoT projects, it can be assumed that many projects will initially also heavily rely on such basic, descriptive analytics (for things like analysis of statistics like MTBF (Mean Time Between Failure) of operational equipment).

Diagnostic analysis will also play an important role in the basic analysis of IoT data; in the event of a problem with a piece of operational equipment it will be vital to be able to identify the root cause as soon as possible in order to fix it quickly.

Adding NoSQL to the equation

In addition to unstructured big data repositories, document-oriented and NoSQL databases are also starting to play an important role in the context of the IoT. In the following, we discuss key aspects of IoT, NoSQL, and big data with Max Schireson, CEO of MongoDB at the time of the interview. MongoDB is a leading open-source database for NoSQL data management.

Dirk Slama: What makes big data for the IoT a reality today, and how does NoSQL fit into this?

Max Schireson: In the past 5 years, there have been more developments in data management than we saw in the previous 30 years. With the rise of commodity computing, open-source technologies, and the cloud, it is less expensive than ever to ingest, store, process, and analyze data.

These developments have seen the emergence of NoSQL databases to serve operational workloads, and Hadoop for analytical workloads, usually complementing existing technologies.

A lot of these concepts emerged from large web properties that hit scalability walls as their data requirements – both storage and processing – outgrew the limits of relational technology. They also discovered that the majority of their data did not fit a neat row and column format, and so couldn’t easily be managed using tables in the relational data model. Application requirements changed radically and the static relational schema held back developer agility. These same problems confront architects and developers in the IoT. The ability to scale out NoSQL databases and Hadoop clusters on fleets of low-cost commodity servers and local storage, either in their own data centers or in the cloud, enables architects to address challenges relating to the data volumes generated by billions of sensors. New data models, such as the document model used by MongoDB, enable architects to store and process not just structured, but also semi-structured, unstructured, and polymorphic data – anything from events, to time-series data, geospatial coordinates, to text and binary data.

Advances in parallel programming have enabled complex algorithms to be distributed around clusters, moving compute to data, rather than the expense and time of moving data to compute. This means we can analyze much higher volumes of data faster than ever before. These fundamental changes in how we store, process, and analyze data provide the foundation for data management in the IoT.

Dirk Slama: What role do modern databases play in the IoT, and do more traditional databases still have a role?

Max Schireson: Many of the modern NoSQL databases serve the operational part of an IoT application. Sensor data is ingested and stored by the NoSQL database, where it can be queried and evaluated, often using business rules created in application middleware, to trigger actions and to feed online reporting dashboards.

Using an example from the world of retail, a network of in-store beacons can identify the location of a customer in a store and send them push notifications. For example, a user might create a shopping list on their smartphone and share it with the store app, where it is stored in the operational database. Upon entering the store, the store app will display a map to the customer, which highlights all the products on their shopping list. Every time the customer gets close to a position where a group of products from their shopping list is located, the app will notify them and make a recommendation for a particular brand. Again, these recommendations will be stored in the operational database. At the check-out point, the system could identify all the products in the shopping cart automatically via RFID, create and confirm an invoice, and use the smartphone to process the payment. The store’s inventory system is automatically updated when the checkout process is complete.

In this scenario, the NoSQL database is storing the user’s movements around the store, and serving up recommendations along with product information in real time. A more traditional relational database could be used in conjunction with the NoSQL database to handle the billing and invoicing. So, relational databases still have a role to play in IoT applications, often serving as the “system of record” for backend enterprise systems that are integrated into new IoT apps.

Dirk Slama: What role does Hadoop play in IoT, and do more traditional EDWs still have a role?

Max Schireson: This is best illustrated using an example – building on the retail use case above, all actions taken by the customer are stored in the NoSQL database, and then loaded to Hadoop which combines the data with other sources, for example clickstreams from web logs, buying sentiment from social media, and historical customer data. All of these data points are analyzed to create deep behavioral models that can then be loaded back into the NoSQL database to serve real-time recommendations to customers when they return to the store.

Like the relational database, EDWs are limited in their capacity to scale to support new data sources, and are not efficient at handling exploratory questions which the data model hasn’t been specifically designed to answer. These are problems addressed by Hadoop, which is deployed to complement the EDW, rather than replace it.

Dirk Slama: How do you integrate these modern databases and Hadoop enterprise data hubs to manage the lifecycle of IoT data?

Max Schireson: There are many ways IoT architects can integrate NoSQL databases and Hadoop, from custom scripts to productized and supported connectors. An example of the latter is the MongoDB Connector for Hadoop which is certified for both Apache Hadoop and the leading commercial distributions. The MongoDB Connector for Hadoop allows users to integrate real-time data from MongoDB with the Hadoop platform and its tools. The connector presents MongoDB as a Hadoop data source allowing a Hadoop job (MapReduce, Hive, Pig, Impala, etc.) to read data from MongoDB directly without first copying it to HDFS, thereby eliminating the need to move TB of data between systems. Hadoop jobs can pass queries as filters, thereby avoiding the need to scan entire collections and speeding up processing; they can also take advantage of MongoDB’s indexing capabilities, including text and geospatial. The connector enables the results of Hadoop jobs to be written back out to MongoDB, including incremental updates of existing documents. In addition, MongoDB itself can be run on the same physical cluster as Hadoop, so there is very tight integration between the two platforms.

Dirk Slama: What role does stream processing play in these new IoT data architectures? How do you see established CEP concepts fitting in here?

Max Schireson: Stream processing such as Apache Spark and Apache Storm, as well as existing CEP can be run against IoT data as it is streamed from sensors, devices and assets, alerting against events and triggering automated actions. CEP is a well proven and mature technology, typically correlating data from multiple sources to find patterns and apply business rules. The data and rules are stored in the operational database. Stream processing such as Spark and Storm are much more recent developments, typically operating against a single stream of data, and again working with an operational database for persistence of data, rules, and actions. Both have a role to play in the IoT, depending on use case and developer skills.

Dirk Slama: How do you analyze IoT data to gain new insight or automate new processes? Is this done in Hadoop, in the database, in the sensor data stream?

Max Schireson: It can be done in all 3. Analytics can be run against data in flight (streams), in real-time processes controlled with a database using native data processing pipelines, and against Hadoop, which stores data ingested from multiple sources.

So, to go back to our retail example – stream analysis detects that a customer has returned from the store. The database retrieves the customer’s details and matches them to a profile of preferences and product promotions. The customer’s activity and purchases are tracked and stored in the database, and then loaded into Hadoop where the new data is used to tune the analytics models. This data is then loaded back into the database, so that when the customer returns, even more relevant offers can be served.

Building up the Big Data Infrastructure

Scenario C assumes that instead of (or in addition to) the real-time analysis of the asset data stream, we want to be able to keep this data for a longer time, in order to perform long-term pattern analysis on this “ephemeral” data. This scenario focuses on building up the required big data infrastructure in combination with stream processing, and introduces some basic algorithms that can be used to analyze this data.

Christopher Dziekan, Chief Product Officer at Pentaho explains: “Everyone is anxious to turn data (or big data) into big value and big profit, often pointing to the three Vs of big data – volume, variety and velocity. These Vs are driving an impetus for change architecturally; however, as companies build upon their big data infrastructures, more Vs come into play. A big data infrastructure is a necessary response to the first three Vs, but veracity and value also need to come into the equation.” To encompass all the Vs of big data, one must look to architectures and methodologies that enable large-scale production deployment.

To turn information into strategic advantage, a big data “orchestration” platform is needed that combines data integration with business analytics. Advanced analytic techniques are applied to the data generated by IoT devices, blending that data with both relational data and new data sources. The platform must fit into existing IT infrastructure and connect to business applications so that it is universally available at the point of impact, supporting line-of-business and operational decisions.

Examining the analytics needs of the different user communities that work with IoT systems, we can see that no single tool or repository perfectly fits all users’ needs. Some need traditional data marts and data warehouses. Others may need a “data lake” repository that can be queried on an ad-hoc basis, along with a data refinery. Others need to analyze device states, perform streaming queries, and broadcast notifications. Still others need ad-hoc data transformation and an array of special tools for their data scientists.

Given these different requirements, how do we build a data architecture to meet all needs? First of all we need a set of tools or a platform that will enable us to manage the entire architecture.

Lambda Architecture

One approach for combining some of these elements into a single engine is known as the Lambda Architecture, which was introduced by Nathan Marz. The Lambda Architecture consists of a “batch layer” that stores all of the historic data, a “speed layer” that processes data in real-time, and a “serving layer” that allows both of the other layers to be queried.

Lamda Architecture

Lambda Architecture (Source: Ricky Ho [RH1])

Since we want all the layers to perform well, the name “batch layer” is better thought of as a “data lake” – a term coined by James Dixon of Pentaho [FO1]. The speed layer should then be thought of as a “real-time layer.”

The Lambda Architecture solves some of the data architecture issues for an IoT system but it does not explicitly provide a way to store or query the state of one or all of the devices.

IoT Data Architecture

The Lambda Architecture can be used as the guiding principle to create an IoT system.

IoT and Big Data - integration perspective

IoT and big data – integration perspective (based on input from Pentaho)

From the diagram you can see that the IoT data architecture contains the Real-Time Layer, the Data Lake/Historical Layer, and the Query Layer of the Lambda Architecture. It also has the State Layer, and the Blending/Transformation/Refinery/Data Mart layer.

James Dixon of Pentaho elaborated on the importance between the State Layer and the Data Lake in his “Union of the State” concept. In a blog post on this topic, Dixon recommends capturing the initial state of an application’s data and the changes to of all of the attributes, not just the main/traditional fields. He advocates applying this approach to multiple applications, each with its own Data Lake of state logs, storing every incremental change and event. This results in capturing the state of every field of (potentially) every business application in an enterprise across time, or, the “Union of the State.”

This approach can be applied in many situations. For example, an e-commerce vendor can find out, for any specified millisecond in the past, how many shopping carts were open, what was in them, which transactions were pending, which items were being boxed or were in transit, what was being returned, who was working, how many customer support calls were queued, and how many were in progress. By its very nature, this approach supports processes like trend analysis and compliance.

To date, there are no off-the-shelf applications or tools that offer a Lambda Architecture product or the IoT extensions to it. This is something that the IoT implementation team needs to consider when designing, in order to maximize the abilities, efficiency, and effectiveness of their IoT system.

Scenario D: Adding Advanced Analytics

Once the infrastructure for big data management and the IoT has been built up, the next step is to increase the level of sophistication with which the data is analyzed. This includes predictive and prescriptive analytics. Predictive analytics uses historical data to make forecasts about future events and developments, while prescriptive analytics attempts to guide actions to achieve an optimized outcome.

Many people actually see these scenarios as extremely important in the IoT. In particular, industrial IoT scenarios will greatly benefit from use cases like predictive maintenance.

Because of the complexity of these use cases, many companies are now looking at creating new job roles, like a data scientist who combines the required business, technical, and data analytics skills required to support such user cases.

Predictive analytics includes a number of different techniques, such as modeling, machine learning, and data mining.

Machine Learning

Dr. Tapio Torikka is a Product Manager for Condition Monitoring at Bosch Rexroth. In the following, he provides an explanation of machine learning as an important element of predictive analytics, based on [ML1] and [ML2].

Machine learning is a field of Artificial Intelligence in which an algorithm constructs a model based on input data in order to make predictions. No explicitly programmed instructions are used to create the model, which enables problems to be solved where little or no domain knowledge is available. The diagram below shows the process of training and applying a machine learning algorithm. After data is acquired from a selected source it has to be preprocessed, for example by scaling the data and imputing missing values. After preprocessing, a feature extraction step extracts any significant information from the data in order to improve the subsequent training step. During the training phase a learning algorithm is used to iteratively adjust the internal parameters of the machine learning model to improve itself. Once a desired accuracy or a preset training time has been reached, the model can be saved and later used to make predictions with unknown data.

machine learning

Machine learning (Source: Bosch Rexroth)

The following types of learning can be identified:

  • Supervised Learning
    • Labeled training data is used to build a classifier model
    • Example Application: Classification of emails between “normal” and “spam”
  • Unsupervised Learning
    • Structure of the data is learned without labels
    • Example Application: Credit card fraud detection
  • Semi-Supervised Learning
    • Small amount of labeled data, larger volume of unlabeled data
    • Example Application: Image recognition. A small number of manually labeled images available, a huge pool of unlabeled images available on the Internet

Supervised learning is preferred when labeled example data (desired outputs for presented input data) is available. When labeled data is rare or unavailable, semi-supervised or unsupervised learning can be used. A simple unsupervised training process is illustrated in the figure below. The algorithm tries to adjust its parameters in such a way that the predictions given by the model (light blue circles) match the input data (dark blue circles). This kind of model can be used for Anomaly Detection, i.e. the model will predict if a given piece of input data will exhibit the same structure as the training data.

Example for unsupervised training process

Example for unsupervised training process (Source: Bosch Rexroth)

Two distinct types of model can be built during the training phase of the machine learning process:

  • Classification: Output of the machine learning model (prediction) is discrete
  • Regression: Output of the model is continuous

A classifier separates the data into n categories as presented in the training data, for example, a classifier built to categorize emails might classify them as normal or spam based on the content. A regression model can be used to predict continuous signals, for example the future values of a defined target signal.

Case Study: Machine Learning for Condition Monitoring in Industrial Applications

The following case study is provided by Dr. Tapio Torikka, Product Manager for Condition Monitoring at Bosch Rexroth.

In condition monitoring, the health state of a machine is assessed based on collected sensor data. This allows the operator to optimize the maintenance activities of their assets and reduce the risk of unplanned downtime (see figure below). Typically, these systems are operated continuously (this is known as online condition monitoring).

Machine condition monitoring

Machine condition monitoring (Source: Bosch Rexroth)

Machines in industrial environments are very complex, so assessing their health state can be quite challenging. Many machines are purpose-built for a specific application and therefore exhibit a unique behavior. Additionally, environmental effects like temperature variations, noise, vibrations, etc. can influence sensor signals. To make matters worse, industrial machinery tends to produce very large volumes of data, partly due to the need for high-frequency measurements, as a result of the highly dynamic production cycles of the machinery. These characteristics mean that monitoring the maintenance status of machinery by manually inspecting individual sensor signals or simple rules is next to impossible.

Storing large amounts of data is no longer a problem and can be solved by implementing available big data solutions. The ability of machine-learning algorithms to generate models based on training data alone addresses the harder problem of making sense of this data. Data scientists are needed to create analytics pipelines for data preprocessing and learning algorithms. Furthermore, they are responsible for defining learning goals and determining action proposals based on the results, together with domain experts. Access to fast computers is important, as machine learning algorithms typically require a lot of computing power.

Anomaly detection is a typical approach to detect abnormal behavior in the data source. In an industrial context, this approach can be used to detect a deviation from normal machine behavior indicating a faulty condition. Anomaly detection requires training data from only one reference state (normal behavior, for example).

In a real-world example, a mining operator was interested in monitoring the health of the Rexroth drive system of a conveyor belt, which is used to transport ore from the mine. This is a critical asset for the operator and is constantly in operation, as downtime translates directly into reduced revenue. Sensors were installed to collect data from the conveyor belt drive. A remote big data system was used to store the data from the system, and machine-learning algorithms were used to generate an index representing the general health state of the machine (this is an example of predictive analytics). No failure data was available from the system and therefore an anomaly detection algorithm was used to indicate the deviation from normal behavior.

system architecture

System architecture (Source: Bosch Rexroth)

A web portal was used to view the data and to monitor the system. The screenshot below is taken from the visualization screen of the portal.

Example screenshot

Example screenshot (Source: Bosch Rexroth)

An electric motor failure occurred in this drive unit. The above image shows the result of the anomaly detection algorithm (in blue) and the electric motor current (in red) over time. The electric motor current suddenly drops to zero after several months of operation without a significant change in the trend. Manual analysis of other sensor signals did not indicate any failure in the system. As can be seen in the image, the blue curve (the result of the anomaly detection algorithm) begins to deviate significantly from normal levels (90-100) well before the motor fails. This represents the capability of machine-learning algorithms to make sense of complicated data. Analyzing the data manually would have been an impossible task due to the large volume of data and the complexity of evaluating all the available sensor signals simultaneously.

Case Study: Aircraft Engine Analytics

The following case study was provided by Ted Willke, Senior Principal Engineer and GM, Graph Analytics Operation at Intel Corporation.

Big data systems are frequently used to calculate statistics from logs and other machine-generated data. This information may be used by engineers and data scientists to monitor operations, identify failures, optimize performance, and improve future designs. Aircraft engines are among the most complex machines in the world, bringing together many aspects of physics and having to operate over an extreme range of conditions [LYYJL]. They are often monitored for diagnostics to identify potential failures early, before they lead to safety concerns and interfere with flights.

Traditionally, snapshots of engine statuses and fault codes are taken during a flight and sent to a ground station for logging and monitoring. But with the availability of IoT systems that accommodate both streaming data and scale using big data technologies (Category C systems), engine data can be streamed off-board to fleet-wide ground stations, monitored continuously, and analyzed for deeper insights [DSAR]. The same stream of data can be collected on the same engine under the same flight conditions (altitude, velocity, temperature, operating time, etc.) time and time again. Simple statistics can be calculated based on this time series data to determine if the engine is operating as expected (when compared to design specifications), and to detect abnormal conditions, such as problems with the turbine pressure, exit air temperature, fuel flow rate, etc.

While Category C systems are able to detect anomalies and failures, they may not be able to isolate faults and uncover their root cause with high confidence and accuracy [LYYJL]. Furthermore, they may not be able to detect weak signals (subtle anomalies) that may develop into more apparent anomalies and serious faults over time. This is where machine learning comes in (Category D systems). Machine learning can create a model of the system under study (in this case, the aircraft engine), and then use the model to predict the behavior of the system. Machine learning is a data-centric approach. A prototype (generic form) of the model is selected and “trained” to improve predictive accuracy. A common training process involves supplying pairs of inputs (what the prediction will be based on) and known outputs (what is being predicted) so that patterns can be learned by the model. After the patterns are learned, accurate predictions can be made. In the case of engine diagnostics, the inputs are the operating characteristics of the engine and the prediction may be one of several engine fault codes (“normal” being one such code).

In one study [LYYJL], 49,900 flights’ worth of data was analyzed to determine the ability of machine learning to predict 1 of 9 engine fault conditions from 8 engine measurements. The technique was shown to be able to detect the existence of a fault early on and to attribute the fault to a specific fault code.


As we saw in this chapter, there are many different levels of sophistication when it comes to data management in the IoT. The four scenarios that we have outlined should help a product manager to decide which solution architecture and technology is right for their particular need. In the following section, we shall provide some conclusions, first from the perspective of industrial applications, and then from the technological perspective.

Industry 4.0 Perspective

Tobias Conz is a project manager in the industry team at Bosch Software Innovations. He is working at the intersection of advanced IT concepts and the harsh reality of today´s manufacturing IT environments. He describes his experience as follows:

“Data management in the area of manufacturing and Industry 4.0 is particularly challenging because many manufacturers are looking at investment cycles of 10-15 years, meaning that change doesn’t come easily in existing environments. The good thing is that many machines are actually collecting a tremendous amount of data already, which is needed for machine control. However, before Industry 4.0 there were few requirements to actually make this data available to the outside world. This means that we are looking at a lot of very basic data integration scenarios that have to deal with the usual problems like heterogeneous interfaces, data types and protocols, software in different versions, etc.

Because most machine data was not integrated at a higher level, we are finding a much higher degree of heterogeneity in this space, compared, for example, to banks and insurance companies, who had to deal with integration much earlier.

New data management concepts like data mining, stream processing, etc. are of course extremely interesting in the context of manufacturing and Industry 4.0. Given the specificities of this space, we recommend a dual strategy: Individual machine data is a treasure trove for data mining which should be used wherever applicable. However, the integration of all machines for this purpose is nearly impossible. This is why we are also developing a near-real-time stream processing solution which will allow us to validate the process parameters of multiple machines at the same time. We are doing this one machine type at a time, to see which machine types this actually works best for. The biggest challenge we are facing is to find a time- and cost-efficient method for integration.”

General Recommendations

As the CEO of MongoDB, Max Schireson was able to gain great insights into the deployment of NoSQL and big data technologies in general, as well as in the emerging IoT space. In this part of the interview, he provides some valuable recommendations for project managers:

Dirk Slama: What are the biggest risks associated with big data and the IoT? How can companies limit or avoid these risks?

Max Schireson: Because the IoT and big data are such hot topics right now, it is very easy to dive straight into the technology, rather than thinking first about the business objectives of the project. Much more important than the technology, you need to think about what it is the business is trying to achieve, and bring together the stakeholders, i.e. the business units, your customers, your partners, and the IT teams who will work on delivering the enabling technology. Maybe even competitors, as interconnectivity and integration are key in IoT ecosystems. You shouldn’t be going to stakeholders, and especially not to customers, talking about the IoT itself. You need to frame the conversation around the benefits that IoT can deliver.

It is important that the project has strategic support at the most senior levels within the organization. Also, take time to learn from the success and failures of early movers. It is important to be explorative… Remember there are very few people who can really predict what future business models will look like, in much the same way that few predicted what the most successful business models would look like in the early days of the Internet.

You need to upskill – either by taking existing staff and training them or by going to third parties outside of the company to leverage their expertise. We would recommend starting with pilot projects that have a defined use case and work on a subset of devices and data. Measure, optimize, and measure again. If the project is successful – with concepts being proven and staff being trained – then get more ambitious.

Dirk Slama: Who is likely to benefit most from the IoT? Who is liable to lose the most?

Max Schireson: There is sometimes a perception that it’s just the makers of “things” that stand to benefit. This is completely wrong. Think about insurance companies who can optimize premiums by monitoring the behavior of their customers, or retailers who can optimize the entire farm-to-fork supply chain. The Economist estimates that 75% of companies are now planning for IoT at the most senior levels of the organization [Ref2]. Those companies who use the IoT and the data/insight it generates to create new business models and that can get closer to their customers will gain the most. Those who stick their heads in the sand will cease to exist.

Dirk Slama: What advice would you give to readers who are developing their strategy and implementation for big data and the IoT?

Max Schireson: Firstly, I would go back to the discussion points in mentioned above. When the team has selected the technologies that they plan to use, then there are some general best practices: Give them ramp-up time to understand the technology (training, reading, and conferences). Engage resources from vendors and service providers to assist in development of new projects. As the development progresses, share progress and best practices internally with tech talks, hackathons, and project wikis. Then start to expand, build a CoE when multiple projects are underway so best practices are institutionalized. There is so much innovation happening in this space, and so many technology choices, you need to keep an open mind. Just selecting a data management platform because it has successfully run your ERP platform for the past 15 years is not a good idea!