Terabyte terror: It takes special databases to lasso the Internet of Things

By June 28, 2016IoT
Riptide smart-fridge

By Lisa Vaas
June 28, 2016

If you believe figures from the technology research firm Gartner, there will be 25 billion network-connected devices by 2020. The “Internet of Things” is embedding networked sensors in everyday objects all around us, from our refrigerators to our lights to our gas meters. These sensors collect “telemetry” and route out data to… whoever’s collecting it. “Precision agriculture,” for instance, uses sensors (on kites or drones) that collect data on plant health based on an analysis of near-infrared light reflected by crops. Sensors can do things like measure soil moisture and chemistry and track micro-climate conditions over time to help farmers decide what, where, and when to plant.

Regardless of what they’re used for, IoT sensors produce a massive amount of data. This volume and variety of formats can often defy being corralled by standard relational databases. As such, a slew of nontraditional, NoSQL databases have popped up to help companies tackle that mountain of information.

This is by no means the first time relational databases have ever been used to handle sensor data. Quite the contrary—lots of companies start, and many never leave, the comfort of this familiar, structured world. Others, like Temetra, (which offers utility companies a way to collect and manage meter data) have found themselves pushed out of the world of relational database management systems (RDBMSes) because sensor data suddenly comes streaming at them like a school of piranha.

From a trickle to a torrent of IoT data

In 2002, Temetra was a small company operating out of Ireland. It employed just five people at the time, but the company was already storing data from hundreds of thousands of water meters, analyzing flow through customers’ pipes. “Having more data allows you to do more analysis on the network,” Temetra Managing Director Paul Barry said. “As you can imagine, water utilities don’t have unlimited budgets. Say I’ve got a budget of $25 million to go fix leaks. I could spend a lot of time chasing them down. It’s much better to address the least efficient parts of the network, where I get the most bang for my buck.”

IoT is about to explode, perhaps literally, if privacy and security issues aren’t fixed.

To that end, Temetra’s not just working to give detailed information on each of its customers’ meters. It is also aggregating that data to actionable results, as in “show me where all my leaking meters are.” Over the course of 10 years, Temetra wound up collecting a flood of data from sensors to give customers that level of insight.

In 2002, the company was doing what everybody did back then—storing data in an RDBMS. “Everybody was just using SQL databases in 2002,” Barry said. “Google had started to break the mold, but typically, everybody wrote apps in a monolithic way, with an RDBMS on the back end.” Thus, Temetra’s meter sensor data was pouring into PostgreSQL, the venerable relational database.

For a decade, things went swimmingly—then came 2012. The company, which was selling software as a service (SaaS), entered the UK market and started dealing with water utilities that were much bigger than those in Ireland. Leading up to the expansion, Temetra had started to see what Barry called “explosive growth.” And since the volume of data was going up so significantly upon entering the UK, the company needed to look at different databases that could better store and help analyze it.

It’s not that PostgreSQL wasn’t up to the job of handling the expected spike in data volume, Barry said. The problem with the RDBMS was that the administrative burden would explode right along with the data volume. “Backups were getting very big,” Barry said. “In that master/slave type of database [a database replication scheme wherein a master database is regarded as the authoritative source and the slave databases are synchronized to it], it takes longer to handle [replication] as the data grows and grows. The more data grew, the greater was the burden.”

With only five employees, Temetra’s driver was to find a data repository that would allow it to have low administrative costs and high database reliability. The team looked at a lot of options, including the non-traditional data stores MongoDB and CouchDB/Couchbase. It came down to a choice between Basho’s Riak and Cassandra. And the main reason Temetra chose Riak was because the company got it running practically in the blink of an eye. “I had a test up and running very quickly. An hour with Riak, and I was up and storing data,” Barry said. “I was very confident it would maintain that reliability, with a low administrative burden.”

Cassandra has improved a lot since then, according to Barry. But back when he was looking for a data store that was easy on his tiny team, he found it “a little fiddly to setup and properly configure to get high availability.”

That’s not surprising, said Zach Altneu, CIO of the IoT car technology company VCARO. He told Ars when looking at some of the noSQL databases out there, his company picked DataStax Enterprise Cassandra over Riak. VCARO’s decision happened in no small part because it already had staffers with Cassandra skills. As Altneu said, you’ve got to know how to implement Cassandra correctly to be successful with it, and that means knowing how to properly set up a data schema.
Barry said that when Temetra evaluated the data store, he “wasn’t 100 percent sure I had Cassandra configured properly.” Temetra started writing data to both Riak and Cassandra to run each solution through its paces. Barry tested the two using some of the standard tricks, including unplugging a node and making sure the cluster still worked as normal.

The data did not differ between Riak and Cassandra; it was more that the tools for Cassandra were a bit limited at the time. Barry couldn’t find information on the running state of the cluster, and it wasn’t easy to know if the cluster was healthy. By contrast, checking cluster health was easy in Riak. Right from the beginning, the service came with “nice tools,” Barry said. “You could run one command and know that the cluster was in a healthy state.”

For Altneu and VCARO, it was all about cost and keeping tight control over the setup, from setting up the database through all operational activities. There would be no database-as-a-service (DBaaS), where somebody like Google or Amazon lifts all that work off your shoulders (thank you very much). As far as costs, Cassandra is open source and saves a heap of dough. Using cloud hosting for lightweight Linux servers, VCARO runs a cluster for less than $1,000/month.

By contrast, Temetra didn’t want to be up to its elbows in database guts, and performance wasn’t all that critical. The company had plenty of headroom left in the PostgreSQL RDBMS. Rather, the choice of where to go with all that sensor data was about reliability above all else, Barry said.

“It sounds reasonable, but some NoSQL databases don’t give you such a strong contract for reliability,” Barry said. “They’ll trade off for performance. They’ll give you high-speed queries, but one in 1 million may fail. We can’t afford that. [With fast-flowing sensor data], we have one shot to store it and respond that we’ve successfully stored it. Once we have, it’s our responsibility to [tell our clients] that it’s been stored.”

What’s consistent about inconsistent data

Walgreen’s has 8,000 stores across the US, and Riptide has sensors embedded in about half of them as it monitors HVAC, lighting, irrigation control, fire and safety controls, and more. Each building system has a programmable controller that monitors or turns the sensors off and on, dims them, and sets their temperatures. Riptide is in the business of developing cloud-based building management tools, and it connects sensors on rooftop machines in commercial buildings that house retailers both big and small (including AT&T, Verizon, and Ulta). “Put your facilities on autopilot,” its site promises.

In those 4,000 Walgreens, there are something like 1 million sensors. Riptide CEO Mike Franco said that the data coming out of them runs the gamut. Some of the sensors sample data once a minute, some every 5 minutes, and others sample at 15-minute intervals. Some change values. There are no consistent time stamps, no consistent time zones, and no consistent names of data points. “It’s very, very unstructured,” Franco said. What is consistent, then? It’s all time series data—a lot of time series data. For the past three years, Walgreen’s sensor data output has equaled about 5 terabytes.

Riptide knew it needed a data storage system and architecture that could scale without degrading. The single core reason it went with NoSQL was the distributed nature of Cassandra and DataStax, Franco said. “We had plenty of use cases where the data got big, hard drives filled up, so we just added more nodes, and we upgraded the hardware. Just the whole horizontal elasticity of the NoSQL database systems we’re using is the No. 1 driver for us.”

After checking out PostgreSQL, MySQL, Oracle, MongoDB, and more, Riptide went with Cassandra because of the community support and scalability. Its team is using Storm for real-time analytics and Solr for search. With this type of real-time analytics, they’ve set it up so that they have a map that shows a big bubble for every single site of another client—Nordstrom.

Nordstrom is very particular about air conditioning. It has a team dedicated to ensuring that the temperature in every store is always at 72 degrees Fahrenheit with proper humidity and comfortable CO2 levels. “That’s critical for the Nordstrom shopping experience,” Franco said. Nordstrom monitors HVAC systems constantly, and one of its crucial performance indicators is Critical Box Count—as in the number of air conditioner units working too hard. If that number reaches a critical threshold, the map Riptide set up starts blinking red, and Riptide dispatches someone to get into the building control system and make changes right away.

That 72-degree shopping experiences comes to you courtesy of Storm real-time analytics, Solr for managing search, and Spark for materialized views that make data available very, very quickly. This is what Riptide likes about Datastax Enterprise Cassandra: all those tools are integrated.

Previously, Nordstrom was pulling CSV files out of Tableau. It would pull those CSV files in, load them into business intelligence or spreadsheet software, and from there the data would be sliced and diced. It wasn’t pretty, and it wasn’t fast. “By the time you could react to something, it’s a week later,” Franco said.

Where sensor data gets a bit wonky

None of this is meant to imply that handling IoT sensor data with non-traditional data stores is a bed of roses. Time series sensor data in particular brings up many issues. One of them is aggregation: if you want to total up data from 1,000 sensors, you could get 30,000 data points per year per device. Do the math, and you’ll find that gathering up all those data points is a lot of work.

Ideally, whatever analytics you want to run on that data, you want all that data to be in the right place. You don’t want to have to go running all over the cluster, pulling data out of different nooks and crannies. With time series data, according to Temetra’s Barry, keeping the locality within the cluster means that all data for one device is loaded in one place on the cluster.

Learning how to get it right didn’t happen overnight, and it’s always changing. Temetra’s on the third iteration of how it stores sensor data. The company has learned that it requires a mind switch. It takes a while to figure out how to switch from dealing with normalized data to dealing with denormalized data.

“We change our mind sometimes,” Barry said. “[For example,] ‘we can’t do aggregation over there because of how it’s structured.’ Some of it isn’t driven by our mistakes, but by new technologies: ‘Now we have to be able to accommodate this.’ It’s outside our control. Or there will be a new way to collect data from meters, or a new format, or a new data point we hadn’t thought of has to be introduced to the system, and so we think it’s time to migrate to accommodate it in a better way.”

Another challenge with the IoT that you don’t hear about much is on the other side of the wall at the equipment level. When you climb up to the rooftops where Riptide’s sensors are sitting on top of building system equipment, you’ll find a Tower of Babel. There’s no naming conventions for all that data they’re collecting. The norm is that there is no norm. Franco said that Daikin, the world’s largest HVAC company, is actually doing data janitorial service, working to normalize that data. Daikin does this by gathering all the data from its own and others’ equipment and coming up with a normalized naming convention; that way the company can control all that disparate equipment.

That’s actually yet another use case for NoSQL databases, Franco said. The database application itself will expose a problem like that, and it allows someone like a Daikin to come in and fix the problem in order to make management at scale possible. “Garbage in, garbage out,” Franco said. “It’s more true when you’re at scale.”

Obviously, sensor data is big data. But that doesn’t always equate to some highfalutin’, big-bucks data-crunching service. Sure, Teradata is big, but Temetra isn’t exactly the Amazon of data storage. The truth is, more and more of us are swimming in sensor data, be it home owners with smart thermostats, big utilities figuring out where they’re losing water and money in water delivery infrastructure, or small reservoirs trying to get away with sensor data so they can pull their fingers out of the dyke.

People do it old-school with relational databases. People do it by building time-series databases on top of NoSQL databases such as HBase and Cassandra (such options include OpenTSDB andKairosDB). Others still go with InfluxDB, which Baron Schwartz describes as being natively time-series. That is, InfluxDB has a time-series query language that looks a lot like SQL and therefore makes a lot of SQL-conversant people a lot of happy.

As Franco said, the benefits of figuring out how to corral all this sensor data apply equally to small, medium, and large organizations. The more you can herd those sensors, the less climbing on top of rooftops you’ll be doing. And the better we get at drinking in data from things like the IoT’s broken water pumps, the more rapidly we’ll be able to fix the IoT’s broken things.

Leave a Reply