Follow BigDATAwire:

November 23, 2012

The Week in Big Data Research

Datanami Staff

This week we bring you tales from the front lines of big data research and development that address some novel approaches to handling energy grid and spatial-temporal data with Hadoop and MapReduce as well as news from researchers who are developing new frameworks to tackle graph portioning and provenance.

We’ll dive in first with a new approach to handling complex data from GPS-enabled systems via CloST…

Using Hadoop to Tackle Big Spatio-Temporal Data

During the past decade, various GPS-equipped devices have generated a tremendous amount of data with time and location information, which can be referred to as big spatio-temporal data.

A team of researchers from Hong Kong University of Science and Technology have attempted to address this enormous amount of complex data via the design and implementation of what they call CloST, a scalable big spatio-temporal data storage system to support data analytics that leverages Hadoop.

The main objective of CloST is to avoid scanning the entirety of a dataset when a spatio-temporal range is given via a data model which has special treatments on three core attributes including an object ID, a location and a time.

Based on this data model, CloST hierarchically partitions data using all core attributes which enables efficient parallel processing of spatio-temporal range scans. According to the data characteristics, the team devised a compact storage structure which reduces the storage size by an order of magnitude. To cap all of this off, they added in a scalable bulk loading algorithms capable of incrementally adding new data into the system.

The team puts their research and the CloST framework in action using a very large GPS log dataset and the results show that CloST has fast data loading speed, desirable scalability in query processing, as well as high data compression ratio.

NEXT — A Hadoop Framework to Tackle the Grid >


A Hadoop Framework to Tackle the Grid

A team from the University of Maine has been working to design a Hadoop-based framework to analyze large synchrophasor datasets, which are common in power and smart grid implementations.

The team notes that the power sector is increasingly utilizing GPS-stamped real-time measurements from Phasor Measurement Units (PMU) to improve the reliability and efficiency of power grids. PMUs directly measure phase angles in real-time, which allows operators to perform grid optimization that was not possible in the past.

They explain  that in early 2010, almost 250 PMU’s were deployed across North America and it continues to increase remarkably. However, one of the major challenges is the complexity of analyzing such a large amount of real-time datasets. The phasor data from PMU’s will be accumulated in petabytes in coming years, which exceeds the capability of conventional relational database technologies.

This requires a new software and architecture framework to process such a large amount of data in real-time reliably and cost-effectively. The team thus proposes a new framework based on Hadoop  to perform distributed and parallel analytics on large synchrophasor datasets. To highlight their work, they demonstrate various applications of MapReduce to analyze patterns of load distribution using parallel node calculations, which can later be scaled up to match the requirements for power utility sector, creating a pilot study on data analytics on big data of smart grids.

NEXT — FENNEL Seeds Graph Partitioning Approaches >


FENNEL Seeds Graph Partitioning Approaches

A team from Microsoft Research and Carnegie Mellon University has tackled key problems with streaming graph partitioning for massive-scale graphs via a framework they’ve proposed, called FENNEL.

They note that graph partitioning is at the heart of several computational challenges, particularly when it comes to querying over large-scale graph data, including tasks like computing node centralities that require iterative computations, not to mention more real-world uses like creating recommendation systems for web users.

The team has thus introduced FENNEL as an underlying framework for graph partitioning that they claim can enable a well-principled design of scalable, streaming graph partitioning algorithms that are amenable to distributed implementation. They also describe their one-pass streaming graph partitioning algorithm and show that it might yield benefits over previous approach via an example across a large set of both real-world and generated graphs.

In the end, the team conducted a number of benchmarks on their sample graphs and found some interesting correlations in terms of performance between FENNEL and the standard METIS for the same problems.

NEXT — Divine Provenance with Semantics >


Divine Provenance with Semantics

A French team from Novapost R&D and TELECOM SudParis has addressed the issue of provenance as a key metadata for assessing electronic documents trustworthiness since it gives an indicator on the reliability and the quality of the document content.

The team says that most applications exchanging and processing documents on the web or in the cloud become provenance aware and provide heterogeneous, decentralized and not interoperable provenance data. Most of provenance management systems are either dedicated to a specific application (workflow, database) or a specific data type. Those systems were not conceived to support provenance over distributed and heterogeneous sources. This implies that end-users are faced with different provenance models and different query languages. For these reasons, modeling, collecting and querying provenance across heterogeneous distributed sources is still considered as a challenging task.

To counter these challenges, they present a new provenance management system (PMS) based on semantic web technologies. It allows users to import provenance sources, to enrich them semantically to obtain high level representation of provenance.  It also supports semantic correlation between different provenance sources and allows the use of a high level semantic query language. In the context of cloud infrastructure where most of applications will be deployed in a near future, scalability is a major issue for provenance management systems.

They provide details about the implementation of their PMS based on an NoSQL database management system coupled with the map-reduce parallel model and show that it scales linearly depending on the size of the processed logs.

NEXT — Scaling Big Visual Exploration >


Scaling Big Visual Exploration

Ioannis Leftheriotis from the Ionian University and Norwegian University of Science and Technology in Corfu, Greece recently set about exploring scalable interaction designs for collaborative visual exploration of large datasets.

In this research, Leftheriotis argues that novel input devices such as tangibles, smartphones, multi-touch surfaces etc. have given impetus to new interaction techniques that can be exploited in collaborative environments via large-scale visualization.

In this PhD research, the main motivation is to study novel interaction techniques and designs that augment collaboration in a collocated environment. To goal is to take advantage of scalable interaction design techniques and tools that can be applied in a variety of devices so as to help users to work together on a problem with an abstract big data set, using visualizations on a collocated context.

BigDATAwire