Follow BigDATAwire:

People to Watch 2016

P. Taylor Goetz
Vice President of Apache Storm
The Apache Software Foundation

P. Taylor Goetz ImageP. Taylor Goetz is the Vice President of Apache Storm at The Apache Software Foundation. He has held this position since joining the company in 2014. As a Storm committer, Goetz is influential in determining the future of this product, which is at the forefront of real-time streaming analytics. Goetz is also a member of the Technical Staff at Hortonworks but has worked for Accenture as a Technical Architect and at Health Market Science as a Software Architect in the past. He was also a firefighter and President of the Kimberton Fire Company for seven years. Goetz received his B.A. from the University of Colorado at Boulder.

Datanami: Congratulations on being selected as a Datanami 2016 Person to Watch. Apache Storm has emerged as a standard platform for real-time data processing. How do you see it evolving to keep up with the demanding requirements that organizations are placing on it?

P. Taylor Goetz: Performance and scalability are always critical for a real-time data processing framework. One of the features that separates Storm from alternatives is the ability to balance low latency and high throughput to meet a specific SLA. In the upcoming Apache Storm 1.0 release we’ve made some serious improvements in terms of throughput and latency in addition to many other improvements. Performance will always be one of our main focal points.

Another area in which Storm continues to evolve is with debugging and tuning applications. Tuning and debugging distributed systems is hard, even more so with streaming applications where you often have complex interactions with external systems, and the potential for wide variances in data velocity. With batch processing, poorly tuned jobs just run slowly, with streaming applications the pipes can burst.

A number of streaming frameworks claim to make it easy to scale and debug an application. In reality it can be anything but — just ask anyone who’s built a complex data streaming application with tight SLAs and put it into production. The problem is that some frameworks are too opaque at runtime, and while the stream processing API may be simple and easy to reason about, tuning and scaling often requires an in depth understanding of the platform’s internals. With Storm 1.0 we’ve added a number of features that make the process much easier, such as improved metrics, data sampling, distributed log search, and on-demand JVM profiling. I expect this tooling to evolve and improve in subsequent versions as well.

In a similar vein, dynamic scaling (both up and down) in response to changing data velocity is a commonly requested. Storm 1.0 introduces automatic back pressure and resource-aware scheduling that take CPU and memory availability into account. Both are great features in their own right, but what’s potentially more interesting is how those features could be extended to do things like auto-tuning and auto-scaling.

Datanami: You were a firefighter for many years prior to joining The Apache Software Foundation. That’s very admirable. Can you talk to us about that experience and how it is helping you now with your duties as Vice President of Apache Storm?

Yes, I was a volunteer firefighter for many years and later became president of the fire company. That experience taught me a lot about the dynamics of a volunteer organization. People volunteer their time for a broad range of reasons, and a broad range of people volunteer. All with different personalities and motivations. So you end up with a diverse, fiercely dedicated community where individual personalities can — and do — clash at times, but in the end resolve their differences and do great things for the community. I’ve found that’s very similar to ASF communities.

I think empathy is one of the most important traits an open source contributor can have, especially with ASF projects where all communication and decision making takes place over email. Without the cues you get with face-to-face communication, it’s all too easy to misinterpret someone’s intent. A lot of times it’s really important to be able to take a step back and put yourself in someone else’s shoes, especially in heated discussions. My experience with firefighting helped a lot with that and group skills in general.

Datanami: Generally speaking, on the subject of big data, what do you see as the most important trends for 2016 that will have an impact now and into the future?

In terms of big data, streaming will continue to be hot. Not that batch is dead, it’s ceartainly not. But I see batch increasingly becoming an augmentation to real-time. Apache Storm and Apache Spark streaming are not going away anytime soon, and there a lot of new entries in that space. Apache Flink, Apache Apex (incubating), and the latest streaming platform Apache Gearpump (incubating) all show promise. These are exciting times in the stream processing world.

Beside the continued interest with streaming, I think memory management is becoming really important as we move more toward in-memory processing and storage. Flink was an early innovator in that area, and Spark followed suit with Tungsten in an attempt to address memory problems users often encountered. One of the more recent projects I’m really excited about is Apache Arrow, which spun out of Apache Drill. Arrow is a columnar data layer for off-heap, in-memory analytics. One of the implications is that it allows multiple systems, and multiple programming languages, to access shared data without the overhead of serialization and deserialization which can be very expensive in terms of CPU cycles.

Obviously IoT is big right now and high in the hype cycle. But in reality it’s kind of a mess, security is a major problem, and there are a lot of vendors jockeying for position. I think this year some of the chaff will blow off and we’ll start to see some solid applications. We don’t need network enabled toilet paper dispensers, we need “things” that demonstrably improve consumers’ way of life and businesses’ bottom line.

I hope (though it’s a stretch) that we’ll see device vendors adopt open protocols and embrace interoperability. Unfortunately, IoT is still an emerging market that’s in the land-grab phase right now, so that may not happen. Established manufacturers want to protect their territory, and also their data. So with certain manufacturers, you get into data retention and usage restrictions that can adversely affect your use case. Device interoperability is still a mess and proprietary radio protocols don’t help. I think MQTT has tremendous potential, but it implies an IP stack, which can be expensive at the sensor level (though prices are dropping dramatically), not to mention the security implications. Does a simple Hall Effect sensor or light switch warrant an IP Stack?

Finally, I think we’re going to see a progression beyond big data platforms to big data “apps” — essentially prepackaged solutions tailored to specific use cases — and containerization will help drive that. In the past once an enterprise adopted a big data platform, they were largely left to build their own applications. That’s starting to change, and we’re already seeing such projects come to Apache. Apache Metron (incubating) for example is a network security analytics tool that leverages other Apache technologies like Storm, Hadoop and HBase. Apache Eagle (incubating) is another good example. Eagle leverages several Apache big data projects, including Storm, to monitor Hadoop data for malicious activity and unauthorized data access and respond in real time.

Datanami: Outside of the professional sphere, what can you tell us about yourself – personal life, family, background, hobbies, etc.?

I’m happily married and have two awesome sons who definitely keep me busy in my time away from work. I like to hunt, fish, and shoot trap and skeet. I’ve also been known to wield a soldering iron and tinker with microcontrollers and SOCs, but I have yet to create the killer network enabled toilet paper dispenser…

Datanami: Final Question: Who is your idol in the big data industry and why?

I can think of two. They are not specific people, but in my opinion are the driving force behind innovation in the big data industry.

The first is the individual open source contributors who give up their free time and effort to learn and contribute to open source projects. The term “big data” pretty much implies distributed systems, which also implies a high degree of complexity. It’s not easy work, and takes a lot of dedication.

The second is the corporate leader who recognizes the benefit of open source, and establish policies that allow their employees to contribute to OSS with the least amount of friction.

Yahoo! is a leader in that respect. They aren’t only the driving force behind open-sourcing Hadoop in the first place, they are also a driving force behind many improvements to the entire Apache ecosystem, including Apache Storm.

And obviously, none of what we enjoy today would have happened if it weren’t for the Apache Software Foundation.

 

P. Taylor Goetz
Apache Software
Jeff Hammerbacher
Cloudera
Jay Kreps
Confluent
Todd Lipcon Image Jacques Nadeau Image Peter Norvig Image
Todd Lipcon
Cloudera
Jacques Nadeau
Dremio
Peter Norvig
Google
Alex Pentland Image Jennifer Priestley Image Nate Silver Image
Alex Pentland
MIT
Jennifer Priestley
Kennesaw State University
Nate Silver
FiveThirtyEight
Daniel Sturman Image Werner Vogels Image Matei Zaharia
Daniel Sturman
Cloudera
Werner Vogels
Amazon
Matei Zaharia
Databricks

 

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13

BigDATAwire