Inside Redfin’s Cautious Approach to Big Data
There’s a certain allure to big data technology that tempts us to view it as the solution to every business problem. But there are times when the simplicity of traditional technology–like relational databases–outweigh the scalability benefits brought by big data tech, as the real-estate website Redfin learned.
It’s no wonder that big data technologies like Hadoop and NoSQL databases were born on the Web. After all, for many of the biggest websites on the Internet–like Google, Yahoo, and Facebook—traditional technology, like relational databases, simply cannot scale to meet the huge data processing needs they need to serve billions of people across the globe.
But outside of that top tier of megasites, companies should not be so quick to dismiss the tried-and-true technologies that have been used so effectively for the past 20 years. That’s the lesson learned by Redfin CTO Bridget Frey, who has guided the technological development of the popular real estate website during big data’s heyday.
Don’t Overthink IT
Redfin was founded in 2004, but really got its Web-based real-estate service off the ground in 2006. That’s the same year Doug Cutting brought Hadoop to Yahoo to help index the Internet, and the same year Google publicly disclosed technical details of BigTable, which kicked off the NoSQL database craze.
Today, Redfin runs the world’s 210th most popular website (according to Alexa.com) and serves detailed real-estate information on hundreds of millions of homes across most of the United States to more than 10 million visitors per month. You might be surprised to learn that much of that data comes from a database management system first released in 1996.
“We had to start our site before big data was a thing,” Frey tells Datanami. “Our traditional data architecture is basically a big old Postgres database, and we use Intel I/O cards that store many terabytes of data to give us some pretty powerful performance.”
Running a single relational database for core website components provides compelling advantages for the company, Frey says–particularly for Redfin’s developers, who would otherwise struggle under the weight of complex queries involving data millions of homes, along with accompanying tax and school records and other geographic data.
“Having all of that in a single database, where we can do joins across them and build views that combine all of these things–it’s just been a really convenient way to build our software,” Frey says. “For our developers, it’s very easy for them to get instantaneous access to all the great information without having to think too hard.”
The Seattle, Washington-based company actually stores its entire database across several dozen Flash accelerator cards from Intel, which essentially gives Redfin the performance of having an in-memory database but without the high costs or the need to make major architectural changes.
“We found that we don’t even need to have indexes and things on database tables because the I/O cards have instantaneous access to the data anyway, and they can give us very high performant query capabilities,” Frey says. “We’ve never had to do any sharding or anything like that that we hear people talk about.”
Addressing Big Data
That’s not to say that Redfin doesn’t have a big data infrastructure. While all the queries from Redfin’s public website are served from the Postgres database, there are a number of other data services running behind the scenes to power Redfin services, such as the home price estimator and home recommendation system. Some of these services are powered by technology from the big data realm.
“A couple of years ago we started recording and looking at clickstream data. We have log files of every click that somebody takes on the site and anything that we send in terms of events is all recorded there,” Frey says. “We’re just finding the size of log files is not suitable for storage in a traditional database.”
At the time, the company looked closely at the Hadoop distribution from Cloudera and the version of the Cassandra database from Datastax to store and analyze these log files in-house. But eventually Redfin decided that it didn’t want to be in the business of running distributed server clusters, so it decided to let a company down the street host it instead.
Redfin uses the entire Amazon Web Services big data stack: it uses S3 to store the log data, Elastic MapReduce (EMR) to process them, Redshift to analyze them, DynamoDB to feed the results back to the website, and Kinesis for managing the whole data pipeline. As Frey explains, it made sense for Redfin to keep the turbo-charged Postgres database in house, but to outsource the big data collection, storage, and analytics work to Amazon.
“When we were looking at the cost we thought it would take us to operate a full Hadoop pipeline in house, it was looking like it was going to be a massive undertaking – especially before we’d even proven that we’re going to use the features,” she says. “AWS was just so easy and in a couple of days our engineers were able to get a basic pipeline working, and we’ve run with it since then.”
Yong Huang, who manages the Redfin big data analytics team, says leaving the DevOps work to AWS lets his team focus on the analytics. “Once we solved the infrastructure problem, we can actually dream a little bigger,” he says in a video case study posted to the AWS website. “We can deliver results without wondering, ‘Hey how do I scale my product? How do make sure that my data is actually there for our customers.’ I think that actually had a liberating effect.”
Frey agrees. “It’s given us a really nice separation between the people working on our core website and people who can innovate on the analytics side by having the Dynamo DB interface,” she says. “I think it’s allowed us to just really experiment with some interesting algorithms without having to move entirely away from a traditional database architecture.”
Keep It Simple
There’s an old adage in the business world called KISS, which stands for “keep it simple, stupid.” You could say that Redfin has been a willing practitioner of the KISS approach, even as the rest of the industry was falling head-over-heals for big data.
“We feel like we get close to the type of performance that people talk about in the big data systems with what we’ve set up with our Postgres and Intel cards,” Frey says. “We can get instantaneous access to several terabytes of housing data and run very high performance geographic-based queries on that data. I know a big data architecture can also do that, but for us, our traditional architecture is also performing very fast.”
There’s a lot of hype about big data technology, and in many situations, there’s good reason for it. For some types of workloads–running machine learning models at scale, finding patterns across petabytes of data, or scaling out to serve hundreds of thousands of simultaneous connections—there are no good reasons to look for alternatives to Spark, MapReduce, and NoSQL databases like Cassandra.
But don’t fall into the trap of thinking that every workload needs to run on the latest and greatest technology to come out of Silicon Valley. In many cases, the technology that Silicon Valley invented 20 years ago is still plenty-fast–and getting faster thanks to advances such as I/O accelerators and the declining cost of RAM.
Frey offers this advice to other CTOs who might be considering a big data system:
“Be thoughtful of when you need the overhead of a big data system,” she says. “They’re still a little trickier and require some different types of training and expertise in the engineers than a traditional website. Think about whether you really need the big data architecture and think about which projects are suitable for big data, and then lean on the database for as long as you can because there are some benefits from the simplicity of a traditional data architecture as well.”
Related Items:
Happy Birthday, Hadoop: Celebrating 10 Years of Improbable Growth
Rudin: Big Data is More Than Hadoop
When to Hadoop, and When Not To