2021 Big Data Year in Review: Part 2
There was a lot going on in 2021, but we’ve done our best to synthesize some of the top stories of the year. Here, we pick up where we left off in part one of this two-part piece.
One of the most interesting developments of 2021 was the rise of data meshes and data fabric. Interest in data fabrics grew thanks to its ability to provide a common layer for data access, discovery, transformation, integration, security, governance, lineage, and orchestration. Data meshes–which provides a path for how people can organize themselves to get the most out of data–also grew in popularity.
The two architectures have similarities–namely that data will not be centralized and will continue to proliferate in silos. But there are important differences, which we laid out in this October article, which became one of Datanami’s most popular of the year.
Data Observability
The observability and AIOps market found a new gear that nobody knew it had. We were also introduced to another data discipline that we didn’t know we needed: data observability.
Among the data observability vendors landing on our radar for the first time were Soda, a Belgium-based vendor that raised $17.5 million in January; Bigeye, which has its roots in Uber’s data pipelines, and which raised $45 million in September; Monte Carlo, which is actually based in San Francisco; and Lightup, which we profiled in April.
Log Analytics
Log analytics, AIOps, and observability continued to grow in importance throughout the year as customers sought better ways to interpret huge data sets. With $17 billion up for grabs in the AIops and observability markets, a number of startups came into the game.
In October, we showed you how a startup called Hydrolix was bringing its secret sauce to bear on log data in the cloud. Some progress was made in standards for log data as Splunk and others settled on the Open Telemetry data format. ChaosSearch continued to make headway with its Elastic clone, which found its way into the Toronto subway system.
In February, Apache Iceberg was proposed to be the hub of a new data service ecosystem. Later in the year, the creator of Iceberg, Ryan Blue, would bring his commercial vendor, Tabular, out of stealth.
Language Models
Large language models, such as OpenAI’s GPT-3, caught the world’s imagination in a big way in 2020, and companies looked for innovative ways to use them for chat bots, intelligent search, and more. In February, Google outdid GPT-3 with the Switch Transformer, which dwarfed GPT-3 in size.
For some language tasks, large transformer models were outperforming human abilities. Companies across industries looked for folks with the skills necessary to leverage language models, such as BERT, for chatbots, document understanding, and optimized search, while others looked to vendors like Mantium to help them use the models.
“Hey brother, can you spare some GPU time?” Sharing spare processing capacity for training deep learning sounds far-fetched, but it’s within the realm of possibility thanks to this project.
All About That AI
AI and BI have coming closer for some time. In 2021, the two technologies combined for “augmented analytics,” which Gartner told us is the new state of the art. On the data science-machine learning (DSML) front, Gartner identified a “glut” of innovation. The idea of “composite AI” that blends multiple data disciplines also grew at SAS.
It felt as though AI was gaining traction, and in June, Appen brought us data that showed companies were going “all in” on AI, with AI budgets at the biggest companies up by more than 50%.
For a real world example of AI, we told you about how Coke bottlers were saving millions by automatically interpreting standard business documents, like proofs of delivery and invoices. However, the opacity of today’s AI is a real problem, one longtime ML practitioner told us in May. The path out of it the malaise, he said, isn’t an easy one, but more “statistical rigor.”
Synthetic Data
One way to avoid the ethical quandary of exploiting someone’s private data: just fake it.
The industry became more aware of the benefits of synthetic data, which resembles real data in all its features and distributions, but does not contain personally identifiable information.
Synthetic data emerged as a particularly attractive approach for imagery, including for training AI drones. But synthetic data is also being generated to help train language models, particularly among outfits that have sophisticated ML-based data labeling operations.
People and Open Source
We announced our Datanami People to Watch for 2021 in February, showcasing 12 individuals who we think have already made a big impact on big data—or are about to. The Tabor Communications team welcomed some new individuals to the team in December, which we cataloged in our monthly Career Notes column.
The software licensing wars continued in 2021, particularly around open source, which has been the source of much innovation in the big data space for many years. Elastic modified its licensing once again in reaction to steps taken by AWS. In April, Grafana switched from Apache 2.0 to AGPL.
A group of vendors like Starburst, Dremio, Ahana, and others backed an emerging approach to open data analytics, which is marked by open source engines running atop data stored openly in cloud object stores, as opposed to optimized but proprietary data warehouses.
Custom Silicon and The Chip Shortage
The chip shortage was a recurring theme throughout 2021, including in March, when we reported that it appeared to be impacting AI workloads in the cloud. With prices for everything from cars to home appliances going up due to the chip shortage, some sought ways to optimize AI in the software stack.
Custom silicon was red hot in 2021, with a number of startups designing chips to speed up certain workloads.
For example, Speedata emerged in October to speed up SQL analytics; SambaNova made progress with its Reconfigurable Dataflow Unit (RDU) for AI; Cerebras touted its massive Wafer Scale Engine (WSE) chip for running large language models; and NeuroBlade impressed with a new processor-in-memory (PIM) technology called XRAM.
The Job Market
By mid-year, hiring for data professionals was up considerably from the prior year, and according to a Burtch Works assessment, salaries were starting to grow. Inflation was running hot at the end of the year, but data salaries were running even hotter. After an “COVID bump” in data salaries early in the year, the increases were on fire later on.
Warning signs of an impending problem with data workers started appearing in June, with an Ascend.io survey that found 96% of data professionals were at or over capacity. In September, a survey by O’Reilly found that the average salary for data professionals in the US and the UK was $146,000, which was just a 2.25% increase from the prior year.
As the Great Resignation grew during the year, data pros were caught in the cross hairs, with nearly 80% of data engineers in one survey saying they wished their jobs came with a therapist. One recruiter said that job vacancies are increasing faster than the number of people seeking jobs, “which is really, really crazy.” The average salary for some data jobs exceeded $300,000.
Related Item:
Big Data Year in Review: Part 1