Big-data analytics provider Cloudera Inc. today announced the general availability of Apache Iceberg within its flagship Cloudera Data Platform, giving customers access to a 100% open table format developed by the Apache Software Foundation.
CDP is the product suite that encompasses Cloudera’s flagship data management and analytics services. It’s used by companies to collect business records from their internal systems, centralize them in one data lake, then analyze them for useful insights and build machine learning models. CDP also includes an array of tools for performing the complex data preparations that have to be carried out before a company’s business information can be used in analytics and machine learning projects.
It’s here where Apache Iceberg fits in. Iceberg is an open, table-format for data lakes that was originally designed by Netflix Inc. to overcome the challenges it came across when using existing data lake formats such as Apache Hive, Apache Impala and Apache Spark.
Cloudera said the biggest reason to adopt Apache Iceberg is its open nature, which is in contrast to some of those other popular table formats. In a blog post, Cloudera Chief Technology Officer Ram Venkatesh explained that many of these table formats are tied to primary engines and oftentimes, single vendors. This is in contrast to the company’s strategy of creating an open Data Lakehouse that’s stacked with easily accessible data. Venkatesh said Cloudera’s customers demand “higher scale, more flexibility of analytic engines and services on the data lake, all without vendor lock-in.”
As a 100% open-source and cloud-native table format, Apache Iceberg eases those concerns. At the same time, it’s designed to handle petabyte-scale object storage, while avoiding some of the performance degradations seen in alternative formats such as Apache Hive, Venkatesh said.
It also brings capabilities that can be used to future-proof data architectures. Features include in-place table evolution, time travel with point-in-time queries, concurrent multifunction analytics and improved performance through aggressive partitioning, to handle very large-scale datasets.
The result is no more lock-in, unnecessary data transformations or data movement across tools and clouds just to extract insights out of the data, Cloudera said. “With Apache Iceberg in CDP, Cloudera leads beyond the data lakehouse with an open ecosystem of data and community, combined with enterprise hardening and performance,” Venkatesh said.
Analyst Holger Mueller of Constellation Research Inc. told SiliconANGLE that open-source is key when it comes to most infrastructure-as-a-service and platform-as-a-service offerings, which is why Cloudera has decided to fully embrace Apache Iceberg.
“Cloudera could have gone down a proprietary path, but adopting iceberg is a triple win,” Mueller said. “First and foremost, it’s a win for customers who can store their very large analytical tables in a standards-based, open-source format, while being able to access them with a standard language. It’s also a win for Cloudera as it provides a key feature on an accelerated timeline while supporting open source standard. Last, it’s a win Apache as it gets another vendor uptake.”
Mueller’s colleague Doug Henschen said Apache Iceberg has been gaining a lot of traction and support on other analytical data platforms lately. For instance, Google Cloud said in March it will bring support for Iceberg to its Big Lake platform, while Snowflake Inc. announced preview support at the Snowflake Summit a few weeks ago.
“Dremio already supported Iceberg and now Cloudera is joining the club,” Henschen said. “Databricks still relies on Delta Lake, but earlier this week it announced that will be made available as an entirely open-source project, no doubt in response to customer and competitive pressure. Both table formats enhance analytical performance and provide better metadata management compared to using columnar formats such as Parquet, so it’s good news for customers.”
Proof of that comes from the welcome reception of Apache Iceberg in CDP by early adopters in technical preview. For instance, the Canadian electronic land registration system provider Teranet Inc. said it’s using CDP with Apache Iceberg to support an open lakehouse architecture that future-proofs its data platforms for all analytical workloads.
“We selected change data capture as our first use case on Iceberg,” said Teranet system architect Steven Brackenbyry. “With frequent updates to our data lake, we aim to accelerate reporting and business intelligence, giving our business teams access to current insights. Partition evolution is also a critical capability for us, guaranteeing superior query performance for large-scale data engineering and BI workloads.”
Cloudera said Apache Iceberg now works with all CDP services, including Cloudera Data Warehousing, Data Engineering and Machine Learning.
Show your support for our mission by joining our Cube Club and Cube Event Community of experts. Join the community that includes Amazon Web Services and Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.