Data Lake Software Options 2026
- Phil Turton
- 15 hours ago
- 13 min read

The original data lake promise was simple: store everything, figure out what to do with it later. In practice, that promise aged badly. Organisations that invested in first-generation data lake architectures through the 2010s frequently ended up with what the industry quietly called a "data swamp" - a vast, cheap repository of raw data that was technically accessible but practically unusable, because without structure, governance, or performance-oriented tooling, data engineers spent more time wrestling with the infrastructure than delivering value from the data.
In 2026, the architecture has fundamentally matured. The dominant pattern is now the data lakehouse - a design that adds transactional reliability, schema enforcement, ACID compliance, and warehouse-grade query performance directly on top of cheap cloud object storage. Open table formats, particularly Apache Iceberg and Delta Lake, have become the structural layer that makes this possible, and the major platforms have converged around them. The result is that enterprise data teams can now have the economics and flexibility of a data lake alongside the performance and reliability of a data warehouse - without choosing between them or paying for both.
The global data lake market was valued at over $13 billion in 2023 and is projected to reach nearly $60 billion by 2030, growing at a compound annual growth rate of around 24%. Much of that growth is being driven by AI: as organisations attempt to build AI applications at scale, they need a single, governed, high-quality data foundation that AI models can reason over - and the modern lakehouse is the architecture most enterprises are converging on to provide it.
This guide covers the leading data lake and lakehouse software platforms available to enterprise and mid-market buyers in 2026. Viewpoint Analysis is a Technology Matchmaker, helping businesses find and select the right technology fast - aiming to be the place buyers go to understand the software and technology market before speaking to vendors.
Included Data Lake Software Vendors
This guide covers the following data lake and lakehouse platforms, evaluated independently across cloud hyperscaler, independent lakehouse, hybrid enterprise, and specialist tiers. Our viewpoint on each vendor follows below.
Databricks | Snowflake | Microsoft Fabric | AWS (Amazon Lake Formation) | Google Cloud (BigLake) | Cloudera | IBM watsonx.data | Dremio
Not sure which data lake platform to shortlist? |
Use the free Longlist Builder to get a tailored list of data platform vendors matched to your cloud environment, workloads, and requirements - no registration required. |
What is Data Lake Software?
A data lake is a centralised repository that stores large volumes of data in its raw, native format - structured tables, semi-structured files such as JSON and Parquet, and unstructured content such as documents, images, and logs - at a scale and cost point that traditional data warehouses cannot match. Unlike a warehouse, which requires data to be transformed and structured before it can be stored, a data lake ingests data as-is and defers processing until the data is needed for a specific analytical purpose. This gives data teams the flexibility to capture everything and decide how to use it later - a significant advantage when the value of a particular data source is not yet clear.
The critical evolution of 2026 is the near-universal adoption of the lakehouse architecture, which solves the most persistent failure modes of the original data lake design. A lakehouse adds an open table format layer - Apache Iceberg, Delta Lake, or Apache Hudi - on top of cloud object storage. This layer provides ACID transactions (ensuring data consistency even with concurrent reads and writes), schema enforcement (preventing malformed data from corrupting downstream consumers), time travel (allowing queries against historical snapshots of data), and high-performance query execution that approaches data warehouse speeds. The result is a platform that can serve data science and machine learning workloads, traditional SQL analytics, real-time streaming ingestion, and governed business intelligence from the same underlying data store - eliminating the duplication and synchronisation overhead that previously forced organisations to maintain separate lakes and warehouses.
A data lake or lakehouse is distinct from a data warehouse, though the boundary is blurring. Warehouses are optimised for structured, schema-on-write data and predictable SQL analytics workloads. Lakehouses are optimised for scale, flexibility, and mixed workloads including machine learning and unstructured data. Many large enterprises run both - using the lakehouse as the primary data store and transformation environment, and a warehouse layer for governed, business-facing reporting. Understanding where each fits in your architecture is an important part of vendor evaluation.
For a broader view of the data technology landscape - covering data governance, data quality, data integration, and master data management alongside data lake platforms - see the Viewpoint Analysis Data Technology page.
How to Find Data Lake Software
The data lake and lakehouse platform market is unusual compared with most enterprise software categories, because it is dominated by the three major cloud hyperscalers - AWS, Microsoft Azure, and Google Cloud - each of which provides native data lake capabilities as part of a broader cloud platform. For many organisations, the choice of data lake platform is partly or entirely constrained by their existing cloud provider relationship. A business that runs its production systems on Azure will naturally gravitate toward Microsoft Fabric or Azure Data Lake Storage; an AWS-first organisation will look first at Amazon Lake Formation and the broader AWS data services ecosystem. Understanding how much of your evaluation is genuinely open versus shaped by existing cloud commitments is the first step in a realistic shortlisting process.
For organisations that want to look beyond their default cloud provider - or that are evaluating independent platforms such as Databricks, Snowflake, or Cloudera to run across multiple clouds - the Viewpoint Analysis Longlist Builder generates a tailored vendor longlist based on your specific cloud environment, primary workloads, team profile, and requirements. Because it filters by your situation rather than listing every platform in the market, it produces a more useful starting point for structured evaluation.

For organisations that want to reach a shortlist faster, the Viewpoint Analysis Technology Matchmaker Service brings the best-fit vendors directly to your team. Think of it as Dragons' Den or Shark Tank for enterprise software: Viewpoint Analysis interviews your team, produces a Challenge Brief capturing your data architecture requirements and business objectives, and invites the leading data lake and lakehouse vendors to pitch their solution directly to you - getting you to a credible shortlist without the months of preliminary market research that complex data platform decisions typically require.
Independent Data Lakehouse Platforms
Databricks is the most widely adopted independent lakehouse platform in the enterprise market and the company that pioneered the lakehouse architectural pattern. Built on Apache Spark - which Databricks' founders created at UC Berkeley - the platform combines large-scale data engineering, machine learning, and SQL analytics in a unified environment underpinned by Delta Lake, the open-source transactional storage format that Databricks created and contributed to the open-source community. Delta Lake provides ACID transactions, schema enforcement, time travel, and change data capture on top of cloud object storage across AWS, Azure, and GCP, giving Databricks genuine multi-cloud portability - data stored in Delta format is not locked to Databricks and can be queried by other engines. Unity Catalog provides centralised governance across all clouds, workspaces, and data assets. Databricks is the platform of choice for organisations with serious data engineering and data science capability - particularly those building AI applications, training custom models, or managing petabyte-scale transformation pipelines - though it demands a skilled engineering team to operate effectively and its consumption-based pricing can be challenging to predict at scale.
Snowflake has evolved from a cloud data warehouse into a comprehensive data platform with strong lakehouse characteristics, and remains one of the most commercially successful data platforms in the enterprise market. Its architecture separates storage from compute in a way that allows multiple independent compute clusters to query the same data simultaneously, which makes it particularly well suited to organisations with many concurrent analytical workloads. In 2026 Snowflake Cortex - its integrated AI capability - has matured significantly, bringing AI functions including text analysis, vector search, and model inference directly into standard SQL queries without requiring data to leave the Snowflake environment. Snowflake's data marketplace is one of its most distinctive assets, enabling organisations to acquire and share data without physical data movement. Its governance model is proprietary rather than open-standard, which gives it a simpler administration experience than more open platforms but introduces more vendor dependency. Snowflake is the strongest choice for organisations that prioritise governed SQL analytics, cross-organisational data sharing, and minimal infrastructure management overhead, and are willing to accept a degree of platform dependency in exchange for those advantages.
Want data lake platform vendors to pitch to your team? |
The Viewpoint Analysis Technology Matchmaker Service identifies the best-fit vendors for your data architecture and workloads and brings them to you - getting you to a shortlist without the usual months of preliminary research. |
Cloud Hyperscaler Data Lake Platforms
Microsoft Fabric is Microsoft's most ambitious data platform investment in a decade - an end-to-end analytics and data engineering platform that brings data lakehouse, data warehouse, real-time analytics, data science, and Power BI into a single, unified SaaS environment built on a shared storage layer called OneLake. OneLake functions as a "OneDrive for data" - a single, tenant-wide logical data lake underpinned by Azure Data Lake Storage Gen2 and using Delta Lake as its table format, which means Delta tables written by other engines including Databricks are compatible and accessible via OneLake shortcuts without data movement. For Microsoft-invested organisations, Fabric's integration advantages are material: Power BI connects to lakehouse tables via Direct Lake mode without import cycles, Microsoft Purview provides native governance, and the entire platform is managed within the Microsoft 365 and Azure security boundary. Fabric is not the right choice for organisations with complex multi-cloud requirements or large teams of data engineers who need full Spark cluster control - but for Azure-first organisations that want to reduce tool sprawl and accelerate time to insight, it offers a compelling unified experience.
AWS provides data lake capabilities through a combination of services centred on Amazon S3 as the storage layer and AWS Lake Formation as the governance and access control framework that sits above it. Lake Formation simplifies the process of building a governed data lake by managing data ingestion, cataloguing, security policies, and access control from a central console, while allowing organisations to use the full range of AWS analytics services - Amazon Athena for interactive SQL queries, Amazon EMR for large-scale Spark and Hadoop workloads, AWS Glue for data integration and transformation, and Amazon Redshift Spectrum for federated querying across the lake and warehouse - on top of their S3 data. For AWS-native organisations, this integrated ecosystem provides broad workload coverage with the operational familiarity of a single cloud relationship. The trade-off is that the AWS data lake architecture is inherently more modular and requires more architectural decisions than integrated platforms like Databricks or Fabric - which is an advantage for experienced data engineering teams who want flexibility, and a complexity risk for organisations without that capability.
Google Cloud's data lake capability is delivered through BigLake, which extends BigQuery - Google's flagship cloud analytics engine - to support open table formats including Apache Iceberg directly on Google Cloud Storage. BigLake allows organisations to query data stored in open formats at BigQuery speeds without moving or copying it into proprietary storage, and provides unified fine-grained access control across BigQuery tables and object storage through a single policy model. Google's strength in this category is the maturity and performance of BigQuery as a query engine: its serverless architecture, columnar storage, and in-memory shuffling deliver consistently strong analytical performance with minimal administration overhead. BigLake and BigQuery are most naturally the right choice for Google Cloud-first organisations, and for those with large-scale SQL analytics and machine learning workloads where BigQuery's performance and Google's AI infrastructure - including Vertex AI - represent a compelling integrated stack.
Hybrid and Enterprise Specialist Data Lake Platforms
Cloudera Data Platform is the enterprise-focused data lake and lakehouse solution for organisations that need to operate across both public cloud and on-premises environments - a requirement that the cloud-native platforms address incompletely. CDP is built on open-source technologies including Apache Hadoop, Apache Spark, Apache Iceberg, and Apache Hive, and provides a consistent management, security, and governance experience across AWS, Azure, GCP, and private cloud deployments from a single control plane. Cloudera's particular strength is its enterprise-grade security and compliance architecture, which makes it the preferred choice in industries with strict data residency requirements, air-gapped environments, or regulatory mandates that prevent data from leaving specific infrastructure. For organisations managing hybrid data architectures - where some data must remain on-premises and other workloads run in the cloud - CDP's ability to present a unified platform across both environments is a genuine differentiator that the hyperscalers cannot fully replicate.
IBM watsonx.data is IBM's hybrid data lakehouse platform, positioned at the intersection of cost optimisation and AI readiness for large enterprise environments. Built on open engines including Apache Spark, Presto, and Apache Iceberg, watsonx.data is designed to allow organisations to run workloads on the most cost-effective compute engine for each task - shifting expensive data warehouse queries to lower-cost open-source engines while retaining the same data in place. Its integration with IBM's broader watsonx AI platform makes it a natural fit for organisations building AI applications within the IBM ecosystem, providing the data foundation that watsonx AI models require. IBM watsonx.data is most relevant for existing IBM customers that want to modernise their data architecture while retaining the security, governance, and enterprise support model they already have in place, and for organisations with significant legacy workloads that need to reduce warehouse compute costs while transitioning toward a modern lakehouse.
Dremio (SAP) is an open lakehouse platform focused on delivering fast, governed analytics directly on data lake storage without requiring data to be copied into proprietary analytical systems first. Its query engine is built on Apache Arrow - a high-performance in-memory columnar format - and it queries Apache Iceberg tables at interactive speeds directly on object storage across AWS, Azure, and GCP. In 2026 Dremio has expanded its agentic capabilities, using AI to automate query acceleration, workload management, and data discovery - making it easier for analytical teams to find and use trusted data without deep platform expertise. Dremio also provides an open-source Apache Polaris catalogue implementation, allowing organisations to manage Iceberg metadata in a vendor-neutral way. Its primary appeal is to organisations that want strong analytics performance on open data formats, want to avoid data duplication between their lake and their analytics layer, and are committed to an open-standard architecture that avoids proprietary lock-in.
How to Select Data Lake Software
Data lake and lakehouse platform selection is one of the most consequential technology decisions a data organisation can make - these platforms sit at the foundation of the entire data and AI stack, and changing them later is expensive and disruptive. The evaluation deserves proportionate rigour, and that starts with a clear-eyed view of the constraints that will shape the decision before any vendor is contacted.
Cloud strategy is the most important constraint. Most organisations evaluating data lake platforms in 2026 already have a primary cloud provider relationship, and the data platform choice needs to be assessed in that context. The native capabilities of AWS, Azure, and Google Cloud are each strongest within their own ecosystems, and the integration, security, and cost advantages of staying within a single cloud are significant. Independent platforms such as Databricks and Snowflake offer genuine multi-cloud portability, which matters for organisations that intentionally distribute workloads across clouds or want to avoid cloud lock-in at the data layer. Hybrid requirements - where some data must remain on-premises for regulatory or operational reasons - narrow the field considerably toward platforms with proven private cloud capability, principally Cloudera and IBM.
Workload profile is the second major constraint. Data lake platforms differ significantly in what they are optimised for. Platforms built around Spark - principally Databricks - deliver the deepest support for large-scale data engineering pipelines, machine learning, and custom model training. SQL-first platforms such as Snowflake and BigQuery deliver the strongest performance and lowest administrative overhead for concurrent analytical workloads. Integrated platforms such as Microsoft Fabric are optimised for reducing tool sprawl in Microsoft-invested organisations. Open query layer platforms such as Dremio are optimised for fast, governed analytics on open-format data without proprietary storage commitment. A mismatch between workload profile and platform architecture is one of the most common and most expensive mistakes in data lake selection.
Team capability is the third variable that is frequently underweighted. Databricks' power comes with an engineering overhead that requires experienced Spark practitioners to manage effectively. Microsoft Fabric's unified SaaS experience reduces that overhead but constrains flexibility. The right platform is the one your team can operate productively at the scale you need, not the most technically sophisticated option in the market.
The Viewpoint Analysis Rapid RFI provides a structured way to assess the data lake market quickly and reach a shortlist - covering cloud architecture fit, workload coverage, governance, open-standard support, and commercial model in a framework that does not require buyers to build the evaluation methodology from scratch. Once a shortlist is established, the Rapid RFP delivers a lean, time-boxed selection process that reaches a vendor decision in weeks. For buyers under time pressure, the 30-Day Technology Selection combines both into a single end-to-end process.
The Enterprise Software Selection Playbook 2026 provides the full methodology for buyers who want to run a rigorous, defensible selection process from first principles.

Summary
The data lake category has matured decisively in 2026. The original promise of cheap, flexible storage has been fulfilled, and the architecture has evolved well beyond it: the modern data lakehouse adds transactional reliability, governance, and warehouse-grade query performance to the economics of object storage, underpinned by open table formats that reduce vendor lock-in and support AI workloads at scale. The vendor landscape divides into independent lakehouse leaders (Databricks, Snowflake), cloud hyperscaler platforms (Microsoft Fabric, AWS Lake Formation, Google BigLake), hybrid and enterprise specialists (Cloudera, IBM watsonx.data), and open analytics specialists (Dremio) - each reflecting a different architectural philosophy and serving a different buyer context.
Three takeaways for buyers making a decision in 2026. First, let your cloud strategy shape the shortlist before you evaluate features - the integration, security, and operational advantages of staying within your primary cloud provider are real, and the independent platforms need to make a compelling case to overcome them. Second, match the platform to the dominant workload - a data engineering and ML-heavy environment and a SQL analytics environment have different platform requirements, and the wrong architecture choice compounds over time as the platform becomes embedded. Third, take open standards seriously as a selection criterion - platforms built on Apache Iceberg and Delta Lake give your data organisation more flexibility than proprietary storage formats, which matters increasingly as AI workloads require data to be accessible across a wider range of compute engines and tools.
How Viewpoint Analysis Can Help
Viewpoint Analysis works with enterprise and mid-market organisations to find and select the right data lake and lakehouse platform - independently, with no vendor fees and no bias. Whether you are starting your architecture review or already evaluating a shortlist, the following resources and services can help you move faster and make a better decision.
To generate a tailored longlist of vendors matched to your cloud environment, workload profile, and requirements, the Longlist Builder is free and takes a few minutes. To get the right vendors presenting their solution directly to your team, the Technology Matchmaker Service handles the briefing and vendor engagement on your behalf.
For structured selection support, the Rapid RFI provides a fast market assessment and shortlisting process, the Rapid RFP takes you to a vendor decision in weeks, and the 30-Day Technology Selection combines both for buyers who need to move fast. The Enterprise Software Selection Playbook 2026 is the definitive reference for running a rigorous end-to-end selection process.
Talk to Viewpoint Analysis
If you are currently evaluating data lake or lakehouse platforms and would like independent guidance on the options, request a call and we will be happy to help. If you are a vendor in this space and would like to be considered for future content and matchmaking opportunities, we would also like to hear from you.
