数据湖

数据湖（英语：data Lake）是指以文件以其原始格式（如BLOB或文件等）存储的数据存储库或是系统^[1] 。数据湖多半会将所有的数据统一存储，包括源系统资料、传感器资料、社会资料等资料的原始副本^[2]，也包括用于报表（英语：Data reporting）、可视化、数据分析和机器学习等流程之转换后数据。数据湖也可能包括关系数据库的结构化数据（行与列）、半结构化的数据（CSV、日志、XML、 JSON）及非结构化数据 (电邮、文件、PDF）及二进制数据（图像、音频、视频）等^[3]。数据湖可能是“on premises”（指在组织的数据中心里），也可能放在云端（使用Amazon、微软或是Google的云端服务）。

构建不良的数据湖又称为数据沼泽。用户或是无法访问这样的数据湖，或是数据湖内的数据没什么价值。^[4]^[5]

背景

据称此术语由James Dixon为了与数据集市对比而提出，当时他是Pentaho的总技术长。^[6]数据集市相对较小，包含从原始数据提取出来的有价值的属性。^[7]在推广数据湖的时候，他认为，数据集市有几个固有的问题，例如资讯孤岛。普华永道称，数据湖可以"解决数据孤岛。"^[8] 在其数据湖研究中，他们指出，企业"开始使用一个单一的、基于Hadoop的存储库来存放和提取数据。"

Hortonworks, 谷歌, Oracle, Microsoft, Zaloni, 天睿动力的技术，Cloudera和亚马逊都有数据湖的产品。 ^[9]

示例

许多公司使用Azure Data Lake和亚马逊云服务 Lake Formation之类的云储存服务，或者Apache Hadoop之类的分布式文件系统 ^[10] 学术界对于数据湖的兴趣也正在兴起。比如，Cardiff 大学的个人数据湖，它定位于管理个人大数据，提供收集，管理和分享个人数据的单一入口。

早期的数据湖(Hadoop 1.0)在批量数据处理方面能力有限，仅有(MapReduce) 这一个数据处理范式。数据湖的访问者必须具备用Java实现MapReduce的能力，以及掌握一些高层工具，比如Apache Pig和Apache Hive(他们本身是面向批处理的)。

批评

大多数情况下，管理不善的数据湖被称为“数据沼泽”。^[11]

在2015年6月，David Needle表示"所谓的数据湖"是"一个（相比之下）更具争议性的方法来管理大数据"。^[12]

普华永道也在它们的研究中谨慎地指出，并不是所有的数据湖行动都是成功的。他们引用Sean Martin，剑桥语义的总技术长的话：

“

我们看见顾客们创造大型数据坟场，把所有的数据都扔进 Hadoop distributed file system (HDFS) 里，希望以后能派上用场。但是数据从此就失去了踪迹。
最主要的挑战不是创造数据湖，而是能从中获益。^[8]

”

普华永道描述那些在创建数据湖方面获取成功的公司能找出对组织重要的那些数据和元数据，逐步让他们的数据湖趋向成熟。对于数据湖的另一项批评是，这一概念模糊和任意。它指的是不适合进入传统的数据仓储架构的任何工具或数据管理实践。数据湖已被称为一种特定的技术。数据湖已被标记为一个原始数据保存库或ETL卸载枢纽。数据湖已被定义为一个自助分析服务的中央枢纽。数据湖这一概念涵盖了太多意义，因此这个术语的价值存疑。^[13]

麦肯锡指出数据湖应该被视为一种在企业内部提供业务价值的服务模式，而不是技术成果。^[14]

参考文献

^ The growing importance of big data quality. [2019-12-20]. （原始内容存档于2019-12-20）.
^ What is a data lake?. aws.amazon.com. [12 October 2020]. （原始内容存档于2023-04-05）.
^ Campbell, Chris. Top Five Differences between DataWarehouses and Data Lakes. Blue-Granite.com. [19 May 2017]. （原始内容存档于2017-09-15）.
^ Olavsrud, Thor. 3 keys to keep your data lake from becoming a data swamp. CIO. [2017-07-05]. （原始内容存档于2017-07-10）（英语）.
^ Newman, Daniel. 6 Steps To Clean Up Your Data Swamp. Forbes. [2017-07-05]. （原始内容存档于2017-08-03）.
^ Woods, Dan. Big data requires a big architecture. Tech. Forbes. 21 July 2011 [2019-12-20]. （原始内容存档于2019-09-02）.
^ Dixon, James. Pentaho, Hadoop, and Data Lakes. James Dixon’s Blog. James. [7 November 2015]. （原始内容存档于2019-12-20）. If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.
^ ^8.0 ^8.1 Stein, Brian; Morrison, Alan. Data lakes and the promise of unsiloed data (pdf) (报告). PricewaterhouseCooper.
^ Weaver, Lance. Why Companies are Jumping into Data Lakes. blog.equinox.com. [19 May 2017]. （原始内容存档于2019-12-20）.
^ Tuulos, Ville. Petabyte-Scale Data Pipelines with Docker, Luigi and Elastic Spot Instances. 22 September 2015 [2019-12-20]. （原始内容存档于2019-05-02）.
^ 3 keys to keeping your data lake from becoming a data swamp. CIO. [2024-05-24]. （原始内容存档于2023-12-09）（英语）.
^ Needle, David. Hadoop Summit: Wrangling Big Data Requires Novel Tools, Techniques. Enterprise Apps. eWeek. 10 June 2015 [1 November 2015]. Walter Maguire, chief field technologist at HP's Big Data Business Unit, discussed one of the more controversial ways to manage big data, so-called data lakes. ^{[失效链接]}
^ Are Data Lakes Fake News?. Sonra. 2017-08-08 [2017-08-10]. （原始内容存档于2018-08-21）.
^ A smarter way to jump into data lakes | McKinsey. www.mckinsey.com. [2024-05-24]. （原始内容存档于2024-05-24）.

[1] The growing importance of big data quality. [2019-12-20]. （原始内容存档于2019-12-20）.

[2] What is a data lake?. aws.amazon.com. [12 October 2020]. （原始内容存档于2023-04-05）.

[3] Campbell, Chris. Top Five Differences between DataWarehouses and Data Lakes. Blue-Granite.com. [19 May 2017]. （原始内容存档于2017-09-15）.

[4] Olavsrud, Thor. 3 keys to keep your data lake from becoming a data swamp. CIO. [2017-07-05]. （原始内容存档于2017-07-10）（英语）.

[5] Newman, Daniel. 6 Steps To Clean Up Your Data Swamp. Forbes. [2017-07-05]. （原始内容存档于2017-08-03）.

[woods2011-6] Woods, Dan. Big data requires a big architecture. Tech. Forbes. 21 July 2011 [2019-12-20]. （原始内容存档于2019-09-02）.

[dixon2010-7] Dixon, James. Pentaho, Hadoop, and Data Lakes. James Dixon’s Blog. James. [7 November 2015]. （原始内容存档于2019-12-20）. If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

[stein2014-8] 8.0 ^8.1 Stein, Brian; Morrison, Alan. Data lakes and the promise of unsiloed data (pdf) (报告). PricewaterhouseCooper.

[9] Weaver, Lance. Why Companies are Jumping into Data Lakes. blog.equinox.com. [19 May 2017]. （原始内容存档于2019-12-20）.

[tuulos2015-10] Tuulos, Ville. Petabyte-Scale Data Pipelines with Docker, Luigi and Elastic Spot Instances. 22 September 2015 [2019-12-20]. （原始内容存档于2019-05-02）.

[11] 3 keys to keeping your data lake from becoming a data swamp. CIO. [2024-05-24]. （原始内容存档于2023-12-09）（英语）.

[needle2015-12] Needle, David. Hadoop Summit: Wrangling Big Data Requires Novel Tools, Techniques. Enterprise Apps. eWeek. 10 June 2015 [1 November 2015]. Walter Maguire, chief field technologist at HP's Big Data Business Unit, discussed one of the more controversial ways to manage big data, so-called data lakes. ^{[失效链接]}

[13] Are Data Lakes Fake News?. Sonra. 2017-08-08 [2017-08-10]. （原始内容存档于2018-08-21）.

[14] A smarter way to jump into data lakes | McKinsey. www.mckinsey.com. [2024-05-24]. （原始内容存档于2024-05-24）.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]