Data Lake

What is Data Lake Architecture?

find an abundance of information in a data lake

Data is the essential component of any organization. When challenges occur that risk the ability of organizations to utilize data correctly, the organization is in jeopardy of failure. Risk has to be avoided, transferred, mitigated, or accepted. Data-driven risk management is a necessary capability for all organizations. Acceptance of risks to data and dealing with the consequences of something happening should be avoided at all costs.

Data Lake is a central repository of data from various sources that allows organizations to store all their raw, structured and unstructured data. Data is stored as-is without having to re-format or structure the data. Organizations need data lakes to perform research, do analysis, and improve decision support within the organization. Improving decision support can create a competitive advantage against their peers and overall enhance service to their customers.

Understanding and improving products and services, customer analysis, employee productivity, and overall operational efficiency and outcomes underlines all organizational strategies, tactics, and operations. The best way to enable this is to use data for decision support. Data contained and managed efficiently and effectively in a data lake can help with challenges in these areas. Data used in this fashion helps empower digital transformation across the organization.

What is Cloud Data Lake Architecture

A cloud data lake architecture is a series of modules, some required, others optional, that define a common data repository at the group, department, or enterprise-wide level for all types of data in their native format to be brought together for various groups to process and analyze.

The fundamental concepts of the Data lake architecture include:

  • Security – This is always a concern. Like any other IT architecture today, it has to be implemented in every layer of the Data Lake to manage threats and vulnerabilities.
  • Data Ingestion and Movement – Managing data and data types from different sources for loading into the data lake from batch, real-time, or other systems has to be managed.
  • Data Governance – Governance, Risk, and Compliance (GRC) has to be managed for overall usability, integrity, confidentiality, and availability of data in an organization against both internal enterprise guidelines as well as external regulatory mandates.
  • Data Quality – Has always to be maintained for deriving business value from the data, bad quality = bad decisions.
  • Data Analytics – Analytics for decision support is the main reason for a data lake.
  • Data Discovery – Data has to be discovered first before any usage, especially for analytics. Critical sources for data have to be identified and managed.
  • Data Recovery – For use for data lakes related to business continuity, data recovery has to be planned and tested.
  • Data Auditing – Auditing is a necessity for risk management, governance, and creating compliance standards.
  • Data Storage – Using the Cloud and/or hybrid solution, attention should be given to managing storage scalability.
  • Data Lineage – The origin of data has to be managed to ensure data ingestion is done effectively.
  • Data Exploration – For all analytics, data exploration has to be done to identify the correct dataset.
  • Coordination and Collaboration – Data lake is an organizational data store; understanding data usage needs collaboration and coordination across the organization with various teams and stakeholders.

The core requirement for a Data lake architecture is an underlying scalable data storage architecture.

  • Initially, this was Hadoop with the Hadoop Data File System or HDFS, but that has been replaced by object storage, generally in the AWS (S3), Azure (ADLs), and Google (GCS). This should be a single shared repository of data.
  • In all cases, there has to be a robust but minimal management system; YARN became the standard here and has migrated from Hadoop to the Cloud Object Storage environments as well. Orchestration and job scheduling capabilities should be key characteristics.
  • Virtually all Data lake architectures now run in the Cloud, decouple compute from storage to support scalability and pay-for-what-you-use models, support multiple programming languages, including Hive and Spark support and SQL Support.

Outside of these fundamental pillars, additional Data lake architecture design considerations are a function of who will use the system and what types of work. Initially, Data Lakes were seen as a tool for data scientists dealing with raw unstructured and semi-structured data. So data lake architecture was focused on developer tools for data ingestion, processing, query, and analytics. Generally, users of the data lake are roles accustomed to doing analytical work with databases. Still, because of the value of data lakes and emerging tools, users of data lakes can be broadened to other users.

Data lake architecture is focused on design for rapid input of raw data, so not much effort is placed on massaging data on input. The other three areas require design considerations tied to which data within the Lake is the focal point and the actual task at hand.

AWS data lake architecture, Azure data lake architecture, Hortonworks data lake architecture, and Spark data lake architecture all follow these concepts and requirements for data lakes. Each has a consistent approach but differs in the total offerings available using their technologies. Organizations should evaluate each depending on their needs.

Data Lake Design Considerations

Organizations should have a big picture in mind with the usage of their data lake. The organizational intent or strategy should drive the design and use of a Data Lake. Good designs make future decisions easier within the data lake architecture.

Data Lakes should be designed with the following characteristics:

  • Cloud enablement with workload isolation.
  • Multiple tiers – Ingestion, operations, processing, distillation, storage, insights.
  • Ability to add and support users without affecting performance during various workloads.
  • Unique Metadata tagging services for the object storage environment.
  • Effective tools extract, load, transform and query data without impact on performance.
  • Multi-cluster shared data architecture.
  • Independent structured compute and storage resource scaling.

In many use cases for document search and query for researchers in, say, pharma, medical, or any area of academia, rely on a search engine and use some query language that can rapidly parse large sets of documents. In other cases, the data may be semi-structured, for example, mobile and IoT data. There can even be the need to build a relational mapping between various IoT data sets; for example, if you have pressure and temperature sensors tied to measurement of a volume of something, the PVT equation represents a relational mapping of data tables or perhaps instead data from each of these sources is streaming into the Data Lake and processed both in real-time as well as later as aggregate data sets with a relationship between them.

Regardless of data structure, one of the critical analytics tasks data scientists carry out is some form of AI, whether pattern cognition such as facial recognition in video data or natural language processing in documents or audio streams.

The use cases just described and handled by data scientists mainly on research projects is precisely what has historically differentiated Data lake architecture from Data Warehouse or Database architecture. However, some aspects of what is generally found in a Data Warehouse architecture are cropping up in Cloud versions of Data lake architectures. First, the democratization of data has been more of a statement than a fact with Data Lakes. They were limited to use by Data Scientists and Engineers, excluding business users. Further, they tended to run very slowly compared with the speed of Data Warehouse query returns and ad hoc analytics. And lastly, but most importantly, early Data lake architectures did not have much built-in security support or data governance and cataloging of what was in the Data Lake.

Cloud Data lake architectures all take advantage of the intrinsic security features of the AWS, Azure, and Google cloud platforms they run on. They all have some form of Data Catalog and data pipeline service to help with the flow of processing data over multiple stages. Further, many implementations of the Data lake architecture provide tools to develop and leverage the metadata associated with the various datasets in the Lake for a range of uses from master data management to semantic operations such as indexing, ontology, and means of ensuring not just higher data quality but optimal use of only the data you need by role.

Data Lake Architecture Adoption

Data lake architecture adoption should be in stages, each with a quick time to value or quick win for the organization. Use available data, then as the project matures and gaps in data are uncovered, mature the data lake.

Stage 1 – Capture, ingest, and take inventory of data and sources, then visualize how current data assets can be used for the organization. While doing this, decide and create methods, practices, and approaches for faster onboarding of new data discoveries.

Stage 2 – Build the analytical models for transforming and doing the data analysis tasks. Keep in mind the outcomes that the data supports. Organizations may try different tools and leverage artificial intelligence (AI).

After stage 2, get the data to the consumers, decision-makers, and any other stakeholders of the data. Leveraging the data lake with an enterprise data warehouse can enable this to happen.

The last stage but not the final stage is continuous improvement. Improve enterprise capabilities of the data lake. This should include data and information lifecycle management. Remember that data lake technology is for improving business outcomes, so measuring improvements in business outcomes relative to the usage of a data lake is critical.

Be careful of having a Data Lake IT project that creates “data swamps” or unusable data. Although deriving value can be done from all types of data, make sure that there is value. Data that has no use affects the performance of both the IT infrastructure and the people using the data to make decisions. Each adoption stage should be mindful of the business relevance of the data being used. Make sure the data has value for decision support with the organization.

Conclusion

Data Risk Management is the responsibility of all functions across the business, marketing, sales, human resource, operations, applications, legal, etc. Taking a proactive approach by identifying risk, adding controls, and preparing for action can make a world of difference when needed. Do not make data risk management an afterthought and something not worth the investment. Data risk management is a part of the cost of doing business and should be understood as such. Be careful of shortcuts and not be strategic and comprehensive with the approach.

Critical to managing data risk is the usage of technology that can help. Data Lake technology can help manage and improve the use of data across the organization. Resulting in improvement in customer interactions, improved service delivery, improved service design, and overall improved day-to-day operations of an organization. Identify and define the organizational data reasons and goals for the data lake and keep those in mind always during the project.