Data Lake – An OverviewPublish Date: June 13, 2017
The explosion of new types of data in recent years has put tremendous pressure on the datacentre, both technically and financially, and an architectural shift is underway where Enterprise Hadoop is playing a key role in the resulting modern data architecture. The realization that unstructured data and big data can also be analyzed for business insights has led to the concept of the data lake.
In relation to a data warehouse, a data lake offers ample storage with high availability of data at lower costs, together with increased agility and flexibility of use. A data lake can help improve data democracy, allowing users more possibilities to ask new questions as they go.
Some of the future challenges for data as we see are:
- The data is typically sub-transactional or non-transactional
- There are many unknown questions that will arise in the future
- There are multiple user communities that have questions of the data
- The data is of a scale or daily volume such that it will not fit technically and/or economically into an RDBMS
In the past, the standard way to handle reporting and analysis of this data was to identify the most impressive attributes and to aggregate these into a data mart. There are several problems with this approach:
- Only subsets of the attributes are examined, so only pre-determined questions can be answered
- The data is aggregated so visibility into the lowest levels is lost
Based on the challenges and the problems above with the traditional solution approach, a new concept called the Data Lake to describe an optimal solution is gaining traction. A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The contents of the data lake stream in from multiple sources to fill the lake and various users of the lake can come to examine, dive in, or take samples.
The term Data Lake is often associated with Hadoop-oriented object storage. In such a scenario, an organization’s data is first loaded into the Hadoop platform, and then business analytics and data mining tools are applied to the data where it resides on Hadoop’s cluster nodes of commodity computers. One of the benefits usually touted about data lakes is that you do not have to know exactly how the data will inevitably be used beforehand. Well, there’s both truth and myth in this statement.
You do need to have some use cases in mind, because this helps you structure and implement the data lake in a way that later queries and analysis will work as intended. Use cases for a data lake do not have to be as honed and refined as those when setting up a traditional SQL-type database, but you do need to conduct due diligence in regards to discovery. What are some potential use cases for your big data initiative? Set up the data lake according to your identified potential use cases.
What to watch for in ensuring successful data lake implementation?
Multiple tools and products: Extracting maximum value out of the Data Lake requires customized management and integration that are currently unavailable from any single open-source platform or commercial product vendor.
Domain specification: The Data Lake must be tailored to the particular industry. A Data Lake customized for biomedical research would be significantly different from one tailored to financial services. The Data Lake requires a business-aware data-locating capability that enables business users to find, explore, understand, and trust the data.
Automated metadata management: The Data Lake concept relies on capturing a robust set of attributes like data lineage, data quality, and usage history which are vital to usability. Maintaining this metadata requires a highly-automated metadata extraction, capture, and tracking facility. Without a high-degree of automated and mandatory metadata management, a Data Lake will rapidly become a Data Swamp.
Configurable ingestion workflows: In a thriving Data Lake, new sources of external information will be continually discovered by business users enabling easy, secure, and trackable content ingestion from new sources. Configuration-driven, ingestion workflow mechanism can provide a high level of reuse.
Integrate with the existing environment. The Data Lake needs to meld into and support the existing enterprise data management paradigms, tools, and methods.
The Data Lake Architecture
The Data Lake is a data-centered architecture featuring a repository capable of storing vast quantities of data in various formats. Data from webserver logs, databases, social media, and third-party data is ingested into the Data Lake. Curation takes place through capturing metadata and lineage and making it available in the data catalog.
Data can flow into the Data Lake by either batch processing or real-time processing. Additionally, data itself is no longer restrained by initial schema decisions and can be exploited more freely by the enterprise. Rising above this repository is a set of capabilities that allow IT to provide Data as a Service (DaaS), in a supply-demand model. IT team takes the role of the data provider (supplier), while business users (data scientists, business analysts) are consumers.
Most organizations are in the process of building a brand-new Hadoop data lake. You can add and modify your infrastructure as you get accustomed to both Hadoop and utilizing a data lake. The successful Hadoop journey typically starts with new analytic applications, which lead to a Data Lake. In future more and more applications will derive value from the new types of data from sensors/machines, server logs, clickstreams, and other sources. The Data Lake forms with Hadoop acting as a shared service will deliver deep insight across a large, broad, diverse set of data efficiently.
Harness big data solutions from YASH to drive better business decisions
Anand Agrawal : Sr Technology Professional – Innovation Group – Big Data | IoT | Analytics | Cloud
Senior Tech Lead