Constraints for Agile Processing during the Computation of Bulk Data in Warehouse

Constraints for Agile Processing during the Computation of Bulk Data in Warehouse

Introduction to Data Engineering

Every one of us is contributing to huge quantities of data, in our day-to-day lives. Just that, we only accumulate data, but don’t actually analyse it for any purpose. To handle data, there is a very specific field, Data Science, the study of data to extract meaningful insights for business.

It combines principles from many fields of mathematics, statistics, artificial intelligence and computer engineering to help Data scientists analyse and give some results. These can include answers to some questions like, what happened, why something happened, what will happen and what can be done. 

We have tons of data from e-commerce, medicine, finance, and many more. The data can be raw and unstructured, or structured and it can be in audio, video, or image formats. Data Science is very important, as it includes the study of tools, methods, and technology to handle data. Understanding data science, would help us to handle different kinds of data, collect them, clean them and store them all in one place called Data Warehouse, before we start analysing.

Let’s quickly see, the five-step process in handling data. It is usually called OSEMN data science process.

Obtain Data

  • Data can be pre-existing, newly acquired, or from a data repository. Internal databases, or web-server logs and social media might also be some resources of data.

Scrub Data

  • It is the process of data cleaning or converting raw and unstructured data into structured data. It includes handling missing data, fixing data errors, and removing any data outliers.

Explore Data

  • It is the initial phase of data analysis, where data scientists try to understand the data, and use visualization tools to identify any specific patterns in data, that might be helpful.

Model Data

  • Machine learning algorithms, and software algorithms might be applied on data here, to get some insights and predict outcomes.

Interpret Data

  • Data summarization, where analysts try to convert the insights, into data visuals. These could include graphs, diagrams and charts to interpret the results.

Data Storage

  • We now understand that the data used by data engineers, need to be located in a single space. There are multiple jargons related to data storage.

Data Lake

  • It is a storage repository, to store structured, semi-structured or unstructured raw data. 
  • There is no need for structure of data or schema, when data is captured.
  • Data from IoT devices, web sites, mobile apps, social media, and corporate applications (non-relational and relational) can be stored here.
  • Different types of analytics like SQL queries, big data analytics, full text search, real-time analytics, and machine learning can be performed to get some insights from data.

Data Warehouse 

  • A storage repository, but unlike data lakes, Warehouses are meant only for structured data. 
  • It facilitates storing, analysing, and interpreting data and helps in better decision making.
  • Data structure or schema is pre-defined, to facilitate faster SQL queries.
  • Transactional systems, relational databases could be sources of data into Data Warehouse.

Subject-oriented 

  • Topic wise information can be easily extracted, like Sales, Promotion, Inventory, depending on the schema.

Integrated

  • Data is consistent, which helps in effective data analysis.

Non-volatile

  • Data once entered into a Data warehouse is read-only, which doesn’t allow data tampering.

Time-Variant primary key

  • All the data stored in the Warehouse, includes an element of time, which is very unique. The primary key, would usually be time-variant, meaning a combination of data, day, year and time.

How Data Warehouse Works?

Like we have seen earlier, Data Warehouses, combine data from multiple sources, to form a repository. It can collect data from social media networks like Facebook, Twitter, LinkedIn etc, or customer information from an organization, and its website and so on.

Several tools might be required to perform operations on such extensive data sets. With the help of these tools, Data Warehouses help in data sorting, filtering and merging. These tools can be categorized into the following:

The importance of Data Engineering in the application development using modern programming language

  • Query and reporting tools
  • Application Development tools
  • Data mining tools
  • OLAP tools

Some popular Data Warehouse tools are Xplenty, Amazon Redshift, Teradata, Oracle 12c, Informatica, IBM Infosphere, Cloudera, and Panoply. 

Need for Agile Data Warehousing

Data Warehouse is the heart of BI (Business Intelligence) systems, which rely on the insights from data analysis. 

Waterfall techniques were majorly used for Data Warehousing projects, which prioritizes task achievements over speed. As a result, many a times, the purpose of the project is out-dated, by the time the projects were delivered. Many organizations were unwilling to accept such high risk in their projects, and the next option was to opt for Agile Methodologies in Data Warehousing, and thus came the buzzword “Agile Data Warehousing”.

Agile Data Warehousing, effectively means, applying some (if not all) practices of Agile Methodology in Data Warehousing projects. “Delivery” is the core of Agile process. Agile Methodology uses an iterative approach of sprints, to deliver value to the stakeholders.

Agile Vs Waterfall Approach

Traditional Waterfall approach for Data Warehousing requires all the specifications and requirements to be planned, well before the implementation phase. Any requirements or changes would be encouraged only after the major chunk of the project is done. This would mean, less adaption to changes, and longer durations of time to incorporate changes, which means longer delivery time.

Agile Data Warehousing would emphasize delivery of end-to-end Minimum Viable Product (MVP). It tries to achieve it through iterative cycles, where the minimum requirements are planned and targeted in the first iteration called sprint. Further sprints, would be planned based on the delivery of the previous sprint, giving more room for changes in requirement, and easier and faster incorporation of requirement changes. This would ensure improved delivery speed, as well as frequent progressive outcomes.

High delivery costs and delays, very minimal interaction with stakeholders (once in six months), and ever-changing needs, are some of the downfalls of the Waterfall model. With the increase in complexity of Data Warehouses and multi fold increase in the amount of data, the inefficiencies of the Waterfall model became more evident, resulting in the advent of the next generation Data Warehousing.

Agile Methodology is able to impress the stakeholders with minimal expenses and a shorter span to start development. Business partners or stakeholders find Agile a must, in projects, owing to their swift results and easier adaptation to changes.

Benefits of Agile Data Warehouse 
  • Improved Delivery Speed - When results can be seen quicker, better judgements can be made sooner. Stakeholders are given transparent updates about the project progress.
  • Enhancement and Improvement Framework – Supports incorporating improvements in subsequent sprints. It is a cycle of Plan, Design, Execute, Review, followed by Retrospection.
  • Project Transparency – Improved transparency to stakeholders, due to frequent interactions and reviews, organized every sprint.
  • Predictability of Delivery – Ensures improved forecasting of project delivery. With better predictability, stakeholders and the business teams have a clear idea, as to what to expect in future sprints.
  • Better risk management – Smaller tasks means easier identification of failures, and quicker solutions. This is better than a catastrophic failure, at the end of 6 months, which could result in a complete project halt.


Constraints of Agile Data Warehousing

So far, we have seen in detail about Data Warehousing, its benefits, Agile Methodology in Data Warehousing and so on. Agile team promises to deliver in small, but consumable, increments, and the results are evaluated quickly through reviews, so they can adapt to change easily. However, we cannot consider Agile to be completely positive, without any negatives, when applied to some projects, especially data related. Let’s see some of the aspects mentioned in Agile Manifesto, and how they are applicable to Data Warehousing projects.

  • Individuals and interactions over processes and tools - Its quite difficult to have good interaction in very large teams (Data Warehousing teams are usually large)
  • Customer collaboration over contract negotiation – Time consumption in communication of various teams, with stakeholders in a regular basis is an overhead. 
  • For Agile to work or fit into Data Warehousing projects, few points need to addressed and agreed upon by the development team.
  • Data Model will never be final – According to Agile, we break down bigger Data Warehousing project into smaller chunks, without looking at the complete requirement. The question remains, as to how can we build a robust data model, without looking every detail of the Data Warehouse.
  • Full historical data loads will occur multiple times throughout the project or phase - This is different in Waterfall model, as historical data is loaded once, in contrary to Agile process. This is owing to the changes in the data model, as the project progresses.
  • Changes in requirements will cause potential changes to everything- Some amount of re-work is unavoidable when using Agile approach, as the project moves over to further sprints.
  • Automated test development must be included in the project or phase – More frequent changes, must be accompanied by good test framework, to have good quality delivered. Investment of time and people in testing framework is a must. When the team agrees to the above-mentioned clauses, it means rework for same ETL job is unavoidable. It is necessary to find the balance between incremental delivery and rework.  In addition to the above-mentioned clauses, there could be some more worry points for Agile Data Warehousing projects, like mentioned below.
  • Initial Data-Architecture understanding phase could be long - Data Warehousing projects usually take a good amount of pre-sprint time, to get a good understanding.
  • Delivering value every sprint - Delivering a value-added sprint output, every sprint could be a challenge.
  • Need to install required infrastructure - Setting up minimum functional infrastructure, in the initial phases of the project can be time taking. 
  • Shortage of Agile Data Manpower - There could be a shortage of people with Agile Data Warehousing knowledge.
  • Data Analytics usually takes longer than a single sprint - Completion of data analytics tasks could take longer than a sprint, even with a good planning in place. Things might prove better after few sprints.
  • Product owners might not be completely aware of the entire data - Product owners must be coached to have a good understanding of their data, with the involvement of Agile Data Professionals.
  • Challenges associated with data sources - There could always be unforeseen challenges, when multiple data sources are involved.

Having seen the benefits and constraints of Agile Data Warehousing, it’s necessary to take a call for an Agile approach of the project, depending on the project’s requirement. After all, implementing any agile framework (scrum, kanban) without good teamwork, transparent communication, fast feedback from stakeholder and fast response developer is useless, without strong commitment.

Scope @ N9 IT Solutions:

  • N9 IT Solutions is a leading IT development and consulting firm providing a broad array of customized solutions to clients throughout the United States. 
  • It got established primarily with an aim to provide consulting and IT services in today’s dynamic environment.
  • N9 IT also offers consulting services in many emerging areas like Java/J2ee, Cloud Computing, Database Solutions, DevOps, ERP, Mobility, Big Data, Application Development, Infrastructure Managed Services, Quality Assurance and Testing.

OUR BLOG

What Is Happening