The Role of Agile Processing in the Field of Data Engineering and Computation

The Role of Agile Processing in the Field of Data Engineering and Computation

Introduction:

The dictionary meaning of “Agile” is “ability to move quickly and easily”. 

Agile development or processing is a project development method followed in software industry, which is characterized by dividing a huge task into smaller tasks, delivered, and assessed at regular short intervals, thus giving more room for changes in requirements and faster adaptation. Agile development focuses and stresses on the fact that “Change is inevitable”, and the same applies to software requirements as well.

It all started in the spring of 2000, when a group of 17 software developers, including Martin Fowler, Jim Highsmith, Jon Kern, Jeff Sutherland, Ken Schwaber, and Bob Martin met in Oregon to discuss how they could speed up development in order bring new software to market faster.

In simple words, when a big chunk of task is divided into smaller tasks, which are deliverable in shorter periods, and showcased at regular intervals, the stakeholder gets a clear vision about the project, which helps them to identify any requirement changes in the initial phases of the project. Identifying requirement changes in the initial phase, rather than a later phase of the project is a big relief to the development team.

Traditional approach of projects is designed to work in a static environment, probably, for longer durations, and much longer time to showcase a prototype to the stakeholder. Incorporating any requirement changes in such projects is a huge overhead, as it involves lot of re-work, and loss of time and thus increasing the project duration as well.

Agile organizations are different from the above. They are designed, anticipating requirement changes and easy adaptation.

A simple layman example might clearly show the difference between both, the traditional approach and Agile approach. Let’s take up a project statement as the following.” A kid must write letters A-Z, different words beginning with each letter, and draw them.”

Traditional Approach – The kid would first write letters A-Z each, in different pages of a notebook. Once that is complete, he would then write different words beginning with each letter, from A-Z and then goes on to the next task from A-Z. 

Agile Approach – The kid would first try to finish letter A, its associated words, and draw them, ready to be able to submit his work for letter A. He would then go on to the next letter and so on.

Let us now try to weigh each of the approaches. In case of any minor mistake from the kids’ side, he will have to re-work the entire lot in the case of a traditional approach. While in the case of Agile approach, he has more room for changes without much of re-work, meaning faster results, and easier adaptation.

Agile Terminology

Agile Methodology is a huge umbrella which has multiple organizational frameworks under it. Kanban and Scrum, Sprint are few of the important jargons in Agile.

Kanban

“Kanban” originated in the manufacturing plants of Japan. It is a Japanese word, meaning, “signboard” and was developed to prioritize “just in time” delivery, meeting demand rather than creating a surplus of products. 

Kanban is characterized by a tracking board which project managers use. It would have vertical columns named as To-do, in progress, done, and deployed. Simple sticky notes could be used to list down all the tasks in the to-do list. These sticky notes are then moved from to-do to in-progress and other columns horizontally, as the developer progresses with the task in hand.

Scrum 

Scrum is one of the agile development methodologies, in software development lifecycle. It is fast, flexible, and effective. Scrum aims at maximum transparency, collective responsibility, and continuous progress. 

Scrum is characterized by daily scrum meetings where the entire development team attends the short stand-up meeting, where the updates from each of the team members are given as answers to the questions, what is done, what is to be done, and blockers, if any. The aim of scrum is to have the meeting as short and crisp as possible, at the same time to track the project’s progress. The scrum master oversees the scrum meetings and helps resolve the blockers for the team members.

Scrum is executed in short and temporary blocks of time called sprints. User stories are high level tasks, which are given to the development team by the product owner. User stories are picked up during the sprints, and split up into smaller tasks. These sprints can be as short as 2 weeks, and can extend to 4 or 6 weeks, depending on the project’s requirement. Every sprint ends with the following meetings.

  • Review Meeting – The team showcases the development or the progress in that sprint.
  • Retrospection Meeting – A meeting that is scheduled to dig deeper at what went wrong in the previous sprint and the steps to correct them.
  • Planning Meeting – A meeting to plan the to-do tasks of the next sprint. Team members would estimate the time taken for each task.
  • Backlog Refinement Meeting – This is a meeting that is scheduled to decide on the backlog for the sprint. This meeting usually involves product owners, to prioritize the user stories that are in hand.

These are recurring meetings that are scheduled at the end of every sprint and again scrum master is responsible for scheduling these meetings. As we can see from the above frameworks, Agile is a great move towards better efficiency, and faster output to the stakeholder.

Data Engineering and Computation

Now, that we have understood few important concepts of Agile Methodology, lets now understand what Data Engineering means.

Data is everywhere around us. We have loads of data in our very own household. Just take a moment, to see the number of photos in your phone, and how many albums have you created so far, how many emails you have in your system, how much GB of data you have on your cloud account, and the list never ends. In organizational terms, data is again everywhere, employee information, CCTV recordings, customer buying pattern, and so on. It has become a field, where data needs is analysed to get some meaningful patterns or insights from it. Business Intelligence, Data engineering, Data Analysis are all based on data, and how we use it.

Data engineering means everything possible with humungous quantities of data. It can be designing and building systems to collect, store, and analyse the data which can lead to interesting findings that can help many organizations. In simple terms, Data Engineering aims at making data more useful for the consumers of data.

The traditional method of working with data is performed in two basic steps.

Data Preparation: ETL - Extract, Transform and Load is the first step.

  • Data Extraction – This is the first step, which aims at extracting data from various sources like legacy systems, flat files. 
  • Data Transformation – This is the next step, which aims at converting data into a format, suitable for loading into the data warehouse (a repository, that stores structured data). This may involve cleaning and validating the data, converting data types, combining data from multiple sources, and creating new data fields.
  • Load: After the data is transformed, its loaded into the data warehouse. It involves creating physical structures in the data warehouse.

Data Analytics and Querying – Once the ETL phase is complete, this would be the next phase, where data is analysed and looked for statistical insights. On a very broad spectrum, data analytics can be divided into the following.

  • Descriptive analytics - can let us know what happened.
  • Diagnostic analytics - can answer the question, why something happened.
  • Predictive analytics - what will likely happen in the future.
  • Prescriptive analytics – can tell us how to act.

Traditional ETL has its own advantages, like improved data quality, better data integration from various sources, improved scalability, but, it has its own set of disadvantages, as listed down below.

High cost: ETL process can be expensive to implement for multiple sources.

Complexity: It is a complex process, which includes code to be written, to suit multiple legacy systems, and hence needs expertise.

Limited flexibility: ETL process might challenge for unstructured data.

Limited scalability: It might not be able to handle very huge quantities of data.

Data privacy concerns: ETL can also raise some privacy concerns, when dealing with huge quantities of data.

How Agile methodology can help in Data Engineering?

With the availability of modern agile data engineering services, legacy ETL is now taking a backseat. The main difference with agile processing is, the two separate phases of Data Preparation and Data Analytics (in traditional method of handling data), have now been merged into one. To be more precise, the engine for merging and combining data sets is sitting right in the distributed storing/computing cluster performing analytics and queries. Let us look deeply, how we can apply the Agile methods can be applied to data engineering projects.

Software development projects are relatively simpler in terms of task duration estimation, user story break-up. However, data driven projects are complicated in terms of uncertainty and complexity, as explained below.

Uncertainty
  • Integrating new data sources in not an easy task
  • Not many details available, if there are any default connectors are available that can fetch data from the data sources, or any customization is required.
  • How frequently to fetch data from source?
  • Is an initial load required?
Complexity
  • Source data might contain data in different formats (JSON, comma separated…) or nested arrays
  • How many transformations would be required to clean up the data, so it can be easily ingested for analytics.

But the goal remains in applying an agile touch to data projects, that work irrespective of data sets from different systems based on different technologies. So then comes the question, how do we manage our data engineering projects with agile touch. It can be done by dividing the tasks keeping in mind the following:

Warm-up time: 

Start with a spike or warm-up time, where we try to design the new data pipeline. We can also try to get clarity from our stakeholders about any issues, during this time. It is generally advised to use up 1 sprint for the warm-up, without any effort estimation or deliverables.

Split the Implementation 

Split implementation tasks with clear deliverables. (Frequent delivery with feedback is a must in Agile practice)

  • Implement transformation from raw to clean data
  • Implement incremental load of data
  • Implement full load of data
Task Estimation
  • Story points is a nice way of estimating a task or a story in Agile framework.
  • The entire team should meet and estimate a story, and decide a number from the Fibonacci series (1,2,3,5,8,13…).
  • As a rule, any story that exceeds 13 points, is not meant to fit in a single sprint, and hence should be split up into multiple tasks or stories. 

With the information given in this article, it should be possible to split up implementation into stories which are neither too big nor too small but still delivers clear value. It is also very critical to have a warm-up sprint in data related projects, without any task estimations and deliverables. This is mainly to answer some of the uncertainties. Following some of the principles of Agile practices will definitely help in reliably delivering data projects, with no surprises for the development team and for the stakeholders.

Scope @ N9 IT Solutions:

  • N9 IT Solutions is a leading IT development and consulting firm providing a broad array of customized solutions to clients throughout the United States. 
  • It got established primarily with an aim to provide consulting and IT services in today’s dynamic environment.
  • N9 IT also offers consulting services in many emerging areas like Java/J2ee, Cloud Computing, Database Solutions, DevOps, ERP, Mobility, Big Data, Application Development, Infrastructure Managed Services, Quality Assurance and Testing.

OUR BLOG

What Is Happening