Data Lineage

Data Lineage
Photo by Joshua Sortino / Unsplash

What is Data Lineage?
Data lineage is the process of tracking and documenting the history, origin, and usage of data. This includes information about where the data came from and how it changed over time. Data lineage helps organizations understand how their data is used for analysis and decision making, as well as how it affects compliance with regulations or standards. Additionally, by understanding data lineage, organizations can identify issues with their data sources, such as incorrect or incomplete information.

How data lineage is different from data flow?
Data lineage is the process of tracing the origin and history of data from its source to its destination. Data lineage describes how data moves through a system, including where it came from, where it is stored, and how it is transformed as it flows through the system.

Data flow is the movement of data from one point to another within a system or between systems. It describes how data moves between various stages or parts of a system and how it is transformed as it moves from one stage or part to another. Data flow diagrams are used to represent the sequence of steps in a process.

The main difference between data lineage and data flow is that while data flow describes the movement of data within a system, data lineage tracks the origin and history of that same data over time.

Why it is important to track the data lineage?
Data lineage is important to track because it allows organizations to understand the origin and development of their data and the sources of that data. It can help organizations improve the accuracy and reliability of their data, as well as enable them to more quickly troubleshoot issues with their data. Having a full understanding of the data lineage can also be helpful for compliance with regulations, as well as for tracking changes made to the data over time.

What are the risk of not tracking the data lineage?

  • Without understanding the history of data from its origin to its current state, organizations cannot be sure if their data is reliable and accurate.
  • Organizations cannot accurately audit data, identify vulnerabilities, or understand how their data is used.
  • Without data lineage, organizations will be unable to detect and prevent errors or misuse of their data.
  • Absence of data lineage can lead to costly delays in business processes and costly mistakes in decision-making due to a lack of trust in the accuracy of the data.
  • Lack of visibility into data lineage can leave organizations open to potential legal issues related to compliance and privacy regulations.

List of commercial applications used for tracking data lineage

  1. Informatica Data Lineage
  2. Axon Data Governance
  3. Alation Data Catalog
  4. Collibra Data Governance Center
  5. erwin Data Modeler
  6. Oracle Enterprise Data Quality
  7. Adverity Insights
  8. Talend Metadata Manager
  9. WhereScape RED
  10. Apatar Data Lineage
  11. IBM InfoSphere DataStage
  12. Syncsort DMX-h
  13. SAP Data Services
  14. CA ERwin Data Modeler
  15. Microsoft SQL Server Integration Services (SSIS)
  16. Oracle Data Integrator (ODI)
  17. Talend Data Integration
  18. Attunity Compose
  19. HVR Software
  20. Ab Initio Data Profiler

List of Open Source tools available for tracking data lineage.

Apache Atlas: Apache Atlas is an open source data governance and metadata management platform that provides a central repository to store and manage enterprise wide metadata. It enables users to track data lineage and monitor changes to the data over time.

Talend Data Lineage: Talend Data Lineage is a free open source tool that enables users to track the complete lifecycle of their data, from its entry into the system, through its usage and transformation, to its eventual storage or export.

Informatica Data Lineage: Informatica Data Lineage is an open source tool that helps users track their data’s movement throughout their system, providing visibility into where their data is coming from and how it’s being used.4. ER/Studio Data Lineage: ER/Studio Data Lineage is an open source data lineage tool that provides users with a comprehensive view of their data and its usage throughout the entire enterprise.

Metadata Workbench: Metadata Workbench is an open source metadata management tool that helps users track and analyze data lineage and understand where their data is coming from and how it’s being used.

Data Mapper: Data Mapper is an open source project that helps users map and track their data lineage across multiple systems and databases.

CloverETL: CloverETL is an open source data integration platform that helps users visualize and track their data lineage to understand where their data is coming from and how it’s being used.

Metadata Manager: Metadata Manager is an open source metadata management platform that enables users to track and trace the movement of their data throughout the entire enterprise, providing visibility into its usage and transformation.

Data Cleaner: Data Cleaner is an open source data cleansing and lineage tracking tool that helps users ensure their data is accurate and up-to-date.

Dataworks: Dataworks is an open source data integration and ETL tool that enables users to track their data lineage and monitor its movement throughout the entire system.