Open Sourcing WhereHows: A Data Discovery and Lineage Portal

Nathan Chappell

Eric Sun
Data Expert with System Insights

In modern data-driven businesses, the complexity that arises from fast-paced analytics, data mining and ETL processes makes metadata increasingly important. In this blog post, we share our own journey and a new open source effort that aims to boost productivity and data provenance. WhereHows, a project of the LinkedIn Data team, works by creating a central repository and portal for the processes, people, and knowledge around the most important element of any big data system: the data itself. The repository has captured the status of 50 thousand datasets (with more than 15 petabytes storage footprint across multiple Hadoop, Teradata and other clusters), 14 thousand comments, 35 million job executions and related lineage information.

The Consequence of Specialization: Mo’ Systems, Mo’ Problems

Along with the rapid growth of its professional network and business lines, LinkedIn has accumulated a lot of diversity in its big data ecosystem. We have many different sources and sinks of data. We write production pipelines that are driven by different scheduling engines, and we support many different transformation engines that are used to process and create derived data. This sort of specialization is nice because it gives us access to the best tool for the job; however, it creates a new set of problems. It becomes much harder to make sense of the overall data flow and lineage across the different processing frameworks, data platforms, and scheduling systems. This can result in a host of challenges including loss in productivity for employees as they try to find the right datasets to derive insights, operational challenges in discovering and triaging data breakages as well as lost opportunities in discovering and eliminating redundant computation.

Like most companies with a mature Business Intelligence (BI) ecosystem, LinkedIn started out with a data warehouse team responsible for integrating various sources into the consolidated golden datasets for the most critical execution reports. As the number of datasets, producers, and consumers grew, this central team was stretched very thin. Some of the questions that they had to answer were:

Where? How?

Our Solution

A couple of years ago, we decided to buckle down and build a central metadata repository to capture metadata across diverse systems and surface it through a single platform to simplify the data and flow discovery problem. This is a long journey and we are by no means done, but we wanted to share our progress with the community at large.

Today, we are excited to announce that we are open sourcing WhereHows, a data discovery and lineage portal. At LinkedIn, WhereHows integrates with all our data processing environments and extracts coarse and fine grain metadata from them. Then, it surfaces this information through two interfaces: (1) a web application that enables navigation, search, lineage visualization, annotation, discussion, and community participation and (2) an API endpoint that empowers automation of other data processes and applications.

WhereHows gif

This enables us to solve problems around data and process lineage, data and process ownership, schema discovery and evolution history, User Defined Function (UDF) and script discovery, operational metadata mashup, and data profiling and cross-cluster comparison. In addition to machine-based pattern detection and association between business glossary and dataset, the community participation and collaboration aspect enables us to create a self-maintaining repository of documentation on the entities by encouraging conversations and pride in ownership.

WhereHows Backend

The major components of WhereHows are:

A data repository to store all metadata content.

A web server that surfaces the data through both UI and API.

A backend server that periodically fetches metadata from other systems.

Detailed documentation for each component can be found on Github.

Putting Metadata To Work

The power of WhereHows comes from the metadata that it collects from the data ecosystem.

What kind of metadata?

In WhereHows, we primarily collect the following types of metadata:

The catalog information of datasets, such as schema structure, datasets physical location, timestamp of create/modify, ownership, and so on. Operational metadata, which includes the jobs, flows, and execution information.

Lineage information metadata, which is the connection between jobs and datasets.

How to use this metadata

Integrating the data from different source systems into a universal model was critical to our efforts. A universal model allows us to better leverage the value from the metadata, such as conducting a search across different platforms based on different aspects of a dataset.

Additionally, the dataset metadata and the job operational metadata are just like two endpoints. The lineage information is the bridge that could connect them together, so we can trace from a dataset/job to its upstream/downstream jobs/datasets. If the whole data ecosystem’s metadata is collected into WhereHows, we can trace the data flow from end to end.

How to collect metadata

The method used to collect metadata highly depends on the source systems. For example, for Hadoop datasets, we have a scraper job that scans through the folders and files on HDFS, reads and aggregates the metadata, then stores it back; for schedulers such as Azkaban and Oozie, we connect to their backend repository to get the metadata, aggregate and transform to the format we want, then load into the WhereHows repository; for lineage information, we parse the log of a MapReduce job and a scheduler’s execution log, then combine the information together to fill in the lineage.

Current State

WhereHows is actively used at Linkedin, not only as a knowledge-based application, but also as a metadata repository to automate several other projects, such as automated data purging for compliance, multi-colo database replication, and so on.

At the time of this blog publishing, we have already integrated with the following systems:

Type System Metadata Collected

Dataset HDFS 25K+ public datasets metadata

Hive 9K+ tables and views

Teradata 22K+ tables

Pinot* 100+ Pinot tables

Execution Azkaban 150K+ flows, 15M+ job execution info

Oozie N/A

Appworx* 4k+ fows, 20M+ job execution info

Informatica* 600+ flows, 2M+ job execution info

Lineage Azkaban All Pig, Hive, MapReduce job lineages

Appworx* All Pig, Hive, MapReduce, Teradata job lineages

Informatica All Informatica job lineages

*We already support these systems in LinkedIn's internal versions or their commercial versions, but not yet in the open source version due to either library licensing or other limitations.

The numbers continue to increase everyday, yet with the abstraction and discovery functions of WhereHows, we are able to better navigate and make better sense of our data ocean.

We are open sourcing WhereHows on GitHub, as well as our discussion group, to share our work with the broader data community. We highly encourage contributors from different companies to create new features and commit important bug fixes. Though metadata management tends to be tightly coupled to other components in the company, we will continue to try to refactor LinkedIn-internal integrations into WhereHows into generic templates or plugins in open source.

What’s Next

We plan to broaden our metadata coverage by integrating with more and more data systems, such as Kafka, Samza, etc. We also plan to integrate with data lifecycle management and provisioning systems such as Gobblin and Nuage to enrich our metadata.

In addition, we will be working on new features that we’ve observed strong interest from analysts and engineers.

Basic data profiling statistics to reveal the column-level and set-level characteristics of the dataset. This can be further leveraged for data quality analysis.

Join relationships between different datasets. This is challenging because there is no native way to express or document a join relationship in Hadoop, and complicated by the fact that in practice we often find type overloading (the same column can be used to contain references to multiple types of entities by using URNs for example)

A business taxonomy to associate and organize data objects and metrics according to business-intuitive hierarchy and grouping.

Subscribe to Industry Era