DataHub

[wtm_mlop_cats]

DataHub is an open-source metadata platform for the modern data stack. Read about the architectures of different metadata systems and why DataHub excels here. Also read our LinkedIn Engineering blog post, check out our Strata presentation and watch our Crunch Conference Talk. You should also visit DataHub Architecture to get a better understanding of how DataHub is implemented and DataHub Onboarding Guide to understand how to extend DataHub for your own use cases.

Features

DataHub is made up of a generic backend and a React-based UI. Original DataHub blog post talks about the design extensively and mentions some of the features of DataHub. Our open sourcing blog post also provides a comparison of some features between LinkedIn production DataHub vs open source DataHub. Below is a list of the latest features that are available in DataHub, as well as ones that will soon become available.

Entities:
Datasets
Search: full-text & advanced search, search ranking
Browse: browsing through a configurable hierarchy
Schema: table & document schema in tabular and JSON format
Coarse grain lineage: support for lineage at the dataset level, tabular & graphical visualization of downstreams/upstreams
Ownership: surfacing owners of a dataset, viewing datasets you own
Dataset life-cycle management: deprecate/undeprecate, surface removed datasets and tag it with “removed”
Institutional knowledge: support for adding free form doc to any dataset
Fine grain lineage: support for lineage at the field level
Social actions: likes, follows, bookmarks
Compliance management: field level tag based compliance editing
Top users: frequent users of a dataset

Users & Groups
Search: full-text & advanced search, search ranking
Browse: browsing through a configurable hierarchy
Profile editing: LinkedIn style professional profile editing such as summary, skills

Dashboards & Charts
Search: full-text & advanced search, search ranking
Basic information: ownership, location. Link to external service for viewing the dashboard.
Institutional knowledge: support for adding free form doc to any dashboards

Tasks & Pipelines
Search: full-text & advanced search, search ranking
Browse: browsing through a configurable hierarchy
Basic information:
Execution history: Executions and their status. Link to external service for viewing full info.

Tags
Globally defined: Tags provided a standardized set of labels that can be shared across all your entities
Supports entities and schemas: Tags can be applied at the entity level or for datasets, attached to schema fields.
Searchable Entities can be searched and filtered by tag

Schemas
Search: full-text & advanced search, search ranking
Browse: browsing through a configurable hierarchy
Schema history: view and diff historic versions of schemas
GraphQL: visualization of GraphQL schemas

Metrics
Search: full-text & advanced search, search ranking
Browse: browsing through a configurable hierarchy
Basic information: ownershp, dimensions, formula, input & output datasets, dashboards
Institutional knowledge: support for adding free form doc to any metric

Fine-Grained Access Controls
DataHub also provides mechanisms to control who has access to which metadata entities via UI & API. Using this functionality, admins of DataHub can define policies such as

* Dataset Owners should be able to update Documentation, but not Tags, for all datasets.
* A specific Data Steward should be able to add tags to any Dataset, but edit nothing else.
* Data Platform team should have all privileges for DataHub, including manging policies & viewing platform analytics.

Metadata Sources
We have a Metadata Ingestion Framework which supports a variety of popular connectors, like

BigQuery, Snowflake, Redshift, Postgres, Kafka, MySQL, Hive, Looker, MongoDB

Official website

Tutorial and documentation

Enter your contact information to continue reading