DataHub is made up of a generic backend and a React-based UI. Original DataHub blog post talks about the design extensively and mentions some of the features of DataHub. Our open sourcing blog post also provides a comparison of some features between LinkedIn production DataHub vs open source DataHub. Below is a list of the latest features that are available in DataHub, as well as ones that will soon become available.
Entities:
Datasets
Search: full-text & advanced search, search ranking
Browse: browsing through a configurable hierarchy
Schema: table & document schema in tabular and JSON format
Coarse grain lineage: support for lineage at the dataset level, tabular & graphical visualization of downstreams/upstreams
Ownership: surfacing owners of a dataset, viewing datasets you own
Dataset life-cycle management: deprecate/undeprecate, surface removed datasets and tag it with “removed”
Institutional knowledge: support for adding free form doc to any dataset
Fine grain lineage: support for lineage at the field level
Social actions: likes, follows, bookmarks
Compliance management: field level tag based compliance editing
Top users: frequent users of a dataset
Users & Groups
Search: full-text & advanced search, search ranking
Browse: browsing through a configurable hierarchy
Profile editing: LinkedIn style professional profile editing such as summary, skills
Dashboards & Charts
Search: full-text & advanced search, search ranking
Basic information: ownership, location. Link to external service for viewing the dashboard.
Institutional knowledge: support for adding free form doc to any dashboards
Tasks & Pipelines
Search: full-text & advanced search, search ranking
Browse: browsing through a configurable hierarchy
Basic information:
Execution history: Executions and their status. Link to external service for viewing full info.
Tags
Globally defined: Tags provided a standardized set of labels that can be shared across all your entities
Supports entities and schemas: Tags can be applied at the entity level or for datasets, attached to schema fields.
Searchable Entities can be searched and filtered by tag
Schemas
Search: full-text & advanced search, search ranking
Browse: browsing through a configurable hierarchy
Schema history: view and diff historic versions of schemas
GraphQL: visualization of GraphQL schemas
Metrics
Search: full-text & advanced search, search ranking
Browse: browsing through a configurable hierarchy
Basic information: ownershp, dimensions, formula, input & output datasets, dashboards
Institutional knowledge: support for adding free form doc to any metric
Fine-Grained Access Controls
DataHub also provides mechanisms to control who has access to which metadata entities via UI & API. Using this functionality, admins of DataHub can define policies such as
* Dataset Owners should be able to update Documentation, but not Tags, for all datasets.
* A specific Data Steward should be able to add tags to any Dataset, but edit nothing else.
* Data Platform team should have all privileges for DataHub, including manging policies & viewing platform analytics.
Metadata Sources
We have a Metadata Ingestion Framework which supports a variety of popular connectors, like
BigQuery, Snowflake, Redshift, Postgres, Kafka, MySQL, Hive, Looker, MongoDB