Updating DataHub
This file documents any backwards-incompatible changes in DataHub and assists people when migrating to a new version.
Next
Breaking Changes
Potential Downtime
Deprecations
- #8525: In LDAP ingestor, the
manager_pagination_enabledchanged to generalpagination_enabled
Other Notable Changes
- #8300: Clickhouse source now inherited from TwoTierSQLAlchemy. In old way we have platform_instance -> container -> co container db (None) -> container schema and now we have platform_instance -> container database.
- #8300: Added
uri_optsargument; now we can add any options for clickhouse client.
0.10.5
Breaking Changes
- #8201: Python SDK: In the DataFlow class, the
clusterargument is deprecated in favor ofenv. - #8263: Okta source config option
okta_profile_to_username_attrdefault changed fromlogintoemail. This determines which Okta profile attribute is used for the corresponding DataHub user and thus may change what DataHub users are generated by the Okta source. And in a follow upokta_profile_to_username_regexhas been set to.*which taken together with previous change brings the defaults in line with OIDC. - #8331: For all sql-based sources that support profiling, you can no longer specify
profile_table_level_onlytogether withinclude_field_xyzconfig options to ingest certain column-level metrics. Instead, setprofile_table_level_onlytofalseand individually enable / disable desired field metrics. - #8451: The
bigquery-betaandsnowflake-betasource aliases have been dropped. Usebigqueryandsnowflakeas the source type instead. - #8472: Ingestion runs created with Pipeline.create will show up in the DataHub ingestion tab as CLI-based runs. To revert to the previous behavior of not showing these runs in DataHub, pass
no_default_report=True. - #8513:
snowflakeconnector will use user'semailattribute as is in urn. To revert to previous behavior disableemail_as_user_identifierin recipe.
Potential Downtime
- BrowsePathsV2 upgrade will now be handled by the
system-updatejob in non-blocking mode. This process generates data needed for the new search and browse feature. This process must complete before enabling the new search and browse UI and while upgrading entities will be missing from the UI. If not using the new search and browse UI, there will be no impact and the update will complete in the background.
Deprecations
- #8198: In the Python SDK, the
PlatformKeyclass has been renamed toContainerKey.
Other Notable Changes
0.10.5 introduces the new Unified Search & Browse experience and is disabled by default. You can control whether or not you want to see just the new search filtering experience, the new search and browse experience together, or keep the existing search and browse experiences by toggling the two environment variable feature flags SHOW_SEARCH_FILTERS_V2 and SHOW_BROWSE_V2 in your GMS container.
Upgrade Considerations:
- With the release of Browse V2, we have created a job to run in GMS that will backfill your existing data with new
browsePathsV2aspects. This job loops over entity types that need abrowsePathsV2aspect (Dataset, Dashboard, Chart, DataJob, DataFlow, MLModel, MLModelGroup, MLFeatureTable, and MLFeature) and generates one for them. For entities that may have Container parents (Datasets and Dashboards) we will try to fetch their parent containers in order to generate this new aspect. For those deployments with large amounts of data, consider whether running this upgrade job makes sense as it may be a heavy operation and take some time to complete. If you wish to skip this job, simply set theBACKFILL_BROWSE_PATHS_V2environment variable flag tofalsein your GMS container. Without this backfill job, though, you will need to rely on the newest CLI of ingestion to create thesebrowsePathsV2aspects when running ingestion otherwise your browse sidebar will be out-of-sync. - Since the new browse experience replaces the old, consider whether having the
SHOW_BROWSE_V2environment variable feature flag on is the right decision for your organization. If you’re creating custom browse paths with thebrowsePathsaspect, you can continue to do the same with the new experience, however you will have to generatebrowsePathsV2aspects instead which are documented here.
0.10.4
Breaking Changes
Potential Downtime
Deprecations
- #8045: With the introduction of custom ownership types, the
Owneraspect has been updated where thetypefield is deprecated in favor of a new fieldtypeUrn. This latter field is an urn reference to the new OwnershipType entity. GraphQL endpoints have been updated to use the new field. For pre-existing ownership aspect records, DataHub now has logic to map the old field to the new field.
Other notable Changes
- #8191: Updates GMS's health check endpoint to account for its dependency on external components. Notably, at this time, elasticsearch. This means that DataHub operators can now use GMS health status more reliably.
0.10.3
Breaking Changes
- #7900: The
catalog_patternandschema_patternoptions of the Unity Catalog source now match against the fully qualified name of the catalog/schema instead of just the name. Unless you're using regex^in your patterns, this should not affect you. - #7942: Renaming the
containerPathaspect tobrowsePathsV2. This means any data with the aspect namecontainerPathwill be invalid. We had not exposed this in the UI or used it anywhere, but it was a model we recently merged to open up other work. This should not affect many people if anyone at all unless you were manually creatingcontainerPathdata through ingestion on your instance. - #8068: In the
datahub deleteCLI, if an--entity-typefilter is not specified, we automatically delete across all entity types. The previous behavior was to use a default entity type of dataset. - #8068: In the
datahub deleteCLI, the--start-timeand--end-timeparameters are not required for timeseries aspect hard deletes. To recover the previous behavior of deleting all data, use--start-time min --end-time max.
Potential Downtime
Deprecations
- The signature of
Source.get_workunits()is changed fromIterable[WorkUnit]to the more restrictiveIterable[MetadataWorkUnit]. - Legacy usage creation via the
UsageAggregationaspect,/usageStats?action=batchIngestGMS endpoint, andUsageStatsWorkUnitmetadata-ingestion class are all deprecated.
Other notable Changes
0.10.2
Breaking Changes
- #7016 Add
add_database_name_to_urnflag to Oracle source which ensure that Dataset urns have the DB name as a prefix to prevent collision (.e.g. {database}.{schema}.{table}). ONLY breaking if you set this flag to true, otherwise behavior remains the same. - The Airflow plugin no longer includes the DataHub Kafka emitter by default. Use
pip install acryl-datahub-airflow-plugin[datahub-kafka]for Kafka support. - The Airflow lineage backend no longer includes the DataHub Kafka emitter by default. Use
pip install acryl-datahub[airflow,datahub-kafka]for Kafka support. - Java SDK PatchBuilders have been modified in a backwards incompatible way to align more with the Python SDK and support more use cases. Any application utilizing the Java SDK for patch building may be affected on upgrading this dependency.
Deprecations
- The docker image and script for updating from Elasticsearch 6 to 7 is no longer being maintained and will be removed from the
/contribsection of the repository. Please refer to older releases if needed.
0.10.0
Breaking Changes
- #7103 This should only impact users who have configured explicit non-default names for DataHub's Kafka topics. The environment variables used to configure Kafka topics for DataHub used in the
kafka-setupdocker image have been updated to be in-line with other DataHub components, for more info see our docs on Configuring Kafka in DataHub . They have been suffixed with_TOPICwhere as now the correct suffix is_TOPIC_NAME. This change should not affect any user who is using default Kafka names. - #6906 The Redshift source has been reworked and now also includes usage capabilities. The old Redshift source was renamed to
redshift-legacy. Theredshift-usagesource has also been renamed toredshift-usage-legacywill be removed in the future.
Potential Downtime
- #6894 Search improvements requires reindexing indices. A
system-updatejob will run which will set indices to read-only and create a backup/clone of each index. During the reindexing new components will be prevented from start-up until the reindex completes. The logs of this job will indicate a % complete per index. Depending on index sizes and infrastructure this process can take 5 minutes to hours however as a rough estimate 1 hour for every 2.3 million entities.
Helm Notes
Helm without --atomic: The default timeout for an upgrade command is 5 minutes. If the reindex takes longer (depending on data size) it will continue to run in the background even though helm will report a failure. Allow this job to finish and then re-run the helm upgrade command.
Helm with --atomic: In general, it is recommended to not use the --atomic setting for this particular upgrade since the system update job will be terminated before completion. If --atomic is preferred, then increase the timeout using the --timeout flag to account for the reindexing time (see note above for estimating this value).
Deprecations
0.9.6
Breaking Changes
- #6742 The metadata file sink's output format no longer contains nested JSON strings for MCP aspects, but instead unpacks the stringified JSON into a real JSON object. The previous sink behavior can be recovered using the
legacy_nested_json_stringoption. The file source is backwards compatible and supports both formats. - #6901 The
envanddatabase_aliasfields have been marked deprecated across all sources. We recommend usingplatform_instancewhere possible instead.
Potential Downtime
Deprecations
- #6851 - Sources bigquery-legacy and bigquery-usage-legacy have been removed
Other notable Changes
- If anyone faces issues with login please clear your cookies. Some security updates are part of this release. That may cause login issues until cookies are cleared.
0.9.4 / 0.9.5
Breaking Changes
- #6243 apache-ranger authorizer is no longer the core part of DataHub GMS, and it is shifted as plugin. Please refer updated documentation Configuring Authorization with Apache Ranger for configuring
apache-ranger-pluginin DataHub GMS. - #6243 apache-ranger authorizer as plugin is not supported in DataHub Kubernetes deployment.
- #6243 Authentication and Authorization plugins configuration are removed from application.yml. Refer documentation Migration Of Plugins From application.yml for migrating any existing custom plugins.
datahub check graph-consistencycommand has been removed. It was a beta API that we had considered but decided there are better solutions for this. So removing this.graphql_urloption ofpowerbi-report-serversource deprecated as the options is not used.- #6789 BigQuery ingestion: If
enable_legacy_sharded_table_supportis set to False, sharded table names will be suffixed with _yyyymmdd to make sure they don't clash with non-sharded tables. This means if stateful ingestion is enabled then old sharded tables will be recreated with a new id and attached tags/glossary terms/etc will need to be added again. This behavior is not enabled by default yet, but will be enabled by default in a future release.
Potential Downtime
Deprecations
Other notable Changes
- #6611 - Snowflake
schema_patternnow accepts pattern for fully qualified schema name in format<catalog_name>.<schema_name>by setting configmatch_fully_qualified_names : True. Current defaultmatch_fully_qualified_names: Falseis only to maintain backward compatibility. The config optionmatch_fully_qualified_nameswill be deprecated in future and the default behavior will assumematch_fully_qualified_names: True." - #6636 - Sources
snowflake-legacyandsnowflake-usage-legacyhave been removed.
0.9.3
Breaking Changes
- The beta
datahub check graph-consistencycommand has been removed.
Potential Downtime
Deprecations
- PowerBI source:
workspace_id_patternis introduced in place ofworkspace_id.workspace_idis now deprecated and set for removal in a future version.
Other notable Changes
0.9.2
- LookML source will only emit views that are reachable from explores while scanning your git repo. Previous behavior can be achieved by setting
emit_reachable_views_onlyto False. - LookML source will always lowercase urns for lineage edges from views to upstream tables. There is no fallback provided to previous behavior because it was inconsistent in application of lower-casing earlier.
- dbt config
node_type_patternwhich was previously deprecated has been removed. Useentities_enabledinstead to control whether to emit metadata for sources, models, seeds, tests, etc. - The dbt source will always lowercase urns for lineage edges to the underlying data platform.
- The DataHub Airflow lineage backend and plugin no longer support Airflow 1.x. You can still run DataHub ingestion in Airflow 1.x using the PythonVirtualenvOperator.
Breaking Changes
- #6570
snowflakeconnector now populates created and last modified timestamps for snowflake datasets and containers. This version of snowflake connector will not work with datahub-gms version older thanv0.9.3
Potential Downtime
Deprecations
Other notable Changes
0.9.1
Breaking Changes
- We have promoted
bigquery-betatobigquery. If you are usingbigquery-betathen change your recipes to use the typebigquery.
Potential Downtime
Deprecations
Other notable Changes
0.9.0
Breaking Changes
- Java version 11 or greater is required.
- For any of the GraphQL search queries, the input no longer supports value but instead now accepts a list of values. These values represent an OR relationship where the field value must match any of the values.
Potential Downtime
Deprecations
Other notable Changes
v0.8.45
Breaking Changes
- The
getNativeUserInviteTokenandcreateNativeUserInviteTokenGraphQL endpoints have been renamed togetInviteTokenandcreateInviteTokenrespectively. Additionally, both now accept an optionalroleUrnparameter. Both endpoints also now require theMANAGE_POLICIESprivilege to execute, rather thanMANAGE_USER_CREDENTIALSprivilege. - One of the default policies shipped with DataHub (
urn:li:dataHubPolicy:7, orAll Users - All Platform Privileges) has been edited to no longer includeMANAGE_POLICIES. Its name has consequently been changed toAll Users - All Platform Privileges (EXCEPT MANAGE POLICIES). This change was made to prevent all users from effectively acting as superusers by default.
Potential Downtime
Deprecations
Other notable Changes
v0.8.44
Breaking Changes
- Browse Paths have been upgraded to a new format to align more closely with the intention of the feature. Learn more about the changes, including steps on upgrading, here: https://datahubproject.io/docs/advanced/browse-paths-upgrade
- The dbt ingestion source's
disable_dbt_node_creationandload_schemaoptions have been removed. They were no longer necessary due to the recently added sibling entities functionality. - The
snowflakesource now uses newer faster implementation (earliersnowflake-beta). Config propertiesprovision_roleandcheck_role_grantsare not supported. Oldersnowflakeandsnowflake-usageare available assnowflake-legacyandsnowflake-usage-legacysources respectively.
Potential Downtime
- [Helm] If you're using Helm, please ensure that your version of the
datahub-actionscontainer is bumped tov0.0.7orhead. This version contains changes to support running ingestion in debug mode. Previous versions are not compatible with this release. Upgrading to helm chart version0.2.103will ensure that you have the compatible versions by default.
Deprecations
Other notable Changes
v0.8.42
Breaking Changes
- Python 3.6 is no longer supported for metadata ingestion
- #5451
GMS_HOSTandGMS_PORTenvironment variables deprecated inv0.8.39have been removed. UseDATAHUB_GMS_HOSTandDATAHUB_GMS_PORTinstead. - #5478 DataHub CLI
deletecommand when used with--hardoption will delete soft-deleted entities which match the other filters given. - #5471 Looker now populates
userEmailin dashboard user usage stats. This version of looker connnector will not work with older version of datahub-gms if you haveextract_usage_historylooker config enabled. - #5529 -
ANALYTICS_ENABLEDenvironment variable in datahub-gms is now deprecated. UseDATAHUB_ANALYTICS_ENABLEDinstead. - #5485
--include-removedoption was removed from delete CLI
Potential Downtime
Deprecations
Other notable Changes
v0.8.41
Breaking Changes
The
should_overwriteflag incsv-enricherhas been replaced withwrite_semanticsto match the format used for other sources. See the documentation for more detailsClosing an authorization hole in creating tags adding a Platform Privilege called
Create Tagsfor creating tags. This is assigned todatahubroot user, along with default All Users policy. Notice: You may need to add this privilege (orManage Tags) to existing users that need the ability to create tags on the platform.#5329 Below profiling config parameters are now supported in
BigQuery:- profiling.profile_if_updated_since_days (default=1)
- profiling.profile_table_size_limit (default=1GB)
- profiling.profile_table_row_limit (default=50000)
Set above parameters to
nullif you want older behaviour.
Potential Downtime
Deprecations
Other notable Changes
v0.8.40
Breaking Changes
- #5240
lineage_client_project_idinbigquerysource is removed. Usestorage_project_idinstead.
Potential Downtime
Deprecations
Other notable Changes
v0.8.39
Breaking Changes
- Refactored the
healthfield of theDatasetGraphQL Type to be of type list of HealthStatus (was type HealthStatus). See this PR for more details.
Potential Downtime
Deprecations
- #4875 Lookml view file contents will no longer be populated in custom_properties, instead view definitions will be always available in the View Definitions tab.
- #5208
GMS_HOSTandGMS_PORTenvironment variables being set in various containers are deprecated in favour ofDATAHUB_GMS_HOSTandDATAHUB_GMS_PORT. KAFKA_TOPIC_NAMEenvironment variable in datahub-mae-consumer and datahub-gms is now deprecated. UseMETADATA_AUDIT_EVENT_NAMEinstead.KAFKA_MCE_TOPIC_NAMEenvironment variable in datahub-mce-consumer and datahub-gms is now deprecated. UseMETADATA_CHANGE_EVENT_NAMEinstead.KAFKA_FMCE_TOPIC_NAMEenvironment variable in datahub-mce-consumer and datahub-gms is now deprecated. UseFAILED_METADATA_CHANGE_EVENT_NAMEinstead.
Other notable Changes
- #5132 Profile tables in
snowflakesource only if they have been updated since configured (default:1) number of day(s). Update the configprofiling.profile_if_updated_since_daysas per your profiling schedule or set it toNoneif you want older behaviour.
v0.8.38
Breaking Changes
Potential Downtime
Deprecations
Other notable Changes
- Create & Revoke Access Tokens via the UI
- Create and Manage new users via the UI
- Improvements to Business Glossary UI
- FIX - Do not require reindexing to migrate to using the UI business glossary
v0.8.36
Breaking Changes
- In this release we introduce a brand new Business Glossary experience. With this new experience comes some new ways of indexing data in order to make viewing and traversing the different levels of your Glossary possible. Therefore, you will have to restore your indices in order for the new Glossary experience to work for users that already have existing Glossaries. If this is your first time using DataHub Glossaries, you're all set!
Potential Downtime
Deprecations
Other notable Changes
- #4961 Dropped profiling is not reported by default as that caused a lot of spurious logging in some cases. Set
profiling.report_dropped_profilestoTrueif you want older behaviour.
v0.8.35
Breaking Changes
Potential Downtime
Deprecations
- #4875 Lookml view file contents will no longer be populated in custom_properties, instead view definitions will be always available in the View Definitions tab.
Other notable Changes
v0.8.34
Breaking Changes
- #4644 Remove
databaseoption fromsnowflakesource which was deprecated sincev0.8.5 - #4595 Rename confusing config
report_upstream_lineagetoupstream_lineage_in_reportinsnowflakeconnector which was added in0.8.32
Potential Downtime
Deprecations
- #4644
host_portoption ofsnowflakeandsnowflake-usagesources deprecated as the name was confusing. Useaccount_idoption instead.
Other notable Changes
- #4760
check_role_grantsoption was added insnowflaketo disable checking roles insnowflakeas some people were reporting long run times when checking roles.