Skip to content

Instantly share code, notes, and snippets.

@pezon
Last active May 3, 2019 15:58
Show Gist options
  • Save pezon/fb42afeb564fce6525a5808a2c7c363a to your computer and use it in GitHub Desktop.
Save pezon/fb42afeb564fce6525a5808a2c7c363a to your computer and use it in GitHub Desktop.
Drop managed tables from Spark cluster

Maintenance - Drop managed tables

This notebook will clean up the table space by dropping all managed tables, including the metadata and underlying data. This process is necessary when updating table schemas (e.g., changing data types).

Note: It is advised to backup your data first.

Recommended procedure

  • Back-up data tables in ADLS (not implemented here)
  • Run maintenance - drop managed tables notebook (this notebook)
  • Re-import all data.

Technical details about Delta table operations

Deleting Delta tables is a very time-consuming operation because they contain a lot of metadata and transaction logs. According to Best Practices for Dropping Managed Databricks Delta Tables, DELETE FROM and VACUUM operations should be run before dropping a Delta table. This ensures that metadata and file sizes are cleaned up before initiating data deletion. These operations reduce the amount of metadata and number of uncommitted files that would increase deletion time.

Listing catalogue tables is also a costly operation. See Listing Table Names for more information.

import spark.implicits._
import spark.sql
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", false)
val databases = spark.catalog.listDatabases()
for (database <- databases.collect()) {
spark.catalogue.setCurrentDatabase(database)
val tables = spark.catalog.listTables()
for (table <- tables.collect()) {
if (!table.name.startsWith("daily_") && !table.name.startsWith("average_")) {
println(s"Dropping ${database.name}.${table.name}")
try {
sql(s"DELETE FROM ${table.name}")
} catch {
case e: Exception =>
println(s">>> Not a Delta table: skipping DELETE operation")
}
try {
sql(s"VACUUM ${table.name} RETAIN 0 HOURS")
} catch {
case e: Exception =>
println(s">>> Could not be vacuumed")
}
try {
sql(s"DROP TABLE ${table.name}")
} catch {
case e: Exception =>
println(s">>> Could not be deleted")
}
}
}
}
spark.catalog.clearCache()
println("Done")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment