欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Delta lake in python [Local][Basic usage]

程序员文章站 2022-07-14 20:45:52
...

Delta lake in python [Local]


Create a table

The first time you use delta lake to create table will download some jars

Py code

from delta import *
import pyspark

if __name__ == '__main__':
    builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

    spark = spark = configure_spark_with_delta_pip(builder).getOrCreate()

    data = spark.range(0, 5)
    data.write.format("delta").save("/tmp/delta-table")

output

/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/delta/test.py
:: loading settings :: url = jar:file:/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/gavin/.ivy2/cache
The jars for the packages stored in: /Users/gavin/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-bc2db27b-406e-4a59-a5a6-81e4ceefcdd9;1.0
	confs: [default]
	found io.delta#delta-core_2.12;1.0.0 in central
	found org.antlr#antlr4;4.7 in central
	found org.antlr#antlr4-runtime;4.7 in local-m2-cache
	found org.antlr#antlr-runtime;3.5.2 in central
	found org.antlr#ST4;4.0.8 in central
	found org.abego.treelayout#org.abego.treelayout.core;1.0.3 in central
	found org.glassfish#javax.json;1.0.4 in central
	found com.ibm.icu#icu4j;58.2 in central
downloading https://repo1.maven.org/maven2/io/delta/delta-core_2.12/1.0.0/delta-core_2.12-1.0.0.jar ...
	[SUCCESSFUL ] io.delta#delta-core_2.12;1.0.0!delta-core_2.12.jar (4279ms)
downloading https://repo1.maven.org/maven2/org/antlr/antlr4/4.7/antlr4-4.7.jar ...
	[SUCCESSFUL ] org.antlr#antlr4;4.7!antlr4.jar (1363ms)
downloading file:/Users/gavin/.m2/repository/org/antlr/antlr4-runtime/4.7/antlr4-runtime-4.7.jar ...
	[SUCCESSFUL ] org.antlr#antlr4-runtime;4.7!antlr4-runtime.jar (5ms)
downloading https://repo1.maven.org/maven2/org/antlr/antlr-runtime/3.5.2/antlr-runtime-3.5.2.jar ...
	[SUCCESSFUL ] org.antlr#antlr-runtime;3.5.2!antlr-runtime.jar (310ms)
downloading https://repo1.maven.org/maven2/org/antlr/ST4/4.0.8/ST4-4.0.8.jar ...
	[SUCCESSFUL ] org.antlr#ST4;4.0.8!ST4.jar (349ms)
downloading https://repo1.maven.org/maven2/org/abego/treelayout/org.abego.treelayout.core/1.0.3/org.abego.treelayout.core-1.0.3.jar ...
	[SUCCESSFUL ] org.abego.treelayout#org.abego.treelayout.core;1.0.3!org.abego.treelayout.core.jar(bundle) (211ms)
downloading https://repo1.maven.org/maven2/org/glassfish/javax.json/1.0.4/javax.json-1.0.4.jar ...
	[SUCCESSFUL ] org.glassfish#javax.json;1.0.4!javax.json.jar(bundle) (255ms)
downloading https://repo1.maven.org/maven2/com/ibm/icu/icu4j/58.2/icu4j-58.2.jar ...
	[SUCCESSFUL ] com.ibm.icu#icu4j;58.2!icu4j.jar (5898ms)
:: resolution report :: resolve 21270ms :: artifacts dl 12676ms
	:: modules in use:
	com.ibm.icu#icu4j;58.2 from central in [default]
	io.delta#delta-core_2.12;1.0.0 from central in [default]
	org.abego.treelayout#org.abego.treelayout.core;1.0.3 from central in [default]
	org.antlr#ST4;4.0.8 from central in [default]
	org.antlr#antlr-runtime;3.5.2 from central in [default]
	org.antlr#antlr4;4.7 from central in [default]
	org.antlr#antlr4-runtime;4.7 from local-m2-cache in [default]
	org.glassfish#javax.json;1.0.4 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   8   |   8   |   8   |   0   ||   8   |   8   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-bc2db27b-406e-4a59-a5a6-81e4ceefcdd9
	confs: [default]
	8 artifacts copied, 0 already retrieved (15452kB/21ms)
21/12/01 17:32:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
                                                                                
Process finished with exit code 0

Note: download some jars

Result

[email protected] delta-table % ll -R
total 48
drwxr-xr-x  3 gavin  wheel   96 Dec  1 17:32 _delta_log
-rw-r--r--  1 gavin  wheel  296 Dec  1 17:32 part-00000-860e9aa3-7a9e-4511-a38f-0fc76096bf3e-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00002-d47de088-a3aa-4ead-b28f-b2f59e749cc3-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00004-2b098d53-196a-40de-9591-98f70edca7f0-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00007-a5b43491-c803-41fa-9bac-54d9423f3122-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00009-b5972d12-16ea-4108-8e19-2ae72ee35acf-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00011-b21c674e-d9f1-47f2-b96f-5555622570d6-c000.snappy.parquet

./_delta_log:
total 8
-rw-r--r--  1 gavin  wheel  1602 Dec  1 17:32 00000000000000000000.json

Json file contains commitInfo/protocol/metaData and every commit record:

{"commitInfo":{"timestamp":1638351150505,"operation":"WRITE","operationParameters":{"mode":"ErrorIfExists","partitionBy":"[]"},"isBlindAppend":true,"operationMetrics":{"numFiles":"6","numOutputBytes":"2611","numOutputRows":"5"}}}
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"11a935b4-4f4b-435c-bd40-84a3a07aaf1f","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1638351148221}}
{"add":{"path":"part-00000-860e9aa3-7a9e-4511-a38f-0fc76096bf3e-c000.snappy.parquet","partitionValues":{},"size":296,"modificationTime":1638351150000,"dataChange":true}}
{"add":{"path":"part-00002-d47de088-a3aa-4ead-b28f-b2f59e749cc3-c000.snappy.parquet","partitionValues":{},"size":463,"modificationTime":1638351150000,"dataChange":true}}
{"add":{"path":"part-00004-2b098d53-196a-40de-9591-98f70edca7f0-c000.snappy.parquet","partitionValues":{},"size":463,"modificationTime":1638351150000,"dataChange":true}}
{"add":{"path":"part-00007-a5b43491-c803-41fa-9bac-54d9423f3122-c000.snappy.parquet","partitionValues":{},"size":463,"modificationTime":1638351150000,"dataChange":true}}
{"add":{"path":"part-00009-b5972d12-16ea-4108-8e19-2ae72ee35acf-c000.snappy.parquet","partitionValues":{},"size":463,"modificationTime":1638351150000,"dataChange":true}}
{"add":{"path":"part-00011-b21c674e-d9f1-47f2-b96f-5555622570d6-c000.snappy.parquet","partitionValues":{},"size":463,"modificationTime":1638351150000,"dataChange":true}}

Read data

Py code

from delta import *
import pyspark

if __name__ == '__main__':
    builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    spark_session = configure_spark_with_delta_pip(builder).getOrCreate()
    
    df = spark_session.read.format("delta").load("/tmp/delta-table")
    df.show()

  '''
  result:
  +---+
  | id|
  +---+
  |  1|
  |  4|
  |  0|
  |  2|
  |  3|
  +---+
  '''

Update table data

Overwrite

from delta import *
import pyspark

if __name__ == '__main__':
    builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    spark = spark_session = configure_spark_with_delta_pip(builder).getOrCreate()

    data = spark.range(5, 10)
    data.write.format("delta").mode("overwrite").save("/tmp/delta-table")

    df = spark_session.read.format("delta").load("/tmp/delta-table")
    df.show()

    '''
    result:
    +---+
    | id|
    +---+
    |  9|
    |  6|
    |  8|
    |  5|
    |  7|
    +---+
    '''

Files in dir:

[email protected] delta-table % ll -R                                  
total 96
drwxr-xr-x  4 gavin  wheel  128 Dec  3 15:54 _delta_log
-rw-r--r--  1 gavin  wheel  296 Dec  1 17:32 part-00000-860e9aa3-7a9e-4511-a38f-0fc76096bf3e-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  296 Dec  3 15:53 part-00000-c57742fe-94a4-45c7-b5e1-6320e9092be3-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00002-d47de088-a3aa-4ead-b28f-b2f59e749cc3-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00002-e7c58054-17b4-4cb8-8796-4c9f2ab31f4a-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00004-2b098d53-196a-40de-9591-98f70edca7f0-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00004-7a09f511-5473-474b-9612-b804c56917a8-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00007-78c06975-6ecf-4f3f-a3ee-eb7a5e2aa046-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00007-a5b43491-c803-41fa-9bac-54d9423f3122-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00009-1cf0e1ef-1f0c-436f-869f-81fa918c08b4-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00009-b5972d12-16ea-4108-8e19-2ae72ee35acf-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00011-5edd12d4-ac7f-4f40-acf4-5a309c5cdbc8-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00011-b21c674e-d9f1-47f2-b96f-5555622570d6-c000.snappy.parquet

./_delta_log:
total 16
-rw-r--r--  1 gavin  wheel  1602 Dec  1 17:32 00000000000000000000.json
-rw-r--r--  1 gavin  wheel  2475 Dec  3 15:53 00000000000000000001.json

Update with condition

from delta import *
import pyspark
from delta.tables import *
from pyspark.sql.functions import *

builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()

deltaTable = DeltaTable.forPath(spark, "/tmp/delta-table")

# Update every even value by adding 100 to it
deltaTable.update(
  condition = expr("id % 2 == 0"),
  set = { "id": expr("id + 100") })

deltaTable.toDF().show()
'''
result:
+---+
| id|
+---+
|  9|
|  5|
|106|
|  7|
|108|
+---+
'''

Files in dir:

[email protected] delta-table % ll -R
total 112
drwxr-xr-x  5 gavin  wheel  160 Dec  3 16:03 _delta_log
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:03 part-00000-37c74c73-e771-41d7-a1d3-a7ce0e382787-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  296 Dec  1 17:32 part-00000-860e9aa3-7a9e-4511-a38f-0fc76096bf3e-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  296 Dec  3 15:53 part-00000-c57742fe-94a4-45c7-b5e1-6320e9092be3-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:03 part-00001-f80990f3-0902-4926-8fcd-370364d10ffe-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00002-d47de088-a3aa-4ead-b28f-b2f59e749cc3-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00002-e7c58054-17b4-4cb8-8796-4c9f2ab31f4a-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00004-2b098d53-196a-40de-9591-98f70edca7f0-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00004-7a09f511-5473-474b-9612-b804c56917a8-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00007-78c06975-6ecf-4f3f-a3ee-eb7a5e2aa046-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00007-a5b43491-c803-41fa-9bac-54d9423f3122-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00009-1cf0e1ef-1f0c-436f-869f-81fa918c08b4-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00009-b5972d12-16ea-4108-8e19-2ae72ee35acf-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00011-5edd12d4-ac7f-4f40-acf4-5a309c5cdbc8-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00011-b21c674e-d9f1-47f2-b96f-5555622570d6-c000.snappy.parquet

./_delta_log:
total 24
-rw-r--r--  1 gavin  wheel  1602 Dec  1 17:32 00000000000000000000.json
-rw-r--r--  1 gavin  wheel  2475 Dec  3 15:53 00000000000000000001.json
-rw-r--r--  1 gavin  wheel  1075 Dec  3 16:03 00000000000000000002.json

Delete with condition

from delta import *
import pyspark
from delta.tables import *
from pyspark.sql.functions import *

builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()

deltaTable = DeltaTable.forPath(spark, "/tmp/delta-table")

# Delete every even value
deltaTable.delete(condition = expr("id % 2 == 0"))

deltaTable.toDF().show()
'''
result:
+---+
| id|
+---+
|  9|
|  5|
|  7|
+---+
'''

Upsert

from delta import *
import pyspark
from delta.tables import *
from pyspark.sql.functions import *

builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()

deltaTable = DeltaTable.forPath(spark, "/tmp/delta-table")

# Upsert (merge) new data
newData = spark.range(0, 20)

deltaTable.alias("oldData") \
  .merge(
    newData.alias("newData"),
    "oldData.id = newData.id") \
  .whenMatchedUpdate(set = { "id": col("newData.id") }) \
  .whenNotMatchedInsert(values = { "id": col("newData.id") }) \
  .execute()

deltaTable.toDF().show()

'''
result:
+---+
| id|
+---+
| 12|
|  9|
|  5|
| 19|
| 16|
| 10|
| 18|
| 17|
|  3|
| 14|
| 11|
|  8|
|  4|
| 15|
|  7|
|  1|
|  6|
| 13|
|  0|
|  2|
+---+
'''

Files in dir:

[email protected] delta-table % ll -R
total 288
drwxr-xr-x  7 gavin  wheel  224 Dec  3 16:09 _delta_log
-rw-r--r--  1 gavin  wheel  296 Dec  3 16:07 part-00000-220c8521-a0b2-4a17-91c8-bd1aa000364a-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:03 part-00000-37c74c73-e771-41d7-a1d3-a7ce0e382787-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  296 Dec  1 17:32 part-00000-860e9aa3-7a9e-4511-a38f-0fc76096bf3e-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  296 Dec  3 15:53 part-00000-c57742fe-94a4-45c7-b5e1-6320e9092be3-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  296 Dec  3 16:09 part-00000-eab34ac9-29df-4863-8021-fe200de0de64-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:03 part-00001-f80990f3-0902-4926-8fcd-370364d10ffe-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00002-d47de088-a3aa-4ead-b28f-b2f59e749cc3-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00002-e7c58054-17b4-4cb8-8796-4c9f2ab31f4a-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00004-2b098d53-196a-40de-9591-98f70edca7f0-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00004-7a09f511-5473-474b-9612-b804c56917a8-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00004-d22db67c-fd60-4b3c-b873-54e210693b4c-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00005-cebbe7c3-e06d-4a28-8fea-63fe0f447c9c-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00007-78c06975-6ecf-4f3f-a3ee-eb7a5e2aa046-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00007-a5b43491-c803-41fa-9bac-54d9423f3122-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00009-1cf0e1ef-1f0c-436f-869f-81fa918c08b4-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00009-b5972d12-16ea-4108-8e19-2ae72ee35acf-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00011-5edd12d4-ac7f-4f40-acf4-5a309c5cdbc8-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00011-b21c674e-d9f1-47f2-b96f-5555622570d6-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00011-d2e04069-de33-4f4b-9f5f-b375d0d81469-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00045-531a3391-78ee-42e6-8f05-ddf2ec84bbce-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00049-da28086b-f548-4ee2-897f-a2bf70c42d30-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00058-9527c574-c1ef-4ed3-ae46-b929ef76925a-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00068-03a6df86-a802-45f2-8110-d4f198903a52-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00069-cb5bfa45-68cf-4ea8-a504-fb873131ab19-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00077-82317c4e-d4de-4e0a-bd25-6df7c16ad9de-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00107-08d1051b-63df-4bef-992d-3b55d979ae25-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00112-fdcd61ef-b173-4078-9a67-331709ec2367-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00116-6e0d3ddd-1b77-4124-9a16-7e0a9f097e79-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00121-4a1199a0-c7f8-4b79-9f0b-9d5509d1cd26-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00128-b401d079-7583-4542-8aaa-822e15bd0a64-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00140-9578ffcf-073d-4a3f-b9c2-fd2b28bce50f-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00143-b99b9424-bc51-4069-8619-b258eb452773-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00150-a000aae6-c7b9-4349-98ee-b3094a997b98-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00154-81ed9b75-4401-4849-9ef1-41249c34cd9c-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00164-e68e77a9-cb7c-431b-8196-28e5abbf1526-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00190-222d17a8-92f6-4fde-b0df-c6ee0ce36f84-c000.snappy.parquet

./_delta_log:
total 48
-rw-r--r--  1 gavin  wheel  1602 Dec  1 17:32 00000000000000000000.json
-rw-r--r--  1 gavin  wheel  2475 Dec  3 15:53 00000000000000000001.json
-rw-r--r--  1 gavin  wheel  1075 Dec  3 16:03 00000000000000000002.json
-rw-r--r--  1 gavin  wheel   910 Dec  3 16:07 00000000000000000003.json
-rw-r--r--  1 gavin  wheel  4747 Dec  3 16:09 00000000000000000004.json

Read older versions of data using time travel

from delta import *
import pyspark
from delta.tables import *
from pyspark.sql.functions import *

builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()

df0 = spark.read.format("delta").option("versionAsOf", 0).load("/tmp/delta-table")
df0.show()
''' 
+---+
| id|
+---+
|  1|
|  4|
|  0|
|  2|
|  3|
+---+
'''
df1 = spark.read.format("delta").option("versionAsOf", 1).load("/tmp/delta-table")
df1.show()
'''
+---+
| id|
+---+
|  9|
|  6|
|  8|
|  5|
|  7|
+---+
'''
df2 = spark.read.format("delta").option("versionAsOf", 2).load("/tmp/delta-table")
df2.show()
'''
+---+
| id|
+---+
|  9|
|  5|
|106|
|  7|
|108|
+---+
'''
df3 = spark.read.format("delta").option("versionAsOf", 3).load("/tmp/delta-table")
df3.show()
'''
+---+
| id|
+---+
|  9|
|  5|
|  7|
+---+
'''
df4 = spark.read.format("delta").option("versionAsOf", 4).load("/tmp/delta-table")
df4.show()
'''
+---+
| id|
+---+
| 12|
|  9|
|  5|
| 19|
| 16|
| 10|
| 18|
| 17|
|  3|
| 14|
| 11|
|  8|
|  4|
| 15|
|  7|
|  1|
|  6|
| 13|
|  0|
|  2|
+---+
'''

Vacuum

Current files as bellows:

[email protected] delta-table % ll -R
total 288
drwxr-xr-x  7 gavin  wheel  224 Dec  3 16:09 _delta_log
-rw-r--r--  1 gavin  wheel  296 Dec  3 16:07 part-00000-220c8521-a0b2-4a17-91c8-bd1aa000364a-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:03 part-00000-37c74c73-e771-41d7-a1d3-a7ce0e382787-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  296 Dec  1 17:32 part-00000-860e9aa3-7a9e-4511-a38f-0fc76096bf3e-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  296 Dec  3 15:53 part-00000-c57742fe-94a4-45c7-b5e1-6320e9092be3-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  296 Dec  3 16:09 part-00000-eab34ac9-29df-4863-8021-fe200de0de64-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:03 part-00001-f80990f3-0902-4926-8fcd-370364d10ffe-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00002-d47de088-a3aa-4ead-b28f-b2f59e749cc3-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00002-e7c58054-17b4-4cb8-8796-4c9f2ab31f4a-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00004-2b098d53-196a-40de-9591-98f70edca7f0-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00004-7a09f511-5473-474b-9612-b804c56917a8-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00004-d22db67c-fd60-4b3c-b873-54e210693b4c-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00005-cebbe7c3-e06d-4a28-8fea-63fe0f447c9c-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00007-78c06975-6ecf-4f3f-a3ee-eb7a5e2aa046-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00007-a5b43491-c803-41fa-9bac-54d9423f3122-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00009-1cf0e1ef-1f0c-436f-869f-81fa918c08b4-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00009-b5972d12-16ea-4108-8e19-2ae72ee35acf-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00011-5edd12d4-ac7f-4f40-acf4-5a309c5cdbc8-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  1 17:32 part-00011-b21c674e-d9f1-47f2-b96f-5555622570d6-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00011-d2e04069-de33-4f4b-9f5f-b375d0d81469-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00045-531a3391-78ee-42e6-8f05-ddf2ec84bbce-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00049-da28086b-f548-4ee2-897f-a2bf70c42d30-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00058-9527c574-c1ef-4ed3-ae46-b929ef76925a-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00068-03a6df86-a802-45f2-8110-d4f198903a52-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00069-cb5bfa45-68cf-4ea8-a504-fb873131ab19-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00077-82317c4e-d4de-4e0a-bd25-6df7c16ad9de-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00107-08d1051b-63df-4bef-992d-3b55d979ae25-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00112-fdcd61ef-b173-4078-9a67-331709ec2367-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00116-6e0d3ddd-1b77-4124-9a16-7e0a9f097e79-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00121-4a1199a0-c7f8-4b79-9f0b-9d5509d1cd26-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00128-b401d079-7583-4542-8aaa-822e15bd0a64-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00140-9578ffcf-073d-4a3f-b9c2-fd2b28bce50f-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00143-b99b9424-bc51-4069-8619-b258eb452773-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00150-a000aae6-c7b9-4349-98ee-b3094a997b98-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00154-81ed9b75-4401-4849-9ef1-41249c34cd9c-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00164-e68e77a9-cb7c-431b-8196-28e5abbf1526-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00190-222d17a8-92f6-4fde-b0df-c6ee0ce36f84-c000.snappy.parquet

./_delta_log:
total 48
-rw-r--r--  1 gavin  wheel  1602 Dec  1 17:32 00000000000000000000.json
-rw-r--r--  1 gavin  wheel  2475 Dec  3 15:53 00000000000000000001.json
-rw-r--r--  1 gavin  wheel  1075 Dec  3 16:03 00000000000000000002.json
-rw-r--r--  1 gavin  wheel   910 Dec  3 16:07 00000000000000000003.json
-rw-r--r--  1 gavin  wheel  4747 Dec  3 16:09 00000000000000000004.json

To drop files before 1 hour ago, code:

from delta import *
import pyspark
from delta.tables import *
from pyspark.sql.functions import *

builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
        .config("spark.databricks.delta.retentionDurationCheck.enabled","false")
spark = configure_spark_with_delta_pip(builder).getOrCreate()

deltaTable = DeltaTable.forPath(spark,"/tmp/delta-table")
# vacuum files not required by versions more than 1 hours old
deltaTable.vacuum(1)


Result:

[email protected] delta-table % 
[email protected] delta-table % date
Fri Dec  3 16:53:20 CST 2021
[email protected] delta-table % 
[email protected] delta-table % ll -R
total 240
drwxr-xr-x  7 gavin  wheel  224 Dec  3 16:09 _delta_log
-rw-r--r--  1 gavin  wheel  296 Dec  3 16:07 part-00000-220c8521-a0b2-4a17-91c8-bd1aa000364a-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:03 part-00000-37c74c73-e771-41d7-a1d3-a7ce0e382787-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  296 Dec  3 15:53 part-00000-c57742fe-94a4-45c7-b5e1-6320e9092be3-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  296 Dec  3 16:09 part-00000-eab34ac9-29df-4863-8021-fe200de0de64-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:03 part-00001-f80990f3-0902-4926-8fcd-370364d10ffe-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00002-e7c58054-17b4-4cb8-8796-4c9f2ab31f4a-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00004-7a09f511-5473-474b-9612-b804c56917a8-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00004-d22db67c-fd60-4b3c-b873-54e210693b4c-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00005-cebbe7c3-e06d-4a28-8fea-63fe0f447c9c-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00007-78c06975-6ecf-4f3f-a3ee-eb7a5e2aa046-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00009-1cf0e1ef-1f0c-436f-869f-81fa918c08b4-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 15:53 part-00011-5edd12d4-ac7f-4f40-acf4-5a309c5cdbc8-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00011-d2e04069-de33-4f4b-9f5f-b375d0d81469-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00045-531a3391-78ee-42e6-8f05-ddf2ec84bbce-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00049-da28086b-f548-4ee2-897f-a2bf70c42d30-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00058-9527c574-c1ef-4ed3-ae46-b929ef76925a-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00068-03a6df86-a802-45f2-8110-d4f198903a52-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00069-cb5bfa45-68cf-4ea8-a504-fb873131ab19-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00077-82317c4e-d4de-4e0a-bd25-6df7c16ad9de-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00107-08d1051b-63df-4bef-992d-3b55d979ae25-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00112-fdcd61ef-b173-4078-9a67-331709ec2367-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00116-6e0d3ddd-1b77-4124-9a16-7e0a9f097e79-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00121-4a1199a0-c7f8-4b79-9f0b-9d5509d1cd26-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00128-b401d079-7583-4542-8aaa-822e15bd0a64-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00140-9578ffcf-073d-4a3f-b9c2-fd2b28bce50f-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00143-b99b9424-bc51-4069-8619-b258eb452773-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00150-a000aae6-c7b9-4349-98ee-b3094a997b98-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00154-81ed9b75-4401-4849-9ef1-41249c34cd9c-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00164-e68e77a9-cb7c-431b-8196-28e5abbf1526-c000.snappy.parquet
-rw-r--r--  1 gavin  wheel  463 Dec  3 16:09 part-00190-222d17a8-92f6-4fde-b0df-c6ee0ce36f84-c000.snappy.parquet

./_delta_log:
total 48
-rw-r--r--  1 gavin  wheel  1602 Dec  1 17:32 00000000000000000000.json
-rw-r--r--  1 gavin  wheel  2475 Dec  3 15:53 00000000000000000001.json
-rw-r--r--  1 gavin  wheel  1075 Dec  3 16:03 00000000000000000002.json
-rw-r--r--  1 gavin  wheel   910 Dec  3 16:07 00000000000000000003.json
-rw-r--r--  1 gavin  wheel  4747 Dec  3 16:09 00000000000000000004.json
[email protected] delta-table %