Last active
January 15, 2023 23:01
-
-
Save antonycourtney/24a19ae34d877779d1e30b4a99f23d55 to your computer and use it in GitHub Desktop.
Parquet cleaning example in DuckDb SQL
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
-- This should all work with the DuckDb shell CLI, available from https://duckdb.org/docs/installation/index | |
-- (Select the CLI tab) | |
-- first create a view on the Parquet file: | |
CREATE VIEW books AS SELECT * from './goodreads_books.parquet'; | |
-- take a look at the schema: | |
PRAGMA table_info(books); | |
-- Fix up one of the columns (book_id), converting it from VARCHAR to INTEGER (int32): | |
CREATE TABLE cleaned_books AS SELECT * REPLACE (CAST(book_id AS INTEGER) AS book_id) FROM books; | |
-- take a look at cleaned up schema: | |
PRAGMA table_info(cleaned_books); | |
-- write back out as Parquet: | |
COPY (SELECT * FROM cleaned_books) TO './goodreads_books_cleaned.parquet' (FORMAT 'parquet'); |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment