Skip to content

Instantly share code, notes, and snippets.

@antonycourtney
Last active January 15, 2023 23:01
Show Gist options
  • Save antonycourtney/24a19ae34d877779d1e30b4a99f23d55 to your computer and use it in GitHub Desktop.
Save antonycourtney/24a19ae34d877779d1e30b4a99f23d55 to your computer and use it in GitHub Desktop.
Parquet cleaning example in DuckDb SQL
-- This should all work with the DuckDb shell CLI, available from https://duckdb.org/docs/installation/index
-- (Select the CLI tab)
-- first create a view on the Parquet file:
CREATE VIEW books AS SELECT * from './goodreads_books.parquet';
-- take a look at the schema:
PRAGMA table_info(books);
-- Fix up one of the columns (book_id), converting it from VARCHAR to INTEGER (int32):
CREATE TABLE cleaned_books AS SELECT * REPLACE (CAST(book_id AS INTEGER) AS book_id) FROM books;
-- take a look at cleaned up schema:
PRAGMA table_info(cleaned_books);
-- write back out as Parquet:
COPY (SELECT * FROM cleaned_books) TO './goodreads_books_cleaned.parquet' (FORMAT 'parquet');
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment