Created
January 17, 2023 11:18
-
-
Save trengrj/b9f31d1429a39250e28ec05f1ee4a799 to your computer and use it in GitHub Desktop.
Migrating Weaviate Data
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Migrating data in Weaviate\n", | |
"\n", | |
"Note this method describes a way to migrate data using raw queries and is intended for relatively small datasets. There are a couple other options:\n", | |
"\n", | |
"1. Use of backups feature https://weaviate.io/developers/weaviate/current/configuration/backups.html.\n", | |
"\n", | |
"2. Use of upcoming scroll api https://github.com/semi-technologies/weaviate/issues/2302. This feature is planned for 1.18 (so relatively soon) and will remove the limit of query maximum results below allowing for paging out efficiently millions of records.\n", | |
"\n", | |
"## Process\n", | |
"\n", | |
"1. Increase your `QUERY_MAXIMUM_RESULTS` size of your source environment. This is set as an environment variable in `docker-compose.yml`. This is needed for large pagination.\n", | |
"\n", | |
"```yaml\n", | |
"QUERY_MAXIMUM_RESULTS: 100000\n", | |
"```\n", | |
"\n", | |
"### Retrieve all data from existing class and write to new class" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import weaviate\n", | |
"from tqdm import tqdm\n", | |
"\n", | |
"client = weaviate.Client(\"http://localhost:8080\")\n", | |
"LIMIT = 100000\n", | |
"\n", | |
"def pull_data_for_class(client: weaviate.Client, class_name: str):\n", | |
" schema = client.schema.get(class_name)\n", | |
" fields = [field[\"name\"] for field in client.schema.get(class_name=class_name)[\"properties\"]]\n", | |
" data = client.query.get(class_name=class_name, properties=fields).with_additional(['vector', 'id']).with_limit(LIMIT).do()\n", | |
" if \"errors\" in data:\n", | |
" raise Exception(data[\"errors\"])\n", | |
" return schema, data[\"data\"][\"Get\"][class_name]\n", | |
"\n", | |
"def push_data(client: weaviate.Client, class_name: str, schema : dict, data: list):\n", | |
" schema[\"class\"] = class_name\n", | |
" client.schema.create_class(schema)\n", | |
" with client.batch as batch:\n", | |
" for record in tqdm(data):\n", | |
" vector = record[\"_additional\"][\"vector\"]\n", | |
" uuid = record[\"_additional\"][\"id\"]\n", | |
" del record[\"_additional\"]\n", | |
" if vector == []:\n", | |
" batch.add_data_object(data_object=record, class_name=class_name, uuid=uuid)\n", | |
" else:\n", | |
" batch.add_data_object(data_object=record, class_name=class_name, uuid=uuid, vector=vector)\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"100%|██████████| 100000/100000 [00:00<00:00, 285620.39it/s]\n" | |
] | |
} | |
], | |
"source": [ | |
"schema, data = pull_data_for_class(client, \"Item2\")\n", | |
"\n", | |
"# write to new client to migrate data\n", | |
"push_data(client, \"Item4\", schema, data)" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.11.0" | |
}, | |
"orig_nbformat": 4, | |
"vscode": { | |
"interpreter": { | |
"hash": "5c7b89af1651d0b8571dde13640ecdccf7d5a6204171d6ab33e7c296e100e08a" | |
} | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment