Skip to content

Instantly share code, notes, and snippets.

@trengrj
Created January 17, 2023 11:18
Show Gist options
  • Save trengrj/b9f31d1429a39250e28ec05f1ee4a799 to your computer and use it in GitHub Desktop.
Save trengrj/b9f31d1429a39250e28ec05f1ee4a799 to your computer and use it in GitHub Desktop.
Migrating Weaviate Data
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Migrating data in Weaviate\n",
"\n",
"Note this method describes a way to migrate data using raw queries and is intended for relatively small datasets. There are a couple other options:\n",
"\n",
"1. Use of backups feature https://weaviate.io/developers/weaviate/current/configuration/backups.html.\n",
"\n",
"2. Use of upcoming scroll api https://github.com/semi-technologies/weaviate/issues/2302. This feature is planned for 1.18 (so relatively soon) and will remove the limit of query maximum results below allowing for paging out efficiently millions of records.\n",
"\n",
"## Process\n",
"\n",
"1. Increase your `QUERY_MAXIMUM_RESULTS` size of your source environment. This is set as an environment variable in `docker-compose.yml`. This is needed for large pagination.\n",
"\n",
"```yaml\n",
"QUERY_MAXIMUM_RESULTS: 100000\n",
"```\n",
"\n",
"### Retrieve all data from existing class and write to new class"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"import weaviate\n",
"from tqdm import tqdm\n",
"\n",
"client = weaviate.Client(\"http://localhost:8080\")\n",
"LIMIT = 100000\n",
"\n",
"def pull_data_for_class(client: weaviate.Client, class_name: str):\n",
" schema = client.schema.get(class_name)\n",
" fields = [field[\"name\"] for field in client.schema.get(class_name=class_name)[\"properties\"]]\n",
" data = client.query.get(class_name=class_name, properties=fields).with_additional(['vector', 'id']).with_limit(LIMIT).do()\n",
" if \"errors\" in data:\n",
" raise Exception(data[\"errors\"])\n",
" return schema, data[\"data\"][\"Get\"][class_name]\n",
"\n",
"def push_data(client: weaviate.Client, class_name: str, schema : dict, data: list):\n",
" schema[\"class\"] = class_name\n",
" client.schema.create_class(schema)\n",
" with client.batch as batch:\n",
" for record in tqdm(data):\n",
" vector = record[\"_additional\"][\"vector\"]\n",
" uuid = record[\"_additional\"][\"id\"]\n",
" del record[\"_additional\"]\n",
" if vector == []:\n",
" batch.add_data_object(data_object=record, class_name=class_name, uuid=uuid)\n",
" else:\n",
" batch.add_data_object(data_object=record, class_name=class_name, uuid=uuid, vector=vector)\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 100000/100000 [00:00<00:00, 285620.39it/s]\n"
]
}
],
"source": [
"schema, data = pull_data_for_class(client, \"Item2\")\n",
"\n",
"# write to new client to migrate data\n",
"push_data(client, \"Item4\", schema, data)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "5c7b89af1651d0b8571dde13640ecdccf7d5a6204171d6ab33e7c296e100e08a"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment