Skip to content

Instantly share code, notes, and snippets.

@khaeru
Created September 23, 2024 15:00
Show Gist options
  • Select an option

  • Save khaeru/8c91ee67c41f75c7db4918813bfce186 to your computer and use it in GitHub Desktop.

Select an option

Save khaeru/8c91ee67c41f75c7db4918813bfce186 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "f50be19a-7054-4970-9827-02a02b0134e1",
"metadata": {},
"source": [
"# Example: Transport ‘glossary’ based on SDMX\n",
"\n",
"This notebook gives a brief sketch of an “SDMX-first” approach to creating metadata products like the “Glossary for transport statistics” maintained by the UN ECE, Eurostat, and ITF-OECD. (Info page in [English](https://unece.org/transport/publications/glossary-transport-statistics), [Français](https://unece.org/fr/transport/publications/glossaire-des-statistiques-de-transport-5eme-edition), [Pусский](https://unece.org/ru/transport/publications/glossariy-po-statistike-transporta) for the 2019 edition.)\n",
"\n",
"Glossaries, code books, and other similar **metadata products** contain information that is used to explain how data *should* be structured (before it is collected), or to understand the precise meaning of data that refer to them (after that data is collected). They may not contain any data directly. A common way to work on such products is:\n",
"1. First work on a document, like a Word or Open Office document, that contains definitions structured in some way.\n",
"2. Publish that document (1), e.g. as a PDF.\n",
"3. Then, convert the terms, definitions, etc. from (2) into other formats to be used with data.\n",
"\n",
"This notebook illustrates a ‘reversed’ approach:\n",
"1. **First define data structures** in a standard format—specifically, the SDMX format developed by Eurostat and other organizations.\n",
"2. Use code to automatically generate various presentations of the structures from (1).\n",
"3. Also use (1) directly to structure data.\n",
"\n",
"The example is given first, followed by some discussion of potential benefits of this approach.\n",
"\n",
"## Create a first SDMX structure\n",
"\n",
"SDMX has a very extensive ‘Information Model’ that allows to talk about many types of concepts and use them flexibly in different ways.\n",
"This notebook does not give a full explanation, but [many learning resources](https://sdmx.org/?page_id=2555) are available.\n",
"\n",
"The first type of artefact we'll create is a **ConceptScheme**, which is a collection of **Concepts**.\n",
"In SDMX, these Concepts (also, many other kinds of artefacts) have some attributes like:\n",
"- **ID**: a short, unique, identifier that is both machine- and human-readable.\n",
"- **Name**: a human-readable name.\n",
"- **Description**: a longer, human-readable text.\n",
"\n",
"Let's create a Concept Scheme first.\n",
"We can use the title of the Glossary for the name, and some text from the introduction as the description:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "9c9ce4fa",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<ConceptScheme GLOSSARY (0 items): Glossary for transport statistics, 5th edition>"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sdmx.model import Annotation, Concept, ConceptScheme\n",
"\n",
"cs = ConceptScheme(\n",
" id=\"GLOSSARY\",\n",
" name=\"Glossary for transport statistics, 5th edition\",\n",
" description=\"\"\"The Glossary for transport statistics was published for the first\n",
" time in 1994 with the purpose of assisting member countries during the collection of\n",
" data on transport using the Common Questionnaire developed by the UNECE, ITF, and\n",
" Eurostat.\"\"\",\n",
")\n",
"\n",
"cs"
]
},
{
"cell_type": "markdown",
"id": "bd1d00d4",
"metadata": {},
"source": [
"## Add Concepts to the ConceptScheme\n",
"We see that the concept scheme currently has “0 items”.\n",
"Let's create a Concept that represents the first item from the glossary:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "277d7041",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<Concept A.I-01: Track>"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"c1 = Concept(\n",
" id=\"A.I-01\",\n",
" name=\"Track\",\n",
" description=\"A pair of rails over which rail borne vehicles can run maintained…\",\n",
")\n",
"c1"
]
},
{
"cell_type": "markdown",
"id": "1e0e84d7",
"metadata": {},
"source": [
"Next, we add this to our concept scheme:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "138b312e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<ConceptScheme GLOSSARY (1 items): Glossary for transport statistics, 5th edition>"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cs.append(c1)\n",
"cs"
]
},
{
"cell_type": "markdown",
"id": "60038722",
"metadata": {},
"source": [
"…we see that the Concept Scheme now has 1 item.\n",
"Let's add a couple more:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "24873a96",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'A.I-01': <Concept A.I-01: Track>,\n",
" 'A.I-01.1': <Concept A.I-01.1: Main/running track>,\n",
" 'A.I-01.2': <Concept A.I-01.2: Other track>}"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"c2 = Concept(\n",
" id=\"A.I-01.1\",\n",
" name=\"Main/running track\",\n",
" description=\"A track providing end-to-end line continuity designed for running…\",\n",
" parent=c1,\n",
")\n",
"c3 = Concept(\n",
" id=\"A.I-01.2\",\n",
" name=\"Other track\",\n",
" description=\"\"\"All other tracks than main/running ones:\n",
" - tracks maintained, but not operated by the infrastructure manager.\n",
" \"\"\",\n",
" parent=c1,\n",
")\n",
"cs.append(c2)\n",
"cs.append(c3)\n",
"\n",
"cs.items"
]
},
{
"cell_type": "markdown",
"id": "ff7f4a0f",
"metadata": {},
"source": [
"## Use more features of SDMX\n",
"\n",
"In the above example, we see:\n",
"- The **description** can be arbitrarily long and detailed.\n",
"- The concept scheme can be **hierarchical**; each concept can have a **parent** or 1 or more **children**.\n",
" (This can help code that uses the concept scheme to do operations like aggregation automatically.)\n",
"\n",
"Another feature of SDMX is that it supports **internationalization**: many items like item names and descriptions can be expressed in multiple languages.\n",
"We see that the name “Track” was stored for the default locale:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "70e8d2cd",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'en': 'Track'}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"c1.name.localizations"
]
},
{
"cell_type": "markdown",
"id": "431378d3",
"metadata": {},
"source": [
"…and can add more localizations from the translated versions of the Glossary:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "6eca9796",
"metadata": {},
"outputs": [],
"source": [
"c1.name.localizations.update(fr=\"Voie\", ru=\"РЕЛЬСОВЫЙ ПУТЬ\")\n",
"c2.name.localizations.update(fr=\"Voie principale\", ru=\"ОСНОВНОЙ/ГЛАВНЫЙ ПУТЬ\")\n",
"c3.name.localizations.update(fr=\"Autres voies\", ru=\"ПРОЧИЕ ПУТИ\")"
]
},
{
"cell_type": "markdown",
"id": "f5ac9c50",
"metadata": {},
"source": [
"SDMX also supports arbitrary **annotations** on nearly every kind of artefact.\n",
"We can use these to store structured information like the “explanatory notes” that accompany some definitions but (per the Glossary introduction) are not part of the definitions:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "941f479a",
"metadata": {},
"outputs": [],
"source": [
"a1 = Annotation(\n",
" id=\"explanatory-note\",\n",
" text={\n",
" \"en\": \"In the context of the EU reporting the cumulative length of railway …\",\n",
" \"fr\": \"Dans le contexte des déclarations au niveau de l’Union européenne, la …\",\n",
" \"ru\": \"В рамках отчетности ЕС из совокупной протяженности железнодорожных …\",\n",
" },\n",
")\n",
"c1.annotations.append(a1)"
]
},
{
"cell_type": "markdown",
"id": "21308133",
"metadata": {},
"source": [
"## Create presentations *from* SDMX data structures\n",
"\n",
"Once the SDMX structure metadata, like our concept scheme, are created, they can immediately and easily be stored in SDMX file formats (based on XML), exchanged, and used by the wide variety of tools in the SDMX ecosystem.\n",
"For example, here is a collection of similar concept schemes tracked by the “SDMX Global Registry”: https://registry.sdmx.org/items/conceptscheme.html\n",
"\n",
"The SDMX objects can also be manipulated by other code.\n",
"This code, can for example, **generate any kind of output,** display, or presentation of the structure information.\n",
"Here we are working in Python, and there are other Python libraries that can be used to output:\n",
"- A word processing file (e.g. Word/OpenOffice) or other document (PDF).\n",
"- A presentation/slide show (e.g. PowerPoint).\n",
"- A static or interactive webpage.\n",
"\n",
"Here we use the example of an HTML table.\n",
"The details of the code are not important; scroll down for the output."
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "8c0629c2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table>\n",
"<th><tr><td>ID</td><td align=\"left\">Name, definition, explanation</td></tr></th>\n",
"<tr><td>A.I-01</td><td align=\"left\"><strong>Track</strong><br/>A pair of rails over which rail borne vehicles can run maintained…<br/><i>In the context of the EU reporting the cumulative length of railway …</i></td></tr>\n",
"<tr><td>A.I-01.1</td><td align=\"left\"><strong>Main/running track</strong><br/>A track providing end-to-end line continuity designed for running…<br/><i></i></td></tr>\n",
"<tr><td>A.I-01.2</td><td align=\"left\"><strong>Other track</strong><br/>All other tracks than main/running ones:\n",
" - tracks maintained, but not operated by the infrastructure manager.\n",
" <br/><i></i></td></tr>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<table>\n",
"<th><tr><td>ID</td><td align=\"left\">Name, definition, explanation</td></tr></th>\n",
"<tr><td>A.I-01</td><td align=\"left\"><strong>Voie</strong><br/>A pair of rails over which rail borne vehicles can run maintained…<br/><i>Dans le contexte des déclarations au niveau de l’Union européenne, la …</i></td></tr>\n",
"<tr><td>A.I-01.1</td><td align=\"left\"><strong>Voie principale</strong><br/>A track providing end-to-end line continuity designed for running…<br/><i></i></td></tr>\n",
"<tr><td>A.I-01.2</td><td align=\"left\"><strong>Autres voies</strong><br/>All other tracks than main/running ones:\n",
" - tracks maintained, but not operated by the infrastructure manager.\n",
" <br/><i></i></td></tr>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<table>\n",
"<th><tr><td>ID</td><td align=\"left\">Name, definition, explanation</td></tr></th>\n",
"<tr><td>A.I-01</td><td align=\"left\"><strong>РЕЛЬСОВЫЙ ПУТЬ</strong><br/>A pair of rails over which rail borne vehicles can run maintained…<br/><i>В рамках отчетности ЕС из совокупной протяженности железнодорожных …</i></td></tr>\n",
"<tr><td>A.I-01.1</td><td align=\"left\"><strong>ОСНОВНОЙ/ГЛАВНЫЙ ПУТЬ</strong><br/>A track providing end-to-end line continuity designed for running…<br/><i></i></td></tr>\n",
"<tr><td>A.I-01.2</td><td align=\"left\"><strong>ПРОЧИЕ ПУТИ</strong><br/>All other tracks than main/running ones:\n",
" - tracks maintained, but not operated by the infrastructure manager.\n",
" <br/><i></i></td></tr>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from IPython.display import display, HTML\n",
"\n",
"\n",
"def show_table(locale=\"en\"):\n",
" \"\"\"Show a table.\"\"\"\n",
" lines = [\n",
" \"<table>\",\n",
" \"<th><tr>\"\n",
" '<td>ID</td><td align=\"left\">Name, definition, explanation</td></tr></th>',\n",
" ]\n",
"\n",
" for concept in cs:\n",
" # Maybe get an explanatory note\n",
" try:\n",
" anno = concept.get_annotation(id=\"explanatory-note\").text.localized_default(\n",
" locale\n",
" )\n",
" except KeyError:\n",
" anno = \"\" # No such note\n",
" lines.append(\n",
" f\"<tr><td>{concept.id}</td>\"\n",
" '<td align=\"left\">'\n",
" f\"<strong>{concept.name[locale]}</strong><br/>\"\n",
" f\"{concept.description.localized_default(locale)}<br/>\"\n",
" f\"<i>{anno}</i>\"\n",
" \"</td></tr>\"\n",
" )\n",
"\n",
" lines.append(\"</table>\")\n",
"\n",
" display(HTML(\"\\n\".join(lines)))\n",
"\n",
"show_table(locale=\"en\")\n",
"show_table(locale=\"fr\")\n",
"show_table(locale=\"ru\")"
]
},
{
"cell_type": "markdown",
"id": "f25324c2",
"metadata": {},
"source": [
"(Note to keep the example short, we haven't added the descriptions in multiple languages.)"
]
},
{
"cell_type": "markdown",
"id": "978d7a1f",
"metadata": {},
"source": [
"## Next steps\n",
"\n",
"- As mentioned above, we can use this concept scheme to **structure** or **label data**.\n",
" Then data with the label e.g. \"A.I-02\" and a reference to this concept scheme are associated clearly with the correct definition.\n",
"- We can **define additional** concept schemes, code lists, etc.\n",
" For example, the “Symbols and abbreviations” at the end of the glossary can be stored as several code lists.\n",
"- We can use **version control tools** to keep track of changes to the scripts that create the SDMX structures.\n",
"\n",
"## Benefits\n",
"\n",
"A key benefit to this structure-first approach is the different nature of creating outputs like the HTML table above.\n",
"\n",
"In this approach, such code must be written **only once**; after that, it can be *reused* automatically and quickly to refresh the outputs.\n",
"This lowers the barriers to publishing corrections, updates, and incremental improvements, and helps keep the metadata products useful and current.\n",
"\n",
"The other practices, like using the structures directly to label data, help reduce errors and keep a transparent record of changes and updates.\n",
"These in turn make it easier to understand and use data."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment