Joel datajoely

Kedro layer	Comment
`raw`	In this situation 3 data source are described: an Excel file, a multi-part CSV export from a database as well as a single CSV export from a personnel management system.
`intermediate`	The intermediate layer is a typed mirror of the raw layer with a minor transformation applied to the equipment extract since the multi-part data received has been concatenated into a single parquet dataset.
`primary`	Two domain level datasets have been constructed from the intermediate layer which model equipment shutdowns and operator actions.
`feature`	Several features have been constructed form the primary layer which represent variables we think may be predictors of equipment shutdowns such as the maintenance schedule and recent shutdowns.
`model_input`	Two model inputs have been created since we are experimenting with two modeling approaches, one time-series based and another equipment centric without a temporal element.
`models`	The trained models constructed have been serialise

Kedro layer	Cookie cutter folder	Comment
`raw`	`raw`	Philosophically identical between the two projects
`intermediate`	`interim`	A stage for typing any untyped data
`primary`	`interim`	A stage where source data models are transformed into domain data models
`feature`	`interim`	A stage for collecting variables which are used to construct model inputs
`model_input`	`processed`	A stage for storing tables which ML models are applied to
`models`	N/A	Not explicitly covered by Cookiecutter
`model_output`	`processed`	A stage for storing ML outputs
`reporting`	`processed`	A stage for storing descriptive outputs

Layer	Order	Description
`raw`	Sequential	Initial start of the pipeline, containing the sourced data model(s) that should never be changed, it forms your single source of truth to work from. These data models can be un-typed in most cases e.g. `csv`, but this will vary from case to case. Given the relative cost of storage today, painful experience suggests it's safer to never work with the original data directly!
`intermediate`	Sequential	This stage is optional if your data is already typed. Typed representation of the raw layer e.g. converting string based values into their current typed representation as numbers, dates etc. Our recommended approach is to mirror the raw layer in a typed format like Apache Parquet. Avoid transforming the structure of the data, but simple operations like cleaning up field names or 'unioning' mutli-part CSVs are permitted.
`primary`	Sequential

	# Generate the commit message using llm
	current_branch=$(git rev-parse --abbrev-ref HEAD)
	git_diff=$(git diff "$current_branch")

	commit_message=$(echo "$git_diff" \| llm prompt --model gpt-4o-mini "
	- Generate a conventional commit message based on the provided git diff.
	- Start with one of the following prefixes: 'build', 'chore', 'ci', 'docs', 'feat', 'fix', 'perf', 'refactor', 'revert', 'style', 'test'.
	- Summarize the changes at a high level without listing every code modification.
	- Use concise bullet points to describe key changes (up to 5 bullets).
	- Skip detailed descriptions for cosmetic changes by ruff.

	"""
	This module provides custom Kedro dataset
	"""
	import hashlib
	import json
	import logging
	from pathlib import Path
	from typing import Any, Dict, Optional, Union
	from urllib.parse import urlparse

	def create_template_pipeline() -> Pipeline:
	""" Template declareed here with real inputs, but placeholder outputs and parameters """
	return Pipeline(
	[
	node(
	func=create_model_inputs,
	inputs=[ # These inputs are never overriden
	"feat_days_since_last_shutdown",
	"feat_days_between_shutdown_last_maintenance",
	"feat_fte_maintenance_hours_last_6m",