Skip to content

Instantly share code, notes, and snippets.

@david-andrew
Created February 20, 2024 20:56
Show Gist options
  • Select an option

  • Save david-andrew/0a4c2fca21c6c8af290c61f76e9af9c0 to your computer and use it in GitHub Desktop.

Select an option

Save david-andrew/0a4c2fca21c6c8af290c61f76e9af9c0 to your computer and use it in GitHub Desktop.
example output from dojo-auto-annotations
Meta(path=PosixPath('datasets/mock_aqi.csv'), name='Air Quality Index', description='This dataset represents daily air quality observations collected from various monitoring stations across different cities worldwide. Each row corresponds to a single observation with details about the date, time, and location of the observation, along with specific air quality metrics and conditions.')
LLM identified column "year" as a DATE
LLM identified column "month" as a DATE
LLM identified column "day" as a DATE
LLM identified column "time" as a DATE
LLM identified column "lat" as a GEO
LLM identified column "lon" as a GEO
LLM identified column "country" as a GEO
LLM identified column "admin1" as a GEO
LLM identified column "admin2" as a GEO
LLM identified column "admin3" as a GEO
LLM identified column "AQI" as a FEATURE
LLM identified column "PM2.5" as a FEATURE
LLM identified column "CO_Level" as a FEATURE
LLM identified column "Is_Industrial" as a FEATURE
LLM identified column "Traffic_Density" as a FEATURE
LLM identified DATE column "year" as a YEAR
LLM identified DATE column "month" as a MONTH
LLM identified DATE column "day" as a DAY
LLM identified DATE column "time" as a DATE
LLM identified GEO column "lat" as a LATITUDE
LLM identified GEO column "lon" as a LONGITUDE
LLM identified GEO column "country" as a COUNTRY
LLM identified GEO column "admin1" as a STATE
LLM identified GEO column "admin2" as a COUNTY
LLM identified GEO column "admin3" as a COUNTY
LLM identified FEATURE column "AQI" as a INT
LLM identified FEATURE column "PM2.5" as a INT
LLM identified FEATURE column "CO_Level" as a INT
LLM identified FEATURE column "Is_Industrial" as a BINARY
LLM identified FEATURE column "Traffic_Density" as a STR
LLM identified no units for feature column "AQI"
LLM provided units and description for feature column "PM2.5": µg/m^3. The units "µg/m^3" represent micrograms per cubic meter, a measure of the concentration of a particulate matter (in this case, particles with diameters of 2.5 micrometers or smaller) in the air.
LLM was unsure about the units for feature column "CO_Level"
LLM identified no units for feature column "Is_Industrial"
LLM identified no units for feature column "Traffic_Density"
LLM identified coordinate pair: ('lon', 'lat')
LLM identified ('lon', 'lat') as the primary geo column(s)
LLM identified date group: ('day', 'year', 'month')
LLM identified ('day', 'year', 'month') as the primary date column(s)
LLM identified DATE/YEAR column "year" strftime format: "%Y"
LLM identified DATE/MONTH column "month" strftime format: "%m"
LLM identified DATE/DAY column "day" strftime format: "%d"
LLM identified DATE/DATE column "time" strftime format: "%H:%M"
LLM provided description for feature column "AQI": "The Air Quality Index (AQI) is a metric used to communicate how clean or polluted the air is on a daily basis. It quantifies the level of air pollution with numerical values; lower values indicate cleaner air, while higher values signify more polluted air. This index is based on the concentrations of several major pollutants, including particulate matter, sulfur dioxide, carbon monoxide, nitrogen dioxide, and ozone. The AQI scale typically ranges from 0 to 500, where a value of 100 generally corresponds to the national air quality standard for the pollutant, with values above 100 indicating poor air quality and potentially harmful health effects for certain sensitive groups of people."
LLM provided description for feature column "PM2.5": "This feature represents the concentration of particulate matter with a diameter of 2.5 micrometers or less (PM2.5) present in the air. PM2.5 is a significant air pollutant due to its ability to penetrate deep into the lungs and bloodstream, potentially resulting in various health problems. The values are measured in micrograms per cubic meter (µg/m^3), indicating the amount of particulate matter per volume of air."
LLM provided description for feature column "CO_Level": "This column quantifies the level of carbon monoxide (CO) in the air, measured as an integer value that typically represents categories or concentrations defined by air quality standards. Higher numbers indicate greater concentrations of CO, a colorless, odorless gas that can be harmful to health at elevated levels."
LLM provided description for feature column "Is_Industrial": "Indicates whether the air quality measurement was taken in an industrial area or not. Values are true for industrial areas and false for non-industrial areas."
LLM provided description for feature column "Traffic_Density": "Indicates the level of vehicle congestion in the area, with classifications such as High, Medium, or Low, reflecting the intensity of traffic flow."
LLM provided description for date column "year": "The column represents the year in which the air quality measurements were recorded. Each entry denotes the specific year associated with the corresponding air quality data, formatted as a four-digit number."
LLM provided description for date column "month": "This column represents the month of the year when the air quality data was recorded, using numerical values where January is 1 and December is 12."
LLM provided description for date column "day": "Represents the day of the month on which air quality measurements were taken. The values are recorded as integers ranging from 1 to 31, corresponding to the days in a month."
LLM provided description for date column "time": "This column records the specific hour of the day when air quality measurements were taken, formatted in 24-hour time from 00:00 to 23:59."
LLM provided description for geo column "lat": "This dataset column contains the latitude coordinates for various locations, representing the north-south position of a point on the Earth's surface. The values are in decimal degrees, where positive values indicate latitudes north of the equator, and negative values indicate latitudes south of the equator."
LLM provided description for geo column "lon": "This dataset contains the longitude coordinates of various locations, measured in degrees. These coordinates indicate the east-west position on the Earth's surface, with values ranging from -180 degrees (west) to +180 degrees (east) of the Prime Meridian."
LLM provided description for geo column "country": "This dataset segment lists the nations where air quality measurements were taken, showcasing a range of locations that includes the United States, the United Kingdom, Japan, and France, among others."
LLM provided description for geo column "admin1": "This dataset column contains names of major administrative subdivisions within countries, such as states in the United States, counties in the United Kingdom, prefectures in Japan, and regions in France. These divisions reflect the geographical areas relevant for air quality management and reporting."
LLM provided description for geo column "admin2": "This column contains the names of cities or local administrative regions where air quality measurements were taken, indicating the specific urban or local area within broader national or administrative boundaries."
LLM provided description for geo column "admin3": "This dataset includes a geographical classification that details specific urban or administrative areas within cities around the world. These areas range from the borough level, such as in New York City, to districts within other global cities, covering different municipal or local governance zones. Each entry identifies a precise location within larger metropolitan areas, providing a granular view of air quality measurements across diverse urban environments."
geo=[GeoAnnotation(name='lat', display_name=None, description="This dataset column contains the latitude coordinates for various locations, representing the north-south position of a point on the Earth's surface. The values are in decimal degrees, where positive values indicate latitudes north of the equator, and negative values indicate latitudes south of the equator.", type=<ColumnType.GEO: 'geo'>, geo_type=<GeoType.LATITUDE: 'latitude'>, primary_geo=True, resolve_to_gadm=None, is_geo_pair=None, coord_format=None, qualifies=None, aliases={}, gadm_level=None), GeoAnnotation(name='lon', display_name=None, description="This dataset contains the longitude coordinates of various locations, measured in degrees. These coordinates indicate the east-west position on the Earth's surface, with values ranging from -180 degrees (west) to +180 degrees (east) of the Prime Meridian.", type=<ColumnType.GEO: 'geo'>, geo_type=<GeoType.LONGITUDE: 'longitude'>, primary_geo=True, resolve_to_gadm=None, is_geo_pair='lat', coord_format=None, qualifies=None, aliases={}, gadm_level=None), GeoAnnotation(name='country', display_name=None, description='This dataset segment lists the nations where air quality measurements were taken, showcasing a range of locations that includes the United States, the United Kingdom, Japan, and France, among others.', type=<ColumnType.GEO: 'geo'>, geo_type=<GeoType.COUNTRY: 'country'>, primary_geo=None, resolve_to_gadm=None, is_geo_pair=None, coord_format=None, qualifies=None, aliases={}, gadm_level=None), GeoAnnotation(name='admin1', display_name=None, description='This dataset column contains names of major administrative subdivisions within countries, such as states in the United States, counties in the United Kingdom, prefectures in Japan, and regions in France. These divisions reflect the geographical areas relevant for air quality management and reporting.', type=<ColumnType.GEO: 'geo'>, geo_type=<GeoType.STATE: 'state/territory'>, primary_geo=None, resolve_to_gadm=None, is_geo_pair=None, coord_format=None, qualifies=None, aliases={}, gadm_level=None), GeoAnnotation(name='admin2', display_name=None, description='This column contains the names of cities or local administrative regions where air quality measurements were taken, indicating the specific urban or local area within broader national or administrative boundaries.', type=<ColumnType.GEO: 'geo'>, geo_type=<GeoType.COUNTY: 'county/district'>, primary_geo=None, resolve_to_gadm=None, is_geo_pair=None, coord_format=None, qualifies=None, aliases={}, gadm_level=None), GeoAnnotation(name='admin3', display_name=None, description='This dataset includes a geographical classification that details specific urban or administrative areas within cities around the world. These areas range from the borough level, such as in New York City, to districts within other global cities, covering different municipal or local governance zones. Each entry identifies a precise location within larger metropolitan areas, providing a granular view of air quality measurements across diverse urban environments.', type=<ColumnType.GEO: 'geo'>, geo_type=<GeoType.COUNTY: 'county/district'>, primary_geo=None, resolve_to_gadm=None, is_geo_pair=None, coord_format=None, qualifies=None, aliases={}, gadm_level=None)] date=[DateAnnotation(name='year', display_name=None, description='The column represents the year in which the air quality measurements were recorded. Each entry denotes the specific year associated with the corresponding air quality data, formatted as a four-digit number.', type=<ColumnType.DATE: 'date'>, date_type=<DateType.YEAR: 'year'>, primary_date=True, time_format='%Y', associated_columns=None, qualifies=None, aliases={}), DateAnnotation(name='month', display_name=None, description='This column represents the month of the year when the air quality data was recorded, using numerical values where January is 1 and December is 12.', type=<ColumnType.DATE: 'date'>, date_type=<DateType.MONTH: 'month'>, primary_date=True, time_format='%m', associated_columns=None, qualifies=None, aliases={}), DateAnnotation(name='day', display_name=None, description='Represents the day of the month on which air quality measurements were taken. The values are recorded as integers ranging from 1 to 31, corresponding to the days in a month.', type=<ColumnType.DATE: 'date'>, date_type=<DateType.DAY: 'day'>, primary_date=True, time_format='%d', associated_columns={<TimeField.DAY: 'Day'>: 'day', <TimeField.YEAR: 'Year'>: 'year', <TimeField.MONTH: 'Month'>: 'month'}, qualifies=None, aliases={}), DateAnnotation(name='time', display_name=None, description='This column records the specific hour of the day when air quality measurements were taken, formatted in 24-hour time from 00:00 to 23:59.', type=<ColumnType.DATE: 'date'>, date_type=<DateType.DATE: 'date'>, primary_date=None, time_format='%H:%M', associated_columns=None, qualifies=None, aliases={})] feature=[FeatureAnnotation(name='AQI', display_name=None, description='The Air Quality Index (AQI) is a metric used to communicate how clean or polluted the air is on a daily basis. It quantifies the level of air pollution with numerical values; lower values indicate cleaner air, while higher values signify more polluted air. This index is based on the concentrations of several major pollutants, including particulate matter, sulfur dioxide, carbon monoxide, nitrogen dioxide, and ozone. The AQI scale typically ranges from 0 to 500, where a value of 100 generally corresponds to the national air quality standard for the pollutant, with values above 100 indicating poor air quality and potentially harmful health effects for certain sensitive groups of people.', type=<ColumnType.FEATURE: 'feature'>, feature_type=<FeatureType.INT: 'int'>, units='N/A', units_description='N/A', qualifies=None, qualifierrole=None, aliases={}), FeatureAnnotation(name='PM2.5', display_name=None, description='This feature represents the concentration of particulate matter with a diameter of 2.5 micrometers or less (PM2.5) present in the air. PM2.5 is a significant air pollutant due to its ability to penetrate deep into the lungs and bloodstream, potentially resulting in various health problems. The values are measured in micrograms per cubic meter (µg/m^3), indicating the amount of particulate matter per volume of air.', type=<ColumnType.FEATURE: 'feature'>, feature_type=<FeatureType.INT: 'int'>, units='µg/m^3', units_description='The units "µg/m^3" represent micrograms per cubic meter, a measure of the concentration of a particulate matter (in this case, particles with diameters of 2.5 micrometers or smaller) in the air.', qualifies=None, qualifierrole=None, aliases={}), FeatureAnnotation(name='CO_Level', display_name=None, description='This column quantifies the level of carbon monoxide (CO) in the air, measured as an integer value that typically represents categories or concentrations defined by air quality standards. Higher numbers indicate greater concentrations of CO, a colorless, odorless gas that can be harmful to health at elevated levels.', type=<ColumnType.FEATURE: 'feature'>, feature_type=<FeatureType.INT: 'int'>, units=None, units_description=None, qualifies=None, qualifierrole=None, aliases={}), FeatureAnnotation(name='Is_Industrial', display_name=None, description='Indicates whether the air quality measurement was taken in an industrial area or not. Values are true for industrial areas and false for non-industrial areas.', type=<ColumnType.FEATURE: 'feature'>, feature_type=<FeatureType.BINARY: 'binary'>, units='N/A', units_description='N/A', qualifies=None, qualifierrole=None, aliases={}), FeatureAnnotation(name='Traffic_Density', display_name=None, description='Indicates the level of vehicle congestion in the area, with classifications such as High, Medium, or Low, reflecting the intensity of traffic flow.', type=<ColumnType.FEATURE: 'feature'>, feature_type=<FeatureType.STR: 'str'>, units='N/A', units_description='N/A', qualifies=None, qualifierrole=None, aliases={})]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment