Skip to content

Instantly share code, notes, and snippets.

@maxidl
Created October 14, 2025 14:05
Show Gist options
  • Select an option

  • Save maxidl/feaadcfa1bea71f299a9c494ad96b87f to your computer and use it in GitHub Desktop.

Select an option

Save maxidl/feaadcfa1bea71f299a9c494ad96b87f to your computer and use it in GitHub Desktop.
properties.md

Detailed Property Descriptions & Annotation Guidelines

Core Content Properties

1. Content Integrity

What we're measuring: Completeness and technical quality of the content itself, regardless of navigation ratio.

Values & Criteria:

complete - Full, intact content as intended

  • Content appears complete with proper beginning, middle, and end
  • All essential elements present (introduction, body, conclusion where appropriate)
  • No obvious truncation or missing sections
  • Example: Complete articles, full tutorials, intact documents

mostly_complete - Minor elements missing but core content intact

  • Core content is complete but some secondary elements may be missing
  • Minor truncation that doesn't affect main message
  • Example: Article with truncated comments, missing sidebar content, partial author bio

fragment - Incomplete content, missing significant portions

  • Missing introduction, conclusion, or substantial middle sections
  • Truncated mid-sentence or mid-paragraph
  • Content feels incomplete or cut off
  • Example: Search result snippets, article excerpts, broken crawls, partial downloads

severely_degraded - Broken, unreadable, or corrupted content

  • Encoding errors, scrambled text, missing characters
  • Severely malformed HTML rendering as gibberish
  • Technical corruption making content unreadable
  • Example: �&$^%*@# characters, completely broken formatting, corrupted files

Key Decision Points:

  • Content completeness: Does the content feel like a complete unit of information?
  • Technical integrity: Is the content technically readable and properly formatted?
  • Fragment vs. complete: Independent of navigation - is the actual content complete?
  • Degraded vs. fragment: Degraded has technical issues; fragment is just incomplete

2. Content Ratio

What we're measuring: How much of the document is actual content vs. navigation, UI elements, and structural markup.

Values & Criteria:

complete_content - 90-100% meaningful content

  • Full articles, papers, tutorials with minimal navigation
  • Clean text with proper paragraphs and structure
  • Example: A Wikipedia article, academic paper, complete blog post

mostly_content - 70-89% meaningful content

  • Complete documents with some navigation elements (header, footer, sidebar)
  • Minor UI elements that don't disrupt reading
  • Example: News articles with standard website navigation

mixed_content - 40-69% meaningful content

  • Significant navigation mixed throughout content
  • Multiple sidebars, ads, or UI elements interrupting text
  • Example: E-commerce product pages with reviews mixed with purchase options

mostly_navigation - 10-39% meaningful content

  • Predominantly menus, links, headers, footers
  • Content overwhelmed by structural elements
  • Example: Site maps, navigation pages, heavily UI-focused pages

minimal_content - 0-9% meaningful content

  • Almost entirely navigation, UI elements, or structural markup
  • Very little readable content present
  • Example: Empty pages, pure navigation menus, error pages with minimal text

Key Decision Points:

  • Focus on the ratio of readable text to navigation/UI elements
  • Count only substantive content, ignore boilerplate and structural elements
  • Mixed vs. mostly_navigation: Can you read it as coherent content despite distractions?

3. Content Length

What we're measuring: Amount of substantive content, ignoring navigation and boilerplate.

Values & Criteria:

substantial - 2,000+ words of meaningful content

  • Long-form, comprehensive content that provides in-depth coverage of a topic
  • Typically includes detailed analysis, multiple sections or chapters, extensive research, or thorough exploration of complex subjects
  • Examples: White papers, research reports, e-books, long-form journalism

moderate - 500–2,000 words of meaningful content

  • Standard-length content that offers meaningful coverage while remaining focused and digestible
  • Balances depth with accessibility; provides enough detail to be informative without overwhelming readers
  • Examples: Typical blog posts, news articles, product reviews, how-to guides

brief - 100–500 words of meaningful content

  • Short, focused content that delivers key information quickly and efficiently
  • Gets straight to the point while still providing value and context
  • Examples: News briefs, product descriptions, FAQs, short blog posts

minimal - Under 100 words of meaningful content

  • Very short content that provides only essential information or serves as a quick reference
  • Designed for rapid consumption or specific micro-purposes
  • Examples: Social media posts, announcements, abstracts, snippets, navigation pages

Measurement Tips:

  • Count only readable content of value: include article body and substantive headings/captions; exclude headers/footers, menus/sidebars, related links, share/consent UI, pagination, ads, and boilerplate.
  • Focus on substantive information, not filler words
  • Complete thoughts matter more than exact word counts
  • Contextual adjustment: Thresholds are guidelines and can be adjusted based on specific use cases and typical content. Academic contexts may shift ranges upward, while social media contexts may shift them downward.

Content Classification

4. Content Type

What we're measuring: The functional structure and purpose of content.

Multi-type content: Content can be assigned multiple type labels if it genuinely serves multiple purposes. Choose ALL applicable types rather than forcing a single primary choice. Always output an array for this property, even if only one type applies.

Values & Criteria:

analytical - In-depth analysis, research, and critical examination

  • Provides detailed analysis or research on a topic
  • Develops arguments, evaluates evidence, or presents findings
  • Example: Research analysis, investigative reports, academic articles, expert commentary

instructional - Teaching and how-to content

  • Explicitly teaches skills, concepts, or procedures
  • Step-by-step guidance or educational explanations
  • Example: Tutorials, how-to guides, educational content, training materials

reference - Lookup materials, definitions, specifications

  • Designed for looking up specific information rather than reading through
  • Often organized alphabetically, categorically, or as lists
  • Example: Dictionaries, encyclopedias, API references, product catalogs

procedural - Step-by-step processes and procedures

  • Sequential instructions or workflows
  • Process documentation with clear steps
  • Example: Recipes, installation guides, standard operating procedures, workflows

qa_structured - Structured question-answer content

  • Formal Q&A format with clear questions and answers
  • Often expert responses to specific questions
  • Example: Stack Overflow, FAQ sections, structured Q&A sites

conversational - Multi-party or turn-based dialogues (humans, bots, or both)

  • Casual or structured conversations between two or more participants
  • May include human–AI chats, forum threads, or comment chains
  • Example: Reddit threads, forum discussions, support chats, assistant chat logs

creative - Entertainment, artistic, fictional content

  • Primary purpose is entertainment or artistic expression
  • Not primarily informational or instructional
  • Example: Short stories, poems, movie reviews, game content, fiction

transactional - Commercial, shopping, service-oriented

  • Primary purpose is to facilitate a transaction or service
  • Focuses on products, services, or business processes
  • Example: Product listings, service descriptions, checkout pages

boilerplate - Legal, policy, standard template text

  • Standard legal or policy language
  • Often repeated across multiple sites with minimal variation
  • Example: Terms of service, privacy policies, disclaimers, cookie banners, standard notices

news_report - Straight reporting of events with minimal analysis

  • Describes events or facts in a neutral, descriptive tone
  • Time-bound news, updates, or reports
  • Example: Wire-service news articles, breaking-news updates

opinion_editorial - Persuasive/opinionated commentary or editorials

  • Expresses a stance or argument; aims to persuade
  • May cite evidence but prioritizes viewpoint
  • Example: Op-eds, opinion columns, personal essays with clear stance

review_critique - Evaluative reviews of products, media, or services

  • Provides judgments, ratings, or critiques
  • May include pros/cons, scoring systems
  • Example: Product reviews, film/book critiques, app store reviews (long-form)

technical_documentation - Manuals, API docs, developer guides, READMEs

  • Primary goal is to instruct usage of software/hardware/APIs
  • Includes reference sections, examples, parameters, version notes
  • Example: API reference, library README, user manual

specification_standard - Normative standards and formal specifications

  • Defines requirements, must/shall language, compliance criteria
  • Maintained by standards bodies or authoritative groups
  • Example: RFCs, ISO standards, formal protocol specs

legal_document - Statutes, case law, contracts, regulatory texts

  • Binding or authoritative legal content
  • Formal legal language and structure
  • Example: Court opinions, legislation, contracts, regulatory rules

press_release - Organization-issued announcements and PR materials

  • Promotional announcements framed as information
  • Quotes from executives, product/service announcements
  • Example: Company press releases, launch announcements

structured_data - Tables, datasets, indices, catalogs with minimal prose

  • Predominantly tabular/listed data meant for lookup
  • Minimal narrative or explanatory text
  • Example: Product catalogs, schedules, statistical tables

source_code - Code listings as primary content

  • Dominant content is program source code or scripts
  • May include lightweight comments or snippets without narrative
  • Example: Code files, gist-like pages, competitive programming solutions

Multi-Type Examples:

  • Tutorial that analyzes different approaches["instructional", "analytical"]
  • Educational reference manual["instructional", "reference"]
  • Research paper with step-by-step methodology["analytical", "procedural"]
  • Q&A site with analytical responses["qa_structured", "analytical"]
  • API guide with examples["technical_documentation", "reference", "instructional"]
  • RFC with rationale["specification_standard", "analytical"]
  • Film review with interview snippets["review_critique", "conversational"]
  • Helpdesk chat with an AI["conversational", "transactional"]
  • Breaking news explainer["news_report", "explanatory"]

5. Business Sector

What we're measuring: Business sector(s) or industry domain(s) for training sector-specific LLMs.

Multi-sector content: Content can be assigned multiple sector labels if it genuinely spans multiple industries. Choose ALL applicable sectors rather than forcing a single primary choice or using "other". Always output an array for this property, even if only one sector applies.

Values & Criteria:

academic_research - Scholarly and research content

  • Peer-reviewed publications, academic papers
  • University-affiliated research and scholarship
  • Formal academic discourse and methodology
  • Example: Journal articles, conference papers, academic books, dissertations

education_sector - Educational institutions and pedagogy

  • K-12 education, higher education administration
  • Educational technology, curriculum development
  • Teaching methodologies and educational policy
  • Example: School curricula, educational policy papers, teaching resources, edtech content

technology_software - Software and information technology

  • Software development, programming, IT services
  • Digital products, platforms, and technology companies
  • Computer science and software engineering
  • Example: Software documentation, tech company content, programming guides, IT industry analysis

hardware_electronics - Hardware devices and electronics industry

  • Semiconductors, consumer electronics, embedded systems, hardware design
  • Electronics manufacturing and supply chains
  • Example: Chip design docs, hardware datasheets, device manuals

healthcare_medical - Healthcare and medical sector

  • Medical research, clinical practice, healthcare delivery
  • Hospitals, medical devices, healthcare policy
  • Public health and wellness
  • Example: Medical journals, clinical guidelines, healthcare administration, wellness content

pharmaceutical_biotech - Pharmaceutical and biotechnology

  • Drug development, clinical trials, biotech research
  • Pharmaceutical industry, biotechnology companies
  • Life sciences and molecular biology applications
  • Example: Drug research papers, clinical trial reports, biotech industry analysis

financial_services - Banking and financial services

  • Banking, investment, insurance, financial planning
  • Financial markets, fintech, payment systems
  • Asset management and financial advisory
  • Example: Financial analysis, banking documentation, investment guides

legal_services - Legal sector and jurisprudence

  • Law firms, legal practice, court systems
  • Legal education, regulatory compliance
  • Litigation, contracts, legal advisory
  • Example: Legal briefs, court opinions, legal analysis, compliance guides

government_public - Government and public administration

  • Government agencies, public policy, civic services
  • Regulatory bodies, public administration
  • Political institutions and governance
  • Example: Government reports, policy documents, regulatory filings, civic information

manufacturing_industrial - Manufacturing and heavy industry

  • Industrial production, manufacturing processes
  • Supply chain, logistics, industrial equipment
  • Factory operations and industrial engineering
  • Example: Manufacturing specs, industrial reports, supply chain analysis, production guides

mining_resources - Mining and natural resources

  • Exploration, extraction, and processing of minerals and resources
  • Resource markets and operations (metals, rare earths)
  • Example: Mining reports, resource exploration docs, commodity operations

chemicals_materials - Chemicals and advanced materials

  • Petrochemicals, specialty chemicals, polymers, composites, advanced materials
  • Safety data sheets (SDS), process chemistry, materials science
  • Example: Material datasheets, REACH documentation, chemical process guides

energy_utilities - Energy and utilities sector

  • Power generation, renewable energy, oil and gas
  • Electric utilities, water services, waste management
  • Energy infrastructure and grid management
  • Example: Energy industry reports, utility regulations, renewable energy research

retail_commerce - Retail and e-commerce

  • Retail operations, e-commerce platforms
  • Consumer goods distribution, merchandising
  • Retail technology and customer experience
  • Example: Retail industry analysis, e-commerce guides, merchandising strategies

wholesale_distribution - Wholesale trade and distribution

  • B2B wholesale, distributors, procurement, inventory and fulfillment
  • Supply relationships between manufacturers and retailers
  • Example: Distributor catalogs, wholesale operations, procurement guides

real_estate_construction - Real estate and construction

  • Property development, construction industry
  • Real estate markets, property management
  • Architecture and building services
  • Example: Real estate analysis, construction specifications, property guides

transportation_logistics - Transportation and logistics

  • Airlines, shipping, freight, public transit
  • Logistics operations, supply chain transportation
  • Vehicle fleet management, transportation infrastructure
  • Example: Logistics guides, transportation planning, shipping documentation

travel_aviation - Travel industry and commercial aviation

  • Airlines, airports, OTA platforms, hospitality travel operations
  • Route planning, airline commercial, loyalty, IATA regulations
  • Example: Airline scheduling, fare rules, OTA partner docs

automotive_industry - Automotive manufacturing and services

  • Vehicle manufacturers, automotive suppliers
  • Automotive technology, electric vehicles
  • Dealerships and automotive services
  • Example: Automotive engineering docs, vehicle technology papers, industry analysis

telecommunications - Telecommunications industry

  • Telecom operators, network infrastructure
  • Mobile services, broadband, satellite communications
  • Telecommunications equipment and technology
  • Example: Telecom industry reports, network specifications, 5G technology papers

media_entertainment - Media and entertainment industry

  • Film, television, music, gaming industries
  • Publishing, news media, content creation
  • Streaming services and digital media
  • Example: Entertainment industry analysis, media studies, content strategy

gaming_industry - Video games and interactive entertainment

  • Game development, studios, engines, esports, live ops
  • Monetization models, community management, platform ecosystems
  • Example: Patch notes, game design docs, esports operations

advertising_marketing - Advertising, marketing, and PR

  • Brand strategy, campaign planning, performance marketing, martech
  • Agencies, in-house marketing, PR communications
  • Example: Campaign briefs, media plans, PR strategies

hospitality_tourism - Hospitality and tourism sector

  • Hotels, restaurants, travel services
  • Tourism industry, destination management
  • Event planning and hospitality services
  • Example: Tourism studies, hospitality management, travel industry reports

food_beverage_hospitality - Food & beverage and restaurant operations

  • Restaurant ops, menu engineering, supply chain, QSR/fast casual
  • Food safety, compliance, procurement for F&B
  • Example: Restaurant training manuals, HACCP docs, vendor specs

agriculture_food - Agriculture and food production

  • Farming, agricultural technology, food processing
  • Agricultural supply chain, food safety
  • Agribusiness and agricultural policy
  • Example: Agricultural research, food industry reports, farming guides

environmental_services - Environmental and sustainability services

  • Environmental consulting, ESG reporting, sustainability programs
  • Waste management services, remediation, impact assessments
  • Example: ESG reports, environmental impact assessments, sustainability frameworks

aerospace_defense - Aerospace and defense industry

  • Aircraft manufacturing, space technology
  • Defense contractors, military systems
  • Aviation and space exploration
  • Example: Aerospace engineering papers, defense industry analysis, aviation guides

insurance_industry - Insurance sector

  • Life, health, property, and casualty insurance
  • Reinsurance, actuarial science, risk assessment
  • Insurance technology and underwriting
  • Example: Actuarial studies, insurance policy analysis, risk management guides

nonprofit_ngo - Nonprofit and NGO sector

  • Charitable organizations, international development
  • Social services, humanitarian organizations
  • Foundations and philanthropic institutions
  • Example: NGO reports, nonprofit management, development studies

consulting_professional - Professional services and consulting

  • Management consulting, accounting firms
  • Business advisory, professional services firms
  • Corporate strategy and business transformation
  • Example: Consulting reports, professional services guides, business strategy papers

human_resources - HR and people operations

  • Talent acquisition, compensation & benefits, performance management, L&D
  • HR tech, workforce planning, organizational development
  • Example: HR policy docs, job frameworks, talent strategy

security_cyber - Security and cybersecurity

  • Information security, threat intelligence, risk management, compliance (e.g., SOC2)
  • Physical security operations and incident response
  • Example: Security guidelines, incident playbooks, vulnerability reports

consumer_goods - Consumer products and CPG

  • Fast-moving consumer goods, household products
  • Personal care, food and beverage brands
  • Consumer product development and marketing
  • Example: CPG industry analysis, product development docs, consumer research

general_interest - General audience content

  • Content for broad audiences without sector focus
  • General knowledge and miscellaneous topics
  • Cross-sector or sector-agnostic content
  • Example: General magazines, broad interest content, lifestyle articles

other - Highly specialized or unclassifiable

  • Highly specialized niches not covered by existing sectors
  • Content with genuinely unclear sector classification
  • Unique content types that don't map to any defined sector
  • Example: Highly specialized technical niches, unique content formats

Multi-Sector Examples:

  • Medical device regulationshealthcare_medical + pharmaceutical_biotech + government_public
  • Fintech software documentationfinancial_services + technology_software
  • Agricultural biotechnology researchagriculture_food + pharmaceutical_biotech

6. Technical Content

What we're measuring: Type and intensity of specialized technical knowledge.

Multi-technical content: Content can be assigned multiple technical content labels if it genuinely combines multiple technical domains. Choose ALL applicable technical types rather than forcing a single primary choice. Always output an array for this property, even if only one technical type applies.

Values & Criteria:

code_heavy - Significant programming content

  • Multiple code examples, algorithms, or implementations
  • Technical programming concepts and methodologies
  • Software development focus
  • Example: Programming tutorials, API documentation, software guides

math_heavy - Substantial mathematical content

  • Mathematical equations, proofs, or statistical analysis
  • Quantitative analysis and mathematical reasoning
  • Mathematical concepts and methodologies
  • Example: Mathematical papers, statistical analysis, quantitative research

scientific - Research and scientific methodology content

  • Scientific research findings, experimental data
  • Scientific methodology and analysis
  • Peer-reviewed research content
  • Example: Research papers, scientific studies, experimental reports

data_heavy - Substantial datasets, tables, and data analysis

  • Contains significant data tables, charts, or datasets
  • Focus on data interpretation and analysis
  • Statistical content with data presentations
  • Example: Research data, statistical reports, data analysis, survey results

engineering - Engineering and applied technical content

  • Engineering design, systems, and applied technical solutions
  • Technical specifications for physical systems
  • Non-software engineering disciplines
  • Example: Mechanical engineering, civil engineering, technical specifications, design documents

basic_technical - Some technical elements but not dominant

  • Light technical content mixed with general explanations
  • Technical concepts explained for general audience
  • Example: Technology articles for general audience, basic technical explanations

non_technical - No significant technical content

  • General audience content without specialized technical knowledge
  • No programming, mathematical, engineering, or scientific focus
  • Example: General articles, humanities content, basic informational content

Multi-Technical Examples:

  • Data science tutorial with code examples["code_heavy", "math_heavy", "data_heavy"]
  • Engineering research with statistical analysis["engineering", "scientific", "data_heavy"]
  • Computational biology paper["code_heavy", "scientific"]

Quality and Value Assessment

7. Content Quality

What we're measuring: Overall quality of content considering writing excellence, substantive value, and presentation quality regardless of authorship origin.

Values & Criteria:

excellent - Outstanding quality across all dimensions

  • Sophisticated writing with varied sentence structures and engaging style
  • Rich, appropriate vocabulary with error-free grammar and punctuation
  • High substantive value with clear insights or information
  • Professional presentation and formatting
  • Natural flow and logical organization
  • Example: High-quality publications, expert analyses, polished educational content, well-crafted professional documents

good - High quality with minor imperfections

  • Grammatically correct with proper sentence structure
  • Appropriate vocabulary and tone for content type
  • Solid substantive value and clear information
  • Good organization and readable flow
  • Only occasional minor issues (1-2 typos per section)
  • Example: Quality journalism, professional websites, well-written blog posts, solid educational materials

adequate - Acceptable quality for most purposes

  • Generally clear and understandable writing
  • Some grammatical errors but meaning remains clear
  • Reasonable substantive value though may lack depth
  • Basic organization and structure present
  • Minor formatting or presentation issues
  • Example: Casual blogs, user reviews, basic informational content, simple guides

poor - Significant quality issues impacting utility

  • Multiple errors affecting comprehension or credibility
  • Unclear expression, confusing organization, or awkward phrasing
  • Limited substantive value or questionable information
  • Major formatting problems or unprofessional presentation
  • Difficult to extract reliable information
  • Example: Low-quality web content, poorly edited materials, confusing instructions

unacceptable - Quality too low for productive use

  • Severely impaired communication with major errors
  • Incoherent, nonsensical, or corrupted content
  • No reliable substantive value
  • Broken formatting or technical corruption
  • Cannot determine intended meaning or extract useful information
  • Example: Corrupted text, severe translation errors, spam content, SEO content, completely broken formatting

Quality Assessment Guidelines:

  • Comprehension: Can the intended message be clearly understood?
  • Substantive value: Does the content provide useful information or insights?
  • Technical presentation: Is the content properly formatted and readable?
  • Error impact: Do errors significantly impede understanding or credibility?
  • Professional standards: Does the content meet basic standards for its intended purpose?

Language-Specific Quality Indicators:

  • For non-Latin scripts (Arabic, Chinese, Japanese): Check for proper character encoding
  • For agglutinative languages (Turkish, Finnish): Adjust expectations for word count/density
  • For languages with different formality levels (Japanese, Korean): Assess appropriate register
  • Mixed-language documents: Evaluate code-switching quality and appropriateness

8. Information Density

What we're measuring: Ratio of valuable information to redundancy, padding, and repetition.

Values & Criteria:

dense - Efficient, information-packed content

  • Every sentence adds new information or insight
  • Minimal redundancy or unnecessary elaboration
  • Little to no repetition of the same concepts
  • Example: Technical specifications, concise academic writing, quality reference material

adequate - Good information content with reasonable elaboration

  • Most content adds value with some acceptable elaboration
  • Minimal repetition within the document
  • Good balance of information and explanation
  • Example: Well-written articles, good tutorials with examples

moderate - Mixed substantive content with noticeable padding

  • Some valuable information mixed with unnecessary elaboration
  • Noticeable repetition of key points for emphasis
  • Some sections feel padded or verbose
  • Example: Blog posts with some fluff, articles with repetitive conclusions

thin - Low information content with significant problems

  • Much content doesn't add new information
  • High internal repetition and excessive redundancy
  • Significant padding to reach desired length
  • Example: SEO-optimized content, poorly edited writing

empty - Dominated by repetition and meaningless content

  • Minimal actual information value
  • Dominated by repetition and copy-paste artifacts
  • Same ideas repeated multiple times without development
  • Example: Spam content, template-filled pages, keyword-stuffed articles

Common Repetition Patterns to Watch For:

  • Same phrases repeated throughout (especially in SEO content)
  • Identical paragraphs or sections (copy-paste errors)
  • Circular reasoning (saying the same thing in different ways)
  • Template artifacts (repeated boilerplate mixed with content)

9. Educational Value

What we're measuring: Potential for teaching, learning, and knowledge transfer.

Values & Criteria:

high - Clear instructional design and learning objectives

  • Explicitly teaches concepts or skills
  • Progressive skill building from basic to advanced
  • Clear learning objectives and outcomes
  • Comprehensive explanations with examples
  • Example: Quality tutorials, textbooks, structured courses, educational guides

moderate - Good instructional value with some learning potential

  • Some instructional elements present
  • Explanations help build understanding
  • Transferable knowledge to other contexts
  • Good examples or illustrations
  • Example: How-to articles, explanatory content, informative guides

basic - Limited educational content

  • Some explanations but not systematically instructional
  • Basic explanations of concepts
  • Limited learning potential or skill building
  • Example: Basic explanations, simple informational content

minimal - Little educational value

  • Primarily informational rather than instructional
  • No clear learning objectives or skill building
  • Entertainment or commercial focus
  • Example: Entertainment content, basic news, commercial content

none - No educational content

  • No instructional value or learning potential
  • Purely transactional, entertainment, or administrative
  • No knowledge transfer potential
  • Example: Pure entertainment, transactions, legal boilerplate

Disambiguation tips

  • Explanatory vs Educational: explanations alone ≠ educational design; require intent to teach plus scaffolding for Basic+
  • Reference docs: typically Minimal; promote to Basic/Moderate when guided “how-to” segments or curated examples exist
  • Reviews/op-eds: None/Minimal unless they include actionable how-to guidance designed for learning

Automation heuristics

  • Keywords: Objectives/Outcomes, Lesson, Exercise/Quiz, Homework, Assessment, Syllabus, Module, Unit, Learning Goals
  • Structure: numbered steps + prerequisites/requirements → Basic; add practice tasks/solutions → Moderate; syllabus/modules/assessments → High
  • Signals of non-edu mix: heavy CTAs/ads or product pitches → cap at Minimal unless clear instructional scaffolding

Quick decision tree

  • Are there explicit learning goals or a syllabus? → High
  • Else, are there step-by-step instructions with examples/exercises? → Moderate
  • Else, are there explanatory sections intended to teach basics? → Basic
  • Else, is there any minor instructional element? → Minimal
  • Otherwise → None

Borderline examples

  • API reference with examples but no guidance → Minimal to Basic (depending on clarity/examples)
  • Blog post explaining concept with analogies and one example → Basic
  • Tutorial with tasks, checkpoints, and solutions → High
  • Product documentation with “Getting Started” and “How-To” flows → Moderate

Educational Indicators:

  • Learning objectives: Clear goals for what reader should learn
  • Skill progression: Builds from basic to advanced concepts
  • Examples and practice: Provides concrete examples or exercises
  • Knowledge transfer: Concepts applicable beyond immediate context

10. Reasoning Indicators

What we're measuring: Presence and quality of logical reasoning, analysis, and explanatory content.

Values & Criteria:

analytical - Complex reasoning and systematic analysis

  • Multi-step arguments with logical progression
  • Cause-effect analysis and systematic thinking
  • Considers multiple perspectives or variables
  • Draws conclusions from evidence and reasoning
  • Example: Research analysis, complex problem-solving, systematic evaluations

explanatory - Clear explanations with logical flow

  • Explains how or why things work
  • Shows cause-effect relationships clearly
  • Educational reasoning that builds understanding
  • Logical connections between concepts
  • Example: Good tutorials, educational content, how-to explanations

basic_reasoning - Simple logical connections

  • Some logical connections between ideas
  • Basic explanations of concepts or processes
  • Elementary analytical thinking
  • Simple cause-effect relationships
  • Example: Basic explanations, simple arguments, elementary analysis

minimal - Limited reasoning, mostly descriptive

  • Primarily describes what rather than why or how
  • Few logical connections between ideas
  • Mostly factual statements without analysis
  • Little explanatory content
  • Example: Basic descriptions, simple factual content, minimal analysis

none - No clear reasoning present

  • Purely descriptive content
  • Simple factual listing without connections
  • Narrative content without analysis
  • No logical argumentation or explanation
  • Example: Simple lists, basic narratives, pure description

Thinking-trace signals (what to look for)

  • Stepwise structure: numbered steps in proofs/derivations/solutions; “First… therefore… hence… so…”
  • Hypothesis and test: assumptions, intermediate results, counterexamples, sanity checks
  • Tool- or method-calls: named algorithms, theorems, lemmas, or procedures invoked and justified
  • Error analysis or reflection: “we tried X, failed because Y, so we…”, “limitations,” “edge cases”
  • Intermediate artifacts: scratch calculations, partial code reasoning, sub-problems and sub-claims

Disambiguation rules

  • Explanatory vs Analytical: explanations tell how; analytical shows multi-step inference with evidence and intermediate claims
  • Worked example vs Mere answer: worked examples expose steps and justification; mere answers without steps are not reasoning-rich
  • Procedural vs Reasoning: procedural lists actions; reasoning links actions via logic, evidence, or constraints

Automation heuristics

  • Lexical cues: because, therefore, thus, hence, suppose/assume, we conclude, by induction, lemma/theorem/proof, O(n), hypothesis, counterexample
  • Structure cues: presence of proof blocks, derivations (e.g., “Proof.”, “QED”, TeX environments), multi-step numeric calculations
  • Program reasoning: code comments like “// invariant”, “// complexity”, pre/post-conditions, test reasoning
  • Thresholding: count reasoning cues per 1k tokens; with ≥2 structural cues or ≥5 lexical cues → at least explanatory; proofs/derivations → analytical

Quick decision tree

  • Is there a proof/derivation or multi-step argument with intermediate claims? → analytical
  • Else, does it explain why/how with cause-effect and logical links? → explanatory
  • Else, are there simple logical connections or one-step justifications? → basic_reasoning
  • Else, does it mostly describe without connecting ideas? → minimal/none

Borderline examples

  • Answer-only solutions (final numeric result without steps) → minimal
  • Step-by-step math solution with intermediate equations → analytical
  • “How it works” article connecting 2–3 causal steps without data → explanatory
  • Troubleshooting log with attempts and justifications → analytical if causal chain is explicit; otherwise explanatory

Key Reasoning Patterns to Identify:

  • Cause-effect: "Because X, therefore Y"
  • Problem-solution: Identifies problems and proposes solutions
  • Comparison: Analyzes similarities and differences
  • Logical progression: Ideas build on previous ideas
  • Evidence-based conclusions: Draws conclusions from presented evidence

Audience and Purpose

11. Audience Level

What we're measuring: Intended sophistication level and background knowledge assumptions of the target audience.

Values & Criteria:

expert - Highly specialized professional/academic content

  • Assumes deep domain expertise and advanced training
  • Uses technical terminology without explanation
  • Content for practitioners actively working in specialized fields
  • Example: Climate modeling methodology in Nature Climate Change, research papers, technical specifications, expert-to-expert communications

advanced - Educated adult audience with analytical skills

  • Assumes higher education and critical thinking ability
  • Explains specialized concepts but uses sophisticated language
  • Intellectually challenging but accessible to educated generalists
  • Example: Complex climate change analysis in The Atlantic, quality journalism, policy analysis, advanced general interest content

general - General adult audience

  • Accessible to most educated adults without specialized background
  • Explains technical concepts when introduced
  • Uses clear language while maintaining intellectual substance
  • Example: Quality journalism, general interest articles, accessible explanations of complex topics

beginner - Introductory level with minimal prerequisites

  • Explains basic concepts and terminology
  • Builds up from fundamental principles
  • Assumes minimal prior knowledge of the subject area
  • Example: Introductory tutorials, beginner guides, basic explanations, getting-started content

youth - Targeted at teenagers and young adults (ages 13-19)

  • Age-appropriate complexity with contemporary cultural references
  • Sophisticated enough for developing critical thinking but accessible
  • May address topics relevant to adolescent experiences and concerns
  • Example: High school educational content, young adult literature, teen-focused explanations, college prep materials

children - Designed specifically for children

  • Simple language and concepts appropriate for young readers
  • Educational content designed for elementary/middle school levels
  • Age-appropriate topics and complexity
  • Example: Children's educational content, elementary school materials, simple explanations for young learners

Assessment Guidelines:

  • Professional context: Is this content designed for workplace use vs. general learning?
  • Terminology density: How much specialized vocabulary is used without explanation?
  • Concept complexity: How sophisticated are the ideas and their development?
  • Background assumptions: What education level and domain knowledge does the author assume?

Cross-Linguistic Considerations:

  • Expert terminology density varies by language (German allows more compound terms)
  • Formality markers differ across cultures
  • Educational level assumptions vary by country's education system
  • Age-appropriate content differs across cultures

12. Commercial Bias

What we're measuring: How much commercial interests influence the objectivity and informational value of content.

Values & Criteria:

none - No commercial influence detected

  • Objective, informational presentation
  • No promotional language or commercial agenda
  • Focus purely on informing or educating
  • Example: Academic papers, objective journalism, educational content

minimal - Slight commercial context but maintains objectivity

  • May mention products/services but in informational context
  • Maintains balanced, objective tone
  • Commercial mentions serve informational purpose
  • Example: Product reviews with balanced analysis, informational articles mentioning relevant products

moderate - Some commercial influence on content

  • Mix of informational and promotional content
  • Some promotional language but still provides useful information
  • Commercial interests somewhat visible but not dominant
  • Example: Company blogs with useful information, sponsored content with actual value

heavy - Strong commercial bias throughout

  • Primarily promotional with some informational elements
  • Heavy use of marketing language and persuasive techniques
  • Clear commercial agenda affects content objectivity
  • Example: Marketing articles disguised as information, heavily biased product comparisons

pure_marketing - Entirely commercial/promotional content

  • No genuine informational value beyond promotion
  • Pure marketing copy or advertising material
  • Designed solely to drive sales or conversions
  • Example: Sales pages, pure advertising copy, promotional brochures

Key Indicators:

  • Language tone: Objective vs. promotional language
  • Primary purpose: Inform vs. persuade/sell
  • Balance: Are alternatives/drawbacks mentioned?
  • Call-to-action: Subtle information vs. obvious sales pitch

13. Time-Sensitivity

What we're measuring: How time-sensitive the content is - whether its value degrades over time or remains stable.

Values & Criteria:

evergreen - Content remains valuable indefinitely

  • Fundamental concepts, principles, theories
  • Historical information and established facts
  • Skills and techniques that don't change
  • Reference materials with lasting value
  • Example: Mathematical proofs, language grammar guides, classical literature analysis, basic cooking techniques

slowly_changing - Content remains valuable for years

  • Best practices that evolve slowly
  • Technical content that updates every few years
  • Cultural and social topics with gradual change
  • Example: Programming language tutorials, academic textbooks, industry standards, educational curricula

regularly_updating - Content valuable for months to a year

  • Industry trends and market analysis
  • Technology reviews and comparisons
  • Policy discussions and current research
  • Example: Software framework guides, business strategies, product reviews, research summaries

time_sensitive - Content value degrades quickly

  • News and current events
  • Time-bound information (prices, schedules, availability)
  • Temporary situations or short-term trends
  • Real-time data and statistics
  • Example: Stock prices, weather reports, breaking news, event announcements, sales/promotions

Key Decision Points:

  • Core question: If someone reads this in 2 years, will it still be valuable?
  • Update frequency: How often does this type of information typically change?
  • Temporal references: Does the content heavily reference "now," "recently," "currently"?
  • Subject matter stability: Is this about unchanging principles or evolving situations?

Safety and Compliance

14. Content Safety

What we're measuring: Presence of inappropriate, harmful, or legally problematic content.

Values & Criteria:

safe - Appropriate for all contexts

  • No concerning content of any type
  • Professional, appropriate language throughout
  • Suitable for general audiences including workplace settings

mild_concerns - Minor issues that don't constitute major problems

  • Occasional mild profanity in context
  • Brief mentions of sensitive topics handled appropriately
  • Minor concerns that don't affect overall suitability
  • Example: Historical discussions of sensitive topics, professional content with mild language

nsfw - Not safe for work or general audiences

  • Explicit sexual content or graphic descriptions
  • Adult themes requiring content warnings
  • Graphic violence or disturbing imagery descriptions
  • Example: Adult content, graphic medical descriptions, explicit violence

harmful - Potentially harmful content requiring careful handling

  • Content promoting dangerous activities or self-harm
  • Hate speech targeting individuals or groups
  • Violent content glorifying harm to others
  • Example: Self-harm content, hate speech, dangerous "how-to" guides

illegal - Illegal content requiring immediate rejection

  • Content promoting clearly illegal activities
  • Material that violates laws in major jurisdictions
  • Example: Terrorist content, child exploitation

Safety Assessment Guidelines:

  • Context matters: Medical/educational discussions of sensitive topics may be appropriate
  • Intent matters: Discussing harmful topics for educational purposes vs. promoting them
  • Audience consideration: Content appropriate for experts may not be safe for general audiences

15. PII Presence

What we're measuring: Whether the content contains personally identifiable information that could identify private individuals.

Values & Criteria:

no_pii - No personal information detected

  • No names of private individuals
  • No contact information (emails, phones, addresses)
  • No identification numbers
  • Public figures and officials mentioned by name are acceptable
  • Example: News articles about politicians, technical documentation, general information

contains_pii - Contains potentially identifiable information

  • Names of private individuals (non-public figures)
  • Email addresses, phone numbers, physical addresses
  • ID numbers (SSN, passport, driver's license, employee IDs)
  • Medical information about identifiable individuals
  • Financial account information
  • Example: Personal blogs with full names, leaked databases, medical case studies with identifying info

Key Decision Points:

  • Public vs. Private figures: Politicians, celebrities, CEOs = public (no PII flag); private citizens = PII
  • Context matters: Academic paper authors and their institutional emails = typically no PII; personal emails in forums = PII
  • Aggregated vs. Individual: Statistical data = no PII; individual records = PII

Geographic Relevance

16. Regional Relevance

What we're measuring: Primary regional, cultural, or geopolitical sphere(s) that the content relates to, regardless of language used.

Multi-regional content: Content can be assigned multiple regional labels if it genuinely spans multiple regions. Choose ALL applicable regions rather than forcing a single primary choice. Always output an array for this property, even if only one region applies.

Values & Criteria:

european - European context (EU and broader Europe)

  • Content about European countries, EU policies, or pan-European topics
  • European cultural perspectives, social systems, or business practices
  • References to European cities, institutions, companies, or regulations
  • Includes: EU member states, UK, Switzerland, Norway, Balkans, etc.
  • Example: GDPR compliance, European Parliament elections, Schengen area travel, European football leagues

north_american - North American context

  • Content about US, Canada, or Mexico
  • North American cultural perspectives, USMCA/NAFTA region topics
  • References to North American institutions, companies, or issues
  • Example: FDA regulations, Silicon Valley tech, NHL, US constitutional law, Canadian healthcare

east_asian - East Asian context

  • Content about China, Japan, Korea (North/South), Taiwan, Mongolia
  • East Asian cultural perspectives, Confucian-influenced societies
  • References to East Asian economic models, companies, or social systems
  • Example: Gaokao exams, K-pop, Shenzhen tech hub, Japanese work culture, Taiwan semiconductor industry

south_asian - South Asian context

  • Content about India, Pakistan, Bangladesh, Sri Lanka, Nepal, Bhutan, Afghanistan, Maldives
  • South Asian cultural perspectives, subcontinental issues
  • References to South Asian institutions, economies, or social structures
  • Example: IIT entrance exams, Bollywood, cricket leagues, monsoon impacts, caste system discussions

southeast_asian - Southeast Asian context

  • Content about ASEAN countries (Indonesia, Thailand, Vietnam, Philippines, Malaysia, Singapore, etc.)
  • Southeast Asian regional perspectives and economic integration
  • References to ASEAN policies, regional companies, or cultural phenomena
  • Example: ASEAN economic community, Indonesian elections, Singapore financial sector, Thai tourism

middle_eastern - Middle Eastern and North African context

  • Content about Arab states, Iran, Turkey, Israel, North Africa (MENA region)
  • Middle Eastern cultural perspectives, Islamic finance, regional conflicts
  • References to Middle Eastern institutions, oil economies, or geopolitics
  • Example: Gulf Cooperation Council, OPEC decisions, Middle East peace process, Islamic banking

sub_saharan_african - Sub-Saharan African context

  • Content about African countries south of the Sahara
  • African Union topics, sub-Saharan development issues
  • References to African institutions, economies, or cultural topics
  • Example: M-Pesa mobile banking, African Union policies, safari tourism, ubuntu philosophy

latin_american - Latin American context

  • Content about Central and South America, Caribbean
  • Latin American cultural perspectives, regional integration (Mercosur, etc.)
  • References to Latin American institutions, economies, or social movements
  • Example: Mercosur trade, telenovelas, Amazon rainforest, Latin American revolutions

oceanian - Oceanian context

  • Content about Australia, New Zealand, Pacific Island nations
  • Oceanian perspectives, Pacific regional issues
  • References to Oceanian institutions, companies, or cultural topics
  • Example: ANZAC relations, Pacific Island climate change, Australian mining, Māori culture

central_asian - Central Asian context

  • Content about Kazakhstan, Uzbekistan, Turkmenistan, Tajikistan, Kyrgyzstan
  • Central Asian perspectives, post-Soviet regional dynamics
  • Silk Road region, resource economies, nomadic heritage
  • Example: Silk Road initiatives, Caspian Sea resources, post-Soviet transitions

russian_sphere - Russian/Post-Soviet context

  • Content about Russia, Belarus, and strong Russian influence areas
  • Post-Soviet perspectives, CIS (Commonwealth of Independent States) topics
  • Russian language content about regional (not global) topics
  • Example: Russian federal politics, CIS integration, post-Soviet economic transitions

global - Genuinely international or universal

  • Content with truly global scope or application
  • International organizations, worldwide phenomena, global comparisons
  • Topics that transcend regional boundaries
  • Example: UN reports, climate change (global perspective), international standards, pandemic response

culturally_neutral - No clear regional focus

  • Abstract, theoretical, or technical content without regional markers
  • Universal scientific, mathematical, or philosophical content
  • Content that could apply equally anywhere without modification
  • Example: Mathematical proofs, chemical formulas, abstract philosophy, programming concepts

indeterminate - Cannot determine regional relevance

  • Insufficient content to identify regional focus
  • Mixed or contradictory regional signals
  • Fragment or corrupted content lacking regional context
  • Example: Technical specifications without context, isolated data tables

Multi-Regional Examples:

  • EU-China trade relations["european", "east_asian"]
  • NAFTA/USMCA impact on Mexican agriculture["north_american", "latin_american"]
  • Indian diaspora in the Gulf states["south_asian", "middle_eastern"]
  • Comparative study of healthcare systems globally["global"]

Regional Identification Guidelines:

Primary indicators:

  • Geographic references: Countries, cities, regions, landmarks mentioned
  • Institutional references: Governments, companies, universities, organizations specific to region
  • Cultural markers: Holidays, customs, cultural phenomena, sports, entertainment
  • Political/economic systems: References to regional political structures, economic blocs
  • Legal/regulatory frameworks: Region-specific laws, regulations, standards
  • Language context: While not determinative, language can provide regional hints

Important distinctions:

  • Language ≠ Region: Spanish content about Asian markets = ["east_asian"], not ["latin_american"]
  • Company origin vs. topic: Apple (US company) operating in India = consider actual content focus
  • Historical vs. current: Historical content about ancient Rome = ["european"] if discussing modern implications
  • Diaspora content: Content about diaspora communities should include both origin and current regions

Quality checks:

  • If content is in a non-English language but discusses global topics → still mark as ["global"]
  • If content compares multiple regions → mark all regions discussed substantially
  • If content is about a specific place but has universal applications → consider both regional and global tags

17. Country Relevance

What we're measuring: Which specific country or countries (if any) the content is relevant to, globally.

Note: Always output an array of country names for this property (even when only a single country applies). Use standard country names from any region worldwide (e.g., "germany", "france", "united_states", "china", "japan", "brazil", "india", "south_africa", "australia", "canada", etc.). The array may also contain the special values supranational or non_country_specific when appropriate.

Values & Criteria:

[COUNTRY_NAME] - Content specifically relevant to a single country

  • Content explicitly about that country's politics, culture, institutions, or regulations
  • Content written from that country's cultural perspective
  • Content addressing that country's specific issues, regulations, or cultural phenomena
  • Content about that country's cities, companies, institutions, or country-specific topics
  • Example: For "germany" → German election coverage, Bundesliga content, German legal analysis
  • Example: For "united_states" → US election coverage, NFL content, US legal analysis
  • Example: For "japan" → Japanese politics, J-League content, Japanese cultural analysis

Country Identification Criteria:

  • Political content: Elections, government policies, political parties, political figures specific to the country
  • Cultural content: National traditions, cultural phenomena, historical events specific to the country
  • Institutional references: Government bodies, national companies, universities specific to the country
  • Geographic focus: Cities, regions, landmarks within the country as primary subjects
  • Legal/regulatory: Laws, regulations, legal frameworks specific to the country
  • Economic content: National economic policies, country-specific market analysis
  • Sports/media: National sports leagues, national teams, country-specific media outlets
  • Social issues: Social policies, demographic topics, social movements specific to the country

Special Values:

supranational - Content focused on supranational entities or regions

  • International organizations, regional blocs, global institutions
  • Content about supranational policies, international organizations, global governance
  • Pan-regional analysis that transcends individual countries
  • Multi-continental or global institutional content
  • Example: UN resolutions, NATO discussions, EU policy analysis, ASEAN agreements, WTO trade rules

non_country_specific - No specific country relevance

  • Abstract, theoretical, or universal content without geographic specificity
  • Technical/scientific content that applies globally without country focus
  • Content that doesn't reference specific countries or national contexts
  • Example: Mathematical proofs, universal scientific principles, abstract philosophical discussions

@harshraj172
Copy link

This seems to be a comprehensive list. I would be very interested in how you segregate the data for each type. For MixtureVitae data, we used some fasttext classifiers, but to cover each category on this list, one needs a bunch of efficient categorization methods - clustering might be effective

@maxidl
Copy link
Author

maxidl commented Oct 14, 2025

the goal is to use a small LLM with structured output. Input is prompt(document), output is a json object, e.g.:

{
    "content_integrity": "complete",
    "content_ratio": "mostly_content",
    "content_length": "substantial",
    "content_type": [
        "instructional",
        "reference",
        "procedural"
    ],
    "business_sector": [
        "technology_software"
    ],
    "technical_content": [
        "code_heavy",
        "data_heavy",
        "engineering"
    ],
    "information_density": "dense",
    "content_quality": "excellent",
    "audience_level": "expert",
    "commercial_bias": "none",
    "time_sensitivity": "slowly_changing",
    "content_safety": "safe",
    "educational_value": "high",
    "reasoning_indicators": "explanatory",
    "pii_presence": "no_pii",
    "regional_relevance": [
        "european"
    ],
    "country_relevance": [
        "italy",
        "supranational"
    ]
}

It is much more expensive to run than fasttext classifiers, but we dont necessarily want to run this for raw dumps, but rather selected languages in fineweb-2, hplt, finepdfs, or top-20% of documents according to level-1 filters like fineweb-edu.

@harshraj172
Copy link

Ah, okay. And you're going to do it for nemotron?

@maxidl
Copy link
Author

maxidl commented Oct 14, 2025

maybe not for all of nemotron but for the HQ subset it is probably a good idea to include

@harshraj172
Copy link

Makes sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment