What we're measuring: Completeness and technical quality of the content itself, regardless of navigation ratio.
complete - Full, intact content as intended
- Content appears complete with proper beginning, middle, and end
- All essential elements present (introduction, body, conclusion where appropriate)
- No obvious truncation or missing sections
- Example: Complete articles, full tutorials, intact documents
mostly_complete - Minor elements missing but core content intact
- Core content is complete but some secondary elements may be missing
- Minor truncation that doesn't affect main message
- Example: Article with truncated comments, missing sidebar content, partial author bio
fragment - Incomplete content, missing significant portions
- Missing introduction, conclusion, or substantial middle sections
- Truncated mid-sentence or mid-paragraph
- Content feels incomplete or cut off
- Example: Search result snippets, article excerpts, broken crawls, partial downloads
severely_degraded - Broken, unreadable, or corrupted content
- Encoding errors, scrambled text, missing characters
- Severely malformed HTML rendering as gibberish
- Technical corruption making content unreadable
- Example: �&$^%*@# characters, completely broken formatting, corrupted files
- Content completeness: Does the content feel like a complete unit of information?
- Technical integrity: Is the content technically readable and properly formatted?
- Fragment vs. complete: Independent of navigation - is the actual content complete?
- Degraded vs. fragment: Degraded has technical issues; fragment is just incomplete
What we're measuring: How much of the document is actual content vs. navigation, UI elements, and structural markup.
complete_content - 90-100% meaningful content
- Full articles, papers, tutorials with minimal navigation
- Clean text with proper paragraphs and structure
- Example: A Wikipedia article, academic paper, complete blog post
mostly_content - 70-89% meaningful content
- Complete documents with some navigation elements (header, footer, sidebar)
- Minor UI elements that don't disrupt reading
- Example: News articles with standard website navigation
mixed_content - 40-69% meaningful content
- Significant navigation mixed throughout content
- Multiple sidebars, ads, or UI elements interrupting text
- Example: E-commerce product pages with reviews mixed with purchase options
mostly_navigation - 10-39% meaningful content
- Predominantly menus, links, headers, footers
- Content overwhelmed by structural elements
- Example: Site maps, navigation pages, heavily UI-focused pages
minimal_content - 0-9% meaningful content
- Almost entirely navigation, UI elements, or structural markup
- Very little readable content present
- Example: Empty pages, pure navigation menus, error pages with minimal text
- Focus on the ratio of readable text to navigation/UI elements
- Count only substantive content, ignore boilerplate and structural elements
- Mixed vs. mostly_navigation: Can you read it as coherent content despite distractions?
What we're measuring: Amount of substantive content, ignoring navigation and boilerplate.
substantial - 2,000+ words of meaningful content
- Long-form, comprehensive content that provides in-depth coverage of a topic
- Typically includes detailed analysis, multiple sections or chapters, extensive research, or thorough exploration of complex subjects
- Examples: White papers, research reports, e-books, long-form journalism
moderate - 500–2,000 words of meaningful content
- Standard-length content that offers meaningful coverage while remaining focused and digestible
- Balances depth with accessibility; provides enough detail to be informative without overwhelming readers
- Examples: Typical blog posts, news articles, product reviews, how-to guides
brief - 100–500 words of meaningful content
- Short, focused content that delivers key information quickly and efficiently
- Gets straight to the point while still providing value and context
- Examples: News briefs, product descriptions, FAQs, short blog posts
minimal - Under 100 words of meaningful content
- Very short content that provides only essential information or serves as a quick reference
- Designed for rapid consumption or specific micro-purposes
- Examples: Social media posts, announcements, abstracts, snippets, navigation pages
- Count only readable content of value: include article body and substantive headings/captions; exclude headers/footers, menus/sidebars, related links, share/consent UI, pagination, ads, and boilerplate.
- Focus on substantive information, not filler words
- Complete thoughts matter more than exact word counts
- Contextual adjustment: Thresholds are guidelines and can be adjusted based on specific use cases and typical content. Academic contexts may shift ranges upward, while social media contexts may shift them downward.
What we're measuring: The functional structure and purpose of content.
Multi-type content: Content can be assigned multiple type labels if it genuinely serves multiple purposes. Choose ALL applicable types rather than forcing a single primary choice. Always output an array for this property, even if only one type applies.
analytical - In-depth analysis, research, and critical examination
- Provides detailed analysis or research on a topic
- Develops arguments, evaluates evidence, or presents findings
- Example: Research analysis, investigative reports, academic articles, expert commentary
instructional - Teaching and how-to content
- Explicitly teaches skills, concepts, or procedures
- Step-by-step guidance or educational explanations
- Example: Tutorials, how-to guides, educational content, training materials
reference - Lookup materials, definitions, specifications
- Designed for looking up specific information rather than reading through
- Often organized alphabetically, categorically, or as lists
- Example: Dictionaries, encyclopedias, API references, product catalogs
procedural - Step-by-step processes and procedures
- Sequential instructions or workflows
- Process documentation with clear steps
- Example: Recipes, installation guides, standard operating procedures, workflows
qa_structured - Structured question-answer content
- Formal Q&A format with clear questions and answers
- Often expert responses to specific questions
- Example: Stack Overflow, FAQ sections, structured Q&A sites
conversational - Multi-party or turn-based dialogues (humans, bots, or both)
- Casual or structured conversations between two or more participants
- May include human–AI chats, forum threads, or comment chains
- Example: Reddit threads, forum discussions, support chats, assistant chat logs
creative - Entertainment, artistic, fictional content
- Primary purpose is entertainment or artistic expression
- Not primarily informational or instructional
- Example: Short stories, poems, movie reviews, game content, fiction
transactional - Commercial, shopping, service-oriented
- Primary purpose is to facilitate a transaction or service
- Focuses on products, services, or business processes
- Example: Product listings, service descriptions, checkout pages
boilerplate - Legal, policy, standard template text
- Standard legal or policy language
- Often repeated across multiple sites with minimal variation
- Example: Terms of service, privacy policies, disclaimers, cookie banners, standard notices
news_report - Straight reporting of events with minimal analysis
- Describes events or facts in a neutral, descriptive tone
- Time-bound news, updates, or reports
- Example: Wire-service news articles, breaking-news updates
opinion_editorial - Persuasive/opinionated commentary or editorials
- Expresses a stance or argument; aims to persuade
- May cite evidence but prioritizes viewpoint
- Example: Op-eds, opinion columns, personal essays with clear stance
review_critique - Evaluative reviews of products, media, or services
- Provides judgments, ratings, or critiques
- May include pros/cons, scoring systems
- Example: Product reviews, film/book critiques, app store reviews (long-form)
technical_documentation - Manuals, API docs, developer guides, READMEs
- Primary goal is to instruct usage of software/hardware/APIs
- Includes reference sections, examples, parameters, version notes
- Example: API reference, library README, user manual
specification_standard - Normative standards and formal specifications
- Defines requirements, must/shall language, compliance criteria
- Maintained by standards bodies or authoritative groups
- Example: RFCs, ISO standards, formal protocol specs
legal_document - Statutes, case law, contracts, regulatory texts
- Binding or authoritative legal content
- Formal legal language and structure
- Example: Court opinions, legislation, contracts, regulatory rules
press_release - Organization-issued announcements and PR materials
- Promotional announcements framed as information
- Quotes from executives, product/service announcements
- Example: Company press releases, launch announcements
structured_data - Tables, datasets, indices, catalogs with minimal prose
- Predominantly tabular/listed data meant for lookup
- Minimal narrative or explanatory text
- Example: Product catalogs, schedules, statistical tables
source_code - Code listings as primary content
- Dominant content is program source code or scripts
- May include lightweight comments or snippets without narrative
- Example: Code files, gist-like pages, competitive programming solutions
- Tutorial that analyzes different approaches →
["instructional", "analytical"] - Educational reference manual →
["instructional", "reference"] - Research paper with step-by-step methodology →
["analytical", "procedural"] - Q&A site with analytical responses →
["qa_structured", "analytical"] - API guide with examples →
["technical_documentation", "reference", "instructional"] - RFC with rationale →
["specification_standard", "analytical"] - Film review with interview snippets →
["review_critique", "conversational"] - Helpdesk chat with an AI →
["conversational", "transactional"] - Breaking news explainer →
["news_report", "explanatory"]
What we're measuring: Business sector(s) or industry domain(s) for training sector-specific LLMs.
Multi-sector content: Content can be assigned multiple sector labels if it genuinely spans multiple industries. Choose ALL applicable sectors rather than forcing a single primary choice or using "other". Always output an array for this property, even if only one sector applies.
academic_research - Scholarly and research content
- Peer-reviewed publications, academic papers
- University-affiliated research and scholarship
- Formal academic discourse and methodology
- Example: Journal articles, conference papers, academic books, dissertations
education_sector - Educational institutions and pedagogy
- K-12 education, higher education administration
- Educational technology, curriculum development
- Teaching methodologies and educational policy
- Example: School curricula, educational policy papers, teaching resources, edtech content
technology_software - Software and information technology
- Software development, programming, IT services
- Digital products, platforms, and technology companies
- Computer science and software engineering
- Example: Software documentation, tech company content, programming guides, IT industry analysis
hardware_electronics - Hardware devices and electronics industry
- Semiconductors, consumer electronics, embedded systems, hardware design
- Electronics manufacturing and supply chains
- Example: Chip design docs, hardware datasheets, device manuals
healthcare_medical - Healthcare and medical sector
- Medical research, clinical practice, healthcare delivery
- Hospitals, medical devices, healthcare policy
- Public health and wellness
- Example: Medical journals, clinical guidelines, healthcare administration, wellness content
pharmaceutical_biotech - Pharmaceutical and biotechnology
- Drug development, clinical trials, biotech research
- Pharmaceutical industry, biotechnology companies
- Life sciences and molecular biology applications
- Example: Drug research papers, clinical trial reports, biotech industry analysis
financial_services - Banking and financial services
- Banking, investment, insurance, financial planning
- Financial markets, fintech, payment systems
- Asset management and financial advisory
- Example: Financial analysis, banking documentation, investment guides
legal_services - Legal sector and jurisprudence
- Law firms, legal practice, court systems
- Legal education, regulatory compliance
- Litigation, contracts, legal advisory
- Example: Legal briefs, court opinions, legal analysis, compliance guides
government_public - Government and public administration
- Government agencies, public policy, civic services
- Regulatory bodies, public administration
- Political institutions and governance
- Example: Government reports, policy documents, regulatory filings, civic information
manufacturing_industrial - Manufacturing and heavy industry
- Industrial production, manufacturing processes
- Supply chain, logistics, industrial equipment
- Factory operations and industrial engineering
- Example: Manufacturing specs, industrial reports, supply chain analysis, production guides
mining_resources - Mining and natural resources
- Exploration, extraction, and processing of minerals and resources
- Resource markets and operations (metals, rare earths)
- Example: Mining reports, resource exploration docs, commodity operations
chemicals_materials - Chemicals and advanced materials
- Petrochemicals, specialty chemicals, polymers, composites, advanced materials
- Safety data sheets (SDS), process chemistry, materials science
- Example: Material datasheets, REACH documentation, chemical process guides
energy_utilities - Energy and utilities sector
- Power generation, renewable energy, oil and gas
- Electric utilities, water services, waste management
- Energy infrastructure and grid management
- Example: Energy industry reports, utility regulations, renewable energy research
retail_commerce - Retail and e-commerce
- Retail operations, e-commerce platforms
- Consumer goods distribution, merchandising
- Retail technology and customer experience
- Example: Retail industry analysis, e-commerce guides, merchandising strategies
wholesale_distribution - Wholesale trade and distribution
- B2B wholesale, distributors, procurement, inventory and fulfillment
- Supply relationships between manufacturers and retailers
- Example: Distributor catalogs, wholesale operations, procurement guides
real_estate_construction - Real estate and construction
- Property development, construction industry
- Real estate markets, property management
- Architecture and building services
- Example: Real estate analysis, construction specifications, property guides
transportation_logistics - Transportation and logistics
- Airlines, shipping, freight, public transit
- Logistics operations, supply chain transportation
- Vehicle fleet management, transportation infrastructure
- Example: Logistics guides, transportation planning, shipping documentation
travel_aviation - Travel industry and commercial aviation
- Airlines, airports, OTA platforms, hospitality travel operations
- Route planning, airline commercial, loyalty, IATA regulations
- Example: Airline scheduling, fare rules, OTA partner docs
automotive_industry - Automotive manufacturing and services
- Vehicle manufacturers, automotive suppliers
- Automotive technology, electric vehicles
- Dealerships and automotive services
- Example: Automotive engineering docs, vehicle technology papers, industry analysis
telecommunications - Telecommunications industry
- Telecom operators, network infrastructure
- Mobile services, broadband, satellite communications
- Telecommunications equipment and technology
- Example: Telecom industry reports, network specifications, 5G technology papers
media_entertainment - Media and entertainment industry
- Film, television, music, gaming industries
- Publishing, news media, content creation
- Streaming services and digital media
- Example: Entertainment industry analysis, media studies, content strategy
gaming_industry - Video games and interactive entertainment
- Game development, studios, engines, esports, live ops
- Monetization models, community management, platform ecosystems
- Example: Patch notes, game design docs, esports operations
advertising_marketing - Advertising, marketing, and PR
- Brand strategy, campaign planning, performance marketing, martech
- Agencies, in-house marketing, PR communications
- Example: Campaign briefs, media plans, PR strategies
hospitality_tourism - Hospitality and tourism sector
- Hotels, restaurants, travel services
- Tourism industry, destination management
- Event planning and hospitality services
- Example: Tourism studies, hospitality management, travel industry reports
food_beverage_hospitality - Food & beverage and restaurant operations
- Restaurant ops, menu engineering, supply chain, QSR/fast casual
- Food safety, compliance, procurement for F&B
- Example: Restaurant training manuals, HACCP docs, vendor specs
agriculture_food - Agriculture and food production
- Farming, agricultural technology, food processing
- Agricultural supply chain, food safety
- Agribusiness and agricultural policy
- Example: Agricultural research, food industry reports, farming guides
environmental_services - Environmental and sustainability services
- Environmental consulting, ESG reporting, sustainability programs
- Waste management services, remediation, impact assessments
- Example: ESG reports, environmental impact assessments, sustainability frameworks
aerospace_defense - Aerospace and defense industry
- Aircraft manufacturing, space technology
- Defense contractors, military systems
- Aviation and space exploration
- Example: Aerospace engineering papers, defense industry analysis, aviation guides
insurance_industry - Insurance sector
- Life, health, property, and casualty insurance
- Reinsurance, actuarial science, risk assessment
- Insurance technology and underwriting
- Example: Actuarial studies, insurance policy analysis, risk management guides
nonprofit_ngo - Nonprofit and NGO sector
- Charitable organizations, international development
- Social services, humanitarian organizations
- Foundations and philanthropic institutions
- Example: NGO reports, nonprofit management, development studies
consulting_professional - Professional services and consulting
- Management consulting, accounting firms
- Business advisory, professional services firms
- Corporate strategy and business transformation
- Example: Consulting reports, professional services guides, business strategy papers
human_resources - HR and people operations
- Talent acquisition, compensation & benefits, performance management, L&D
- HR tech, workforce planning, organizational development
- Example: HR policy docs, job frameworks, talent strategy
security_cyber - Security and cybersecurity
- Information security, threat intelligence, risk management, compliance (e.g., SOC2)
- Physical security operations and incident response
- Example: Security guidelines, incident playbooks, vulnerability reports
consumer_goods - Consumer products and CPG
- Fast-moving consumer goods, household products
- Personal care, food and beverage brands
- Consumer product development and marketing
- Example: CPG industry analysis, product development docs, consumer research
general_interest - General audience content
- Content for broad audiences without sector focus
- General knowledge and miscellaneous topics
- Cross-sector or sector-agnostic content
- Example: General magazines, broad interest content, lifestyle articles
other - Highly specialized or unclassifiable
- Highly specialized niches not covered by existing sectors
- Content with genuinely unclear sector classification
- Unique content types that don't map to any defined sector
- Example: Highly specialized technical niches, unique content formats
- Medical device regulations →
healthcare_medical+pharmaceutical_biotech+government_public - Fintech software documentation →
financial_services+technology_software - Agricultural biotechnology research →
agriculture_food+pharmaceutical_biotech
What we're measuring: Type and intensity of specialized technical knowledge.
Multi-technical content: Content can be assigned multiple technical content labels if it genuinely combines multiple technical domains. Choose ALL applicable technical types rather than forcing a single primary choice. Always output an array for this property, even if only one technical type applies.
code_heavy - Significant programming content
- Multiple code examples, algorithms, or implementations
- Technical programming concepts and methodologies
- Software development focus
- Example: Programming tutorials, API documentation, software guides
math_heavy - Substantial mathematical content
- Mathematical equations, proofs, or statistical analysis
- Quantitative analysis and mathematical reasoning
- Mathematical concepts and methodologies
- Example: Mathematical papers, statistical analysis, quantitative research
scientific - Research and scientific methodology content
- Scientific research findings, experimental data
- Scientific methodology and analysis
- Peer-reviewed research content
- Example: Research papers, scientific studies, experimental reports
data_heavy - Substantial datasets, tables, and data analysis
- Contains significant data tables, charts, or datasets
- Focus on data interpretation and analysis
- Statistical content with data presentations
- Example: Research data, statistical reports, data analysis, survey results
engineering - Engineering and applied technical content
- Engineering design, systems, and applied technical solutions
- Technical specifications for physical systems
- Non-software engineering disciplines
- Example: Mechanical engineering, civil engineering, technical specifications, design documents
basic_technical - Some technical elements but not dominant
- Light technical content mixed with general explanations
- Technical concepts explained for general audience
- Example: Technology articles for general audience, basic technical explanations
non_technical - No significant technical content
- General audience content without specialized technical knowledge
- No programming, mathematical, engineering, or scientific focus
- Example: General articles, humanities content, basic informational content
- Data science tutorial with code examples →
["code_heavy", "math_heavy", "data_heavy"] - Engineering research with statistical analysis →
["engineering", "scientific", "data_heavy"] - Computational biology paper →
["code_heavy", "scientific"]
What we're measuring: Overall quality of content considering writing excellence, substantive value, and presentation quality regardless of authorship origin.
excellent - Outstanding quality across all dimensions
- Sophisticated writing with varied sentence structures and engaging style
- Rich, appropriate vocabulary with error-free grammar and punctuation
- High substantive value with clear insights or information
- Professional presentation and formatting
- Natural flow and logical organization
- Example: High-quality publications, expert analyses, polished educational content, well-crafted professional documents
good - High quality with minor imperfections
- Grammatically correct with proper sentence structure
- Appropriate vocabulary and tone for content type
- Solid substantive value and clear information
- Good organization and readable flow
- Only occasional minor issues (1-2 typos per section)
- Example: Quality journalism, professional websites, well-written blog posts, solid educational materials
adequate - Acceptable quality for most purposes
- Generally clear and understandable writing
- Some grammatical errors but meaning remains clear
- Reasonable substantive value though may lack depth
- Basic organization and structure present
- Minor formatting or presentation issues
- Example: Casual blogs, user reviews, basic informational content, simple guides
poor - Significant quality issues impacting utility
- Multiple errors affecting comprehension or credibility
- Unclear expression, confusing organization, or awkward phrasing
- Limited substantive value or questionable information
- Major formatting problems or unprofessional presentation
- Difficult to extract reliable information
- Example: Low-quality web content, poorly edited materials, confusing instructions
unacceptable - Quality too low for productive use
- Severely impaired communication with major errors
- Incoherent, nonsensical, or corrupted content
- No reliable substantive value
- Broken formatting or technical corruption
- Cannot determine intended meaning or extract useful information
- Example: Corrupted text, severe translation errors, spam content, SEO content, completely broken formatting
- Comprehension: Can the intended message be clearly understood?
- Substantive value: Does the content provide useful information or insights?
- Technical presentation: Is the content properly formatted and readable?
- Error impact: Do errors significantly impede understanding or credibility?
- Professional standards: Does the content meet basic standards for its intended purpose?
Language-Specific Quality Indicators:
- For non-Latin scripts (Arabic, Chinese, Japanese): Check for proper character encoding
- For agglutinative languages (Turkish, Finnish): Adjust expectations for word count/density
- For languages with different formality levels (Japanese, Korean): Assess appropriate register
- Mixed-language documents: Evaluate code-switching quality and appropriateness
What we're measuring: Ratio of valuable information to redundancy, padding, and repetition.
dense - Efficient, information-packed content
- Every sentence adds new information or insight
- Minimal redundancy or unnecessary elaboration
- Little to no repetition of the same concepts
- Example: Technical specifications, concise academic writing, quality reference material
adequate - Good information content with reasonable elaboration
- Most content adds value with some acceptable elaboration
- Minimal repetition within the document
- Good balance of information and explanation
- Example: Well-written articles, good tutorials with examples
moderate - Mixed substantive content with noticeable padding
- Some valuable information mixed with unnecessary elaboration
- Noticeable repetition of key points for emphasis
- Some sections feel padded or verbose
- Example: Blog posts with some fluff, articles with repetitive conclusions
thin - Low information content with significant problems
- Much content doesn't add new information
- High internal repetition and excessive redundancy
- Significant padding to reach desired length
- Example: SEO-optimized content, poorly edited writing
empty - Dominated by repetition and meaningless content
- Minimal actual information value
- Dominated by repetition and copy-paste artifacts
- Same ideas repeated multiple times without development
- Example: Spam content, template-filled pages, keyword-stuffed articles
- Same phrases repeated throughout (especially in SEO content)
- Identical paragraphs or sections (copy-paste errors)
- Circular reasoning (saying the same thing in different ways)
- Template artifacts (repeated boilerplate mixed with content)
What we're measuring: Potential for teaching, learning, and knowledge transfer.
high - Clear instructional design and learning objectives
- Explicitly teaches concepts or skills
- Progressive skill building from basic to advanced
- Clear learning objectives and outcomes
- Comprehensive explanations with examples
- Example: Quality tutorials, textbooks, structured courses, educational guides
moderate - Good instructional value with some learning potential
- Some instructional elements present
- Explanations help build understanding
- Transferable knowledge to other contexts
- Good examples or illustrations
- Example: How-to articles, explanatory content, informative guides
basic - Limited educational content
- Some explanations but not systematically instructional
- Basic explanations of concepts
- Limited learning potential or skill building
- Example: Basic explanations, simple informational content
minimal - Little educational value
- Primarily informational rather than instructional
- No clear learning objectives or skill building
- Entertainment or commercial focus
- Example: Entertainment content, basic news, commercial content
none - No educational content
- No instructional value or learning potential
- Purely transactional, entertainment, or administrative
- No knowledge transfer potential
- Example: Pure entertainment, transactions, legal boilerplate
- Explanatory vs Educational: explanations alone ≠ educational design; require intent to teach plus scaffolding for Basic+
- Reference docs: typically Minimal; promote to Basic/Moderate when guided “how-to” segments or curated examples exist
- Reviews/op-eds: None/Minimal unless they include actionable how-to guidance designed for learning
- Keywords: Objectives/Outcomes, Lesson, Exercise/Quiz, Homework, Assessment, Syllabus, Module, Unit, Learning Goals
- Structure: numbered steps + prerequisites/requirements → Basic; add practice tasks/solutions → Moderate; syllabus/modules/assessments → High
- Signals of non-edu mix: heavy CTAs/ads or product pitches → cap at Minimal unless clear instructional scaffolding
- Are there explicit learning goals or a syllabus? → High
- Else, are there step-by-step instructions with examples/exercises? → Moderate
- Else, are there explanatory sections intended to teach basics? → Basic
- Else, is there any minor instructional element? → Minimal
- Otherwise → None
- API reference with examples but no guidance → Minimal to Basic (depending on clarity/examples)
- Blog post explaining concept with analogies and one example → Basic
- Tutorial with tasks, checkpoints, and solutions → High
- Product documentation with “Getting Started” and “How-To” flows → Moderate
- Learning objectives: Clear goals for what reader should learn
- Skill progression: Builds from basic to advanced concepts
- Examples and practice: Provides concrete examples or exercises
- Knowledge transfer: Concepts applicable beyond immediate context
What we're measuring: Presence and quality of logical reasoning, analysis, and explanatory content.
analytical - Complex reasoning and systematic analysis
- Multi-step arguments with logical progression
- Cause-effect analysis and systematic thinking
- Considers multiple perspectives or variables
- Draws conclusions from evidence and reasoning
- Example: Research analysis, complex problem-solving, systematic evaluations
explanatory - Clear explanations with logical flow
- Explains how or why things work
- Shows cause-effect relationships clearly
- Educational reasoning that builds understanding
- Logical connections between concepts
- Example: Good tutorials, educational content, how-to explanations
basic_reasoning - Simple logical connections
- Some logical connections between ideas
- Basic explanations of concepts or processes
- Elementary analytical thinking
- Simple cause-effect relationships
- Example: Basic explanations, simple arguments, elementary analysis
minimal - Limited reasoning, mostly descriptive
- Primarily describes what rather than why or how
- Few logical connections between ideas
- Mostly factual statements without analysis
- Little explanatory content
- Example: Basic descriptions, simple factual content, minimal analysis
none - No clear reasoning present
- Purely descriptive content
- Simple factual listing without connections
- Narrative content without analysis
- No logical argumentation or explanation
- Example: Simple lists, basic narratives, pure description
- Stepwise structure: numbered steps in proofs/derivations/solutions; “First… therefore… hence… so…”
- Hypothesis and test: assumptions, intermediate results, counterexamples, sanity checks
- Tool- or method-calls: named algorithms, theorems, lemmas, or procedures invoked and justified
- Error analysis or reflection: “we tried X, failed because Y, so we…”, “limitations,” “edge cases”
- Intermediate artifacts: scratch calculations, partial code reasoning, sub-problems and sub-claims
- Explanatory vs Analytical: explanations tell how; analytical shows multi-step inference with evidence and intermediate claims
- Worked example vs Mere answer: worked examples expose steps and justification; mere answers without steps are not reasoning-rich
- Procedural vs Reasoning: procedural lists actions; reasoning links actions via logic, evidence, or constraints
- Lexical cues: because, therefore, thus, hence, suppose/assume, we conclude, by induction, lemma/theorem/proof, O(n), hypothesis, counterexample
- Structure cues: presence of proof blocks, derivations (e.g., “Proof.”, “QED”, TeX environments), multi-step numeric calculations
- Program reasoning: code comments like “// invariant”, “// complexity”, pre/post-conditions, test reasoning
- Thresholding: count reasoning cues per 1k tokens; with ≥2 structural cues or ≥5 lexical cues → at least explanatory; proofs/derivations → analytical
- Is there a proof/derivation or multi-step argument with intermediate claims? → analytical
- Else, does it explain why/how with cause-effect and logical links? → explanatory
- Else, are there simple logical connections or one-step justifications? → basic_reasoning
- Else, does it mostly describe without connecting ideas? → minimal/none
- Answer-only solutions (final numeric result without steps) → minimal
- Step-by-step math solution with intermediate equations → analytical
- “How it works” article connecting 2–3 causal steps without data → explanatory
- Troubleshooting log with attempts and justifications → analytical if causal chain is explicit; otherwise explanatory
- Cause-effect: "Because X, therefore Y"
- Problem-solution: Identifies problems and proposes solutions
- Comparison: Analyzes similarities and differences
- Logical progression: Ideas build on previous ideas
- Evidence-based conclusions: Draws conclusions from presented evidence
What we're measuring: Intended sophistication level and background knowledge assumptions of the target audience.
expert - Highly specialized professional/academic content
- Assumes deep domain expertise and advanced training
- Uses technical terminology without explanation
- Content for practitioners actively working in specialized fields
- Example: Climate modeling methodology in Nature Climate Change, research papers, technical specifications, expert-to-expert communications
advanced - Educated adult audience with analytical skills
- Assumes higher education and critical thinking ability
- Explains specialized concepts but uses sophisticated language
- Intellectually challenging but accessible to educated generalists
- Example: Complex climate change analysis in The Atlantic, quality journalism, policy analysis, advanced general interest content
general - General adult audience
- Accessible to most educated adults without specialized background
- Explains technical concepts when introduced
- Uses clear language while maintaining intellectual substance
- Example: Quality journalism, general interest articles, accessible explanations of complex topics
beginner - Introductory level with minimal prerequisites
- Explains basic concepts and terminology
- Builds up from fundamental principles
- Assumes minimal prior knowledge of the subject area
- Example: Introductory tutorials, beginner guides, basic explanations, getting-started content
youth - Targeted at teenagers and young adults (ages 13-19)
- Age-appropriate complexity with contemporary cultural references
- Sophisticated enough for developing critical thinking but accessible
- May address topics relevant to adolescent experiences and concerns
- Example: High school educational content, young adult literature, teen-focused explanations, college prep materials
children - Designed specifically for children
- Simple language and concepts appropriate for young readers
- Educational content designed for elementary/middle school levels
- Age-appropriate topics and complexity
- Example: Children's educational content, elementary school materials, simple explanations for young learners
- Professional context: Is this content designed for workplace use vs. general learning?
- Terminology density: How much specialized vocabulary is used without explanation?
- Concept complexity: How sophisticated are the ideas and their development?
- Background assumptions: What education level and domain knowledge does the author assume?
Cross-Linguistic Considerations:
- Expert terminology density varies by language (German allows more compound terms)
- Formality markers differ across cultures
- Educational level assumptions vary by country's education system
- Age-appropriate content differs across cultures
What we're measuring: How much commercial interests influence the objectivity and informational value of content.
none - No commercial influence detected
- Objective, informational presentation
- No promotional language or commercial agenda
- Focus purely on informing or educating
- Example: Academic papers, objective journalism, educational content
minimal - Slight commercial context but maintains objectivity
- May mention products/services but in informational context
- Maintains balanced, objective tone
- Commercial mentions serve informational purpose
- Example: Product reviews with balanced analysis, informational articles mentioning relevant products
moderate - Some commercial influence on content
- Mix of informational and promotional content
- Some promotional language but still provides useful information
- Commercial interests somewhat visible but not dominant
- Example: Company blogs with useful information, sponsored content with actual value
heavy - Strong commercial bias throughout
- Primarily promotional with some informational elements
- Heavy use of marketing language and persuasive techniques
- Clear commercial agenda affects content objectivity
- Example: Marketing articles disguised as information, heavily biased product comparisons
pure_marketing - Entirely commercial/promotional content
- No genuine informational value beyond promotion
- Pure marketing copy or advertising material
- Designed solely to drive sales or conversions
- Example: Sales pages, pure advertising copy, promotional brochures
- Language tone: Objective vs. promotional language
- Primary purpose: Inform vs. persuade/sell
- Balance: Are alternatives/drawbacks mentioned?
- Call-to-action: Subtle information vs. obvious sales pitch
What we're measuring: How time-sensitive the content is - whether its value degrades over time or remains stable.
evergreen - Content remains valuable indefinitely
- Fundamental concepts, principles, theories
- Historical information and established facts
- Skills and techniques that don't change
- Reference materials with lasting value
- Example: Mathematical proofs, language grammar guides, classical literature analysis, basic cooking techniques
slowly_changing - Content remains valuable for years
- Best practices that evolve slowly
- Technical content that updates every few years
- Cultural and social topics with gradual change
- Example: Programming language tutorials, academic textbooks, industry standards, educational curricula
regularly_updating - Content valuable for months to a year
- Industry trends and market analysis
- Technology reviews and comparisons
- Policy discussions and current research
- Example: Software framework guides, business strategies, product reviews, research summaries
time_sensitive - Content value degrades quickly
- News and current events
- Time-bound information (prices, schedules, availability)
- Temporary situations or short-term trends
- Real-time data and statistics
- Example: Stock prices, weather reports, breaking news, event announcements, sales/promotions
- Core question: If someone reads this in 2 years, will it still be valuable?
- Update frequency: How often does this type of information typically change?
- Temporal references: Does the content heavily reference "now," "recently," "currently"?
- Subject matter stability: Is this about unchanging principles or evolving situations?
What we're measuring: Presence of inappropriate, harmful, or legally problematic content.
safe - Appropriate for all contexts
- No concerning content of any type
- Professional, appropriate language throughout
- Suitable for general audiences including workplace settings
mild_concerns - Minor issues that don't constitute major problems
- Occasional mild profanity in context
- Brief mentions of sensitive topics handled appropriately
- Minor concerns that don't affect overall suitability
- Example: Historical discussions of sensitive topics, professional content with mild language
nsfw - Not safe for work or general audiences
- Explicit sexual content or graphic descriptions
- Adult themes requiring content warnings
- Graphic violence or disturbing imagery descriptions
- Example: Adult content, graphic medical descriptions, explicit violence
harmful - Potentially harmful content requiring careful handling
- Content promoting dangerous activities or self-harm
- Hate speech targeting individuals or groups
- Violent content glorifying harm to others
- Example: Self-harm content, hate speech, dangerous "how-to" guides
illegal - Illegal content requiring immediate rejection
- Content promoting clearly illegal activities
- Material that violates laws in major jurisdictions
- Example: Terrorist content, child exploitation
- Context matters: Medical/educational discussions of sensitive topics may be appropriate
- Intent matters: Discussing harmful topics for educational purposes vs. promoting them
- Audience consideration: Content appropriate for experts may not be safe for general audiences
What we're measuring: Whether the content contains personally identifiable information that could identify private individuals.
no_pii - No personal information detected
- No names of private individuals
- No contact information (emails, phones, addresses)
- No identification numbers
- Public figures and officials mentioned by name are acceptable
- Example: News articles about politicians, technical documentation, general information
contains_pii - Contains potentially identifiable information
- Names of private individuals (non-public figures)
- Email addresses, phone numbers, physical addresses
- ID numbers (SSN, passport, driver's license, employee IDs)
- Medical information about identifiable individuals
- Financial account information
- Example: Personal blogs with full names, leaked databases, medical case studies with identifying info
- Public vs. Private figures: Politicians, celebrities, CEOs = public (no PII flag); private citizens = PII
- Context matters: Academic paper authors and their institutional emails = typically no PII; personal emails in forums = PII
- Aggregated vs. Individual: Statistical data = no PII; individual records = PII
What we're measuring: Primary regional, cultural, or geopolitical sphere(s) that the content relates to, regardless of language used.
Multi-regional content: Content can be assigned multiple regional labels if it genuinely spans multiple regions. Choose ALL applicable regions rather than forcing a single primary choice. Always output an array for this property, even if only one region applies.
european - European context (EU and broader Europe)
- Content about European countries, EU policies, or pan-European topics
- European cultural perspectives, social systems, or business practices
- References to European cities, institutions, companies, or regulations
- Includes: EU member states, UK, Switzerland, Norway, Balkans, etc.
- Example: GDPR compliance, European Parliament elections, Schengen area travel, European football leagues
north_american - North American context
- Content about US, Canada, or Mexico
- North American cultural perspectives, USMCA/NAFTA region topics
- References to North American institutions, companies, or issues
- Example: FDA regulations, Silicon Valley tech, NHL, US constitutional law, Canadian healthcare
east_asian - East Asian context
- Content about China, Japan, Korea (North/South), Taiwan, Mongolia
- East Asian cultural perspectives, Confucian-influenced societies
- References to East Asian economic models, companies, or social systems
- Example: Gaokao exams, K-pop, Shenzhen tech hub, Japanese work culture, Taiwan semiconductor industry
south_asian - South Asian context
- Content about India, Pakistan, Bangladesh, Sri Lanka, Nepal, Bhutan, Afghanistan, Maldives
- South Asian cultural perspectives, subcontinental issues
- References to South Asian institutions, economies, or social structures
- Example: IIT entrance exams, Bollywood, cricket leagues, monsoon impacts, caste system discussions
southeast_asian - Southeast Asian context
- Content about ASEAN countries (Indonesia, Thailand, Vietnam, Philippines, Malaysia, Singapore, etc.)
- Southeast Asian regional perspectives and economic integration
- References to ASEAN policies, regional companies, or cultural phenomena
- Example: ASEAN economic community, Indonesian elections, Singapore financial sector, Thai tourism
middle_eastern - Middle Eastern and North African context
- Content about Arab states, Iran, Turkey, Israel, North Africa (MENA region)
- Middle Eastern cultural perspectives, Islamic finance, regional conflicts
- References to Middle Eastern institutions, oil economies, or geopolitics
- Example: Gulf Cooperation Council, OPEC decisions, Middle East peace process, Islamic banking
sub_saharan_african - Sub-Saharan African context
- Content about African countries south of the Sahara
- African Union topics, sub-Saharan development issues
- References to African institutions, economies, or cultural topics
- Example: M-Pesa mobile banking, African Union policies, safari tourism, ubuntu philosophy
latin_american - Latin American context
- Content about Central and South America, Caribbean
- Latin American cultural perspectives, regional integration (Mercosur, etc.)
- References to Latin American institutions, economies, or social movements
- Example: Mercosur trade, telenovelas, Amazon rainforest, Latin American revolutions
oceanian - Oceanian context
- Content about Australia, New Zealand, Pacific Island nations
- Oceanian perspectives, Pacific regional issues
- References to Oceanian institutions, companies, or cultural topics
- Example: ANZAC relations, Pacific Island climate change, Australian mining, Māori culture
central_asian - Central Asian context
- Content about Kazakhstan, Uzbekistan, Turkmenistan, Tajikistan, Kyrgyzstan
- Central Asian perspectives, post-Soviet regional dynamics
- Silk Road region, resource economies, nomadic heritage
- Example: Silk Road initiatives, Caspian Sea resources, post-Soviet transitions
russian_sphere - Russian/Post-Soviet context
- Content about Russia, Belarus, and strong Russian influence areas
- Post-Soviet perspectives, CIS (Commonwealth of Independent States) topics
- Russian language content about regional (not global) topics
- Example: Russian federal politics, CIS integration, post-Soviet economic transitions
global - Genuinely international or universal
- Content with truly global scope or application
- International organizations, worldwide phenomena, global comparisons
- Topics that transcend regional boundaries
- Example: UN reports, climate change (global perspective), international standards, pandemic response
culturally_neutral - No clear regional focus
- Abstract, theoretical, or technical content without regional markers
- Universal scientific, mathematical, or philosophical content
- Content that could apply equally anywhere without modification
- Example: Mathematical proofs, chemical formulas, abstract philosophy, programming concepts
indeterminate - Cannot determine regional relevance
- Insufficient content to identify regional focus
- Mixed or contradictory regional signals
- Fragment or corrupted content lacking regional context
- Example: Technical specifications without context, isolated data tables
- EU-China trade relations →
["european", "east_asian"] - NAFTA/USMCA impact on Mexican agriculture →
["north_american", "latin_american"] - Indian diaspora in the Gulf states →
["south_asian", "middle_eastern"] - Comparative study of healthcare systems globally →
["global"]
Primary indicators:
- Geographic references: Countries, cities, regions, landmarks mentioned
- Institutional references: Governments, companies, universities, organizations specific to region
- Cultural markers: Holidays, customs, cultural phenomena, sports, entertainment
- Political/economic systems: References to regional political structures, economic blocs
- Legal/regulatory frameworks: Region-specific laws, regulations, standards
- Language context: While not determinative, language can provide regional hints
Important distinctions:
- Language ≠ Region: Spanish content about Asian markets =
["east_asian"], not["latin_american"] - Company origin vs. topic: Apple (US company) operating in India = consider actual content focus
- Historical vs. current: Historical content about ancient Rome =
["european"]if discussing modern implications - Diaspora content: Content about diaspora communities should include both origin and current regions
Quality checks:
- If content is in a non-English language but discusses global topics → still mark as
["global"] - If content compares multiple regions → mark all regions discussed substantially
- If content is about a specific place but has universal applications → consider both regional and global tags
What we're measuring: Which specific country or countries (if any) the content is relevant to, globally.
Note: Always output an array of country names for this property (even when only a single country applies). Use standard country names from any region worldwide (e.g., "germany", "france", "united_states", "china", "japan", "brazil", "india", "south_africa", "australia", "canada", etc.). The array may also contain the special values supranational or non_country_specific when appropriate.
[COUNTRY_NAME] - Content specifically relevant to a single country
- Content explicitly about that country's politics, culture, institutions, or regulations
- Content written from that country's cultural perspective
- Content addressing that country's specific issues, regulations, or cultural phenomena
- Content about that country's cities, companies, institutions, or country-specific topics
- Example: For "germany" → German election coverage, Bundesliga content, German legal analysis
- Example: For "united_states" → US election coverage, NFL content, US legal analysis
- Example: For "japan" → Japanese politics, J-League content, Japanese cultural analysis
- Political content: Elections, government policies, political parties, political figures specific to the country
- Cultural content: National traditions, cultural phenomena, historical events specific to the country
- Institutional references: Government bodies, national companies, universities specific to the country
- Geographic focus: Cities, regions, landmarks within the country as primary subjects
- Legal/regulatory: Laws, regulations, legal frameworks specific to the country
- Economic content: National economic policies, country-specific market analysis
- Sports/media: National sports leagues, national teams, country-specific media outlets
- Social issues: Social policies, demographic topics, social movements specific to the country
supranational - Content focused on supranational entities or regions
- International organizations, regional blocs, global institutions
- Content about supranational policies, international organizations, global governance
- Pan-regional analysis that transcends individual countries
- Multi-continental or global institutional content
- Example: UN resolutions, NATO discussions, EU policy analysis, ASEAN agreements, WTO trade rules
non_country_specific - No specific country relevance
- Abstract, theoretical, or universal content without geographic specificity
- Technical/scientific content that applies globally without country focus
- Content that doesn't reference specific countries or national contexts
- Example: Mathematical proofs, universal scientific principles, abstract philosophical discussions
This seems to be a comprehensive list. I would be very interested in how you segregate the data for each type. For MixtureVitae data, we used some fasttext classifiers, but to cover each category on this list, one needs a bunch of efficient categorization methods - clustering might be effective