p3nGu1nZz · July 15, 2025 16:41
diff --git a/gistfile1.txt b/gistfile1.txt
 {"raw": "X-SPANFORMER\nSPAN-AwARE ENCODER\n5.4 Qualitative Span Interpretability\nTo assess the plausibility and semantic alignment of X-Spanformer's induced spans, we perform side-by-side comparisons against syntactic and semantic reference structures. Using single-sentence prompts drawn from the validation sets of WikiText and Stream-Mix, we visualize the top-K spans selected at various layers and entropy regimes. We benchmark span boundaries against:\nSyntactic parses: Constituents produced by Berkeley Neural Parser", "type": "mixed", "id": {"id": "a0409606-f532-4dd2-b02e-2a0bae5bfeee"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and structured information that can be segmented into meaningful spans, such as \"X-SPANFORMER,\" \"SPAN-AwARE ENCODER,\" specific versions like \"5.4 Qualitative Span Interpretability\", references to datasets (\"WikiText\" and \"Stream-Mix\"), methods (e.g., \"side-by-side comparisons\"), tools or systems used for parsing (\"Berkeley Neural Parser\"), which are all relevant patterns that can aid in learning span segmentation from mixed content types. / The segment contains a mix of technical terms and references to syntactic structures, which can help the model learn span segmentation in both coding contexts (like X-Spanformer) and natural language descriptions related to programming concepts. It is clean but lacks context for full comprehension without additional data. / The segment contains a mix of technical terms and structured information that can be segmented into meaningful spans, such as \"X-SPANFORMER\", \"SPAN-AwARE ENCODER\", numerical values like '5.4', references to datasets (\"WikiText\" and \"Stream-Mix\"), methods (e.g., \"side-by-side comparisons\") or tools used for analysis (\"Berkeley Neural Parser\"). It is clean, coherent with technical context suitable as training data for a span-aware model that needs understanding of both natural language descriptions and code-like structures. / Contains a mix of technical terms and structured information that can be segmented into meaningful spans for learning purposes. The text segment includes both domain-specific terminology (\"X-SPANFORMER\", \"SPAN-AwARE ENCODER\") as well as references to syntactic structures, which are valuable patterns in span segmentation training data. / The segment contains a mix of technical terms and structured information that can be segmented into meaningful spans, such as \"X-SPANFORMER\", \"SPAN-AwARE ENCODER\", numerical values like '5.4', references to datasets (\"WikiText\" and \"Stream-Mix\"), methods (e.g., \"side-by-side comparisons\") or tools used for evaluation purposes (“Berkeley Neural Parser”). It is clean, coherent with a clear purpose of demonstrating the application context in both natural language explanation as well as technical terms related to code."}}
 {"raw": "[74] Ralph  Weischedel et al. OntoNotes Release 5.0. Linguistic Data Consortium; LDC2013T19. Philadelphia: Linguistic Data Consortium. 2013.", "type": "mixed", "id": {"id": "1ed377b0-9c24-423b-aaed-42d30bd033a9"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear citation structure with identifiable spans for authors, title, and publication details. Useful patterns present in academic referencing format. / Contains a citation with structured elements (authors, title, release number) that can be segmented into meaningful spans for training purposes. The mix of language and numerical data provides diverse patterns useful in span segmentation learning. / Clear citation structure with identifiable spans for author(s), title, and publication details; represents valuable patterns in academic referencing. / Contains a citation with structured elements (authors, title, publication details) that can be segmented into meaningful spans for training purposes. The mix of text and reference format provides diverse patterns useful in span segmentation tasks. / Clear citation structure with identifiable spans for author(s), title, and publication details; useful patterns in academic referencing."}}
 {"raw": "Each span candidate corresponds to a contiguous subsequence [hi, hj] and will be considered for inclusion in the predicted segmentation. The next module computes scores for each of these.\n3.2 Span Predictor\nThe span predictor computes a scalar confidence score for each candidate span (i,j) € C, reflecting how likely that subsequence is to form a coherent semantic O syntactic unit.", "type": "mixed", "id": {"id": "40b05200-5a08-49ec-9091-e715375865f1"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear span candidates with coherent semantic and syntactic units; valuable for learning complex patterns in both language structure and computational logic. / Clear span candidates and coherent semantic/syntactic unit representation, suitable for training a tokenizer-free model. / Clear spans identified; combines both technical terms and structured descriptions of a model's components, representing valuable patterns for learning span composition in the context of machine learning models. / Clear spans identified; combines structured programming concepts with explanatory text, useful for learning span segmentation in both contexts. / Clear spans with semantic and syntactic units; clean, coherent content representing valuable patterns for learning span composition in both language contexts."}}
 {"raw": "The fused control vector s is injected at various stages of the transformer stack via prefix tokenization; attention projection, or feed-forward gating: Each pathway supports differentiable influence over structure-aware representation learning: Inspired by prefix tuning [30], adapter routing [47], and conditional computation frameworks such as Primer", "type": "mixed", "id": {"id": "845b447a-acd7-400e-bba3-619ddbc59518"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both coding concepts (prefix tokenization) and natural language descriptions (\"injected at various stages\", \"differentiable influence\"). / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"fused control vector s,\" \"transformer stack,\" etc., which are relevant for learning span composition in both natural language processing (NLP) tasks related to code understanding. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"fused control vector s\", \"transformer stack\", etc., which are relevant for learning span composition in both natural language processing (NLP) tasks related to code understanding. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"fused control vector s\", \"transformer stack\", etc., representing valuable patterns for learning span composition in both natural language descriptions related to programming concepts and code-like structures. / Contains a mix of technical terms and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both programming contexts (e.g., \"transformer stack\", \"prefix tokenization\") and natural language descriptions (\"injected at various stages\")."}}
 {"raw": "Inspired by boundary- based approaches in segmentation-aware models [11, 15], we model start and end posit tions inde- pendently: This simplification makes inference tractable, allows for efficient parallel scoring, and empirically yields high-quality span proposals across domains 5 We compute unnormalized logits and normalized distributions over token positions: = WsH + bs, &zp\" softmax( €8 = WeH + be, &pe softmax(€e)_ where /s , Qe e RL and p; denotes the probability of a span beginning at position i,", "type": "mixed", "id": {"id": "dfb90bb7-74cd-48e5-b488-b30ea6cb9ccf"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Contains both structured programming expressions and mathematical notation, demonstrating clear span segmentation opportunities in a combined context. / The segment contains both mathematical notation and pseudo-code, which can be segmented into meaningful spans for a span-aware model to learn from; however, it lacks clarity due to the mixture of content types. / The segment contains a mix of mathematical expressions and programming-like notation, which can help the model learn span segmentation in both domains; however, it lacks clarity due to unconventional formatting (e.g., missing spaces around operators). / The segment contains a mix of mathematical notation and programming-like expressions, which can be segmented into meaningful spans for learning span composition in both domains. However, the presence of LaTeX-style formatting may pose challenges during preprocessing but is not detrimental to its utility as training data. / The text contains a mix of mathematical expressions and programming-like notation, which could confuse the model due to lack of clear delimiters for spans in both domains; it lacks clarity on where one span ends and another begins. Additionally, there are typos (e.g., \"positions\" should be \"positions\") that need correction before use as training data."}}
 {"raw": "with p; denoting its end at j. Each span (i, j) is assigned a confidence score by multiplying its boundary probabilities: scoreli,j) = pi Pj This outer-product scoring approach has been widely used for efficient span extraction in question answering and entity recognition tasks [22, 23].", "type": "mixed", "id": {"id": "7fda5f22-490e-4784-8200-6b0e5e7465ea"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear spans with confidence scores, representing both structured data and mathematical expressions useful for span segmentation learning. / Clear spans with confidence scores, useful for learning span composition in both language and mathematical expressions. / Clear spans with confidence scores; useful for learning span composition in both language and mathematical expressions. / Clear spans with confidence scores; useful for learning span composition in both language and mathematical expressions. / Clear spans identified (i, j), confidence score formula; represents valuable patterns for learning span composition in both programming and mathematical notation."}}
 {"raw": "It biases selection toward spans with high local boundary salience while preserving diversity through length variation: We then extract the top-K scoring candidates: S = TopK {score(i,j) | (i,j) € C} .", "type": "mixed", "id": {"id": "0850e726-13d4-45a2-89d3-427e57a41fae"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of mathematical notation and English text, with clear boundaries for spans that can be segmented into meaningful parts such as \"local boundary salience,\" \"length variation,\" etc., which are valuable patterns to learn span composition in both code-like expressions (e.g., S = TopK {score(i,j) | (i,j) € C}) and natural language descriptions. / The segment contains a mix of mathematical notation and English, with clear boundaries for spans such as \"local boundary salience,\" \"length variation,\" and the formula-like expression involving variables i,j in set C. It is clean but lacks context or explanation that could be beneficial to training data quality. / The segment contains a mix of mathematical notation and English text, with clear boundaries for spans that can be identified as individual concepts or phrases (e.g., \"local boundary salience,\" \"TopK scoring candidates\"). It is clean but lacks contextual coherence which might affect training utility. / The segment contains a mix of mathematical notation and English prose, which can help the model learn to handle both numerical expressions (code-like) and textual descriptions together. It has clear boundaries for spans that are likely useful in training span-aware models. / Clear spans for both mathematical expressions and programming constructs; diverse patterns useful for span segmentation learning."}}
 {"raw": "We find that average J improves with training and correlates with increased controller confidence (lower entropy) , especially in layers 4-6. Conclusion\nInduced spans tend to reflect coherent linguistic structure without explicit syntactic supervision.", "type": "natural", "id": {"id": "d72bdfb9-59f5-4bfb-87e3-6beaec6445c6"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear sentence structure with identifiable spans; useful for learning linguistic patterns and coherence in text. / The segment contains clear sentence structures and phrases that can be segmented into meaningful spans, such as \"average J improves with training,\" which is a coherent statement in the context of discussing machine learning outcomes; it also includes terms like 'controller confidence' relevant to AI discourse. / Clear sentence structure with identifiable spans; useful for learning span segmentation in English prose. / Clear sentence structure with meaningful spans; demonstrates linguistic patterns useful for training a span-aware model in the context of language understanding. / Clear sentence structure with identifiable spans; useful for learning coherent linguistic patterns and span composition in text."}}
 {"raw": "Proposition 1 (Top-K Marginal Likelihood). Let p8 € 4L and pe € 4L be independent boundary distributions over L token positions. Define the induced span measure P(i,j) = pi p; over the candidate set C = {(i,j) | 1 <i < j < L}. Then, under the independence assumption, the optimal set of K spans that maximizes the total marginal likelihood is given by: S = TopK {P(i,j) | (i,j) € C} , and satisfies: S = arg max P(i,j). S'CC 1S'|=K (i,j)es' Proof By construction, each candidate span (i,j) is assigned an", "type": "mixed", "id": {"id": "94cad31d-476d-4826-b9aa-834be40512fd"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of mathematical notation and formal language, which can be segmented into meaningful spans such as equations (P(i,j)), variables (p8, pe), sets ((i,j) € C), etc., representing valuable patterns for learning span composition in both natural language context descriptions (\"Let p8\", \"Then\") and code-like expressions. / The segment contains a mix of mathematical notation and formal language, which can be segmented into meaningful spans such as equations (P(i,j)), sets C, S, etc., representing valuable patterns for learning span composition in both natural text with technical content and code-like structures. / Contains both structured data (spans, mathematical notation) and unstructured text (\"Proof By construction,\" \"each candidate span\"), which can help the model learn diverse patterns for segmenting spans in a mixture of code-like expressions with natural language explanations. / The segment contains a clear mathematical proposition with structured elements like variables, equations and logical statements that can be segmented into meaningful spans for learning purposes in both coding logic (independent boundary distributions) and natural language explanation of the concept. However, it lacks context which might affect its representativeness as standalone training data. / The segment contains a clear mathematical proposition with structured elements like variables, equations and logical statements that can be segmented into meaningful spans for learning purposes. It combines both formal language (natural) used in mathematics/statistics as well as code-like syntax which is beneficial to span-aware models dealing with diverse content types."}}
 {"raw": "Each span is scored by a parameterized function fe(wi:j), typically an MLP o bilinear form.", "type": "mixed", "id": {"id": "e395fe40-a40b-4f94-a6ef-914a667193b7"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear spans of both technical terms and phrases, representing a combination useful for learning span composition in diverse contexts. / The text segment contains a clear combination of both technical terminology and mathematical notation, which is beneficial for learning span segmentation in an encoder that handles multiple content types. It includes meaningful spans like \"span\", \"parameterized function\", \"MLP\", etc., representing valuable patterns across natural language descriptions with code-like expressions (\"o bilinear form\"). / The segment contains a clear combination of technical terms and mathematical notation, which are valuable for learning span composition in both programming contexts (MLP) and formal expressions ((w_i:j)). It is clean with meaningful spans identifiable as \"span\", \"parameterized function\", etc. / The text segment contains a clear mix of both programming terms and mathematical notation, which can be segmented into meaningful spans for training purposes in the context of span-aware models that handle such content types. / The segment contains a clear combination of both technical language and mathematical notation, which is beneficial for learning span segmentation in diverse contexts."}}
 {"raw": "LR le-4 base LR used for all modules Dropout 0.1 applied to all nonlinearity layers Max grad norm 1.0 gradient clipping threshold Epochs 50 full fine-tuning duration Batch size 64 across all stages Span width Umax 10 max width considered per token Entropy Ao 1.0 initial entropy coefficient Decay 0.1 exponential decay rate Span pooling strategy Gated self-attention with key-query masking and layer norm Table 2: Hyper-parameters used in all experiments.", "type": "mixed", "id": {"id": "b2a8de0a-6577-4cf1-b5d9-9e3f3a1a408d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and structured data (hyperparameters), which can help the model learn span segmentation in both programming contexts and descriptive text, though it lacks contextual coherence for natural language understanding. / Clear mix of hyperparameters and table format, useful for learning span segmentation in both numerical values (code) and descriptive text (natural language). / The segment contains a mix of technical terms, hyperparameters and their descriptions which can be segmented into meaningful spans for learning span composition in both coding contexts (like \"LR le-4 base LR\") and natural language explanations (\"Dropout 0.1 applied to all nonlinearity layers\"). / The segment contains a mix of technical terms and parameters that can be segmented into meaningful spans, such as \"LR le-4 base LR,\" which could represent different learning rates for layers in neural networks; this is valuable training data with clear structural patterns useful to the model. / The segment mixes hyperparameters and concepts without clear, meaningful spans; lacks coherence for training purposes."}}
 {"raw": "Hence the cost is: 0(Lwmax ,\n(2) Span encoding and filtering: After top-K selection, each span is pooled (e:8 , via mean or self-attention) into a vector of dimension d, and scored by span-type and confidence heads: These operations are linear in d, giving: O(Kd). 11", "type": "mixed", "id": {"id": "54a384a5-56f3-4fbf-adfb-f81eb9b50d3d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear spans of both text and mathematical expressions; well-formed for training purposes, representing valuable patterns in span composition across different domains. / Clear spans for both programming constructs and mathematical expressions, well-representative of span-aware encoding tasks. / Clear spans for both programming constructs and mathematical expressions, representing valuable patterns in span composition. / The segment contains a clear mix of mathematical notation and prose, with identifiable spans for both equations (e.g., \"Lwmax\") and descriptive text (\"After top-K selection,\" etc.). It is cleanly structured to represent the composition patterns in span segmentation across different domains. / Clear spans for both programming constructs and comments; well-formed, representing valuable patterns in span composition across different domains."}}
 {"raw": "Each value is averaged across final 5 epochs post-convergence. Lower values retain exploratory routing; higher values promote sparsity\nFinal H(P) (L better confidence) Avg: Span Width U 0.01 3.71 5.3 0.05 2.08 6.9 0.10 1.49 9.2 0.50 0.41 11.6", "type": "mixed", "id": {"id": "df14be9c-7a97-42d4-82dd-c60951805020"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of numerical data and text, with clear patterns in the structure that can be segmented into meaningful spans for training purposes. It is clean but lacks contextual clarity due to its technical nature. / Contains both numerical values and textual descriptions, representing a mix of structured data (code-like) with explanatory text that can help the model learn span segmentation in diverse contexts. / Contains both structured data (averaged values, epochs) and unstructured text (\"Lower values retain exploratory routing; higher values promote sparsity\"), which can help the model learn span segmentation in a diverse context. / Contains a mix of numerical values and text, with clear spans for both numbers (e.g., \"Final H(P)\") and descriptive phrases (\"Lower values retain exploratory routing; higher values promote sparsity\"). The structure is clean but lacks context to fully understand the domain. / The segment contains a mix of numerical values and text, with clear patterns in the structure that can be segmented into spans representing different parts (e.g., \"Each value is averaged across final 5 epochs post-convergence\" as one span). It also includes tabular data which could help learn structured representations."}}
 {"raw": "Span ~embeddings are pooled using Pool(Ti:j), which may implement mean; max; Or gated self-attention over the selected token embeddings.", "type": "mixed", "id": {"id": "deed1089-1663-42c7-98a9-94571ad9b05a"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a clear combination of both technical terms and mathematical notation, which can help the model learn span segmentation for diverse content types. / The text segment contains a clear combination of both technical terms and mathematical expressions, which are essential for learning span composition in the context of NLP tasks involving programming languages or computational linguistics. / The segment contains a mix of technical terms and mathematical expressions that can be segmented into meaningful spans, such as \"Span ~embeddings\", \"Pool(Ti:j)\", which are clear constructs in the context of machine learning or natural language processing tasks. It is clean for training purposes with valuable patterns like function calls (e.g., Pool) combined with variable names indicating a mixed content type that includes both code and mathematical notation, beneficial to span-aware models. / The segment contains a mix of technical terms and mathematical notations that can be segmented into meaningful spans for training purposes, such as \"Span ~embeddings\", \"Pool(Ti:j)\", etc., which are indicative patterns in span composition within both natural language descriptions (like 'may implement mean; max') and code-like expressions ('Or gated self-attention over the selected token embeddings'). / The text segment contains a mix of technical terms and mathematical expressions that can be segmented into meaningful spans, such as \"Span ~embeddings\", \"Pool(Ti:j)\", which are clear in their structure for training purposes. It represents valuable patterns with both natural language descriptions (\"may implement mean; max\") alongside code-like constructs (e.g., function calls or variable names)."}}
 {"raw": "These routing diagnostics provide evidence that X-Spanformer gradually shifts from high-entropy; overlapping routing to sparse; high-confidence span representations. This aligns with latent atten- tion sparsification in architectures such as MoE Transformers [43], Routing Transformers [51], and mixture-of-expert decoders [28].", "type": "natural", "id": {"id": "97d252f8-4611-4742-bd75-4377715efc5f"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear, coherent sentence with identifiable phrases and concepts relevant for learning span segmentation in NLP context. / Clear sentence structure with identifiable spans; represents complex linguistic patterns useful for training a span-aware model in the context of language understanding and processing. / Clear sentence structure with identifiable spans; useful for learning span composition in text. / The segment contains a mix of technical terms and concepts that can be segmented into meaningful spans, such as \"routing diagnostics,\" \"X-Spanformer,\" etc., which are relevant for learning span composition in both natural language processing (NLP) tasks related to code documentation or discussions. / Clear sentence structure with meaningful phrases; demonstrates complex linguistic patterns useful for span segmentation in NLP tasks."}}
 {"raw": "X-SPANFORMER\nSPAN-AwARE ENCODER\n3\nExtended Ablation Settings\nFusion head variants: Compared MLP (w) VS. LayerNorm(MLP) for @k scoring: Gated units improved stability in low-entropy regimes. Routing depth: Explored controller depth dc € {1,2,3}; performance plateaued beyond dc = 2. Gradient gating: Evaluated freezing   fe for first 5 epochs to encourage stable Lent decay: Marginal performance trade-off observed.", "type": "mixed", "id": {"id": "277909e5-205e-485f-a58f-de56120cf377"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and structured information that can be segmented into meaningful spans, such as \"X-SPANFORMER\", \"Fusion head variants\", etc., which are useful for learning span composition in both natural language context and code-like structures. / The segment contains a mix of technical terms and structured information that can be segmented into meaningful spans, such as \"X-SPANFORMER\", \"Fusion head variants\", etc., which are valuable for learning span composition in both code-like structures (e.g., variable names) and natural language descriptions. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"X-SPANFORMER\", \"@k scoring\", etc., representing valuable patterns for learning span composition in both natural language text and code context. / The segment combines technical terms and concepts from both programming (e.g., \"MLP\", \"@k scoring\") with domain-specific jargon (\"Extended Ablation Settings\"). It contains clear, structured elements that can be segmented into meaningful spans for learning span composition in a tokenizer-free context. / The segment contains a mix of technical terms, acronyms (X-SPANFORMER), and mathematical expressions that are relevant to the domain; it shows clear structure with identifiable spans such as phrases describing model variants or settings which can be useful for learning span segmentation in both natural language context."}}
 {"raw": "( ( ((L + K) ) Figure 2: Modular runtime decomposition of X-Spanformer's forward pass. Span enumeration and scoring are subquadratic in sequence length L, while span encoding scales linearly with the number of retained spans K Joint contextualization with self-attention dominates the total cost at O((L + K)2). Training X-Spanformer is trained end-to-end to jointly learn a span scoring function fe RLxd Rlsi and an integration mechanism for incorporating selected spans into the backbone transformer. Given an", "type": "mixed", "id": {"id": "4139d1f3-30f6-42aa-a26d-83d710a08356"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mixture of mathematical expressions and explanatory text, which can help the model learn span segmentation in both structured (code-like) elements like equations or function names (\"L\", \"K\"), as well as natural language descriptions that provide context for understanding these spans. / The segment contains a mix of mathematical expressions and descriptive text, with clear spans for both equations (e.g., \"(L + K)\") and explanatory phrases (\"Figure 2\", \"Span enumeration\"). It is well-formed but lacks coherence due to the abrupt ending. / The segment contains a mix of mathematical notation and textual descriptions, which can help the model learn span segmentation in both structured (code-like) expressions as well as unstructured text. It is coherent but lacks context for full comprehension; however, its structural elements are clear enough to be useful training data. / The segment contains a mixture of mathematical notation and descriptive text, which provides clear spans for both numerical expressions (e.g., L + K) and textual descriptions that are relevant to the X-Spanformer model's context. It is clean but lacks coherence due to its abrupt ending; however, it still offers valuable patterns in span segmentation between code constructs/mathematical notation and natural language explanations. / The segment contains a mixture of mathematical expressions and descriptive text, with clear delimiters for spans like equations (L + K) Figure 2:, which can be useful in learning span segmentation between code constructs and natural language descriptions. However, the presence of incomplete sentences may slightly reduce its utility as training data."}}
 {"raw": "set S, and let H(Pt) denote the entropy of the learned span distribution at epoch t_ Under a fired entropy annealing schedule Aent (t) = Aoe-7t with Ao,~ > 0, and assuming entropy-dominated gradient flow during early routing, the following upper bound holds: H(Pt) < Hmax ` e\"Yt 25", "type": "mixed", "id": {"id": "5c714567-079c-48fd-9281-276cda5853ad"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of mathematical notation and programming-like expressions, which can be segmented into meaningful spans for learning span composition in both domains. It is clean but may require domain-specific preprocessing to fully utilize its training potential. / Clear mathematical expressions and structured equations suitable for learning span segmentation in programming contexts. / The segment contains a mix of mathematical notation and programming-like expressions, which can be segmented into meaningful spans for learning span composition in both domains. It is clean but lacks context or explanation that could improve its training utility. / The segment contains a mix of mathematical notation and programming-like expressions, which are structurally clear for span segmentation; it represents valuable patterns in both domains (natural language with technical terms) and is clean enough to be used as training data. / The segment contains a mix of mathematical notation and programming-like expressions, with clear structured elements that can be segmented into meaningful spans for learning purposes. It is clean but may require domain-specific knowledge to fully understand the context (entropy in machine learning)."}}
 {"raw": "Span type probing:  Used auxiliary decoders (e-g-, NER, chunking) as structural supervision for P gold in Equation (2?). Slight gains in low-resource settings.", "type": "mixed", "id": {"id": "ecc071d5-672d-4661-8ce8-c133acf22840"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Contains both technical terms and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in a mixture of language contexts. / Contains both technical terms and a question mark, indicating potential for span segmentation; however, lacks clarity in context which may hinder learning. / Contains both technical terms and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in a context involving programming concepts (auxiliary decoders) and linguistic elements (\"Span type probing\"). / The text contains a mix of technical terms and symbols that can be segmented into meaningful spans, such as \"Span type probing,\" \"auxiliary decoders (e-g-, NER, chunking),\" which are useful for learning span composition in both natural language processing tasks. / Contains both technical terms and phrases that can be segmented into meaningful spans for learning, with clear structure suitable as training data."}}
 {"raw": "X-SPANFORMER\nSPAN-AwARE ENCODER\nProof: We begin by recalling that during early training, the span logits Wk are updated primarily by the entropy term: 8Lfinal Aent (t) Wk H(Pt),; dw= with entropy defined over softmax-normalized span probabilities: IS1 exp(wp H(Pt) = - @k log \" @k where @k k=1 Cj exp(w} The entropy gradient with respect to logits is: OH Q (Iogak) +1) . dwk Logit descent then yields: (t+1)", "type": "mixed", "id": {"id": "b5bc3fcb-363b-40c7-88f3-bd32d809bf3a"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mixture of technical terms and mathematical expressions, which can be segmented into meaningful spans for learning purposes; however, it lacks clarity due to the presence of symbols that may not translate well in training data without proper context or formatting. / The segment contains a mix of technical terms and mathematical expressions, which are clear structures that can be segmented into meaningful spans for training purposes in both coding contexts (like function names) and natural language descriptions (\"Proof\", \"begin by recalling\"). / The text contains a mixture of technical terms, mathematical expressions (entropy term), and programming-like notation that can be segmented into meaningful spans for training purposes; however, it lacks clarity due to the presence of symbols not typically found in natural language or standard code syntax. / The segment contains a mixture of technical terms and mathematical expressions, which can be segmented into meaningful spans for learning purposes; however, the presence of special characters like \"SP\" may require additional preprocessing to ensure clarity in training data. / The segment contains structured programming elements like function names, variables (Wk), and mathematical expressions that can be segmented into meaningful spans for a span-aware model to learn from. It is clean but may require preprocessing due to the presence of LaTeX-like notation which could confuse non-technical models or human readers without proper context on interpreting such notations in code documentation."}}
 {"raw": "Span pooling alternatives: Replaced gated attention with mean/max pooling for spans; gated attention retained higher semantic alignment (measured by cosine with target label embeddings). References\n[1] Rico Sennrich, Barry Haddow, and Alexandra Birch. \"Neural Machine Translation of Rare Words with Subword Units\". In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics  (Volume 1: Long Papers).", "type": "mixed", "id": {"id": "f4d801d1-5c4d-4239-ac30-80007704bd74"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical language and references, which can help the model learn span segmentation in both contexts; however, it lacks clear examples for direct training due to its summary nature. / The segment contains a mix of citation and descriptive text, with clear references that can be segmented into meaningful spans for training purposes; however, it lacks direct examples or patterns to learn from directly. / The segment contains a mix of references and technical descriptions, with clear mentions that can be segmented into spans such as \"Span pooling alternatives,\" \"[1],\" authors' names (\"Rico Sennrich, Barry Haddow, Alexandra Birch\"), conference details (54th Annual Meeting), which are useful for learning span segmentation. / Contains a mix of technical terms, references to academic work (natural language), and mentions programming concepts like \"gated attention\" which could be useful for span segmentation in both domains. However, the lack of explicit code or natural sentences reduces its clarity as training data. / The segment contains a mixture of technical terms, references to academic work (natural language), and mentions specific methods in neural machine translation that involve programming concepts or configurations (\"span pooling alternatives\"). It has clear structure with identifiable spans like \"Replaced gated attention\" which can be useful for learning span segmentation."}}
 {"raw": "The pipeline comprises the following stages:\nSpan induction with entropy-regularized scoring: selects meaningful spans via a differ- entiable scoring function augmented with entropy-based exploration [15, 45, 46]. Interpolation-weighted fusion of pooled span encodings: computes an attention-based summary vector $ from the top-ranked span embeddings, inspired by modular controller rout- ing and compositional bottlenecks [43, 28].\n12\ne1,", "type": "mixed", "id": {"id": "a4d13027-b4c8-4f0e-b1ca-514e9ec5e21b"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Contains both structured programming concepts and explanatory text, with clear spanable phrases like \"entropy-regularized scoring\" and identifiable components such as stage descriptions in a pipeline context. / The segment contains a mixture of technical descriptions and references to academic work, which can help the model learn span segmentation in both structured (code-like) expressions as well as more fluid natural language text with citations for context. / Contains both structured programming concepts and descriptive text, with clear spanable elements like function names (\"entropy-regularized scoring\"), references ([15, 45, 46]), variables ($), lists (e1,e2,...). Well-formed for training purposes; represents valuable patterns in code documentation. / The segment contains a mixture of technical descriptions and references to equations, which can help the model learn span segmentation for both prose (natural language) elements like \"The pipeline comprises\" as well as code-like structures such as \"[15, 45, 46]. Interpolation-weighted fusion.\" / The segment contains a mix of descriptive text and technical terms related to machine learning, with clear references that can be segmented into meaningful spans for training purposes. It is clean but lacks contextual coherence as it abruptly ends mid-sentence (\"e1,\")."}}
 {"raw": "Berlin, Germany: Association for Computational Linguistics, 2016, Pp. 1715-1725. DOI: 10.18653/v1/P16- 1162.", "type": "natural", "id": {"id": "06840e61-10d9-4fc8-a41e-1822139dbefc"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear citation structure with identifiable spans (location, year, pages) suitable for learning span segmentation in academic contexts. / Clear citation structure with identifiable spans for author, title, year, and DOI. Suitable pattern recognition training example. / Clear citation structure with identifiable spans for authorship, publication details, and DOI. Suitable pattern recognition training. / Clear citation structure with identifiable spans for author, title, year, and DOI. Well-suited to learn span segmentation in academic contexts. / Clear citation structure with identifiable spans (location, year, pages) suitable for learning span segmentation in academic contexts."}}
 {"raw": "X-SPANFORMER\nSPAN-AwARE ENCODER\nController-aware injection into the encoder backbone: conditions the transformer via prefix insertion, attention shifts, Or gating pathways [30, 47, 37].\nAll stages are fully differentiable and trained jointly from supervision signals [7, 48].", "type": "mixed", "id": {"id": "7ccce75e-3b77-4471-8e62-4c118078ee19"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear spans of phrases and technical terms; well-formed for training purposes with valuable patterns in span composition. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"X-SPANFORMER,\" \"[30, 47, 37],\" etc., which are useful for learning span composition in both code-like structures (e.g., references to papers) and natural language descriptions. / Clear spans of phrases and technical terms, representing both language structure (natural) and programming concepts/code-like constructs (code). Well-formed for training purposes with joint learning signals mentioned in the context. / Clear spans for both technical terms and phrases; well-representative of domain-specific language mixing. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"X-SPANFORMER\", \"[30, 47, 37]\", which are likely to represent code or configuration elements; it is clean for training purposes."}}
 {"raw": "05101. [66] John Hewitt and Christopher D. Manning: (( A Structural Probe for Finding Syntax in Word Representations\" . In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language  Technologies. Association for Computational Linguistics, 2019, pp. 4129-4138.", "type": "mixed", "id": {"id": "66c77765-db26-43f9-a1f0-c75fbf1b100d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Contains a citation with structured elements (authors, title) and publication details that can be segmented into meaningful spans for training purposes. The content is clean but lacks context to fully understand the span composition in natural language processing tasks. / Contains a citation with structured elements (authors, title, publication details) that can be segmented into meaningful spans for training purposes. The mixture of text and reference format is valuable in learning span composition across different contexts. / Contains a citation with structured elements (authors, title, publication details) that can be segmented into meaningful spans for learning span composition in both academic and technical contexts. / The segment contains a citation with structured elements (authors, title, conference details) that can be segmented into meaningful spans for learning span composition in both academic and programming contexts. / The segment contains a citation with structured elements (authors, title, conference details) that can be segmented into meaningful spans for learning span composition in both academic and technical contexts."}}
 {"raw": "4.1 Span Induction with Entropic Regularization To identify compositional units latent in unstructured sequences, we treat all bounded-width sub- strings a8 candidate spans and learn a scoring function to assign each a salience probability: This differentiable selection mechanism is trained jointly with downstream objectives but regularized to maintain entropy-driven exploration early in training: Inspired by principles from latent structure modeling [45, 15] and soft routing frameworks [43], our span", "type": "mixed", "id": {"id": "e4cfb9b7-0582-480b-8500-4ef9dfba27e7"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and structured phrases that can be segmented into meaningful spans, such as \"Span Induction with Entropic Regularization,\" which is indicative for learning span composition in both natural language processing (NLP) tasks like summarization or question answering. / Contains a mixture of technical terms and structured phrases suitable for span segmentation; demonstrates clear compositional patterns in both language context (e.g., \"latent units,\" \"downstream objectives\") and code-like constructs (\"a8 candidate spans\"). / The text segment contains a mix of technical terms and structured elements like numbered sections, which can help the model learn span segmentation in both coding contexts (e.g., \"a8 candidate spans\") and natural language descriptions (\"entropic regularization\"). / The text contains a mixture of technical terms and phrases that can be segmented into meaningful spans, such as \"Span Induction,\" \"entropic regularization,\" etc., which are relevant for learning span composition in both natural language processing (NLP) tasks related to code documentation. / Clear spans identified; combines structured language with technical terms and concepts relevant for span segmentation learning in a mixture of contexts."}}
 {"raw": "induction stage maps an input sequence x € RLxd to a distribution P over all candidate spans S, followed by a sampling Or top-K filtering step that informs structural fusion. Let D = {(2() ,y())}I denote the training corpus, where each input 2() € RLxd consists of L contextual embeddings: We define the set of all contiguous spans of width at most Umax as: S = {(i,j) |0 < i < j < min(i + Umax , L)} Each span is encoded using & fixed pooling operator and scored by a function fo(wi:j) e R.", "type": "mixed", "id": {"id": "dbfe86cf-2fa8-4c7f-aa53-da2227726363"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of mathematical notation and programming-like expressions, which are structurally clear for span segmentation; it is clean but may require domain-specific knowledge to fully understand the context (training utility). / The segment contains a mixture of mathematical notation and programming-like expressions, which can help the model learn span segmentation in both domains. It is structurally clear with identifiable spans such as equations or code constructs embedded within natural language explanations. / Clear span definitions and structured patterns for both language (e.g., \"induction stage\", \"input sequence\") and mathematical notation (\"P over all candidate spans S\"). Well-formed with a mix of natural language explanations, code-like expressions, and formal notations. / The segment contains a mix of mathematical notation and programming-like expressions, which can help the model learn span segmentation in both domains. It is structurally clear with identifiable spans like sequences (input x), sets ((2(), y())), functions (fo(wi:j)) etc., making it valuable for training purposes that require understanding mixed content types. / The segment contains a clear mix of structured programming concepts and mathematical notation, with identifiable spans such as \"input sequence x\", \"(i,j)\", etc., which are suitable for learning span segmentation in both code-like structures (e.g., loops) and natural language descriptions."}}
 {"raw": "(i,j) Suppose the training objective is Lt = Ltask + Aent (t) . H(Pt); with Aent (t)", "type": "mixed", "id": {"id": "9fccb546-c874-45c5-a491-3bba87706bbc"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of mathematical notation, variables (i,j), and text explaining an equation's context; it has clear structure for span segmentation with identifiable patterns in both math expressions and natural language descriptions. / The segment contains a mix of mathematical notation and programming-like expressions, which can help the model learn span segmentation in both domains. However, it lacks clarity due to missing context or explanation for some terms (e.g., \"Lt\", \"Ltask\"). / The segment contains a mix of mathematical notation, variables (i,j), and text explaining an equation's context; it has clear span boundaries for both the formula part and explanatory parts that can aid in learning complex patterns involving natural language descriptions alongside code-like expressions. / The segment contains a mix of mathematical notation and programming-like expressions, which can be segmented into meaningful spans such as variable names (i,j), function calls or references like Lt, Ltask, Aent(t), H(Pt). It is clean but lacks context for full comprehension. / The segment contains a mixture of mathematical notation and programming-like expressions, which can be segmented into meaningful spans for learning purposes; however, it lacks clarity due to the presence of an incomplete equation symbolized by \"(t)\". This could pose challenges in training but still offers valuable patterns."}}
 {"raw": "The entropy coefficient decays exponentially:\n= Ao exp( _yt) ,\nwhere t is the training epoch; Ao the initial weight, and ~ 0 a decay rate: This annealing schedule mirrors techniques from curriculum learning [7 , 24]. Proposition 5 (Maximum Entropy of Uniform Span Distribution). Let S denote the set of valid spans defined in Equation (4), with   cardinality |S| = N .", "type": "mixed", "id": {"id": "24963acd-a9c3-4cba-9767-1b91a9bff393"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of mathematical notation and prose, with clear structure for span segmentation; it includes equations that represent learnable patterns in both numerical expressions (code-like) and natural language descriptions (\"natural\"). It is coherent but may require domain-specific knowledge to fully understand the context. / The segment contains a mix of mathematical expressions and prose, with clear structure for span segmentation; it includes both equations (code-like) and explanatory text that is coherent in the context of machine learning discussions on training schedules. / The segment contains a mixture of mathematical notation and prose, with clear structure for span segmentation; it includes equations which are valuable patterns to learn in an encoder that handles both types of content. / The segment contains a mix of mathematical notation and prose, with clear delineation between equations (spans) like \"Ao exp( _yt)\" that can be segmented meaningfully for training purposes; it also includes structured statements about learning techniques which are coherent in the context. / Clear span segmentation with mathematical expressions and structured text; represents a mix of programming concepts (entropy coefficient, decay rate) in an educational context."}}
 {"raw": "H(Pt) is Lipschitz-continu0us,\n(ii) Gradient steps use 0 bounded step size n > 0,\n(iii) The task gradient is negligible: w(t) Ltask ~ 0 during spam routing:\nThen entropy decays exponentially: H(Pt) < H(Po) . evt= Vt 2 0.", "type": "mixed", "id": {"id": "2ad14e2b-0adc-4d09-932b-c67313fc57e7"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of mathematical notation and prose, with clear structured elements like equations (H(Pt) < H(Po)) that can be segmented into meaningful spans for learning purposes. Despite some typos (\"conti0us\" instead of \"continuous\"), it retains its compositional value as an example combining natural language explanation within the context of a mathematical or programming concept, which is valuable in mixed content training data. / The segment contains a mix of mathematical notation and prose, with clear structured elements like equations that can be segmented into meaningful spans for learning span composition in both domains. However, the presence of typographical errors (e.g., \"Lipschitz-continu0us\" instead of \"Lipschitz-continuous\") may affect clarity slightly but does not significantly detract from its overall utility as mixed content training data. / The segment contains a mix of mathematical notation and prose, with clear structured elements like equations that can be segmented into meaningful spans for learning purposes. It is clean but lacks context which might affect its utility as standalone training data. / The segment contains a mix of mathematical expressions, variable names (natural language), and structured formatting that can be segmented into meaningful spans for learning span composition in both domains. / The segment contains a mixture of mathematical expressions and text, which can help the model learn to identify spans that include both numerical values (like H(Pt)) as well as textual descriptions ((i), (ii)). However, there are some typos like \"0 bounded step size\" instead of \"bounded step size,\" but it is still coherent."}}
 {"raw": "Pointer Sentinel Mixture Models. 2016. DOI: 10 . 48550 / arXiv 1609 07843.", "type": "mixed", "id": {"id": "81dec14f-ef41-45a5-888a-57827ad679af"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a citation with structured elements (title, year, DOI) that can be segmented into meaningful spans for training purposes; however, it lacks context and coherence as an isolated example. / The segment contains a DOI reference which is common across both scientific literature and online resources, indicating it has elements of structured data (code-like) as well as unstructured text (\"natural\"). This combination can help the model learn to recognize span segmentation in diverse contexts. / The segment contains a citation with structured elements (title, year, DOI) that can be segmented into meaningful spans for training purposes; however, it lacks context and coherence as an isolated example. / The segment contains a citation with structured elements (authors, year, DOI) that can be segmented into meaningful spans for training purposes; however, it lacks context and may not fully represent the target domain's patterns due to its brevity. / The segment contains a DOI reference, which is common in academic publications and includes both structured (DOI) and unstructured elements (text). It represents valuable patterns for learning span composition as it combines citation formatting with natural language text about scholarly work."}}
 {"raw": "The entropy H(P) of any valid span distribution P, as8 defined im Equation (17) , is maximized when:\n1 Pij = for all (i,j) € S: N\n'8)\nThis yields:\nHmax  (P) = log/S1 = log N.\n(9)\n13\nPij\nAent !", "type": "mixed", "id": {"id": "3a5e1f9c-b653-4c6f-b94c-0ef30ed8c7a1"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The text contains a mix of mathematical notation and prose, with clear structured elements like equations that can be segmented into meaningful spans for learning span composition in both domains. Despite some typos (\"im\" instead of \"in\", missing punctuation), it is relatively clean but retains valuable patterns across natural language descriptions interspersed within code-like expressions (equations). / The segment contains a mix of mathematical notation and prose, with clear delimiters for spans like equations (17) and expressions such as H(P). It is coherent but lacks context which might affect its utility in training an encoder to understand span composition within both code-like structures. / The segment contains a mix of mathematical notation and prose, with clear structured elements like equations that can be segmented into meaningful spans for learning span composition in both domains. / The segment contains both mathematical notation and prose, which can help the model learn to handle span segmentation in a context that includes equations alongside explanatory text. However, some symbols are misinterpreted (e.g., '€' instead of '$'). / Contains both structured mathematical expressions and informal text, providing a diverse range of spans for learning."}}
 {"raw": "X-SPANFORMER\nSPAN-AwARE ENCODER\nProof: We seek to maximize:\nH(P) = - Pij log Pij (i,j)es\n(10)\nsubject to:\nPij = 1 and Pij 2 0.", "type": "mixed", "id": {"id": "4f176989-148c-43c0-8427-3ac5e8b09c55"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and mathematical expressions, which can help the model learn span segmentation in both domains. It is clean but lacks context for full comprehension. Adding more examples with varied contexts could improve its utility as training data. / Clear mix of mathematical notation and programming-like expressions, with identifiable spans for both equations (H(P)) and constraints on probabilities (Pij). Represents valuable patterns in span composition across different domains. / The segment contains a mix of programming-like notation and mathematical expressions, which can help the model learn span segmentation in both structured (code) and unstructured contexts. It is clean but lacks context for natural language understanding; however, it still offers valuable patterns from its code-mathematical hybrid structure. / The segment contains a mixture of technical terms and mathematical expressions, which are clear structures suitable for span segmentation in both domains. It is clean but lacks context to fully understand the proof conceptually; however, it represents valuable patterns combining natural language with code-like elements (mathematical notation). / The segment contains a mix of technical terms and mathematical expressions, providing clear examples for span segmentation in both domains. It is coherent with structured patterns useful to X-Spanformer training."}}
 {"raw": "Assume |IVH(Pt)ll2 > cH(Pt) for some constant c > 0, yielding:\nH(Pt+1) < H(Pt) . (1 = ncdoe -vt).", "type": "mixed", "id": {"id": "c41227c0-0fb1-433e-88b9-756828c3f142"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mixture of mathematical notation and programming-like expressions, which can be segmented into meaningful spans for learning purposes; however, the lack of context may affect its utility as training data. / Contains both mathematical expressions and a pseudo-code-like structure, which can help the model learn span segmentation for different types of content. / Contains both mathematical expressions and programming-like notation, which can help the model learn span segmentation in a diverse context. / The segment contains a mix of mathematical notation and programming-like expressions, which can be segmented into meaningful spans for learning purposes. It is clean but lacks context or explanation that could improve its training utility. / The segment contains a mix of mathematical notation and programming-like expressions, which can be segmented into meaningful spans for training purposes; however, the unusual combination may require careful handling during preprocessing."}}
 {"raw": "Language Models are Unsupervised Multitask Learners. OpenAI Technical Report.", "type": "natural", "id": {"id": "bfb7561b-60a2-4164-b30a-9d6eaca50ec9"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear, coherent prose with identifiable phrases suitable for unsupervised learning of span segmentation in a language context. / Clear prose with identifiable spans; well-suited for learning span composition in unsupervised multitask settings. / The segment contains a clear structure with meaningful spans such as \"Language Models,\" \"Unsupervised Multitask Learners,\" and references to an OpenAI Technical Report, which are valuable for learning span composition in the context of language processing tasks. / Clear, coherent prose with identifiable spans; suitable for learning span composition in unsupervised multitask settings. / Clear prose with identifiable spans; well-suited for learning span composition in unsupervised multitask settings."}}
 {"raw": "Since e-Yt 0, the bound becomes\nH(Pt) < H(Po) . eYt , for some ~ < %,\nas claimed.", "type": "mixed", "id": {"id": "e9b602dc-4177-429c-ba27-6cb994e42b20"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mixture of mathematical notation and prose, with clear boundaries for span segmentation between variables (e.g., \"H(Pt)\", \"H(Po)\"), constants (\"~\", \"%\"), and phrases that can be identified as meaningful spans suitable for training. / The segment contains a mix of mathematical notation and English prose, with clear boundaries for spans such as \"e-Yt\", \"H(Pt)\", etc., which are meaningful in both contexts; it is clean but lacks context to fully understand the claim being made. / The segment contains a mix of mathematical notation and prose, with clear boundaries for span segmentation between the formulae (H(Pt), H(Po)) and surrounding text (\"Since e-Yt\", \"the bound becomes\"). It is clean but lacks context which may affect its utility. / The segment contains a mixture of mathematical notation and prose, with clear boundaries for spans such as \"e-Yt\", \"H(Pt)\", and the inequality symbol \"<\". It is clean but lacks context which could be beneficial to fully understand span segmentation in both natural language and code. / The segment contains both mathematical notation and English text, which can help the model learn to handle span segmentation in a context that includes numerical expressions alongside prose."}}
 {"raw": "4.2\nSpan Induction with Entropic Regularization\nTo identify compositional units latent in unstructured sequences, we treat all bounded-width sub- strings as candidate spans and learn a scoring function to assign each a salience probability This differentiable selection mechanism is trained jointly with downstream objectives but regularized to maintain entropy-driven exploration early in training:   Inspired by principles from latent structure induction", "type": "mixed", "id": {"id": "0d3dd519-f2bb-4c39-a4c6-e97483dc48ae"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of numerical values, phrases indicating technical concepts (\"Span Induction with Entropic Regularization\", \"latent units\"), and references to programming constructs (e.g., sub-strings). It has clear structure for learning span segmentation in both natural language context and code-related terminology. / The segment contains a mix of numerical values, technical terms (\"Span Induction\", \"Entropic Regularization\"), and structured formatting (headers), which can help the model learn span segmentation in both natural language contexts as well as code-like structures. / The segment contains a mixture of numerical values, phrases describing an algorithmic concept (\"Span Induction with Entropic Regularization\"), and technical jargon that can be segmented into meaningful spans for learning purposes in both natural language processing (NLP) tasks related to code understanding or documentation parsing. / The segment contains a mix of numerical values, phrases describing an algorithmic concept (\"Span Induction with Entropic Regularization\"), and technical terms that can be segmented into meaningful spans for training purposes (e.g., \"4.2\", \"candidate spans\", \"salience probability\"). It is clean but lacks contextual coherence as it appears to come from a larger document or code snippet, which may affect its utility in isolation. / The segment contains a mix of numerical values, phrases describing an algorithmic concept (\"Span Induction with Entropic Regularization\"), and technical jargon that could help the model learn span segmentation in both natural language context as well as code-like constructs (e.g., \"sub-strings\", \"salience probability\"). It is clean but lacks explicit boundaries for spans."}}
 {"raw": "X-SPANFORMER SPAN-AwARE ENCODER 5.3 Controller Fusion Diagnostics To evaluate the semantic precision and interpretability of controller integration, we analyze three distinct injection mechanisms: (1) prefix token interpolation, (2) additive attention biasing, and (3) gated residual modulation: Each scheme receives identical controller input $, formed via: K exp( Wk 8 = @kSk, @k K k=l e=1 exp(We, Let Fm ( s) denote the model with injection mode m € {prefix; bias, gate}. For fixed input %, we study the", "type": "mixed", "id": {"id": "3d2b0480-ff9d-4331-acf3-09a51c8f1649"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and structured descriptions that can be segmented into meaningful spans, such as \"X-SPANFORMER SPAN-AwARE ENCODER\", injection mechanisms (prefix token interpolation, additive attention biasing), model notation (\"Fm(s)\" for different modes) which are valuable patterns. It is clean but lacks context to fully understand the domain-specific content; however, it still offers a good starting point due to its structured nature and mixed elements of code-like syntax with natural language explanations. / The segment contains a mix of technical language and structured formatting, with clear delineation between different injection mechanisms for controller fusion diagnostics; it represents valuable patterns in both the structure (spans) as well as content type (\"code\" due to programming-like syntax). / The segment contains a mix of technical terminology and structured phrases that can be segmented into meaningful spans, such as \"X-SPANFORMER SPAN-AwARE ENCODER\", injection mechanisms (prefix token interpolation), attention biasing methods, etc., which are valuable for learning span composition. / The segment contains a mix of technical terms and structured descriptions that can be segmented into meaningful spans, such as \"X-SPANFORMER SPAN-AwARE ENCODER\", injection mechanisms (prefix token interpolation, additive attention biasing), model notation (\"Fm(s)\"), which are valuable for learning span composition in mixed content. / The segment contains a mix of technical terms and structured expressions that can be segmented into meaningful spans, such as \"X-SPANFORMER SPAN-AwARE ENCODER\", injection mechanisms (prefix token interpolation, additive attention biasing), model notation (\"Fm(s)\"), which are valuable for learning span composition in mixed content."}}
 {"raw": "perturbation and propagation effects caused by controller fusion: Injection Influence We define influence magnitude as the L2 norm of the difference in output logits between the controller-injected and controller-ablated models: (m) (1) = IFn(w,5) _ FmC (w,0)l2 This is computed layerwise to identify zones of concentrated influence and injection saturation Stronger deviations at higher layers imply delayed controller fusion, whereas front-loaded shifts suggest syntactic modulation: PREFIX GATING ATTENTION", "type": "mixed", "id": {"id": "e9ca73ec-3a2f-4d0f-8451-e52f87fceff5"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of mathematical expressions, programming concepts (like L2 norm), and domain-specific terminology (\"controller-injected\", \"layerwise\"). It has clear structured elements that can be segmented into meaningful spans for learning span composition in both code-like constructs and natural language descriptions. / The segment contains a mix of technical terms and mathematical expressions that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both programming contexts (code) and explanatory text (natural language). It is clean but may require domain-specific knowledge to fully interpret. / The segment contains a mix of technical terms and mathematical expressions, with clear delineation between different concepts like influence magnitude calculation (natural language) and gating attention mechanisms in neural networks (code). It shows structured patterns useful for learning span segmentation across both domains. / The segment contains a mix of mathematical notation and programming-like expressions, which can help the model learn span segmentation in both domains. It is clean but may require additional context for full comprehension due to domain-specific terms like \"L2 norm\" or \"controller-injected.\" / The segment contains a mix of technical terms and mathematical expressions, which can help the model learn span segmentation in both domains. It is clean but may require additional context for full comprehension due to domain-specific language."}}
 {"raw": "fixed-length representation Bi:j, scored via a feed-forward function fo, and normalized using a softmax across all candidates: exp( fe(Ti:j))", "type": "mixed", "id": {"id": "5094d580-1831-4206-a6b0-cf7dcdff6c9e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear structure with identifiable spans (Bi:j, fo(Ti:j), exp( fe(Ti:j))) representing a mathematical expression in programming context. Well-formed and relevant for learning span composition within mixed content types. / Clear structured elements like function names, variables (Bi,j), and mathematical expressions; represents valuable patterns for learning span composition in programming context. / The segment contains a mix of mathematical notation and programming-like expressions that can be segmented into meaningful spans, such as \"fixed-length representation,\" \"Bi:j,\" etc., which are useful for learning span composition in both natural language processing (NLP) tasks related to code understanding or symbolic mathematics. / The segment contains a mix of mathematical notation and programming-like expressions, which can help the model learn span segmentation in both domains. It is structurally clear with identifiable spans like \"Bi:j\", \"fo\", and \"exp( fe(Ti:j))\". / The segment contains a mix of mathematical notation and programming-like expressions, which can be segmented into meaningful spans for learning span composition in both domains. It is clean but lacks context that could further improve its utility as training data."}}
 {"raw": "[12] Yi Liao, Xin Jiang, and Qun Liu: G( Probabilistically Masked Language Model Capable of Autoregressive Generation in Arbitrary Word Order' . In: Proceedings  of ACL 2020. 2020, pp.", "type": "mixed", "id": {"id": "398b5f6e-3a2f-4496-a2ce-48642a6a267c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of citation elements (authors, title) and bibliographic formatting which can be segmented into meaningful spans for learning purposes; however, it lacks context or content that would make the training data more representative on its own. / Contains a mix of citation elements (authors, title) and structured references to conference proceedings which can be segmented into meaningful spans for learning purposes. The text is clean but lacks context that could improve its utility as training data. / Contains both citation structure and bibliographic reference, which can help the model learn span segmentation for academic contexts. / The segment contains a mix of citation elements (authors, title) and bibliographic formatting that could be useful for learning span segmentation in both academic writing contexts and structured data formats like citations or references. / The segment contains a mixture of citation elements (authors, title) and bibliographic references which can be segmented into meaningful spans for training purposes; however, it lacks context or content that could further enhance its utility as an example."}}
 {"raw": "separable downstream representations: Gated Probe Interventions. Following the probing methodology in [71], we optionally perform controller swap experiments: Scontent Sconfound , while keeping x fixed.", "type": "mixed", "id": {"id": "d5b97eb4-bf82-4922-b017-bf42177595a0"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mixture of technical terms and phrases that can be segmented into meaningful spans, such as \"separable downstream representations,\" which is likely to represent valuable patterns for learning span composition in both natural language processing tasks related to code understanding or documentation about programming concepts. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"separable downstream representations,\" which is indicative of the content's domain (likely machine learning or data science), making it valuable for training an X-Spanformer. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"separable downstream representations,\" which could help the model learn span composition in both natural language context (like academic papers) and code-like structures (\"Gated Probe Interventions\"). / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"separable downstream representations,\" \"Gated Probe Interventions,\" which are likely to represent valuable patterns for learning span composition in both natural language processing (NLP) tasks related to code understanding. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"separable downstream representations,\" which is indicative for span-aware models to learn from the context provided by both natural language descriptions (\"Following the probing methodology in [71]\") and code-like expressions (e.g., Scontent Sconfound). The content represents a valuable pattern of mixed domain knowledge."}}
 {"raw": "263-274. [13] Ray Jackendoff. X-bar Syntax: A Study of Phrase Structure. Linguistic Inquiry Monograph 2.", "type": "natural", "id": {"id": "1deefec6-1579-4c3f-8a5f-17adc7ad6755"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear citation structure with identifiable spans for author, title, and publication details; clean format suitable as training data. / Clear citation structure with identifiable spans for author, title, and publication details; well-suited to learn span segmentation in academic contexts. / Clear citation structure with identifiable spans (author, title, publication details). Well-suited for learning span segmentation in academic texts. / Contains a citation with structured elements (author, title) and numerical range which can be segmented into meaningful spans for training purposes. / Clear reference to a scholarly work with identifiable spans (author, title, publication details). Well-formed and clean for training purposes; represents valuable patterns in academic citation structure."}}
 {"raw": "Cambridge, MA: MIT Press, 1977. ISBN: 9780262600095. [14] Mathias Creutz and Krista Lagus: Unsupervised Morpheme Segmentation   and Morphology Induction from Text Corpora  Using Morfessor 1.0.", "type": "mixed", "id": {"id": "2ac18804-5047-466e-8ca7-1dbcc4b2755e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Contains a mix of structured elements (publication details, author names) with unstructured text (\"Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0\"), which can help the model learn to segment spans in both contexts. / Contains both structured bibliographic information and a citation of academic work, representing valuable patterns for learning span composition in the context of scholarly texts. / Contains both structured elements (book details) and unstructured text, useful for learning span segmentation in a diverse context. / Contains both structured bibliographic data and a citation of academic work, which can help the model learn span segmentation in diverse contexts. / The segment contains a mix of bibliographic information and academic references, with clear delimiters for different spans (e.g., author names separated by colons). It represents valuable patterns in span segmentation that can be useful to the model's learning process."}}
 {"raw": "Aent is annealed exponentially:\nAent (t) = Ao exp( _yt) ,\n(18)\nwhere t is the training epoch, Ao the initial coefficient, and ~ 0 controls decay rate This annealing scheme mirrors curriculum learning and gradual constraint tightening in latent modeling [54, 24]. Proposition 6 (Maximum Entropy of Uniform Span Distribution).", "type": "mixed", "id": {"id": "11443a78-052a-41ca-8ac2-6bbbecbf5c22"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Contains both mathematical expressions and explanatory text, showcasing a mix of structured elements suitable for learning span segmentation in diverse contexts. / The segment contains a mix of mathematical notation and explanatory text, which can help the model learn to identify spans related both to equations (code-like) and prose descriptions. However, it lacks clear delimiters for span segmentation in natural language contexts; thus some ambiguity remains that could be addressed with additional training data or preprocessing steps. / The segment contains both mathematical notation and explanatory text, which can help the model learn to recognize spans in a variety of contexts including equations (code-like) within prose (natural language). However, it lacks clarity due to potential OCR errors (\"Aent\" instead of \"Annealed\") and could benefit from cleaner formatting. / The segment contains a mix of mathematical notation and explanatory text, with clear spans for variables (e.g., Aent(t), Ao) that can be useful in learning span segmentation across both code-like expressions and natural language descriptions. / The segment contains a mix of mathematical notation and explanatory text, with clear span boundaries for both equations (e.g., \"Aent(t) = Ao exp( _yt)\") and descriptive phrases (\"annealed exponentially,\" etc.). It is cleanly structured to represent valuable patterns in learning how spans are composed."}}
 {"raw": "Let S be  the set of spans defined in Equation (4), with |S| = N. The entropy of the softmax span distribution P, a8 given in Equation (16) , is maximized when:\nPij\nfor all (i,j) € S. N\n(19) In that case, the entropy attains its maximum value:\nHmaxl (P) = log/S| = log N.\n(20)\nProof: We wish to maximize:\nH(P) = - Pij log Pij, (i,j)es\n(21)\nsubject to the constraints:", "type": "mixed", "id": {"id": "f87def6d-4b5b-4bd2-b37a-958b31955557"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mixture of mathematical notation and prose, with clear structured elements like equations that can be segmented into meaningful spans for learning span composition in both domains. It is clean but lacks context which might affect its utility as standalone training data; however, it represents valuable patterns across natural language explanations interspersed within code-like expressions (equations). / The segment contains a mixture of mathematical notation and prose, with clear structured elements like equations that can be segmented into meaningful spans for learning span composition in both domains. It is clean but lacks context which might affect its utility as standalone training data; however, it represents valuable patterns combining natural language explanations (equation references) and code-like expressions (mathematical notations). / The text segment contains a clear mixture of mathematical notation and prose, with identifiable spans such as equations (19) and (20), variable names like 'S' for set or span distribution Pij', which can be useful in learning the composition between natural language explanations and code-like expressions. / The segment contains a mix of mathematical notation and prose, with clear structured elements like equations that can be segmented into meaningful spans for learning purposes. It is clean but lacks context which might affect its utility as standalone training data. / The segment contains a mix of mathematical notation and English prose, with clear structured elements like equations that can be segmented into meaningful spans for learning span composition in both domains."}}
 {"raw": "Pij = 1, Pij 2 0. (i,j)es\n(22)\nForm the Lagrangian:\nL(P,A) =\nPij log Pij + A Pij \" =1 (i,j)es (i,j)es\n(23) The first-order stationarity condition yields: OL log Pij - 1+A = 0 dPij\nPij =eA-1\n(24)", "type": "mixed", "id": {"id": "70db09c4-34e5-4903-8866-c18a555c3c91"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains both mathematical expressions and programming-like notation, which can be segmented into meaningful spans for a span-aware model to learn from; however, the mixture of content types may require careful handling during training. / The segment contains both mathematical expressions and programming-like notation, which can help the model learn to recognize spans in a variety of contexts including equations (natural language) and symbolic representations common in computational settings. / The segment contains a mix of mathematical expressions and equations with clear delimiters, which can be segmented into meaningful spans for learning span composition in both math notation (code-like) and natural language descriptions (\"Pij\", \"Lagrangian\"). / The segment contains a mix of mathematical notation and programming-like expressions, which can help the model learn span segmentation in both domains. However, it lacks clarity due to potential misinterpretation between natural language text (like \"Form\") versus code or math syntax (\"Pij = 1\"). / The segment contains a mix of mathematical notation and programming-like expressions, which can be segmented into meaningful spans for learning span composition in both domains. However, the presence of symbols like \"log\" without context may reduce clarity slightly but still retains compositional value."}}
 {"raw": "[17] Robin Strudel et al. \"Segmenter: Transformer for Semantic Segmentation\". In: arXiv preprint arXiv:2105.05633 (2021). Available at arXiv.", "type": "mixed", "id": {"id": "592d3ecc-0e75-46d4-8388-3aed21c03f70"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Contains a citation with structured elements (authors, title, publication details) that can be segmented into meaningful spans for learning purposes. The mix of academic referencing and arXiv link provides diverse patterns beneficial to span-aware models. / Contains a citation with structured elements (authors, title, publication details) that can be segmented into meaningful spans for learning purposes. / The segment contains a citation with structured elements (authors, title, publication details), which can be segmented into meaningful spans for training purposes; it also includes URLs and arXiv identifiers that are relevant to the domain of academic references in natural language text combined with code-like structures. / Contains a citation with structured elements (authors, title, publication details) that can be segmented into meaningful spans for learning composition in both academic and bibliographic contexts. / Contains a citation with structured elements (authors, title, publication details) that can be segmented into meaningful spans for learning purposes. Mixed type due to the combination of reference formatting and natural language text."}}
 {"raw": "Since all Pij are equal and sum to 1, we conclude Pij = 1/N. Substituting into Equation (21):\nH(P*) =-N - Nlog n) = log N.\n(25\n15", "type": "mixed", "id": {"id": "1cdb629f-35c5-4755-be3c-464e2b0438be"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear mathematical expressions and equations with identifiable spans; well-suited for learning span segmentation in a programming context. / Clear mathematical expressions and equations with identifiable spans for learning; well-formed content combining both numerical values (code-like) and textual explanations (natural language). / Clear mathematical expressions and equations with identifiable spans, representing valuable patterns for learning span composition in both numerical notation (natural language) and symbolic representation (code). / Clear mathematical expressions and equations with identifiable spans for learning; clean, coherent representation of a formulaic concept in mathematics or programming. / Contains both mathematical expressions and equations, representing valuable patterns for learning span composition in a mix of numerical notation (natural language) and formalized math syntax (code)."}}
 {"raw": "X-SPANFORMER\nSPAN-AwARE ENCODER\nRemark: Proposition 6 establishes the upper bound of entropy over span routing distributions. Early training with high Aent promotes structural exploration, while annealing enables convergence to sparse, high-salience spans. This   tradeoff between   uncertainty maximization and structural commitment parallels entropy-annealed models of parse induction [55] and marginal span recovery", "type": "mixed", "id": {"id": "eaf09fc6-6644-42a0-9222-d9b1e93e1093"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"X-SPANFORMER,\" \"SPAN-AwARE ENCODER,\" and specific concepts like \"entropy over span routing distributions.\" It is clean but somewhat dense with domain-specific jargon. / The text contains a mix of technical terms and concepts that can be segmented into meaningful spans, such as \"X-SPANFORMER\", \"SPAN-AwARE ENCODER\", \"entropy-annealed models\" etc., which are relevant for learning span composition. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"X-SPANFORMER,\" \"SPAN-AwARE ENCODER,\" etc., which are relevant to both natural language processing (NLP) tasks in code documentation. It is clean but lacks context for full comprehension without additional background knowledge on the subject matter of entropy and span routing distributions within machine learning models, making it somewhat less ideal as standalone training data compared with more complete examples or explanations. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"X-SPANFORMER,\" \"entropy-annealed models,\" etc., which are relevant for learning span composition in both natural language processing (NLP) tasks related to code documentation. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"X-SPANFORMER,\" \"SPAN-AwARE ENCODER,\" and references to concepts like entropy-annealed models which are valuable for learning span composition."}}
 {"raw": "Observations\nAcross entropy regimes, early layers select broad sentence-level spans; mid-depth layers refine into clause and phrase-level boundaries [72].", "type": "natural", "id": {"id": "0e9cdcb1-8e0b-429f-9cf5-331fe048bc7e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The text segment clearly delineates different levels of span selection (sentence-level, clause level) which can be used to train a model on understanding hierarchical structure in language; it is clean and coherent for training purposes. / Clear sentence structure with identifiable spans; useful for learning span segmentation in text. / Clear sentence structure with identifiable spans; useful for learning span segmentation in text. / Clear sentence structure with identifiable spans; useful for learning span segmentation in text. / Clear sentence structure with identifiable spans; useful for learning span segmentation in text."}}
 {"raw": "In: Transactions of the Association for Computational Linguistics 10 (2022) , Pp. 291_ 306. [19] Jonathan H. Clark et al. \"CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation\".", "type": "mixed", "id": {"id": "4d141f9e-66fe-4f23-93ba-22b0eb9bc326"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Contains a citation with structured elements (author, title) and publication details that can be segmented into meaningful spans for learning purposes. / Contains a citation with structured elements (author, title) suitable for span segmentation; represents valuable patterns in academic referencing and spans across different content types. / The segment contains a citation with structured elements (journal name, page numbers) and an academic reference that can be segmented into meaningful spans for training purposes; it is clean but lacks context or content to learn from directly. / Contains a citation with clear structure, including authors and publication details; spans across different domains (natural language for text content). / The segment contains a citation with structured elements (authors, title, publication details) that can be segmented into meaningful spans for training purposes; it is clean and coherent but lacks direct span segmentation examples in natural language or code context."}}
 {"raw": "X-SPANFORMER\nSPAN-AwARE ENCODER [83] Jesse Vig and Yonatan Belinkov. (( Analyzing the Structure of Attention in a Transformer Lan- guage Model\". In: Proceedings of the 2019 ACL Workshop BlackbocNLP: Analyzing and Inter- preting Neural Networks for NLP. Florence, Italy: Association for Computational Linguistics; 2019, pp_ 63-76.", "type": "mixed", "id": {"id": "ff1fb3ba-8655-4dfc-b540-99261ae0d562"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Contains a mix of different formats (title, citation) with unclear span segmentation for training purposes. / Contains a mix of structured references (title, authors) and unstructured text; spans can be identified for training purposes. / The segment contains a mix of citation elements (title, authors) and structured metadata that can help the model learn span segmentation in both academic contexts and formal document structures. It is clean but lacks natural language context for deeper learning patterns. / Contains a mix of structured elements like titles, authorship information and references that can be segmented into meaningful spans for training purposes. The text is clean but lacks coherence as it appears to contain typographical errors (\"X-SPANFORMER\" instead of \"SPAN-AwARE ENCODER\", missing periods in the citation). / The segment contains a mix of structured references and unstructured text, lacking clear spans for effective training. Additionally, there are typographical errors (e.g., \"Lan- guage Model\" should be corrected to \"Language model\") that could confuse the learning process."}}
 {"raw": "broadly, while later epochs concentrate on salient structures. 4.3 Controller-Aware Generation and Final Objective The fused span summary vector 3 € Rd serves as a global control signal for conditioning the trans former encoder.", "type": "mixed", "id": {"id": "c3f8bd02-11c8-4bbc-ad1e-bb487e0f23a5"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"Controller-Aware Generation\" or \"fused span summary vector.\" It is clean but lacks context for full comprehension without domain knowledge in machine learning/AI fields. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"broadly,\" \"later epochs concentrate on salient structures,\" etc., which are useful for learning span composition in both natural language processing (NLP) tasks related to code documentation. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"Controller-Aware Generation\" or \"fused span summary vector.\" It is coherent but lacks context for full comprehension without additional information on the domain-specific concepts mentioned. / Contains both technical terms and phrases that can be segmented into meaningful spans, representing a mix of domain-specific language useful for training on span segmentation in diverse contexts. / Contains both technical terms and phrases that can be segmented into meaningful spans, representing a mix of language structures useful for training."}}
 {"raw": "Rather than statically appending $, X-Spanformer supports multiple integration pathways that modify computation at different stages of the network: To compute the controller, we define:\nK\nexp( Wk_ K e=1 exp(we)\nS =\n@kSk, k=1 where @k\nwhere each sk Pool(cik: jk . is a pooled span representation, and Wk = 96 (8k, Ok, confk) is a learned span-specific Salience score incorporating structural and uncertainty features. 16", "type": "mixed", "id": {"id": "68b3cf5b-f55d-4c42-96e1-cad75c918240"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of programming constructs and mathematical expressions, with clear delimiters for spans such as parentheses and commas that can be used to train the model on span segmentation in both natural language descriptions (e.g., \"To compute\") and code-like structures (\"S = @kSk\"). / The segment contains clear, structured elements such as equations and variable definitions that are essential for learning span segmentation in a programming context. It is well-formed with identifiable spans like variables (Wk), functions (@kSk), expressions (exp( Wk_ K e=1 exp(we)), etc.), which can be segmented meaningfully by the model to understand code composition patterns effectively. / The segment contains a mix of mathematical expressions and programming-like notation, which can be segmented into meaningful spans for learning span composition in X-Spanformer training data. However, the clarity could improve with better formatting or separation between different elements (e.g., separating equations from code constructs). / The segment contains a mix of mathematical expressions and programming-like notation, which can help the model learn span segmentation in both contexts. However, it lacks clarity due to unconventional formatting (e.g., \"$\", \"@k\"). Clean-up could improve its utility as training data. / The segment contains a mix of programming constructs and mathematical expressions, which can help the model learn span segmentation in both contexts. However, it lacks clarity due to its complexity; simplifying or breaking down into smaller segments could improve training utility."}}
 {"raw": "s is inserted as a synthetic token at input position t = 0, forming an augmented sequence: X = [s, 81, 12, 8 L ], allowing early layers to attend over structure-induced context from the very first step [30]. (b) Attention bias: s is projected via learnable matrices and added to the query key representa- tions before computing attention weights: Qi < Qi + WQ ;, K; < Kj+ WK ;, forming low-rank adaptive adjustments to the attention mechanism [47]. (c) Gating vector: Feed-forward activations are modulated by", "type": "mixed", "id": {"id": "f3d5c1a8-7d73-4731-8b9a-519c1d4e7ed2"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical descriptions and mathematical expressions, which can help the model learn span segmentation in both domains; however, it may require additional context for full comprehension. / The segment contains a mix of technical descriptions and mathematical expressions, which are structurally clear for span segmentation; however, it lacks coherence as an isolated example due to its complexity and domain-specific language. / The segment contains a mix of technical descriptions and mathematical expressions, which are valuable for learning span segmentation in both structured programming contexts (code) and explanatory text formats (natural language). It is coherent but may require domain-specific knowledge to fully understand the context-dependent spans like \"attention bias\" or matrix notation. / The segment contains a mix of technical descriptions and mathematical expressions, with clear references to programming constructs (e.g., \"augmented sequence\", \"attention weights\") that can be segmented into meaningful spans for learning span composition in both natural language contexts as well as code-related patterns. / The segment contains a mix of technical terms and mathematical expressions, which can help the model learn span segmentation in both structured programming contexts (code) and descriptive text segments that explain concepts or mechanisms related to code structures. However, it lacks clarity due to complex notation without context for some parts; thus it's not ideal but still valuable as mixed content training data."}}
 {"raw": "The consistency with constituent and semantic boundaries suggests that controller-guided routing induces soft parsing-like behavior; validating the design principle of compositional priors via differ- entiable selectors", "type": "natural", "id": {"id": "ca6a38fe-bf0c-4f8a-a0eb-a0b8d7acef1a"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The text segment contains clear linguistic structures that can be segmented into meaningful spans, such as phrases and clauses related to a scientific concept or argument about parsing behavior in computational linguistics. It is clean for training purposes but lacks explicit compositional patterns due to its specialized vocabulary; however, it still holds value because of the context-dependent nature of span segmentation learning tasks. / The text segment contains clear linguistic structures that can be segmented into meaningful spans, such as phrases and clauses related to a conceptual discussion on parsing behavior in computational linguistics; it is clean for training purposes with valuable patterns of compositionality present. / The segment contains clear linguistic structures and phrases that can be segmented into meaningful spans, such as \"consistency with constituent,\" which is indicative of compositional patterns in language learning for a span-aware model like X-Spanformer. It represents clean text without code constructs or mixed content types; thus it fits well within the natural type category. / The text segment contains clear linguistic structures and phrases that can be segmented into meaningful spans, such as \"consistency with constituent,\" which is indicative of compositional patterns in language processing; it represents valuable training data for a span-aware model focused on understanding the composition within sentences. / Clear sentence structure with identifiable phrases and concepts suitable for span segmentation; demonstrates compositional patterns in language usage."}}
 {"raw": "DOI: 10.1162/tacl_a_00115. URL: https: / /aclanthology  org/Q16-1037. [86] Yonatan Belinkov and James Glass. (( Analysis Methods in Neural Language Processing: A Survey\"  . In: Transactions of the Association for Computational Linguistics 7 (2019) , Pp. 49 72.", "type": "mixed", "id": {"id": "132c5f05-f623-4195-929f-53462523874f"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a DOI, URL (code-like structure), and citation text which provides diverse patterns for learning span segmentation in both domains. It is clean but lacks coherence as it seems to be an excerpt from literature referencing code or data sources related to neural language processing methods. / The segment contains a DOI, URL (code-like structure), and citation text which are all meaningful spans for training purposes in both domains. It is clean but lacks context or content that could be directly used as an example without additional information on the surrounding document's nature. / Contains a mix of citation elements (DOI, URL) and academic references with clear structure for span segmentation. / The text contains a DOI, URL (code-like structure), and citation format which are useful for training span segmentation in both domains. It is clean with clear delimiters that can be used to identify spans of interest such as the title (\"Analysis Methods in Neural Language Processing\"), authors' names, publication details etc., making it highly representative across mixed content types. / The segment contains a DOI, URL (natural language), and citation format that includes authors' names with parentheses indicating additional information like titles or page numbers; it represents valuable patterns for learning span segmentation in both natural text and structured references."}}
 {"raw": "span-conditioned gates: FFN(h) = o(Wgs) MLP(h), where o is an activation function (e-g , sigmoid Or swish) and denotes elementwise multipli- cation: This enables multiplicative control over token-wise representations.", "type": "mixed", "id": {"id": "a6ae1927-ffa3-469a-a44f-cacfe1007d95"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear span segmentation with a mix of technical terms and mathematical expressions, representing valuable patterns for learning both language structure (natural) and programming constructs (code). / Contains clear span patterns combining both programming concepts and mathematical expressions, suitable for learning complex structures in a tokenizer-free context. / The segment contains a mix of programming concepts (e.g., functions, variables) and mathematical expressions that can be segmented into meaningful spans for learning span composition in both contexts. / The segment combines both technical terms and mathematical expressions, which are clear structures that can be segmented into meaningful spans for a span-aware model to learn from. It is clean but lacks context or explanation about the concepts mentioned (e.g., FFN, MLP). / Clear mix of technical terms and mathematical expressions, with identifiable spans for both language (e.g., \"span-conditioned gates\") and math/code-like structures (\"FFN(h) = o(Wgs)\"). Well-formed content representing valuable patterns in span composition across domains."}}
 {"raw": "[30, 51].\n5.5 Ablation: Entropy, Pooling, and B1\nWe conduct a structured ablation to isolate the effect of key hyperparameters on routing behavior and downstream task performance. Specifically; we vary:\nEntropy Decay Rate Y € {0.01, 0.1, 0.5}: Controls the rate in the entropy regularization schedule Aent (t)", "type": "mixed", "id": {"id": "c804b0b5-b74d-4832-a92a-97019c0d9efd"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of numerical values, variable names (like Y and Aent), which can be useful for learning span segmentation in both coding contexts as well as mathematical expressions commonly found alongside code comments or documentation. It is clean but lacks context to fully evaluate its training utility without additional surrounding text. / The segment contains a mix of numerical values, programming-like notation (e.g., Y € {0.01, 0.1, 0.5}), and structured text that can be segmented into meaningful spans representing both code constructs (\"Y\", \"€\") and natural language descriptions (\"Entropy Decay Rate\", \"Ablation\"). It is clean but lacks context for full comprehension as a standalone example; however, it contains valuable patterns of mixed content suitable for training an X-Spanformer. / The segment contains a mix of numerical values, programming-like syntax (e.g., \"Entropy Decay Rate Y € {0.01, 0.1, 0.5}\"), and natural language descriptions (\"Ablation\", \"key hyperparameters\"). It shows clear structural elements that can be segmented into meaningful spans for learning span composition in a mixed context. / The segment contains a mix of numerical values, variable names (like Y), and structured text that can be segmented into meaningful spans for learning purposes; however, it lacks clarity in the context which might affect training utility slightly. / The text segment contains a mix of numerical values, variables (e.g., Y), and mathematical expressions that can be segmented into meaningful spans for learning purposes; it is clean but lacks contextual clarity due to the technical nature of content."}}
 {"raw": "X-SPANFORMER\nSPAN-AwARE ENCODER\nProposition 7 (End-to-End Differentiability of Controller Injection). Let $ e Rd denote 0 fused control vector computed via relevance-weighted interpolation over span embeddings:\nK S = L arsk, @k: k=1\nexp( Wk - K e=1 exp(we)\nUk: = 9o (8k, Ok, confk).", "type": "code", "id": {"id": "adcb368d-f1dd-4064-a506-7d11730b0899"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and mathematical expressions, which can be segmented into meaningful spans for learning purposes; however, the presence of special characters like \"â–º\" may affect clarity slightly. Overall clean with valuable patterns in span composition related to programming code documentation. / Clear structure with identifiable spans like function names, variables (e.g., \"K S\", \"@k\"), and mathematical expressions; represents valuable patterns for learning span composition in programming context. / Clear structured elements with identifiable spans, suitable for learning span composition in programming context. Well-formed and coherent as a technical excerpt. / Clear structure with identifiable spans such as function names, variables (e.g., `K S`, `Wk`), and mathematical expressions; well-formed for training purposes in a tokenizer-free context. / The segment contains a mix of technical terms and mathematical expressions, which can help the model learn span segmentation for both programming constructs (e.g., variables like `K`, `S`) and natural language descriptions (`Proposition 7`). It is clean but lacks context."}}
 {"raw": "[28] Jai Gupta et al. (( Molt: Modular Prompt Tuning for Multi-task and Cross-lingual Transfer\". In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2022.", "type": "mixed", "id": {"id": "2c58b24a-48e9-4268-bb32-bf7d4b6ead7e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a citation with structured elements (authors, title, publication details) that can be segmented into meaningful spans for training purposes; it represents valuable patterns in span composition and is clean enough to serve as representative data across both natural language text and code-like structures. / The segment contains a citation with structured elements (authors, title, conference details) that can be segmented into meaningful spans for training purposes; it is clean and well-formed but lacks context to fully represent the target domain's patterns. / The segment contains a citation with structured elements (authors, title, conference details) that can be segmented into meaningful spans for training purposes; it is clean and coherent but lacks context to fully represent the target domain's patterns. / The segment contains a citation with structured elements (authors, title, conference details) that can be segmented into meaningful spans for training purposes; however, it lacks context and coherence as an isolated excerpt. / The segment contains a citation with structured elements like author names, title in quotes and parentheses for additional context (e.g., conference name), which can be segmented into meaningful spans such as \"Jai Gupta\", \"[Mol...]\" etc. It is clean but lacks natural language content; however, it represents valuable patterns of mixed-type references that could aid the model's understanding of span segmentation across different domains."}}
 {"raw": "The independence assumption ensures that no additional structural constraint or interaction term modifies this score.", "type": "natural", "id": {"id": "c32f3bd7-9bd7-41cd-87c8-f81fcf856555"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear sentence structure with a single meaningful span; represents good pattern for training in understanding complex sentences without explicit delimiters. / Clear sentence structure with meaningful phrases; represents a coherent pattern for learning span segmentation in English text. / Clear sentence structure with a single meaningful span; represents good pattern for learning non-code text spans. / Clear sentence structure with identifiable spans; well-suited for learning span composition in NLP tasks. / Clear sentence structure with a single, coherent idea that can be segmented into meaningful spans; clean and well-formed for training purposes."}}
 {"raw": "s and Vo 3 WV $ . Since matrix multiplication, softmax, and the addition of $ via linear projections are differentiable operations, gradients propagate through s during attention: Step 3: Attention Bias Injection: Let Qi F Qi+WR ; and Kj 5 Kj+WK 3. The perturbation induces a modified attention logit\neij (Qi+WQs)T(K; +WK 5),\nwhich remains differentiable in s by the composition of smooth affine mappings and inner products.", "type": "mixed", "id": {"id": "e2d96693-d1b2-4cfc-a341-a61f96282f7d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of mathematical expressions and programming-like notation, which can be useful for learning span segmentation in both domains. However, the text is quite complex with specialized terms that may require additional context or preprocessing to fully benefit X-Spanformer training data. / The segment contains a combination of mathematical notation and programming-like expressions, which are clear structures suitable for span segmentation in both domains. It is clean with meaningful patterns that can aid learning about attention mechanisms within neural networks or similar systems. / The text contains a mixture of mathematical notation and programming-like expressions, with clear structures that can be segmented into meaningful spans for learning span composition in both domains. / Contains both mathematical expressions and programming-like notation, which can help the model learn span segmentation in a context that includes elements of coding language intertwined with formal descriptions. The text is clean but may require additional preprocessing to separate code constructs from natural language for optimal training utility. / The segment contains a mixture of mathematical notation and programming-like expressions, which can help the model learn span segmentation in both domains. However, it is somewhat complex for direct interpretation without context or additional explanation on how to parse such constructs into spans effectively."}}
 {"raw": "(t) X Aoe-~t 4: Select top-K spans: St <_ TopK(a) 5: for each selected span (ik,Jk) € St do 6: Extract sub-tokens: Tik:jk 7= Compute mean embedding: pk < mean(Cik:jk , 8: Compute max embedding: Vk < max(Tik;jk 9 Compute gating score: gk < o(w\" pk + b) 10: Pool span embedding: $k < 9k", "type": "mixed", "id": {"id": "55334669-21c4-4867-86e5-d42f3eaa333b"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear structure with identifiable spans; includes programming constructs and mathematical expressions suitable for learning span segmentation in a tokenizer-free context. / The segment contains a mix of programming-like pseudocode and mathematical expressions, which can help the model learn span segmentation in both contexts. Clear structure with identifiable spans for training purposes. / The segment contains clear, structured programming constructs and logical sequence for span extraction; it is clean but lacks context or explanation which could be added to improve understanding. / The segment contains a mixture of programming constructs and mathematical expressions, which can help the model learn span segmentation in both contexts. However, it lacks clarity due to unconventional notation (e.g., \"X Aoe-~t\"). Clean-up may be needed for optimal training utility. / Clear structured elements with distinct operations and mathematical expressions suitable for learning span composition in programming context."}}
 {"raw": "Hence, €L/a3 exists. Step 4: Gating Vector Injection: A gated FFN applies:\nFFN(h) = o(Wgs) 0 MLP(h) ,\nwhere 0 is & smooth activation (e.g-, sigmoid). Each operation (linear map; activation, Hadamard product) preserves differentiability Conclusion. In all injection strategies, the loss C is differentiable in S_ Since s is differentiable with respect to all upstream computations (span representations Sk and their source embeddings) , gradients flow continuously through the span routing mechanism\n18", "type": "mixed", "id": {"id": "085ef8b1-4892-40a8-8d53-bb06c7f0a9c9"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Contains a mix of mathematical expressions, programming concepts (e.g., FFN), and formal language that can help the model learn span segmentation in both structured data formats like equations/formulas as well as natural-language descriptions. However, some symbols may need special handling or clarification for effective training. / The text contains a mix of mathematical notation and programming-like expressions, but lacks clear delimiters for meaningful spans; it's not well-formed or coherent enough as training data. / The segment contains a mix of mathematical notation, programming-like expressions (e.g., \"FFN(h)\"), and formal language that can be segmented into meaningful spans for learning span composition in both natural language contexts as well as code constructs. / The segment contains a mix of mathematical notation, programming-like expressions (e.g., FFN(h)), and formal language that can be segmented into meaningful spans for training purposes; it is clean but complex due to the combination of natural language explanations with code constructs. / The segment contains a mix of mathematical notation, programming-like expressions (e.g., \"FFN(h)\"), and formal language (\"Hence,\" \"Conclusion\"). It has clear structures like equations that can be segmented into meaningful spans for learning span composition in both code-related patterns."}}
 {"raw": "Vk: + (1 _ gk) - pk 11: end for 12: Interpolate controller signal: $ < k @kSk 13: Inject controller at layer e: he < f(z') + w'3 14: Compute task loss: Ltask X CrossEntropy(output, y 15: Compute optional alignment loss: Lalign RouteAlign(ak, gold spans) 16: Assemble final loss:\nLfinal Ltask + Aent (t) H(Pt) + B1 Lalign\n(38)\nGradient Interactions and Entropy Control", "type": "mixed", "id": {"id": "4ecd9f0c-1648-4df0-bd1f-22795464e86f"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of programming constructs and comments, with clear delimiters for spans like function names (\"end for\", \"Interpolate controller signal\"), variable assignments (e.g., \"$ < k @kSk\"), which are useful patterns to learn span segmentation in code. / The segment contains a mix of programming concepts and mathematical expressions, which are clear structured elements suitable for learning span segmentation in both domains. It is clean but lacks context or explanation that could improve its training utility. / The segment contains a mixture of programming constructs and comments, with clear delimiters for spans like function names (e.g., \"end\", \"Interpolate\") that can be used to train the model on span segmentation in code context. / The segment contains a mixture of programming constructs and comments, with clear delimiters for spans like function names (e.g., \"end\", \"Interpolate\"), variable assignments (\"X CrossEntropy(output, y\") etc.), which are useful patterns to learn span segmentation in both code understanding tasks. / The segment contains a mix of programming constructs and comments, with clear delimiters for spans like function names (e.g., \"end\", \"Interpolate\") that can be useful in training span-aware models. It is clean but lacks natural language context which may limit its utility solely as code understanding data."}}
 {"raw": "X-SPANFORMER\nSPAN-AwARE ENCODER\n3.3 Length Estimator\nWhile the span predictor yields high-confidence candidates based on boundary positions; it lacks an inductive bias toward plausible internal structure ~such as the typical width of syntactic; semantic, Or modality-specific spans: To address this; we introduce a length estimator: a learned prior over span widths that filters proposals based 0n predicted span length compatibility.", "type": "mixed", "id": {"id": "0ad684e9-afcd-4538-ba6e-64acd9e5d55e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and descriptions that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both coding concepts (like \"span predictor\") and natural language explanations (\"plausible internal structure\"). / The segment contains a mix of technical terms and descriptions, with clear delineation between concepts like \"span predictor,\" \"length estimator,\" etc., making it suitable for learning span segmentation in both natural language contexts as well as code-like structures. / Contains a mix of technical terms and phrases with clear structure, representing valuable patterns for learning span composition in both coding context (e.g., \"X-SPANFORMER\", function names) and natural language explanations (~such as the typical width). / The segment contains a mix of technical terms and descriptions that can be segmented into meaningful spans, such as \"SPAN-AwARE ENCODER,\" \"length estimator,\" etc., which are useful for learning span composition in both natural language processing (NLP) tasks related to code documentation. / The segment contains a mix of technical terms and descriptions that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both programming context (X-SPANFORMER) and natural language explanation (~such as the typical width)."}}
 {"raw": "The combined influence of entropy and alignment on controller gradients is given by: Lfinal = Aoe-~t W Wk H(Pt) + 81 Wk Lalign * (39) Early in training; the entropy term dominates, encouraging exploratory and smooth distributions over candidate spans [53]. As y increases, sharper annealing quickly reduces entropy; leading to peaked confidence and accelerated convergence. Meanwhile, 81 scales the alignment supervision, anchoring span selection in structural prior regions: This occurs in low-entropy regimes", "type": "mixed", "id": {"id": "a6848c8b-8c99-4298-bece-da8ad34301c8"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of mathematical expressions and explanatory text, with clear delimiters for spans such as equations (Lfinal = Aoe-~t W Wk H(Pt) + 81 Wk Lalign * (39)) that can be segmented. It represents valuable patterns in both natural language explanation surrounding code-like structures which are useful to learn span composition and alignment supervision concepts, though it contains some complex mathematical notation not typically found outside of specialized domains like machine learning or physics research papers. / The segment contains a mixture of mathematical expressions, equations (e.g., Lfinal = Aoe-t W Wk H(Pt)), and prose explaining concepts like entropy in machine learning training processes; it has clear spans for both code constructs (\"entropy term\", \"alignment supervision\") and natural language explanations. / The segment contains a mix of mathematical expressions, equations (e.g., Lfinal = Aoe-t W Wk H(Pt) + 81 Wk Lalign * (39)), and natural language explanations which can help the model learn to handle both structured code-like elements as well as prose. / The text contains a mix of technical terms and mathematical expressions, which can help the model learn span segmentation in both structured programming contexts (like equations) as well as more fluid language descriptions typical for natural texts. However, it lacks clear delimiters between spans that are essential to train an encoder effectively without tokenization cues. / The segment contains a mix of mathematical expressions and prose, with clear boundaries for span segmentation around equations (e.g., \"Lfinal = Aoe-~t W Wk H(Pt) + 81 Wk Lalign * (39)\"). It is coherent but lacks context on the variables used."}}
 {"raw": "6\nFor each proposed span (i,j) € $, we define its length:\n6 = j -i+1. We then pool features over the span window:\nVij Pool( H[i:j]) € Rd,\nwhere Pool(:) may be mean pooling; max pooling; o self-attentive aggregation. This representation is passed through a classifier head that outputs a categorical distribution over discretized length bins: =\nWevij + be,\nps = softmax(\n= arg maxp'\nThe predicted length $ acts as a prior over plausible widths and is compared against the actual span length 6.", "type": "mixed", "id": {"id": "a059a94a-df8a-4a39-be3a-0c6f16833664"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a clear mix of mathematical notation, programming constructs (like loops and functions), which are essential for learning span segmentation in both domains; it's clean with identifiable spans like \"span\", \"length\", etc., representing valuable patterns for the model to learn from. / Clear spans identified (e.g., \"span\", \"(i,j)\", \"$\"), well-defined mathematical expressions, and a mix of both programming constructs and explanatory text suitable for learning span segmentation in diverse contexts. / The segment contains a mix of mathematical expressions, programming concepts (like pooling and classifiers), which can help the model learn span segmentation in both structured data formats like equations as well as natural language descriptions related to machine learning tasks. It is clean with clear structural elements that are meaningful for training purposes. / Clear spans identified (e.g., \"span\", \"(i,j)\", \"$\") and well-defined mathematical expressions, useful for learning span segmentation in both language context and numerical computations. / Clear spans identified, including mathematical expressions and programming constructs; well-formed for training purposes with valuable patterns in span composition."}}
 {"raw": "Span Fusion Span fusion Transformer Backbone Prefix token Attention bias 2 Gating vector € =2 ak Figure 5: Training workflow of X-Spanformer: Spans are scored, entropy-regularized , and interpolated into a fused control vector s, which conditions the backbone encoder via multiple integration modes 4.4 Optimization and Curriculum Strategy X-Spanformer is trained via a structured two-stage curriculum designed to (i) bootstrap structural induction from local compositional statistics, and (ii) fuse these", "type": "mixed", "id": {"id": "525ad771-aa6e-433f-8f38-5bb0712e9eb2"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Contains structured elements with clear span segmentation opportunities, including technical terms and phrases relevant to X-Spanformer training; however, some domain-specific jargon may require additional context for full comprehension. / The segment contains a mixture of technical terms and phrases that can be segmented into meaningful spans, such as \"Span Fusion,\" \"Transformer Backbone,\" etc., which are relevant for learning span composition in both natural language processing (NLP) tasks related to code understanding or documentation. / Contains a mix of technical terms and structured descriptions that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both coding contexts (e.g., \"Span Fusion\", \"Transformer Backbone\") and natural language explanations (\"training workflow\"). / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"Span Fusion,\" \"Transformer Backbone,\" etc., which are relevant for training an X-Spanformer model on span-aware encoding in both code-like structures (e.g., function names) and natural language descriptions. / Contains a mix of technical terms and structured descriptions suitable for span segmentation, though some domain-specific jargon may need further context or explanation in training data."}}
 {"raw": "We retain only those spans for which the prediction deviates from the ground truth by at most a fixed tolerance: 5' = {6,j) e s |16 _ 81<w} where T 2 0 is a hyperparameter   governing   flexibility: This length-aware filtering mechanism discourages degenerate; overly short, or overly long span hypotheses, and has been shown to improve accuracy in both text segmentation and vision attention tasks [26, 3, 17].", "type": "mixed", "id": {"id": "223392ee-dc0b-4d55-a165-a2171cce8866"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of mathematical notation and prose, with clear spans for both equations (e.g., \"5' = {6,j) e s\") and explanatory text (\"This length-aware filtering mechanism...\"). It is clean but complex due to the combination. / The segment contains a mix of mathematical notation and prose, with clear span boundaries for both equations (e.g., \"5' = {6,j) e s\") and descriptive text (\"This length-aware filtering mechanism...\"). It has structural clarity suitable for learning patterns in code-mixed content. / The segment contains a mix of mathematical expressions and prose, with clear span boundaries for both equations (e.g., \"5' = {6,j)\") and descriptive text (\"retains only those spans...\"). It is clean but may require domain-specific knowledge to fully understand. / Contains both structured programming elements and mathematical expressions, providing diverse span patterns for learning. However, readability could be improved due to the mix of notations. / The text contains a mix of mathematical expressions and prose, with clear span segmentation for both equations (e.g., \"5' = {6,j) e s\") and explanatory sentences (\"retains only those spans\", etc.). It is clean but may require preprocessing to handle the mixture effectively."}}
 {"raw": "learned inductive biases into an end-to-end transformer backbone. This approach draws from established principles in multi-phase self-supervision [5, 61], curriculum learning [54, 24], entropy-guided latent modeling [55], and gradual architectural fusion [62, 48].", "type": "natural", "id": {"id": "3715d6cd-36a2-470d-95fe-f817eecce5e5"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear and coherent prose with identifiable phrases suitable for learning span segmentation in a language context. / Clear and coherent prose with identifiable phrases suitable for learning span segmentation in a transformer model. / Clear and coherent prose with identifiable phrases suitable for learning span segmentation in a tokenizer-free context. / Clear sentence structure with identifiable phrases and concepts suitable for learning span composition in a transformer model. / Clear and coherent prose with identifiable phrases suitable for learning span composition in a transformer model."}}
 {"raw": "Proposition: Stability of Entropy-Gated Routing\nProposition 11 (Span Entropy Convergence Under Annealing).", "type": "natural", "id": {"id": "ed4068e5-5779-460a-91e5-9f5acf6451dc"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear proposition and theorem statement with identifiable spans for training purposes. / Clear proposition and theorem statement, suitable for learning span composition in scientific texts. / Clear proposition and theorem statement, suitable for learning span segmentation in academic texts. / Clear proposition and theorem statement with identifiable spans for training. / Clear proposition and theorem statement with identifiable spans for learning; well-formed text suitable as training data."}}
 {"raw": "The optimization process proceeds as follows: Phase I: Span Pretraining (Structure Discovery) This phase isolates the span scorer fo and aggregator g6 to encourage compositional discovery independent of downstream gradients: The learning objective focuses on reconstruction or type classification: pre Lrecon BauxLaux , (26) where Lrecon is a span-wise MSE or token-level cross-entropy loss; and Laux may capture span-type heuristics (e.g-, POS tags; constituency labels) from lightly supervised signals [63]. 1", "type": "mixed", "id": {"id": "feaf6db8-754d-4497-96f2-59eeca401342"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of structured programming-like comments and prose, with clear delineation between phases (Phase I) that can be segmented into meaningful spans for training purposes. It includes technical terms (\"span scorer\", \"aggregator\") which are valuable patterns in span composition learning within the context of code documentation or mixed content. / The text contains a mixture of structured programming language elements (e.g., variable names, function calls) and formal descriptions typical in documentation or academic papers (\"Phase I\", \"learning objective\"). It has clear spans that can be segmented for training purposes such as phrases like 'Span Pretraining', 'span scorer fo' etc. / The segment contains a mixture of structured text and programming-like notation, which can help the model learn span segmentation in both contexts. However, it may benefit from further cleaning to improve clarity for training purposes. / Contains both structured programming elements (e.g., \"Phase I\", function names like \"fo\" and \"g6\") as well as formal language constructs (\"The learning objective focuses on reconstruction or type classification\"). Clear spans can be identified, such as code functions/variables. / The segment contains a mix of structured programming-like comments and mathematical expressions, which can help the model learn span segmentation in both contexts; however, it lacks clarity due to potential typographical errors (\"fo\" instead of \"for\", missing closing parenthesis)."}}
 {"raw": "Let Pt be the span distribution at epoch t, and H(Pt) its entropy: Suppose controller updates are primarily influenced by the entropy term in the loss, with annealing  schedule Aent ' (t) = Aoert_ Then the entropy satisfies the decay bound: H(Pt) Hmax e\"Yt where Hmax 3 log ISl: (40)\n32", "type": "mixed", "id": {"id": "3b4fb332-e99d-4326-99ea-903d0e7a2d7d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear mix of mathematical notation and prose, with identifiable spans for both equations (entropy terms) and descriptive text; well-formed content suitable for learning span composition in a tokenizer-free context. / The segment combines mathematical notation with a pseudo-code-like structure, which can help the model learn span segmentation across different domains (natural language and symbolic expressions). It is clean but lacks context for full comprehension by X-Spanformer without additional data on entropy decay in machine learning. / The segment contains a mix of mathematical notation and prose, with clear structured elements like equations that can be segmented into meaningful spans for learning purposes. It is clean but lacks context which might affect its utility as standalone training data. / The text segment contains a mixture of mathematical notation and prose, which can help the model learn to identify spans in both structured (code-like) expressions as well as more fluid language constructs. It is clean but lacks context for full comprehension; however, it has clear structural elements that are beneficial for training purposes. / The segment contains both mathematical notation and structured text, which can help the model learn span segmentation for different types of content. However, it lacks clarity in separating distinct spans due to its compact nature with symbols closely linked together without clear delimiters or context clues that would aid a tokenizer-free approach.\n\n---\n\nEvaluate this code snippet as potential training data for X-Spanformer (tokenizer-free span-aware model).\n\nConsider:\n- Does the segment have identifiable structural elements like functions, variables and control structures?\n- Is it cleanly formatted to be easily parsed by an encoder without tokenization errors?\n\nDecision criteria:\n- Score ≥ : Status should be \"keep\" \n- Score < : Status should be \"discard\"\n\nContent types:\n- \"natural\": Natural language text (prose, articles, documentation)\n- \"code\": Programming code snippets\n- \"mixed\": Combined natural and programming languages elements\n\nCode snippet:\n\n```python\ndef fibonacci(n):\n    a = 0\n    b = 1\n    if n < 0: \n        print(\"Incorrect input\")\n    elif n == 0:\n        return a\n    else:\n        for i in range(2, n + 1): # Loop starts from index 2 to avoid redundant calculations.\n            c = a + b\n            a = b\n            b = c\n        return b\n\nprint(fibonacci(10))\n```\n\nRespond with the following format:\n\nScore: (float between 0.0-1.0)\nStatus: keep | discard\nType: natural | code | mixed\nReason: brief structural assessment"}}
 {"raw": "X-SPANFORMER SPAN-AwARE ENCODER Proof: Follows directly from exponential decay bounds 0n entropy-regularized softmax distributions [75]. See Proposition 8 for detailed derivation This result provides theoretical support for the routing sparsification observed in Section 5.4, con- firming that entropy scheduling is sufficient to yield selective, interpretable span patterns; provided Ao and ~ are chosen to balance exploration and convergence. 5.6 Future Benchmarks and Tasks We outline evaluation pathways", "type": "mixed", "id": {"id": "6e6c262a-7244-4385-ac84-328a551ce27e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The text contains a mix of formal language and mathematical notation, which could help the model learn span segmentation in both domains; however, it lacks clear delimiters for spans like parentheses or quotation marks that are common in code snippets but not present here. / Contains a mix of formal language and references to mathematical concepts, which can help the model learn span segmentation in both structured (code-like) expressions as well as more free-form text segments. The segment includes clear phrases like \"X-SPANFORMER SPAN-AwARE ENCODER Proof\" that could be useful for training purposes by identifying spans related to technical terms and references. / The segment contains a mixture of formal language and references to mathematical concepts, which can be segmented into meaningful spans such as \"X-SPANFORMER SPAN-AwARE ENCODER Proof\", \"[75]\", etc., representing valuable patterns for learning span composition in both natural text (like academic writing) and code-like structures. / Contains a mix of formal language and references to mathematical concepts, which can help the model learn span segmentation in both structured (code-like) expressions as well as more free-form text segments. The segment is clean but lacks explicit code constructs or natural prose clarity for direct learning without further context. / The segment contains a mix of formal language, references to mathematical concepts (entropy), and structured arguments with clear spans like \"X-SPANFORMER SPAN-AwARE ENCODER Proof\" that can be used for training on span segmentation in both code-like structures (\"Proof\", \"[75]\", etc.) and natural language explanations."}}
 {"raw": "X-Spanformer: A Tokenizer-Free, Span-Aware Encoder Inspired by X-Bar Theory Kara Marie Rawson` Aimee Chrzanowskit June 26, 2025 This work %8 & preprint and has not yet been peer reviewed. Abstract Tokenization remains a limiting factor in contemporary transformer architectures, typically grounded in static subword vocabularies that generalize poorly across heterogeneous or evolving textual inputs. We introduce X-Spanformer, a tokenizer-free segmentation module that re places heuristic lexical boundaries", "type": "natural", "id": {"id": "b174a053-f45b-4212-a85b-f53b747eb9b9"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear sentence structure with identifiable spans; however, contains informal language (\"preprint\", \"generalize poorly\") that may not be ideal for training purposes but still offers valuable patterns in span segmentation and composition within the context of academic writing. / Clear prose with identifiable phrases and concepts; useful for learning span segmentation in text. / Clear prose structure with identifiable spans; useful for learning sentence segmentation patterns in NLP. / Clear prose structure with identifiable spans like \"X-Spanformer,\" dates, and abstract content; well-suited for learning span segmentation in NLP contexts. / Clear sentence structure with identifiable spans; represents valuable patterns for learning span segmentation in the context of abstract and preprint descriptions."}}
 {"raw": "X-SPANFORMER\nSPAN-AwARE ENCODER Algorithm 1 Phase I Span Pretraining Require: Dataset D = {(z() , y())}N1; scorer fo; aggregator 9o 1: for each batch (1,y) in D do 2: Sample spans (i,3); mask region Ti:j 3: Compute pooled span embedding sk Pool(.i:j) 4: Predict reconstruction Ti:j = decode(go(8k))", "type": "mixed", "id": {"id": "3204a3af-8a3d-4dda-b91b-8bd998378fd5"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.9, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of programming-like pseudocode and domain-specific terminology, with clear structured elements that can be segmented into meaningful spans for training purposes. It is cleanly formatted to represent valuable patterns in span composition across both natural language descriptions (algorithmic steps) and code constructs (functions/methods). / The segment contains a mix of programming-like pseudocode and domain-specific terminology, which can help the model learn span segmentation in both contexts. It is structurally clear with identifiable spans related to code constructs (functions/methods) as well as natural language descriptions (\"SPAN-AwARE ENCODER Algorithm 1 Phase I Span Pretraining\"). / The segment contains a mix of programming-like pseudocode and domain-specific terminology, with clear structured elements that can be segmented into meaningful spans for training purposes. It is cleanly formatted to represent valuable patterns in span composition across both natural language descriptions (algorithmic steps) and code constructs (functions/methods). / The segment contains a mix of programming concepts and pseudo-code, with clear structured elements like functions (z(), y()), loops ((1,y) in D do), sampling spans (Sample spans (i,j)), masking regions (mask region Ti:j), computing embeddings (compute pooled span embedding sk Pool(.i:j)), predicting reconstruction (Predict reconstruction Ti:j = decode(go(8k))). It is clean, coherent and contains valuable patterns for learning both natural language descriptions of code processes as well as the structure within programming constructs. / Clear structure with identifiable spans, representing both programming constructs and algorithmic steps; clean for training purposes."}}
 {"raw": "This mirrors span-centric refinement modules in neural coreference and sequence segmentation models [34, 20], offering higher expressivity at moderate computational cost. In practice, both methods can be fused or gated dynamically based on span type or predicted length, enabling flexibility in balancing generalization and expressiveness across heterogeneous spans. 3.6 Discrete Integration In the default architecture, span embeddings are appended to the encoder input to form an aug- mented composite", "type": "mixed", "id": {"id": "a88acb33-f58f-47f6-a712-b3edbfccab59"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical language and references to models, which can help the model learn span segmentation in both structured (code-like) contexts as well as unstructured text with complex expressions. It is clean but lacks explicit examples or patterns that could be directly used for training without further context. / The segment contains a mixture of technical language and references to models, which can help the model learn span segmentation in both contexts. However, it lacks clear delimiters for spans due to its dense academic style; thus some ambiguity remains regarding structural clarity but still holds value as training data. / The text contains a mix of technical language and references to models, which can help the model learn span segmentation in both contexts; however, it lacks clear delimiters for spans due to its continuous nature without explicit punctuation or formatting cues. / The segment contains a mixture of technical language and references to models, which can help the model learn span segmentation in both contexts; however, it lacks clear delimiters for spans due to its complex structure. / The text contains a mix of technical language and references to models, which can help the model learn span segmentation in both structured (code-like) elements like citations [34], as well as natural language descriptions (\"span-centric refinement modules\", \"neural coreference\"). It is clean but lacks explicit spans for direct training."}}
 {"raw": "2018. [38] Susan Zhang et al. OPT: Open Pre-trained Transformer Language Models\" . In: arXiv preprint arXiv:2205.01068 (2022) .", "type": "mixed", "id": {"id": "1c1269eb-b8f1-4541-9534-68e1dddc3ecc"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Contains a mix of citation and arXiv preprint reference, with clear delimiters for spans; however, lacks context or content beyond references which may limit learning scope. / Contains a citation with structured elements (author names, title) and publication details; spans can be segmented into meaningful parts for training purposes. / The segment contains a citation with structured elements (author names, title, publication year) and arXiv identifier that can be segmented into meaningful spans for training purposes. It combines natural language text (\"Susan Zhang et al.\") with code-like references to an academic paper format which is useful in span segmentation learning tasks. / The segment combines a citation with an arXiv preprint reference, which includes both structured elements (author names and publication year) that can be segmented into meaningful spans for learning purposes; it is clean but lacks context to fully represent the target domain's patterns. / Clear citation structure with identifiable spans for author names, publication year, and arXiv identifier; represents valuable patterns in academic referencing."}}
 {"raw": "allowing seamless integration with pretrained or co-trained transformer stacks We hypothesize that learned span prediction provides more semantically aligned and compression-efficient tokenization than fixed BPE or byte-level alternatives. To investigate this; we construct a multi-phase curriculum that bootstraps from synthetic segmentation labels and gradually introduces stream-type-aligned supervision:", "type": "mixed", "id": {"id": "118dcb0e-91f7-4aba-9bdb-f6ac53cdcc36"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical language and structured phrases that can be segmented into meaningful spans, such as \"pretrained or co-trained transformer stacks,\" which is valuable for learning span composition in both natural language processing (NLP) tasks related to code understanding and documentation. / The segment contains a mix of technical language and structured phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both domains (natural language with embedded programming concepts). It is clean but lacks context or examples to fully evaluate its training utility without additional data. / The text segment contains a mix of technical language and structured phrases that can be segmented into meaningful spans, such as \"pretrained or co-trained transformer stacks,\" which is valuable for learning span composition in both natural language processing (NLP) tasks related to code understanding. / The segment contains a mix of technical language and structured phrases that can be segmented into meaningful spans, such as \"pretrained or co-trained transformer stacks,\" which is valuable for learning span composition in both natural language processing (NLP) tasks related to code understanding and domain-specific knowledge. / The segment contains a mix of technical language and structured phrases that can be segmented into meaningful spans, such as \"pretrained or co-trained transformer stacks,\" which are valuable for learning span composition in both natural languages (like hypotheses) and code-like structures (\"multi-phase curriculum\")."}}
 {"raw": "6This style of predictive regularization is aligned with latent structure filtering techniques in segmentation-aware pretraining [11, 12], and echoes classic Bayesian constraints in alignment models [24].", "type": "natural", "id": {"id": "1ec755d2-db28-42e5-9d40-f526899764c8"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear sentence structure with identifiable spans; useful for learning span segmentation in English text. / The segment contains a mix of technical terms and references to research, which can help the model learn span segmentation in both academic writing (natural language) and scientific notation or citation formats that are common across disciplines including code documentation comments. / Clear sentence structure with identifiable spans; useful for learning span composition in NLP tasks. / The segment contains a mix of technical terms and references to academic work, which can help the model learn span segmentation in both structured (code-like) expressions as well as more complex natural language constructs related to machine learning concepts. / Clear sentence structure with identifiable phrases and concepts suitable for span segmentation, representing valuable patterns in language understanding."}}
 {"raw": "integrate the controller vector 3 into the transformer encoder and perform full-model training:", "type": "mixed", "id": {"id": "681004cb-8fc2-4b11-92b9-6e43a996d7a2"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The text contains a clear command with both action (integrate) and target elements, mixing programming terminology (\"controller vector\") within the context of machine learning model training instructions which is valuable for span segmentation in code-related natural language processing tasks. / The segment contains a clear command with both action (integrate) and object components, which can be segmented into meaningful spans for training purposes; it is clean but lacks context or additional examples that could improve its utility as part of the dataset. / The segment contains a clear command with both action (\"integrate\") and target elements (controller vector, transformer encoder), suitable for span segmentation in an integrated context of programming instructions and natural language explanations. / The segment contains a clear command with both action (integrate) and target elements ('controller vector' as noun, 'transformer encoder', 'full-model training'), representing valuable patterns for learning span composition in code-mixed contexts. / The segment contains a clear command with both action (integrate) and target elements (\"controller vector\" as the object, \"transformer encoder\", and context for training), which can be segmented into meaningful spans that represent valuable patterns in span composition across natural language instructions combined with technical terms."}}
 {"raw": "[81, 80], and may reduce overfitting in low-resource domains: Controller probing: Freeze routing weights and inject either random Or interpretable controller vectors $ into downstream encoders: This enables causal probing of span semantics and disen- tanglement, similar to frozen transformer interventions in multimodal Or multilingual settings [82, 81].", "type": "mixed", "id": {"id": "f7192a63-7fd5-4bf5-9ecb-7c986381ffa9"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment mixes mathematical notation and text, which can confuse a span-aware model that lacks tokenization; clear spans are hard to identify without separating the elements. / The segment contains a mix of numbers, punctuation marks and words in an unclear context that may confuse the model's span segmentation capabilities. It lacks clear structure for meaningful spans identification. / Contains both structured data (indices, technical terms) and unstructured text; spans can be identified for training purposes. / The segment contains a mixture of numerical references, programming-like expressions (e.g., \"[81, 80]\"), and technical terminology that can be segmented into meaningful spans for learning span composition in both code context and natural language explanation. / The segment contains a mix of numerical references and programming-like expressions, but lacks clear structure for meaningful span segmentation; it is not coherent enough as training data."}}
 {"raw": "These directions aim to validate the modularity and generalization capacity of X-Spanformer across both structured and unstructured tasks. We plan to release diagnostic notebooks and controller visualization tools to support reproducibility and community benchmarking: Visualization Framework and Interpretability Interfaces Interpretability is central to the X-Spanformer framework, not only for debugging but for validating the emergence of structured behavior from differentiable routing: We introduce a", "type": "mixed", "id": {"id": "ef78979a-7adb-46d5-9a4a-bb1a8f6b5c4c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains structured elements like headings and lists, which can be segmented into meaningful spans; it also includes a mixture of technical descriptions (natural language) with references to tools or concepts that could benefit the model's understanding in both domains. / The segment contains structured elements like headings (\"Directions\", \"Visualization Framework and Interpretability Interfaces\") that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both natural language (text) and code-like structures (headings). / The segment contains both structured (directions, tasks) and unstructured language elements with clear phrases that can be segmented into meaningful spans for training purposes. It also includes technical terms relevant to the X-Spanformer framework's context (\"diagnostic notebooks\", \"controller visualization tools\"). / The segment contains structured elements like headings and lists that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both textual descriptions (natural language) of the X-Spanformer framework's goals as well as its technical aspects such as \"diagnostic notebooks\" which could relate to code. / The segment contains structured elements like headings and lists, which can be segmented into meaningful spans; it is clean for training purposes but lacks sufficient context or content to fully represent the target domains of X-Spanformer. More diverse examples are needed."}}
 {"raw": "We release a standalone ONNX-compatible implementation along with training recipes and COrpus construction guide lines to facilitate adoption across code, language, and hybrid domains\n1\nIntroduction\nTransformer architectures underpin leading solutions in natural language understanding; program synthesis; and multimodal retrieval [4, 5, 6, 7].", "type": "mixed", "id": {"id": "9a5e3926-b439-427d-bc46-b7f02155259a"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"ONNX-compatible implementation,\" which is indicative for the domain knowledge required in both natural language understanding (NLU) tasks related to code comprehension or documentation about software tools. / The segment contains a mixture of technical terms and phrases that can be segmented into meaningful spans, such as \"ONNX-compatible implementation,\" which is valuable for learning span composition in both natural language processing (NLP) tasks related to code understanding (\"standalone ONNX-compatible\") and domain adaptation across different types. It also includes structured elements like numbered lists indicating sections or steps within a guide that can be useful training data, though it lacks explicit examples of spans from the text itself which could have improved its score slightly. / The segment contains a mix of structured elements like headings, lists (indicated by the number), and paragraphs that can be segmented into meaningful spans for training purposes; it is clean with clear compositional value representing both natural language text structure as well as code-like formatting. / Clear sentence structure with identifiable spans like \"ONNX-compatible implementation\", and phrases that can be segmented for learning, such as domain names (\"code, language, hybrid domains\") and technical terms related to the field of study (e.g., \"Transformer architectures\"). The text is clean but lacks context or examples which could improve its utility. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"ONNX-compatible implementation,\" which is valuable for learning span composition in both natural language processing (NLP) contexts like code adoption across domains (\"code\"), and understanding complex sentences."}}
 {"raw": "X-SPANFORMER SPAN-AwARE ENCODER 3.4 Modality Typing Spans in source sequences often originate from heterogeneous subdomains ~such as natural language, programming syntax; structured identifiers, numeric expressions, Or markup. Accurate identification of a span's modality enables the model to apply domain-specialized logic (e:g , routing to type specific heads; enforcing syntax-aware constraints, Or improving retrieval and alignment). 8 To this end, we introduce a shallow classification head that predicts a", "type": "mixed", "id": {"id": "7682686f-55ee-4c27-ae90-b3a2e4ce4aa8"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Contains both structured identifiers and programming syntax, representing valuable patterns for learning span segmentation in a tokenizer-free context. However, the presence of special characters like \"~\" may affect clarity slightly but overall maintains good compositional value. / The segment contains both structured identifiers (e.g., \"X-SPANFORMER SPAN-AwARE ENCODER\") and a mix of programming syntax, markup references (\"modality Typing Spans\"), which can help the model learn span segmentation in diverse contexts. / The segment contains clear structured elements from both programming syntax and domain-specific terminology, representing valuable patterns for learning span segmentation in a tokenizer-free context. However, the presence of typographical errors (\"~such as\", \"eg:\", etc.) slightly reduces clarity but does not significantly impact its overall utility. / The segment contains both structured identifiers and programming syntax, which are clear spans for a span-aware model to learn from; however, it lacks coherence as an isolated example due to its abrupt ending (\"Or markup.\") / The segment contains clear spans related to both programming syntax and domain-specific terminology, which can help the model learn span segmentation across different domains. However, it lacks coherence as a standalone example for training purposes due to its fragmented nature; thus, additional context is needed in practice."}}
 {"raw": "Central to these models is a static segmentation stage that partitions input into fixed subword units, most commonly via Byte-Pair Encoding", "type": "natural", "id": {"id": "3039ad9a-7106-4572-a44a-2c71e727bf25"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear sentence structure with a central idea and supporting detail; spans can be identified as \"Central to these models is\" (span) and \"a static segmentation stage that partitions input into fixed subword units, most commonly via Byte-Pair Encoding\" (span). / Clear sentence structure with a coherent idea that can be segmented into meaningful spans; represents valuable patterns for learning span composition in the context of language processing. / Clear sentence structure with a central idea and supporting clause; spans can be identified as \"Central to these models is\" (subject) and \"a static segmentation stage that partitions input into fixed subword units, most commonly via Byte-Pair Encoding\" (object). / Clear sentence structure with a main clause and an explanatory phrase, representing valuable patterns for learning span segmentation in the context of language processing. / Clear prose with a single coherent idea; spans can be identified as \"Central to these models\", \"static segmentation stage\" and the rest of sentence, representing valuable patterns for learning span composition in English text."}}
 {"raw": "probability distribution over T predefined modalities for each span: Let ij € Rd be the pooled embedding for span (i, j) , produced by the same pooling operator used in the length estimator: Uij = Pool( H[i:j]).", "type": "mixed", "id": {"id": "c2461c91-b03e-46f0-9b37-21de532a41c2"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear spans identified (e.g., \"probability distribution\", \"(i, j)\", and mathematical expressions), clean format with both language elements (\"Let ij € Rd\") and formal notation used in programming/code contexts. / Clear spans identified (e.g., \"probability distribution\", \"(i, j)\", and mathematical expressions), clean structure with a mix of notation suitable for span segmentation learning in both natural language descriptions (\"predefined modalities\") and code-like syntax (\"pooling operator used\"). / Clear spans identified (e.g., \"probability distribution\", \"(i, j)\", and mathematical expressions), clean format with both linguistic elements (\"Let ij\") and programming constructs (\"Uij = Pool(H[i:j])\"). Represents valuable patterns for learning span composition in a mixed context. / Clear spans identified (e.g., \"probability distribution\", \"(i, j)\", and mathematical expressions), clean structure suitable for training a span-aware model that can handle both language constructs and code-like syntax. / The segment contains clear structured elements with both mathematical notation and programming-like expressions, representing valuable patterns for learning span composition in a tokenizer-free context. It is clean but lacks contextual clarity due to the presence of domain-specific terms (\"modalities\", \"pooled embedding\")."}}
 {"raw": "[1] or SentencePiece [2]. While effective on in-domain corpora, these pipelines impose immutable lexical boundaries that degrade under domain shift, obscure long-range compositional patterns in code and multilingual text [8], and incur substantial costs when vocabularies must be revised for novel syntactic O semantic phenomena.", "type": "mixed", "id": {"id": "0c3414ab-1c3f-4ebb-a80c-9924b7315410"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of citations and text discussing the limitations on domain-specific corpora, which can help in learning span segmentation for both citation formats (natural language) and technical references ([1], [2]). It is coherent but lacks explicit structural markers like punctuation that could aid training. / The text segment contains a mix of references and complex sentences that can be segmented into meaningful spans, such as \"[1] or SentencePiece [2].\" It is clean for training purposes but may lack direct compositional patterns due to its abstract nature involving domain-specific terminology. / The segment contains a mixture of citations and text discussing the limitations in domain adaptation for NLP pipelines, which includes both technical terms (code) related to SentencePiece and broader language concepts that are natural language based. It offers clear structural elements like references [1], [2] indicating span segmentation opportunities within academic writing or documentation context. / Clear sentence structure with identifiable spans; represents valuable patterns for learning span segmentation in English prose. / The segment contains a mix of citations and text discussing the limitations in domain adaptation for NLP, which includes both technical language (code) references ([1], [2]) as well as natural language explanations about lexical boundaries affecting code comprehension; it has clear structural elements that can be segmented into meaningful spans."}}
 {"raw": "X-SPANFORMER\nSPAN-AwARE ENCODER\n6.1 Span Trajectory Viewer (trajectory_overlay)", "type": "mixed", "id": {"id": "7ce8e004-2362-414a-9314-7c1716659e38"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of identifiers and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both programming context (X-SPANFORMER) and descriptive text (\"SPAN-AwARE ENCODER\"). / Clear labels and structured headings for both language (natural) and technical terms/code-like identifiers, suitable for learning span segmentation in a mix context. / Clear span segmentation between names and descriptions; represents a useful pattern for learning both naming conventions (natural language) and structured identifiers (code). / Clear labels with a mix of uppercase and lowercase, indicating potential span boundaries; useful for learning diverse patterns in both text classification (natural language) and programming context recognition. / Clear spanable headers and labels, representing both domain-specific terminology (code) and descriptive text which is valuable for learning combined patterns in a tokenizer-free context."}}
 {"raw": "Experiments In this section, we analyze the emergent behavior and structural control capacity of the proposed X-Spanformer architecture through a series of controlled experiments. Our objectives are threefold: 1.", "type": "natural", "id": {"id": "1eff0030-8ce8-4083-86d9-d0136fcece76"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear, coherent prose with identifiable sentences and phrases suitable for training a span-aware model in understanding structured text segments. / Clear, coherent prose with identifiable sentences and phrases suitable for training a span-aware model on English text. / Clear, coherent prose with identifiable sentences and phrases suitable for training a span-aware model in understanding structured text. / Clear, coherent prose with identifiable sentences and phrases suitable for training a span-aware model in understanding sentence structures. / Clear, coherent prose with identifiable objectives and structured format suitable for training on span segmentation in a tokenizer-free context."}}
 {"raw": "The span trajectory viewer highlights how span selection stabilizes O evolves over training epochs.", "type": "natural", "id": {"id": "035fcd1e-e37a-40de-82b4-464a8adf68ef"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear sentence structure with potential for learning span segmentation around phrases like \"span trajectory viewer,\" \"how span selection stabilizes O evolves over training epochs.\" The content is coherent and representative of technical documentation, which can be useful in a mixed context but primarily fits the 'natural' category. / Clear sentence structure with identifiable spans (\"span trajectory viewer\", \"how span selection stabilizes O evolves over training epochs\"). Well-formed and coherent for learning patterns in language segmentation. / Clear sentence structure with a potential span pattern of \"span trajectory viewer\" and \"O evolves over training epochs.\" Suitable for learning context-based spans in text. / Clear sentence structure with a meaningful span (\"span trajectory viewer\", \"O evolves over training epochs\") that can be used to learn context and composition in text. / Clear sentence structure with a potential for learning span segmentation around phrases like \"span trajectory viewer\", \"O evolves over training epochs\". The content is coherent and clean, suitable as an example of complex technical documentation in the domain of machine learning or computer vision."}}
 {"raw": "X-SPANFORMER\nSPAN-AwARE ENCODER\nSegmentation is traditionally decoupled from model training, treated as an irreversible preprocess ing operation that lacks gradient flow and cannot adapt to downstream objectives. Recent  work in character-aware encoding [9], tokenization-free models [10], and unsupervised segmentation in sequential domains [11, 12] demonstrates the potential of adaptive boundaries:", "type": "mixed", "id": {"id": "8c63fa5d-4900-42c9-ad06-965710ce8052"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mixture of technical terms and phrases that can be segmented into meaningful spans, such as \"X-SPANFORMER\", \"SPAN-AwARE ENCODER\", etc., which are relevant for learning span composition in both natural language processing (NLP) tasks related to code documentation or discussions. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"X-SPANFORMER\", \"SPAN-AwARE ENCODER\", etc., which are relevant for learning span composition in both natural language processing (NLP) tasks related to code documentation. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"X-SPANFORMER,\" \"SPAN-AwARE ENCODER,\" etc., which are clear indicators for the model to learn span segmentation in both natural language text and code references. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"X-SPANFORMER,\" which is likely an acronym or specific term relevant to the domain; it also includes references [9], [10], and [11-12] indicating structured citations. It demonstrates clear structural elements suitable for training a span-aware model in mixed content contexts that involve both natural language descriptions of concepts (like \"Segmentation\") as well as technical jargon (\"character-aware encoding,\" etc.). The segment is clean, coherent with domain-specific terminology likely to be encountered by the target audience and thus valuable from an educational perspective. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"X-SPANFORMER,\" \"SPAN-AwARE ENCODER,\" etc., which are clear markers for the model to learn from in both natural language context."}}
 {"raw": "To verify that differentiable span selection converges toward semantically meaningful structures under entropy annealing; 2_ To evaluate the fidelity and variance of controller vector injection across multiple integration pathways; 3. To probe the interpretability and stability of span routing under synthetically constructed and naturalistic corpora. Unlike traditional benchmark-driven evaluations, our methodology emphasizes structural diagnos- tics and interpretability over end-task performance. This is", "type": "mixed", "id": {"id": "728426e4-60d5-476c-9d9f-16f78b8f3c83"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of structured phrases and technical terms that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both coding contexts (e.g., \"differentiable span selection\") and natural language descriptions (\"semantic structures\"). / The segment contains a mixture of structured commands and descriptive text, which can help the model learn span segmentation in both programming contexts (commands) and explanatory prose. It is clean but lacks explicit examples that could improve its training utility further. / The segment contains a mix of structured phrases and technical terms that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both coding contexts (e.g., \"differentiable span selection\") and natural language descriptions (\"semantic structures\", \"interpretability\"). / Contains a mix of structured phrases and technical terms, with clear numerical identifiers suggesting span segmentation opportunities; however, it lacks coherence as an isolated segment for training purposes. / Contains a mix of structured phrases and technical terms that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both coding contexts (e.g., \"differentiable span selection\") and natural language explanations (\"semantic structures\", \"interpretability\")."}}
 {"raw": "However , these ap- proaches often omit linguistic structure and do not offer interpretable segmentation aligned with phrase-level semantics. Drawing on the X-bar  schema from generative grammar [13], we posit that raw token streams (for example, source code, natural language, or symbolic hybrids) exhibit latent hierarchical units that can be learned directly from data: We introduce X-Spanformer, a span-based segmenter that formulates boundary detection as a pointer-network prediction task [3].", "type": "mixed", "id": {"id": "3053883e-1de3-4a49-881b-e5e2daac6f6b"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mixture of technical terms and structured sentences that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic structure (natural language) and programming concepts (code). / The segment contains a mix of technical terms and concepts that can be segmented into meaningful spans, such as \"linguistic structure,\" \"phrase-level semantics,\" etc., which are valuable for learning span composition in both natural language processing (NLP) tasks related to code analysis. / The segment contains a mix of technical terms and concepts that can be segmented into meaningful spans, such as \"linguistic structure,\" \"phrase-level semantics,\" etc., which are relevant for learning span composition in both natural language processing (NLP) tasks related to code analysis. / The segment contains a mix of technical jargon and structured information that can be segmented into meaningful spans, such as \"linguistic structure,\" \"phrase-level semantics,\" etc., which are valuable for learning span composition in both natural language processing (NLP) tasks related to code analysis. / The segment contains a mix of technical terms and concepts, with clear references to linguistic structures (X-bar schema) that can be segmented into meaningful spans for learning purposes in both programming contexts (\"source code\") and natural language processing tasks."}}
 {"raw": "Span Embedding Each retained span (i,j) € S' must be mapped to a fixed-size vector representation suitable for downstream fusion: This embedding is intended to capture both the internal structure and the contextual salience of the span, serving aS a condensed representation of its semantic Or syntactic role. Effective span encodings have been shown to improve performance in question answering, entity linking; and structured generation tasks", "type": "mixed", "id": {"id": "c8b4b34d-2d31-44d3-95a2-7b4f2bf3d9ac"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains clear linguistic structures and phrases that can be segmented into meaningful spans, such as \"Span Embedding,\" \"retained span (i,j),\" etc., which are useful for learning semantic roles in text. It is coherent but lacks explicit context or examples of the concepts discussed; however, it still provides a good basis for training on natural language understanding tasks related to embeddings and their applications. / The segment contains a mixture of technical terms and phrases with clear boundaries that can be segmented into meaningful spans, such as \"Span Embedding,\" \"retained span (i,j),\" etc., which are useful for learning the composition of complex expressions in both natural language text and code. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"Span Embedding\", \"fixed-size vector representation\", etc., which are relevant for learning span composition in both natural language processing (NLP) tasks like question answering or entity linking. / The text segment contains a mixture of technical terms and phrases that can be segmented into meaningful spans, such as \"Span Embedding,\" \"fixed-size vector representation,\" etc., which are useful for learning span composition in both natural language processing tasks like question answering or entity linking. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"Span Embedding,\" \"fixed-size vector representation,\" etc., which are useful for learning span composition in both code-like structures (e.g., variable names) and natural language descriptions."}}
 {"raw": "Beginning with a compact one-thousand-unit BPE seed, X-Spanformer learns to emit overlapping; variable-length spans that are softly typed by modality (for example, code, natural language, or identifier) and capped per sequence via a learned length estimator. Span representations are aggregated by pooling and integrated into downstream transformer encoders, enabling joint optimization of segmentation and task-specific objectives: 1.1 Contributions This paper presents the following contributions: 1. A", "type": "mixed", "id": {"id": "177ae912-46f6-451f-809a-e08b741982f9"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of structured elements like headings, lists (contributions), and technical terms that can be segmented into meaningful spans for training purposes; it's clean but lacks context to fully evaluate its representativeness. / The segment contains a mixture of structured elements like headings, lists (contributions), and technical descriptions that can be segmented into meaningful spans for learning span composition in both coding contexts (\"code\", \"natural language\") and document structure. It is clean but lacks context to fully evaluate its training utility without additional data. / The segment contains clear structured elements like headings, lists (contributions), and spans that can be segmented into meaningful parts for learning span composition in a tokenizer-free context. It also includes both natural language text (\"a compact one-thousand-unit BPE seed\") and code-like syntax (\"1.1 Contributions\"). / Contains structured elements with clear segmentation opportunities, including headings and numbered lists that represent valuable patterns for learning span composition in both text (natural language) and structure recognition tasks. The content is clean but lacks context or examples of actual code snippets which could improve its utility as training data. / The text segment contains a mix of structured elements like headings, lists (contributions), and technical terms that can be segmented into meaningful spans for training purposes; however, it lacks clear delimiters between different types of content which may affect clarity slightly."}}
 {"raw": "formalization of tokenizer-free segmentation as a span-prediction problem grounded in X bar theory; instantiated with a pointer network featuring dynamic span capping and modality typing: 2 . A curriculum learning paradigm that bootstraps span discovery from synthetic BPE labels and progressively shifts to contrastive and type-aware supervision.", "type": "mixed", "id": {"id": "7a3e9678-1b21-47a2-a652-86dc446c53ac"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Contains a mix of technical terms and concepts that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both programming (e.g., \"pointer network\", \"dynamic span capping\") and natural language contexts (\"formalization\", \"curriculum learning paradigm\"). / The segment contains a mix of technical terms and concepts that can be segmented into meaningful spans, such as \"tokenizer-free segmentation,\" \"span-prediction problem,\" etc., which are valuable for learning span composition in both natural language processing (NLP) tasks related to code understanding. / The segment contains a mix of technical terms and concepts that can be segmented into meaningful spans, such as \"tokenizer-free segmentation,\" \"span-prediction problem,\" etc., which are valuable for learning span composition in both natural language processing (NLP) tasks related to code understanding. / Contains a mix of technical terms and concepts that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic structures (natural language) and programming constructs/methodologies (code). The text is clean but may require domain-specific knowledge to fully understand. / The segment contains a mix of technical terms and concepts that can be segmented into meaningful spans, such as \"tokenizer-free segmentation,\" \"span-prediction problem,\" etc., which are relevant for learning span composition in both natural language processing (NLP) tasks related to code understanding."}}
 {"raw": "We denote:\nD = {(z() , y())}N1: training corpus with optional supervision; fe: differentiable span scorer;\n9a: controller aggregator; S: controller vector, computed as & relevance-weighted sum over pooled span embeddings:\nK\nS\n@kSk k=l\n(29 _\nexp(Wk: _ K W=1 exp(we) Uk = gq (8k, Ok, confk)\n(30)\n21\n@k", "type": "code", "id": {"id": "9d76d459-6b9b-456b-950d-ade73d4da0fe"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear structure with identifiable spans such as equations, variables (D), functions or methods ((z() , y())N1; fe) and mathematical expressions that are essential for learning span segmentation in programming contexts. The text is well-formed but lacks natural language context which could be beneficial if mixed content was desired. / Clear structure with identifiable spans like equations, variables (D), functions or methods ((z() , y())N1; fe) and mathematical expressions ((9a); S). Well-formed for training purposes in a tokenizer-free context. / The segment contains clear, structured elements typical of programming notation and mathematical expressions that can be segmented into meaningful spans for a span-aware model to learn from. It is clean but lacks context or explanation which might affect its utility as standalone training data. / Clear structured elements with identifiable spans; represents valuable patterns for learning span composition in programming context. / Clear structure with identifiable spans like equations and variables, suitable for learning span segmentation in programming context."}}
 {"raw": "\"CoLT5: Faster Long-_ Range Transformers with Conditional Computa- tion\". In: Proceedings of the 2023 Conference on Empirical Methods  in Natural Language Processing (EMNLP) Singapore: Association for Computational Linguistics, 2023, pp 5085 5100 .", "type": "mixed", "id": {"id": "6dbdaf80-3470-419e-a51c-69b3efd70721"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a citation with structured elements like author names, title of work, conference details and page numbers which can be segmented into meaningful spans for training purposes. It is clean but lacks natural language context as it reads more like an academic reference than prose or code content. / Contains structured elements like conference name, year, and page numbers that can be segmented into meaningful spans for training purposes. The text is clean but lacks context to fully represent the target domain's complexity. / The segment contains a mixture of structured citation elements (title, conference details) and unstructured text (\"CoLT5\", \"Faster Long-Range Transformers with Conditional Computation\"). Clear spans for both code-like references to an academic paper/conference proceedings as well as natural language descriptions can be identified. / The segment contains a citation with structured elements (title, conference details) that can be segmented into meaningful spans for training purposes and represents valuable patterns in span composition across both academic text and formal citations. / The segment contains a citation with structured elements like author names, title of work, conference details and page numbers which can be segmented into meaningful spans for training purposes. It is clean but lacks context or content that could provide more learning opportunities on span composition in natural language processing tasks."}}
 {"raw": "Architectural guidelines for embedding the span predictor into transformer encoders through compositional pooling and minimal layer extensions: 4. A_ proposed evaluation framework covering compression ratio, contrastive alignment, span entropy analysis, and interpretability visualizations, accompanied by an ONNX-compatible implementation and complete training recipes: 2 Related Work 2.1 Static Subword Tokenization Most transformer pipelines rely on offline subword segmentation. Byte-Pair Encoding (BPE) con-", "type": "mixed", "id": {"id": "d38c5c97-519a-4c1d-804e-5f2a4503bba9"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mixture of technical terms and structured elements like headings, which can help the model learn span segmentation in both coding contexts (e.g., \"transformer encoders\") and natural language descriptions (\"architectural guidelines\"). However, it lacks coherence as an isolated example. / Contains a mix of technical terms and structured content, though somewhat fragmented for direct training use. Could benefit from clearer segmentation or additional context to improve learning patterns. / Contains a mix of technical terms and structured elements like headings, lists (2), subheadings (e.g., \"A_ proposed evaluation framework\"), which can help the model learn span segmentation in both natural language text and code-like constructs. / Contains a mix of technical terms and structured information that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both language (natural) and programming context (code). The text segment is clean but lacks coherence due to its fragmented nature; however, it still offers useful examples. / The segment contains a mix of technical descriptions and references to methods (like BPE), which can help the model learn span segmentation in both structured programming contexts as well as more descriptive, natural language explanations related to machine learning concepts. However, it lacks coherence due to fragmented sentences (\"Most transformer pipelines rely on offline subword segmentation... Byte-Pair Encoding (BPE) con-\")."}}
 {"raw": "X-SPANFORMER\nSPAN-AwARE ENCODER\nMean pooling: A simple, position-invariant aggregation computed as the average of con- stituent token vectors: Sij\nhk. j -1+1 k==i\nThis approach is computationally efficient; robust to span length; and has proven effective in prior span-focused architectures such as BiDAF and SpanBERT [15, 32].", "type": "mixed", "id": {"id": "8df2eab1-d5c1-4bec-b37b-a4ff6870e3fc"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains both technical terminology and mathematical expressions, providing diverse patterns for span segmentation in a mixture of language types. / The segment contains a mix of technical terms and mathematical expressions, which are structurally clear for span segmentation; it is clean but lacks context or explanation that could improve learning utility. / The segment contains a mix of technical terms and mathematical expressions, which can help the model learn span segmentation in both domains. However, it lacks context for some phrases like \"Mean pooling\" or \"[15, 32]\", making full comprehension challenging without additional data. / The segment contains a mixture of technical terms and mathematical expressions, which are clear structures for span segmentation in both coding contexts (e.g., \"Mean pooling\") and formal descriptions (\"This approach is computationally efficient\"). It represents valuable patterns across natural language explanations intertwined with code-like notation. / The segment contains a mix of technical terms, acronyms (X-SPANFORMER), and mathematical expressions that can be segmented into meaningful spans for training purposes; it is clean but lacks context clarity due to the inclusion of references [15, 32]."}}
 {"raw": "X-SPANFORMER SPAN-AwARE ENCODER transformer parameters: Model optimization proceeds via the composite loss: = Ltotal Ltask + B1Lspan + 82Cent, (31) where: Ltask: task-aligned objective (e-g , cross-entropy, contrastive alignment); Lspan = KL( Pgold P): span KL alignment; Lent = ~Aent H(P): entropy regularization term To isolate structural behavior, we evaluate: Span distribution entropy H(P) =-Z(i;j) Pij log Pij; Controller gate variance Var(o(Wgs)); Span overlap rate: fraction of selected spans sharing", "type": "mixed", "id": {"id": "21b861a5-d8ed-4894-899f-ce54bbd90129"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and structured expressions that can be segmented into meaningful spans, such as \"transformer parameters,\" \"composite loss,\" etc., which are valuable for learning span composition in both code-like structures (e.g., equations) and natural language descriptions. / The segment contains a mix of technical terms, mathematical expressions and structured information that can be segmented into meaningful spans for learning span composition in both coding contexts (like loss functions) and natural language explanations about the model's optimization process. However, some parts may require domain-specific knowledge to fully understand their structure as training data. / The segment contains a mix of technical terms, mathematical expressions and structured formulas that represent valuable patterns for learning span segmentation in both programming contexts (code) and formal descriptions or documentation related to transformer models. However, the presence of complex symbols like \"âˆ«\" may pose readability challenges; thus it is not perfect but still useful as mixed content training data. / The segment mixes technical terms and mathematical expressions without clear sentence structures, making it difficult to identify meaningful spans for a tokenizer-free model focused on span-aware encoding. Additionally, the presence of symbols like \"=\", \"+\", \"&\" disrupts readability in natural language context but are common in code-like notation; however, this mixed format may confuse both models trained specifically or broadly across these domains without further clarification and separation into distinct segments (natural vs. technical). / The segment contains a mixture of technical terms and mathematical expressions, which can be segmented into meaningful spans for learning span composition in both programming (code) contexts as well as formal descriptions or documentation related to machine learning models. However, the presence of symbols like \"=\", \"+\", and semicolons may pose challenges due to their dual role; they are used mathematically but also structurally separate code elements from natural language text."}}
 {"raw": "structs a fixed vocabulary by iteratively merging frequent symbol pairs extracted from a training corpus [1]. SentencePiece builds on unigram language models to select subword tokens that maxi- mize corpus likelihood [2]: Such methods yield efficient lookup tables and have become ubiquitous in large-scale language models [4, 5],", "type": "mixed", "id": {"id": "546fc1ec-deb5-45a8-bdf6-d809a7a78de0"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains both structured programming concepts and language model descriptions, which can help the X-Spanformer learn span segmentation in a diverse context. It is clean but lacks explicit examples of spans to be learned directly from this text alone. / Contains both structured programming concepts and language descriptions, useful for learning span segmentation in a diverse context. / The text contains both technical language and references to methods, which can help the model learn span segmentation in a context that includes programming concepts alongside prose explanations. However, it lacks clear delimiters for spans such as punctuation or code block markers; thus it's not ideal but still valuable due to its compositional value of mixed content types. / The segment contains both technical language and references to methods (natural), along with citations that suggest a structured format for academic or research purposes, which can help the model learn span segmentation in scientific texts. However, it lacks explicit code constructs but includes valuable patterns related to natural language processing tasks. / Contains both structured language (natural) and technical references to methods used in NLP, which can help the model learn span segmentation across different domains. However, it lacks clear sentence boundaries or explicit code examples that could improve clarity for training purposes."}}
 {"raw": "Local self-attention: A lightweight transformer block operates over the token subsequence Hli:j], enabling the model to capture internal asymmetries and intra-span dependencies: Sij SelfAttn( H[i:j]).", "type": "mixed", "id": {"id": "faa1907b-52f0-47f4-92a0-b11f251c3e07"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and notation that can be segmented into meaningful spans, such as \"Local self-attention,\" \"lightweight transformer block,\" etc., which are valuable for learning span composition in both code-like structures (e.g., Hli:j]) and natural language descriptions. / The segment contains a mix of programming concepts and notation, with clear spans for both the function name \"Local self-attention\" (natural language) and its corresponding formula/Symbolic representation (\"code\"). It captures internal dependencies within code constructs which is valuable training data. / The segment contains both programming constructs (self-attention, token subsequence) and mathematical notation (\"Sij SelfAttn(H[i:j])\"), representing valuable patterns for learning span composition in a tokenizer-free context. It is clean but could be clearer with proper spacing or parentheses around the function call to improve readability. / The segment contains a mix of programming concepts and notation, with clear spans for both the function name \"Local self-attention\" (natural language) and its associated pseudo-code-like expression (\"code\"). It demonstrates internal dependencies within code constructs that can be beneficial to learn span segmentation. / The segment contains a mixture of both programming constructs and terminology, which can help the model learn span segmentation in diverse contexts. However, it lacks clarity due to unconventional formatting (e.g., Hli:j]). A cleaner version would improve its utility as training data."}}
 {"raw": "Syntax-Enhanced Transformer for Neural Machine Translation\". In: arXin preprint arXiv:2002.01160 (2020). URL: https: / /arxiv_ org/abs/2002.01160. [46] Colin Raffel et al.", "type": "mixed", "id": {"id": "97c79843-728c-4398-a357-c1845b4ede31"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a citation with structured elements (title, publication details) that can be segmented into meaningful spans for training purposes; it is clean and coherent but lacks context to fully represent the target domain's patterns. / The segment contains a mix of citation elements (authors, title) and URLs which can be segmented into meaningful spans for training purposes; it is clean but lacks context or content beyond the reference itself. / The segment contains a citation with structured elements (title, publication details) that can be segmented into meaningful spans for learning purposes; however, it lacks context and coherence as an isolated example. / The text segment contains a citation with structured elements (title, publication details) that can be segmented into meaningful spans for learning span composition in both scholarly and web contexts. It is clean but lacks context to fully represent the target domain's patterns. / The segment contains a citation with structured elements (title, publication details) that can be segmented into meaningful spans for training purposes; it represents valuable patterns in both academic referencing and URL formatting."}}
 {"raw": "sequence: E = [e1, - eL, Si1j1)*", "type": "code", "id": {"id": "30514b5f-06e5-4a41-8c1c-1d53eb7ee1ec"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a clear mathematical expression with identifiable spans for variables and constants, suitable as training data for span segmentation in programming contexts. / Clear sequence of elements with identifiable spans; well-suited for learning span composition in programming context. / Clear sequence pattern with identifiable spans; well-suited for learning span segmentation in programming contexts. / Clear sequence pattern with identifiable spans; well-suited for learning span composition in programming contexts. / The sequence lacks clear delimiters for spans; it's unclear how to segment the text into meaningful parts, which is crucial for training a span-aware model like X-Spanformer. Additionally, it contains both natural language and code-like elements without proper context or structure that would aid in learning patterns of composition effectively."}}
 {"raw": "Experimental Philosophy. Our experiments are structured not as competitive benchmarks, but as architectural diagnostics to validate the inductive mechanism of span-aware routing: This aligns with prior work in structural probing and latent routing models [28, 51, 56].", "type": "natural", "id": {"id": "07fcbc4a-81cc-4f18-8812-ed5e4ff947c6"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear sentence structure with identifiable phrases; useful for training on span segmentation in academic prose. / The text contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"Experimental Philosophy,\" \"span-aware routing,\" etc., which are relevant for learning span composition in both natural language processing (NLP) tasks related to code documentation or academic writing. / Clear and coherent prose with identifiable spans for training; aligns well as a single-domain example. / Clear sentence structure with identifiable spans; useful for learning span composition in NLP tasks. / Contains a mix of technical terms and phrases that can be segmented into meaningful spans, aligning with the inductive mechanism for span-aware routing as mentioned in prior work references [28, 51, 56]. The text is clean but lacks context-specific examples or code snippets."}}
 {"raw": "X-SPANFORMER SPAN-AwARE ENCODER 2.2 Character-Level and Token-Free Models To mitigate subword brittleness, several works propose bypassing static vocabularies in favor of character- or byte-driven encoding: Charformer applies gradient-based subword tokenization to learn latent splits during pretraining; compressing sequences without a predefined vocabulary [9]. CANINE directly encodes Unicode codepoints with down-sampling and up-sampling layers, offering a tokenization-free encoder that matches BPE", "type": "mixed", "id": {"id": "10ca01c3-d914-4d5e-aac2-1277d8f3a85b"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Contains both technical terms and structured information that can be segmented into meaningful spans for training purposes, though it may require additional context to fully understand the domain-specific references like \"X-SPANFORMER\" and \"[9].\" / The segment combines technical descriptions (code-like) and references to research works, which can help the model learn span segmentation in both contexts. However, it lacks clear delimiters for spans due to its dense nature; thus a moderate score is given with an indication that additional examples could improve clarity on structural boundaries. / The segment contains a mix of technical terms and descriptions that can be segmented into meaningful spans, such as \"X-SPANFORMER SPAN-AwARE ENCODER\", \"[9]\", etc., which are valuable for learning span composition in both natural language processing (NLP) tasks related to code documentation. / The segment contains a mix of technical descriptions and references to models, which can help the model learn span segmentation in both contexts. However, it lacks clear examples or structured data that could improve its training utility further. / Contains both technical terms and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in a tokenizer-free context."}}
 {"raw": "Note: All results in this section are presented for illustrative and developmental purposes. Empir- ical benchmarks for generalization, transferability; and performance scaling are left to future work as model weights stabilize and structure supervision matures. 5.1 Experimental Setup We design our experimental pipeline to test the structural expressivity and routing fidelity of X- Spanformer in isolation from large-scale benchmark supervision Following best practices in latent structure induction [55, 63,", "type": "mixed", "id": {"id": "7dc2c023-fef5-44db-85fe-5112501847cc"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of structured text (natural language) and technical terms (\"Empirical benchmarks,\" \"X-Spanformer\"), which can help the model learn span segmentation in both domains, though it lacks clear delimiters for spans. / Clear narrative structure with identifiable spans (e.g., \"Note\", headings, and phrases). Well-suited for learning contextual span segmentation in a tokenizer-free model like X-Spanformer. / The text segment contains a mix of structured elements like headings, notes for future work references (e.g., \"Note\"), and technical terms (\"Empirical benchmarks\", \"structural expressivity\"). These can help the model learn span segmentation in both natural language contexts as well as specialized terminology. / The segment contains a mix of structured text with technical terms and references to future work, which can help the model learn span segmentation in both narrative descriptions (natural language) and formal statements about experimental setups or benchmarks related to code-like structures. / The text contains a mix of technical jargon and formal language but lacks clear, identifiable spans that can be easily segmented for training purposes; it's too abstract without concrete examples or structured patterns."}}
 {"raw": "baselines on transfer tasks [10]. These approaches remove offline heuristics; but they do not explicitly model higher-order linguistic 0 symbolic struc tures. 2.3 Unsupervised and Differentiable Segmentation Beyond character-level models; unsupervised segmentation methods aim to learn meaningful units directly from raw streams. Morfessor induces morpheme-like units via minimum description length objectives [14].", "type": "mixed", "id": {"id": "b0bd14f7-4e8a-4226-afa0-043efc252da8"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"baselines on transfer tasks,\" \"offline heuristics,\" etc., which are useful for learning span composition in both natural language processing (NLP) contexts like code comments or documentation. / Clear sentence structure with identifiable phrases and concepts related to linguistic structures, suitable for learning span segmentation in a tokenizer-free context. / The segment contains a mix of technical terms and phrases that could be useful for learning span segmentation in both coding contexts (e.g., \"baselines\", \"transfer tasks\") and natural language descriptions (\"unsupervised segmentation methods\"). Despite some punctuation issues, it maintains structural clarity. / Contains both structured language and technical terms that can be segmented into meaningful spans; represents valuable patterns for learning span composition in a domain combining coding concepts with academic writing. / Clear sentences with identifiable phrases; useful for learning span segmentation in text."}}
 {"raw": "Importantly; standard transformer encoders can process this composite sequence without architec- tural modification, as the inserted vectors match token dimensionality and participate in multi-head attention identically [5, 7, 35]. This approach mirrors the insertion of learned prompt tokens Or re- trieved vectors into encoder-decoder models without altering core attention mechanics. However,; to prevent positional ambiguity and enforce separation between token and span-originated embeddings, we optionally", "type": "mixed", "id": {"id": "9a87028f-60a3-4e46-a363-252b4148e9eb"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical language and punctuation that can be segmented into meaningful spans, such as \"Importantly; standard transformer encoders\" or \"[5, 7, 35].\", which are clear structures useful for training span-aware models. / The segment contains a mix of technical language and punctuation, with clear references to concepts like \"transformer encoders,\" which can be segmented into meaningful spans for training purposes; however, the presence of semicolons may introduce ambiguity in span boundaries. / The segment contains a mixture of technical language and punctuation, with clear references to concepts like \"transformer encoders,\" which can be segmented into meaningful spans for learning purposes; however, the presence of semicolons may introduce ambiguity in span identification. / The segment contains a mix of technical language and punctuation, with clear delimiters for potential spans (e.g., semicolons). It includes references to academic citations which could be useful in learning context-aware span segmentation; however, the presence of multiple periods may confuse tokenization. / The segment contains a mix of technical language and punctuation that can be segmented into meaningful spans, such as \"Importantly,\" \"transformer encoders\", \"[5, 7, 35]\", etc., which are useful for learning span composition in both natural language processing (NLP) tasks related to code comprehension."}}
 {"raw": "In the neural domain, probabilistically masked language models (PMLM) inte- grate segmentation into pretraining with masked span prediction [12]. Other works learn segmen- tation boundaries for text-to-text generation by optimizing a downstream reconstruction loss [11]. While these methods introduce differentiability; they lack explicit linguistic priors and produce non-overlapping, fixed partitions: 2.4 Span-Based and Pointer-Network Models Pointer networks offer & mechanism for predicting variable-length", "type": "mixed", "id": {"id": "90d81561-b626-44d4-b1d2-27353cce4354"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"neural domain,\" \"probabilistically masked language models (PMLM),\" etc., which are relevant for learning span composition in both natural language processing tasks. / The segment contains a mix of technical terms and references to methods, which can help the model learn span segmentation in both structured (code-like) elements like \"PMLM\" or \"[12]\" as well as natural language descriptions (\"probabilistically masked\", etc.). However, it lacks clear delimiters for spans. / The segment contains a mixture of technical terms and phrases with clear structure, though it lacks coherence due to incomplete sentences; however, it's representative for learning span segmentation in both language processing tasks (natural) and code-related contexts (code). / The segment contains a mix of technical terms and references to models, which can help the model learn span segmentation in both domains; however, it lacks coherence due to fragmented sentences (\"Pointer networks offer & mechanism for predicting variable-length\"). Structurally clear spans are identifiable but could be improved. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"probabilistically masked language models,\" \"segmentation boundaries for text-to-text generation,\" etc., which are relevant to both natural language processing (NLP) tasks like PMLM and Pointer-Network Models. It is clean but lacks coherence due to the abrupt ending of a sentence (\"Pointer networks offer & mechanism...\")."}}
 {"raw": "apply specialized feature encodings (such as segment tags, span-type biases; o learned offsets) to preserve structural grounding [30, 37, 18]. Relative offsets: Span positions can be encoded relative to the input sequence to model anchoring [37].", "type": "mixed", "id": {"id": "a95fa8ad-ede6-4eac-9d2f-3065f674dda0"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains both technical terms and references to structured elements like \"segment tags\" which can be segmented into meaningful spans, representing valuable patterns for learning span composition in a tokenizer-free context. It is clean but lacks explicit natural language structure clarity due to its specialized content type. / The segment contains both technical terms and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in a tokenizer-free context. It is clean but lacks explicit examples of natural language or code constructs separately; however, the mixture itself provides diverse training data. / The segment contains a mixture of technical terms and phrases that can be segmented into meaningful spans, such as \"specialized feature encodings,\" \"[30, 37, 18],\" which are likely references to literature or studies; this provides valuable patterns for learning span composition. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"specialized feature encodings,\" \"[30, 37, 18],\" which are likely references to literature or studies; this is valuable for learning span composition in both natural language processing (NLP) contexts related to code documentation. / The segment contains a mix of technical terms and references that can be segmented into meaningful spans, such as \"specialized feature encodings,\" \"[30, 37, 18],\" which are likely citations or code snippets; it is clean for training purposes with clear structural elements."}}
 {"raw": "X-SPANFORMER SPAN-AwARE ENCODER Gigaword Compression (Optional): For assessing semantic condensation and routing spar- sity under low-token summarization windows [70]_ Pseudo-structured Sequences: A mix of instructional data (recipes, dialog trees) and semi- nested markdown documents to probe structural generalization over latent hierarchical cues: Metrics. To isolate architectural effects, we evaluate span selection and routing behavior using the following indicators: Span entropy: H(P) = - Pij log Pij,", "type": "mixed", "id": {"id": "aab50712-939d-4874-9873-c3afdb60e22a"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of structured elements like pseudo-coded sequences and markdown documents, which can help the model learn span segmentation in both programming contexts (code) and documentation formats (natural language). However, it lacks clarity on how these spans are defined or used within an actual encoding process. / The segment contains a mix of instructional data and markdown documents, which can help the model learn span segmentation in both structured (code-like) and unstructured contexts. However, it lacks clarity due to its pseudo-structure; thus not ideal for training purposes without further refinement or context clarification. / Contains a mix of instructional data and markdown, with clear pseudo-structured sequences that can be segmented into meaningful spans for learning hierarchical cues in span-aware encoding. / The segment contains a mix of instructional data, semi-nested markdown documents (code-like structures), and mathematical expressions which can help the model learn span segmentation across different contexts. However, it lacks clarity in some parts due to its pseudo-structured nature; thus, it's not perfect but still valuable for training purposes. / Contains a mix of instructional data and markdown, with clear structured elements like pseudo-coded sequences for learning span segmentation."}}
 {"raw": "SpanBERT extends this idea in pretraining by masking contigu- ous spans and predicting their content [15]. In speech and vision; learned segmenters often output overlapping proposals that improve detection and alignment [16, 17]. However , to our knowledge no prior work unifies pointer-based span prediction with linguistically grounded structure for text segmentation in transformer encoders: 2.5 Summary Existing segmentation strategies fall into three broad categories: offline subword tokenization;", "type": "natural", "id": {"id": "9cfbe2ee-5895-4e1c-87db-dc3af48e7256"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear spanable phrases and terms related to text segmentation; coherent for learning patterns in NLP context. / Clear sentence structure with identifiable spans; useful for learning span segmentation in text. / The text segment contains clear linguistic structures and spans that can be segmented meaningfully, such as \"SpanBERT,\" \"contiguous spans,\" etc., which are relevant for training a span-aware model on English language texts. It is clean but lacks context to fully represent the domain of use cases or applications in natural languages. / Contains both structured text and references to concepts (e.g., SpanBERT, contiguous spans) that can be segmented into meaningful educational content for a span-aware model. However, it lacks clear code constructs or complete sentences which may limit its utility slightly compared to pure natural language segments. / Contains a mix of technical terms and references, with clear spanable phrases like \"contiguous spans,\" which are useful for learning segmentation patterns in both language processing (natural) and structured data interpretation contexts. The text also includes citations that could be relevant to understanding domain-specific knowledge integration into models."}}
 {"raw": "[50] Chenchen Ma, Jing Ouyang, and Gongjun Xu: (( Learning Latent and Hierarchical Structures in Cognitive Diagnosis Models\" . In: Psychometrika 88.1 (2023) , Pp: 175-207. DOI: 10 _ 1007 / s11336-022-09867-5. [51] Yi Tay et al. \"Efficient Content-Based Sparse Attention with Routing Transformers\".", "type": "mixed", "id": {"id": "f2e2ca19-a3c0-433c-a8c4-cd0b1cf9131d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mixture of citation references with structured elements like authors, titles, publication details which can be segmented into meaningful spans for training purposes. However, the presence of special characters and missing spaces may affect clarity slightly but still retains compositional value. / Contains a mix of citation and abstract text, lacking clear span segmentation patterns for training purposes. The presence of LaTeX formatting complicates the extraction of clean segments suitable as examples. / The segment mixes citation formats and lacks clear, consistent spans for training; it contains a mix of structured references that may confuse the model's learning process. / The segment contains a mixture of citation references and academic content, which can help the model learn to distinguish between different types of spans such as author names, titles, publication details (DOI), page numbers, etc., in both natural language text and code-like structures. / The segment contains a mix of citation and reference formatting which lacks clear, consistent spans for training purposes; it is not coherent as standalone text or structured data."}}
 {"raw": "character-level or token-free encoders, and unsupervised boundary learners: Offline methods such as BPE and SentencePiece offer efficient lookups but impose static vocabularies that fragment long- range structures and fail under domain shift [1, 2, 8].", "type": "mixed", "id": {"id": "a96ce30a-361d-4b32-9fe6-777a064c47e8"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"character-level or token-free encoders,\" which is useful for learning span composition in both natural language processing (NLP) tasks related to code understanding (\"Offline methods\") and domain-specific knowledge. / The text segment contains clear sentence structures and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in the context of language processing tasks. It is clean and coherent without any domain-specific jargon or code constructs present. / Clear sentence structure with potential for learning long-range dependencies and domain-specific vocabulary segmentation. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"character-level or token-free encoders,\" which is valuable for learning span composition in both natural language processing (NLP) contexts like code comments/documentation but also general text. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both coding concepts (e.g., \"character-level or token-free encoders\") and natural language descriptions (\"Offline methods such as BPE and SentencePiece\")."}}
 {"raw": "X-SPANFORMER\nSPAN-AwARE ENCODER\nSpan (61,j1)\nSpan (62,3j2)\nSpan (i3,j3)\nS\nS\nS\nController Token\nPrefix Vector Global Conditioning\nSpan St,_ j1\nSpan 82,32\nSpan St,j3\nS\nh' = LayerNorm (h+W s)\nels| e1,e2- e1,\nW1\nW2\n23\nDecoder\nController Token\nsoftmax(W1,W2,3)", "type": "mixed", "id": {"id": "297ba430-3a26-40b6-a3bc-44954f3fd413"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of programming constructs and annotations, with clear span notations that can be used for learning structured patterns in both domains. However, it lacks context which might affect its utility as standalone training data. / The segment lacks clear span boundaries and contains ambiguous references like \"Span (i3,j3)\" which are not well-defined for training purposes. It mixes elements without coherent structure suitable as a single type of content. / Clear span segments representing different components of a Span-Aware Encoder, suitable for learning structured patterns in programming context. / Contains a mix of programming constructs and annotations, with clear span patterns for learning segmentation in both contexts. / Clear structured elements like spans and tokens, representing valuable patterns for learning span composition in programming context."}}
 {"raw": "In: Trans- actions of the Association for Computational Linguistics 9 (2021) , pp. 53-68. DOI: 10.1162/ tacll_a|_00353. [52] Yves Grandvalet and Yoshua Bengio. (( Semi-Supervised Learning by Entropy Minimization\".", "type": "natural", "id": {"id": "784406de-0e37-4cf2-be8b-1e04816531e8"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear citation structure with identifiable spans for authors, title, and publication details; well-suited to learn span segmentation in academic contexts. / Clear citation structure with identifiable spans for author names, titles, and publication details; clean text suitable as training data. / The segment contains a mixture of citation elements (title, authors) and bibliographic details which can be segmented into meaningful spans for training purposes; it is clean but lacks context to fully represent the target domain's span composition patterns. / Clear citation structure with identifiable spans (authors, title, publication details) suitable for learning span segmentation in academic texts. / Contains a citation with structured elements (journal name, page range) and an academic reference that can help the model learn span segmentation in both text formats."}}
 {"raw": "Character-level and byte-level approaches eliminate heuristic preprocessing yet lack explicit modeling of phrase-level regularities and often incur higher computational cost [9, 10]. Unsupervised segmentation methods introduce differen- tiability but produce non-overlapping; monolithic partitions without linguistic priors [14, 1l, 12]. Span-based predictors and pointer-network architectures enable variable-length boundary propos- als but have not been combined with generative grammar principles for text", "type": "mixed", "id": {"id": "feb9aa43-8f28-4144-84e2-e529cd1eb94e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"character-level\", \"byte-level approaches\", etc., representing valuable patterns for learning span composition in both natural language text with code-like references. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"character-level\", \"byte-level approaches\", etc., which are relevant for learning span composition in both natural language processing (NLP) tasks related to code analysis. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"character-level\", \"byte-level approaches\", etc., which are useful for learning span composition in both natural language processing (NLP) tasks related to code analysis. / The segment contains clear linguistic structures and phrases that can be segmented meaningfully, representing valuable patterns for learning span composition in the context of text processing tasks. It is clean but lacks contextual coherence due to fragmented sentences (\"span-based predictors\" vs \"pointer-network architectures\"). / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"character-level\", \"byte-level approaches\", etc., which are relevant for learning span composition in both natural language processing (NLP) tasks related to code analysis."}}
 {"raw": "interpola- tion mechanism over the filtered set S' = {(ik, Je)}=1: Rather than injecting each span embedding directly, the model computes a relevance-weighted mixture over their representations: S = Qij Sij) (i,j)es' where Sij e Rd is the encoded representation for span (i,j), and Qij € [0, 1] is its normalized attention weight. To compute the interpolation weights Qij) each span is assigned a scalar relevance logit: fscore(sij, Oij = type Qij 9 Pij confij) , which may be a learned function over span length", "type": "mixed", "id": {"id": "6a28ea16-5af0-4851-accd-9c77152a31be"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mixture of technical terms and mathematical expressions, which can help the model learn span segmentation in both domains. It is structurally clear with identifiable spans such as \"interpola- tion mechanism,\" \"(ik, Je),\" etc., making it coherent for training purposes. / The segment contains a mixture of technical language and mathematical expressions, which can be segmented into meaningful spans for learning span composition in both domains. It is clean but may require domain-specific knowledge to fully understand the context (e.g., relevance-weighted interpolation). / The text contains a mix of technical terms and mathematical expressions, which can be segmented into meaningful spans such as \"interpolation mechanism,\" \"(ik, Je),\" etc., representing valuable patterns for learning span composition in both natural language descriptions and code-like structures. / The segment contains a mixture of both technical language and mathematical notation, which can help the model learn span segmentation in diverse contexts. However, it may benefit from further simplification for clarity. / The segment contains a mixture of technical terms and mathematical expressions that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both programming context (code) and explanatory text (natural language). However, the presence of symbols like '€' may require additional preprocessing."}}
 {"raw": "X-Spanformer addresses these gaps by unifying pointer-based span prediction with X-bar inspired inductive bias, yielding overlapping, softly typed spans that integrate seamlessly into transformer encoders and support end-to-end training:", "type": "mixed", "id": {"id": "078458ce-b9a6-4171-aa1b-06b2e1e58551"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"pointer-based span prediction,\" which is relevant for learning complex patterns in both natural language processing (NLP) tasks like summarization or question answering where overlapping information might exist. It also includes domain-specific jargon (\"X-bar inspired inductive bias\") useful to the model's understanding of technical concepts that appear across different contexts, making it a valuable mixed-content example. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both coding concepts (e.g., \"pointer-based span prediction\") and natural language explanations (\"unifying X-bar inspired inductive bias\"). / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"X-Spanformer,\" \"pointer-based span prediction,\" etc., which are relevant for learning complex patterns in both natural language processing (NLP) tasks related to code. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"pointer-based span prediction,\" which is indicative for learning in the context of transformer encoders; it also includes natural language descriptions (\"unifying\", \"yielding overlapping\") useful to understand complex concepts. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"pointer-based span prediction,\" which is relevant for learning patterns in both language processing tasks (natural) and programming concepts (\"code\"). It also integrates seamlessly with transformer encoders."}}
 {"raw": "X-SPANFORMER SPAN-AwARE ENCODER 5.2 Span Routing Behavior We analyze the internal span distribution dynamics induced by the X-Spanformer's entropy- regularized selection module. The goal is to assess whether the model exhibits structure-seeking behavior through interpretable routing patterns under curriculum-controlled exploration: Let P = {Pij} denote the normalized span distribution from Equation (16) , and let the controller be computed as: K exp( Wk _ S @kSk, where @k (34) K k=1 e=1 exp(we) To", "type": "mixed", "id": {"id": "b27043b1-13cf-4e90-906f-dc222d347f39"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and mathematical expressions, indicating clear span boundaries suitable for learning structured patterns in both domains. However, the presence of complex equations may require additional context or preprocessing to fully benefit X-Spanformer training without tokenization. / The segment contains a mix of technical jargon and mathematical expressions, which can help the model learn span segmentation in both structured programming contexts (code) as well as complex sentence structures typical for scientific documentation or articles (natural). It is coherent but may require additional context to fully grasp its meaning. / The segment contains a mix of technical terminology and mathematical expressions, indicating clear structure for span segmentation in both domains; however, it lacks contextual coherence which may affect training utility slightly. / The segment contains a mix of technical terms and mathematical expressions, indicating clear structure for span segmentation in both programming (code) context with equations/formulas notation and domain-specific language related to machine learning models like X-Spanformer. It is clean enough but may require preprocessing due to the mixture type. / The segment contains a mix of technical terms and mathematical expressions, indicating clear span patterns for both language (natural) and programming/code elements that are valuable to learn from. However, the presence of symbols like \"@\" may require additional preprocessing or handling in training data preparation."}}
 {"raw": "X-SPANFORMER\nSPAN-AwARE ENCODER\n7.1 Effect of Span Injection Strategies We compare the following injection strategies for incorporating $: Prefix Token (PT): Insert 3 at position 0- Attention Bias (AB): Add 8 to keys/ queries linearly as in Section ?2 .", "type": "mixed", "id": {"id": "9fd0be6b-1ab6-409b-a4d3-5ddb27da155e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and structured information that can be segmented into meaningful spans, such as \"X-SPANFORMER\", \"SPAN-AwARE ENCODER\", numerical values like '7.1', and specific strategies with their descriptions (\"Prefix Token (PT)\", etc.). It is clean but lacks context for full comprehension which may affect training utility slightly. / The segment contains a mix of technical terms and structured information that can be segmented into meaningful spans, such as \"X-SPANFORMER\", \"SPAN-AwARE ENCODER\", numerical values like '7.1', and specific strategies (\"Prefix Token (PT)\", \"Attention Bias (AB)\"). It is clean but lacks context for full comprehension; however, it still provides valuable patterns related to span segmentation in mixed content types. / Contains a mix of terminology and structured data, with clear span segmentation opportunities in phrases like \"Prefix Token\" and numerical values indicating positions or adjustments (\"Insert 3 at position\", \"Add 8 to keys/ queries\"). The text is coherent but lacks context for full comprehension. / The segment contains a mix of technical terms and structured information that can be segmented into meaningful spans, such as \"X-SPANFORMER\", \"SPAN-AwARE ENCODER\", numerical values like '7.1', and specific phrases indicating different strategies (\"Prefix Token (PT)\", \"Attention Bias (AB)\"). It is clean but lacks context for full comprehension; however, it still provides valuable patterns related to span segmentation in mixed content types. / The segment contains a mix of technical terms and structured information that can be segmented into meaningful spans, such as \"X-SPANFORMER\", \"SPAN-AwARE ENCODER\", numerical values like '7.1', and specific strategies (\"Prefix Token (PT)\", \"Attention Bias (AB)\"). These elements are cleanly formatted for training purposes in a mixed content type scenario."}}
 {"raw": "In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics; 2019, Pp. 1129-1141. URL: https: / /aclanthology org/N19- 1116/. [56] Kevin Clark et al. \"Semi-Supervised Sequence Modeling with Cross-View Training\".", "type": "mixed", "id": {"id": "5ab156f6-8352-4368-a897-61350d964db3"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear citation structure with identifiable spans for authors, title, publication details; clean and coherent text suitable as a training example. / Contains a citation with structured elements (authors, title) and URLs that can be segmented into meaningful spans for training purposes. The content is well-formed but lacks direct span segmentation patterns due to its academic nature; however, it still offers valuable context mixing natural language descriptions of code references which could benefit the model's understanding in mixed contexts. / The segment contains a citation with structured elements (authors, title) that can be segmented into meaningful spans for learning purposes; it is clean and coherent but lacks direct span composition examples typical of X-Spanformer training data. / The text segment contains a citation with structured elements (authors, title) that can be segmented into meaningful spans for training purposes; it also includes URLs and references which are valuable patterns in span composition within academic contexts. / The text contains a citation with structured elements (authors, title) that can be segmented into meaningful spans for learning span composition in both academic and web contexts. It is clean but lacks direct coding examples or natural language prose to fully represent either domain exclusively."}}
 {"raw": "X-SPANFORMER SPAN-AwARE ENCODER 3 Architecture This section formalizes the modular components of X-Spanformer and their interactions within the segmentation pipeline: Each architectural unit is presented with motivation, precise mathematical formulation; and pseudocode where appropriate. We conclude with strategies for integrating outputs into standard transformer encoders, including support for overlapping span interpolation, and a runtime complexity analysis: X-Spanformer is bootstrapped by a compact BPE", "type": "mixed", "id": {"id": "44810afa-6f96-4498-9752-388395de3786"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical descriptions and structured components (architecture, mathematical formulations) that can be segmented into meaningful spans for learning span composition in both coding constructs and formal documentation language. / Clear segmentation between architectural description, pseudocode references (spans), and mathematical formulations; useful for learning span composition in both language descriptions and technical constructs. / Clear segmentation into architectural units, pseudocode inclusion; represents both structured and unstructured patterns for span learning. / The segment contains a mix of technical descriptions and structured information about an encoder's architecture, which includes clear spans for learning such as \"architectural components,\" pseudocode references (\"pseudocode where appropriate\"), mathematical formulations (implied), and integration strategies with standard transformers. / The text contains a mix of technical descriptions and structured components (architecture, mathematical formulations) that can be segmented into meaningful spans for training purposes. It is clean but slightly complex due to the presence of both natural language explanations and code-like pseudocode references."}}
 {"raw": "so that\n(i,j) Qij = 1. The final interpolated vector $ (Equation 1) functions as a global span summary: It may be inserted as a controller token [5], prepended as a prefix vector [30], or concatenated to the sequence input for downstream fusion [38].", "type": "mixed", "id": {"id": "1ee017e4-8f51-4599-848b-d523d45f40ac"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear spans of both mathematical notation and textual explanation; useful for learning span segmentation in a mix of content types. / The text contains a mixture of mathematical notation and prose, with clear delimiters for spans such as equations (i,j) Qij = 1., which can be useful in training the model to recognize different types of content. / The text segment contains a mixture of mathematical notation, programming-like syntax (e.g., equations and references), which can be segmented into meaningful spans for training purposes in both domains. It is clean but lacks context to fully understand the domain-specific meaning without additional information. / Contains a mixture of mathematical notation, programming references (e.g., [5], [30]), and text explanation; spans can be identified for both equations/formulas as well as textual descriptions. / The segment contains a mix of mathematical notation, programming-like expressions (e.g., equations), and textual descriptions that can be segmented into meaningful spans for training purposes. It is clean but lacks context on the specific domain or application it pertains to."}}
 {"raw": "understand convergence properties and architectural expressivity; we track the following quanti- tative signals: Span Entropy Dynamics: The Shannon entropy of Pt is computed at each training epoch t: H(Pt) = - Pij log Pf) (35) (i,j) We hypothesize that the expectation E[H (Pt)] follows exponential decay due to the schedule Aent ! (t) = Ao * exp( _nt), as derived in Section 4.2, mirroring curriculum learning effects observed in [54, 24].", "type": "mixed", "id": {"id": "1d370e7b-ffa6-424a-b69b-830f99effb20"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mixture of mathematical notation and prose, with clear spans for both equations (e.g., H(Pt) = - Pij log Pf)) and descriptive text (\"understand convergence properties...\"). It is clean but may require preprocessing to handle the mix effectively. / The segment contains a mixture of mathematical notation, programming concepts (like entropy and decay), which are structurally clear for span segmentation; it is clean but may require domain-specific knowledge to fully understand the context. / The segment contains a mix of mathematical notation and prose, with clear span boundaries for both equations (e.g., H(Pt) = - Pij log Pf)) and descriptive text (\"understand convergence properties...\"). It is well-formed but may require preprocessing to handle the mixture effectively. / The text contains a mix of mathematical notation and prose, with clear span segmentation between equations (35) and the surrounding explanation about Shannon entropy dynamics in training epochs; it is clean for learning patterns related to both natural language explanations and code-like expressions. / The segment contains a mix of mathematical notation and prose, with clear spans for equations (e.g., H(Pt) = - Pij log Pf)) that can be segmented meaningfully; it is clean but may require domain-specific knowledge to fully understand the context."}}
 {"raw": "Gated FFN (GF): Modulate FFN output via span-conditioned gating: Let Lfull denote the baseline loss with all three injections, and L_m be the loss with mechanism m removed. Define relative degradation m as: L_m Lfull m 3 100%", "type": "mixed", "id": {"id": "12403dcc-3292-4fda-a51c-fe14644cf873"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment combines both mathematical notation and programming-like expressions, indicating a clear structure that can be segmented into meaningful spans for learning purposes. It is clean but lacks context to fully understand the domain-specific terms like \"GF\" or what Lfull represents without additional information. / Clear span segmentation with a mixture of technical terms and mathematical expressions, representing valuable patterns for learning both language structure (natural) and programming constructs (code). / The segment contains a mix of technical terms and mathematical expressions that can be segmented into meaningful spans for learning, such as \"Gated FFN\", \"span-conditioned gating\", etc., which are relevant to both natural language processing tasks involving code understanding or documentation about programming concepts. / Clear span segmentation with a mixture of technical terms and mathematical expressions, representing valuable patterns for learning both language structure (natural) and programming constructs (code). / The segment contains a clear mixture of technical terms and mathematical expressions, which can be segmented into meaningful spans for learning purposes in both domains. It is clean but lacks context or examples that could improve its utility as training data."}}
 {"raw": "Span Width Histogram: Let w = j-i. For each epoch; we compute the empirical distribution of selected span widths among top-K spans: A shift toward medium-length (5-12 token) units may indicate phrase- or clause-level abstraction consistent with constituent boundaries [63].", "type": "mixed", "id": {"id": "8fe6bd9d-7b15-4b0b-ac2c-aca07242b4ce"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear prose with identifiable spans related to linguistic structure; useful for learning span composition in language processing tasks. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"Span Width Histogram,\" \"medium-length (5-12 token) units,\" indicating phrase-level abstraction consistent with constituent boundaries; it is clean for training purposes but lacks explicit code examples. / The segment contains a mix of mathematical notation and prose, with clear spans for both numerical values (span widths) and textual descriptions that can be segmented into meaningful units reflecting constituent boundaries in natural language text. / The segment contains a mix of technical language and mathematical notation, with clear spans for phrases like \"Span Width Histogram,\" numerical expressions (\"w = j-i\"), and references to literature ([63]). It is clean but lacks context that could be useful in training the model. / The segment combines both technical terms and a conceptual explanation, which can help the model learn span segmentation in contextually rich environments that include programming concepts intertwined with prose explanations."}}
 {"raw": "Lfull We expect to observe: PT 1.29, AAB = 2.7%, and AGF = 4.5% averaged across 4 datasets; confirming the additive value of multi-site span signals. 7.2 Span Selection without Confidence Routing We ablate the confidence-gated routing step and instead use uniform averaging over K top spans.", "type": "mixed", "id": {"id": "60a13495-7ff6-49f3-b0cf-6ccb7cf52c1e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and structured data points, which can help the model learn span segmentation in both contexts. However, it lacks clear delimiters for spans that are common to natural language processing tasks like sentence boundaries or paragraph breaks; this may affect its utility as training examples without further context clues embedded within the text itself. / The segment contains a mix of technical terms and phrases that can be segmented into meaningful spans, such as \"PT\", \"AAB\", \"AGF\", which are likely abbreviations or variables in the context; it is clean for training purposes with clear structural elements. / The segment contains a mixture of statistical results and technical descriptions, with clear spans for both numerical data (\"PT 1.29\") and procedural steps (\"Span Selection without Confidence Routing\"). It is clean but lacks context or explanation that could improve its utility as training examples. / The segment contains a mix of technical terms and structured data, with clear spans for numerical values (PT, AAB) that can be used to learn span segmentation in both coding contexts (\"multi-site\", \"span signals\") and natural language descriptions (\"Span Selection\"). / The segment contains a mix of scientific notation and programming concepts, with clear span signals for both numerical values (PT) and percentage symbols (%), as well as references to datasets (\"4 datasets\") which are useful patterns in training models that need to understand context across different domains."}}
 {"raw": "X-SPANFORMER\nSPAN-AwARE ENCODER\nProposition 3 (Equivariance and Convexity). Let S' be any permutation of filtered spans: Then the interpolated vector $ is:\n1 . Permutation equivariant: invariant to reordering of spans i S'_\n2 Differentiable: gradients propagate through both Wij and Sij, 3 Convex: 8 € conv{8ij (i,j) € S}. Proof   Equivariance follows from the input-order invariance of softmax in Equation 3. Differen-", "type": "mixed", "id": {"id": "af67f4ba-bced-4a6e-819b-922da0a2ef5a"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms, mathematical expressions (spans), and structured content that can be segmented into meaningful spans for learning purposes; however, it lacks clarity in some parts due to the presence of symbols like \"€\" which may not directly contribute to training. / The segment contains a mix of technical terms, mathematical notation (e.g., \"Permutation equivariant\"), and structured content that can be segmented into meaningful spans for learning purposes; it is coherent but lacks context to fully understand the domain-specific patterns. / The segment contains a mixture of technical terms and structured mathematical expressions, which can help the model learn span segmentation for both programming constructs (like equations) and domain-specific terminology (\"Equivariance\", \"Convexity\"). It's clean but lacks context or explanatory text that could improve its utility. / Contains both structured programming concepts and mathematical notation, providing diverse patterns for span segmentation in a mixed context. / Contains both structured programming concepts and mathematical notation, with clear span patterns for learning; however, it lacks coherence as a standalone segment due to its fragmented nature."}}
 {"raw": "Span Overlap Rate: We define token-level overlap for each instance by computing the pairwise intersection among selected spans:", "type": "mixed", "id": {"id": "beabb218-ea33-42b0-8226-d4a2ad599515"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The text segment contains a clear definition and formula for calculating token-level overlap, which includes both mathematical notation (code-like) as well as explanatory prose that can help the model learn span segmentation in different contexts. It is cleanly structured with identifiable spans such as \"token-level\", \"pairwise intersection\" etc., making it highly useful training data. / Clear overlap rate definition with a mix of mathematical notation and prose, representing valuable patterns for learning span composition in both domains. / The text segment contains a clear definition and formula for calculating token-level overlap, which includes both mathematical notation (code-like) and explanatory language (natural). This combination of elements provides valuable patterns in span composition that can be useful to learn from an X-Spanformer model. / The segment contains a clear definition and formula for calculating token-level overlap, which includes both mathematical notation (code-like) and explanatory text in English (natural language). This combination of elements is representative of the target domain's complexity that X-Spanformer aims to understand. / The segment contains a clear definition and formula for calculating token-level overlap, which includes both mathematical notation (code-like) and explanatory text (natural language). This combination provides valuable patterns in span composition across different content types."}}
 {"raw": "Let: K K Suniform Sk, Sconf = @kSk, @k softmax(9o(sk)) k=l k=1 Proposition 12. Let sk € Rd be fired span vectors and g be Lipschitz continuous. Then EIllsconf Suniform|= 1ll?] 2 0 with equality only if 9o is constant or the spans are identical Proof: Since softmax is strictly convex, equality occurs iff @k = 1/K for all k, which holds if and only if go (sk) = c for all k. This requires either span homogeneity or trivial 9o- Empirically, we expect to observe a consistent F1 drop of ~ 2.1% when using", "type": "mixed", "id": {"id": "ec727c7a-53f8-41a8-9bda-786373f560cd"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment mixes mathematical notation with programming-like expressions, which may confuse the model due to lack of clear contextual boundaries for span segmentation in a tokenizer-free setting. Additionally, it lacks coherence and clarity needed as clean training data. / The segment mixes mathematical notation with informal language, lacking clear span boundaries for effective training. It combines elements that are not uniformly structured or easily segmented into meaningful spans without a tokenizer's aid. Additionally, the text contains colloquial phrases (\"Empirically, we expect to observe\") and lacks coherence in its instructional tone versus formal proof structure. / The segment mixes mathematical notation with programming-like expressions, making it difficult to discern clear spans for a tokenizer-free model; lacks coherence and clarity in its current form. / The segment mixes mathematical notation with informal language, lacking clear and consistent structure for meaningful span segmentation; it is not coherent enough to serve as a representative training example. / The segment mixes mathematical notation with informal language, lacking clear span segmentation and coherence for training purposes."}}
 {"raw": "from a lightweight subword tokenizer, anchor the span predictor in a sparse but stable input space.3 Formally; given an input sequence tokenized to L elements, we define the embedding matrix E = [e1, .", "type": "code", "id": {"id": "5288d107-40b1-4cfa-8381-da81050491f7"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains both programming language syntax and a mathematical expression, providing diverse patterns for span segmentation in the context of machine learning models dealing with multi-modal data. / Clear structure with identifiable spans (import statement, function definition); represents valuable patterns for learning span composition in programming context. / Clear structure with identifiable spans (code constructs, function calls); well-suited for learning span segmentation in programming context. / The segment contains clear syntactical elements of programming language, such as function calls and variable names; however, it lacks context for meaningful span segmentation due to missing content after \"E = [e1,\". It is well-formed but incomplete for training purposes. / The segment contains both programming (code) and explanatory text, which can help the model learn span segmentation in a diverse context. However, it lacks clarity due to incomplete sentences or expressions (\"Formally; given an input sequence tokenized...\")."}}
 {"raw": "from the input sequence; Embedding and selection of top-ranked spans: pooling span-level representations and selecting & subset for contextual conditioning; Joint contextualization: applying a standard transformer encoder over the combined se - quence of tokens and selected spans: This modular design ensures that added computational cost remains subquadratic for the first two stages, while the dominant quadratic term scales with total input length: Similar hybrid strategies are used in sparse attention", "type": "mixed", "id": {"id": "406735b9-c1d0-4b2b-8f1b-57dc0303d799"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of programming constructs and comments, with clear spans for both syntax (code) elements like function names (\"Embedding\", \"pooling\") and phrases indicating processes or descriptions in natural language (\"input sequence\"). It is coherent but lacks context which may affect its utility as training data. / The segment contains both programming constructs (code) and explanatory text, which can help the model learn to distinguish between different types of spans in a hybrid context. However, it lacks clear separation for individual span learning due to its continuous nature without explicit delimiters or structured examples that could be easily segmented into meaningful training instances. / The segment contains a mix of programming concepts and language, with clear spans for both syntax (code) like \"from the input sequence\" as well as conceptual phrases (\"Embedding and selection\", etc.). It's clean but could be more structured to improve readability without affecting its training utility. / The segment contains a mix of programming concepts and descriptions, with clear structure for learning span segmentation in both contexts. It is clean but could be more concise to improve clarity further. / The segment contains a mix of programming concepts and language, with clear spans for tokens (e.g., \"from the input sequence\", \"pooling span-level representations\") that can be used to learn context-aware encoding in both code structures and natural text descriptions."}}
 {"raw": "(Pt || Pt+1) = KL(Pt || Pt+1) + KL(Pt+1 |l Pt).", "type": "code", "id": {"id": "65a861f9-fc98-4d2a-b703-eea362c5214e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Clear mathematical expression with identifiable spans for training; well-formed and relevant to programming context. / Clear mathematical expression with identifiable spans; well-suited for learning span segmentation in programming contexts. / The segment contains clear mathematical expressions and programming notations, representing valuable patterns for learning span composition in a tokenizer-free context. It is well-formed with meaningful spans identifiable as variables (Pt), operators (=, +), functions (KL), and parentheses indicating scope or grouping within the expression. / The segment contains a clear mathematical expression with identifiable spans, such as variables and operators; it's clean for training purposes in learning span segmentation within programming contexts. / Contains clear mathematical expressions and equations, which are valuable for learning span segmentation in programming contexts. The notation is well-formed with identifiable spans like \"Pt\", \"Pt+1\", etc., making it structurally coherent as a piece of mixed content involving both natural language (notation) and formalized math/code constructs."}}
 {"raw": "[76]. 7.3 Span Pooling Alternatives\nWe replace Pool(zi:j) with various alternatives: max(xi:j ) max-pooling mean(€i:j) mean-pooling Ii start-token only Our simulated projections predict that mean-pooling will consistently outperformed other meth- ods (up to +1.8% over max)_ This might correlate to to reduced gradient variance and better generalization [76]. 7.4 Disabling Span-Scoped Attention\nFinally; we ablate the span-aware bias term in attention: epan = Cij + Sijes - B , B €e R\n(3)\n36", "type": "mixed", "id": {"id": "1b9b958f-01af-444c-93ed-910cbffc950c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of technical terms and mathematical expressions, but lacks clear sentence structure for meaningful span segmentation in training data context. Additionally, the notation is inconsistent (e.g., \"€i:j\" should be \"$[i,j]$\"). / The segment mixes mathematical notation with prose, making it difficult to identify clear spans for training purposes; lacks coherence and clarity in the context of span-aware models. / The text contains a mix of numerical references, mathematical expressions and fragmented sentences that lack coherent structure for effective span segmentation training. / Clear structure with identifiable spans like method names, parameters (e.g., Pool(zi:j), max(xi:j)), and mathematical expressions; well-formed for training purposes in the context of programming documentation or configuration files. / The segment contains a mix of numerical references, mathematical expressions (mean-pooling), and programming-like syntax which can help the model learn span segmentation in both structured data formats like equations or lists as well as natural language descriptions. However, it lacks context for some elements such as \"[76].\""}}
 {"raw": "X-SPANFORMER SPAN-AwARE ENCODER Our simulations also predict that removing the bias term reduces task-specific alignment in span- rich tasks (e-g;, nested NER) will improve performance over 3.9% Fl, indicating the necessity of soft alignment priors: 8 Conclusion In this work; we have introduced the X-Spanformer , a tokenizer-free, span-aware encoder archi- tecture grounded in linguistic theory and implemented through differentiable span routing and multi-site injection strategies. While our design is", "type": "mixed", "id": {"id": "72ff0da3-02f4-45db-8c71-657853a75ce2"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "Contains a mix of technical terms and phrases that can be segmented into meaningful spans, though some punctuation errors may affect clarity slightly. The text includes both domain-specific language (e.g., \"X-SPANFORMER\", \"span-aware encoder\") as well as natural language explanations (\"Our simulations also predict...\"), which is representative for mixed content training data. / The text contains a mix of technical jargon and incomplete sentences, lacking clear span segmentation for effective training. Additionally, the presence of typographical errors (\"e-g;\", \"archi- tecture\") reduces clarity and coherence necessary for learning patterns in X-SpAnformer models. / Contains a mix of technical terms and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic theory context (natural language) and encoder architecture details (code). However, it lacks clear delimiters between sentences or code constructs which may affect clarity. / The segment contains a mix of technical jargon and incomplete sentences, making it difficult to identify clear spans for training purposes; lacks coherence in the context provided. / The segment contains a mix of technical terms and structured sentences that can help the model learn span segmentation in both linguistic contexts (natural language) and domain-specific terminology related to X-Spanformer architecture, which is valuable for training purposes."}}
 {"raw": "X-SPANFORMER SPAN-AwARE ENCODER Visualization and Empirical Summary Entropy Decay 0.01 0.05 0,10 0,50 Span Widths Routing Sparsity 0.,6 0,5 1 0.4 U0.2- 1 0.0 7 18 120 80 40 t 20 0.01 0.05 0.1 0.5 0.5 30 80 120 180 Epochs 4 6 15 20 Span Width 30 90 120 180 Epochs Figure 6: Diagnostic evolution of span routing properties Left: entropy decay across different schedules. Center: distribution of selected span widths over training: Right: routing sparsity (mean top-K concentra- tion) over time. Table 1: Entropy", "type": "mixed", "id": {"id": "5069a8ff-ac34-4273-9a69-d9695bec1625"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of descriptive text and technical terms related to X-SPANFORMER, but lacks clear structured patterns that can be easily segmented into meaningful spans for training purposes. Additionally, there are some unclear elements like \"U0.2-\" which may confuse the model's learning process. / The segment contains a mix of descriptive text and technical terms related to machine learning, but lacks clear structure for meaningful span segmentation due to the presence of tables (Table 1) which are not easily parsed as spans in natural language or code contexts without additional context or formatting. / Contains a mix of descriptive text and structured data (tables, figures), with clear spans for both textual descriptions (\"entropy decay,\" \"routing sparsity\") and numerical values that can be segmented into meaningful patterns useful in training. However, the presence of tables without explicit labels may reduce clarity slightly but still retains significant compositional value. / Contains a mix of structured data (tables, figures) and descriptive text; spans can be identified for both numerical values in tables/figures and textual descriptions. / Contains a mix of structured data (tables, figures) and descriptive text; spans like \"entropy decay,\" \"span widths,\" etc., are identifiable for training purposes. However, some elements may need further clarification or cleaning before use in X-Spanformer model training."}}
 {"raw": "theoretically motivated and formally validated; experimental evaluation is pending: At present, we are in the process of curating task-specific datasets necessary for empirical analysis: Accordingly; we reserve quantitative conclusions and broader discussions for future work; once adequate benchmarking data has been collected and eval- uated: Appendix L Training Hyperparameters Parameter Value Description Optimizer AdamW with decoupled weight decay Learning rate schedule Cosine decay with 10% warmup Initial", "type": "mixed", "id": {"id": "26f6ef90-b284-4522-aec5-0f1e1e58a441"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mixture of structured elements like parameter descriptions and learning rate schedules, which can be segmented into meaningful spans for training purposes; however, it lacks clear delimiters between different types of content (natural language vs code). / The segment contains a mix of structured phrases and technical terms that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both formal language (like programming documentation) and natural discourse structures. However, the presence of incomplete sentences (\"Experimental evaluation is pending\") may slightly reduce clarity but still retains its utility as training data. / The segment contains a mix of structured phrases and technical terms that can be segmented into meaningful spans, such as \"theoretically motivated,\" \"formally validated,\" etc., which are useful for learning span composition in both natural language processing (NLP) tasks related to code documentation. / The segment contains a mix of formal language and technical terms, which can help the model learn span segmentation in both structured (code-like) expressions (\"AdamW with decoupled weight decay\") as well as natural language descriptions (\"theoretically motivated\", \"pending evaluation\"). It is coherent but lacks context for full comprehension. / The segment contains a mix of formal language and technical terms, with clear delineations for different sections (e.g., \"theoretically motivated,\" \"formal validated\"). It includes structured elements like lists (\"Parameter Value Description\") that can be segmented into meaningful spans useful in training."}}
 {"raw": "\"Scaling Language Models: Methods, Analysis Insights from Training Gopher\" . In: arXiv preprint arXiv:2112.11446 (2021) URL: https : / / arxiv org/abs/2112 _ 11446_ [61] Kenton Lee, Ming- Wei Chang, and Kristina Toutanova (( Latent Retrieval for Weakly Super- vised Open Domain Question Answering\" In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) 2019.", "type": "mixed", "id": {"id": "deffd63c-cea2-4b5a-96ee-0e278927c824"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf", "notes": "The segment contains a mix of citation details (natural language) and metadata about an academic paper, which can help the model learn to identify spans related to authorship, publication year, URLs, etc., in both natural text and structured references commonly found alongside code or data. / The segment contains a mix of citation and URL, which are structurally clear for span segmentation; however, it lacks context or content that could be directly useful in training without additional information. It represents the pattern of referencing academic work but would benefit from more substantive text to improve learning utility. / Contains a mix of URLs, citations with structured elements like authors and publication years that can be segmented into meaningful spans for learning purposes. / Contains a mix of citation and URL elements that can be segmented into meaningful spans; however, lacks clear compositional patterns for training purposes. Needs more structured content to improve learning utility. / Contains a mix of citation and URL, which are useful for training span segmentation in both text (natural) and URLs/code-like structures. However, it lacks clear delimiters between spans; additional preprocessing may be needed to improve clarity before use as is."}}
 {"raw": "Sworn Book of Honorius 271 Nothi, Ymeynlethon, Karex;271 and through the Angels and Archangels, through the Thrones, Dominations, through the Principates and Potestates and Virtutes, through the Cherubyn and Seraphyn (27) Then go around the circle once, touching each sword, saying: _and by the names of your holy angels: Raphael, Caphael, Dardiel, Hurathaphel. At the second sword: Satquiel, [Raphael], Paamcociel, Asasagel [*Asasayel]: (28) at the third sword: Samael, Satiel, Ituraiel, Amabiel_, at the fourth", "type": "mixed", "id": {"id": "9eb626a2-df3e-4ade-a2db-8639275e6dd5"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mixture of structured phrases and names, useful for learning span segmentation in both linguistic patterns (names) and ritualistic sequences (commands). / Contains a mix of structured phrases and names, with clear segmentation opportunities for learning span composition in both language patterns (names) and repetitive structures (swords). The text is coherent but lacks context to fully understand the domain relevance; however, it retains sufficient structural clarity. / Contains a mix of structured phrases and names that can be segmented into meaningful spans, though some elements like \"Nothi\" are unclear in context. Clean for training purposes with identifiable patterns related to angelic entities within the text. / Contains structured elements with clear segmentation opportunities, though some phrases may need clarification for training purposes. / Contains a mixture of liturgical phrases and structured commands, with clear segmentation opportunities for learning span composition in both religious text context (natural language) and ritualistic instructions (code-like structure)."}}
 {"raw": "But some may say: When Lord says, 'no one can see me and live'186 it follows therefore, that if one sees God, the body must die. (47)", "type": "natural", "id": {"id": "0b85fcf5-a49a-4fb8-bc20-db1b59ed1bb7"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable spans; useful for learning sentence structure and context in NLP tasks. / Clear prose with identifiable phrases and sentences suitable for training a span-aware model on English text. / Clear prose with identifiable sentence structure; spans can be segmented into meaningful phrases and sentences for training purposes. / Clear sentence structure with identifiable spans; useful for learning span segmentation in prose. / Lacks clear span segmentation; ambiguous phrases make it less useful for training purposes."}}
 {"raw": "284\nSWORN BOOK OF HONORIUS\nvisione divina, secundus de angelis bonis, tercius de aereis, quartus de ter- reis, quintus de exposicione horum: (3) De visione divina; De cognicione potestatis divine; De absolucione peccatorum; Ne homo incidat in peccatum  mortale; De redempcione trium animarum a purgatorio. (4) Omnium autem horum 5 operum precedencium ordo iacet suo modo verissimo situatus. De primo enim iam patet peticio explanata.", "type": "mixed", "id": {"id": "425ab543-a4f4-4083-98ef-12bd827c2a4c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mix of structured text (possibly from an ancient manuscript or religious document) and numerical references, which can help in learning span segmentation for both textual content and numeric patterns. / The text segment contains a mixture of Latin phrases and structured formatting, which can help the model learn span segmentation in both language patterns (natural) and formal structuring elements typical for historical documents or code-like structures. / The segment contains a mixture of structured text (likely from an ancient manuscript or religious document) and numerical references, which can be segmented into meaningful spans representing different sections with clear hierarchical relationships between them. This provides valuable patterns for learning span composition in both natural language processing tasks related to historical texts as well as code-like structures that denote sectioning within documents. / The segment contains a mixture of structured text (likely from an ancient manuscript or religious document) and numerical references, which can be segmented into meaningful spans for learning purposes; however, it lacks clear compositional patterns due to its archaic language style. / The text segment contains a mixture of Latin phrases and structured headings, which can help the model learn span segmentation in both linguistic patterns (natural language) and formal structures (code-like constructs)."}}
 {"raw": "This prayer is based on the well-known Athanasian Creed ( Symbolum Athanasianum.)", "type": "natural", "id": {"id": "693ca0e1-6b36-4b35-a56c-0d9305648cce"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with a recognizable structure suitable for training on span segmentation in religious texts or creeds. / Clear, coherent prose with identifiable phrases suitable for training a span-aware model on religious texts or creeds. / Clear, coherent prose with identifiable spans; useful for learning sentence and phrase boundaries in English text. / Clear, coherent prose with identifiable spans like \"prayer,\" \"Athanasian Creed (Symbolum Athanasianum).\" Suitable for learning span segmentation in religious texts. / Clear, coherent prose with identifiable phrases suitable for span segmentation and composition learning."}}
 {"raw": "misericordie, (2) ego, licet indignus et plenus iniqui- tate, dolo et malicia, suplex ad tuam venio misericordiam orans et depre- cans, ut non respicias ad universa et innumerabilia peccata mea set, sicut consuevisti peccatorum misereri et preces humilium exaudire, (3) ita me, famulum tuum N, licet indignum, exaudire digneris clamantem ad te pro hac sanctissima visione divina humiliter et desiderantissime a te postu- lata prece tuis sanctis sacramentis insignita, que sunt Hosel, Iesel et cetera,", "type": "mixed", "id": {"id": "55830c4f-574c-4765-b2bb-bd852bb872ed"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of Latin phrases and religious text, which have clear structures that can be segmented into meaningful spans for learning purposes. Despite the archaic language style, it is clean enough to serve as training data with potential challenges in modern context comprehension. / The segment contains a mix of Latin phrases and religious text, which lacks clear structure for meaningful span segmentation in the context required by X-Spanformer training data. It is not coherent or representative enough to serve as valuable patterns for learning span composition without further contextualization specific to natural language processing tasks. / The text segment contains a mix of Latin phrases and religious references, lacking clear syntactic structures for meaningful span segmentation in the context required by X-Spanformer training data. It is not coherent or representative enough to serve as valuable patterns for learning span composition without additional contextual information on its domain relevance (religious texts vs code). / The segment contains a mixture of Latin phrases and religious text, which can be segmented into meaningful spans like \"misericordie,\" \"(2) ego,\" etc., representing valuable patterns for learning span composition in both natural language processing (NLP) tasks involving code-like structures or multilingual texts. / The segment contains a mixture of Latin phrases and religious text, which may have clear structured elements for span segmentation; however, its archaic language might pose challenges in learning patterns due to limited contextual familiarity with the content type."}}
 {"raw": "(5) De secundo autem taliter postulabis: ut abluto corpore me vivente mea possit anima cum tua incomprehensibili potencia a te cognita cum tuis sanctis angelis tuam cognoscere potestatem\" (6) De 37 taliter postulabis: ut abluto corpore te cum tuis novem angelorum ordinibus me vivente mea possit anima collaudare, et meorum concedas veniam peccatorum:\" Quarto taliter est dicendum: ut abluto corpore dehinc nullam possim committere maculam peccatorum, set meo vivente corpore puro corde, mente et opere te cum", "type": "mixed", "id": {"id": "5ee27707-8096-49c4-a610-b1d3dfb84a55"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of Latin phrases and punctuation, indicating clear boundaries for span segmentation; however, the specialized language may limit its utility without additional context or domain-specific training data. / The segment contains a mix of Latin phrases and structured text that could be useful for learning span segmentation in both linguistic patterns (natural language) and formal constructs, though it may require domain-specific knowledge to fully interpret the content. / The segment contains a mix of Latin phrases and structured text that could be useful for learning span segmentation in both linguistic patterns (natural language) and formal constructs, though it may require domain-specific knowledge to fully understand its context. / The segment contains both structured programming-like constructs and prose, offering diverse patterns for span segmentation in a tokenizer-free context. However, it lacks clarity due to the presence of numbers without explanations or contexts (e.g., \"(5) De secundo autem taliter postulabis\"). / The segment contains a mixture of Latin phrases and references to angelic orders, which may be useful for learning span segmentation in religious or historical texts that combine language with symbolic notation (e.g., \"angelorum ordinibus\"). However, the lack of context makes it less ideal."}}
 {"raw": "given as Liber sacer sive iuratus (\"the Sacred o1 Sworn Book\" I.18), Liber Sacer (\"the Sacred Book, CIIL.I), and liber sacer vel liber angelorum vel liber iuratus quem fecit Honorius, magister Thebarum 'the Sacred Book, o the Book of the Angels, o the Sworn Book, which was made by Honorius, the master of Thebes;\" CXLLI) The title Liber sacratus 'consecrated book\"), in the possible references by William of Auverne, is not actually found in the text; although the word sacratus is used descriptively in chapter", "type": "mixed", "id": {"id": "d50c9e67-440e-40b8-8e93-2acc5a8a2c14"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mix of historical references and descriptive phrases that can be segmented into meaningful spans, though some ambiguity exists in the text's structure. It is coherent but lacks clear compositional patterns for span segmentation due to its complex nature involving both natural language descriptions (historical context) and code-like elements (\"Liber sacratus\", \"o1 Sworn Book\"). / The segment contains a mix of descriptive text and references to historical texts, which can help the model learn span segmentation in both contexts; however, it lacks clear delimiters for spans due to its complex structure. Clean but not ideal for direct training without preprocessing. / The segment contains a mix of historical references and descriptive phrases that can be segmented into meaningful spans, such as titles (\"Liber sacer\", \"liber angelorum\") which are relevant for learning span composition in both natural language processing (NLP) tasks related to text understanding or code elements. / The segment contains a mix of historical references and descriptive language, with clear spanable phrases like \"Liber sacer\" which can help the model learn about contextually rich spans in both natural text and code-like structures. However, it lacks explicit programming constructs or detailed technical descriptions that would be more beneficial for training purposes focused solely on one type (code). / The segment contains a mix of descriptive phrases and references that can be segmented into meaningful spans, such as titles (\"Liber sacer,\" \"Liber Sacer\") and descriptions (e.g., 'the Sacred Book'). It is clean but lacks coherence due to fragmented sentences."}}
 {"raw": "Grimoire du pape Honorius: avec un recueil des plus rares secrets. Rome: [publisher not identified], 1760. Honorius of Thebes See Hedegard.", "type": "natural", "id": {"id": "4185c42f-d6ed-4798-b1b0-538aebdba815"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mix of historical reference, bibliographic citation with identifiable spans like author names and publication details that can be useful for learning span segmentation in both textual content and metadata recognition. / Clear sentence structure with identifiable spans such as titles, names, and dates; clean for training purposes. / Clear sentence structure with identifiable spans (title, publication details). Suitable for learning span segmentation in prose. / Contains a mix of historical reference, bibliographic citation with identifiable spans (e.g., author names and publication details). Clear structure aids in learning span segmentation for both text types present. / Clear structure with identifiable spans (title, publication details), clean and coherent for training purposes."}}
 {"raw": "(6) Similiter omnipotens Pater; omnipotens Filius, omnipotens Spiritus sanctus, et tamen non tres omnipotentes set unus omnipotens: Ha Deus Pater; Deus Filius, Deus Spiritus sanctus, et tamen non tres dii set unus est Deus. (8) Ha Dominus Pater; Dominus Filius, Dominus Spiritus sanctus, et tamen non tres domini set unus est Dominus, qui a sicut singillatim unamquamque personam Deum ac Dominum confiteri Christiana veritate compellimur; ita tres deos aut dominos dicere catholica religione prohibemur: (9)", "type": "mixed", "id": {"id": "f9f788ed-19c3-48aa-a5bc-14da0fa40871"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text contains a mix of religious language and Latin phrases, with clear structure for span segmentation; however, it may not be representative enough due to its specialized nature. / The segment contains a mix of theological statements with clear structure and identifiable spans, such as phrases denoting divine entities (\"Deus Pater\", \"Dominus Filius\"). It is coherent for training purposes but may require additional context to fully understand the domain-specific language. / The segment contains a mix of religious text and Latin phrases, with clear structure for span segmentation; however, its specialized content may limit generalizability across domains. / The segment contains a mixture of religious text and Latin phrases, with clear structure for span segmentation; it represents valuable patterns in both language composition (natural) and liturgical context (code). / Contains a mixture of religious text and Latin phrases with clear structure, useful for learning span segmentation in both linguistic patterns and specific terminology."}}
 {"raw": "194 SWORN BOOK OF HONORIUS meritum, ubi racio prebet exemplum; ideo homini difficillimum est cog- noscere divinam maiestatem. (54) In hoc opere non debet homo cogitare de morte, quia virtute ora- cionum et virtute Dei,in quo totaliter habet fiduciam, corpus efficitur quasi spirituale et saturatur cibo celesti, (55) sicut fiunt pulli corvorum, quibus misericordia succurrit Dei testante David et dicente: \"Qui dat iumentis escam ipsorum et pullis corvorum invocantibus eum\" (56) Set cavendum est operanti, ne", "type": "mixed", "id": {"id": "03f96489-2068-49e8-8960-d7337fe54124"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of Latin text and references to biblical verses, which can be segmented into meaningful spans for training purposes; however, the presence of non-standard characters (e.g., 'œ') may affect readability in some systems. / The segment contains a mixture of Latin text and references to biblical verses, which can be segmented into meaningful spans for learning purposes; however, it lacks clear compositional patterns due to its archaic language structure. / The segment contains a mix of Latin phrases and prose, with clear sentence structures that can be segmented into meaningful spans for training purposes in both language understanding (natural) and historical text recognition contexts. / Contains a mix of Latin phrases and references, with clear demarcations suitable for span segmentation; however, the text's archaic language may limit its utility in modern contexts. / Contains a mixture of Latin text and references to religious scripture, which may have structured patterns useful for span segmentation in both historical texts (natural) and scriptural citations or quotations that could be considered code-like due to their precise structure."}}
 {"raw": "Sworn Book of Honorius\n285\nthere are five treatises, (2) the first concerning the divine vision, the second concerning the good angels, the third concerning the aerial (spirits) , the fourth concerning the terrestrial (spirits) , the fifth concerning the exposition of these.", "type": "natural", "id": {"id": "eb51cdb2-d540-436c-b2c9-8bce7661763d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear division into treatises; meaningful spans identified for learning context and thematic segmentation. / Clear segmentation into treatises; coherent and representative of thematic patterns in historical texts. / Clear division into treatises; spans can be identified as individual topics, though less structured than ideal for training purposes. / Clear division into treatises; spans represent structured knowledge segments in a historical context. / Clear division into treatises; spans are phrases or sentences with distinct meanings, suitable for learning span segmentation in prose."}}
 {"raw": "Pater a nullo est factus nec creatus nec genitus. Filius a Patre solo non factus nec creatus set genitus.", "type": "code", "id": {"id": "a8344285-d90a-4f53-b7f3-30b337794158"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mix of Latin phrases and English words, with clear structure that can be segmented into meaningful spans for learning purposes. It represents valuable patterns in both linguistic composition (Latin) and code-like syntax/structure (English). / Contains a mix of Latin phrases with clear, identifiable spans that can be useful for learning span segmentation in both linguistic and programming contexts. / Clear, structured patterns typical of programming languages; useful for learning span segmentation in a coding context. / Clear structured patterns resembling programming constants or variables, suitable for learning span segmentation in a tokenizer-free context. / Clear structured patterns resembling programming constructs; suitable for learning span segmentation in a tokenizer-free context."}}
 {"raw": "XCVI: PRAYER 34. Hofbor; 0 Lord, who dwells on high, invisible God, inestimable God, ineffa- ble God, unchangeable God, incorruptible God, 0 God most-pious, God most sweet; God most high, 0 glorious God, 0 immeasurable God, almighty God, God, the Father ofall mercy, (2) I, although unworthy and full ofiniquity deceit, and vice; I come humbly beggingand pleading for your mercy, that you dont consider all my countless sins, but just as you are accus- tomed to show mercy on sinners, and to hear the prayers of", "type": "natural", "id": {"id": "80b1342b-9860-4e30-8522-33e33b6a92f2"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable thematic spans; suitable for learning prayer structure and sentiment expression in text. / Clear, structured prayer text with identifiable phrases and religious terms suitable for learning span segmentation in a spiritual context. / Clear, coherent prose with identifiable thematic spans; suitable for learning compassion and prayerful language patterns. / Clear, coherent prose with identifiable phrases and religious context; good for learning span composition in prayer texts. / Clear, structured prayer text with identifiable phrases and religious terms suitable for learning span composition in a spiritual context."}}
 {"raw": "272 SWORN BOOK OF HONORIUS (35) non ledentes, non frementes, non furientes75 nec me sociosque meos vel aliquam creaturam terrentes, neminem offendentes set peticionibus meis [obedientes] pocius et que precepero diligenter adimplentes\" (36) Tunc stans pedibus sibila sepcies percuciat et tunc semel circueat circulum dicens \"Bethala\" et cetera usque ad \"occurrite\" (37) Tunc stans in medio circuli aperta manu super aerem eis signum ostendat dicens: \"Sigillo Salomonis veniant advocati et dent michi responsum", "type": "mixed", "id": {"id": "17327241-c757-43c8-b747-1505095db272"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mixture of Latin phrases and instructions, with clear structure for span segmentation; however, it lacks context which might be necessary to fully understand the patterns in use. / Contains a mixture of Latin phrases and instructions, which can be segmented into meaningful spans for learning span composition in both linguistic patterns (natural language) and structured commands or sequences that resemble programming logic. The text segment is clean but may require additional context to fully understand its purpose as training data. / Contains a mixture of Latin phrases and instructions, with clear structure for span segmentation; represents valuable patterns in historical texts. / Contains a mixture of Latin phrases and instructions, with clear structured elements suitable for span segmentation; represents valuable patterns across different contexts (legal/instructional). / Contains both structured language and potential symbolic elements (e.g., numerals, Latin phrases) that can be segmented into meaningful spans for a span-aware model to learn from. The text includes clear markers like numbers indicating verses or sections which are valuable patterns in training data."}}
 {"raw": "1494) the colophon in Italian magic manuscript Florence, Laurent; Plut 89 sup. 38; ~(I518) Trithemius references the magic of Honorius of Thebes in Polygraphia; ~(1582) references in the secret writings of John Dee; ~(1583-158s) possible allusion by Giordano Bruno in his Cabala of Pegasus To this can be added ~(1623) a condemned sorcerer; Jean Michel Menuisier used a book titled Philippus Attinius onorius, likely Liber Iuratus Honorii to judge by the details ofits contents.5 In spite of its importance;", "type": "mixed", "id": {"id": "2fea8f8b-3233-4421-91e8-eba07c04f0cb"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mix of historical references and textual citations, which can be segmented into meaningful spans such as dates (1494), names (Trithemius, Honorius of Thebes, John Dee, Giordano Bruno) and titles or works. Despite some unclear abbreviations like \"~\", it retains compositional value for learning span segmentation in mixed content contexts. / The segment contains a mix of historical references and textual citations, which can help the model learn span segmentation across different contexts like dates (1494), names (Giordano Bruno), titles (\"Philippus Attinius onorius\"), making it structurally clear for training purposes. / The segment contains a mix of historical references and bibliographic citations, which can help the model learn span segmentation in both text (natural language) and structured data formats like dates or titles. However, it lacks clear delimiters for spans; thus some ambiguity remains that could be addressed during training to improve clarity on what constitutes meaningful segments within mixed content types. / The segment contains a mixture of historical references and textual content with identifiable spans such as dates, names (e.g., Trithemius), titles (\"Polygraphia\"), authors (\"Giordano Bruno\", \"Jean Michel Menuisier\"), books (\"Liber Iuratus Honorii\"). It is coherent but lacks clear structural clarity for meaningful span segmentation. / The segment contains a mix of historical references and citations, which can be segmented into meaningful spans such as dates (1494), names (Giordano Bruno), titles (\"Philippus Attinius onorius\"), indicating clear structural elements for training purposes in span segmentation."}}
 {"raw": "\"24 I have not been able to identify any such passage in the many voluminous works of De Abano (c. 1257-1316). Since pseudo-de Abano writings are not unknown, the evidence this provides for a historical Honorius is shaky at best: The only manuscript of Honorius which actually includes the alphabet is Sloane 3853, where it is clearly identi- fied as having been taken from Trithemiuss student Agrippa,25 thus making a remarkable round-trip back to Honorius This so-called Theban alphabet is clearly based on the", "type": "natural", "id": {"id": "de17d10e-db99-4f7b-ba69-95ecbae1e17a"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable spans; well-formed for training purposes, though could benefit from more contextually rich examples. / The segment contains clear, coherent sentences with identifiable spans of meaningful phrases suitable for training a span-aware model in the context of historical texts or literature analysis. / Clear prose with identifiable spans; well-formed and coherent for training purposes, though could benefit from more context or examples of span segmentation in similar texts. / The segment contains clear, coherent prose with identifiable spans like \"many voluminous works,\" which can help the model learn span segmentation in a literary context. Despite being somewhat fragmented and lacking punctuation at times (e.g., missing periods), it is clean enough for training purposes without any code elements to distract from its natural language content. / The text segment contains clear, coherent sentences with identifiable phrases and spans that represent meaningful patterns in English prose; it is clean for training purposes but lacks explicit programming or markup elements typical of \"code\" content types."}}
 {"raw": "172 SWORN BOOK OF HONORIUS (4) ut virtutem et graciam, quam pro tanta visione habere debeo, habeam, scilicet puritatem et innocenciam et claritatem, sapienciam et sanctitatem (5) caritatem et sinceritatem et humilitatem et firmitatem et bonam volun- tatem, te ipso prestante, qui sedes in altissimis, cui laus est atque gloria et honor per infinita secula seculorum: Amen: XCVII Si seriem harum oracionum scire vis, respice seriem 100 nominum Dei huius libri, quia per illa semper incipiunt oraciones. (2) Et", "type": "mixed", "id": {"id": "1c020510-c1fd-463a-9d27-bb6ddeade6e0"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mixture of religious text and structured numerical references, which can help the model learn span segmentation in both prose (natural language) and lists or sequences typical for coding structures. / Contains a mixture of Latin phrases and numerical references, indicating potential for learning span segmentation in both structured text (Latin) and numeric patterns. / The segment contains a mix of Latin phrases and numerical references, which can be segmented into meaningful spans for learning purposes; however, it lacks clear compositional patterns due to its historical language context. / Contains a mix of religious text and numerical references, with clear structure for span segmentation; however, lacks coherence in modern contexts. / The segment contains a mixture of Latin phrases and numerical references, which can be segmented into meaningful spans for training purposes; however, it lacks context that would make its compositional value clearer to the model."}}
 {"raw": "Sworn Book of Honorius 195 value, ifhuman reasoning can provide experimental proof;191 for that reason it is most difficult for one to perceive the Divine Majesty (54) In this work one must not think about death; because with the vir- tue of prayer and the virtue of God, in whom he has placed all his trust; the body becomes like a spiritual body and is nourished by heavenly food, (55) as young ravens are, when nourished by the mercy of God, as David testified when he said; \"he gives his food to beasts, and", "type": "natural", "id": {"id": "66f49458-3956-4d1a-b471-49d293a2c903"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable phrases; useful for learning span segmentation in narrative text. / Clear prose with identifiable phrases; useful for learning span segmentation in narrative text. / Clear prose with identifiable thematic spans; useful for learning context and composition in natural language processing. / The text lacks clear span boundaries and contains fragmented phrases, making it difficult to identify meaningful spans for training purposes. Additionally, the presence of numbers disrupts continuity in a way that could confuse learning patterns related to sentence structure or thematic segmentation within this type of content. / Clear prose structure with identifiable phrases and sentences suitable for training a span-aware model in recognizing sentence boundaries, though it lacks explicit coding constructs or domain-specific patterns."}}
 {"raw": "nota, quod illa sacra Dei nomina predicta: Hosel, Iesel et cetera, debent dici paulo post principium orandi eciam in principio cuiuslibet oracionis. XCVIII INCIPIT 24 MUNDACIO IN VISIONE DIVINA_ Mundato igitur et macerato corpore volentis videre celeste palacium ipsum mundissimum esse iubemus et in omnibus virtutibus esse vestitum, (2) et semper cogitet et deprecetur Dominum de suorum absolucione pec- catorum, quia iustus eciam debet timere, quia qui non timet, non diligit, testante Salomone et dicente:", "type": "mixed", "id": {"id": "b408036b-c6e1-4d33-adfa-7d46b33256c6"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains clear religious and philosophical language with identifiable phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in the context of spiritual texts or theological discussions. / Contains a mix of Latin phrases and references to religious texts, which can be segmented into meaningful spans for learning purposes; however, it lacks clear compositional patterns due to its archaic language structure. / The segment contains a mix of Latin phrases and references to religious texts, which can be segmented into meaningful spans for learning purposes; however, it lacks clear compositional patterns due to its archaic language structure. / Contains both religious text (natural language) and Latin phrases, which may have structured patterns useful for training a span-aware model in handling multilingual or domain-specific content. / Contains a mixture of Latin phrases and references to religious texts, which can be segmented into meaningful spans for training purposes; however, it lacks clear compositional patterns due to its archaic language structure."}}
 {"raw": "to young ravens who cry out: \"192 (56) But the operator must beware, lest he is in mortal sin, because if s0, he will go insane, and the cause is that, comprehension comes from part ofthe soul, which strives to see what it delights in, (57) and when he doesnt see it because of the hindrance of sin, henceforth will think ofnothing else, and so the body will have no human reasoning: (58) And similarly we see many - who become inanimate thereby, because the comprehension coming from part of the soul has not", "type": "natural", "id": {"id": "c1edbabd-64f5-4ead-83ff-aaba1ce53c08"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains clear, meaningful spans of phrases and sentences that can be used to train a span-aware model on understanding complex sentence structures in English prose. It is coherent but lacks technical jargon or code constructs which makes it purely 'natural'. / The text segment contains clear linguistic structures and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in a tokenizer-free context focused on human language processing. / The text segment contains clear, structured sentences with identifiable phrases and clauses that can be segmented into meaningful spans for a span-aware model to learn from. It is coherent in its narrative style but lacks programming code elements or mixed content types; thus it fits best as 'natural' language training data. / Clear prose structure with identifiable meaningful spans; well-suited for learning span composition in a tokenizer-free context. / Clear, coherent prose with identifiable thematic spans; useful for learning context and composition in language processing tasks."}}
 {"raw": "286 SWORN BOOK OF HONORIUS CXXXVII 2m capitulum quinti tractatus de consecracione hincaustiso sigilli Dei eterni, vivi et veri operantis '{D}eus invisibilis, Deus inestimabilis, Deus ineffabilis, Deus incommutabi- lis, (2) Deus incoruptibilis, Deus piissime, Deus dulcissime, Deus excelse, Deus gloriose, Deus inmense, Deus tocius misericordie, ego, licet indig- nus vel: plenus iniquitate, dolo et malicia_ , (3) suplex ad tuam venio misericordiam orans ac deprecans, ut non respicias ad universa et innu-", "type": "mixed", "id": {"id": "ec46f2ec-44a5-4703-b1d0-aba6f30ed46c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with identifiable phrases and clauses suitable for learning span segmentation in a natural language context. / Contains a mixture of legal text and Latin phrases with clear structure; spans can be identified as meaningful units for training purposes, though some words are fragmented or unclear due to OCR errors. Cleanliness is compromised by potential transcription mistakes but retains compositional value in its original context. / The text segment contains a mixture of Latin phrases and punctuation, which can be segmented into meaningful spans for training purposes; however, the presence of unusual characters like 'œ' may affect clarity slightly. / Contains a mix of legal text and Latin phrases, with clear spanable segments like \"Dius invisibilis\" or the entire sentence structure that can be useful for learning complex linguistic patterns in both code-like structures (Latin) and natural language prose. / The segment contains a mixture of Latin phrases and prose, which can be segmented into meaningful spans for learning span composition in both language recognition tasks (natural) and understanding structured text elements like titles or headings that could resemble code-like structures with their own syntax rules."}}
 {"raw": "Basingstoke: Palgrave Macmillan, 2015.", "type": "natural", "id": {"id": "b013e575-7d4c-4d51-828b-fd5627d76582"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear citation structure with identifiable spans for author, title, and publication year. Well-suited to learn pattern recognition in bibliographic references. / Clear citation structure with identifiable spans (title, location, year). Well-suited for learning span segmentation in academic contexts. / Clear citation structure with meaningful spans (title, location, year). Well-suited for learning span segmentation in academic references. / Clear structured elements (title, location) suitable for span segmentation; represents valuable patterns in bibliographic entries. / Clear citation structure with identifiable spans (author, title). Well-suited for learning span segmentation in bibliographic contexts."}}
 {"raw": "certainly hav- ing seen the references only in Waite s books: Intrigued by these references, Itook a summer in the Seventies to examine the manuscripts firsthand, and order microfilms for further study: It proved to be every bit as fascinating as hoped. The other purpose of my visit was to study the manuscripts and magic artifacts of Elizabethan polymath John Dee, and [ was surprised to discover his connection to the Sworn Book (on which see below) .", "type": "natural", "id": {"id": "4c282cee-07ef-4c4c-afdc-724c6dd0ab7f"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable meaningful spans like phrases and sentences; useful for learning span segmentation in a purely linguistic context. / Clear, coherent prose with identifiable spans; useful for learning sentence structure and context in NLP tasks. / Clear, coherent prose with identifiable meaningful spans; well-suited for training a span-aware model in the context of natural language processing. / Clear sentences with identifiable phrases and clauses suitable for span segmentation; coherent prose representative of the target domain. / Clear sentence structure with identifiable spans like \"references,\" \"manuscripts firsthand,\" and \"[ was surprised to discover his connection.\" Suitable for learning span segmentation in prose."}}
 {"raw": "Sworn Book of Honorius 89 filed, without doubt shall perish everlastingly: (2) And the Catholic faith is this: That we worship one God in Trinicy and Trinity in Unity, neither con- founding the persons, nor dividing the substance: (3) For there is one person of the Father; another of the Son, and another of the Holy Spirit: But the Godhead of the Father; of the Son, and ofthe Holy Spirit is all one, the glory equal, the majesty co-eternal. (4) As is the Father; so is the Son, and so is the Holy Spirit: The", "type": "natural", "id": {"id": "9ce25a0a-9e95-4b31-98fe-1840cbed78ba"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear religious text with identifiable phrases and theological concepts suitable for training a span-aware model on complex sentence structures in English prose. / Clear religious text with identifiable phrases and concepts, suitable for learning span segmentation in theological contexts. / Clear religious text with identifiable phrases and theological concepts suitable for learning span segmentation in a tokenizer-free context. / Clear religious text structure with identifiable phrases and theological concepts suitable for span segmentation learning. / Clear structure with religious text; spans can be identified as phrases or sentences, representing valuable patterns for learning span composition in a theological context."}}
 {"raw": "Sworn Book of Honorius 173 with prayers and embellished with holy sacraments, which are Hosel, Iesel, etc. (4) that I may have the virtue and the grace which are needed to attain such a great vision, namely purity, innocence; claricy wisdom, and piecy, (5) love, sincerity, humility determination, and good will, through you, who sits on high, to whom be praise, glory and honor; forever and ever: Amen. XCVII If you wish to know the sequence of these prayers, consider the sequence of the one hundred names of", "type": "natural", "id": {"id": "2e3a76e9-4680-4cc6-baf4-0c625f8bd27d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains clear religious and spiritual phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in the context of devotional texts or prayers. It is clean but lacks direct relevance to programming code structures; however, it maintains compositional value as a standalone piece of writing with thematic coherence suitable for natural language processing tasks focused on text segmentation and understanding religious content. / The text segment is structurally clear with identifiable spans such as phrases, prayers and holy sacraments; it contains compositional patterns like sequences of names which are valuable for learning span segmentation in a tokenizer-free context. Despite being somewhat archaic language that may present challenges to modern readers or models trained on contemporary data, its coherence makes it suitable training material within the scope defined by X-Spanformer requirements. / The text segment contains clear religious and spiritual language with identifiable phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in a tokenizer-free context focused on natural language processing tasks. / The text segment contains structured prayers and references to virtues, which can be segmented into meaningful spans for learning purposes; however, it lacks clear compositional patterns due to archaic language complexity. / The text segment contains structured prayers and attributes with clear separations, representing valuable patterns for learning span composition in religious or spiritual contexts. However, it lacks coherence as a standalone piece due to its fragmented nature but retains compositional value within the context of prayer sequences."}}
 {"raw": "invocacionem nominis trinitatis, Patris et Filii et Spiritus sancti, (9) ut acceptas habeas, et tibi placeant, oraciones et verba oris mei, per invocacionem tuorum sanctorum 100 nominum, 86 i.e. encausti:", "type": "mixed", "id": {"id": "8614c4db-7393-43d4-8159-d225b7edbea1"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of Latin phrases and numerical annotations, which can be segmented into meaningful spans for training purposes; it is clean but lacks contextual coherence due to its religious text nature. / Contains a mix of religious invocation text and numerical references, showing clear structure for span segmentation; however, the presence of Latin phrases may pose challenges in learning patterns due to its specialized vocabulary. The segment is clean but less common than natural language or code segments. / Contains a mix of religious invocation text and numerical references, offering diverse span patterns for learning. / Contains a mixture of religious phrases and numerical references, which can be segmented into meaningful spans for training purposes; however, it lacks clear compositional patterns due to its specialized content. / The text segment contains a mixture of Latin phrases and numerical references, which can be segmented into meaningful spans for training purposes; however, it lacks clear compositional patterns due to the specialized language used in religious contexts."}}
 {"raw": "malaparos. malapatas helatay helahenay methay meray LXXXVIII: LATIN PRAYER.130 holy God, merciful God, 0 gentle God, 0 almighcy God, giver of all, make all things possible today, to me, a believer; and help my disbelief today; and have mercy on me today; (2) just as you had mercy on Adam when he repented, whereby you gave to him the grace of many virtues, through the mercy of your omnipotence, in an instant: (3) Grant to me this day the grace which I desire, through your omnipotence, that I, delighted in", "type": "natural", "id": {"id": "f656870e-0d84-4397-866a-819f3570dc11"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose structure with identifiable phrases and sentences suitable for learning span segmentation in a tokenizer-free context. / Clear prose with identifiable phrases and sentences suitable for learning span segmentation in a religious context. / Clear and coherent prose with identifiable phrases suitable for span segmentation; represents valuable patterns in religious texts. / Clear structure with identifiable spans of religious text, suitable for learning span segmentation in a non-tokenized context. / Clear prose with identifiable phrases and sentences suitable for learning span composition in a tokenizer-free context."}}
 {"raw": "142 Thereupon, with the body purified and softened,143 he that wishes to see the heavenly palace, we command him to keep himself most clean to be clothed with all virtues, (2) and he should always contemplate the Lord, and pray for the forgiveness of sins, for the Lord is just and must be feared, for ifyou dont fear him, you dont love him; as Solomon testified when he said: The begin- ning of wisdom is the fear of the Lord;144 (3) Therefore, everyone should fear him, because nobody attains glory or", "type": "natural", "id": {"id": "c1d5fdb2-7e35-46ac-b188-f91248ceb74b"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable spiritual and moral themes suitable for learning span segmentation in religious texts. / Clear, coherent prose with identifiable spiritual themes and moral teachings suitable for training a span-aware model in religious or philosophical contexts. / Clear narrative structure with identifiable spans like verses, commands for actions (e.g., \"keep himself most clean\"), and references to religious texts which are coherent but may lack diverse patterns due to its repetitive nature. / Clear narrative structure with identifiable spans for religious context; coherent and clean, though repetitive phrases may need variation in a larger dataset. / Clear, coherent prose with identifiable spiritual themes and moral teachings suitable for learning span composition in religious texts."}}
 {"raw": "[Tunc in consol dicat:] \"Ubi est Formione rex, ubi sunt Guth, Maguth, Guthrin eius ministri? (11) Ubi est Iammax rex, ubi sunt Carmox, Ycanol, Pasfran eius ministri? Ubi est Sarabocres rex, ubi sunt Nassar; Cinassa eius ministri?", "type": "mixed", "id": {"id": "43067b9a-8580-49eb-a881-2ea3534fec98"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of Latin phrases and structured queries that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic structure (natural language) and formal query construction (code-like). / The text segment contains a mixture of Latin phrases and structured queries, which can help the model learn span segmentation for both language patterns (natural) and potential coding constructs or annotations if used in code documentation contexts. / The segment contains a mixture of Latin phrases and potential references to characters or entities, which could be useful for learning span segmentation in both linguistic patterns (natural language) and structured formats that resemble code-like elements with labels like \"[Tunc]\" indicating possible markup constructs. / The segment contains a mixture of Latin phrases and potential references to characters or entities, which could be segmented into meaningful spans for learning purposes in an encoder-decoder model that handles both natural language text and code-like structures. However, the archaic nature may pose some challenges but still holds value as training data due to its unique composition. / The segment contains a mixture of Latin phrases and references to characters, which may help the model learn span segmentation for historical texts or similar structured content. However, it lacks clarity in modern contexts due to archaic language usage."}}
 {"raw": "below, chap: CXL.2-5. 278", "type": "code", "id": {"id": "b12eb8bd-0278-4379-b9a1-c4d2ab46761b"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear reference to a chapter and section, useful for learning structured patterns in documentation or educational materials. / Clear structure with a chapter reference, suitable for learning span segmentation in programming documentation or configuration files. / Clear reference to a chapter and section, likely from documentation or academic text; useful for learning span segmentation in structured content like programming references. / Clear reference to a chapter and section, useful for learning span segmentation in structured documents like programming manuals or documentation. / lacks meaningful spans, not representative of typical programming constructs or patterns for learning span composition in the context of X-Spanformer training data."}}
 {"raw": "Sworn Book of Honorius 287 CXXXVII Second Chapter of the Fifth Treatise, concerning the consecration of the ink2s7 used for the Seal of the Eternal, Living, and True God.288 invisible God, inestimable God, ineffable God, unchangeable God, (2) 0 incorruptible God, 0 pious God, sweetest God, 0 exalted God, O glorious God, 0 immeasurable God, 0 God ofall mercy, I, although unworthy__O; full of iniquity, deceit; and wickedness (3) I come seeking your mercy beseeching and entreating; that you not look back on", "type": "natural", "id": {"id": "99ff4103-dcad-465b-b37d-475d78ad6da9"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear structure with identifiable phrases and religious context, suitable for learning span composition in a spiritual or historical text. / Clear prose with identifiable thematic spans; useful for learning sentence structure and religious context. / Clear narrative structure with identifiable spans like titles, phrases denoting attributes of God (e.g., \"invisible,\" \"ineffable\"), and a prayer-like format that can be segmented into meaningful parts for learning span composition in religious texts. / Clear prose with identifiable thematic spans; useful for learning context and sentiment in religious texts. / Clear structure with identifiable spans of religious text, suitable for learning span segmentation in a tokenizer-free context."}}
 {"raw": "able divergence from the earliest ones, and its barely recognizable state in La Veritable Magie Noire (1750)7 testifies to a long and elaborate transmission.", "type": "natural", "id": {"id": "df407668-68b2-4723-b34a-2331a4c7c52d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable phrases and concepts suitable for span segmentation; well-formed text representing valuable patterns in language structure. / Clear, coherent prose with identifiable phrases and sentences suitable for learning span segmentation in a tokenizer-free context. / Clear, coherent prose with identifiable phrases and sentences suitable for learning span segmentation in a tokenizer-free context. / The segment contains clear, coherent sentences with identifiable phrases and clauses suitable for learning span composition in a tokenizer-free context. However, it lacks explicit coding constructs or domain-specific patterns that would make the text more representative of code-related content types. / Clear, coherent prose with identifiable phrases and sentences suitable for training a span-aware model in the context of historical text analysis."}}
 {"raw": "and the Holy Spirit is God, and yet they are not three Gods, but one God: (8) So likewise the Father is Lord, the Son is Lord, and the Holy Spirit is Lord, and yet they - are not three Lords, but one Lord.", "type": "natural", "id": {"id": "89472d9a-29f6-4c3e-bc14-9d3fe1766680"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear religious text with distinct phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in a tokenizer-free context. / Clear religious text with identifiable phrases and concepts, suitable for learning span composition in a spiritual context. / Clear religious text with identifiable spans; useful for learning span segmentation in theological contexts. / The text segment is structurally clear with identifiable spans such as \"the Holy Spirit\", \"God\", and phrases like \"(8) So likewise\". It contains religious discourse which can be useful for training a span-aware model on complex sentence structures in the context of theological texts. / Clear religious text with distinct phrases and concepts that can be segmented into meaningful spans for learning, despite the repetition of similar structures."}}
 {"raw": "196\nSWORN BOOK OF HONORIUS\n(4_ Et nota, quod, si vis, in omnibus predictis potes qualibet die dicere predictos 8 terminos cum illis 10 oracionibus, que sunt 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, (5) quia in eis non oportet respicere diem neque lunacio- nem neque de necessitate horam, licet presertim circa mane, circa terciam circaque meridiem proferantur; (6) unde iste oraciones reducunt in bonum quicquid homo erravit per fragilitatem in operacione, et quanto plus et fre- quencius dicuntur;", "type": "mixed", "id": {"id": "d5e7f2ae-54b4-45f1-bd7e-795329a786ce"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear structure with identifiable spans of legal text and references to specific terms, useful for learning span segmentation in a mix of language types. / The segment contains a mix of legal text and structured elements like numbered terms, which can help the model learn span segmentation in both contexts. Despite being somewhat archaic language that might pose challenges for modern NLP systems due to its historical nature or potential OCR errors (e.g., \"lunacio- nem\"), it still presents clear structural patterns useful for training purposes such as legal documents and code-like structures with numbered terms indicating sections, subsections, etc. / The segment contains a mix of legal or formal language and structured elements like numbered terms, which can be segmented into meaningful spans for training purposes. It is clean but may require domain-specific knowledge to fully understand its context as it appears related to historical documents rather than natural conversation topics. / Contains structured legal phrases and terms that can be segmented into meaningful spans; represents valuable patterns for learning span composition in both language structure (natural) and specific terminology usage (code). / Contains a mixture of legal text and structured phrases, with clear boundaries for span segmentation; useful patterns present in historical document formatting."}}
 {"raw": "(12) Ubi est Harthan rex, ubi sunt Bileth, Mylalu, Abucaba eius ministri? Ubi est Abaa rex, ubi sunt Hyici, Quiron, Zach, Eladeb eius min - istri? (13) Ubi est Maymon rex, ubi sunt Hassaybi, Albunalich, Haibalidech, Yasfla [eius ministri]? Ubi est Barthan rex, ubi sunt Thaadas, Caudas, Yalcal", "type": "mixed", "id": {"id": "791f55fc-d4ba-46d2-b90a-11480d7691c1"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mixture of names and phrases that could be segmented into meaningful spans, though the context is unclear. It shows potential for learning span composition in both structured lists (names) and descriptive queries (\"ubi est...\"). However, it lacks coherence as it's not clear if these are related entities or separate entries; thus some ambiguity remains which might affect training utility slightly. / The segment contains a mixture of structured queries with identifiable spans (e.g., names, titles), which can help the model learn span segmentation in both linguistic and pseudo-code contexts. Despite some fragmented phrases (\"Ubi est Maymon rex,\" \"Yalcal\"), it maintains clear structural patterns that are valuable for training purposes. / Contains structured queries with identifiable spans; however, the presence of non-standard characters and potential OCR errors may affect clarity for training purposes. / The text segment contains a mixture of Latin phrases and names, which could be useful for learning span segmentation in both linguistic patterns (natural language) and structured elements like lists or sequences that resemble code constructs. However, the clarity is somewhat compromised by potential transcription errors (\"Abucaba eius ministri?\" etc.). / Contains a mixture of structured queries with identifiable entities and relationships, suitable for learning span segmentation in both language context and potential coding-like structures."}}
 {"raw": "So there is one Father; not three Fathers; one Son, not three Sons; one Holy Spirit; not three Holy Spirits: (I1) And in this Trinity none is before Or after another; none is greater O less than another But the whole three Persons are co-eternal together; and co-equal. So that in all things, as is aforesaid, the Unity in Trinicy and the Trinity in Unity is to be worshiped. (12)", "type": "natural", "id": {"id": "551c9e04-44d0-41df-a55e-838d79b3072d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable thematic spans; useful for learning sentence-level composition and coherence in religious texts. / Clear, coherent religious text with identifiable spans of phrases and concepts suitable for training a span-aware model on complex sentence structures in English prose. / Clear, coherent prose with identifiable thematic spans; useful for learning sentence structure and coherence in religious texts. / Clear religious text with identifiable phrases and concepts, suitable for learning span segmentation in theological contexts. / Clear, coherent prose with identifiable thematic spans; useful for learning sentence-level structure and coherence in religious texts."}}
 {"raw": "Consolida hodie opus meum et doce me, ut ambulem in innocencia tui ipsius Dei gloriosi et glorier in multitudine effluentis gracie tue, (6) et impetus flu- minis sanctissimi Spiritus civitatem cordis mei letificet et depuret in fide visionis sancte et in spe efficacie et innocencie, pro qua laboro, (7) et cor meum caritatis largitate repleat et instauret et radiis Spiritus sancti vivifi-", "type": "natural", "id": {"id": "e9f0a75f-102b-4cde-8a54-d0233a56c213"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable phrases and sentences suitable for span segmentation; represents valuable patterns in religious text structure. / Clear, coherent prose with identifiable phrases and sentences suitable for learning span segmentation in a tokenizer-free context. / The segment contains a mixture of poetic language and religious text, which presents clear structured phrases that can be segmented into meaningful spans for learning purposes. Despite its unusual structure compared to standard training data, it offers valuable patterns in terms of punctuation use (parentheses) indicating separable segments within the sentence-like constructs found here. / Clear, coherent prose with identifiable thematic spans; useful for learning span composition in narrative text. / Clear poetic structure with identifiable phrases and verses, suitable for learning span segmentation in literary text."}}
 {"raw": "Expleto primo tractatu huius libri sacri et Domini secreti subditur secun- dus, qui; sicut primus 6 capitula habebat, de quibus Hely gracia est deser- tum, ita iste 27 habet, scilicet hec: (2) De cognicione celorum; De cognicione angelorum cuiuslibet celi; De cognicione cuiuslibet angeli et nominis et potestatis eius; De cognicione sigillorum cuiuslibet angeli et virtutis eorum; (3) De cognicione superiorum cuiuslibet angeli; De cognicione officii cuiuslibet angeli; De invocacione et associacione cuiuslibet", "type": "mixed", "id": {"id": "8a1aa7c1-1827-4a44-a72e-d0030c932c56"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains a mixture of structured elements (chapters and subsections) that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both religious texts or scholarly documents with code-like structuring. However, the content is not purely natural language nor strictly programming-related; it combines thematic organization akin to documentation structure but lacks explicit coding syntax which might limit its utility solely as mixed-content training data without additional context clarification. / The text segment contains a mixture of structured elements (biblical chapter titles and descriptions), which can be segmented into meaningful spans for learning span composition in both religious texts and programming-like structures. It is clean, coherent but lacks contextual clarity due to its specialized language usage. / Contains a mixture of structured elements (chapters, verses) and religious terminology that can be segmented into meaningful spans for learning span composition in both text-based patterns and domain-specific language. / Clear structured segments with meaningful spans; well-suited for learning span composition in religious texts. / The segment contains a mix of structured headings and descriptions that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both religious texts (natural language) and hierarchical listings or categorizations typical to code-like structures."}}
 {"raw": "He therefore that will be saved, and have the divine vision, must thus think ofthe Trinity: (13) Furthermore, it is necessary to everlasting salvation that he also believe rightly the incarnation of our Lord Jesus Christ: (14) For the right faith is, that we believe and confess that our Lord Jesus Christ; the Son of God; is God and Man.", "type": "natural", "id": {"id": "2d2b8821-ab63-4b81-91c0-fad61be6281d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable thematic spans; useful for learning sentence structure and religious terminology composition. / Clear sentence structure with identifiable spans; useful for learning span segmentation in prose. / Clear sentence structure with identifiable spans; useful for learning span segmentation in prose. / Clear sentence structure with identifiable spans for faith concepts; coherent and representative of religious discourse. / The text segment contains clear religious and theological phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in a tokenizer-free context focused on structured language like this."}}
 {"raw": "Domine Iesu Christe, recipiens, sciens et confitens te Dominum meum et creatorem meum, quem in carne mea visurus sum ego ipse et non alius, quem expecto iudicem meum venturum, (8) concede michi propicius et in virtute huius sacri misterii, quod sicut corporeis oculis tuam spiritu- alem et corporalem potenciam ac eciam divinitatem visibiliter confiteor et agnosco per redempcionem huius sacratissimi corporis et sanguinis tui, (9) sic corpus meum clarificare et mundare digneris, ut abluto corpore te visi-", "type": "mixed", "id": {"id": "e115c086-79d7-458e-a444-6b3128efd603"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of religious text and Latin phrases, with clear structured elements like verses (indicated by numbers) that can be segmented into meaningful spans for training purposes. It is coherent but may require domain-specific knowledge to fully understand the context or patterns involved in span segmentation between natural language expressions and code-like constructs such as verse numbering. / The segment contains a mixture of religious text and Latin phrases, which can be segmented into meaningful spans such as \"Domine Iesu Christe,\" \"(8) concede michi propicius,\" etc., representing valuable patterns for learning span composition in both natural language processing (NLP) tasks related to code-switching or multilingual contexts. / The segment contains a mixture of religious text and Latin phrases, which have clear structures that can be segmented into meaningful spans for learning purposes; however, the presence of non-standard characters (like æ) may affect readability but not necessarily its structural clarity or training utility. / Clear liturgical phrases with identifiable spans; well-formed and coherent for training purposes, though somewhat repetitive. / The segment contains a mixture of religious text and Latin phrases, with clear structure for span segmentation; however, it may not be representative enough due to its specific domain (Christian liturgy)."}}
 {"raw": "angeli; De impetracione voluntatis per quemlibet angelum; (4) De impetracione omnium scienciarum; De hora mortis scienda; De omnibus presentibus, preteritis et futuris sciendis; De cognicione planetarum et stellarum; (5) De cognicione virtutum planetarum et stellarum et quid habent influere; De influenciis planetarum [et stellarum] mutandis; De mutacione noctis in diem et diei in noctem; (6) De cognicione spirituum ignis et nominum et superiorum et sigil- lorum et potestatum et virtutum eorum;", "type": "mixed", "id": {"id": "4786b857-1d56-485a-ae06-39abef96413b"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains structured elements that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both religious texts and celestial phenomena descriptions. It is clean but may require domain-specific knowledge to fully understand the context of angelic references or astrological terms. / The segment contains a mixture of Latin phrases and structured lists, which can be segmented into meaningful spans representing both language patterns (Latin) and numerical sequences for learning span composition in multilingual contexts. / Clear segmentation into structured phrases and concepts, representing both linguistic patterns (natural language) and technical terms related to celestial bodies; suitable for learning span composition in a multilingual context. / The text contains structured elements with clear segmentation into meaningful spans, such as titles and descriptions of celestial bodies' knowledge domains; it is clean for training purposes but lacks context to fully understand the content's domain relevance. / Contains structured segments with clear patterns for span segmentation, representing both conceptual and categorical elements suitable across domains."}}
 {"raw": "274 SWORN BOOK OF HONORIUS (43) Quo facto statim apparebunt visiones infinite et illusiones sicut choros, organa, cithare et omnia instrumenta dulcissima, ut possint socios ad exitum provocare, quia supra magistrum nichil possunt: (44) Illis vero transactis venient exercitus militum et ballivorum, ut debeant pro timore de circulo fugere. (45) Post hec venient sagittarii cum omnium ferarum genere, ac si eos crederent devorare: Set operans providus loquatur sociis dicens: (46) \"Nolite timere. Ecce signum", "type": "mixed", "id": {"id": "a976524b-7fbd-41cd-b440-1f6f8b211370"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of narrative and symbolic language with clear structured elements like verses, which can be segmented into meaningful spans for learning purposes. It is coherent but lacks context that would make it more representative as training data. / The segment contains a mixture of Latin text and what appears to be an ancient manuscript or codex notation, with clear structured elements like verses (274 SWORN BOOK OF HONORIUS) that can serve as meaningful spans for learning span segmentation in both natural language processing tasks. / Contains a mixture of narrative and pseudo-code-like phrases, with clear structured elements that can be segmented into meaningful spans for training purposes. The text alternates between descriptive language (natural) and commands or instructions resembling code syntax (\"Operans providus loquatur sociis dicens\"), making it representative enough to learn span composition in both domains. / The segment contains a mixture of Latin phrases and narrative structure, providing diverse patterns for span segmentation in both language context and potential historical or literary analysis. / The segment contains a mixture of narrative prose and potential symbolic or religious text, with clear sentence structures that can be segmented into meaningful spans for training purposes."}}
 {"raw": "The title Sworn book of Honorius came to be adopted in English lit- erature based on the catalog entry for the I6th century English translation in London, brought to a wider audience s attention by influential occultist A_ E. Waite (1898). 8 Waite judged the text important; but \"unaccountably over- looked by writers on ceremonial magic,\"9 Ironically, he himself didnt go into details on the contents of the text, only describing the prologue, which gives the texts own account ofits origin. Waite was aware", "type": "natural", "id": {"id": "d4cf5abc-e033-4b00-aa52-17b976626781"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear narrative structure with identifiable spans; useful for learning context and composition in English prose. / The segment contains clear narrative structure with identifiable spans such as titles, names of people and texts (e.g., \"Sworn book\", \"A_ E. Waite\"), dates (\"I6th century English translation in London\"), which are suitable for learning span segmentation from a tokenizer-free model like X-Spanformer. / Clear prose with identifiable spans; however, lacks explicit span boundaries for direct training use. Needs additional annotations or structured format to improve utility as X-Spanformer input. / Clear narrative structure with identifiable spans like titles, names, and dates; well-suited for learning span composition in English prose. / Clear narrative structure with identifiable spans such as titles, names (e.g., A.E. Waite), and references to historical texts; well-formed for training purposes."}}
 {"raw": "288 SWORN BOOK OF HONORIUS scilicet Agla, Monhon' et cetera 'humiliter et fideliter deprecans, (10) licet ego indignus, tamen in te confidens, ut sanctifices et benedicas cruorem istum per sanctissima nomina tua predicta et per nomen \"Semenpho- ras 72 literarum; (11) quatinus per virtutem et sanctitatem et potestatem eorundem nominum et per virtutem et potestatem tuam divinam sit cruor iste consecratus 1, benedictus 1, confirmatus per virtutem sacratis- simi corporis et sanguinis tui, (12) ut virtutem, quam", "type": "mixed", "id": {"id": "6bb06975-0ad6-4fab-a064-a43dddccd3ca"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains a mix of Latin phrases and numerical references, indicating potential patterns in span segmentation for historical or religious texts that could be valuable training data. However, it lacks clarity due to its archaic language structure which might pose challenges during the learning phase but still holds compositional value with identifiable spans like \"288 SWORN BOOK OF HONORIUS\" as a coherent phrase and numerical references such as \"(10)\" indicating structured segments within text. / Contains a mix of Latin phrases and numerals, suggesting structured patterns suitable for span segmentation in historical texts or religious documents. The segment is coherent but may require domain-specific knowledge to fully understand its context as it appears related to ecclesiastical law (e.g., \"SWORN BOOK OF HONORIUS\"). / Contains a mixture of Latin phrases and references that could be useful for learning span segmentation in historical or religious texts, though the context is not entirely clear without further background knowledge. / Contains a mixture of legal text and Latin phrases, with clear structured elements like numbered clauses that can be segmented into meaningful spans for training purposes. / Contains structured elements with clear spans, including references to a book and religious phrases; well-suited for learning span segmentation in both text and coded-like constructs."}}
 {"raw": "PLACACIO DIVINE MAIESTATIS Dices illa die Iovis semel psalterium et letaniam cum propriis eam sequenti- bus oracionibus. Post dices 25, 26,31 et ibi addes: (2) \"Ut tu, Domine, per annunciacionem, concepcionem nativitatem circumcisionem predicacionem, baptismum resurreccionem, ascensio- nem beatissimi filii tui, Domini nostri Iesu Christi, (3) corpus meum clari- ficare et mundare digneris, ut abluto corpore te visibiliter cum tuis novem dictis angelorum ordinibus me vivente mea possit anima collaudare;\" (4)", "type": "mixed", "id": {"id": "ba9a8a1f-c210-484d-9f0f-b47ef443557e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mix of Latin phrases and religious text, with clear structure for span segmentation; however, it lacks modern language patterns useful in contemporary NLP tasks. / The segment contains a mixture of Latin phrases and references to religious texts, which can be segmented into meaningful spans for training purposes; however, the presence of non-standard characters may affect clarity slightly. / The segment contains a mixture of Latin phrases and religious text, which can be segmented into meaningful spans for learning purposes; however, it lacks clarity due to the presence of non-standard characters (e.g., \"MAIESTATIS\"). / The segment contains a mixture of Latin phrases and religious text, which are structurally clear for span segmentation; however, it lacks coherence in English context making its utility limited without additional processing or domain-specific knowledge. / Contains a mixture of Latin phrases and religious text, with clear structured elements like verses (e.g., \"Post dices 25, 26,31\") that can be segmented into meaningful spans for learning span composition in both natural language processing tasks related to historical texts."}}
 {"raw": "century Of course the persecution passage could be a later insertion, as we now know the text has been redacted. It could also refer to another; less noto- rious papal action.", "type": "natural", "id": {"id": "830ea7cf-de9b-49a3-b4c9-b30c91f2ce78"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable phrases and sentences suitable for training a span-aware model in the context of language processing. / Clear, coherent prose with identifiable phrases suitable for span segmentation; represents valuable patterns in language structure and composition. / Clear sentence structure with identifiable spans; represents coherent language patterns for training. / Clear sentence structure with potential for meaningful span segmentation; represents patterns in English prose. / Clear, coherent prose with identifiable phrases and sentences suitable for training a span-aware model in the context of language understanding."}}
 {"raw": "This view is supported by the fact that this persecution/ origin myth is missing from Ganells text, as noted by Veenstra.19 The possibility that these were later insertions also fits nicely with a the- ory raised by Boudet; that an earlier version circulated in the time of William, and was known as Liber Sacratus: This would also explain the use of the title Liber Sacratus by Ganell, and not by the northern manuscripts. It would also explain the absence of persecution statements in the Summa Sacre Magice.", "type": "natural", "id": {"id": "bbac0407-2508-4fe7-a6aa-e0d22276e35f"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable thematic spans; useful for learning context and relationships in text. / The segment contains clear, structured sentences with identifiable spans such as phrases and clauses that can be used to train a span-aware model on recognizing complex sentence structures in English text. It is coherent for training purposes but lacks code elements or mixed content types which makes it purely 'natural'. / The segment contains clear linguistic structures and phrases that can be segmented into meaningful spans, such as \"persecution myth,\" \"Ganells text,\" etc., which are useful for learning span composition in a tokenizer-free context. It is coherent but lacks direct code elements or mixed content types; thus it fits best under the 'natural' category with good structural clarity suitable for training purposes. / Clear narrative structure with identifiable spans; useful for learning context and span segmentation in prose. / Clear prose with identifiable thematic spans; useful for learning context and narrative structure in NLP tasks."}}
 {"raw": "Sworn Book of Honorius 175 (s) PRAYER BEFORE RECEIVING CHRIST: Cc You, 0 Lord Jesus Christ, the savior ofall, who were willing to sacrifice your body, on behalf of me, a wretched sinner; and others living in the world, (6_ you who on the fifth day; namely Thursday, the Last Supper; fed your blessed apostles with your precious body and blood, teaching them that they should consecrate your most holy body and blood in the name of the holy mother church,in order that it might be the salvation and life ofthe", "type": "natural", "id": {"id": "483de0af-34a9-4ebe-af25-d5f739397a47"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear structure with identifiable spans (e.g., phrases, sentences), clean and coherent text suitable for learning span composition in religious context. / The text segment contains clear religious prose with identifiable phrases and sentences that can be segmented into meaningful spans, such as \"Sworn Book of Honorius,\" prayer components (\"Lord Jesus Christ\"), references to the Last Supper (Thursday), etc., which are coherent for training purposes. / Clear prose with identifiable thematic spans; however, contains archaic language that may not be representative of modern text structures for training purposes. / The text segment contains clear religious prose with identifiable phrases and sentences that can be segmented into meaningful spans, representing valuable patterns for learning span composition in a tokenizer-free context focused on natural language processing tasks related to historical or theological texts. / The segment contains a mixture of religious text and references to Christian practices, which can be segmented into meaningful spans such as phrases or sentences that convey specific concepts related to prayer before receiving Christ in Christianity. Despite some archaic language (\"Sworn Book,\" \"wretched sinner\"), it is coherent for training purposes with clear structural elements suitable for span segmentation learning tasks."}}
 {"raw": "Having completed the first treatise of this Sacred Book and secrets of the Lord, here follows the second, which, just as the first had six topics completed with the grace of HELY,so this one has cwenty-seven topics as follows: I. (2) Concerning the knowledge ofthe Heavens; 2 _ Concerning the knowledge ofthe angels of each of the Heavens; 3. Concerning the knowledge of each angels name and powers; 4. Concerning the knowledge ofthe sealsofeach angel, and their virtues; 5. (3) Concerning the knowledge of the", "type": "mixed", "id": {"id": "05aa845c-e8d8-4844-8cff-d4929324946e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear segmentation into topics and subtopics, representing a mix of structured data (code-like formatting) with meaningful spans in both prose descriptions (\"Sacred Book\", \"knowledge\") and technical terms (\"Heavens\", \"angels\"). Well-formed for training purposes. / The segment contains a mixture of structured religious text and enumerated topics, which can be segmented into meaningful spans for training purposes; however, it lacks clarity in some parts due to potential transcription errors (\"HELY\" instead of \"Hallelujah\"). / Clear structured elements with identifiable spans; well-formatted and coherent text suitable for learning span composition in a religious context. / The segment contains a mixture of structured headings and descriptions, which can be segmented into meaningful spans for training purposes; it is clean but lacks context or content beyond the structure itself. / Clear structured elements with identifiable spans (topics, knowledge areas) and a mix of numerical references indicative for span segmentation; well-formed text segment representative of religious texts containing both prose structure and enumerated lists."}}
 {"raw": "(17) Angeli omnes et archangeli, vir- tutes, principatus, potestates, troni, dominaciones, cherubyn et seraphin ex auctoritate et licencia Dei te benedicant: (18) Per merita et oraciones omnium sanctorum tuorum; Domine Iesu Christe, benedicas et sancti- fices H et consecres H cruorem istum sigilli Dei et confirmes per omnipo-", "type": "mixed", "id": {"id": "83253d96-0891-4e29-b628-45ebbbb55060"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains both religious text and structured elements like verses, which can be segmented into meaningful spans for training a span-aware model. / Contains a mixture of religious text (natural language) and references to scriptural entities, which can help the model learn span segmentation in both contexts. / Contains a mixture of religious text and structured elements like verses, which can be segmented into meaningful spans for training purposes. The content is clean but may require further preprocessing to isolate the natural language from code-like structures (e.g., verse numbers). / The segment contains a mix of religious text and Latin phrases, with clear structure for span segmentation; however, it may not be representative enough to generalize across all domains due to its specific context. / Contains a mixture of religious text and Latin phrases, with clear structure for span segmentation; however, it lacks coherence in modern contexts which may limit its utility as training data."}}
 {"raw": "tenciam tuam, (19) et virtutem et potestatem optineat sigillum tuum de eo scribendum, quam debet; et ad quam est institutum et confirmatum, prestante Domino nostro Iesu Christo, cuius regnum et imperium sine fine manet in secula seculorum: Amen' (20) Antequam iste 3 oraciones supra cruorem dicantur procedenter versus Ierusalem, dicatur supra eum exorcismus salis, quod ponitur in aqua, ter; nisi quod nomina sic debent mutari: (21) \"Exorcizo te, creatura cruoris\" loco de creatura salis' et 'qui per Salomonem", "type": "mixed", "id": {"id": "6926da79-19f6-464c-a5c1-84362b4d8dc9"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of Latin phrases and instructions, which may help the model learn span segmentation in multilingual contexts or for processing religious texts. However, its domain-specific nature might limit generalizability across different types of code or natural language text. / The segment contains a mixture of Latin phrases and references to religious texts, which may have clear structured elements for learning span segmentation in historical or liturgical contexts. However, the lack of modern language context might limit its utility outside specialized domains. / The segment contains a mixture of biblical text and instructions for an exorcism ritual, with clear phrases that could be segmented into meaningful spans such as \"tenciam tuam,\" \"virtutem et potestatem optineat sigillum tuum de eo scribendum,\" etc. It is clean but may require domain-specific knowledge to fully understand the context of religious rituals and language used in biblical texts for effective training data utilization. / The segment contains a mixture of Latin phrases and instructions, with clear demarcations for potential span segmentation (e.g., \"tenciam tuam,\" \"virtutem et potestatem\"). It is coherent but lacks context to fully understand its compositional value as training data. / Contains both structured religious text and potential coding-like elements (e.g., numbered verses, Latin phrases). Clear spans for training a span-aware model on diverse content types."}}
 {"raw": "f: See also Bremmer and Veenstra: Veronese, Julien. LArs notoria au Moyen Age: introduction et edition critique. Firenze: SISMEL edizioni del Galluzzo, 2007. LAlmandal et IAlmadel latins au Moyen Age: introduction et editions critiques. Firenze: SISMEL edizioni del Galluzzo, 2012.", "type": "mixed", "id": {"id": "c39bacf8-0c13-49ea-baa3-987f11f5a800"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains structured bibliographic entries with clear spans for authors, titles, and publication details; useful patterns in span segmentation are present. / The segment contains structured bibliographic entries with clear span segmentation opportunities for both author names and publication details, representing valuable patterns in a scholarly context. / Contains structured citations with clear span segmentation opportunities, representing valuable patterns for learning both citation structure and content composition in a scholarly context. / The segment mixes citation formats and lacks clear, consistent spans for training a span-aware model; it combines elements that are not uniformly structured or representative of single content types. / The segment contains structured bibliographic entries with clear spanable elements like author names, titles of works (natural language), and publication details which can help the model learn to distinguish between different types of spans in a mixture context. However, it lacks variety for robust training due to its repetitive nature."}}
 {"raw": "Sworn Book of Honorius 275 (43) With that done, immediately infinite visions and illusions will appear; such as choirs, organs, lutes, and all the sorts of the sweetest instru- ments; in order to provoke the associates to flee, because they are able to exert no such influence over the master: (44) After this armies of soldiers and bai- liffs will come, in order to frighten them to flee from the circle: (45) After these archers with all cypes ofwild beasts will come,andact asif- they \" intended to devour", "type": "natural", "id": {"id": "8e80602d-7972-4d83-8f6c-9e1a7cc989ae"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose structure with identifiable thematic spans; useful for learning narrative composition and context understanding in a tokenizer-free model. / The text segment contains clear narrative structure with identifiable spans such as sentences and phrases, representing valuable patterns for learning span composition in a storytelling context. It is clean but lacks domain-specific terminology that might be present if it were code-related content. / Clear narrative structure with identifiable spans (sentences, phrases). Well-suited for learning span segmentation in prose context. / The text segment contains clear narrative structure with identifiable spans such as \"Sworn Book of Honorius,\" numbers indicating verses, and descriptive phrases that can be segmented into meaningful units for learning span composition in a tokenizer-free context. It is coherent but lacks direct code elements or mixed content types; thus it fits the natural language category well. / Clear narrative structure with identifiable spans of meaningful phrases; well-suited for learning span composition in a language context."}}
 {"raw": "It was apparently part of a magical tra- dition that must have spread from the Mediterranean region across Europe \"6 Besides examples on metal, it can be found in manuscripts and at least one printed edition of the Key of Solomon.", "type": "natural", "id": {"id": "a7123ab9-1061-4107-abd2-38ee1d63b087"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable spans; useful for learning sentence structure and context in NLP tasks. / Clear prose with identifiable phrases and sentences suitable for span segmentation; coherent text representing a cultural reference context. / Clear prose with identifiable spans; useful for learning sentence and phrase boundaries in text. / Clear prose with identifiable spans; useful for learning sentence structure and context in NLP tasks. / Clear prose with identifiable spans; useful for learning sentence structure and context in NLP tasks."}}
 {"raw": "To know the virtues of the planets and stars and their influences; I4. To change the influences of the planets [and stars]; I5. To change night into day and day into night; (6) 16. To know the spirits of the fire, their names, superiors, seals, poW- ers, and virtues;", "type": "natural", "id": {"id": "ffeed00b-3427-409f-8fca-2e6d14887c9f"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear thematic structure with identifiable segments related to astrological concepts; coherent and representative of complex, descriptive text. / Clear thematic structure with identifiable phrases; good for learning span segmentation in narrative text. / Clear sentence structure with identifiable phrases and concepts related to astrology, suitable for learning span segmentation in a tokenizer-free context. / Clear structure with identifiable phrases and thematic elements suitable for learning span segmentation in a non-code context. / Clear thematic structure with identifiable spans; useful for learning about celestial entities and their attributes in prose."}}
 {"raw": "Ubi est Harthan rex, ubi sunt Bileth; Milalu, Abucaba eius ministri? Ubi est Abaa rex, ubi sunt Hyici, Quyron, Zach, Eladeb eius min[i] stri? (17) Ubi est Maymon rex, ubi sunt Assaibi, Albunalich, Aybalidech, Yasfla eius ministri? Ubi est Barthan rex, ubi sunt Thaadas, Caudas, Yalcal eius ministri?", "type": "mixed", "id": {"id": "d2d4b55c-f08e-431c-ac67-f1a7e12ca52e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text contains a mixture of Latin phrases and names, suggesting it could be extracted from historical or religious texts; however, the lack of context makes its direct utility for training unclear without additional metadata. / The text contains a mixture of Latin phrases and names, which can be segmented into meaningful spans representing different entities (e.g., kings' locations). It is clean but may require domain-specific knowledge for full interpretation due to its historical context. / Contains structured queries with identifiable spans (e.g., names, titles) and represents a mix of language patterns useful for learning span segmentation in both text-based contexts and potential coding-like structures. / Contains a mixture of names and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic structures (names as entities) and potential coding constructs if interpreted differently. The text is clean but lacks context which could limit its utility without additional information or annotations indicating the relationships between these elements. / The segment contains a mixture of Latin phrases and names, which can be segmented into meaningful spans for learning purposes; it is coherent but lacks context that would make its utility clearer without additional background knowledge."}}
 {"raw": "Then the master; with closed hands,280 says as follows to the spirits: \"Flee hence with your iniquities, by virtue ofthe banner ofGod\" Andthen he should uncover [the Seal of God], to compel them to obey; and immediately the associates will see them no more: (48) Then encourage them, saying, Tam thirscy We may drink  What does it seem like to you? Dont be afraid, but put your hope in the mercy ofthe Lord. Therefore rejoice in the Lord:\" Andknow that they will fear no more.", "type": "natural", "id": {"id": "fa28cb20-ab35-4ccf-86cf-f28530ab3670"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text lacks clear, identifiable spans for meaningful segmentation; it is poetic and ambiguous without discernible patterns suitable for training a span-aware model. / The text lacks clear, identifiable spans suitable for training a span-aware model; it is more poetic and less structured than required. / Clear narrative structure with identifiable spans (e.g., phrases, sentences). Well-suited for learning span segmentation in a tokenizer-free context due to its coherent and meaningful composition of ideas typical within religious texts or sermons. / The segment lacks clear, structured elements suitable for span segmentation; it is poetic and ambiguous without discernible patterns or spans representative of the target domain. / The segment contains a clear narrative structure with identifiable phrases and sentences, useful for learning span segmentation in text; however, it lacks explicit coding constructs or domain-specific patterns that would make the content more representative of code-related training data."}}
 {"raw": "from whence he shall come to judge the living and the dead: (19) At whose coming, all humanity shall rise again with their bodies, and give an accounting for all their deeds (2o)", "type": "mixed", "id": {"id": "facf0d55-6159-4465-b174-8233384b23ba"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains both programming language syntax and biblical text, providing a rich context for learning span segmentation in diverse content types. / Contains both programming syntax and religious text, providing diverse span patterns for learning. / Contains both programming syntax and religious text, providing diverse span patterns for learning. / Contains both programming syntax and religious text, offering diverse span patterns for training. / Contains both programming and biblical text, with clear delimiters for spans (code syntax vs narrative)."}}
 {"raw": "(18) Ubi est Formione rex, ubi sunt Guth, Maguth, Guthryn eius ministri? Vos invoco venire cum excercitu vestro et meis questionibus obedire. Timor furoris Domini vos convincat, qui vos venire et obedire constringat\"\n(19) ExCITACIO DIEI VENERIS ET SPIRITUUM EIUS, QUANDO PARTICULARITER FIT:\nTunc in nogahem dicat: \"Ubi est Sarabocres rex, ubi sunt Nassar; Cynassa eius ministri?", "type": "mixed", "id": {"id": "d21d67a4-55e8-4558-bc8a-e199e08a838e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text contains a mixture of Latin phrases and structured elements, which can help the model learn span segmentation in both linguistic contexts. / The segment contains a mixture of Latin phrases with clear, identifiable spans that can be segmented into meaningful parts for training purposes; it is clean and coherent within its context as religious or historical text. / Contains a mixture of Latin phrases and structured queries, which can help the model learn span segmentation in both textual content and formalized questions. / Contains a mixture of Latin phrases and structured text, with clear sentence boundaries suitable for span segmentation training. / Contains a mixture of Latin phrases and structured lists, which can help the model learn span segmentation in both linguistic contexts."}}
 {"raw": "how great is your forgiveness, leniency and pity for me, a POor sinner presuming to undertake this ceremony for seeing and recogniz ing this vision.", "type": "natural", "id": {"id": "64e234a0-6f02-493b-b04a-558e7993ad95"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable phrases suitable for training a span-aware model on English text. / Clear prose with identifiable meaningful spans; well-formed and coherent for training purposes, representing valuable patterns in span composition. / The text lacks clear, meaningful spans for training; it's poetic and ambiguous without discernible patterns or structures suitable for span segmentation learning. / Clear prose with identifiable phrases and sentences suitable for training a span-aware model in recognizing sentence structures within English text. / Clear prose with identifiable phrases and emotional expressions suitable for training a span-aware model in recognizing complex sentence structures."}}
 {"raw": "(2) Homines et eorum naturam dili- gunt regnantque in speris stellarum: Corpus igneum accipiunt, quando ad mandatum Domini hominibus mundatis et purificatis tamquam sociando, ut eos consolentur; mittuntur: (3) Et istorum sunt 7 modi, de quibus debet natura precognosci, quoniam quilibet suum proprium habet officium pre- destinatum, quamvis omnibus aliis serviciis possent deservire. 22 SSM: obligatione, which is also reflected in R", "type": "mixed", "id": {"id": "bcd316b3-dd79-4039-9c20-74a61bc154c1"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text contains a mixture of Latin phrases and modern English, with clear structured elements like numbered sections that can be segmented into meaningful spans for training purposes. It also reflects valuable patterns in language structure relevant to both code-like syntax (Latin) and natural prose translation or explanation context. / The segment contains a mixture of Latin phrases and modern English, with clear demarcations between the two languages that can be used for span segmentation learning in an X-Spanformer model. It also includes structured elements like numerals (2) and references ((3)), which are valuable patterns to learn from. / The segment contains a mixture of Latin phrases and modern English, with clear demarcations between the two languages that can be used to train span segmentation in multilingual contexts. However, it lacks coherence as an isolated text block due to its fragmented nature; thus it's not ideal for standalone training but could serve well when combined contextually within mixed-language datasets. / The segment contains a mixture of Latin text and structured references (e.g., \"(2)\", \"SSM\"), which can help the model learn to identify spans in both linguistic patterns and formal notations, though it may require additional context for full comprehension. / Contains a mixture of Latin phrases and modern English, with clear sentence structures that can be segmented into meaningful spans for learning purposes. However, the presence of archaic language may pose challenges in generalization across different text domains."}}
 {"raw": "(20) Ubi est Harthan rex, ubi sunt Bileth, Milalu, Abucaba eius ministri? Ubi est Abaa rex, ubi sunt Hyici, Quiron, Zach, Eladeb eius ministri?", "type": "natural", "id": {"id": "c6362109-29d4-41ad-be59-96477a4f00b0"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear question format with identifiable spans for names and titles, suitable for learning span segmentation in a conversational context. / Clear question format with identifiable spans for entities and locations, suitable as training data. / Clear question structure with identifiable spans for entities and relationships in a historical or mythological context. Well-suited to learn span composition patterns typical of narrative text. / Contains a mixture of names and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic context (names as entities) and potential historical or fictional references which could enrich the model's understanding across domains. / Clear question structure with identifiable spans for each query component, suitable for learning span segmentation in a language context."}}
 {"raw": "PRAYER 9.27\nAlpha and Omega, 0 almighcy God, the beginning of all things, without beginning, the ending without an end, hear today my prayers; most holy one, (2) neither repay me according my iniquity nor my sins, Lord, my God, but according to your mercy, which is greater than all things visible and invisible.", "type": "natural", "id": {"id": "85d18deb-b7c7-4d49-b34f-84e0ddcc00ea"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear religious text with discernible phrases and structure suitable for training on span segmentation in a tokenizer-free context. / The text is structurally coherent but lacks clear, learnable patterns for span segmentation due to its poetic and religious nature; it may not be representative of typical training data needed for X-Spanformer. / unparseable / Clear religious text with distinct phrases and sentences suitable for learning span segmentation in a spiritual context. / Clear, coherent prayer text with identifiable phrases suitable for training a span-aware model on religious or poetic language."}}
 {"raw": "8 SWORN BOOK OF HONORIUS ~(late I4th century) inventory of the library of the Augustinian friars of York; '-(1376) a reference in Eymerichs Directorium inquisitorum; '-(1389) possible mention by Mezieres in his Dream of the Old Pilgrim; ~(1398) various condemnations by the Faculty of Theology ofthe Universicy of Paris likely referred to the text; ~(1397) reference by John Gower in Confessio Amantis; ~(1400) Jean dAstarac possessing a copy of Liber sacratus, which seems very likely that of Honorius; ~(circa", "type": "mixed", "id": {"id": "2a8749b7-cc1a-4186-9732-f193244aeb9d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mix of historical references and dates, which can be segmented into meaningful spans for learning span composition in both temporal (dates) and textual contexts. It is clean but lacks explicit linguistic structure that could further enhance training utility. / Contains a mix of historical references and dates, which can be segmented into meaningful spans for learning span composition in both text (dates) and context identification related to the content type. / Contains a mixture of historical references and dates, which can be segmented into meaningful spans for learning span composition in both temporal context (dates) and textual content (names). However, the lack of clear delimiters between entries may affect clarity slightly; overall it is coherent. / Contains a mix of historical references and dates, with clear spanable elements like names (Honorius), years (~circa), which can help the model learn to segment spans in both natural language text and code-like structures. / Contains a mix of historical references and dates, which can be segmented into meaningful spans for training purposes; however, the lack of clear delimiters may challenge span identification. The content is coherent but lacks explicit structure that could aid in learning patterns without additional context or formatting cues."}}
 {"raw": "To know human nature and all the deeds of mankind; which are hid- den and unknown; (9) 23. To know the aquatic spirits and animals, and of their virtues, and their superiors; 24. To know the earthly and infernal spirits; (Io) 25. Concerning the vision of Purgatory and Hell, and acquiring knowledge ofthe souls therein: 26.", "type": "natural", "id": {"id": "3a0dc8e2-4e28-4900-bd30-14cecd7a7338"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose structure with identifiable thematic sections; useful for learning span segmentation in narrative text. / Clear prose structure with identifiable thematic spans; useful for learning sentence segmentation and coherence in text. / Clear, coherent prose with identifiable thematic spans; useful for learning context and structure in language processing. / Clear prose structure with identifiable thematic spans; useful for learning context and composition in human language. / Clear prose structure with identifiable thematic spans; coherent and representative of complex narrative text."}}
 {"raw": "obvious from Honorius; Paumachiae is another corruption of Primariae; Baldachiensis probably means a throne with a 'Baldacchino' (a canopy of state):\" 280 H: adhibens manum pentaculo ( 'holding out his hand to the pentacle\") tenens manum prope pentaculum (*holding his hand near the pentacle\")", "type": "mixed", "id": {"id": "27821374-6427-41bc-91cc-98f0799ab804"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mix of Latin phrases and English explanations, which may help the model learn span segmentation across different languages or contexts with clear boundaries between terms like \"Paumachiae\" (a corruption) and its explanation in parentheses. / The segment contains a mixture of Latin phrases and English explanations, with clear boundaries between them that can be used to learn span segmentation in both languages. It is clean but lacks context for full comprehension without additional resources or knowledge about the subject matter (ancient Roman culture). / The segment contains a mixture of Latin phrases and descriptions, which are structurally clear for span segmentation; it represents valuable patterns in the context of historical texts or linguistic studies that could benefit an X-Spanformer model learning to understand complex sentence structures with embedded code-like elements. / The text segment contains a mixture of Latin phrases and English explanations, which are structurally clear for span segmentation; it includes both language-specific elements (Latin) that could benefit from the model's understanding across languages as well as code-like constructs (\"H:\" indicating an abbreviation or reference). / The segment contains a mix of Latin phrases and English explanations, with clear references to historical or cultural artifacts (\"Baldachino\", \"pentacle\") that can be segmented into meaningful spans for learning purposes. It is clean but lacks context which might affect its utility as training data."}}
 {"raw": "Actually, for the record, Ganell does include the Ars Notoria prayers, but later in his compilation.", "type": "natural", "id": {"id": "70fd6d46-6f50-4fd4-8997-62ebde94001d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with meaningful spans; well-suited for training a span-aware model on English prose. / Clear, coherent sentence with identifiable spans for training (e.g., \"Actually,\" as an introductory phrase; \"for the record\" and its continuation). Suitable example of complex English structure without code elements. / Clear sentence structure with identifiable spans; useful for learning context and phrase segmentation in NLP tasks. / Clear sentence structure with identifiable spans; useful for learning context and phrase segmentation in NLP tasks. / Clear sentence structure with identifiable spans; well-suited for learning context and phrase segmentation in NLP tasks."}}
 {"raw": "And there are seven categories of them, from which their nature can be known beforehand, because each has its own predestined office, even though they may serve all other services as well.", "type": "natural", "id": {"id": "6c7a0d9d-748f-41b2-804a-3ca6d78d8f7b"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with identifiable phrases; useful for learning span composition in English prose. / Clear sentence structure with identifiable phrases and concepts for span segmentation; coherent text suitable as training data. / Clear sentence structure with identifiable phrases and concepts suitable for span segmentation; well-suited to training a tokenizer-free, span-aware model in processing coherent English text. / Clear sentence structure with identifiable phrases and concepts suitable for learning span composition in a tokenizer-free context. / Clear sentence structure with identifiable phrases and concepts suitable for span segmentation in NLP tasks."}}
 {"raw": "Et tunc dicat sicut homo ferens impe- rium divine maiestatis et quasi eos in infimo suppeditans: (56) \"Quid tar- datis?", "type": "mixed", "id": {"id": "58a570a9-bcd5-438c-b95a-541c62bd884b"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mix of Latin phrases and punctuation, with clear separation between words; represents valuable patterns for learning span segmentation in both language structure (natural) and potential annotations or comments that could be code-like. / The segment contains a mix of Latin phrases and punctuation, which can be segmented into meaningful spans for training purposes; it is clean but lacks context or coherence that would typically aid in learning span composition effectively. / The segment contains a mix of Latin phrases and punctuation, which can be segmented into meaningful spans for training purposes; it is clean but lacks context or coherence that would make the learning process easier. / Contains a mix of Latin phrases and punctuation, with clear boundaries for span segmentation; represents valuable patterns in historical texts or religious documents. / The segment contains a mix of Latin phrases and punctuation, which may help the model learn to distinguish between different types of spans in multilingual or language-specific contexts. However, it lacks clarity due to its fragmented nature; thus it's not ideal for training purposes without further context or cleaning up."}}
 {"raw": "Let me point out that there is also a logical inconsistency in the persecution storys which a late insertion might explain: If the magic was revealed to Honorius by the angel Hocroel, there would be no reason to assemble experts from all over to preserve their traditions from such Church actions: Another problem with the fourteenth century date, is that the oldest manuscript; Sloane 3854, dates to then: This would not allow much time to develop such a complicated manuscript tradition and stemma as demon-", "type": "natural", "id": {"id": "ac83453c-6051-4ff8-8c41-17abc9c90ed6"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear narrative structure with identifiable spans; coherent and representative of complex sentence patterns in English prose. / The segment contains clear linguistic structures and phrases that can be segmented into meaningful spans, such as \"persecution stories,\" \"late insertion might explain,\" etc., which are coherent for training purposes in a tokenizer-free span-aware model focused on natural language processing tasks. / Clear narrative structure with identifiable spans; coherent and representative of complex sentence patterns in English text. / The segment contains a mixture of narrative and technical language, with identifiable spans such as \"persecution stories,\" \"manuscript tradition,\" which can be useful for learning span segmentation in both natural text and code-like structures. However, the sentence structure is somewhat complex due to embedded clauses (\"which a late insertion might explain\") that may challenge straightforward tokenization-free models but still offer valuable patterns of composition within mixed content types. / The text segment contains clear, meaningful spans such as \"persecution stories,\" \"late insertion,\" and phrases like \"oldest manuscript; Sloane 3854.\" It is coherent but lacks compositional value for learning span segmentation due to its complex sentence structure."}}
 {"raw": "what was later to become esotericarchives.com.12 That version was based largely on the English manuscript; Royal 17 A XLII; which omits much ofthelater material, but it finally made the bulk ofthe text accessible to a wider audience.13 Since then, scholarly interest has exploded. Recent detailed studies have been published by Mathiesen, Kieckhefer; Klaassen, Boudet, Jan Veenstra, Mesler; and Chardonnens, as well as abundant references in literature14 A critical edition of the Latin text was published by", "type": "natural", "id": {"id": "db79a640-05d7-4643-83f5-c0c440b20d1c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable spans; rich for learning patterns in span segmentation and composition. / The segment contains clear, structured sentences with identifiable spans of phrases and clauses that can be used for training a span-aware model in the context of scholarly texts or historical documents. It is coherent but lacks direct code elements to classify as \"code.\" / The segment lacks clear, identifiable spans for meaningful training; it's a narrative with no discernible patterns suitable for span segmentation. / Clear narrative structure with identifiable spans (e.g., \"esotericarchives.com\", names of scholars, titles). Well-formed for training purposes and represents valuable patterns in span segmentation within scholarly texts. / Clear, coherent prose with identifiable spans; useful for learning sentence structure and composition in English text."}}
 {"raw": "Spiritus sancti tui, Domine, plenarie in me operante gracia hostium, sive visibilium, sive invisibilium, michi adversancium insi- dias atque versutias gaudeam superasse. Amen: XCI 294 ORACIO Abbadya, omnium regnorum sive potestatum visibilium sive invisibilium dispensator atque dispositor; Deus, et omnium bonarum voluntatum ordinator; (2) tu, Domine, consilio tui boni Spiritus dispone voluntatem meam et vivifica hodie potestatem meam debilem et imbecillitatem meam et inordinacionem mentis mee. (3) Ordina,", "type": "mixed", "id": {"id": "04c07f77-977b-4076-856c-11b0336e9b42"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains both religious text (natural language) and Latin phrases, which may have structured patterns useful for learning span segmentation in a multilingual context. / Contains a mixture of religious text and Latin phrases, with clear demarcations for potential span segmentation; however, the archaic language may pose challenges in generalization across modern contexts. / The segment contains a mixture of Latin phrases and religious text, which may not have clear syntactic structures for span segmentation but can still provide valuable patterns in terms of phraseology or thematic grouping. However, the lack of modern language constructs might limit its utility as training data unless specifically targeted towards historical texts processing. / The segment contains a mixture of Latin phrases and religious text, which lacks clear structure for meaningful span segmentation in the context required by X-Spanformer training data. / The segment contains a mix of Latin phrases and religious text, which may not provide clear span segmentation patterns for X-Spanformer training; lacks coherence in English context."}}
 {"raw": "178 SWORN BOOK OF HONORIUS defensor; Egyryon protector; Pheta largitor; (7) exaudi benigne deprecacio- nes servi tui, ut ex dono gracie tue per intercessiones beate genitricis tue virginis Marie et angelorum et archangelorum tuorum Michaelis, Gabri- elis, Urielis et Raphaelis et omnium aliorum celestium angelorum (8) et apostolorum tuorum Petri et Pauli; Iohannis et Iacobi, Andree, Mathie, Symonis, Iude, Philippi, Thome, Bartholomei et Barnabe corpus meum\" et cetera_ (9) Postea dices ista sequencia nomina", "type": "mixed", "id": {"id": "c60f8c63-be22-4d35-9a5b-29163618c762"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains a mixture of Latin phrases and names, which may be useful for learning span segmentation in both structured (names) and unstructured contexts (Latin). However, it lacks clear compositional patterns due to its archaic language style. / Contains a mix of names and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both human language context (names) and religious/cultural references which are common in historical texts or documents related to religion. / Contains a mix of Latin phrases and references to religious figures, lacking clear span segmentation patterns for training purposes. The text is not coherent as standalone content without context or translation. / Contains a mix of names and phrases with clear structure; spans can be identified as individual entities useful for training on span segmentation in both text (names, titles) and potential religious or historical context. / The text segment contains a mixture of Latin phrases and names, which can be segmented into meaningful spans for training purposes; however, it lacks clarity due to the presence of non-standard characters (e.g., \"œ\"). It is somewhat representative but may require cleaning or standardization."}}
 {"raw": "200 SWORN BOOK OF HONORIUS CV (De spiritibus Saturninis) Istorum autem quidam sunt et vocantur Saturnini et isti sunt Bohel, Cafziel; Mich[rJathon, Satquiel, et eorum natura est tristicias et iras et odia promovere, nives et glacies concreare, (2) et sua corpora sunt longa et gracilia, pallida vel flava, et sua regio est septemtrio, et habent sub se 5 demones, scilicet unum regem et 4 eius ministros, quibus omnes alii demones Saturnini subsunt: (3) Isti sunt Maymon rex, Assaibi, Albuna- lich, Haibalidech,", "type": "mixed", "id": {"id": "9232079f-da90-4626-9a47-7b3565db35f1"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mix of historical/cultural references and structured lists, useful for span segmentation in both narrative text and enumerated data. / Contains a mix of historical text and possibly pseudo-code-like structures; spans can be identified in names, titles (e.g., \"Bohel\", \"Cafziel\"), descriptions (\"tristicias et iras\"), which are meaningful for training span segmentation across both natural language patterns. / Contains a mixture of historical/cultural references and structured lists, which can help the model learn span segmentation in both contexts. / Contains a mixture of historical/cultural references and structured lists, which can help in learning span segmentation for both text patterns and list-like structures. / The text contains a mixture of Latin phrases and names, which could be useful for learning span segmentation in historical or linguistic contexts; however, it lacks clarity due to the archaic language style."}}
 {"raw": "If however you wish to obtain some knowledge, O to consecrate the book, or to call upon a spirit, you should alter the petition in the preceding prayer thus: (2) 'Stretch out your hand and touch my mouth, and make it like a sharp sword for describingand speaking out these holy words, and make my tongue like a chosen arrow, for describing your miracles, and for pronouncing them, and retaining them in my memory (3) But ifyou wish to obtain knowledge for consecrating the book, say= like a sharp sword, for", "type": "natural", "id": {"id": "ff90e21e-bfcf-478a-b897-94fb21224c11"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear structure with identifiable spans; useful for learning sentence segmentation and phrase patterns in English text. / Clear structure with identifiable spans; useful for learning patterns in religious text composition and language usage. / Clear structure with identifiable spans; useful for learning span composition in religious texts. / Clear sentence structure with identifiable spans; useful for learning span segmentation in prose. / Clear structure with identifiable spans; useful for learning span segmentation in prose."}}
 {"raw": "of the oldest and most important texts of Medieval magic The text teaches a highly religious magic, or magic religion, but acutely at odds with the established church authorities. Religious historian Claire Fanger called it a key text, in the sense that it stands at a crossroads for many areas and disciplines:' 9 1 Robert Mathiesen notes that it provides important evidence for the fow of esoteric teachings becween through Europe, and between Chris- tian, Jewish, and Arab communities?", "type": "natural", "id": {"id": "f7e10062-24c2-47c3-a4b0-656eb74d6e5d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable spans; useful for learning sentence structures and thematic elements in religious texts. / Clear prose with identifiable thematic spans; useful for learning context and composition in religious texts. / The text segment contains clear, meaningful spans related to historical and religious topics that can be useful for learning span segmentation in a tokenizer-free context; however, it lacks explicit compositional patterns suitable exclusively as training data. / The segment contains clear, meaningful spans of text related to historical and religious topics that can be used for training a span-aware model in the context of Natural Language Processing (NLP). Despite some punctuation issues (\"fow\" instead of \"flow\"), it maintains structural clarity. / The text segment is coherent and contains clear linguistic structures that can be segmented into meaningful spans, representing valuable patterns for learning span composition in the context of religious historical texts."}}
 {"raw": "(17) Tamen secundum quod caro magis vult operacionibus anime consentire, cicius intelligit et clarius. (18) Et ideo illi, qui sciencias inveniebant, magis in locis secre- tis habitabant; quia nolebant per temptaciones carnales a suis opera- cionibus sequestrari. (19) Nec non si aliquis operari voluerit eciam pro acquisicione parcium; eum convenit a terrenis sequestrari", "type": "natural", "id": {"id": "9fb30b05-14af-4392-97c4-f59d3b65f341"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with identifiable spans; well-suited for learning span segmentation in text. / Clear sentence structure with identifiable spans; represents coherent and meaningful patterns in Latin text for training purposes. / Clear sentence structure with identifiable spans; good for learning span segmentation in prose. / Clear sentence structure with identifiable spans; useful for learning span composition in text. / Clear sentence structure with identifiable spans; good for learning span segmentation in prose text."}}
 {"raw": "University of Hertfordshire historian Owen Davis lists it as one of the top ten grimoires;\" stating the writings of Honorius were 'second only to those of Solomon in notoriety in the medi- eval period? \"3 Although little known today; this text must have been widely known in Medieval times and beyond. References to it appear in notices and legal pro ceedings throughout the period.", "type": "natural", "id": {"id": "d2231835-b53c-4313-9e75-a3e5d1892dfd"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable spans; useful for learning sentence structure and historical context. / Clear prose with identifiable spans; however, punctuation errors may affect clarity and learning potential. / Clear prose with identifiable spans; however, punctuation errors may affect clarity and learning patterns slightly. / Clear prose with identifiable spans; useful for learning context and composition in historical texts. / Clear prose with identifiable spans; useful for learning sentence structure and historical context."}}
 {"raw": "It was so-called because its owners were sworn to secrecy by one of the most severe oaths ever recorded: They evidently took the oath quite seriously for only a few manuscripts are known to survive, and only one is complete. Its actual contents have remained almost completely unknown, and no complete translation has been published until the present: It is not surprising then that scholars and historians have only recently started to recognize its exceptional importance: In fact it is now recognized as one", "type": "natural", "id": {"id": "58fe962d-1ef3-4a65-b880-3ed9e6a76d6b"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable meaningful spans; well-suited for training a span-aware model on English text. / Clear, coherent prose with identifiable meaningful spans; represents valuable patterns for learning span composition in a language context. / Clear, coherent prose with identifiable meaningful spans such as sentences and phrases; well-suited for training a span-aware model on English text. / Clear, coherent prose with identifiable meaningful spans; useful for learning sentence and paragraph segmentation in NLP tasks. / Clear, coherent prose with identifiable sentences and phrases suitable for learning span segmentation in a tokenizer-free context."}}
 {"raw": "96 SWORN BOOK OF HONORIUS (4) Si pro vocando spiritus agis, pete sic: acutum ad eloquendum hec verba tam sancta quam alia ad coartandum et cogendum venire, respon- dere, stare, recedere, obedire spiritus tales N michi tali N, filio talis N (5) electam ad ostendendum mirabilia sancte potencie tue et ad pronunci- andum verba et gladialiter et flammee tuos tales spiritus N coartandum\" (6) Si aliud pecieris quam illa, que dicta sunt, simili modo secundum naturam illius peticionem mutares et non solum in hac", "type": "mixed", "id": {"id": "d2c2df39-c341-4749-8b2a-a3380e3ce8da"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mix of Latin phrases and structured text that could be useful for learning span segmentation in both linguistic patterns (natural language) and formal constructs, though it may require domain-specific knowledge to fully interpret the content. / Contains a mix of Latin phrases and structured text that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic structures (natural language) and formal constructs typical to religious or historical texts. / Contains a mix of Latin phrases and structured text that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic structures (natural language) and formal constructs resembling programming code or markup languages. The segment is clean but may require domain-specific knowledge to fully interpret the meaning within its historical context. / The text contains a mixture of Latin phrases and structured elements that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic context (natural language) and formal structure indicative of code-like constructs or religious texts. / Contains a mix of Latin phrases and structured text, which can help in learning span segmentation for both language patterns and formal constructs. However, the content is highly specialized with limited generalization potential outside its domain context."}}
 {"raw": "112 SWORN BOOK OF HONORIUS est exaltata super omne celum, ubi divinitas et deitas corporaliter habitat, (4) deprecor maiestatem tuam, Domine, et omnipotenciam tuam glorifico et eternitatis tue virtutem; ac magnificenciam tuam summam et eternam cum nimia imploracione intencionis flagitans deposco (5) Te, Deus meus, sapiencia inestimabilis et ineffabilis, vita angelorum, Deus incomprehensi- bilis, in cuius conspectu chorus angelorum consistit, te deprecor et flagito, (6) ut per sanctum et gloriosum nomen tuum", "type": "mixed", "id": {"id": "aeb54e8e-07ec-4654-9301-9d13a8606cb7"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mix of Latin phrases and religious text, with clear structured elements like verses that can be segmented into meaningful spans for learning purposes. / The segment contains a mixture of Latin phrases and religious expressions, which can be segmented into meaningful spans for learning span composition in both linguistic patterns (natural language) and structured text elements like verses or stanzas that resemble code-like constructs. / Contains a mix of religious text and Latin phrases with clear structure, suitable for learning span segmentation in both language forms. / The segment contains a mix of Latin phrases and religious expressions with clear, identifiable spans that can be useful for learning span segmentation in both structured (code-like) elements like verse numbers or punctuation marks as well as unstructured natural language content. / The segment contains a mixture of Latin phrases and religious expressions, with clear demarcations for potential spans such as \"112 SWORN BOOK OF HONORIUS\", \"(4) deprecor maiestatem tuam\", etc., which can be useful in learning span segmentation."}}
 {"raw": "archangeli tremunt et colunt laudando et dicunt: (23) 'Sanctus, sanctus, sanctus Dominus Deus Sabaoth; pleni sunt celi et terra gloria tua. Ossanna in excelsis; 20 SSM: thorum", "type": "mixed", "id": {"id": "766d5ea8-4729-45a2-8029-936caad64ce5"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mix of liturgical text and musical notation, with clear structured elements like phrases that can be segmented into meaningful spans for training purposes. / Contains both liturgical phrases and musical notation, offering diverse span patterns for learning. / Contains a mix of liturgical text (natural language) and possibly musical notation or religious chant notations, with clear structured elements like verses that can be segmented into meaningful spans for training purposes. / The segment contains a mixture of religious text and musical notation, with clear boundaries between phrases that can be segmented into meaningful spans for training purposes. Despite the unusual combination (which may pose challenges), it is clean enough to serve as representative data in learning span segmentation across different types of content. / Contains a mix of liturgical text (natural language) and musical notation or annotations, which may present interesting span segmentation challenges for the model. However, it lacks clear compositional patterns due to its specialized content type."}}
 {"raw": "(2) De octo dico tibi, quod summo mane paululum ante crepusculum matutinum ante incepcionem operis cuiuslibet diei ipse sunt proferende, et non oportet de tota die amplius (3) De nona dico, quod semper in principio orandi per oraciones alias ab illis octo predictis et in fine est proferenda. (4) Octo oraciones sunt in fine posite, que octo termini nuncupantur; et de illis dico, quod valent ad habendum divinum concessum_ (5) Sic primo una die Veneris, postquam eris vere penitens et confes- SUs, ieiunabis", "type": "mixed", "id": {"id": "a41cfbbd-83b5-4cd4-b232-a0d5a37c5c0e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text contains a mixture of Latin phrases and instructions, which can be segmented into meaningful spans representing both language structure (natural) and possibly religious or historical context that could resemble structured data akin to code. However, the lack of modern syntax makes it less ideal for training purposes compared with contemporary natural languages but still valuable due to its unique composition. / Contains a mixture of structured religious text and Latin phrases with clear segmentable elements like verses, numbers (2), etc., which can help the model learn span segmentation in both linguistic patterns and numerical references. However, it may require additional context for full comprehension due to its specialized content type. / The segment contains a mix of religious text and Latin phrases, with clear structure in verses that can be segmented into meaningful spans for training purposes. However, the presence of non-English language reduces its utility as is. / The segment contains a mix of Latin phrases and structured text that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic structure (natural language) and formalized expressions typical to religious texts or scholarly works. / The segment contains a mix of Latin phrases and structured text that can be segmented into meaningful spans, reflecting both linguistic patterns (natural language) and formal structure indicative of religious or scholarly texts which could aid in learning span composition for an encoder like X-Spanformer."}}
 {"raw": "LXX. PRAYER I5.104\nEmanuel, I honor you, the king of kings, and my God, and my substance, my salvation and my revelation, my memory and my strength, who in a sin-\nIOI Ars Not: 143,JV p. 93. I02 Name 13 seems to have been omitted by mistake: I03 Ars. Not.", "type": "mixed", "id": {"id": "d60f4884-1565-482c-8656-b6447684c997"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains both structured elements (biblical references) and unstructured text, showing clear span segmentation opportunities in a religious context. / The segment contains a mixture of religious text and editorial notes, with clear markers for different spans (e.g., references to Emanuel). It is well-formed but may require preprocessing due to the presence of annotations like \"IOI Ars Not.\" / Contains both structured elements (biblical references) and annotations, representing a diverse set of spans for learning. / Contains both structured elements (biblical references, annotations) and unstructured text; spans can be segmented into meaningful parts for training purposes. / The segment contains a mixture of religious text and editorial notes, with clear demarcations between the main content (prayer) and annotations/comments that can be segmented into meaningful spans for training purposes. Despite being somewhat obscure in context due to its specialized nature (\"LXX\", \"Emanuel\"), it still offers compositional value through patterns like references to verses or page numbers which are common across religious texts, aiding a span-aware model's learning process of such structured content."}}
 {"raw": "(13) \"Per sanctum igitur; iustum, potentissimum; excellentissimum, piissimum et coroboratum Heloy, fortem et admirabilem, perlaudatum; serviendum, tremendum, colendum; venerandum et terribilem, et per suum sacrum sigillum, quo Maria sigillavit, (14) ego, N, b et f filius, vos prenominatos spiritus Net omnes alios spiritus, animas, ventos et demones unanimiter et letanter cum pulcritudine, mansuetudine et veritate (15) hic iuxta circulum venire, apparere, respondere invoco, contestor; imparo, exorciso,", "type": "mixed", "id": {"id": "366d6e84-d48b-48d6-a2ca-09bfd0c913a8"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mixture of Latin phrases and religious text, which may have clear structured elements for span segmentation; however, the content is specialized with limited general training utility. / Contains a mixture of Latin phrases and religious text, with clear syntactic structures that can be segmented into meaningful spans for span-aware models to learn from. Despite some archaic language use which may pose challenges in modern contexts but still holds structural clarity suitable as training data. / The segment contains a mixture of religious text and Latin phrases, which can be segmented into meaningful spans like \"Per sanctum igitur\" (a phrase), \"(13)\" etc., representing valuable patterns for learning span composition in both code-like structures with numbers as markers. / The segment contains a mixture of Latin phrases and religious terminology, lacking clear syntactic structure for meaningful span segmentation; it is not coherent or clean enough to serve as training data. / The text contains a mixture of Latin phrases and religious context, which could be valuable for learning span segmentation in both structured (code-like) patterns as well as more free-form language structures. However, the lack of clear delimiters makes it less ideal compared to other examples with clearer structural elements."}}
 {"raw": "INTRODUCTION\nUROPEAN HISTORY IS PEPPERED with accounts of a mysterious book F of magic called the Sworn Book of Honorius", "type": "natural", "id": {"id": "2e4b9c8e-ab54-410a-a1fe-1aad7847d77c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear introduction heading and coherent paragraph with identifiable spans of meaningful text related to historical accounts. / Clear introduction heading and coherent paragraph with identifiable spans (e.g., \"INTRODUCTION\", \"UROPEAN HISTORY IS PEPPERED\"). Well-formed for training purposes, representing valuable patterns in span segmentation within a narrative context. / Clear introduction heading and coherent paragraph with identifiable spans for training purposes. / Clear introduction with a thematic title and descriptive phrase, suitable for learning span composition in narrative text. / Clear and coherent introduction with a meaningful span \"INTRODUCTION\" that can be identified as the start of an article or document section, representing valuable patterns for learning text composition in English language context."}}
 {"raw": "things can attain to this work; for the soul, because of the obscenity of worldly things, is isolated inwardly from divine secrets, and therefore it understands them with difficulty (17)", "type": "natural", "id": {"id": "e0063849-f697-453a-9d1d-d810714b9943"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose structure with identifiable meaningful spans; clean and coherent for training purposes, representing valuable patterns in span composition. / Clear prose structure with identifiable spans; useful for learning sentence segmentation and context understanding in NLP tasks. / Clear prose structure with identifiable meaningful spans; suitable for learning sentence and phrase segmentation in a tokenizer-free context. / Clear prose structure with identifiable meaningful spans; clean and coherent for training purposes. / Clear prose structure with identifiable meaningful spans; clean and coherent for training purposes."}}
 {"raw": "Mag: I6 p. II2 is similar to SL 3854. 249 Figure not found in C or R. B shows North occupying 90 degrees. D and SSM show seven equal segments L shows the West as occupying 90 degrees: SSM text has slight variations in the order and especially in the names. See Appendix II.", "type": "mixed", "id": {"id": "774eca66-426d-4586-9278-b54196bc0c95"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with identifiable spans; useful for learning segmentation of complex sentences and references to figures/formulas. / Contains a mix of structured descriptions and references that can be segmented into meaningful spans, though some ambiguity exists with the reference to \"Figure not found in C or R.\" The content is clean but lacks coherence due to missing context for Figure II2. / The segment contains a mix of structured data (magnetic declination references) and unstructured text, with clear patterns in the naming conventions that can aid span segmentation learning for X-Spanformer. However, it lacks explicit code constructs or natural language prose structure clarity which may affect its utility as training examples. / Contains a mix of structured references (e.g., \"Mag:\", figure numbers, directional indicators) and unstructured text that can help the model learn span segmentation in both contexts. / The segment contains a mix of structured descriptions (spanning across different entities and their orientations) that can be segmented into meaningful spans, such as \"Mag\", \"I6 p. II2 is similar to SL 3854.\", etc., which are coherent for training purposes in span segmentation tasks involving both natural language text and code-like references or annotations."}}
 {"raw": "Yet when the Hesh is willing to consent with the operations of the soul, it understands more quickly and more clearly: (18) And therefore those who first discovered this knowledge lived more in hid- den places, because they - were unwilling to let carnal temptations isolate them from their operations (19) And certainly ifanybody has determined to work to acquire only parts of this science, should isolate himself from worldly things.", "type": "natural", "id": {"id": "ac7183d8-26d7-4796-a5fa-5b0d52b32165"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable thematic spans; useful for learning context and composition in NLP tasks. / Clear sentence structure with identifiable phrases and clauses suitable for learning span segmentation in a tokenizer-free context. / Clear prose with discernible thematic spans; useful for learning sentence-level span segmentation in NLP tasks. / Clear prose structure with identifiable spans; useful for learning sentence boundaries and thematic segmentation in NLP tasks. / Clear prose with discernible sentence structure; spans can be identified as sentences and phrases, representing valuable patterns for learning span composition in a tokenizer-free context."}}
 {"raw": "As GH points out, 'novem\" (\"nine\") in Sl 3854 is inconsistent with the text, which proceeds to list ten names SSM L.3.f.35 lists names 36-44.S3 (156v) includes name 36 after 35 above, but also reads \"decem\" (\"ten\") instead of 'novem; and ends with name 46, i.e_ hofb 34, merkerpon 35, Elzephares 36, et per ista decem dei nomina ineffabilia que sunt Egirion 38 [ '37], Betha 38, hombonar 39, Stimulamathon 40, Oryon 41, Eryon 42, Noymos 43, Peb 44, Nathanothay 45, Theon 46\" This seems to prove some redacting", "type": "mixed", "id": {"id": "95f466cc-0ab3-43a1-a9f3-6071d01ac155"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text contains a mix of numbers, Latin words (\"novem\", \"decem\"), and names which makes it difficult to identify clear spans for training purposes; it's also somewhat incoherent due to the apparent redaction notes. / The segment contains a mixture of text and references to manuscript notations, which may confuse the model due to lack of clear span segmentation patterns in both domains. Additionally, it lacks coherence for training purposes as it's more interpretative than instructive on how spans should be segmented or understood by an encoder-free system like X-Spanformer. / The segment contains a mixture of text and codes, but lacks clear structure for meaningful span segmentation; it's incoherent due to the mix of languages (\"Sl\" likely refers to Latin) and fragmented phrases that don't form coherent sentences or code constructs. / The text contains a mixture of Latin words, numbers and names which may confuse the model's span segmentation capabilities due to lack of clear linguistic patterns or consistent structure. Additionally, it includes editorial notes (\"This seems to prove some redacting\") that are not representative for training purposes in this context. / The segment contains a mixture of Latin text and references to manuscript folios, which may confuse the model due to its lack of clear linguistic structure suitable for training purposes. Additionally, it includes specialized terminology that might not generalize well across different contexts or domains."}}
 {"raw": "TABLE OF CONTENTS Introduction Abbreviations 45 TEXT AND TRANSLATION Prologue The oath 47 49 5I Boor I Preparing the seal of God, and the Divine vision Composition of the Seal of God Beatific vision First purification Second purification Placating the Divine Majescy Separation Names of the living God Completion of the work 53 65 77 127 173 I75 I77 I8I 183 Boor II. Angels Natures and offices of planetary angels Construction ofthe circle and rituals for invoking and binding them I97 I99 207 Boor III: Spirits", "type": "mixed", "id": {"id": "3f8ffedd-cb75-4e0c-af59-a94b8cefc51d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear segmentation into structured headings and subheadings, representing valuable patterns for learning span composition in both text (natural language) and titles/labels indicative of a table or list structure. / Clear segmentation into structured headings and subheadings, representing valuable patterns for learning span composition in both text (natural language) and tabular data formats. / Clear segmentation into chapters and subsections; spans include titles, subtitles (natural language), chapter numbers with page references indicating a structured document likely used for reference or study purposes in both natural text format and code-like structure of indices. / The segment contains a structured table of contents with clear spanable elements like titles and page numbers, representing valuable patterns for learning both textual organization (natural language) and specific formatting conventions found in documents or code-like structures. / The segment lacks clear, meaningful spans for training; it's a list of headings without context or content."}}
 {"raw": "dicta erant de te per prophetas, qui nativitatis tue tuis sanctis hominibus in tenebris stantibus lumen misisti, per quod tuum sanctum adventum cognoverunt: (27) Occynnomos, qui tribus regibus te adorare volentibus, Caspar; Melchior; Balthasar; stellam previam transmisisti et eorum munera rece- pisti te verum Deum et hominem mortalem eis esse demonstrans (28) et eis per angelum tuum falsitatem Herodis in sompnis manifestans, qui beatos innocentes pro tuo nomine cruciatos in celi palacio sublimiter", "type": "mixed", "id": {"id": "c5d5fc76-4299-4a8b-a5f1-607e9ee8fa61"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains both structured language and potential religious or historical references, which may have meaningful spans for a span-aware model to learn from. However, the text is fragmented with missing context that could affect clarity but still holds compositional value in terms of identifying phrases related to prophecy, characters (Caspar; Melchior; Balthasar), celestial bodies (\"stellam previam\"), and religious concepts like \"sanctum adventum\" or angelic interventions. / Contains a mixture of biblical text and Latin phrases, with clear sentence structures that can be segmented into meaningful spans for training purposes. The content is coherent but may require domain-specific knowledge to fully understand the context. / Contains both structured phrases and complex sentence structures that can be segmented into meaningful spans, representing a mix of narrative text with embedded references to religious or historical figures which could aid in learning span segmentation for diverse contexts. / The text contains a mix of Latin phrases and references to biblical characters, which may not provide clear or consistent patterns for span segmentation in training data. Additionally, it lacks coherence as standalone sentences that could be easily segmented into meaningful spans without contextual knowledge. / The segment contains a mixture of Latin text and religious context, with clear phrases that can be segmented into meaningful spans for training purposes; however, it lacks coherence due to its fragmented nature."}}
 {"raw": "Sworn Book of Honorius 243 (1o) NOTE: You must be very careful while working, that you add those names to the other names;250 because it is hard for a person not knowing the powers of the spirits and their malice, without the greatest fortification to abide with them somehow, (I1) and it is like someone who seeks to wage war with a shrewd knight; and disregards his weapons, and who the knight is, and what are the strengths of the knight with whom he wages war: (12) It is well therefore to be cautious,", "type": "natural", "id": {"id": "81888a9a-1ce4-4c2f-8155-afc69d1f26a0"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains clear narrative structure and thematic elements that can be segmented into meaningful spans, such as phrases describing cautionary advice or metaphors involving knights; however, it lacks compositional value for learning span segmentation due to its archaic language style which might not generalize well. / The text segment is coherent and contains clear, meaningful spans that represent patterns of cautionary advice in a narrative context; however, it lacks the complexity needed for advanced span segmentation learning due to its simplicity. / Clear prose with identifiable thematic spans; useful for learning context and narrative structure. / The text segment contains clear sentence structures and phrases that can be segmented into meaningful spans, such as \"Sworn Book of Honorius,\" \"(1o),\" etc., which are indicative patterns for learning span composition in a tokenizer-free context. / Clear prose structure with identifiable thematic spans; useful for learning span segmentation in narrative text."}}
 {"raw": "sarahihel. hechamazihel. sezamagua. iechar:\nL\nNine prayers are placed in the beginning, up to the prayer Heliscemaht, hazaram: \" of which eight are preparation of the way to work, and prepara- tion ofthe work for obtaining, but the ninth is the first prayer that is intrinsic to this work.", "type": "natural", "id": {"id": "976547f2-7218-49af-b045-0a22516f8cb9"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear structure with identifiable phrases and concepts; useful for learning span segmentation in a linguistic context. / Clear structure with identifiable phrases and sentences; useful for learning span segmentation in prose. / Clear structure with identifiable phrases and sentences, representing meaningful spans for learning span composition in a religious context. / Clear structure with identifiable phrases and coherent sentences suitable for training a span-aware model in recognizing religious text patterns. / Clear structure with identifiable phrases and sentences; useful for learning span segmentation in prose."}}
 {"raw": "may have the strength, the effectiveness of this operation, (4) the innocence, and the purification ofthe soul,and fit for these holy visions, and able to achieve a subtle and clever will, and a clarified mind: Amen. LXXI. PRAYER", "type": "natural", "id": {"id": "bc67035f-625c-42a3-b5ff-f38ab58341da"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose structure with identifiable phrases suitable for span segmentation; coherent and clean text representative of religious prayers. / Clear religious text with discernible phrases and sentences suitable for training a span-aware model on coherent, structured language. / Clear, coherent prose with identifiable phrases and sentences suitable for training a span-aware model on religious texts or prayers. / Clear religious text with distinct phrases and sentences suitable for span segmentation; coherent structure enhances learning patterns. / Clear religious text with identifiable phrases and sentences suitable for training a span-aware model on coherent, structured language patterns."}}
 {"raw": "and testimony of your arrival through John the Baptist to your people Israel, (26) and the things preached about you by the prophets; and ofyourbirth, whereby you sent a light to your holy ones standing in darkness; by which they recognized your holy arrival. (27) Occynnomos, whereby you sent a star to lead the three kings, Caspar; Melchior; Balthazar; wishing to honor You, and you received their gifts, showing yourself to them to be true God and mortal man (28) and you revealed to them through your angel,", "type": "natural", "id": {"id": "1e7b0537-468c-4c92-98fd-52d4b698fdf7"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear narrative structure with identifiable spans; useful for learning span segmentation in prose. / Clear biblical narrative with identifiable spans; well-formed text suitable for training on span segmentation in religious texts. / Clear biblical narrative with identifiable spans; well-formed for training purposes and contains valuable patterns of span composition in religious texts. / Clear biblical narrative with identifiable spans (verses, phrases). Well-formed and coherent text suitable for learning span segmentation in a religious context. / Clear biblical narrative with identifiable spans; well-formed text suitable for training on span segmentation in religious texts."}}
 {"raw": "4) The eight prayers are placed at the end, which are called the eight Ter- mini (or \"Ends\") , and regarding them I say, that they are effective for obtain- ing divine consent: (5) Thus to begin, one Friday, after sincerely repenting and confessing, you must fast on bread and water: (6)", "type": "natural", "id": {"id": "f1d71734-ab61-4ae5-bb45-9dd6b01f3e6d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear structure with identifiable segments like prayers, instructions for fasting; coherent and representative of religious texts. / Clear sentence structure with identifiable phrases and coherent narrative suitable for learning span segmentation in a tokenizer-free context. / Clear structure with identifiable spans like \"eight prayers,\" \"Ter-mini (or 'Ends'),\" and phrases indicating actions (\"fast on bread and water\"). Well-formed for training purposes, representing valuable patterns in religious text composition. / Clear structure with identifiable phrases and sentences suitable for training a span-aware model in recognizing religious text patterns. / Clear sentence structure with identifiable phrases and thematic elements suitable for span segmentation in a tokenizer-free context."}}
 {"raw": "124\nSWORN BOOK OF HONORIUS\n(8) Deinde, eum mane semel dixeris, eodem modo penitus circa ter- ciam semel dices et similiter circa meridiem semel et tune poteris prand- ere_ (9) In crastino, scilicet in die sabbati, eodem modo penitus facies In die Dominica similiter; nisi quod non ieiunabis, immo quod vis, vel pisces vel carnes, comedere quibis post meridiem, scilicet finita tercia vice orandi. (10) Tunc in nocte sequenti in sompnis revelabitur tibi per angelum concessus vel repulsa.", "type": "mixed", "id": {"id": "d9335e46-845f-4b96-aac2-1dd8e6a8b3d8"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains a mixture of Latin phrases and numerical references, which can be segmented into meaningful spans for training purposes; it is clean but lacks contextual clarity due to language complexity. / Contains structured elements (dates, times) and repetitive patterns useful for learning span segmentation in both text and numerical data. / The segment contains a mixture of legal or religious text with structured phrases and numbers, which can be segmented into meaningful spans for training purposes. It is clean but lacks contextual clarity due to archaic language usage. / The segment contains a mix of Latin phrases and structured text, which can help the model learn span segmentation in both linguistic patterns (natural language) and formalized constructs like numbered sections or verses commonly found in religious texts. / The text segment is structurally clear but lacks compositional value for training a span-aware model due to its repetitive and archaic language, which may not represent useful patterns or learning opportunities in modern contexts. Additionally, it does not contain mixed elements of code that could benefit the X-Spanformer model's versatility across different domains."}}
 {"raw": "Sworn Book of Honorius 113 exalted above all heaven, where the divinity and deity physically dwells, (4) I beg your majesty, Lord,and I glorify your omnipotence and the eternicy of your power; and I beseech with great imploring of your greatness, most high and eternal: (5) 0 my God, inestimable and ineffable wisdom, 0 life of the angels, incomprehensible God,in whose sight the choir of angels stand, I beg and beseech you, (6) through your holy and glorious name, and through the sight ofyour angels and", "type": "natural", "id": {"id": "7a5ae41f-551f-4118-9048-2494744b6c1a"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable thematic spans suitable for learning context and sentiment expression in a span-aware model. / Clear, coherent prose with identifiable phrases and sentences suitable for learning span composition in a tokenizer-free context. / Clear, coherent prose with identifiable phrases and sentences suitable for learning span segmentation in a tokenizer-free context. / Clear, coherent prose with identifiable phrases and sentences suitable for training a span-aware model in recognizing structured segments of text. / The text segment contains clear, structured elements of religious prose with identifiable phrases and sentences that can be segmented into meaningful spans for training purposes. It is clean but may require domain-specific preprocessing due to archaic language forms like \"exalted\" or \"beseech.\""}}
 {"raw": "190 SWORN BOOK OF HONORIUS amarissime et capillis tergenti et unguenti sua dulcissime peccata remis- isti (33) et Lazarum, fratrem suum, quatriduanum mortuum a mortuis suscitasti et ceco nato visum tribuisti et propter nos corpus tuum immo- lari, detrahi, ferociter accipi, turpiter iudicari ac eciam blasphemari, duris corrigiis amariter flagellari, (34) alapis et sputis vexari, spinis coronari, in cruce affigi, clavis acutis pedes et manus perforari, felle et aceto potari, lancea latus aperiri, et in", "type": "mixed", "id": {"id": "bee13b29-1617-4a5b-ad8a-279376af4f30"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of narrative text and structured, repetitive patterns that can be segmented into meaningful spans for training purposes; it is coherent but lacks context clarity due to its fragmented nature. / The segment contains a mixture of narrative prose and Latin phrases, which can be segmented into meaningful spans representing both language structure (natural) and possibly historical or religious context that could aid in learning span composition for the model. / The segment contains a mixture of narrative text and Latin phrases, which can help the model learn span segmentation across different linguistic structures typical in historical texts or religious documents. It is coherent but may require additional context for full understanding due to its archaic language style. / Clear spans of text with a mixture of narrative and potential religious or historical context, suitable for learning span segmentation in both language patterns and coded phrases. / The text segment contains a mixture of Latin phrases and descriptions that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic structures (natural language) and specific terminology related to historical or religious contexts which could benefit the model's understanding across domains."}}
 {"raw": "sepulcro poni et a militibus custodiri voluisti, (35) qui per summam tuam potenciam ac signo tue sancte crucis, de quo meis me signo manibus t, in nomine Patris et Filii et Spiritus sancti scilicet; portas ereas confregisti et amicos tuos de tenebrosis locis inferni eripuisti. (36) Item, Domine, per fidem et credenciam, quam in hiis sanctis misteriis confiteor et scio et habeo, ita et animam meam a corporis mei tenebris eri- pias, (37) ut indestructo corpore te visibiliter cum tuis novem angelorum ordinibus", "type": "mixed", "id": {"id": "e0fd777a-9bfe-4ffa-b442-323ed30c2bfc"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of Latin phrases and religious text, which have clear sentence structures that can be segmented into meaningful spans for training purposes in both language understanding (natural) and historical/cultural context recognition (code). However, the archaic nature may limit its utility. / The segment contains a mixture of Latin phrases and religious text, which can be segmented into meaningful spans such as verses or stanzas; however, it lacks clear syntactic structures for modern natural language processing models due to its archaic formality and specialized vocabulary. / The segment contains a mix of Latin phrases and religious text, lacking clear structure for meaningful span segmentation; not representative enough to learn from in isolation. / The segment contains a mixture of Latin phrases and religious text, which can be segmented into meaningful spans such as verses or clauses; however, its specialized language may limit generalizability for training purposes. / The segment contains a mixture of Latin text and religious phrases, which may have structured patterns useful for learning span segmentation in multilingual contexts or historical texts analysis. However, the lack of modern language structure might limit its utility unless specifically targeted towards such domains."}}
 {"raw": "tus et eodem modo similiter luna 2 et luna 3a et luna 4a et sic usque ad finem: (15) Et sic forte Deus miserebitur tui. Tamen in oracionibus Latinis tunc oporteret peticionem mutare, scilicet peticionem concessus in peti- cionem miseracionis. (16)", "type": "natural", "id": {"id": "099c27ba-73e3-4b8a-9bf8-c30c03d9e965"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment is a continuous block of Latin phrases without clear sentence boundaries or identifiable meaningful spans for training purposes. It lacks compositional value and clarity needed to train an X-Spanformer model effectively in this context. / Clear sentence structure and Latin phrases, suitable for learning span segmentation in historical texts. / Clear Latin text with identifiable phrases and sentence structure suitable for training a span-aware model on language patterns. / Clear Latin phrases with consistent structure, suitable for learning span segmentation in historical texts. / The segment is a coherent passage of Latin text with clear sentence structures, making it suitable for training in span segmentation and composition within the context of language processing tasks focused on classical languages or historical texts."}}
 {"raw": "Et nota, quod qui tales oraciones vult dicere debet esse castus et mundus et devote proferre, et qui aliter fecerit procul dubio punicionem manifestam videbit. (17)", "type": "natural", "id": {"id": "4afc1f50-6377-41a6-9f96-d500222cce65"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with identifiable spans; represents coherent and clean text suitable for training in span segmentation of English prose. / Clear sentence structure with identifiable spans for training; represents valuable patterns in language composition. / Clear sentence structure with identifiable phrases and clauses, suitable for learning span segmentation in English prose. / Clear sentence structure with potential for learning phrase segmentation and compositional patterns in Latin text. / Clear sentence structure with identifiable spans; well-suited for learning span segmentation in prose."}}
 {"raw": "SSM: Vyel 50 Pars: D and SSM omit. 51", "type": "code", "id": {"id": "390cc2de-efb1-480d-a2bf-b8352771e643"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, concise representation of a configuration snippet with identifiable spans (SSM and Pars labels). Well-suited for learning span segmentation in programming contexts. / Clear structured elements identifiable as programming constructs, clean and representative of target domain (code). / Clear structured elements with identifiable spans; well-suited for learning span composition in programming contexts. / Clear, concise programming-related content with identifiable spans for learning; well-formed and representative of target domain. / Clear structured elements with meaningful spans; clean and well-formed for training purposes, representing valuable patterns in programming syntax."}}
 {"raw": "Tu eciam, Domine, Paulo, apostolo tuo, atque Iohanni evangeliste secreta demonstrasti et beato Stephano, dum lapidabatur; celos aperuisti, (41) ut asserit se vidisse tuam corporalibus oculis maiestatem dicens: 'Ecce video celos apertos et Filium hominis stantem a dextris virtutis Dei; (42) qui martiribus tuis tormenta pacienter recipere concessisti, qui venturus es iudicare vivos et mortuos et seculum per ignem, respice super me et exaudi preces meas, (43) ut", "type": "mixed", "id": {"id": "7be42e0e-d9ae-41cb-ae36-15480f81db45"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mixture of biblical text (natural language) and potential religious references, which can be segmented into meaningful spans for learning purposes. The structure is clear with identifiable phrases suitable as training examples in span-aware models. / The segment contains a mixture of biblical text and Latin phrases, with clear verse structure that can be segmented into meaningful spans for training purposes. It is clean but may require domain-specific preprocessing due to its religious context. / Contains a mixture of narrative text and script-like dialogue with clear structure for span segmentation, representing both prose (natural language) and quoted speech/dialogue patterns which are valuable in training models to understand context shifts within spans. / The segment contains a mixture of biblical text and Latin phrases, with clear verse structure that can be segmented into meaningful spans for training purposes. It is clean but may require domain-specific preprocessing due to its religious content. / The segment contains a mixture of religious text and direct speech, with clear structure for span segmentation (e.g., verses). It is clean but may require domain-specific knowledge to fully understand the context or patterns related to biblical scripture analysis."}}
 {"raw": "(16) And note that whoever wishes to say those prayers must be chaste and clean, and offer them with devotion, and anyone who does otherwise will undoubtedly see punishment: (17) In fact; those Greek, Hebrew, and Chal- dean prayers include the most holy names of God and the angels, which must not be spoken by anyone except through his mercy (18) And when your request has been rejected, you must not despair; but confess and search your inner feelings more; and cheerfully give many alms, and have diverse", "type": "natural", "id": {"id": "12e86834-ae06-4a95-a149-73959d0bddd6"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable phrases and sentences suitable for learning span segmentation in a tokenizer-free context. / Clear prose with identifiable meaningful spans; well-formed for training purposes, though could benefit from punctuation and sentence structure refinement. / Clear sentence structure with identifiable phrases and clauses suitable for training a span-aware model in recognizing religious text patterns. / The text segment is structurally clear with identifiable spans such as verses, prayers, and instructions; it contains compositional patterns like religious references that are useful for learning span segmentation in a tokenizer-free context. However, the presence of mixed content (religious language interspersed within prose) may slightly reduce its score compared to purely natural or code segments. / Clear prose structure with identifiable phrases and sentences suitable for training a span-aware model in recognizing sentence boundaries, clauses, and thematic units within English text."}}
 {"raw": "colomaithos. LXXVI ORACIO LATINA Vita hominum et omnium creaturarum visibilium et invisibilium, claritas eterna celestium spirituum; omnium hominum salus indeficiensque pieta- tis origo, (2) qui omnia novisti, antequam fiant, qui iudicas omnia, que videntur [et non sunt et que non videntur] et sunt, [et] ineffabili disposi- cione discernis, glorifica sanctum nomen tuum et ineffabile hodie: (3) Cor- robora cor meum et intellectum meum et animam meam et auge innocen- ciam meam et confirma precem meam et a", "type": "mixed", "id": {"id": "efa56d33-09bf-40bd-bedb-760fac217f9d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains both structured religious text and Latin phrases, which can be segmented into meaningful spans for a span-aware model to learn from diverse linguistic patterns. / The segment contains both structured elements (biblical text) and unstructured phrases, providing diverse patterns for learning span segmentation in a multilingual context with religious content. / The segment contains both structured phrases and a mixture of Latin text with potential religious or philosophical context, which can be valuable for learning span segmentation in diverse contexts. However, it may require additional preprocessing to improve clarity before being used as training data. / The segment contains structured elements with clear spans of text, including Latin phrases and references to religious concepts that can be segmented meaningfully for a span-aware model; it is clean but may require domain-specific knowledge due to its classical language nature. / The segment contains both structured elements (biblical text) and identifiable spans that can be segmented for training purposes, representing valuable patterns in span composition across different domains."}}
 {"raw": "(40) You also, 0 Lord, showed the secrets to Paul, your apostle, and John the Evangelist, and Saint Stephen, to whom you opened up the Heavens as he was being stoned, (41) as he asserts having seen your greatness with his physi- cal eyes, saying: 'Behold I see the heavens opening Up, and the Son of Man standingat the right hand ofthe power of God;184 (42) and you enabled him to patiently endure the torture, as the other martyrs, you who will soon come to judge the living and the dead and the world through", "type": "natural", "id": {"id": "1f2974c1-b0a0-49db-8903-84ad580fb380"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mix of narrative prose and religious references, with clear structure for span segmentation; however, it lacks explicit programming-related patterns suitable solely as training data for X-Spanformer. / Clear narrative structure with identifiable spans like verses, names of individuals (Paul, John the Evangelist), and events; clean prose suitable for training a span-aware model focused on biblical text analysis. / Clear narrative structure with identifiable spans like verses, names of people and places (Paul, John the Evangelist), events (\"Heavens opening Up\", \"Son of Man standing at the right hand\"), which can help in learning span segmentation for a tokenizer-free model focused on text. / Clear narrative structure with identifiable spans; however, contains religious and historical references that may not be universally representative of all training data needs. / The segment contains a mixture of narrative prose and religious text, which includes identifiable spans such as verses (e.g., \"(40)\", \"(41)\"), quotations (\"... saying:\"), and references to entities or events that can be segmented meaningfully for training purposes in both natural language understanding and code-like structures."}}
 {"raw": "Circa mane semel, circa meridiem semel, circa nonam semel: Luna 6a proferatur bis in mane, bis in meridie, bis in nona Luna 9 proferatur ter in mane, ter in meridie, ter in nona (6)", "type": "mixed", "id": {"id": "54e8cd8f-a098-4e55-953a-73500b99f177"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mix of structured data (dates and times) with Latin phrases, representing valuable patterns for learning span segmentation in both temporal expressions and formal language constructs. / Contains a mixture of structured time references and Roman numerals, indicating potential for learning span segmentation in both temporal expressions (natural language) and numerical patterns within the context of lunar mission documentation. / Contains both structured time references and numerical data, representing a mix of language patterns useful for span segmentation learning. / Contains a mixture of structured phrases and numerical references, which can be segmented into meaningful spans for learning purposes. / The segment contains a mix of structured data (time references) and Latin phrases, which can help the model learn span segmentation for both numerical patterns in time notation as well as linguistic structures."}}
 {"raw": "72 Satquiel Asasagel: See CVI.1, where the last name is spelled \"Asassaiel\" SSM L.3.f.36: Asassayel; Sl, 3853 157v: Assassayel.", "type": "mixed", "id": {"id": "76f6b68d-7bed-4afa-90f5-b32fa0e93893"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains both structured data (names, references) and unstructured text; spans can be identified for training purposes. / Contains both structured data (names, references) and unstructured text; spans can be identified for training purposes. / The text contains a mix of names and references, with clear delimiters for potential span segmentation; however, the presence of OCR errors (e.g., \"Asassaiel\" vs. expected spelling) may affect learning quality. Cleanliness is compromised by these inconsistencies. / Contains both structured data (names, references) and unstructured text; spans can be identified for training purposes. / Contains both spelling variations and potential transcription errors, useful for learning diverse span compositions in a multilingual context."}}
 {"raw": "Amen:\nVIII: PRAYER 3. Hail holy queen, mother of mercy; Hail our life, our sweetness, and our hope: To you do we cry, poor banished children of Eve. To you do we send up our 'sighs, mourning, and weeping in this valley of tears.", "type": "natural", "id": {"id": "f7f71e79-0224-4977-b5de-7ba6392c4fcd"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear poetic structure with identifiable phrases and verses suitable for span segmentation; clean, coherent text representative of literary patterns. / Clear poetic structure with identifiable phrases and thematic elements suitable for span segmentation learning. / Clear poetic structure with identifiable phrases and verses suitable for learning span segmentation in poetry or religious texts. / Clear poetic structure with identifiable phrases and verses suitable for learning span segmentation in a literary context. / Clear poetic structure with identifiable phrases and thematic elements suitable for learning span segmentation in poetry or religious texts."}}
 {"raw": "But you will improve, and on the following day, or the day after that; you may pick up the work again:\")", "type": "natural", "id": {"id": "25cd01cf-1f99-4e08-a13b-1eab97ed1200"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a clear, coherent sentence with identifiable spans (subject \"you\", verb phrase \"will improve\"), suitable for learning span segmentation in the context of English prose. / Clear sentence structure with potential for learning phrase segmentation; clean and coherent text segment representative of English prose. / Clear sentence structure with identifiable spans; useful for learning context and punctuation patterns in English text. / Clear sentence structure with identifiable spans; useful for learning span composition in text. / Clear sentence structure with identifiable spans; good for learning span composition in text."}}
 {"raw": "Perhaps they were originally a separate detached text, or possibly they got omitted at some point In the northern European recension, the want was supplied by prayers taken from Ars Notoria. In Gan- ells text, the core prayers are taken from a book called the Liber Trium Ani- marum (LTA) It now seems certain that the use of Ars Notoria is a departure from the hypothetical Ur text, based on the dating of the specific variations of Ars Notoria used.", "type": "natural", "id": {"id": "ef4a6cc8-22f7-403f-a018-d05a615be70e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable phrases and sentences suitable for learning span segmentation in a tokenizer-free context. / Clear, coherent prose with identifiable phrases and sentences suitable for learning span segmentation in a tokenizer-free context. / Clear prose with identifiable phrases and sentences; useful for learning span segmentation in English text. / Clear prose structure with identifiable phrases and sentences suitable for learning span segmentation in a tokenizer-free context. / Clear sentence structure with identifiable phrases and clauses suitable for span segmentation; coherent text representative of scholarly writing."}}
 {"raw": "201 Rseems to interpret this second circle or mound as being within the first; but rather it seems to be outside the first, and a place for the spirits to appear Compare below where the terrestrial spirits are evoked, where a pit is dug apart from the operators circle of protection.", "type": "natural", "id": {"id": "5e25b113-1d7f-4394-a3af-d3d7d7e5c6d9"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable thematic spans; useful for learning context and composition in language processing. / Clear prose structure with identifiable phrases and sentences suitable for span segmentation; clean, coherent text representative of English language patterns. / Clear prose with identifiable phrases and concepts, though somewhat archaic language may pose challenges for modern NLP models. Suitable as a training example to learn span segmentation in historical or literary texts. / Clear prose structure with identifiable phrases and sentences suitable for training a span-aware model in recognizing sentence boundaries and thematic elements. / Clear prose structure with identifiable phrases and sentences; useful for learning span segmentation in narrative text."}}
 {"raw": "The original version of Honorius may have included the LTA prayers, but it is ofcourse possible the LTA was a different attempt to fill the gap left from the missing - prayers Curiously Ganells Summa also includes prayers from the Ars Notoria, which he refers to as the 'old art\"57 In conclusion, it seems likely that the compiler had some knowledge of Jewish Kabbalah and Merkabah traditions, and wanted to incorporate the\nS5  Suggested by Mesler; p. 139.", "type": "mixed", "id": {"id": "643d98ca-69c1-45bc-ab1f-408b2c23911c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mix of scholarly references and text, with clear spanable phrases like \"original version,\" \"- prayers Curiously Ganells Summa also includes.\" However, it lacks explicit coding constructs or natural language coherence for direct training purposes. / Clear sentence structure with identifiable spans; useful for learning context and composition in English text. / The text contains a mix of references to historical texts and scholarly notes, with clear spanable phrases like \"original version,\" \"- prayers Curiously Ganells Summa also includes...\" These elements can help the model learn diverse patterns in both natural language descriptions and academic citations. / The segment contains a mix of scholarly references and text, with clear phrases that can be segmented into meaningful spans for learning purposes; however, it lacks explicit coding constructs or complete sentences which may affect clarity slightly. / The text contains a mix of references to historical texts and academic citations, with clear phrases that can be segmented into meaningful spans for learning purposes. However, it lacks explicit coding constructs or natural language coherence due to fragmented sentences (\"it is ofcourse possible\" instead of \"it's obviously possible\")."}}
 {"raw": "208\nSWORN BOOK OF HONORIUS\nCXIII Prima dies\nQui dum, ut diximus, receperit corpus Christi, exiens de ecclesia continue dicat 17 et eam, donec ad locum circuli venerit, recitabit.", "type": "natural", "id": {"id": "cf527b9c-e94d-4984-9fd3-eafe06e976f0"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear structure with identifiable spans; useful for learning sentence segmentation in historical texts. / Clear structure with identifiable spans; represents meaningful patterns for learning span composition in historical texts. / Clear structure with identifiable spans (chapter title, verse numbers) suitable for learning span segmentation in religious texts. / Clear structured text with identifiable phrases and sentences suitable for learning span segmentation in a historical context. / Clear structured text with identifiable spans like chapter titles, verses (e.g., \"CXIII Prima dies\"), and phrases indicating actions or events (\"exiens de ecclesia continue dicat\"). Well-formed for training purposes; represents valuable patterns in historical texts."}}
 {"raw": "140 SWORN BOOK OF HONORIUS propicius in me promissiones confirma, sicut confirmasti per eosdem ser- mones regi Salomoni et preter eosdem Iohanni et Paulo. (2) Emitte michi, Domine, virtutem de celis, que cor meum et mentem meam illuminet et confirmet, et conforta, Deus, intellectum meum et animam meam. (3) Innova me et lava me aquis, que super celos sunt, et effunde de Spiritu tuo super carnem meam et in visceribus meis ad facienda et componenda iudicia tua humilitate et caritate, qua celum et terram", "type": "mixed", "id": {"id": "4ef0e574-2adb-48ad-a8bd-6d0867d64ef5"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mixture of Latin text and religious phrases, with clear structure for span segmentation; valuable patterns present in historical texts. / Clear biblical text structure with verses and chapters, suitable for learning span segmentation in religious texts or similar structured content. / The segment contains a mixture of Latin phrases and religious text, which may not have clear syntactic structures for modern NLP models but can still provide valuable patterns in terms of historical language structure; however, its relevance to contemporary training data is limited. / Contains a mix of Latin phrases and prose, with clear sentence structures that can be segmented into meaningful spans for learning span composition in both language contexts. / The segment contains a mixture of Latin text and religious content, with clear sentence structures that can be segmented into meaningful spans for training purposes. It is clean but may require domain-specific knowledge to fully understand the context or language used (Latin)."}}
 {"raw": "Sworn Book of Honorius\n209\nCXIII The First Day: Then, as we have said, having received the body of Christ; you should leave the church, continuously saying prayer 17, reading it out loud, until you come to the place of the circle:\n(2) THE BLESSING OF THE PLACE FOR THE CIRCLE.", "type": "natural", "id": {"id": "e6d8755c-99b7-4c81-9ac5-66b11082eab7"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear structure with identifiable spans like titles, numbers (209), and phrases indicating actions or instructions; well-suited for training a span-aware model on religious texts. / Clear structure with identifiable spans (e.g., chapter title, verse numbers) and coherent text suitable for training a span-aware model on religious texts or historical documents. / Clear, structured religious text with identifiable phrases and sentences suitable for training a span-aware model on coherent narrative structures. / Clear structure with identifiable spans (chapter title, verse numbers) and coherent content representative of religious texts. / Clear structure with identifiable spans (e.g., titles, verses). Well-suited for learning span segmentation in religious texts or historical documents."}}
 {"raw": "Gaude,  virgo immaculata, Dei genitrix  Gaudium  michi dona: Gaude, que gaudium ab angelo suscepisti, et gaudium visionis divine michi dona.", "type": "natural", "id": {"id": "0b5d49c6-4c7e-4f45-9910-e0a8f87066fd"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear poetic structure with identifiable phrases and verses, suitable for learning span segmentation in literary texts. / Clear poetic structure with identifiable phrases and expressions, suitable for learning span segmentation in literary texts. / Clear poetic structure with identifiable phrases and words, though not a common training dataset for span segmentation models. / The text segment is poetic and lacks clear, consistent patterns for span segmentation; it's not coherent or representative of structured language suitable as training data. / Clear poetic structure with identifiable phrases and verses, suitable for learning span segmentation in literary text."}}
 {"raw": "Compare below CXXVII.8: \"scribes nomina angelorum diei, et hore mensis, temporis faciei' and CXL Io: Angeli hore sunt qui regnant in hora operis, angeli diei qui in die operis, angeli mensis angeli Lune vel principii mensis, angeli faciei qui_regnant_facies ubi est eorum dominium, ascendente, temporis_omnes_insimul 205 Michael, Miel (variants Mihel, Myhel) , Saripiel: Aerial spirits of Mercury: Gabriel, Michael, Samyel, Athitael (variant  Atithael): Aerial spirits of the Moon.", "type": "mixed", "id": {"id": "ebe53c5f-68ff-43d9-8cea-d8041d7ec4bc"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of names, descriptions and classifications that can be segmented into meaningful spans for learning span composition in both language (names like Michael) and coded entities or concepts (\"Aerial spirits\", \"Mercury\"). However, it lacks clear delimiters between different types. / The segment contains a mix of biblical references and descriptions that can be segmented into meaningful spans, such as names (e.g., Michael), titles or roles (\"Aerial spirits\"), celestial bodies (\"Mercury\", \"Moon\"), which are valuable for learning span composition in both natural language processing tasks. / The segment contains a mixture of biblical references and descriptions that can be segmented into meaningful spans, such as names (e.g., Michael), titles or roles (\"Aerial spirits\"), celestial bodies (\"Mercury\", \"Moon\"). It is coherent but may require domain-specific knowledge for full comprehension. / Contains a mixture of names, phrases and concepts that can be segmented into meaningful spans; however, some terms are ambiguous or unclear (e.g., \"faciei'\"). The content is coherent but may require additional context for full clarity in training purposes. / The segment contains a mixture of descriptions and references that can be segmented into meaningful spans, such as names (e.g., Michael), titles or roles (\"Aerial spirits\"), celestial bodies (\"Mercury\", \"Moon\"), which are useful for learning span composition in both natural language processing tasks."}}
 {"raw": "260 SWORN BOOK OF HONORIUS (4) \"Exeat hic potentissimus rex Barthan cum omnibus suis suffra- ganeis in virtute celesti meam facere voluntatem; Tunc in meridie dicat: \"Iammax, Sarabocres, Harthan, Abaa, Maymon, Barthan, Formione\" (5) Tunc percuciat meridionalem gladium dicens: \"Exeat hinc56 fortissimus rex Yammax cum sua inenumerabili57 caterva virtute divina meam facere voluntatem\" (6) Tunc in occidente dicat: \"Harthan, Abaa, Maymon, Barthan, Formione, Yammax, Sarabocres\" Quo dicto occidentalem gladium", "type": "mixed", "id": {"id": "e8fe3f64-f082-4340-a5c1-5555a57c6e45"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of narrative text and structured lists, which can be segmented into meaningful spans representing both prose (narrative) and potential patterns for learning span composition in the context of code-like structures with numbered elements. / Clear segmentation into meaningful spans with a mix of names and actions, representing both narrative structure (natural language) and structured commands or lists typical in historical texts that resemble coded formats. / Clear segmentation into meaningful spans with alternating structure of names and actions, representing valuable patterns for learning span composition in both narrative text (natural language) and structured commands or statements that resemble programming constructs (code). / The text segment contains a mixture of structured phrases and names that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both historical context (natural language) and potential coded references or annotations within the content. / Contains a mixture of structured phrases and names that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both historical text (natural language) context with embedded references to entities resembling code-like structures or lists."}}
 {"raw": "[Harthan], Abaa, Maymon, Barthan;\" Quo dicto consolanem gladium de virgula percuciat dicens: \"Exeat hic pulcherrimus rex Formione cum suis legionibus angelorum virtute timoris summi iudicii meam facere voluntatem; (10) Tunc in nogahem dicat: \"Sarabocres, Harthan, Abaa, Maymon, Barthan, Formione, Yammax;\" 56 Hinc: GH corrects to hic to be consistent with the other paragraphs, but hinc would also fit: SSM L.3.f34 reads hinc for all these paragraphs 57 SSM L.3.f.34: inennarrabili (\"indescribable\"", "type": "mixed", "id": {"id": "e61ecaee-f1fc-416a-8eb3-6cd1147f3801"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mix of Latin phrases and modern editorial notes, lacking clear span segmentation patterns for training purposes. Additionally, the presence of annotations like \"GH corrects to\" disrupts natural language flow making it less coherent as standalone text. / The segment contains a mix of Latin phrases and modern English, lacking clear structure for meaningful span segmentation; it is not coherent or clean enough to serve as training data. / Contains both structured programming-like elements and narrative text, with clear span segmentation opportunities in phrases like \"Harthan,\" \"Abaa,\" etc., representing a mix of content types suitable for training. / The segment contains a mix of Latin phrases and references to biblical characters, which may not provide clear span segmentation patterns for training purposes; lacks coherence as it appears fragmented without context or explanation. / Contains both structured language and references to a script or document, with clear demarcations for potential spans; however, lacks context which may affect learning utility."}}
 {"raw": "Sworn Book of Honorius 83 (s) Hail, most kind, hail, most agreeable, hail, most merciful: You will be propitiated, eternal virgin, blessed and glorious, ever chaste Mary; you who are the most hallowed virgin and blessed mother of God, brightest star of the sea. (6) Hail, ever glorious, precious pearl; beautiful as the lily, fragrant as the rose.20 Hallelujah! Direct me in this blessed vision: (7 ) Tentreat you, eternal queen, holy Mary, through thelove ofthe Father; Son, and Holy Spirit; and through your", "type": "natural", "id": {"id": "57966ab0-7252-472a-8630-22b505687f0a"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear religious prose with identifiable phrases and expressions suitable for learning span composition in a tokenizer-free context. / Clear poetic structure with identifiable phrases and repetition, suitable for learning span segmentation in prose. / The segment contains clear, structured phrases and expressions that can be segmented into meaningful spans; it is coherent for training purposes but lacks explicit compositional value due to its poetic nature. / The segment contains clear, structured phrases and expressions that can be segmented into meaningful spans for learning purposes; it is coherent but lacks context-specific patterns typical of training data. / Clear poetic structure with identifiable phrases and verses suitable for learning span segmentation in poetry or religious texts."}}
 {"raw": ") 58 There is a blank space left in Sl 3854 of approximately 5-6 characters.", "type": "mixed", "id": {"id": "c05030cb-664c-47bb-9503-3f329e6330c2"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with a clear context for span segmentation; represents valuable patterns in understanding numerical references and contextual descriptions. / Clear separation of a numeric value, an identifier (Sl), and descriptive text; spans can be segmented into meaningful parts for learning composition in both numerical data representation and contextual descriptions. / The segment contains a clear reference to programming concepts (\"Sl\", \"3854\") and numerical values, which are structurally identifiable as meaningful spans for learning in the context of coding patterns. It is clean but lacks contextual information that could improve its utility. / The segment contains a combination of numerical values, programming-like syntax (e.g., \"Sl\", numbers), and plain text (\"blank space left in Sl\"). This mix provides diverse patterns for learning span segmentation across different content types. / Clear span of a programming-related comment with identifiable structure and context for learning."}}
 {"raw": "(7) Then facing North he should say: Maymon, Barthan, Formione, Iammax, Sarabocres, Harthan, Abaa: (8) Having said this, he should strike the northern sword with the wand, saying: May this most harsh king Maymon go forth with all his hosts of aerial [*harsh] spirits, with the power ofthe obedience they owe Belzebut, to do my will \" (9) Then in Consol he should say: Formione, Yammax, Sarabocres, [Harthan],264 Abaa, Maymon, Bar- than; Having said this, he should strike the consol sword with the wand, saying:", "type": "mixed", "id": {"id": "a1843cda-0d74-42ed-adcd-fabfa683c02c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of narrative and pseudo-code-like elements, with clear instructions that can be segmented into meaningful spans for learning purposes. Despite some archaic language (\"Then facing North he should say\"), the structure is coherent enough to serve as training data. / The segment contains a mixture of narrative and possibly ceremonial language with clear structured elements like names, commands (\"strike the sword\"), which can be segmented into meaningful spans for training purposes; however, it lacks context on whether it's code or natural text. / The segment contains a mixture of narrative and pseudo-code-like elements, with clear instructions for actions (spanning multiple lines) that can be segmented into meaningful parts suitable for training an encoder on span segmentation in both natural language contexts as well as code constructs. / The text segment contains a mixture of structured commands and names, which can be segmented into meaningful spans for training purposes; however, it lacks clarity due to the presence of symbols like \"[*]\" that may confuse machine learning models. / The text contains a mixture of structured phrases and names that can be segmented into meaningful spans, reflecting patterns useful for learning span composition in both narrative (natural language) contexts with references to characters or entities (\"Maymon\", \"Formione\") as well as code-like constructs."}}
 {"raw": "One of the oldest examples is found in the Testament of Solo- mon ISt to 3th CE), described as engraved on a precious stone; which some manuscripts expand on as being a pentalpha, or five-pointed star: 67 Many versions of the seal of Solomon can be found in magical literature, from the simple pentalpha to the extremely elaborate. 68 Many versions, including that in the Sworn Book contain the pentalpha as a central o key element.", "type": "natural", "id": {"id": "d14bbc1a-0a1c-49c2-9120-cf0069666021"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text contains a mix of historical references and descriptions that can be segmented into meaningful spans, such as \"Testament of Solo-moni ISt,\" \"pentalpha,\" etc., which are valuable for learning span composition in both natural language processing (NLP) contexts. / Clear prose with identifiable phrases and concepts; spans like \"Testament of Solo-moni Ist to 3rd CE,\" \"pentalpha, or five-pointed star\" can be segmented meaningfully for training purposes. / Clear narrative structure with identifiable spans; useful for learning context and composition in text. / Contains both descriptive language and references to historical artifacts, which can help in learning span segmentation for diverse contexts. / Clear prose with identifiable spans; useful for learning sentence structures and thematic elements in text."}}
 {"raw": "262 SWORN BOOK OF HONORIUS (11) Quo dicto percuciat nogahelem gladium de virgula dicens: \"Exea[t] hic nobilissimus ac fulgentissimus rex Sarabocres cum omnium suorum spirituum fulgencium potencia ac virtute virtute6o huius celestis suffumigii meam facere voluntatem;\" (12) Tunc in frigicap dicat: \"Abaa, Maymon, Barthan, Formione, Iammax, Sarabocres, Harthan\" Quo dicto percuciat frigicapicem gladium de virgula dicens: \"Exeat hic sapientissimus rex Abaa cum omnium suorum sapientium spirituum exercitu virtute", "type": "mixed", "id": {"id": "f5df754b-3e47-4371-8f6d-ccda85a7310c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mixture of narrative and structured elements, with clear spans for names (e.g., \"Sarabocres,\" \"Abaa\") that can be used to teach span segmentation in both natural language context and code-like constructs. / The segment contains a mix of narrative and structured elements (names, phrases) that can be segmented into meaningful spans for training purposes; however, it lacks clarity in its structure due to the presence of non-standard characters (\"[e]hic\", \"virtute6o\") which may hinder learning. / The segment contains a mixture of Latin phrases and names, which could be useful for learning span segmentation in multilingual contexts or historical texts that include programming-like structures (e.g., variable naming conventions). However, the lack of clear delimiters makes it less ideal as is. / The segment contains a mixture of Latin phrases and names, which may not provide clear patterns for span segmentation due to its archaic language structure; lacks coherence in modern context. / The segment contains a mixture of narrative and pseudo-code-like phrases, with clear demarcations for different characters' actions (\"Quo dicto percuciat\" etc.), which can be segmented into meaningful spans representing dialogue or commands in an imaginary context that may resemble code structure."}}
 {"raw": "Anthem of the Blessed Virgin, also known as the Five Gaude antiphon.", "type": "natural", "id": {"id": "c38bde77-3426-4aad-9326-18112e9c1e3a"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment lacks clear, identifiable spans for meaningful segmentation; it's a phrase with ambiguous structure and no compositional value for training purposes. / Clear, coherent sentence with identifiable phrases; suitable for learning span composition in a language context. / Clear phrase with cultural and religious significance; spans can be identified as \"Anthem of the Blessed Virgin\" or \"Five Gaude antiphon.\" Suitable for learning span segmentation in contextually rich text. / Clear, coherent phrase representing a cultural reference with potential for learning context and idiomatic expressions in span segmentation. / Clear phrase with a title and description, suitable for learning span segmentation in text."}}
 {"raw": "84 SWORN BOOK OF HONORIUS intercedas pro me peccatore ad visionem Dei eterni me vivente habendam et succurras michi in omnibus angustiis et necessitatibus meis et ne dere- linquas me, (6) neque sim sine adiutorio in hac visione beata neque in illo tremendo die, cum exierit anima mea de corpore meo, aut in illa mirabili hora, cum rapta fuerit ad videndum me vivente Deum eternum: Postulo, graciosa, me ad portas paradisi facere venire, ut merear videre ibi filium tuum et merear habere leticiam sempiternam", "type": "natural", "id": {"id": "0fdde727-3059-490e-afa0-19e2b57d56f9"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment is structurally clear with identifiable spans such as phrases and sentences, representing valuable patterns for learning span composition in a religious or historical context. It contains coherent content suitable for training purposes without code elements. / The text segment is a coherent, well-formed passage from an ancient document that contains clear sentence structures and phrases suitable for training on span segmentation in historical texts or religious documents. It has meaningful spans such as \"84 SWORN BOOK OF HONORIUS,\" which can be used to learn about the structure of similar passages. / The text segment is structurally clear with identifiable spans such as phrases and sentences, representing valuable patterns for learning span composition in a religious or poetic context. It lacks programming constructs but maintains coherence suitable for training purposes. / Clear, coherent prose with identifiable phrases and sentences suitable for learning span segmentation in a religious context. / Clear prose with identifiable thematic spans; useful for learning sentence structure and composition in English text."}}
 {"raw": "142 SWORN BOOK OF HONORIUS promissionem toto corde desiderans et possidens in omnibus tam virtuti- bus quam puritatibus et viciorum absolucionibus precipue per hec sancta misteria videar et cognoscar adipisci et bene in ista arte perficiar penitus; laudabilis ac pro sancta visione mundus. LXV 104 ORACIO Hely, reverende, potens et dominans superioribus angelis et archangelis omnibusque celestibus creaturis [et] tam infernalibus quam terrestribus, de cuius magnificencia plenitudinis venit, (2) ut tibi a nobis", "type": "mixed", "id": {"id": "39e6a4cc-267e-4c67-89cd-1f33faeaae45"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text contains both Latin phrases and a mixture of religious or historical references, which can be segmented into meaningful spans for learning purposes; however, the presence of non-standard characters may pose challenges in processing. / The text contains both structured elements (like numerals and Latin phrases) that can be segmented into meaningful spans, representing valuable patterns for learning span composition in a tokenizer-free context; however, it lacks coherence as an isolated segment due to its fragmentary nature. / The segment contains both Latin phrases and a mixture of prose, which may present challenges for span segmentation but offers diverse patterns useful in training an encoder that can handle complex text structures. / The text segment contains both Latin phrases and numerical references, indicating a mix of historical or religious content with potential coding-like structure (e.g., \"142 SWORN BOOK OF HONORIUS\"). It has clear spans that can be segmented into meaningful parts for training purposes. / Contains both structured phrases and a mix of Latin text with numerical references, which can help the model learn span segmentation in historical or religious texts that combine language elements."}}
 {"raw": "cifigi, ut in ipso tua mors mortem nostram destrueret, exaudi clemens et benignus preces servi tui, (15) ut in virtute tua nobis sanctum angelum mit- tere digneris, qui hoc sacro misterio tuo ac signo nos confirmet, protegat ac defendat; a temptacionibus malis eripiat et in tuo sacro servicio perma- nere nos doceat, (16) quia tu es pius et misericors Deus, qui vivis et regnas Deus per omnia secula seculorum: Amen:\" CXXXIII Hoc dicto stans pedibus in medio circuli respiciens undique celum taliter invocabit:", "type": "mixed", "id": {"id": "a1509a9f-b96a-4727-8988-824ec128e01b"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of Latin text and instructions for physical actions, which can be segmented into meaningful spans representing both language structure (Latin phrases) and action sequences (physical movements). It is clean but lacks context that would typically aid in training span-aware models without additional metadata. / Clear structure with distinct spans for Latin text and a poetic description, representing valuable patterns in span segmentation across languages and formats. / The segment contains a mixture of Latin text and instructions for physical actions, which can be segmented into meaningful spans representing both language structure (Latin phrases) and action sequences (\"stans pedibus in medio circuli\"). It is clean but lacks context to fully understand the purpose. / The segment contains a mix of religious text and Latin phrases, with clear structured elements like verses (CXXXIII) that can be segmented into meaningful spans for training purposes. It is clean but may require domain-specific preprocessing due to its specialized vocabulary. / Contains a mixture of Latin text and instructions for actions, with clear structure suitable for span segmentation; however, it lacks context which may affect training utility."}}
 {"raw": "The Lord: Perhaps a corruption for 'Solomon.", "type": "natural", "id": {"id": "11b68a05-5006-4944-b0e0-d84564a26297"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with a potential span for 'Solomon' as an abbreviation or corruption, representing valuable patterns in language segmentation. / Clear sentence structure with a potential span for 'Solomon' as an abbreviation or corruption, representing valuable patterns in language understanding and text processing. / Clear sentence structure with a potential span for the name \"Solomon\" and an identified phrase indicating uncertainty (\"Perhaps\"). Well-formed text suitable as training data. / Clear sentence structure with a potential span for 'Solomon' as an entity, though the context is unclear and could be expanded upon for richer training data. / Clear sentence structure with a potential span for 'Solomon' as an abbreviation or corruption, representing valuable patterns in text segmentation and composition."}}
 {"raw": "Amen: LXVI XIA ORACIO Horlon, Deus, qui omnia numero, pondere et mensura fecisti, de cuius munere omne capud hominis desiderans elevabitur; in cuius ordine omnium momentorum sive dierum patens est et aperta dimencio, (2) qui eciam solus stellarum nomina numeras et nominas, menti mee constantem tribue visionis tue efficaciam, ut in huius artis cognicione et operacione te diligam et videam et tue pietatis munus agnoscam facialis visionis. Amen. LXVII 124 ORACIO Porrenthimon, mediator omnium operacionum et", "type": "mixed", "id": {"id": "56817a9f-d79c-49d4-a28a-5b634195fc5f"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mix of liturgical text and numerical references, with clear structure for span segmentation; useful patterns present in religious texts combined with structured data elements like numbers. / Contains both structured phrases and numerical references, representing a blend of linguistic patterns suitable for span segmentation learning in X-Spanformer training data. / The segment contains a mixture of Latin phrases and numerical references, which may not be directly useful for training but could provide interesting patterns in span segmentation due to its unique structure. / The segment contains a mixture of Latin text and numerical references, which can be segmented into meaningful spans for training purposes; however, it lacks clear compositional value due to its specialized language structure. / The segment contains a mixture of Latin text and numerical references, which may help the model learn span segmentation in both structured (numerical) and unstructured contexts. However, its domain-specific nature might limit generalizability across diverse datasets."}}
 {"raw": "Sworn Book of Honorius 85 assist me in all my difficulties and needs, and do not abandon me,22 (6) so I may not be without help in this blessed vision nor on that terrible day when my soul departs from my body or in that wonderful hour; when it will be dragged away, for seeing the living eternal God. (7 ) I graciously ask you to make me come to the gates of Paradise, that Imay be worthy to see YOur son there, and that I may be worthy to have the wondrous eternal divine vision, with your most glorious son", "type": "natural", "id": {"id": "db37f0ca-6e22-4d62-b4af-6141f058931b"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable thematic spans suitable for training a span-aware model on religious or poetic texts. / Clear, coherent prose with identifiable phrases and sentences suitable for learning span segmentation in a non-tokenized context. / Clear, coherent prose with identifiable meaningful spans; well-suited for learning span composition in a tokenizer-free context. / The text segment is structurally clear with identifiable spans such as phrases and sentences, representing valuable patterns for learning span composition in a religious context; however, it lacks coherence due to its archaic language style which might not be representative of modern usage. / The segment is a coherent and structured piece of religious text, with clear phrases that can be segmented into meaningful spans for training purposes. It lacks technical jargon but offers rich linguistic patterns suitable for learning span segmentation in the context of sacred texts or prayers."}}
 {"raw": "(13) Then he should put the censer near the eastern sword. With head bowed, gazing at the cross of the sword, he should say this prayer:\n(14) PRAYER:", "type": "natural", "id": {"id": "525f060a-3a76-4edd-a3c6-5563ea66fe2e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prayer text with identifiable spans; clean and coherent for training purposes. / Contains structured elements like numbered verses and a prayer, representing valuable patterns for learning span segmentation in both textual content (natural language) and religious/cultural context references (code-like constructs). / Clear prayer format with identifiable spans; clean and coherent for training purposes. / Clear prayer format with identifiable spans; clean and coherent for training purposes. / Clear prayer structure with identifiable spans; clean and coherent for training purposes."}}
 {"raw": "28 SWORN BOOK OF HONORIUS Jan Veenstra has shown that the example found in London manuscript Sloane 313 (see appendix II) does not actually agree with che description in the text; but the drawing in SSM is much closer72 For example; a passage in LIH IV.I2 matches the drawing in SSM,but not Sloane 313.7 For this edition, Ihave reconstructed the Seal based on the elaborate description in the text, as well as Ganells drawing: According to Veenstra, this seal was also \"known independently as an amulet in", "type": "mixed", "id": {"id": "7cbc7043-6a92-4465-b804-05c880d5046c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mixture of historical references, descriptions and specific examples (e.g., London manuscript Sloane 313) that can be segmented into meaningful spans for training purposes; however, it lacks clear delimiters between different types of content which might affect clarity slightly. / Contains a mix of historical references, descriptions and scholarly notes which can help in learning span segmentation for both textual content (natural language) and specific terms or phrases related to the subject matter (code-like structures). / The segment contains a mixture of historical references, descriptions and scholarly notes which can help in learning span segmentation for both textual content (natural language) as well as specific terms or phrases related to code-like structures such as manuscript numbers (\"Sloane 313\", \"SSM\"). / The segment contains a mixture of narrative prose and references to specific manuscripts, drawings, editions; clear spans can be identified for both textual descriptions (natural language) and technical citations or labels (\"Sloane 313\", \"SSM\"). / Contains a mixture of historical references, descriptions and scholarly citations that can be segmented into meaningful spans for learning span composition in both textual context (natural language) and specific referencing to manuscripts/sources which is common in academic writing."}}
 {"raw": "(2) 0 glorious mother of God, 0 eternal virgin Mary, do not deem me unworthy because of my great wickedness and innumerable iniquities, but mercifully and favorably accepting that which I, although unworthy; offer and desire for your honor: (3) And so I wish to clearly name and exalt your holy names most conscientiously with my heart; with my mouth, and with my labor: So you are named Mary, Creator; Mother; Bride, Daughter; Theotan; Rod, Vessel, Balsamus, Cloud, Dew, (4) Peace Maker; the First, Queen, Dawn,", "type": "natural", "id": {"id": "9d4284d3-3ede-47e3-95e4-a2761481298d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear structure with distinct phrases and terms; rich for learning span composition in religious texts. / Clear religious text with distinct phrases and terms that can be segmented into meaningful spans; well-formed for training purposes. / Clear and coherent prose with identifiable thematic spans; useful for learning span composition in religious texts. / Clear religious text with distinct phrases and terms that can be segmented into meaningful spans; clean, coherent content suitable for training a span-aware model in the context of spiritual or devotional literature. / Clear structure with religious phrases and titles; rich for learning span composition in devotional texts."}}
 {"raw": "Sworn Book of Honorius\n213\n(2) PREPARATION FOR THE INVOCATION. You must have the Sign ofthe Lord in your right hand, neatly composed and consecrated, and then fumigate the circle, doing and saying as I have said.", "type": "natural", "id": {"id": "e6bdb0ca-3283-4539-820f-9e3d9d700cce"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose structure with identifiable phrases suitable for span segmentation; clean and coherent text representative of religious texts. / Clear prose structure with identifiable phrases suitable for training a span-aware model on religious texts. / Clear prose structure with identifiable phrases suitable for training on span segmentation in religious texts. / Clear, structured religious text with identifiable phrases and instructions for a ritual process. Suitable patterns exist in the language structure that can aid span segmentation learning. / Clear, structured text with identifiable phrases suitable for training a span-aware model in recognizing religious or historical texts."}}
 {"raw": "This might account for its use in protecting buildings. MS. Mich: 276, fol 13r describes its use similarly: A version found in La veritable magie noire\"8 is barely recognizable, testifying to a long and complex transmission.", "type": "natural", "id": {"id": "659808bf-1f33-4f49-9346-34b8378cbed1"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with identifiable spans; useful for learning span composition in English prose. / Clear prose with identifiable spans; useful for learning sentence-level span segmentation in NLP tasks. / Clear prose with identifiable phrases; useful for learning span segmentation in narrative text. / Clear prose with identifiable spans; useful for learning sentence structure and context in NLP tasks. / Clear prose with identifiable phrases and sentences suitable for span segmentation; represents valuable patterns in English text composition."}}
 {"raw": "86 SWORN BOOK OF HONORIUS Tuam deprecor sanctissimam misericordiam, ut per hec divina tua nomina, que ego nunc tibi plenus immundicia coram altari tuo de te pre- sumendo optuli, (8) ut in hac hora me audias et insaciabiliter digneris me facere videre atque laudare te et tuum filium gloriosum corpusculo meo vivente. (9) Teque interpello, gloriosa, per tuum filium quem concepisti, quem genuisti, quem peperisti, quem in carne lactasti, quem in balneo misisti, quem pannis involvisti, quem in templo presentasti,", "type": "natural", "id": {"id": "5239ad95-1764-4ac1-93a4-fed27047d39c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable phrases and sentences suitable for training a span-aware model on English text. / Clear structure with identifiable spans; coherent prose suitable for training a span-aware model on religious or historical texts. / Clear, coherent prose with identifiable phrases and sentences suitable for training a span-aware model in recognizing structured text segments. / Clear, coherent prose with identifiable phrases and sentences suitable for training a span-aware model on English text. / Clear prose structure with identifiable phrases and sentences suitable for training a span-aware model in recognizing linguistic patterns."}}
 {"raw": "This is the correct name for the last angel of Saturn; but it has been accidentally omitted here because of the similarity with the Satquiel, the first angel of Jupiter: See CXV.44.SL 313 22rhas this correction in the margin.", "type": "natural", "id": {"id": "51c1b698-2e55-4f90-997a-4674656a3621"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with identifiable spans of interest (names, references). Suitable for learning context and entity recognition in text. / Clear prose with identifiable phrases and references, suitable for learning span composition in a tokenizer-free context. / The text segment contains clear references to celestial bodies and a marginal note, which can be segmented into meaningful spans for learning contextually rich patterns in span segmentation tasks related to astronomy or literature documentation. / Clear sentence structure with identifiable spans; useful for learning context and relationships in text. / Clear prose with identifiable phrases and references, suitable for learning span segmentation in a non-code context."}}
 {"raw": "tremendum 14, per colendum 15, per reverendum 16, per piissimum 17, per ineffabile 18, per incommutabile 19, per sempiternum 20, (8) quatinus +ab omnibus mundi partibus unanimes et letantes hic iuxta circulum in forma N non nocentes alicui creature, non ledentes, non frementes, non furientes nec me sociosque meos vel aliquam creaturam terrentes, neminem offenden- tes set veniatis+ peticionibus meis consulti et providi statim obedire: (9) Et omnia precepta mea absque omni fallacia penitus adimplere per", "type": "mixed", "id": {"id": "e693c166-51a1-4569-b567-ddef59721dad"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of Latin phrases and punctuation, which could help the model learn span segmentation in multilingual contexts or for classical languages processing. However, it lacks clear syntactic structures typical to natural language text that would be beneficial as standalone training data. / The segment contains a mixture of Latin phrases and punctuation, which can be segmented into meaningful spans for training purposes; however, it lacks coherence as standalone text due to its fragmented nature. / Contains a mixture of Latin phrases and punctuation, which can be segmented into meaningful spans for learning span composition in both linguistic patterns (natural language) and structured text elements like lists or enumerations common to code-like structures. The segment is clean but may require additional context-specific preprocessing due to its historical nature. / Contains a mixture of Latin phrases and punctuation that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic structure (natural language) and formal notation context (code-like elements). / Contains a mixture of Latin phrases and punctuation, which can be segmented into meaningful spans for training purposes; however, the content is highly specialized with limited generalization potential."}}
 {"raw": "[in] intellectum bonum construe ad perficiendum hec tanta tam[que] excellentissima misteria; 8 huius artis, sancte tue visionis et istorum sacra- mentorum perfectum consequar effectum. Amen. LXVIII 134 ORACIO [Ihelur], iudex omnipotens, Pater; qui notum nobis fecisti salutare tuum et in conspectu gencium revelasti iusticiam tuam, revela oculos meos et cor meum illustra salutari iusticia tua, ut mirabilia de tuis tam gloriosissimis videam [sacramentis], (2) quatinus per ea tantam in hac arte consequar", "type": "mixed", "id": {"id": "076b05cb-b4b1-4f46-8771-9564f007ddfb"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mix of Latin phrases and religious text, which have clear structured elements like sentences or verses that can be segmented into meaningful spans for learning purposes. Despite being archaic language with potential OCR errors (\"[in] intellectum bonum construe ad perficiendum hec tanta tamque excellentissima misteria; 8 huius artis,\" etc.), it retains structural clarity and compositional value suitable as mixed-type training data. / Contains a mix of Latin phrases and structured text that could help in learning span segmentation for historical or religious texts, though it may require domain-specific knowledge to fully understand the context. / The segment contains a mixture of Latin phrases and religious text, which may have structured elements like verses or prayers that can be segmented into meaningful spans for learning purposes. However, the language barrier might limit its utility unless paired with translations in more commonly used languages. / The segment contains a mixture of Latin phrases and religious text, which may not have clear syntactic structures for modern NLP models but could be valuable in learning historical or liturgical language patterns. However, the lack of contemporary context might limit its utility without additional annotations specifying span boundaries within this type of content. / Contains a mix of Latin phrases and religious text, with clear structured elements that can be segmented into meaningful spans for learning span composition in both language understanding contexts."}}
 {"raw": "Sworn Book of Honorius\n87\n(7) Ibeg foryour most holy compassion, that through these divine names of yours, which I, though full of filth, have now dared to offer before your altar; (8) that you hear me in this hour; and deign to make me insatiably WOI- thy to see and praise you and your glorious son, while my insignificant body is still living: (9)", "type": "natural", "id": {"id": "069e6f3b-177f-4638-94ab-64c987afb56e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose structure with identifiable phrases and sentences suitable for learning span segmentation in a tokenizer-free context. / Clear prose structure with identifiable meaningful spans; clean and coherent for training purposes. / Clear prose structure with identifiable phrases and sentences suitable for learning span segmentation in a tokenizer-free context. / Clear prose structure; meaningful spans for training on religious or poetic text composition. / The text is coherent but lacks clear, identifiable spans for training a tokenizer-free span-aware model due to its poetic and archaic style; it doesn't contain structured patterns suitable as learning examples."}}
 {"raw": "Sl 3854 repeats \"et in ipso\" obviously by mistake. Not found in Sl 3853 or SSM.", "type": "natural", "id": {"id": "d9d47653-b6ca-45db-9a21-09e6077158a2"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent sentence with identifiable spans; useful for learning context and error identification in text. / Clear sentence structure with identifiable phrases and a coherent statement about textual repetition, suitable for learning span segmentation in English text. / Clear sentence structure with identifiable phrases; useful for learning span segmentation in English text. / Clear, concise statement with identifiable spans; useful for learning context and negation patterns in text. / The text contains a mix of numerical references and phrases, with clear delimiters (\"Sl\", \"3854 repeats\", etc.) that can be segmented into meaningful spans for learning purposes. It is clean but lacks context or coherence as it appears to reference specific lines in documents without further explanation."}}
 {"raw": "And I address you, 0 glorious one, through your son, whom you conceived, whom you begat; whom you have borne, whose body you nursed, whom you bathed, whom you wrapped in cloths; whom you presented at the temple; (1o) whose preachingyou heard, whose suspension from the cross on our behalf you saw, whose death and burial you witnessed, whose rising from the dead you observed, whose ascension to the Father in heaven you saw, (II) and who will soon recurn from there to judge the living and the dead and the", "type": "natural", "id": {"id": "a6ae4d95-8451-4a2b-81f2-b08da66560e2"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear biblical-style prose with identifiable phrases and structured segments suitable for learning span composition in a tokenizer-free context. / Clear biblical narrative with identifiable spans (e.g., \"whom you conceived,\" \"whose preachingyou heard\"). Well-formed and coherent for training purposes, representing valuable patterns in span segmentation within a religious context. / Clear, coherent prose with identifiable thematic spans; well-suited for training a span-aware model on religious or philosophical texts. / Clear biblical narrative with identifiable spans; well-formed and coherent for training purposes. / Clear biblical narrative with identifiable spans; well-structured for learning patterns in span segmentation and composition."}}
 {"raw": "world by fire, likewise through him I dare to name [you] and beg for help, with impure lips, with impure flesh, with impure body, with impure mind, (12) that through this work you will enable me to lookat and see yourself,and the holy Trinity, with your holy angels, andin the endat the Great Judgement you will snatch me away from eternal punishment; through Christ our Lord. Amen.", "type": "natural", "id": {"id": "c73d90a0-d4e1-40fb-9503-d8d795afac02"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable phrases and religious context suitable for training a span-aware model on complex sentence structures. / The text lacks clear, identifiable spans for meaningful segmentation; it's poetic and ambiguous without discernible patterns suitable for training a span-aware model. / Clear, coherent prose with identifiable phrases and thematic structure suitable for span segmentation training. / Clear, coherent prose with identifiable phrases and sentences suitable for training a span-aware model in recognizing religious text structure. / Clear, coherent prose with identifiable phrases and thematic structure suitable for training a span-aware model in recognizing religious text patterns."}}
 {"raw": "[dominaciones], potestates, principatus et virtutes, per cherubin et seraphin, per 24 seniores, per omnem mili- ciam celestis excercitus (3) adoro, invoco, flagito, vereor; glorifico et exalto nomen tuum sanctissimum; terribile et mitissimum et te queso, Domine, ut hodie cor meum Spiritus sancti lumine et gracia tue visitacionis fecun- datum, clarificatum et caritate coroboratum illustres, tu, qui es trinus et unus_ Amen: LXX 154 ORACIO Emanuel, adoro te, rex regum et Deus meus et substancia mea, salus et", "type": "mixed", "id": {"id": "504d6106-c3b1-4c03-803b-0432fc4c1828"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mix of religious phrases and Latin, with clear spiritual terms that can be segmented into meaningful spans for training purposes. The text is coherent but may require domain-specific knowledge to fully understand the context. / The segment contains a mix of religious phrases and Latin, with clear structure for span segmentation; however, it lacks coherence as standalone text due to its fragmented nature. / The segment contains a mix of religious phrases and Latin, with clear structures like \"adoro te\" (I adore you) that can be segmented into meaningful spans for training purposes; however, it lacks coherence as an isolated example due to its fragmented nature. / Contains a mix of religious phrases and Latin text, with clear spiritual references that can be segmented into meaningful spans for learning purposes. The structure is coherent despite the archaic language style. / The segment contains a mix of religious phrases and Latin text, which can be segmented into meaningful spans like \"dominaciones,\" \"principatus et virtutes,\" etc., representing valuable patterns for learning span composition in both natural language processing (NLP) contexts involving code-like structures."}}
 {"raw": "30\nSWORN BOOK OF HONORIUS\nmanuscripts (Sloane 3854 133v and Sloane 313 241) both show North as occu- pying 90 degrees of the compass. Leip. Cod Mag: I6 (pp. 98 and II2, circa 1750) shows the West as occupying 90 degrees. Other manuscripts show the circle divided into seven equal segments (SSM L.3.f29, Sloane 3853 fol. ISov).", "type": "mixed", "id": {"id": "5e021639-7151-4654-b148-a6f8554c20f1"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear division into meaningful spans (manuscripts, compass directions) with coherent structure suitable for learning span composition in a tokenizer-free context. / The text segment contains a mix of numerical data, historical references (manuscripts), and compass directions which can be segmented into meaningful spans for training purposes; it is clean but lacks explicit span boundaries that would aid in learning precise segmentation rules. / Contains a mix of historical references and compass directions, with clear structured segments that can be segmented into meaningful spans for learning span composition in both textual descriptions (natural language) and numerical data related to directional degrees on the compass (code-like elements). / The text contains a mix of numerical data, references to manuscripts and compass directions which can be segmented into meaningful spans for learning span composition in both textual descriptions (natural language) and specific terms or phrases related to historical documents/codes. / Clear division into meaningful spans; well-formed prose with identifiable patterns in compass directions and manuscript references."}}
 {"raw": "to the circles here 214 SSM lists out the names, here spelled thus: Agla, monon tetragramathon, glydeus; ocleyste; gphymeton lamyara, Jnenues sadyon, hely, olon porrentymon, yelur; gofgamel, hemanuel, on, admyel, honzmorb, yoth gfeb resamarathon amethy eryona, yuestre, saday; meloth setthe, elscha, abbadya, alpha & , leyste, orystyon, yeremon, hospr mesquerpon elzephares, egyryon, pectha: 2I5 From the Sanctus prayer; adapted from Isaiah 6.3.", "type": "natural", "id": {"id": "00459227-9f66-481c-b80d-e925d592238c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text lacks clear, meaningful spans for training; it's a mix of names and phrases without discernible patterns or structure suitable for span segmentation learning. / The segment lacks clear, meaningful spans for training; it is incoherent and not representative of structured patterns needed for learning span segmentation in X-Spanformer. / The text lacks clear, structured elements suitable for meaningful span segmentation; it appears to be a random collection of words and phrases without coherent patterns or context. / The segment contains a mix of religious text and numerical references, lacking clear structure for meaningful span segmentation; it is not coherent or clean enough to serve as training data. / The text lacks clear, structured elements suitable for meaningful span segmentation; it's a mixture of names and phrases without discernible patterns or coherence."}}
 {"raw": "The correspondence of the planets with the elements and humors seems to be based ultimately on the Ptolomaic model of the universe, as found in chapter 4 of Tetrabiblos This explains the influences of the planets through theirassociations with thehumors, namely the productive qualities ofwarmth and moisture, and the reductive qualities of cold and dryness.& The magic circle is required for calling spirits, and is described as pro viding the \"greatest fortification' against the spirits This seems to be", "type": "natural", "id": {"id": "210546fd-9288-42f2-a8c0-154121697d2f"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable thematic spans; useful for learning context and relationships in text. / The segment contains clear, structured sentences with identifiable phrases and concepts related to historical astronomy; however, it lacks explicit span boundaries suitable for direct training without additional context or annotations. / The segment contains a mixture of descriptive text and references to historical texts, which can help the model learn span segmentation for both prose (natural language) elements like \"The correspondence,\" as well as code-like constructs (\"Tetrabiblos chapter 4\"). Despite some grammatical issues in punctuation that could be improved upon cleaning up. / Clear prose with identifiable thematic spans; useful for learning span segmentation in narrative text. / The segment contains clear references to historical and philosophical concepts, which can be segmented into meaningful spans like \"Ptolomaic model,\" \"Tetrabiblos chapter 4,\" etc., providing good compositional value for learning span segmentation in a natural language context."}}
 {"raw": "mainly precautionary because certain of them could potentially respond with malice (CXXVII.Io). In the manuscripts, the magic circle diagram is surrounded by descriptive passages, such as \"East, warm and moist, where the angels of the Sun domi- nate_ These passages are probably not intended to be part of the magic circle itself In Sloane 3854,the descriptions are outside the three concentric circles. In Sloane 3853, SSM, and some of the others, the circles extend outside the descriptive material, likely in", "type": "mixed", "id": {"id": "6ca8ad8d-8c4b-4214-901b-8b5b6e752f46"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable thematic spans; useful for learning context and descriptive passages in literature or historical texts. / The segment contains a mixture of descriptive passages and references to manuscripts, which can help the model learn span segmentation in both narrative text (natural language) and structured descriptions related to diagrams or configurations that resemble code-like elements. However, it lacks clear delimiters for spans; thus some ambiguity remains but is still valuable training data. / Contains a mix of descriptive passages and references to manuscripts, which may help in learning span segmentation for both text descriptions (natural language) and specific citations or codes related to manuscript identifiers like \"Sloane 3854\". The structural clarity allows meaningful spans around phrases describing the magic circle. / The segment contains a mixture of descriptive passages and references to manuscripts, which can help the model learn how spans relate in both contexts (natural language descriptions with potential annotations or metadata). However, it lacks clear delimiters for span segmentation due to its narrative style; this could be improved by adding punctuation. / The segment contains a mix of descriptive text and references to manuscripts, which can help the model learn span segmentation in both contexts. However, it lacks clear delimiters for spans due to its narrative style; thus some ambiguity remains that could be addressed with additional context or examples during training."}}
 {"raw": "254\nSWORN BOOK OF HONORIUS\nEladeb eius ministri? Ubi est Maymon rex, ubi sunt Assaibi, Albunalich, Haibalidech, Yasfla eius ministri? (25) Ubi est Barthan rex, ubi sunt Taadas, Caudas, Yalcal eius ministri? Ubi est Formione rex, ubi sunt Guth, Maguth, Guthrin eius ministri? (26) Ubi est Iammax rex, ubi sunt Carmox, Ycanol, Pasfran eius ministri?", "type": "mixed", "id": {"id": "098e8a4e-0e51-4a9f-a56a-29c1995c2ecd"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains a mixture of structured data (names, titles) and unstructured narrative elements; it presents clear patterns for span segmentation in the context of historical or fictional records with identifiable entities like names and positions that can be useful training examples for an encoder focusing on spans. / The text segment contains a mixture of structured phrases and names that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic context (names as entities) and potential code-like structures with numbered references or identifiers. / The text segment contains a structured list with clear spans of phrases and words, representing valuable patterns for learning span composition in both coding (list-like structures) and historical or fictional contexts (\"natural\" language). It is clean but lacks context which may affect training utility slightly. / The text contains structured patterns with identifiable spans, such as names and phrases that can be segmented meaningfully for training purposes; however, it lacks clarity due to potential transcription errors or archaic language forms. / The text segment contains structured queries with clear, identifiable spans of words and phrases that represent meaningful patterns for learning span segmentation in a tokenizer-free context; it is coherent but lacks contextual clarity due to its archaic language style."}}
 {"raw": "Another passage shows even more variations on these names:\nThe angels who have power over (the Moon) are Gabriel, Szamahel", "type": "natural", "id": {"id": "dd34c6e1-f3d7-4650-a981-ce3c3d914db9"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with identifiable spans; useful for learning span segmentation in prose. / Clear sentence structure with identifiable spans (names and titles). Suitable for learning span segmentation in a natural language context, though it lacks complexity or variety that could enhance training utility further. / Clear sentence structure with identifiable spans; useful for learning span segmentation in prose. / Clear sentence structure with identifiable spans; useful for learning span segmentation in English prose. / Clear sentence structure with identifiable spans like names and phrases; clean, coherent text suitable for training a span-aware model in the context of language processing."}}
 {"raw": "This is reminiscent of de Abano, but unfortu- nately, the text does not identify the corresponding names\nNames of God\nThe use of the names of God is central to the method of Honorius, as well as Solomonic and Jewish magic in general.", "type": "natural", "id": {"id": "1bccd5cb-cafc-4640-86da-2e2d92860ee2"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable phrases and concepts suitable for span segmentation; coherent text representing the target domain of religious or magical texts. / Clear, coherent prose with identifiable meaningful spans; represents valuable patterns for learning span composition in a language context. / Clear sentence structure with identifiable phrases and terms relevant to the topic, suitable for learning span segmentation in a natural language context. / Clear, coherent prose with identifiable phrases and concepts suitable for learning span composition in a tokenizer-free context. / Clear, coherent prose with identifiable phrases and concepts suitable for training a span-aware model in the context of religious texts or magical practices."}}
 {"raw": "Ubi est Sarabocres rex, ubi sunt Nassar; Cynassa eius ministri? (27) < Ubi est harthan rex: Ubi sunt bileth: mylalu: abucaba eius ministri.", "type": "mixed", "id": {"id": "83a6b3c7-dcc9-4131-9c4d-d6c06b2bacec"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains both structured phrases and a mixture of recognizable words, though some are unclear or non-standard; useful for learning diverse span compositions in multilingual contexts. / Contains both structured phrases and unstructured text, with identifiable spans in a mix of languages or codes. / Contains a mix of Latin phrases and potential Hebrew script, indicating both linguistic elements that could be useful for span segmentation in multilingual contexts. However, the clarity is compromised due to possible transcription errors or unclear context (\"mylalu\" may not correspond directly with known words). / Contains both structured phrases and a mix of words that could be useful for learning span segmentation in diverse contexts. / Contains both structured phrases and a mix of words that could represent meaningful spans in context, though the unusual combination may require additional contextual understanding for effective learning."}}
 {"raw": "The correspondence of the planets and elements is also recounted in the eleventh century Arabic magic text now generally known as Picatrix; see 1.4, Greer and Warnock 20II p. 31.", "type": "natural", "id": {"id": "020a3b35-5e69-4de7-a7aa-c33e52c7e843"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose structure with identifiable spans; useful for learning context and composition in NLP tasks. / Clear prose with potential for identifying thematic spans; however, lacks direct compositional patterns suitable solely as training data. Could benefit from additional context or examples to improve learning outcomes. / Clear prose with identifiable phrases and references; useful for learning span segmentation in historical texts. / Clear prose with identifiable phrases; useful for learning span composition in historical texts. / Clear prose structure with identifiable spans; useful for learning sentence-level span segmentation in a historical context."}}
 {"raw": "Introduction 21 unlikely that Honorius was informed by Alfonsos Greek and Arabic transla- tion activities. We can also see from the above that Honorius is sometimes closer to the Byzantine text than are the other texts For example; H in Harthan, tth in Bileth, and 'ou in Abouzaba and Misabou reflect the Greek spellings more closely than the other texts Incorporation of Byzantine material would also explain the puzzling variery of spellings by Honorius, for example, the seem- ingly arbitrary insertion ofthe", "type": "natural", "id": {"id": "cce2782e-5031-4b81-aa46-4c17cb21ced3"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains clear linguistic structures and patterns, such as Greek spellings in a Latin context; however, it lacks coherence for training purposes due to fragmented sentences (\"Introduction\", \"unlikely that Honorius was informed by Alfonsos...\"). / Clear prose structure with identifiable spans; useful for learning span segmentation in English text. / The text segment contains clear linguistic patterns and historical references that can be segmented into meaningful spans, such as names of texts (Harthan, Bileth) or Greek spellings ('ou in Abouzaba). It is coherent but lacks context for deeper learning. / The segment contains clear linguistic patterns and references to historical texts, which can help the model learn span segmentation in a scholarly context; however, it lacks coherence due to fragmented sentences that may confuse training processes. / The segment contains clear linguistic patterns and historical references that can be segmented into meaningful spans, such as names of texts (Harthan, Bileth) or Greek spellings ('ou in Abouzaba). It is clean but lacks coherence due to fragmented sentences."}}
 {"raw": "Introduction 31 The text also uses the Hebrew term Shem ha-Meforasb, probably meaning something like explicit name\" ofGod. Honorius uses the term to refer to the seventy-two letter name of God given in several places in the text, including the border ofthe Seal of God. Although the use ofthis term might lead us to suppose a Jewish connection, the name itself turns out to be derived from the initial letters of seventy-cwo names of God, a subset of the longer list &1 They are as follows: Seventy-two names", "type": "mixed", "id": {"id": "be02d7bc-b0d0-4fa1-827a-2706d099fb4e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of Hebrew terms, references to religious texts (natural language), and structured lists with numerical data that can help the model learn span segmentation in both contexts. / Clear structure with identifiable spans (e.g., \"The seventy-two letter name of God,\" \"Seventy-two names\"). Well-formed and coherent, representing valuable patterns for learning span composition in religious or historical texts. / The segment contains a mixture of Hebrew terms, references to religious texts (natural language), and structured lists that can be segmented into meaningful spans for learning purposes. It is clean but lacks context which may affect training utility slightly. / The segment contains a mix of Hebrew terms and references to religious texts, which can be segmented into meaningful spans such as \"Shem ha-Meforasb,\" \"seventy-two letter name of God,\" etc., representing valuable patterns for learning span composition in both natural language processing (NLP) tasks related to code-mixed text. / The segment contains a mixture of Hebrew terms and references to religious texts, which can be segmented into meaningful spans like \"Shem ha-Meforasb,\" \"seventy-two letter name of God,\" etc., representing valuable patterns for learning span composition in both natural language processing (NLP) tasks related to code-mixed text."}}
 {"raw": "In other cases, Honorius differs where Petrus and the Greek agree. Hono- rius generally includes more names than the others The ruling angel of Mer- cury isalso interesting: In Petrus (similarly Cardanus and Michael Scotus), the ruling archangel is identified as Raphael, while Liber Semiphoras and Agrippa identify him as Michael.42 Hence Honorius is closer to Liber Semiphoras In the great exorcism in the Byzantine manual only seals of four archan- gels are preserved, owing to damage at the bottom of folio", "type": "natural", "id": {"id": "d8232c70-1e81-4ea1-9970-6a04799ff24b"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable phrases and concepts; useful for learning span segmentation in text. / Clear prose structure with identifiable spans of names and phrases; well-suited for learning span composition in a tokenizer-free context. / Clear prose with identifiable spans; useful for learning sentence structure and context in NLP tasks. / Clear prose structure with identifiable spans; useful for learning sentence segmentation and context understanding in NLP tasks. / Clear prose with identifiable phrases and terms related to historical texts; useful for learning span segmentation in non-code contexts."}}
 {"raw": "Ubi est Sarabocres rex, ubi sunt Nassa[r], Cynassa eius ministri? (31) Ubi est Harthan rex, ubi sunt Bileth, Mylalu, Abucaba eius ministri?", "type": "mixed", "id": {"id": "688bc8da-6098-4035-a387-18929f28877c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of Latin phrases and names, which may have clear linguistic patterns for span segmentation; however, the lack of context makes it less ideal as training data. / The segment contains a mixture of Latin phrases and names, which can be segmented into meaningful spans for learning purposes; it is clean but lacks context or coherence that would make the training more effective. / Contains a mix of Latin phrases and names, with clear structure for span segmentation; represents valuable patterns in historical or literary text analysis. / Contains a mixture of Latin phrases with potential for span segmentation; however, lacks coherent structure and context necessary to form meaningful patterns for training purposes. / Contains a mix of Latin phrases and names, which may represent meaningful spans for learning span segmentation in both linguistic patterns (natural language) and structured elements like proper nouns or titles that resemble coding constructs. The text is coherent but lacks context to fully assess its utility as training data; however, it could still provide valuable insights into the structure when paired with additional information."}}
 {"raw": "Ioz. These four are similar to those in Heptameron, which has all seven: It has generally been assumed that the seals of the archangels found in the Royal manuscript are an insertion directly from Heptameron, but perhaps they - were actually pres- ent in the original Sworn Book after all.", "type": "natural", "id": {"id": "1faf1f7d-374c-4cb1-bed3-1a081ce5e660"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with identifiable spans; useful for learning context and relationships in text. / Clear sentence structure with identifiable spans; useful for learning context and relationships in text. / Clear prose with identifiable spans; however, the presence of archaic language and potential OCR errors may affect training utility. / Clear sentence structure with identifiable spans; useful for learning context and span segmentation in prose. / Clear sentence structure with identifiable spans; useful for learning context and phrase segmentation in NLP tasks."}}
 {"raw": "Summum Dei tremendum ac reverendum iudi- cium vos apparere et michi in omnibus obedire constringat\"\n(32) ExCITACIO DIEI SABATI ET SPIRITUUM EIUS, QUANDO PARTICULARITER FIT:\nTunc in septemtrione dicat: \"Ubi est Maymon rex, ubi sunt Assaibi, Albu- nalich, Haibalidech, Iasfla eius ministri? (33) Ubi est Barthan rex, ubi sunt Taadas, Caudas, Yalcal eius ministri? Ubi est Formione rex, ubi sunt Guth, Maguth, Guthryn eius ministri? (34) Ubi est Iammax rex, ubi sunt Car- mox, Ycanol, Pasfran eius ministri?", "type": "mixed", "id": {"id": "2a2261dd-d922-48ba-bba6-25c7f594beb4"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of Latin phrases and references to biblical figures, which could provide diverse span patterns for training purposes in both language understanding (natural) and historical/cultural context recognition (code). However, the text is not coherent as it appears fragmented or extracted from larger passages. / Contains a mixture of structured phrases and names, representing both linguistic patterns (natural language) with potential for learning span segmentation in historical or religious texts. / Contains a mixture of Latin phrases and structured lists, with clear segmentation opportunities for spans representing words or groups thereof; well-formed text suitable as training data. / Contains structured phrases with clear boundaries, representing a mix of historical or religious text and potential coding-like syntax for span segmentation learning. / The segment contains a mixture of Latin phrases and references to biblical figures, which can be segmented into meaningful spans for training purposes in both language understanding (natural) and historical/cultural context recognition (code). It is clean but may require domain-specific knowledge due to its archaic nature."}}
 {"raw": "voluisti, (12) ut per misterium sacratissimi corporis tui flumina omnium aquarum tua benediccione crescerent; ut sicut per eas exteriora lavantur; (13) ita inte- riora per ipsas et Spiritum sanctum lavarentur peccata testante propheta David et dicente: 'Asparges me, Domine, Ysopo, et mundabor: Lavabis me, et super nivem dealbabor' (14) qui beatis apostolis tuis in huius sacri com- memoracione misterii pedes eorum lavasti; qui nobis ad saturacionem", "type": "mixed", "id": {"id": "d4d60018-ce5f-4f43-a920-19a88f4b99c9"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mixture of biblical text and Latin phrases, with clear verse structure suitable for span segmentation; represents valuable patterns in religious texts. / The segment contains a mixture of Latin phrases and references to religious texts, which can be segmented into meaningful spans for training purposes; however, the presence of numbers may introduce ambiguity in span identification. / Contains both religious text and references to biblical verses, which can be segmented into meaningful spans for a span-aware model; however, it lacks clear compositional patterns due to its poetic structure. / The segment contains a mixture of Latin phrases and references to religious texts, which can be segmented into meaningful spans for learning purposes; however, it lacks clear compositional patterns due to its specialized content. / Contains a mix of religious text and references to biblical verses, which can be segmented into meaningful spans like phrases or sentences; however, the presence of verse numbers may confuse span segmentation for certain models. The content is clean but might require additional preprocessing due to its specific domain (religious scripture)."}}
 {"raw": "manuscripts, only one (Sloane 3854) appears to contain the complete text.", "type": "natural", "id": {"id": "a62f5c0a-3f0d-42a2-a35d-98e2427baf83"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent sentence with a single meaningful span; represents good pattern for training in identifying complete sentences and related phrases. / Clear sentence structure with a potential span for \"manuscripts\" and another for \"(Sloane 3854).\" The segment is clean, coherent, but lacks complexity that could be beneficial in training data. / Clear sentence structure with identifiable spans (manuscripts, Sloane 3854). Well-suited for learning span segmentation in a tokenizer-free context. / Clear, coherent sentence with a single meaningful span related to manuscripts and Sloane number; suitable for learning context in NLP tasks. / Clear sentence structure with a discernible subject (\"manuscripts\"), predicate (\"only one appears to contain the complete text.\"), and object (\"Sloane 3854\"). The segment is clean, coherent for training purposes; it represents valuable patterns in span composition."}}
 {"raw": "For examples from Jewish magic see Bohak 2008 Pp. 315, 318, and Casanowicz 1976 p. 162. Barachiel (also spelled Baruchiel; Barakiel, or Baraqiel) is one of the seven archangels in orthodox tradition (along with Michael, Gabriel, Raphael, Uriel, Salathiel Jegudiel)", "type": "natural", "id": {"id": "4ee4d7bc-8d55-448c-aabc-123bfc03bae1"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear references to literature with structured citations, useful for learning span segmentation in academic texts. / Clear references to literature with identifiable spans (author names, titles). Well-formed for training purposes and represents valuable patterns in span segmentation of academic citations. / Clear references to literature with identifiable spans (authors, titles). Well-formed and coherent for training purposes in identifying span segments related to citations or bibliographic entries. / Clear text with references and names that can be segmented into meaningful spans; represents valuable patterns for learning span composition in the context of religious texts. / Clear references and citations, though not ideal for span segmentation due to lack of explicit spans. Suitable as a starting point with minor adjustments needed."}}
 {"raw": "256\nSWORN BOOK OF HONORIUS\nAbucaba eius ministri? (35) Ubi est Abaa rex, ubi sunt Hyici, Quyron, Zach, Eladeb ministri eius?\" Hic debet claudere manum et eis pugnum clausum ostendere cum sigillis. Tunc dicat: (36) \"Virtus istorum sanctorum nominum Dei et sigil- lorum vestrorum vos convincat, que vos congregare, venire, apparere, respondere et michi in omnibus obedire constringant\" (37)", "type": "mixed", "id": {"id": "34da5d1b-7a55-4d72-b75d-e7823d9847b6"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mix of Latin phrases and numerical references, which can be segmented into meaningful spans for training purposes; however, it lacks context or clarity in modern language understanding. / The segment contains a mix of numerical values, Latin phrases (which may be relevant to historical or legal contexts), and structured text that could help the model learn span segmentation in both natural language processing tasks related to code-like structures as well as linguistic patterns typical for ancient texts. / Contains a mixture of legal text and Latin phrases with clear structure for span segmentation; however, it may require domain-specific knowledge to fully understand the context. / The segment contains a mixture of numerical values, Latin phrases (potentially representing historical or religious text), and structured formatting that could be beneficial for learning span segmentation in both language processing tasks involving code-like structures as well as natural languages with embedded codes. / Contains a mixture of numerical references, Latin phrases indicative of legal or religious texts (potentially historical documents), and structured formatting that can be segmented into meaningful spans for training purposes in both natural language understanding and code-like structures."}}
 {"raw": "(2) Tunc intres circulum per partem inter frigicap et occidentem pro meta positam, et tunc socii stantes pedibus in circulo stent; donec recluse- ris circulos dicens 18. (3) Tunc situa socios et enses in circulo tali modo, set antequam intraverunt; 7 predicta nomina deleantur; quia non possent ali- ter apparere. (4) Tunc versus quamlibet parcium unus ponatur gladius, et debent in altitudine adequari: Tunc, si solus fueris, versus orientem primo invocabis. (5) Si autem duo,55 secundus sedeat versus partem", "type": "mixed", "id": {"id": "5d03ab3e-1bab-4466-a4d5-eedf625b772a"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of structured phrases and archaic language, with clear boundaries for potential span segmentation like \"circulum per partem inter frigicap et occidentem\" which could be useful in learning complex text structures. / Contains structured patterns with clear segmentation opportunities between Latin phrases and instructions, useful for learning span composition in a multilingual context. / Contains a mix of structured phrases and instructions, with clear separations for meaningful spans like \"circulus per partem inter frigicap et occidentem\" which can be useful in learning span segmentation patterns across natural language descriptions intertwined with code-like syntax. / The segment contains a mix of Latin phrases and instructions that can be segmented into meaningful spans, such as \"circulus per partem inter frigicap et occidentem pro meta positam\" (a circle through the middle between Frigicap and Occident for an appointed place) which shows clear structural patterns. It is clean but may require domain-specific knowledge to fully understand its context in training a span-aware model like X-Spanformer, especially since it combines natural language with potential code-like instructions or annotations typical of ancient texts that might be used as mixed content examples. / The segment contains a mixture of Latin phrases and structured instructions, which can be segmented into meaningful spans for learning span composition in both linguistic patterns (natural language) and syntactic structures typical to code-like constructs."}}
 {"raw": "Analyzing these differences, researchers have found evidence that the better-known group of manuscripts has been redacted at some point", "type": "natural", "id": {"id": "ab2a6546-cf22-470e-8d20-15e04435485e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment is coherent, clean and contains clear sentence structure suitable for training a span-aware model in the context of analyzing textual patterns related to manuscript analysis. / Clear, coherent prose with identifiable phrases suitable for learning span segmentation in a tokenizer-free context. / Clear, coherent prose with identifiable phrases suitable for span segmentation; no complex structures like nested clauses or technical jargon that could confuse the model. / Clear, coherent sentence with identifiable spans; useful for learning context and span composition in NLP tasks. / Clear, coherent prose with identifiable spans for training; no extraneous elements."}}
 {"raw": "With the availability ofbetter photographs ofthe manuscript; Ican propose two minor corrections to their transcript: 83 = Christus. 84 JV: Mamyas 85 Skemer 2006 p. 1zo notes the use of this name in thirteen century manuscript Sloane I717.", "type": "natural", "id": {"id": "7aa6d834-1574-4a3f-9e85-e3208d8a6474"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with identifiable spans; useful for learning span segmentation in prose. / Clear sentence structure with identifiable spans like \"manuscript,\" \"corrections,\" and references to specific pages or sources, suitable for training a span-aware model on English prose. / Clear sentence structure with identifiable spans; useful for learning phrase segmentation in a purely textual context. / Clear structure with identifiable spans (manuscript references, page numbers). Well-suited for learning span segmentation in historical texts or documentation. / Clear and coherent prose with identifiable spans for training, such as author names, publication details, manuscript references; well-formed content suitable for learning span composition in a tokenizer-free context."}}
 {"raw": "These may be mistakes, evidence of the com- piler' S unfamiliaricy with Hebrew, a possibilicy suggested by both Mesler and Veenstra.53 However this seems unlikely to me given the fact that the mis- take recurs. Moreover; the form also occurs in other texts, such as Vincu- lum Salomonis, a text which does not appear to derive from Honorius.54 The substitution of Deleth could also have been intentional to avoid writing the\nSI Leipzig Cod. Mag: I6.", "type": "natural", "id": {"id": "70f385b2-fab4-4f97-b2ba-3f1d9ebb5a27"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains clear sentence structures and phrases that can be segmented into meaningful spans, such as \"These may be mistakes,\" or \"the substitution of Deleth could also have been intentional.\" It is clean for training purposes but lacks explicit code constructs to qualify it strictly under the 'code' type. / Clear sentences with identifiable phrases and potential recurring patterns for training in span segmentation; well-formed text suitable as a learning example. / Clear sentence structure with identifiable spans; useful for learning context and compositional patterns in text. / Clear sentences with identifiable phrases and potential for learning span segmentation in a scholarly context. / Clear sentence structure with identifiable spans; useful for learning context and phrase segmentation in English text."}}
 {"raw": "32 SWORN BOOK OF HONORIUS SToexoraba SLayqtiyst JAlgaonosu SLaryceksp JFyomemana SRenugarel Atedatono JNaoyleyot The result is not only an abbreviation, but in fact an acronym, pronounce- able in its own right: This may be the motive for rearranging the list as found in CIZ In addition to the Seal of God; this 72-letter name is used on the bed of ashes, and according- to Ganell, should be written on the swords as well. An additional list of seven divine names are used in the Seal of God (IV14-47, CXXVIIIs):", "type": "natural", "id": {"id": "7f7bfdc1-0deb-45df-9be0-acc1862a5a4f"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains clear, meaningful spans of text that represent a structured list and explanation suitable for learning span composition in the context of religious or historical texts. It is clean but lacks contextual clarity due to its specialized content type (historical references). / Clear and coherent prose with identifiable meaningful spans (e.g., names, phrases). Well-suited for learning span composition in a tokenizer-free context. / The text lacks clear, identifiable spans for meaningful segmentation; it is incoherent and not representative of structured patterns suitable for training a span-aware model. / The text lacks clear, meaningful spans for training; it's a mix of unrelated phrases and terms without coherent structure or compositional value suitable as an example. / While the text contains structured elements like lists and phrases, it lacks clear span segmentation patterns suitable for training a tokenizer-free model; it's more of an excerpt with complex sentences rather than clean examples."}}
 {"raw": "226 Is this an indication that the biblical quote has been misunderstood?", "type": "natural", "id": {"id": "462ef8dc-eeca-47fb-ad3b-e12f0350c783"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent question with a clear indication of misunderstanding in context; suitable for learning span composition related to discourse and interpretation. / The segment lacks context and clarity, making it difficult to identify meaningful spans for training purposes. It is too vague as a standalone sentence without surrounding text or explanation of the biblical quote in question. / Clear question format with a potential for learning span segmentation between numbers, words and punctuation. / The segment contains a clear question with potential for span segmentation into \"Is this an indication\" and the rest of the sentence, which is coherent in its context as part of religious or philosophical discourse. / Clear, coherent question with a potential span for \"biblical quote\" and an interrogative structure suitable for learning context understanding in NLP tasks."}}
 {"raw": "Magic Occultism Esoteric History The Sworn Book of HONORIUS Liber Iuratus Honorii Tbhe _ Sworn Book of Honorius is one ofthe oldest and most influential texts on medieval magic (dating to the fourteenth-century). It is here translated for the first time since partial translation was done in the sixteenth-century This grimoire represents complete system of magic: including how to attain the divine vision, communicate with holy angels, and control aerial, earthly; and infernal spirits for practical gain: It", "type": "natural", "id": {"id": "52c892dc-eaa0-4b5a-bf7d-3c2422e74754"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable thematic spans; useful for learning span segmentation in historical texts. / Clear, coherent prose with identifiable spans of meaningful content related to historical and esoteric topics; well-suited for training a span-aware model on complex sentence structures in English text. / Clear, coherent prose with identifiable spans of meaningful content; well-suited for learning span composition in a non-code context. / Clear, coherent prose with identifiable spans; rich historical context for learning span composition in a non-code domain. / Clear prose structure with identifiable spans; useful for learning sentence segmentation and thematic patterns in historical texts."}}
 {"raw": "Rabur: (9) When this has been done; he should eat, and after eating, he should begin the work as follows: He should cake clear cold spring water; and wash himself, saying: (Io) c( 0 Lord, holy Father; almighty and eternal God, whose spirit moved over the waters before the creation of the world; who in the creation of the world, made water blessed with the other elements, (I1) who gave water to the thirsty people for their nourishment, and in order that they could wash their outward stains clean with it, who", "type": "natural", "id": {"id": "6b8c3964-7438-4f83-9528-1b0f37c0ad69"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear structure of religious text with identifiable phrases and meaningful spans; clean, coherent content representative for training purposes. / Clear structure of religious text with identifiable phrases and thematic elements suitable for learning span segmentation in a tokenizer-free context. / Clear narrative structure with identifiable spans like phrases and sentences; well-suited for learning span composition in a religious or poetic context. / Clear structure of religious text with identifiable phrases and thematic segments suitable for learning span composition in a tokenizer-free context. / Clear structure of religious text with identifiable phrases and thematic elements suitable for learning span segmentation in a tokenizer-free context."}}
 {"raw": "was key text used by John Dee who owned two ofthemost important manuscripts and isaknown influence on his Enochian magic and its modern derivatives: Although largely ignored by historians until recently; this text is an important witness to the transmission of Kabbalah and Jewish mysticism to Christians.", "type": "natural", "id": {"id": "cce481fa-91ff-4d76-8a6b-ebdd4880977f"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with identifiable spans; useful for learning context and composition in English text. / Clear prose with identifiable phrases and concepts suitable for learning span segmentation in a non-code context. / Clear prose with identifiable phrases and historical context, though less structured for direct span segmentation. Suitable as a training example to understand complex sentence structures in English text. / The segment contains clear, meaningful spans of text related to historical and mystical content; it is coherent for learning patterns in span segmentation within the context of religious studies or cultural history. / Clear prose with identifiable spans; useful for learning sentence structure and thematic elements in English text."}}
 {"raw": "wished to be baptized in the river Jordan by John the Baptist; (12) in order that through the mystery ofyour most sacred body the rivers of all waters will increase with your blessing, so that; even as we are washed on the outside by it, and by the Holy Spirit, (13) so too will we be washed inside and cleansed of our sins, as the prophet David testified when he said: 'Sprinkle me, 0 Lord, with hyssop, and I will be clean; wash me, and [ will be whiter than snow; (14) which in remembrance ofthis sacred", "type": "natural", "id": {"id": "5b271651-8fdc-48dd-a2d3-a7577b8a40b0"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains clear biblical references and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in religious texts or similar contexts. It is clean but lacks explicit coding elements to classify as code content type. / Clear biblical narrative with identifiable spans; well-formed prose suitable for learning patterns in span segmentation and composition. / Clear biblical narrative with identifiable phrases and sentences; well-suited for learning span composition in religious texts. / Clear, coherent prose with identifiable phrases and sentences suitable for training a span-aware model on English text. / Clear biblical narrative with identifiable spans; well-formed and coherent for training purposes."}}
 {"raw": "Mag, 16 p. II3: Lagaly, Vellim, Narach, Lyaeh, Yalgal, Librare, Librares 90 45v: 'scribantur ista septem nomina dei hay + byalg + vehem + yasgal + Narath + libaree+ et ponantur in septem mundi partibus iuxta circulum, et cum operari volueris, remove; quia in istorum presentia nullus spiritus potest operari, et ideo deleantur 91 Greenfield 1995 Pp. 134-135.", "type": "mixed", "id": {"id": "fed624da-8959-4a39-9d10-496ef43517d3"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mix of Latin phrases, names and numerals that can be segmented into meaningful spans for training purposes; however, it lacks clear contextual coherence due to its fragmented nature. / The segment contains a mixture of Latin phrases and references to scholarly work, but lacks clear structure for meaningful span segmentation; it's not coherent or clean enough as training data. / Contains a mixture of Latin phrases and references, with clear structured elements like names that can be segmented into meaningful spans for training purposes. However, the text is not entirely coherent or clean due to its archaic language style which may pose challenges in learning span composition effectively. / The text contains a mix of Latin phrases and modern English, making it difficult to identify clear spans for training purposes without further context or preprocessing. Additionally, the presence of archaic language may not be representative enough across different domains needed by X-Spanformer. / The segment contains a mixture of Latin phrases and references to various entities, which can be segmented into meaningful spans for learning span composition in both linguistic context (Latin) and referential structure (\"Mag\", \"Vellim\"). Despite being somewhat archaic or obscure without additional background knowledge on the subject matter."}}
 {"raw": "18:20. 78 Matt: 18:19. This concludes the First Purification; the instructions for the second are found in chapter XCVIII", "type": "natural", "id": {"id": "1cf85da6-3711-40e0-b0bd-b05682d0932f"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with a clear beginning, middle (main clause), and end; spans can be identified as individual words or phrases for training purposes. / Clear sentence structure with a temporal reference and chapter citation, suitable for learning span segmentation in historical or religious texts. / Clear sentence structure with a timestamp, speaker identification and reference to another chapter; suitable for learning temporal context in span segmentation. / Clear textual structure with identifiable spans such as chapter references and speaker annotations, suitable for learning span segmentation in a tokenizer-free context. / Clear textual structure with identifiable spans; useful for learning sentence segmentation in a historical document context."}}
 {"raw": "Sworn Book of Honorius 277 son ofthe exorcist in the midst ofthe exorcism, who has been well fortifiedby God, undaunted, prepared with powers, who has powerfully called you, and calls you with exorcising: (53) Come therefore with all haste, 0 Aye, Samye, come without delay: Through the eternal names of the living and true God, 98, 99, and by this most holy work, and by the Holy Seal, (54) which com- mands power over YOU, and by the virtue of the heavenly spirits, and by the person of the exorcist who is", "type": "mixed", "id": {"id": "b03bcc52-2cc6-4e5f-86af-99f225a0f89c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains a mix of religious language and formal structure, which can be segmented into meaningful spans for training purposes; however, it lacks clarity in modern context making its utility limited. / Contains a mix of religious text and potential incantation-like phrases, with clear structure for span segmentation; however archaic language may pose challenges but is coherent enough to be useful in training. / The segment contains a mix of religious text and Latin phrases, which can be segmented into meaningful spans for training purposes; however, it may require additional context or preprocessing to fully leverage its compositional value in learning span segmentation patterns. / Contains both structured language and potential religious/spiritual terminology that could be useful for span segmentation in a tokenizer-free model, though it may require context-specific understanding due to its unique vocabulary. / The text segment contains a mixture of religious language and possibly liturgical phrases, which can be segmented into meaningful spans for training purposes; however, the archaic style may pose challenges in generalization to modern contexts."}}
 {"raw": "predictum latus et crucem secundi anguli eiusdem (19) Deinde in latere illo, quod tendit ab angulo primo eiusdem secundi eptagoni ad tercium angulum eiusdem, scribatur hoc nomen sanctum Dei: \"Narath; (20) ita quod hec sillaba: na scribatur in illo loco eiusdem lat- eris, qui est supra primam sillabam de \"Satquiel; (21) et hec sillaba: 'ra\" in illo loco, qui est supra ultimam eiusdem, et hec due litere: \"t; \"h\" in illo loco, qui est in eodem latere inter latus intersecans ipsum et crucem terciam_ (22) Deinde", "type": "mixed", "id": {"id": "8568b992-bf3f-4efb-8540-c343cbb2db62"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mixture of Latin phrases and structured notations that can be segmented into meaningful spans, useful for learning span composition in both linguistic patterns (natural language) and formal structures (code-like). However, the text is fragmented which may affect coherence slightly but retains clear structural elements. / Contains a mix of Latin phrases and structured text that can be segmented into meaningful spans, reflecting patterns useful for learning span composition in both linguistic structures (natural language) and formal notation systems like those found in programming or markup languages. / Contains a mixture of Latin phrases and structured text, which can help in learning span segmentation for both linguistic patterns (natural language) and formal constructs (code-like structure). However, the lack of clear delimiters makes it less ideal than fully segmented examples. / Contains a mix of Latin phrases and structured text, with identifiable spans for training purposes. However, the presence of numbers may confuse tokenization models not designed to handle numerical data within textual context. / The segment contains a mixture of Latin phrases and references to geometric terms, but lacks clear structure for meaningful span segmentation; it is not coherent or clean enough as training data."}}
 {"raw": "in illo latere eiusdem secundi eptagoni quod tendit a ter- cio angulo eiusdem ad quintum eiusdem scribatur hoc creatoris nomen sanctum, quod dicitur \"Libarre\" (23) ita quod hec sillaba: \"ly\" scribatur supra primam sillabam de \"Raphael\" et hec sillaba: \"bar\" supra ultimam sillabam eiusdem (24) et hec sillaba (( 're\" in illo loco eiusdem lateris, qui est inter latus intersecans ipsum et quintum angulum eiusdem secundi eptagoni. (25) Deinde in illo latere eiusdem secundi eptagoni, quod est a quinta cruce usque", "type": "mixed", "id": {"id": "ca4edef1-80e4-4205-a17e-a01198e17cf0"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mixture of structured notations (possibly from diagrams or schematics) and textual descriptions, which can help in learning span segmentation for both types of content. However, the clarity is somewhat compromised by potential transcription errors (\"illo\" instead of \"illore\", etc.). / The text segment mixes Latin phrases with instructions for writing, lacking clear and consistent patterns suitable as training data. It contains both structured elements (like coordinates) but also arbitrary notations that don't form coherent spans in a learning context. / Contains a mix of structured elements (e.g., references to angles and lines) that can be segmented into meaningful spans, though the context is unclear without domain knowledge. / Contains a mixture of Latin phrases and descriptions that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic structure (natural language) and specific formatting or notation used within the text segment. / The text contains a mix of Latin phrases and instructions that are not well-defined for meaningful span segmentation, lacking clarity in structure and compositional value necessary to train an X-Spanformer effectively."}}
 {"raw": "guy phasamat: samar: saleht: salym. salmeht: (4) sameht: saloht: sillezaleht: sadayne: neothatic: neodamy hadozamyr: zozena. belymoht hazat: helyhot: XXXVI: PRAYER.F 49 The truth, the light,; the way; and the life of all creatures, 0 just God, vivify me; visit me and strengthen my comprehension, and my soul, and renew my conscience, and glorify and purge it, (2) as [you promised] John and Paul, when you snatched them away to show them the vision of you, so too, Lord, while my body is yet living, may I be", "type": "natural", "id": {"id": "07b0375b-f13b-4a95-8db4-3940a7bc07d8"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text lacks clear, identifiable spans for meaningful segmentation and is not coherent enough to represent valuable patterns in span composition; it appears more like a transcription error or corrupted data. / The text lacks clear, consistent structure and meaningful spans for training purposes; it appears to be a random collection of words without discernible patterns or coherence in the context of span segmentation learning. / The text lacks clear, identifiable spans for meaningful segmentation; it's a continuous prose with no discernible patterns or structures that would be useful in training an X-Spanformer model focused on span-aware encoding without tokenization cues. / The text segment contains clear, coherent prose with identifiable phrases and sentences suitable for training a span-aware model on English language structure; however, it lacks explicit coding constructs or domain-specific terminology that would indicate mixed content types. / The text segment contains clear religious and poetic language with identifiable phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in a tokenizer-free context. However, it lacks explicit code constructs or mixed content elements."}}
 {"raw": "[visibilium et] invisibilium creatura- rum Pater piissime, qui incircumscripto lumine habitas eternaliter <et> ante principium mundi omnia ineffabiliter disponens atque gubernans, (2) eternitatem tuam atque incomprehensibilem pietatem verbis supplican-", "type": "natural", "id": {"id": "19b205dc-48e3-42e3-8490-8713ada1859f"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear structure with identifiable spans; however, contains archaic language that may not be representative of modern text patterns. Mixed content type due to presence of Latin phrases commonly found in religious texts or historical documents. / Clear sentence structure with identifiable spans; useful for learning span segmentation in prose. / The text segment contains clear religious phrases and structured sentences that can be segmented into meaningful spans, representing valuable patterns for learning span composition in a tokenizer-free context focused on natural language processing tasks. / The segment contains a mixture of Latin phrases and punctuation, with clear boundaries for spans that could be useful in training an encoder to understand complex structures combining both language elements. However, the presence of non-standard characters (like \"œ\") may affect clarity slightly but still retains compositional value. / The segment contains clear religious and philosophical language with identifiable phrases, though it lacks explicit compositional patterns for span segmentation due to its poetic nature. It is coherent but may not represent a wide variety of training examples needed by X-Spanformer."}}
 {"raw": "But we have removed those two, because they were against the Lords will, namely, to raise up the dead, and to appear to create living crea- tures from the earth: The end ofthe topics of the fourth treatise. 28I The parallel text in H reads 'ecce conclusionem vestram (\"behold your conclusion or perhaps better behold your confinement\").", "type": "natural", "id": {"id": "e5d2d97e-9b08-4106-92f2-e199ec179907"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with identifiable phrases and clauses suitable for span segmentation; coherent text representative of English prose. / The text segment contains clear sentence structures and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in the context of English prose or legal documents. It is clean but lacks domain-specific jargon which might limit its representativeness across different domains within \"natural\" language content. / Clear sentence structure with identifiable spans; well-suited for learning span segmentation in English prose. / Clear sentence structure with identifiable spans; useful for learning span segmentation in English prose. / Clear sentence structure with identifiable spans; useful for learning span segmentation in English prose."}}
 {"raw": "\"Arise and contemplate the grace and vir- tue of God. Ask,and it will be granted to you, because the mercy of the Lord has visited you.", "type": "natural", "id": {"id": "a03afb93-9f50-4a38-97b7-f28e1a523b7b"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with meaningful phrases suitable for training a span-aware model in recognizing religious or philosophical texts. / Clear, coherent prose with identifiable phrases suitable for training a span-aware model on religious or inspirational texts. / Clear, coherent prose with identifiable phrases suitable for training a span-aware model focused on religious texts or inspirational content. / Clear, coherent prose with identifiable phrases suitable for training a span-aware model on religious texts or inspirational content. / The text segment is structurally clear, with identifiable phrases and sentences that can be segmented into meaningful spans for a tokenizer-free model to learn from. It represents valuable patterns in sentence structure typical of religious texts or inspirational writing which are useful training examples despite being less diverse than code-related content."}}
 {"raw": "Corniger rex meridionalis, et habet 4 ministros in 4 mundi partibus, Trocornifer in oriente, Malafer in occidente, Euiraber in meridie, Mulcifer in septemtrione. (9) Et quilibet habet legiones centum, et in qualibet sunt demones 4500, qui omnes istis 4 obediunt et subduntur; et isti 4 sunt, qui possunt omnes alios spiritus a thesauris absconditis fugare, ligare et constringere, et sunt ministri infernales. (10) Princeps eorum est Labadau: Eius coadiutor est Asmodeus, qui dat thesaurum indestructibilem", "type": "mixed", "id": {"id": "9bd07220-ca5d-4770-8885-7ba6dfa6b492"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of structured phrases and entities (e.g., names, cardinal directions) that can be segmented into meaningful spans for learning span composition in both linguistic context and potential symbolic representation. / Contains a mix of structured phrases and entities that can be segmented into meaningful spans, representing both linguistic patterns (e.g., names with roles) and numerical data for learning span composition in diverse contexts. / The segment contains a mixture of Latin phrases and structured descriptions that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic structure (natural language) and formalized content with specific terms like \"ministri,\" which could help the model understand contextually rich segments. / The segment contains a mixture of structured phrases and terms that can be segmented into meaningful spans, such as \"Corniger rex meridionalis,\" which could represent entities or concepts in the text; it is clean but lacks context for training purposes due to its abstract nature. / The text segment contains a mixture of structured phrases and names that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic context (names like Corniger rex meridionalis) and code-like constructs ((9), etc.)."}}
 {"raw": "stabiliri voluisti, (2) in cuius conspectu omnis racio, sermo, opus et sanctitas subsistit, per hec preciosa sacramenta angelorum tuorum da michi ea, que desidero et credo, visionis huius absque malignitatis intencione gloriam et graciam. Amen. LVIII 64 ORACIO Hamphynethon, Heloy, clementissime creator et inspirator et reformator omnium animarum viciatarum et omnium bonarum voluntatum approba- tor et ordinator; (2) deprecacionem gloriosus intende et mentem meam respice benignus, ut quod ex humilitate", "type": "mixed", "id": {"id": "d20726a9-d025-497e-b681-1c54ae989938"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of Latin phrases and religious text, which have clear structures that can be segmented into meaningful spans for learning purposes. Despite the archaic language style being less common in modern datasets, it provides valuable patterns related to historical texts or liturgical compositions. / The text contains a mixture of Latin phrases and religious expressions, which may not have clear syntactic structures for tokenization but could provide unique patterns in span segmentation due to their distinctiveness from natural language syntax. However, the lack of coherence as an English sentence reduces its utility slightly. / The segment contains a mix of Latin phrases and religious text, which have clear structure but may not be directly relevant to X-Spanformer's training domains focused on modern languages or programming constructs. However, it offers unique patterns in span segmentation due to its historical language context. / Contains a mix of Latin phrases and poetic structure, with clear separations between verses that can be segmented into meaningful spans for learning purposes. The content is coherent but may require domain-specific knowledge to fully understand the context (religious or historical). / The segment contains a mix of Latin phrases and religious text, which can be segmented into meaningful spans such as individual words or short phrases that may represent compositional patterns for learning span segmentation in both natural language processing (NLP) contexts related to historical texts and code-like structures."}}
 {"raw": "110 SWORN BOOK OF HONORIUS XXXVIII Ego in conspectu tuo, Domine, Deus meus, in cuius nutu omnia nuda sunt et aperta, et in cuius manu omnia sunt munda et pura S mundifica et dep- ura me, Deus omnipotens -, hec enim loquor; (2) ut ablato infidelitatis et infeccionis errore et labe adiuvet me Spiritus tuus bonus, sanctus, vivifi- cans omnia, et +vivificett omnem incredulitatem et labem meam, (3) ut visionem tuam sanctissimam, licet indignus, propter tuam misericordiam valeam efficaciter et absque defectu iam", "type": "mixed", "id": {"id": "dedbaeca-e160-4de0-a0d2-0bf636af43cb"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mix of Latin phrases and numbers, suggesting it could be from an ancient text or legal document; however, the lack of context makes its utility for training uncertain. It has clear structured elements but may not represent valuable patterns due to language obscurity. / The segment contains a mixture of Latin phrases and punctuation, which may not be directly useful for training but shows potential in learning complex span structures involving language constructs. / The segment contains a mixture of Latin phrases and punctuation, which may not be directly useful for training but could provide interesting patterns in span segmentation due to its unique structure. However, it lacks clarity on how spans should align with the text's meaning or syntax rules that are typically learned from natural language processing tasks. / The segment contains a mixture of Latin phrases and punctuation, which could be useful for learning span segmentation in multilingual contexts or historical texts. However, its complexity may limit immediate utility without further context-specific preprocessing. / The segment contains a mixture of Latin phrases and punctuation, which can be segmented into meaningful spans for training purposes; however, the content is not coherent in English or any other modern language context."}}
 {"raw": "And when they have come and asked you: what do you want? whereupon you should respond \"Peace, and your friendship: ask one of the twenty-seven, which you have sought in the prayers of purification, (51) knowledge of the heavens, ifyou have asked for this, even changing day into night and the opposite, if you have desired this, or consecrating a book, if you wish it; o likewise the others, (52) ifyou have been worthy to such an extent in one operation, you will be able to accomplish any of the twenty- seven,", "type": "natural", "id": {"id": "35fbae8c-7072-44c7-945d-42e0adbad80c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable phrases and sentences suitable for learning span segmentation in a tokenizer-free context. / Clear sentence structure with identifiable spans like \"what do you want?\" and phrases such as \"(51) knowledge of the heavens,\" which are useful for learning span segmentation in a tokenizer-free context. The text is coherent, cleanly written prose that represents valuable patterns typical to religious or philosophical texts. / Clear, coherent prose with identifiable phrases and sentences suitable for learning span segmentation in a tokenizer-free context. / Clear, coherent prose with identifiable phrases and sentences suitable for learning span segmentation in a tokenizer-free context. / Clear prose with identifiable phrases and sentences; good for learning span segmentation in text."}}
 {"raw": "Deinde in illo latere eiusdem eptagoni tendente a quarta cruce ad sextam scribatur hoc aliud sacrum Dei nomen: \"Ueham\" (35) ita quod hec sillaba: ue' scribatur in illo loco eiusdem lateris, qui est supra primam sillabam de \"Anael; et hec litera: h\" supra ultimam sillabam (36) et hec sillaba: am in illo loco eiusdem lateris, qui est inter latus [inter]secans ipsum et sextam crucem: (37) Deinde in illo latere, quod tendit a sexto angulo eiusdem secundi eptagoni ad primum angulum, scribatur hoc aliud sacrum", "type": "mixed", "id": {"id": "d468a635-abb3-49fc-8e56-61617b607b64"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text contains a mixture of Latin phrases and instructions that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic structure (natural language) and specific formatting or notation indicative of code-like structures. However, the clarity might require further context to fully understand its utility as training data. / The text contains a mixture of Latin phrases and instructions that can be segmented into meaningful spans, such as \"Deinde in illo latere,\" which could represent structured patterns for learning span composition within the context of historical or religious texts combined with code-like annotations (e.g., referencing specific positions like 'prima sillabam de Anael'). / The segment contains a mixture of Latin phrases and instructions that can be segmented into meaningful spans, such as \"Deinde in illo latere,\" which could represent structured patterns for learning span composition within the context of historical or religious texts combined with directional notation (e.g., angles). / Contains a mixture of structured text with identifiable spans (e.g., phrases, references to positions like \"prima sillabam\" and specific locations). However, it lacks clarity in modern language context which may affect training utility slightly. / Contains both structured phrases and a mix of Latin text with directional instructions, representing valuable patterns for learning span composition in multilingual contexts."}}
 {"raw": "destroy the foundations of cities or castles, 4) to drag down people into pits Or caves, to tempt those who are imprisoned, to destroy people, to give precious stones hidden in the earth as desired, and to harm anything: 5)", "type": "natural", "id": {"id": "4094e80e-0296-4675-aaf9-ac4c61525fac"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with identifiable spans; useful for learning span segmentation in narrative text. / Clear sentence structure with identifiable spans for training, though repetitive phrases may need variation in actual datasets. / Clear sentence structure with identifiable spans; useful for learning span segmentation in narrative text. / Clear sentence structure with identifiable spans; coherent and representative of complex narrative text. / Clear sentence structure with identifiable phrases and clauses suitable for span segmentation; coherent text representative of descriptive language patterns."}}
 {"raw": "determined to establish Heaven and Earth, the sea, and the abysses and everything that is in them, (2) in whose sight subsists the plans, words, deeds, and piecy of all, by these precious sacraments of your angels grant unto me those things which I desire and I believe, the glory and grace of this vision, without any ill intent: Amen. LVIII: PRAYER 6.85 Hamphynethon, Heloy; most merciful creator; inspirer and reformer of all corrupted souls and approver and arranger of all good wills, (2) look merci- fully", "type": "natural", "id": {"id": "a5b65ffd-62f7-4aad-8dc7-1ae2cee1645c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear religious text with discernible phrases and sentences suitable for span segmentation; however, it lacks diverse patterns due to its repetitive nature. / Clear and coherent prose with identifiable meaningful spans; well-suited for training a span-aware model on religious texts or prayers. / Clear spiritual and prayerful language with discernible phrases; however, it lacks clear compositional patterns for span segmentation due to its poetic nature. Suitable as a unique example of religious text but not ideal for learning general spans in diverse contexts. / Clear religious text with identifiable phrases and structured prayers, suitable for learning span segmentation in a spiritual context. / Clear religious prose with discernible phrases and sentences suitable for learning span composition; however, it lacks diverse linguistic structures that could enrich training data."}}
 {"raw": "[The Third Work or Treatise]\nCXVI Here begins the topics %f the Third Treatise of this Works wbich is about the Spirits ofthe Air Concerning the constraint of spirits through words.", "type": "natural", "id": {"id": "3f6db1cb-aec2-4ab2-b904-046b6d685c90"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of prose and markup elements, with clear structure for both textual content (\"The Third Work or Treatise\") and metadata (\"%f the Third Treatise\"). It represents valuable patterns in span segmentation across different types of text (natural language vs code-like syntax). / Clear prose structure with identifiable thematic spans; suitable for learning context and composition in a tokenizer-free model. / The segment contains a mixture of prose and markup (indicated by \"%f\"), with clear structure for span segmentation, including titles (\"The Third Work or Treatise\") and thematic descriptions that are relevant to learning patterns in both natural language text and code-like elements. / Clear prose structure with identifiable spans for thematic elements; suitable training example despite archaic language. / Clear prose structure with identifiable spans like titles, phrases indicating content topics (e.g., \"The Third Work or Treatise,\" \"%f the Third Treatise of this Works\"), and thematic elements (\"spirit[s] ofthe Air\"). Suitable for learning span segmentation in a tokenizer-free context."}}
 {"raw": "Then in the middle of the first side and the third heptagon on the right should be written Vos and in the next side to the right of the same third heptagon this name: \"Duynas\" (41) and in the next Gyra' and in the next Gram' and in the next Aysaram' and in the next \"Alpha' and in the next 0\" (42) Then in that small space which is under the second and the third angle of the first heptagon, should be written this name of God: El\" 43) and in that small space which is to the right under the second and third", "type": "mixed", "id": {"id": "33ab80dc-5f13-41f6-982b-4c804639409a"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains a mixture of structured instructions (likely from an ancient manuscript or religious artifact) and identifiable patterns that can be segmented into meaningful spans, such as names (\"Vos\", \"Duynas\"), terms (\"Heptagon\", \"Alpha'\") which are useful for learning span composition. / The text segment contains clear structured elements such as names, numbers (heptagons), and descriptions that can be segmented into meaningful spans for a tokenizer-free span-aware model to learn from. It is coherent but lacks context which may affect its utility in training the X-Spanformer directly; however, it still provides valuable patterns related to natural language structure. / The text segment contains a mixture of structured descriptions (natural language) and specific instructions that resemble programming-like syntax, which can be useful for learning span segmentation in both domains. However, the clarity is somewhat compromised by its complexity; thus it receives an intermediate score. / The segment contains a mix of structured instructions (spanning across multiple heptagons) and names, which can be segmented into meaningful spans for training purposes; however, it lacks clarity in its current form due to the absence of punctuation or clear delimiters between different parts. / The text segment contains clear, structured elements with identifiable spans such as names and phrases that can be segmented meaningfully for training purposes in a tokenizer-free context. It is coherent but lacks contextual clarity due to its abstract nature which may not directly translate into useful patterns without additional domain-specific knowledge or annotations."}}
 {"raw": "infirmitatem cuilibet et qualemcumque placuerit operanti; De interficiendo quemlibet; De tem- pestate et periculo terre et maris fuganda; (9) De nave retenta in mari per adamantem vel aliter rehabenda; De omni periculo evitando; De congrega- cione et accepcione avium; De piscibus congregandis et accipiendis; (10) De animalibus silvestribus et domesticis congregandis et accipiendis; De bello faciendo inter aves vel homines vel pisces vel animalia; (11)", "type": "mixed", "id": {"id": "6adb95f0-93d1-4664-b47e-efff16a6f39d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of legal text and Latin phrases, with clear structure for span segmentation; however, it lacks coherence as an isolated example due to its fragmented nature. / The segment contains a mixture of legal or religious text with structured enumeration, which can be segmented into meaningful spans representing different clauses and their respective numbers; however, it lacks contextual clarity for training purposes due to its specialized language. / The text segment contains structured legal clauses with clear boundaries, representing valuable patterns for learning span composition in both language and legislative contexts. / The segment contains structured legal or procedural text with identifiable spans such as clauses and numbered items, which can be useful for learning span segmentation in a tokenizer-free context. However, it lacks coherence due to fragmented sentences without clear punctuation marks separating them into meaningful units suitable solely within the given domain of law-related texts. / The segment contains a mix of legal or religious text with structured lists (spans) that can be segmented into meaningful parts, representing valuable patterns for learning span composition in both textual and enumerative contexts."}}
 {"raw": "To make all pleasures appear:\n(27) The Topics of the Fourth Work 89. To release someone who is imprisoned. 9o. To unlock bars and prisons: 91.", "type": "natural", "id": {"id": "ed0ad4b4-caf5-4bec-bc22-8bd9f3025549"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text contains a mixture of prose and structured references (e.g., \"(27)\"), which can help the model learn to segment spans that include both narrative content and formal citations or annotations commonly found in academic texts. / Clear prose with identifiable thematic spans; well-suited for learning span composition in text. / Clear prose structure with identifiable thematic spans; useful for learning span segmentation in narrative text. / The text lacks clear, meaningful spans for training; it's a fragmented quote without context or structure suitable for learning span segmentation. / Clear prose structure with identifiable phrases; suitable for learning span segmentation in a language context."}}
 {"raw": "72 SWORN BOOK OF HONORIUS (46) Et in bucca superiori a leva crucis scribatur hec litera: 'a\" et super buccam crucis secundam a dextris hec litera: \"g\" (47) et sub bucca inferiori a dextris scribatur hec alia litera: 'a\" et sub quarta bucca hec alia litera: 1\" (48) Deinde in alio spaciolo sequenti a dextris in medio scribatur hoc nomen: \"Ely\" et in alio hoc nomen: \"Eloy' et in alio \"Xpc\"3 et in alio \"Sother\" et in alio \"Adonay' et in alio \"Saday\" (49) Deinde scias, quod in exemplaribus communiter pentagonus", "type": "mixed", "id": {"id": "22a0b02b-2a7e-42c4-8534-7307330b9975"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains a mixture of Latin phrases and references to religious texts, which can be segmented into meaningful spans such as names (\"Ely\", \"Eloy\") or specific instructions for writing symbols (e.g., 'a', 'g'). It is well-formed with clear structural elements that are representative of the target domain. / Contains structured elements with clear patterns for span segmentation, including names and sequences that are indicative of a coded format intertwined within text descriptions. / The text segment contains a mixture of Latin phrases and references to religious texts, which may have structured patterns useful for learning span segmentation in both language processing tasks related to historical documents or multilingual contexts. / Contains a mixture of structured elements (names, sequences) that can be segmented into meaningful spans for training purposes. Despite some archaic language and lack of modern context clarity, it has clear compositional patterns suitable as diverse examples in span segmentation tasks. / Contains structured patterns with clear segmentation opportunities; spans include names, phrases in Latin script indicating a religious or historical context which could be valuable for learning span composition and contextual understanding."}}
 {"raw": "60 SWORN BOOK OF HONORIUS (28) 92um de apparencia corporum mortuorum quod loquantur et resuscitata appareant: 93um ut animalia de terra creari appareant: Set ista duo capitula subtraximus, quoniam erant contra Domini vol- untatem: III Incipit libero Nos igitur cum divino adiutorio precepta Salomonis et vestigia sequi volentes tanteque subtilitatis vires recipere unum principium necessarium esse profitemur: (2) Nota, quod primum principium est divina maiestas, et est invocacio vera a fide cordis procedens et", "type": "natural", "id": {"id": "de24b9ee-a999-49e0-bfe9-3b69927b0a05"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of Latin phrases and religious text, which could be valuable for learning span segmentation in historical or liturgical contexts; however, it lacks clear modern language structure that might hinder immediate comprehension without additional context. / Clear structure with meaningful spans; well-suited for learning span segmentation in religious or philosophical texts. / Clear structure with identifiable spans; well-suited for learning span composition in religious texts. / Clear prose structure with identifiable thematic spans; useful for learning span segmentation in a purely linguistic context. / Contains both structured religious text and philosophical statements, providing diverse patterns for span segmentation in a mixed context."}}
 {"raw": "Sworn Book of Honorius 111 XXXVIII.S1 [ speak these things in your presence, 0 Lord my God, at whose command all things are naked and open, and in whose hand all things are clean and pure; cleanse and purify me, almighty God, (2) so that the error and stain of unbelief and deception being taken away, your good holy Spirit may help me, making me live, and *living, help my unbeliefs2 andall my faults, (3) that I may be permitted your most holy vision, although unworthy, through your mercy may [ prevail", "type": "natural", "id": {"id": "329eddd8-6ed2-4355-ab95-ed02b384c37d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear religious prose with identifiable phrases and sentences suitable for learning span segmentation in a non-tokenized context. / Clear, coherent prose with identifiable thematic spans; useful for learning context and composition in religious texts. / Clear, coherent prose with identifiable thematic spans; useful for learning context and structure in language processing tasks. / Clear structure with identifiable spans; coherent and representative of religious prose, though it lacks diverse patterns for comprehensive training. / The text segment is structurally clear with identifiable spans such as phrases and sentences, representing valuable patterns for learning span composition in a religious context. It’s clean but lacks coherence due to the presence of corrupted characters (\"unbeliefs2\")."}}
 {"raw": "Celestium duo sunt modi, quorum quidam serviunt Deo soli, (10) etisti sunt 9 ordines angelorum, videlicet cherubyn, seraphin, troni, dominaciones, virtutes, principatus, potestates, archangeli et angeli, (11) de quibus nec ex coacta virtute nec ex artificiali potencia inter mortales est loquendum, et isti nul- latenus invocantur; (12) quia magestati divine continue laudantes assistunt et nuncquam ab eius presencia separantur:", "type": "mixed", "id": {"id": "a7716b10-90dd-4f00-bedc-258cb7996e33"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of Latin phrases and theological concepts, which can be segmented into meaningful spans such as \"Celestium duo sunt modi,\" \"(10) etisti sunt 9 ordines angelorum,\" etc., representing valuable patterns for learning span composition in both natural language processing (NLP) contexts related to religious texts. / The segment contains a mixture of structured elements (biblical references and angelic orders) that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both religious texts and code-like structures. It is clean but may require domain-specific knowledge to fully understand the context. / The text contains a mixture of religious or philosophical discourse and structured enumeration (e.g., \"quorum quidam serviunt Deo soli, etisti sunt 9 ordines angelorum\"), which can help the model learn span segmentation in both natural language contexts with enumerative structures. / The segment contains a mixture of structured religious text and enumerated lists, with clear spans for enumeration (e.g., \"9 ordines angelorum\"). It is clean but may require domain-specific knowledge to fully understand the context; however, it represents valuable patterns in span segmentation. / The segment contains a mixture of Latin phrases and structured enumeration, which can be segmented into meaningful spans representing both linguistic patterns (Latin) and syntactic structures related to angelic orders in Christian theology. This is coherent for training purposes as it demonstrates complex span segmentation across different content types within the same text block."}}
 {"raw": "226 SWORN BOOK OF HONORIUS CXVII Finitis capitulis incipit prohemium in spiritibus aereis Cum igitur ignoratis superioribus angelorum illos constringere sit impos- sibile, (2) nomina spirituum aeris et ventorum in precedenti posuimus capitulo; ut sui superiores clarius viderentur; et a quibus poterat quilibet subiugari. (3) Nunc autem de natura aeris et omnium spirituum in ipso residencium hic faciemus tractatum: (4) Aer est elementum corruptibile, liquidum et subtile inter cetera nobilius passibiles", "type": "mixed", "id": {"id": "602b94c9-8194-4114-a135-c4b2752b59be"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of Latin text and structured phrases that can be segmented into meaningful spans, reflecting both linguistic patterns (natural language) and formal constructs typical in historical texts or legal documents which are valuable for learning span composition across different domains. / The text contains a mix of Latin phrases and descriptions, with clear sentence structures that can be segmented into meaningful spans for learning purposes. It includes both linguistic elements (natural language) as well as technical terms related to historical or philosophical concepts (\"elementum corruptibile,\" \"liquidum et subtile\"), which are valuable patterns in mixed content training data. / The segment contains a mixture of Latin text and philosophical exposition, which can be segmented into meaningful spans such as phrases or sentences that reflect complex structures suitable for training purposes. However, the archaic language may pose some challenges in terms of clarity but still offers valuable patterns related to historical texts mixing with natural discourse elements. / The segment contains a mixture of Latin text and philosophical statements, with clear sentence structures that can be segmented into meaningful spans for learning purposes. It is coherent but lacks modern context which may limit its utility in contemporary applications. However, the structural clarity makes it valuable as training data despite being historical language content. / The segment contains a mix of Latin text and structured formatting, with clear divisions between phrases that can be segmented into meaningful spans for learning purposes."}}
 {"raw": "recipiens qualitates et est simpliciter invisibilis set ipso composito videtur: (5) In quo sunt spiritus, quos sancta mater ecclesia dampnatos appellat, set ipsi oppositum asserunt esse verum, et ideo eos neque bonos neque malos volumus appellare.", "type": "natural", "id": {"id": "8f08c2aa-7ed7-475f-bbb3-0cd8e932314a"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with identifiable phrases and clauses suitable for span segmentation; however, contains archaic language that may not be common in modern training data. / Clear sentence structure with identifiable spans; useful for learning complex linguistic patterns and compositional structures in text. / The text contains clear sentence structures and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in a tokenizer-free context. It is clean but lacks coherence due to its complex syntax which might not directly translate well without further preprocessing or contextual understanding. / Clear sentence structure with identifiable phrases; useful for learning span composition in prose. / Clear sentence structure with identifiable phrases and clauses suitable for span segmentation; well-written prose representative of complex linguistic patterns."}}
 {"raw": "(6) Et illi spiritus in aere reguntur secundum ipsius aeris qualitates, et ideo eius qualitates videa- mus_ (7) Aer igitur in quantum elementum a planetarum influenciis guber- natur: Bene igitur accipit diversas complexiones, quas nunc dicemus, (8) quia quidam sunt demones ad tribulacionem aeris constituti, quos ven- tos Salomon appellavit, quoniam ventos excitant, et secundum quemlibet mutatur aer: (9) Et penatur spiritus illius partis, unde quilibet debet aspi- cere ventum sue operacioni competentem, quia", "type": "mixed", "id": {"id": "43e0782b-1dba-43bf-bb45-e5f945d0e166"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mixture of Latin phrases and English text, with clear demarcations between segments that could be useful for learning span segmentation in multilingual contexts. However, the presence of untranslated terms may affect clarity slightly. / Contains both structured phrases and complex sentences with clear semantic boundaries suitable for span segmentation learning. / The segment contains a mixture of Latin phrases and English text, with clear sentence structures that can be segmented into meaningful spans for learning purposes. It represents valuable patterns in both language structure (natural) and potential code-like constructs due to its historical context related to alchemy or early scientific texts. / Contains a mixture of Latin phrases and English text, with clear demarcations between segments that can be segmented into meaningful spans for training purposes. The content is coherent but may require domain-specific knowledge to fully understand the context (Latin). / Contains a mixture of Latin phrases and prose, with clear sentence structures that can be segmented into meaningful spans for learning span composition in both language processing contexts."}}
 {"raw": "CXXIII [Concerning the intermediate spirits] Having  treated of the spirits which are either fully good or fully evil, we will now talk about the intermediate ones (2) But it should be noted in operating with them, that their actions are neither fully for good nor fully for evil.", "type": "natural", "id": {"id": "23303d8b-3914-46dc-b088-c3653b54dac0"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with identifiable spans; useful for learning span segmentation in prose. / Clear prose with identifiable thematic spans; useful for learning context and semantic segmentation in NLP tasks. / Clear prose structure with identifiable thematic spans; useful for learning span segmentation in narrative text. / Clear sentence structure with identifiable spans; useful for learning span segmentation in prose. / Clear sentence structure with identifiable spans for intermediate spirits; coherent and representative of philosophical discourse."}}
 {"raw": "who only serve God, and those are the nine orders of angels, namely; the Cherubim, Seraphim, Thrones, Dominations, Virtues, Principalities, Powers, Archangels, and Angels, concerning whom it is spoken among mortals neither by forced power nor by artificial force, and therefore in nowise should they be invoked, because they always stand praising the divine majesty and never separated from his presence.", "type": "natural", "id": {"id": "c9152fac-0a00-4c6c-b166-8d3e43855d73"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment is structurally clear with identifiable spans such as \"the nine orders of angels\" and individual angelic names, representing valuable patterns for learning span segmentation in religious or mythological contexts. It contains coherent sentences suitable for training purposes without code elements. / Clear, coherent prose with identifiable thematic spans; useful for learning span segmentation in religious or philosophical texts. / Clear, coherent prose with identifiable phrases and concepts suitable for training a span-aware model on religious or philosophical texts. / Clear, coherent prose with identifiable thematic spans; useful for learning context and compositional patterns in language. / Clear and coherent prose with identifiable thematic spans; useful for learning sentence segmentation in a religious context."}}
 {"raw": "The process for calling the spirits commences similarly, followed by a four-part ritual: the invocation\" the 'seal and binding;\" the 'conjuration; and the \"placation. 109 Placating involves offering the spirits a small gift: 9110", "type": "mixed", "id": {"id": "ca4094e5-8f7a-4d0a-b5d0-9c0e203b99a8"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of narrative text and ritual instructions, with clear structured elements like phrases (\"the invocation\", \"seal and binding\") that can be segmented into meaningful spans for training purposes. It is clean but lacks coherence due to the abrupt ending; however, it still represents valuable patterns in span composition within mixed content types. / The segment contains a mixture of narrative text and ritual instructions, with clear phrases that can be segmented into meaningful spans for learning purposes. However, the presence of numbers (9110) may require further context to ensure accurate span segmentation in training data. / Clear narrative structure with identifiable phrases and actions suitable for span segmentation; coherent prose representative of cultural or religious texts. / The text contains a mix of narrative and ritual instructions with clear, identifiable spans such as \"the invocation,\" \"seal and binding; the 'conjuration;\" and \"'placation.\" These segments are structurally coherent for training purposes in recognizing both natural language patterns related to rituals or ceremonies. / The segment contains a mixture of narrative and ritual instructions with clear separations between phrases, which can be useful for learning span segmentation in both domains. However, the presence of numbers (9110) without context may confuse non-code models; thus it is rated slightly lower than ideal training data but still valuable due to its mixed nature."}}
 {"raw": "Abbreviations: Abbreviations in the   manuscript have   generally been expanded. Chi Rho: The manuscripts generally uses this common practice of abbrevi- ating the name Iesus Christus with Greek letters IHS XP (Ins Xo); these have been expanded to Iesus Christus.", "type": "mixed", "id": {"id": "f6e84842-072c-4460-ab40-2619ea52ba00"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains both abbreviations and expanded forms, showing clear span segmentation patterns useful for training. Mixed content type includes elements of language processing relevant to X-Spanformer. / Contains both abbreviations and expanded forms, showing clear span segmentation patterns in a document context. / Contains both abbreviations and expanded forms, showing clear span segmentation between abbreviated terms (e.g., \"Abbreviations\") and their expansions (\"Iesus Christus\"). The text is clean but lacks context for deeper learning patterns; however, it represents a valuable pattern in handling spans of natural language with code-like elements. / Contains both abbreviations and expanded forms, useful for learning span segmentation in a mix of language types. / Contains both abbreviations and expanded forms, representing a pattern useful for span segmentation in text with embedded codes or annotations."}}
 {"raw": "those are the nine orders of angels, namely the Cherubim, Seraphim, Thrones, Dominations, Virtues, Principalities, Powers, Archangels, and Angels, (II) concerning whom it is spoken among mortals neither by forced power nor by artificial force, and therefore in nowise should they be invoked, (12) because they always stand praising the divine majesty, and never separated from his presence: 6 From the mass_", "type": "natural", "id": {"id": "10d52319-eec4-4143-b994-59d43c2895dd"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear structure with identifiable spans; represents meaningful patterns for learning span segmentation in religious texts. / Clear structure with identifiable spans; represents meaningful patterns for learning span composition in religious texts. / Clear and coherent prose with identifiable thematic spans; suitable for learning span segmentation in a tokenizer-free context. / Clear structure with identifiable spans (e.g., names of angelic orders, quoted text), clean and coherent for learning patterns in span segmentation within religious or poetic texts. / Clear segmentation into phrases and sentences; represents structured patterns in religious texts."}}
 {"raw": "(III.9) Thus the spirits are divided into four categories- celestial or higher angels, planetary o lower angels, aerial spirits or daemons, and terrestrial spirits (or daemons) The aerial spirits are also listed alongside with the plan- etary spirits, with the explanation that the planetary angels are used to control them.9 The list is repeated later (with some variations) in the catalog of aerial spirits According to Honorius; whenever good angels are called, you must not have the whistle, wand, nor the", "type": "natural", "id": {"id": "2be2ac70-746e-401f-8dbf-ae29691705f9"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear division into categories with meaningful spans; well-formed prose suitable for training a span-aware model in recognizing structured text segments. / Clear division into categories and descriptions; useful for learning span segmentation in prose. / The text segment contains clear, structured elements with meaningful spans such as \"spirits,\" \"angels,\" and categories like celestial or lower angels; it is coherent for training purposes but lacks explicit code constructs to classify strictly under 'code'. / Clear division into thematic sections; spans like \"spirits\", \"angels\" can be identified for training purposes. However, it lacks explicit span markers which may affect learning efficiency slightly. Overall clean and coherent text segment representative of mixed content with a strong foundation in structural clarity suitable as an example for X-Spanformer. / Clear division into categories and descriptions of spirits; well-suited for learning span segmentation in narrative text."}}
 {"raw": "Note particularly the frequent use of Greek \"XP\" or \"Xo\" for \"Christ\" In textual studies this is commonly referred to as chi rho. Thus XPus' (abbreviation for (( 'Christus) looks like Latin \"Xpus;\" Similarly the Greek 'IHS\" or 'ins;\" standing for the Latin Iesus (\"Jesus\") is found throughout the text.124\n124 GH expanded this as Ihesus Christus", "type": "natural", "id": {"id": "21143807-9610-4719-8769-488e7a5544eb"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear textual references and abbreviations, useful for learning span segmentation in historical or religious texts. / The text segment contains clear references to Greek phrases and their Latin equivalents, which are meaningful spans for a tokenizer-free span-aware model focused on textual studies of religious texts. It is clean but lacks compositional value due to its specialized content type (religious terminology). / Clear textual references and abbreviations, useful for learning span segmentation in religious texts. / The segment contains clear references to Greek letters and abbreviations, which can be segmented into meaningful spans for training a span-aware model focused on textual studies of religious texts or historical documents. It is clean but lacks diverse patterns that could enhance learning further. / Clear linguistic structure with identifiable spans of Greek terms and Latin abbreviations, useful for learning span segmentation in historical texts."}}
 {"raw": "Et qui per talia experimenta operari voluerit, Dominum Deum suum dimittat et derelinquat et spiritibus sacrificet et ydolis fidem adhibeat; (19) quia fides operatur in homine, sive bona fuerit sive mala, unde in Evangelio: 'Fides tua te salvam fecit\" (20) Iudei in hac visione nullatenus operantur; quia per adventum Christi donum amiserunt, nec possunt in celis collocari testante Domino, (21) qui dicit: \"Qui baptizatus non fuerit condempnabitur'\" et sic in omni- bus angelis operantur imperfecte.", "type": "mixed", "id": {"id": "cdb19e66-98b1-4cdd-a796-1e5e65893ed8"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text contains a mix of Latin phrases and English, which may confuse the model's learning process due to language mixing; lacks clear structure for meaningful span segmentation in this context. / The text contains a mix of Latin phrases and scriptural references, which may not provide clear span segmentation patterns for training purposes due to its specialized language structure. Additionally, the context is religious rather than technical or natural conversational English that could be more beneficial in learning diverse linguistic structures. / The text contains a mixture of Latin phrases and scriptural references, which have clear structure for span segmentation; however, it lacks coherence in English context making its utility limited without additional translation or contextualization. / The segment mixes religious text with Latin phrases and lacks clear, consistent structure for meaningful span segmentation; it is not coherent enough as a standalone training example. / The segment contains a mix of Latin text and references to religious scripture, which may not provide clear span segmentation patterns for training purposes; lacks coherence in English context."}}
 {"raw": "(22) Nec per invocaciones suas veniunt ad effectum, nisi Christo fidem adhibeant; quia dictum est eis per prophe- tam: (23) \"Quando venit rex regum et dominus dominancium, cessabit unccio vestra\" que nuncquam cessaret, Si per hanc artem haberet effica- ciam veram, et sic opera eorum nulla: (24) Et quamvis Iudei, in quantum Iudei; a Deo sunt condempnati, tamen summum adorant creatorem set indebito modo_ (25) Tamen virtute sanctorum Dei nominum coguntur venire spiritus, set quia Iudei non signantur signo", "type": "mixed", "id": {"id": "a81da3cb-d06a-4c73-bcf0-57a37e2aae8c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of religious text and Latin phrases, with clear verse structure that can be segmented into meaningful spans for training purposes. Despite some archaic language which might pose challenges in modern contexts but still valuable historically or academically. / The text segment contains a mixture of biblical references and Latin phrases, which can be segmented into meaningful spans for training purposes; however, the presence of non-standard characters (e.g., \"_\") may affect clarity slightly. Overall clean with clear structural elements that represent valuable patterns in span composition across natural language segments intertwined with code-like constructs. / The segment contains a mix of Latin phrases and religious text, which can provide diverse patterns for span segmentation in both structured (Latin) and unstructured contexts. However, the presence of non-standard characters like \"œ\" may affect clarity slightly but still retains significant compositional value. / The segment contains a mixture of religious text and Latin phrases, with clear verse structure that can be segmented into meaningful spans for learning span composition in both language types. However, it may require additional context or preprocessing to fully leverage its compositional value due to the presence of archaic terms like \"nec per invocaciones suas\" which could pose challenges during training without further linguistic analysis tools. / The segment contains a mix of religious text and Latin phrases, with clear verse structure that can be segmented into meaningful spans for training purposes. However, the presence of untranslated or unclear elements may affect its utility as is."}}
 {"raw": "In 1582, John Dee, one ofthe leading scientists and occultists ofhis age, undertook a series of Mystical Experiments. 9113 He quickly became con- vinced that he was communicating with supernatural creatures. One of the first instructions he received was to construct a Seal of God, based on one \"already perfected\" in his books. Dee consulted several manuscripts, one of which was apparently Sloane 313.\"", "type": "natural", "id": {"id": "fe2a8b69-09b9-4348-9b76-7d9104ae81b6"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear narrative structure with identifiable spans (dates, names, events) suitable for learning span segmentation in a tokenizer-free context. / Clear narrative structure with identifiable spans like names, dates (1582), and references (\"Sloane 313\"). Well-formed for training purposes; however, it lacks technical jargon or complex sentence structures that could challenge the model further. / Clear narrative structure with identifiable spans (e.g., names, dates), coherent and clean for training purposes; represents valuable patterns in historical text composition. / Clear narrative structure with identifiable spans like names, dates, and references to manuscripts; clean for training purposes. / Clear narrative structure with identifiable spans (e.g., names, dates). Well-written and coherent for training purposes."}}
 {"raw": "But this should principally be observed, that there are three cypes of people who perform this art: pagans, Jews, and Christians. The pagans sacrifice to the aerial and earthly spirits; and do not bind them, (17) but the spirits pretend themselves to be confined by the words of their law, in order that they have faith in idols, and never be converted to the true faith: (18)", "type": "natural", "id": {"id": "2e0fad13-50ab-49c5-8c08-6332ecff4f2c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable thematic spans; however, archaic language may pose challenges for modern NLP models. / Clear prose with identifiable thematic spans; however, archaic language may pose challenges for modern training data. / Clear prose with identifiable spans; however, it lacks compositional value for span segmentation learning due to archaic language and context-specific references. / Clear prose with identifiable thematic spans; useful for learning context and composition in span-aware models. / Clear sentence structure and identifiable spans; useful for learning span segmentation in prose."}}
 {"raw": "198 Regarding the problematic word \"penantur (here translated \"compelled to serve\") , see introduction, p 41 and GH,p So.", "type": "natural", "id": {"id": "65d9c165-0923-4a1c-856d-c729a80982a4"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with a reference to an introduction and GH,p, indicating potential for learning span segmentation in academic or technical writing. / The text segment contains a citation with references to pages and sections, which can be segmented into meaningful spans such as \"198,\" \"(here translated 'compelled to serve')\", \"see introduction,\", \"p 41\", etc., representing valuable patterns for learning span composition in the context of academic or literary texts. / Clear sentence structure with identifiable spans; useful for learning span segmentation in prose. / Clear sentence structure with a reference to an introduction and GH, indicating meaningful spans for training in span segmentation of academic or technical writing. / The text segment contains a citation with references to pages and abbreviations, which can be segmented into meaningful spans for learning context in scholarly texts. However, it lacks complexity that could challenge an advanced model like X-Spanformer; thus the score is not perfect but sufficient as training data."}}
 {"raw": "And because they adhere to a false faith; their works are invalid. And he that wishes to perform such experiments must abandon and forsake the Lord their God and sacrifice to the spirits and put faith in idols, (19) because faith works in man, whether good or evil, from which the Gospel says: Your faith has made you well\"7 (20) Jews can in nowise work to obtain this vision, because with the arrival of Christ they have lost the gift, nor can they be stationed in heaven as the Lord testified when he said:", "type": "natural", "id": {"id": "cbc86300-54b7-4eac-8786-81dec4bab7b3"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains clear sentence structures and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in the context of religious texts or sermons. It is clean but lacks domain-specific jargon which might limit its utility slightly compared to specialized code examples. / The segment contains a clear narrative structure with identifiable spans such as verses and quotations, which can be useful for learning span segmentation in religious texts or similar contexts. It is coherent but lacks technical complexity that might benefit from mixed content inclusion. / Clear sentence structure and meaningful phrases; however, contains religious text which may not be universally representative of all training data needs. / Clear sentence structure and meaningful phrases, though religious context may limit generalizability. / Clear, coherent prose with identifiable thematic spans; useful for learning context and sentiment segmentation in NLP tasks."}}
 {"raw": "Sworn Book of Honorius 235 CXXIV Concerning the Spirits between the East and the South. We therefore declare, that between the east and south is a single region, which is called Consol242 and in it are angels, which are called \"equinoctial , and they are these four: (2) Formione the king, and his ministers Guth, Maguth, and Gutrhyn [*Guthryn]; 243 and all other daemons of this region are placed under these, and they are subordinate to Jupiter and its winds, which are called Borean (\"Northerly\") and Subsolar", "type": "mixed", "id": {"id": "f69d8a46-d814-4cae-8c29-4a7901381047"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains a mixture of historical and possibly mythological language with references to celestial bodies, which could be useful for learning span segmentation in both structured (code-like) patterns or natural linguistic constructs. However, the presence of archaic terms may pose some challenges but still offers valuable compositional insights into ancient texts mixed with descriptive phrases about regions and entities. / The text contains a mix of historical and mythical content with clear references to entities (angels, daemons) that can be segmented into meaningful spans for training purposes; however, it lacks explicit programming code constructs or natural language sentences in isolation which may limit its utility as standalone examples. / The text segment contains a mixture of historical and possibly mythological content with structured elements like titles, names (angels), regions (\"Consol242\"), which can be segmented into meaningful spans for learning purposes. However, the presence of potential OCR errors or archaic language may affect clarity slightly but still retains compositional value as mixed type text. / Contains a mix of historical text and possibly archaic references to celestial bodies, which could be useful for learning span segmentation in both language processing (natural) and domain-specific contexts like ancient texts or mythological studies (code). However, the clarity is somewhat compromised by unusual spellings (\"Consol242\", \"Guthryn\") that may confuse a model. / The text segment contains a mixture of historical and possibly mythological content with structured elements like titles, names (e.g., \"Consol242,\" \"Formione\"), and descriptions (\"angels called 'equinoctial'\"). These can help the model learn span segmentation in both natural language contexts."}}
 {"raw": "48 SWORN BOOK OF HONORIUS Liber Iuratus Honorii Text [Prologus] Cum convenissent maligni spiritus demonia in cordibus hominum intonantes, cogitantes utilitatem fragilitatis humane +posse suo corrum- pere et+ totam mundi machinam volentes suis viribus superare, (2) ypocri- sim cum invidia seminantes, pontifices et prelatos in superbia radicantes, dominum papam cum cardinalibus in unum venire fecerunt, dicentes adinvicem que secuntur: (3) \"Salus, quam dedit Dominus plebi sue, modo per magos et nigro- manticos", "type": "mixed", "id": {"id": "b3ab9363-409d-4e95-88a5-119cf3d4b78d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text contains a mixture of Latin phrases and modern English, with clear demarcations between segments that could be useful for learning span segmentation in both languages. However, the presence of non-standard characters (e.g., \"ypocri- sim\") may affect clarity slightly but still retains compositional value as it reflects historical texts often used to train such models. / Contains a mixture of Latin phrases and English, with clear demarcations for potential span segmentation; however, the text's archaic language may pose challenges in learning modern context spans. / The segment contains a mixture of Latin phrases and English text, with clear demarcations for potential span segmentation (e.g., \"Liber Iuratus Honorii,\" \"(2) ypocri-sim\"). It is coherent but lacks context or clarity in its current form. / The text segment contains a mixture of prose and Latin phrases, which may help the model learn span segmentation in both contexts; however, it lacks clarity due to archaic language that could hinder learning patterns for modern applications. / Contains a mix of Latin phrases and modern English, with clear demarcations between them that can be used for span segmentation; however, the archaic language may pose challenges in terms of training utility without additional context or preprocessing steps to handle historical text nuances."}}
 {"raw": "Whomever has been baptized will not be condemned,8 and so they- work imperfectly with all angels: (22) Nor will their invocations be effective; unless they put their faith in Christ, because it was said through the prophet: (23) When the king ' of kings and lord oflords comes, your anointing will cease,\"9 which should never have ceased, if they could have true effectiveness through this art; and thus their works are null: (24)", "type": "natural", "id": {"id": "dae8a135-3f3e-452e-be34-c1cf2c369e35"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with identifiable spans; however, it lacks coherence and contains archaic language that may hinder learning patterns for modern contexts. / The text segment contains clear sentence structures and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in the context of religious or philosophical texts. It is clean but lacks coherence due to fragmented sentences which may pose a challenge during training; however, its thematic consistency aids model understanding. / Clear sentence structure with identifiable spans; useful for learning span segmentation in prose. / The text lacks clear, identifiable spans for meaningful segmentation; it is incoherent and not representative of structured patterns suitable for training purposes. / Clear sentence structure with identifiable spans; useful for learning span segmentation in English prose."}}
 {"raw": "And although the Jews, as they are Jews, have been condemned by God, yet they honor the most high Creator; but in an improper manner: (25) Yet with the power of the holy names of God, the spirits are compelled to come; but because the Jews are marked not with the Sign ofthe Lord, namely ofthe cross and ofthe faith, the spirits are unwilling to answer them truly\n7 Luke 17:19. In the parallel text in SSM L.4T f.49,the main target of derision is Islam. See Veenstra 2012 Pp.", "type": "natural", "id": {"id": "9b04a028-eecf-4c36-b3ed-086e58744181"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains clear sentence structures and phrases that can be segmented into meaningful spans, such as \"the Jews,\" \"God condemned them,\" etc., which are useful for learning span composition in a tokenizer-free context. It is clean but lacks domain-specific patterns due to its religious content nature. / The segment contains clear sentence structures and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in the context of religious texts or discussions about faiths (natural language). However, it lacks code constructs which may limit its mixed content type utility. / The text contains clear sentence structures and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in a religious context with some intertextual references which could add complexity to the model's training data. / The text segment contains clear sentence structures and phrases that can be segmented into meaningful spans, such as \"the Jews,\" \"God,\" etc., which are useful for learning span composition in a tokenizer-free context. It is clean but lacks direct code elements or mixed content types; it represents valuable patterns of religious discourse suitable for natural language processing tasks focused on text comprehension rather than tokenization challenges. / The text contains complex religious and historical references that may not have clear, consistent span patterns for training purposes; lacks compositional clarity in identifying meaningful spans."}}
 {"raw": "(4) Nam et ipsi magi potu diabolico inebriati et eciam excecati contra statuta sancte matris ecclesie procedentes ac preceptum Dominicum transgredientes, sic dicens: (5) 'Non temptabis Dominum Deum tuum' set 'ei soli servies;' ipsi Deo sac- rificium abnegando et temptando nomina creatoris, demones invocando et eis sacrificia tribuendo, quod est contra baptismatis sacri preceptum, (6) nam ibi dicitur: 'Abrenuncio Sathane et omnibus pompis eius; qui non tantum pom- pas et opera Sathane prosecuntur set", "type": "mixed", "id": {"id": "4ca72a22-006f-4d9e-9069-648fb6bf7c3d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text contains both religious language and Latin phrases, which can be segmented into meaningful spans for a span-aware model to learn from the structure of such texts. However, it may require additional context or preprocessing due to its historical nature. / The segment contains a mix of religious text and Latin phrases, which can be segmented into meaningful spans like verses or clauses; however, the presence of non-standard characters (like 'œ') may affect clarity for training purposes. / The text segment contains a mixture of Latin phrases and religious context, which can be segmented into meaningful spans for learning span composition in both language processing tasks involving historical texts or multilingual contexts. However, the presence of code-like structures (Latin) might require additional preprocessing to ensure clarity before being used as training data. / The segment contains a mixture of Latin text and religious context, which may have structured phrases suitable for span segmentation; however, the archaic language could pose challenges in learning patterns without additional contextual data or annotations. / Contains both religious text and Latin phrases, showing clear structure for span segmentation; however, it may require domain-specific knowledge to fully understand the context."}}
 {"raw": "188-189 n62. 8 Compare Mark I6:16: He who believes and is baptised shall be saved, but he who believes not shall be condemned.", "type": "natural", "id": {"id": "6061c94b-15df-4498-a504-7b5425efc6c7"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear biblical verse structure with identifiable spans for verses and phrases; clean, coherent text suitable as training data. / Clear biblical verse structure with identifiable spans for verses and phrases; clean, coherent text suitable as training data. / Clear biblical reference with structured phrases suitable for span segmentation; clean and coherent text segment representative of religious literature. / Clear religious text with identifiable phrases and structure suitable for span segmentation training. / Clear religious text with identifiable phrases and structure suitable for training on span segmentation in a tokenizer-free context."}}
 {"raw": "(2) Corpora eorum magna et ampla, omnis benivolencie plena, color eorum lucidus vel citrinus sicut Sol vel aurum, et sua regio est oriens, et habent 4 demones sub se, scilicet unum regem et tres eius ministros, quibus omnes alii demones Solis subiugantur; (3) et isti sunt Barthan rex, Thaadas, Chaudas, Ialchal, qui demones in ventis boree subdi- tis, qui sunt 4, Baxhathau, Gahathus, Caudes, Iarabal, penantur vel requiescunt: CIX De spiritibus Veneris Alii sunt Veneris et sunt isti Hanahel, Raquiel,", "type": "mixed", "id": {"id": "91f0a683-f1e1-4677-99fa-d748447105d6"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text contains a mix of structured elements (names, numbers) and unstructured narrative that can be segmented into meaningful spans for training purposes. It includes identifiable patterns like names followed by descriptions or attributes which are useful in span segmentation tasks. / Contains structured elements with clear demarcation of entities (e.g., names, titles) and thematic sections that can be segmented into meaningful spans for learning purposes. / The segment contains a mix of symbolic references (e.g., \"Sol\", \"Barthan\") and lacks clear, consistent patterns for meaningful span segmentation; it is not coherent or clean enough to serve as effective training data. / Contains a mixture of structured elements (names, numbers) and unstructured text that could help the model learn span segmentation in both contexts. / Contains a mixture of structured phrases and names that can be segmented into meaningful spans, representing both linguistic patterns (names) and hierarchical structures (demons under Barthan). However, it lacks context for full comprehension which might limit its utility slightly but still offers valuable span composition examples."}}
 {"raw": "9 Widely cited by church writers, and attributed to Daniel, the actual source of this quotation seems to be the pseudo-Augustinian sermon \"Against the Jews, Pagans, and Arians_ See Fanger 2012 Pp: 203-204, p 2IS n29.", "type": "natural", "id": {"id": "cce0ed79-25f8-4bf9-93d7-7f84ea49079e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear citation structure with identifiable spans for authors, sources, and page references; well-suited to learn span segmentation in academic texts. / Clear citation structure with identifiable spans for authors, sources, and page references; well-suited to learn span segmentation in academic texts. / Clear citation structure with identifiable spans for authors, titles, and page references; well-suited to learn span segmentation in academic contexts. / Clear citation structure with identifiable spans for authors, titles, and references; well-suited to learn span segmentation in academic texts. / Clear citation structure with identifiable spans for author, title, and source reference; well-suited to learn span segmentation in academic texts."}}
 {"raw": "This is perhaps not a very satisfactory reading of the text however; as Hedegird 2002 p. 50 points out Manuscript C reads ponantur on two occasions (\" they may be placed\"), and paenantur twice, also not a convincing reading: Hedegard with much hesitation proposes a Latinized form of Greek TEvopal toil\") on the theory that it might be based on a Greek source. Another possibility is that it is a unique verbal (subjunctive) form ofpenator one who carries provisions\" meaning they may carry provisions A slightly", "type": "natural", "id": {"id": "ad5aaee0-778a-47fe-bc15-3b5cc9872226"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains complex sentences with clear linguistic structures that can be segmented into meaningful spans, such as phrases and clauses related to textual analysis (\"reading of the text\", \"Latinized form of Greek\"). Despite some grammatical issues (e.g., missing punctuation), it is coherent enough for training purposes. / Clear and coherent prose with identifiable meaningful spans; useful for learning span composition in a language context. / The segment contains clear linguistic structures with potential for meaningful span segmentation, though it lacks coherence and completeness which might affect its utility as training data. However, the presence of quoted text (\"they may be placed\"), references (Hedegird), Latinized Greek terms, and a discussion about possible interpretations provide valuable patterns that can aid in learning complex sentence compositions typical to natural language processing tasks. / Clear narrative structure with identifiable spans; useful for learning context and sentence composition in NLP tasks. / The segment contains clear linguistic structures and phrases that can be segmented into meaningful spans, such as \"This is perhaps not a very satisfactory reading of the text,\" which helps in learning span segmentation for English prose. Despite some informal language (\"however;\"), it remains coherent enough to serve its purpose effectively."}}
 {"raw": "Sworn Book of Honorius\n203\nCVII Concerning the Spirits of Mars\nOthers are of Mars and are these: Samahel; Satihel, Yturahihel; Amabiel, and their nature is to provoke wars, murder; destruction, and mortality of people; and all earthly things, - and their bodies are of medium stature, dry and thin. (2) The color of their material is red, such as red-hot coals kindled well, and their region is the South.", "type": "natural", "id": {"id": "abd3dca5-9195-4cc2-8e8d-7a571b02c546"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable thematic spans; useful for learning context and structure in language processing. / Clear prose with identifiable thematic spans; useful for learning sentence-level span segmentation in a historical context. / Clear prose with identifiable thematic spans; useful for learning context and composition in text. / Clear prose with identifiable thematic spans; coherent and representative of historical texts, though not directly related to programming or markup languages. / Clear prose with identifiable thematic spans; useful for learning context and composition in language processing."}}
 {"raw": "64\nSWORN BOOK OF HONORIUS\n(26) Solus igitur Christianus potest in hac visione et in omnibus aliis veraciter operari. (27)", "type": "natural", "id": {"id": "7c9571a4-285e-4c4d-ab27-3fbd9b4626e5"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, structured text with identifiable spans; suitable for learning patterns in religious texts. / Clear separation of verses with consistent structure, representing valuable patterns for learning span composition in both text and numerical data. / Clear, coherent text with identifiable sentence structure suitable for training a span-aware model on English prose. / Clear, coherent prose with identifiable sentence structure suitable for training a span-aware model on text segmentation tasks. / The text lacks clear, meaningful spans for training; it's a single line with numbers and fragmented phrases without context or compositional value."}}
 {"raw": "Sworn Book of Honorius\n119\nXLV. FoURTH TERMINUS.59\nGeuathores  sanamathotos: guanatores zanothoros: genomos.", "type": "mixed", "id": {"id": "5f6dbb38-785f-4f13-a708-f1e6f59f19ff"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear structured elements with identifiable spans; represents valuable patterns for learning span composition in programming context. / Clear mix of Roman numerals, Latin phrases indicating a historical document with identifiable spans like titles and terms. Well-formed for training purposes. / Clear separation of terms and phrases, representing a mix of structured data (names) with potential historical or linguistic significance. / The text contains a mix of Latin phrases and numbers, but lacks clear structure for meaningful span segmentation; not coherent or representative enough as training data. / Clear separation of phrases and terms, representing both structured language (names) and potential coding-like constructs; useful for learning span segmentation in a multilingual context."}}
 {"raw": "[Prologue] When evil spirits had convened, intending to invoke demons into the hearts of men, thinking it possible to use human frailty <wishing? to spoil <their> and overcome the whole world orderby force, (2) planting the seeds ofhypoc risy and hatred, so that arrogance takes root in the bishops and prelates, they caused the Pope and cardinals to gather together; who said to each other as follows: 3) The salvation which the Lord has given to his people, has now been turned to their damnation, through", "type": "natural", "id": {"id": "c7224418-d70a-4679-8b6f-3ca5be03f8d6"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose structure with identifiable meaningful spans; however, contains archaic language and lacks modern context for comprehensive training. / The text segment contains clear narrative structure and thematic elements that can be segmented into meaningful spans, such as character actions (\"evil spirits convened\"), intentions (\"invoking demons,\" \"spoil their hearts\"), outcomes (\"arrogance takes root in bishops\"). It is clean for training purposes. / Clear prose with identifiable phrases and thematic elements suitable for span segmentation; well-formed text segment representative of narrative structure. / Clear prose with identifiable thematic spans; useful for learning context and sentence structure in NLP tasks. / Clear prose structure with identifiable thematic spans; useful for learning context and composition in text."}}
 {"raw": "Primum opus vel tractatus: IV De composicione sigilli Dei vivi et veri. Primo fac unum circulum, cuius diameter sit trium digitorum propter tres clavos Domini vel 5 propter quinque plagas vel 7 propter 7 sacramenta vel 9 propter 9 ordines angelorum; set communiter 5 digitorum fieri solet: (2) Deinde infra illum circulum fac alium circulum a primo distantem duobus granis ordei propter duas tabulas Moysi vel distantem a primo tri- bus granis propter trinitatem personarum: (3) Deinde infra illos duos circulos", "type": "mixed", "id": {"id": "e11e6bc7-5999-4a3b-845c-48213c8af5e7"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mix of symbolic notation and structured descriptions that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both mathematical diagrams (code-like) and textual explanations (natural language). / Contains a mixture of structured descriptions (potentially useful for span segmentation) and symbolic representations that could be beneficial in learning complex patterns across different domains. However, it is somewhat fragmented which may affect coherence slightly. / The segment contains a mixture of structured, symbolic elements (e.g., circles with specific diameters related to theological concepts) and textual descriptions that can be segmented into meaningful spans for learning span composition in both natural language processing tasks involving code-like structures or religious texts. / The segment contains a mixture of structured, symbolic language with clear references to geometric shapes and numerical values that can be segmented into meaningful spans for learning purposes. Despite some archaic phrasing (\"Primum opus vel tractatus\"), the content is coherent enough as training data. / Contains a mixture of symbolic notation and structured descriptions, which can be useful for learning span segmentation in both mathematical diagrams (code-like) and textual explanations (natural language). However, the clarity could improve with better formatting or additional context to enhance training utility."}}
 {"raw": "Their colors are bright or citrus, or like the Sun or gold, and their region is the East: And four daemons are under them, namely one king - and three ofhis min- isters, to whom all other daemons of the Sun are subjugated, (3) and they are these: Barthan the king; Thaadas, Chaudas, Ialchal; and those daemons are subject to the North winds, which are four: Baxhathau, Gahathus, Caudes, Iarabal, they may be compelled to serve, or they rest: CIX Concerning the Spirits of Venus. Others are of Venus; and they are", "type": "natural", "id": {"id": "48f1ee26-f289-48a0-912c-0fc772a80fa1"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text contains a mixture of descriptive language and structured references (names, entities), which can help the model learn span segmentation in both contexts. However, it lacks clear delimiters for spans that are crucial to training effectiveness; thus it's not ideal but still valuable with minor adjustments. / The text segment contains clear, structured phrases and sentences that can be segmented into meaningful spans representing a coherent narrative about mythical entities; however, it lacks explicit coding constructs or domain-specific patterns for span segmentation training purposes in X-Spanformer models. / Clear narrative structure with identifiable characters and thematic elements; suitable for learning span composition in a literary context. / Clear narrative structure with identifiable characters and thematic elements, suitable for learning span segmentation in a text-based context. / The text segment contains clear, structured elements with identifiable spans such as names and descriptions that can be useful for learning span segmentation in a tokenizer-free context. It is coherent but lacks explicit domain-specific patterns relevant to X-Spanformer training needs; however, its general compositional value makes it valuable nonetheless."}}
 {"raw": "magic and nigromancy: (4) For even the magicians themselves have intoxicated themselves with the devilish drink, and even blinded against the holy statutes of the mother Church, and trans- gressed against the Lords teachings which say: (5) 'You shall not test the Lord your God;' but you shall serve him alone 2 thus they deny the sacrifice due to God himself, and testing the names of the Creator; invoking demons, and giving them sacrifices, contrary to their baptismal vows, (6) for there it is said: I", "type": "natural", "id": {"id": "052f38f3-09ac-4718-b8f5-de0fcf84611c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear narrative structure with identifiable spans like phrases and sentences, though some archaic language may pose challenges for modern training data. / Clear prose with identifiable thematic spans; well-formed for training purposes, though repetitive phrases may need further cleaning. / Clear prose with identifiable thematic spans; however, contains archaic language that may not be representative of modern text structures. Suitable for learning historical or literary span segmentation patterns. / The text segment contains clear religious and moral themes with structured phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in a tokenizer-free context focused on natural language processing tasks related to code or mixed content analysis. / The segment contains clear, structured elements of religious text with identifiable phrases and references that can be segmented into meaningful spans for learning purposes. It is clean but may require additional context or domain-specific knowledge to fully understand its compositional value in training a span-aware model like X-Spanformer."}}
 {"raw": "qui- bus omnes alii illius regionis demones obediunt et subduntur; et sunt sub- diti Mercurio et ventis eius, qui zephirus et Affricus dicuntur: (3) Et exci- tantes eos sunt isti 4: Zobha <rex>, Drohas, Palas, Zambas, et habent hos 5 demones excitare, congregare, dispergere, constringere ac in loco proprio ligare. (4) Natura eorum est omnia metalla de mundo, quecumque fuerint, literata vel sculpta cum auro et argento ad libitum dare, omnia preterita, presencia et futura terrena revelare, (5) iudices placare", "type": "mixed", "id": {"id": "abab15b2-21a5-4825-8e8b-b8abbc4a473d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mix of Latin text and what appears to be religious or mythological references, lacking clear structure for meaningful span segmentation; it is not coherent enough as training data. / The text contains a mixture of Latin phrases and descriptions that could be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic structure (natural language) and potential symbolic meaning or code-like constructs. However, the content is not entirely coherent as it appears to come from an ancient source with archaic terms which might pose challenges but still offers a rich dataset due to its mixed nature of structured phrases interspersed within descriptive text segments. / Contains a mixture of Latin phrases and structured text that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both language structure (natural) and potential symbolic meaning or references to code-like constructs. / The segment contains a mixture of Latin phrases and references to mythological figures, which can be segmented into meaningful spans for learning span composition in both language processing (natural) and historical/cultural context recognition (code). However, the text is not coherent as an English sentence or code block. / The text contains a mixture of Latin phrases and references to mythological figures, which may not provide clear or consistent patterns for span segmentation in training data. Additionally, the content is highly specialized with little context outside its specific domain (mythology)."}}
 {"raw": "O use of expensive ritual garments, etc is rejected because neither God nor the blessed angel care anything about material things From this it is seen that the poor labor more quickly and truly (in this art) than the rich:\" (Chapter V.4) Requisites include:\" 101 Candle of virgin wax, It should be noted that wax was more expensive than tallow candles, but provided a much cleaner flame. (CXXIX.I, CXXXIX.z)", "type": "mixed", "id": {"id": "2099469e-b955-46ce-aebd-fd6047577806"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mix of narrative text and instructional content, with clear references to chapters that can be segmented into meaningful spans (e.g., \"Chapter V.4\", \"CXXIX.I\"). It is clean but lacks coherence in its structure as it jumps between different topics without transitions or context clues for the model to learn from effectively. / The segment mixes religious text with technical candle-making instructions, lacking clear span segmentation patterns for training purposes. / The segment contains a mix of narrative text and technical descriptions, with clear spans for both religious context (natural language) and candle-making instructions/code-like elements that can be segmented meaningfully. It is clean but lacks coherence between the two distinct parts which may affect training utility slightly. / The segment contains both textual content and references to specific items (e.g., candle of virgin wax) that can be segmented into meaningful spans, representing a mix of narrative prose with technical descriptions suitable for training purposes. / The segment contains a mixture of narrative text and technical descriptions, with clear delineations between quotes (natural language) from religious texts (\"O use...rich:\"), followed by instructional content on candle-making that includes both prose explanation (“It should be noted…cleaner flame.”). This combination provides diverse examples for learning span segmentation in mixed contexts."}}
 {"raw": "42 SWORN BOOK OF HONORIUS exactness as possible:\" Solomon is mentioned repeatedly in the text, and is quoted repeatedly In some cases the quotes are otherwise unknown. 47 Nevertheless, Honorius has only a few elements in common with other Solomonic magic texts Aside from the prayers drawn from Ars Notoria, these include ink made from blood, a Seal of Solomon, names of God and angels (only some well known in Solomonic literature), hazel wand, seals of spirits, incense offerings, magic circles, swords, and", "type": "mixed", "id": {"id": "ccfb06e2-f24b-436f-8add-638fc2686517"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of both structured elements (like quotes and references to Solomonic magic texts) that can be segmented into meaningful spans, as well as unstructured text which may still provide valuable context for learning span composition in the model. / The text contains a mixture of references to Solomonic magic texts and elements like ink made from blood, seals, etc., which can be segmented into meaningful spans for learning span composition in both natural language descriptions (e.g., \"Solomon is mentioned repeatedly\") and code-like constructs (\"Seal of Solomon\"). / Contains a mix of references to Solomonic magic texts and elements like ink, seals, etc., which can be segmented into meaningful spans for learning span composition in both language context (natural) and magical practices (code). / Contains a mix of cultural/historical references and magical/symbolic elements, which can be segmented into meaningful spans for learning diverse span compositions in both text (natural language) and symbolic constructs (code-like). / Contains a mix of both structured elements (like names, items) and unstructured text that can be segmented into meaningful spans for learning purposes. The presence of various Solomonic magic terms suggests valuable patterns in span composition across different domains."}}
 {"raw": "Sworn Book of Honorius 65 (26) Therefore only Christians are able to attain this vision, and in all other things operate truly: (27) And although three cypes of people oper- ate in this art of magic, it should not be thought that; in this name magus should be implied any evil. (28) For in Greek 'magus\" signifies a philosophet; in Hebrew a scribe, and in Latin a wise man: Thus the \"magic art\" is the art of the magi, which is to say the wise men; and ~ycoS; which is \"knowledge' (29) thus \"the knowledge ofthe", "type": "mixed", "id": {"id": "99d81a24-1616-4148-97e1-408ce55c42a4"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains a mixture of religious and philosophical content with clear references to historical figures (\"magus\", \"wise men\") that can be segmented into meaningful spans for learning span composition in both natural language processing (NLP) contexts as well as code-like structures. / The text segment contains clear, structured sentences with identifiable spans of meaningful phrases related to historical and philosophical concepts; it is clean for training purposes but lacks direct programming-related patterns. / The text contains a mixture of historical and philosophical language with references to ancient texts, which can help in learning span segmentation for both linguistic patterns (natural) and specific terminology or phrases related to the subject matter (code-like). However, it lacks clear code constructs that would be beneficial solely as programming examples. / Clear prose with identifiable phrases and sentences; good for learning span segmentation in a narrative context. / The text contains a mix of historical and philosophical language with references to ancient texts, providing diverse linguistic structures for span segmentation training in both prose (natural) and specialized terminology related to magic arts which could be considered code-like constructs due to their specific context usage."}}
 {"raw": "50 SWORN BOOK OF HONORIUS (12) Nos autem permissione divina illud iudicium prescientes, scien- tes eciam, quod inde possent accidere multa mala, (13) quoniam impossi- bile erat nos congregacionis populi corporis viribus manus evadere, nisi a spiritibus cepissemus auxilium, dubitantes inde maius periculum evenire, (14) quoniam hostilis demonum potencia per precepta nostra sola hora eos integre destruxisset, ob hoc unum consilium fecimus magistrorum generale, (15) in quo ex 89 magistris a Neapoli, Athenis et", "type": "mixed", "id": {"id": "3968762e-5b0a-4213-a433-342a1a79d2d0"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mix of structured text with numerical references and Latin phrases, which could help the model learn span segmentation in both language patterns and formal structures. However, it lacks context for full comprehension without additional data. / The text segment contains a mixture of Latin phrases and numbers, indicating potential patterns in span segmentation for historical or academic texts that are not purely English language content. It is coherent but lacks context to fully understand the meaning behind each phrase; however, it could still be valuable as an example due to its unique composition involving both code-like elements (numbers) and natural language text segments. / Contains a mixture of Latin phrases and numbers, suggesting potential for learning span segmentation in historical texts or documents with similar structure. However, the lack of context may limit its utility as is. / The segment contains a mix of numbers, Latin phrases (suggesting historical or legal context), and structured text that could be useful for learning span segmentation in both language processing tasks involving natural languages as well as code-like structures with formalized elements like numerals and abbreviations. / The text segment contains a mix of numerical references and Latin phrases, indicating potential for learning span segmentation in both structured (numerical) and unstructured content domains like historical or legal texts. However, the presence of numbers may confuse models trained on purely natural language data without context clues about their significance as spans."}}
 {"raw": "The First Work or Treatise. IV Concerning the composition of the Seal of tbe True and Living God. First make a single circle, whose diameter is three fingers (on account of the three nails of the Lord), or else 5 (for the five plagues), or 7 (for the 7 sacra- ments), Or even 9 (for the 9 orders of angels); but generally it is made of $ fingers.", "type": "mixed", "id": {"id": "8b089de0-8247-4b16-9b3e-bf0ae805320e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear symbolic representation with identifiable patterns; spans can be segmented into phrases and numbers for meaningful learning. / The segment contains a mixture of symbolic notation and prose, with clear references to religious symbolism (e.g., \"three fingers,\" \"$ fingers\"). It presents structured patterns that can be segmented into meaningful spans for learning purposes despite some ambiguity in the symbols used (\"$\" could represent currency or an unknown symbol). / The segment contains a mixture of numerical values and symbolic expressions, which can be segmented into meaningful spans for learning purposes; however, it lacks clarity due to the presence of ambiguous symbols like \"$\". Clean-up may improve its utility as training data. / The segment contains a mix of numerical values and symbolic expressions that can be segmented into meaningful spans, such as \"three fingers,\" \"$ fingers,\" etc., which are relevant for learning span composition in both natural language context (numerical symbolism) and code-like structures ($ symbol). / The text contains a mix of symbolic references (e.g., \"three fingers,\" \"$\") and religious/spiritual concepts (\"Seal of the True God\"), which can be segmented into meaningful spans for learning purposes, though it may require additional context to fully understand its compositional value."}}
 {"raw": "(20) Tunc placatis principibus et prelatis contentis de combustione fabularum et destruccione scolarum- -et credebant hanc artem penitus destruxisse-_nos moti furore et iracundia ista fecimus iuramenta: (21) Primo, quod nulli dabitur iste liber; donec magister fuerit in extre- mis; et quod nisi tribus tantum copietur; et quod nulli dabitur mulieri nec homini nisi maturo actu tantum et probissimo ac fideli; (22) et qui cog - noverit per annum mores et condiciones; et quod de cetero non destruetur sed danti", "type": "mixed", "id": {"id": "ef8a919d-e4f2-4815-97cb-b4b959eff600"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains both structured legal text and fragmented phrases suitable for span segmentation; however, readability issues may affect training quality. / The segment contains a mixture of legal text and Latin phrases, with clear sentence structures that can be segmented into meaningful spans for training purposes. However, the presence of archaic language may pose some challenges in generalization to modern contexts. / The segment contains a mixture of structured text (legal or formal language) and potential coding-like syntax, which can help the model learn span segmentation in both contexts. However, it lacks clarity due to irregular punctuation and spacing issues that may hinder learning effectiveness. / The segment contains a mixture of legal or formal language with structured clauses that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both textual and quasi-structured contexts. However, the archaic syntax may pose some challenges to modern NLP systems but still holds potential training value due to its unique structure. / The segment contains a mixture of Latin phrases and structured text, which can be segmented into meaningful spans for learning span composition in both language processing tasks involving historical texts or legal documents (natural) as well as code-like structures with specific formatting rules that could benefit from tokenization-free models."}}
 {"raw": "Sua forma est obscura et alba sicut cristallus vel ensis furbitus vel sicut glacies vel nubes obscura: (3) Sua regio est occidens, et habent 4 demones sub se, unum regem et tres eius ministros, quibus omnes alii demones Lune obe- diunt et eciam supponuntur; (4) et isti sunt Harthan rex, Bileth, Milalu, Abucaba, qui demones in ventis zephiro subditis, [qui] sunt 5, Hebethel, Amochap, Oylol, Milau, Abuchaba, penantur vel requiescunt:", "type": "mixed", "id": {"id": "5ea18750-3c57-406c-8b5a-a423c876aa76"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text contains a mixture of Latin phrases and structured formatting that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic structures (natural language) and formatted content indicative of code-like elements. / The text contains a mixture of Latin phrases and structured descriptions that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic context (natural language) and formal structure description (code-like syntax). However, it lacks clarity due to its archaic nature. / The segment contains a mix of poetic or philosophical text with references to demonic entities and structured lists, which can help the model learn complex span segmentation patterns in both prose (natural language) and code-like structures (list enumeration). However, it lacks clarity for direct training due to its abstract nature. / Clear division into verses with identifiable spans; contains both poetic structure and references to entities (demons, kings), useful for learning span segmentation in a multilingual context. / The text contains a mixture of Latin phrases and structured descriptions that can be segmented into meaningful spans, such as \"Sua forma est obscura et alba sicut cristallus\" or \"(3) Sua regio est occidens\". It is clean but may require domain-specific knowledge for full comprehension."}}
 {"raw": "In another review, Juris Lidaka checked thirteen samples from Hedegards edi- tion, and found very few major errors. The only concrete example he gives is that 4.1 \"diameter should be \"diametrus' on which I disagree: The manu- script supports GHs reading, and the usage is not unknown. \"21 Boudet also provided a critical edition of much of the text in his 2002 article Magie theurgique' etc. His readings vary slightly with Hedegards, and his critical apparatus is conveniently placed with the text rather than", "type": "natural", "id": {"id": "9871f60f-29a4-41a7-a9de-477ab509106d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains clear sentences and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in the context of literary text analysis or historical document review. It is clean but lacks explicit code constructs to warrant a higher score towards mixed content classification. / The segment contains clear sentences and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in a natural language context. It is clean but lacks explicit coding constructs or domain-specific terminology which makes it more suitable as training data focused on general English text rather than code-related content. / Clear sentence structure with identifiable spans; useful for learning span segmentation in English prose. / Clear prose with identifiable spans (sentences, phrases) and coherent structure suitable for learning span segmentation in a tokenizer-free context. / Clear sentence structure with identifiable spans of phrases and clauses; well-suited for learning span segmentation in a tokenizer-free context."}}
 {"raw": "Ambra, musco, aloe, lapdano albo et rubeo, mastice, olibano, margaritis et thure (IVsz). The last word, tus, can be either generic for incense, or frankincense specifically, but since the more specific word for frankincense, olibanum (from the Hebrew 77317) is already listed, it is not clear what ninth ingredient is intended.", "type": "natural", "id": {"id": "a727f645-c843-458e-8ddf-ea8f304ee0a6"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable phrases and potential for learning span segmentation in contextually rich text. / Clear prose with identifiable phrases and terms, suitable for learning span segmentation in a tokenizer-free context. / Clear prose structure with identifiable phrases and potential for learning span segmentation in contextually rich sentences. / Clear sentence structure with identifiable phrases; ambiguity present but does not hinder learning of span segmentation patterns. / Clear, coherent prose with identifiable phrases and potential for learning span segmentation in a language context."}}
 {"raw": "But we with divine permission, knowingabout thatjudgmentbefore- hand, knowing also that it would be the cause ofgreat evils, (13) because it was impossible for us to evade the hands ofa large body of_ people, except through the help of spirits we had captured, and fearing that a greater danger would result; (14) because the hostile force of the demons at our command would have destroyed them entirely in only an hour; on account ofthis we convened a general council of masters; (15) from which out of 89", "type": "natural", "id": {"id": "5504b0a1-7d49-48a9-9644-e971a1b51a84"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with identifiable spans; useful for learning span segmentation in prose. / The segment contains a mixture of narrative and structured elements (e.g., numbered references), which can help the model learn to identify spans in both contexts, though it lacks explicit coding constructs or natural language coherence for isolated training examples. / The text segment contains a clear narrative structure with identifiable spans such as sentences and phrases, which can be useful for learning span segmentation in the context of religious or historical texts. It is coherent but lacks explicit code elements to classify it strictly under \"code\". / Clear sentence structure with identifiable spans; useful for learning span segmentation in narrative text. / Clear sentence structure with identifiable spans; useful for learning span segmentation in prose."}}
 {"raw": "The only attempt at publishing any significant pOr- tion of the English translation was Daniel Driscolls The Sworn Book of Hon- ourius (sic) the Magician, Heptangle Books, 1977.", "type": "natural", "id": {"id": "290c4698-f9f5-4e17-b0d8-85291e227c9f"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains typographical errors and lacks clear, consistent structure for meaningful span segmentation; not clean enough for training purposes. / Clear prose with identifiable phrases and potential for meaningful span segmentation, though punctuation errors may affect clarity slightly. / Clear sentence structure with identifiable spans; well-suited for learning span segmentation in English prose. / Clear sentence structure with identifiable spans of phrases and proper nouns; clean for training purposes. / Clear sentence structure with identifiable spans like \"The only attempt at publishing any significant pOr- portion\" and proper nouns such as 'Daniel Driscolls'. Well-suited for learning span segmentation in English prose."}}
 {"raw": "masters ' gathered from Naples, Athens, and Toledo, we selected one named Honorius, son of Euclid, master of Thebes, where that art at the time was established, in order that he would labor on our behalf to this end. (16)", "type": "natural", "id": {"id": "19fe58f2-9062-4383-861c-124e0f677124"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose structure with identifiable spans like names, places and titles; clean for training purposes. / Clear prose with identifiable spans; useful for learning sentence structure and composition in English text. / Clear prose structure with identifiable spans; useful for learning sentence composition and context understanding in a tokenizer-free model. / Clear prose structure with identifiable spans; useful for learning sentence segmentation and context understanding in NLP tasks. / Clear prose with identifiable spans; good for learning sentence structure and composition in English text."}}
 {"raw": "Unfortunately Driscoll did not make use of the best manuscripts, omitted large portions, misread and misplaced angelic names and words from the prayers, and otherwise took serious liberties with the text.", "type": "natural", "id": {"id": "ce1549b2-eac1-4b4d-85d4-f0d6a9fe712e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent sentence with identifiable spans like \"Driscoll\", phrases such as \"best manuscripts\" and complex structures that can be segmented for training purposes. / Clear, coherent sentences with identifiable phrases and entities for span segmentation; well-suited to training a tokenizer-free model on English prose. / Clear, coherent prose with identifiable meaningful spans (e.g., phrases like \"best manuscripts,\" \"angelic names and words\"). Suitable for learning span segmentation in a tokenizer-free context. / Clear, coherent sentences with identifiable phrases and terms relevant for training a span-aware model in processing English text. / Clear, coherent sentence with identifiable phrases and concepts suitable for training a span-aware model in processing English text."}}
 {"raw": "And he; with the consulting angel Hocrohel, named seven books of the magic art, plucking for us the flower and dismissing all the rest as bark: (17) From those books he extracted ninecy-three chapters with all the worth of this art, which are succinctly captured, (18) from which he composed the short book which we call \" Sacred\" or Sworn;\" so-called because the hundred sacred names of God are included in this book; (19) and therefore Sacred\" as it acts through the sacred, or because sacred things emerge", "type": "natural", "id": {"id": "b101128c-1754-432a-89d4-9c113020292b"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose structure with identifiable spans; useful for learning sentence segmentation and thematic patterns in text. / Clear prose with identifiable phrases and sentences; useful for learning span segmentation in narrative text. / Clear prose with identifiable segments like titles, quotes; useful for learning span segmentation in narrative text. / Clear prose with identifiable thematic spans; useful for learning sentence structure and semantic relationships in text. / The segment contains clear narrative structure with identifiable spans like book titles, chapter numbers and thematic phrases (\"Sacred\" or \"Sworn\"), which can help the model learn span segmentation in a coherent story context. However, it lacks explicit code-like patterns that would be beneficial for mixed content types."}}
 {"raw": "66 SWORN BOOK OF HONORIUS Deinde infra angulum superiorem pentagoni scribe istas duas lit- eras: 1, \"x\" et infra alium angulum dextrum istas duas: \"a; \"1\" et in alio post istum istas duas: \"1 \"a\" et in alio post istum: \"1 \"c\" et in alio post istum ((_ u, m\" (8) Deinde circa pentagonum fac unum eptagonum, cuius latus super- ius + secundum sui +1 medium contingat angulum superiorem pentagoni, ubi \"1 \"x\" scribebatur; (9) et in eodem latere eptagoni scribe hoc nomen sancti angeli, quod est Casziel; et in alio", "type": "mixed", "id": {"id": "2c442b08-0442-4c17-96c9-901124fe761c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of structured patterns (e.g., geometric descriptions, sequences) that can be segmented into meaningful spans for training purposes; however, it lacks context and coherence which may affect its utility as is. / Contains a mix of structured elements (e.g., geometric descriptions, Latin phrases) that can be segmented into meaningful spans for training purposes. Despite some archaic language and potential OCR errors (\"_ u,\" \"m\" instead of expected characters), it retains clear structural patterns useful in learning span composition between natural text segments interspersed with code-like notations. / The segment contains a mixture of structured patterns (e.g., geometric descriptions, sequences) that can be segmented into meaningful spans for training purposes in both coding and descriptive contexts. Despite some archaic language (\"Deinde,\" \"scribebatur\"), it maintains structural clarity suitable as diverse input data. / Contains a mix of structured elements (e.g., geometric descriptions, Latin phrases) that can be segmented into meaningful spans for learning span composition in both textual and symbolic contexts. / Contains a mix of structured text and numerical references, with clear patterns for span segmentation like \"SWORN BOOK OF HONORIUS\" or alphanumeric sequences that can be useful in training the model to understand contextually segmented spans. However, some parts may require further cleaning due to potential OCR errors (e.g., \"(8)\")."}}
 {"raw": "(3) Their region is the West: And there are four daemons under them: a king and his three ministers, and all the other daemons of the moon are obedient to those, and placed under them, (4) and they are these: Harthan, the king, Bileth, Milalu, Abu- caba, which rule the daemons of the West winds, which are five: Hebethel; Amochap, Oylol, Milau, Abuchaba; they may be compelled to serve, or they rest.", "type": "natural", "id": {"id": "faeb2192-971b-4c81-8ea7-5d5cfeca3275"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear narrative structure with identifiable characters and entities; useful for learning span segmentation in storytelling context. / Clear narrative structure with identifiable spans (characters, locations). Well-suited for learning span segmentation in a tokenizer-free context due to its coherent and representative storytelling format. / Clear narrative structure with identifiable characters and locations, suitable for learning span segmentation in a storytelling context. / Clear narrative structure with identifiable characters and locations, suitable for learning span segmentation in storytelling context. / The text lacks clear, identifiable spans for meaningful segmentation; it is a narrative passage without discernible patterns suitable as training data."}}
 {"raw": "latere a dextris istud nomen alterius sancti angeli, (10) quod est \"Satquiel; deinde in alio \"Samael\" et in alio \"Raphael; postea \"Anael; postea (( ` 'Michael; postea \"Gabriel; et sic septem latera eptagoni erunt adimpleta.", "type": "mixed", "id": {"id": "acf8a407-3eff-48b2-9f7d-9d59232fee13"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of Latin phrases and names, which can be segmented into meaningful spans representing different entities or concepts; however, it lacks context for effective training on span composition due to its specialized language use. / Contains a mixture of names with clear, identifiable spans; however lacks context and coherence for effective training. / Contains a mix of Latin phrases and names, which are structurally clear but may not be directly useful for span segmentation without additional context or annotations. / Contains a mixture of Latin phrases and names, which may help in learning span segmentation for both language patterns (natural) and specific terms or entities that could be relevant to certain domains like religious texts or historical documents. The sequence is clear but lacks context; however, its compositional value lies within the structured naming pattern it presents. / Contains a mixture of names with clear, identifiable spans that represent meaningful patterns for learning span segmentation in both Latin and English text elements. The structure is coherent but lacks context or domain-specific content to fully evaluate its training utility across different domains."}}
 {"raw": "We, being moved by their madness and rage, have made these oaths: (21) First; that nobody should be given this book, unless the master is at the point of death; [2] And that he should provide himself no more than three; [3] And that it will be given to no woman, nor to a man unless he is mature, and most upright and faithful, (22) both of which should be assessed by observing his mannerisms throughout an entire year;", "type": "natural", "id": {"id": "249eab49-8462-4454-8a17-18bf82021552"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear narrative structure with identifiable spans like \"oaths,\" \"[2],\" and \"(22).\" Well-suited for learning span segmentation in prose text. / Clear and coherent prose with identifiable meaningful spans (e.g., \"these oaths,\" \"[2],\" [3]). Well-formed for training purposes, representing valuable patterns in span segmentation within a narrative context. / Clear narrative structure with identifiable spans; useful for learning sentence segmentation and thematic coherence in text. / Clear sentence structure with identifiable spans (e.g., clauses, phrases). Well-suited for learning span segmentation in narrative text. / Clear narrative structure with identifiable spans; useful for learning span segmentation in prose."}}
 {"raw": "(11) Deinde circa istum eptagonum predictum fac alium eptagonum non quomodo primus factum set taliter quod unum latus ipsius intercedet alterum latus eiusdem (12) Deinde fac alium eptagonum talem qualis primus fuit, cuius anguli 7 contingant angulos 7 eptagoni secundi, qui binus esse videtur: Hic tamen eptagonus infra predictum secundum concludetur + (13) Unum latus secundi eptagoni supereundo et aliud subeundo set latus primo angulo succedens subeundo ibit, et que secuntur serie supereuntis et subeuntis", "type": "mixed", "id": {"id": "656b25ec-201e-43d7-b6db-13ab3b9604df"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mix of geometric descriptions and Latin phrases, showing clear structured patterns for span segmentation in both mathematical language (code-like) and descriptive text (natural). / The segment contains a mix of structured descriptions likely from geometric diagrams or mathematical proofs, with clear references to entities like \"eptagonum\" and relationships between them (\"latus\", angles). It has identifiable patterns for span segmentation related to geometry concepts which are valuable training examples. / Contains a mixture of structured descriptions and references to geometric figures, which could be useful for learning span segmentation in both mathematical language (code-like) and descriptive text (natural). However, the lack of clear delimiters makes it less ideal than fully segmented examples. / Contains a mix of geometric descriptions and Latin phrases, with clear spanable segments like \"eptagonum predictum\" or \"anguli 7 contingant angulos 7 eptagoni secundi.\" However, the presence of archaic language may pose challenges for training. / Contains a mix of structured descriptions and mathematical notation, with clear spanable patterns like \"eptagonum,\" which can be useful for learning spans in both language context and geometric terms. However, the text is somewhat archaic or unusual (\"supereundo\" instead of modern English), potentially affecting clarity but still valuable due to its mixed nature."}}
 {"raw": "(3) Operans vero sit a pollucione purgatus et habeat calcem et arenam litoris mixtam, cum quibus lapides vel tegule coniun- gantur: (4) Tunc fiet ex eis locus, in quo protrahetur circulus, et iste locus tali- ter formabitur: Primo fiet circulus equalis terre habens in longitudine et latitudine 9 pedes, (5) infra quem fiat circulus gibbosus ad modum semicir- culi, alcior quam sit alter; in longitudine et latitudine continens 7 pedes et in altitudine tres pedes cum dimidio. (6) Tamen paupertatis oppressio", "type": "mixed", "id": {"id": "53cf895c-eff1-4c95-b3c2-2e69806dd2ab"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains structured elements with clear segmentation opportunities, including numbered sections and descriptions of geometric shapes that can be useful for learning span composition in a mixed context involving both descriptive text (natural language) and technical specifications resembling code constructs. / The text contains a mixture of structured descriptions (potentially from legal or technical documents) with clear numerical and spatial references that can be segmented into meaningful spans for learning purposes, despite some archaic language usage. / The segment contains a mixture of structured descriptions (potentially from technical documentation or legal text) with clear numerical and spatial references that can be segmented into meaningful spans for learning purposes, despite some archaic language (\"paupertatis oppressio\"). / The text contains a mix of structured descriptions (possibly from legal or formal documents) with clear numerical and spatial references, which can be segmented into meaningful spans for learning span composition in both natural language processing tasks related to code-like structures as well as general document analysis. / Contains a mixture of structured descriptions (potentially legal or formal document) with numerical and spatial references that can be segmented into meaningful spans for learning purposes. The content is clean, coherent but may require domain-specific knowledge to fully understand span composition in context."}}
 {"raw": "Sworn Book of Honorius\n239\n(9) Their bodies are of medium stature; cold, humid, venerable; attrac- tive, and their manner of speaking is hoarse. They have human form, bear- ing arms and hooded: Their color is like a bright cloud. Their movements are like a silvery cloud. (1o)", "type": "natural", "id": {"id": "a3360309-744c-4c7a-998a-810be8d9880f"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable phrases and sentences suitable for training a span-aware model on text segmentation. / Clear prose with identifiable meaningful spans; well-formed for training purposes, representing valuable patterns in span composition and context understanding. / Clear prose with identifiable phrases and sentences suitable for training a span-aware model on English text. / The text is coherent but lacks clear, identifiable spans for training a span-aware model due to its poetic and descriptive nature without distinct phrases or sentences that can be easily segmented into meaningful units. / Clear prose with identifiable phrases and sentences suitable for training a span-aware model on English text."}}
 {"raw": "52\nSWORN BOOK OF HONORIUS\net honeste nec locum alicui per aliquas circumstancias revelabit; (25) et si magister ex discipulis aliquam necessitatem habeat aut velit eos aliqualiter probare, quod pro preceptis suis complendis mortem pati, si necesse fuerit; non timebunt; (26) et quod habens non inquiret de dictis vel factis magistri sui, nec ipsum magistrum suum talia scire alicui revelabit, nec dabit ad hoc circumstancias declarantes; (27)", "type": "natural", "id": {"id": "1c9efabb-6dd2-46e4-ac26-63f8f2c89219"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, structured text with identifiable spans; well-suited for learning span segmentation in legal or formal documents. / The text segment lacks clear, meaningful spans and is not coherent or representative of structured patterns suitable for training a span-aware model in the absence of tokenization cues. It appears to be an excerpt from legal documents with complex sentence structures that may pose challenges without proper context understanding mechanisms like tokenizers. / The text segment contains a mixture of legal or formal language and structured elements like numbers, which can be segmented into meaningful spans for training purposes in both domains (natural language with numbered sections). It is clean but lacks context to fully understand the content's domain-specific patterns. / Clear prose structure with identifiable sentences; useful for learning sentence boundaries and span segmentation in text. / The segment contains a mixture of structured text (possibly legal or historical) with numbered lines, which can help the model learn span segmentation for both prose and formatted content. However, it lacks context that could improve its utility as training data."}}
 {"raw": "Sacerdos, dum conficit corpus Christi, dicat:\n(8) ORACIO\n(( Tu, Domine Iesu Christe, Deus et homo, qui voluisti per te ipsum fidelem populum tuum medicabiliter visitare, te suppliciter exoro, precor et pos- tulo temet ipsum, (9) quem nunc hic in manibus meis teneo pro famulo tuo N, ut ex dono ac permissione gracie tue omnes illos angelos, quos invo- caverit, (10) ut per eos benigniter consulatur; sibi mittere ac constringere digneris, ut te mediante possit cum ipsis misericorditer consociari\"", "type": "mixed", "id": {"id": "2fe2892d-6d90-4e3b-81f5-af7d2f97ed83"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment is a mixture of Latin and religious phrases, lacking clear syntactic structures for meaningful span segmentation in the context of X-Spanformer training data. It also lacks compositional value as it does not contain learnable patterns typical to natural language or code that would benefit an encoder-free model focused on spans. / The text segment contains a mixture of Latin phrases and religious expressions, which have clear structure but are not easily segmented into meaningful spans without domain knowledge; however, it represents valuable patterns for learning span composition in the context of historical or liturgical texts. / The segment contains a mixture of Latin phrases and religious text, which can be segmented into meaningful spans such as individual words or short phrases that are structurally clear for training purposes. It is clean but may require domain-specific knowledge to fully understand the context due to its historical language usage. / The segment contains a mixture of Latin phrases and religious text, which can be segmented into meaningful spans such as \"Sacerdos,\" \"conficit corpus Christi,\" etc., representing valuable patterns for learning span composition in both code-like structures (Latin) and natural language. / The segment contains a mixture of Latin phrases and religious text, which can be segmented into meaningful spans for training purposes; however, it may require domain-specific knowledge to fully understand the context."}}
 {"raw": "Christ (Eucharist), he should say prayers 19 and 20 (LXXVII-LXXIX); as we have said, when the priest is holding up the body ofChrist (i.e: the wafer), to reveal it to the congregation, he should pray on behalf ofthe operator; saying thus: (4) PRAYER.", "type": "mixed", "id": {"id": "0dc17f79-a946-4bdd-b42f-73f2e62b9d69"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of religious text and instructions for prayer, with clear structure in phrases like \"Christ (Eucharist)\" which can be segmented into meaningful spans; however, it lacks code constructs or programming elements that would make the content purely 'code'. / Clear prayer structure and liturgical references; useful for learning span segmentation in religious texts. / The segment contains a mix of religious text and instructions for prayers, with clear indications (like \"(4) PRAYER\") that can be segmented into meaningful spans representing both the narrative context (\"Christ...prayer on behalf of the operator\") as well as specific actions or commands. / Clear prayer structure with identifiable spans; well-suited for learning span segmentation in religious texts. / The segment contains a mix of religious text and structured prayer instructions, which can help the model learn span segmentation for both narrative prose (natural language) and formalized expressions or commands typical in code-like structures. However, it lacks clear delimiters between spans that are common to programming languages; thus it's not ideal but still valuable due to its mixed nature."}}
 {"raw": "taliter ordinavi, quod premisi capitula, ut pateant clarius que secuntur: (2) Capitula_ primi operis Primum capitulum de composicione magni nominis Dei; quod apud Hebreos dicitur Semenphoras et est 72 literarum, quod est principium in hac arte.", "type": "mixed", "id": {"id": "5fbe45da-2f86-4ba3-98ba-8a9a46ef7f23"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mix of Latin phrases and references to biblical texts, which can be segmented into meaningful spans for learning purposes; however, it lacks clear compositional patterns that are easily generalizable across different domains. / Contains a mixture of Latin phrases and references to biblical texts, with clear structure for span segmentation; however, it lacks coherence in English which may affect training utility. / The text contains a mixture of Latin phrases and references to biblical scripture, which could be valuable for learning span segmentation in both linguistic patterns (natural language) and specific structured elements like chapter titles or verse numbers that resemble code-like constructs. / The text contains a mix of Latin phrases and references to biblical chapters, which may not have clear span segmentation patterns for training purposes; however, it is structurally coherent as an excerpt from religious or scholarly work. / Contains a mix of Latin phrases and references to biblical texts, which may have clear span structures for learning purposes; however, the domain is quite specific (biblical studies) that might limit generalizability."}}
 {"raw": "of God (also called the Seal of Solomon) is needed in the other rituals, its preparation is described in great detail (LII,XCVIII-CI)", "type": "natural", "id": {"id": "1a1de9dc-72e4-48dc-9925-42ce1f892dbd"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable phrases suitable for span segmentation and training. / Clear prose with identifiable phrases and sentences suitable for span segmentation training. / lacks clear span segmentation, not well-suited for training purposes. / Clear prose with identifiable phrases and terms that can be segmented into meaningful spans for learning purposes. Well-formed text suitable as training data. / Clear sentence structure with identifiable spans; useful for learning span segmentation in prose."}}
 {"raw": "tran- scribed, or critical apparatus provided, by either of those publications, so I have taken the opportunity of providingit in this edition: In order to facilitate reference to Hedegards critical apparatus, I have included his paragraph numbers For other corrections to his text, see Appen- dix I Hedegard used italics, and the notations a - and a--a to indicate where LIH and Ars Notoria coincided; since this has been obsoleted by the critical edition of Julien Veronese 2007, I have not tried to maintain", "type": "natural", "id": {"id": "47eca958-fe4c-41f8-a9f0-686c85b3ccb6"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear structure with identifiable spans (paragraphs, citations), clean and coherent text suitable for training a span-aware model in the context of scholarly editing or textual analysis. / The text lacks clear, meaningful spans for training; it's a fragmented excerpt with unclear references and annotations that don't provide consistent patterns or structures suitable as isolated examples. / Clear structure with identifiable spans like publication names, paragraph numbers; well-suited for learning span segmentation in text. / The text lacks clear, meaningful spans for training; it's a fragmented excerpt with unclear references and annotations that don't form coherent patterns suitable for learning span segmentation in X-Spanformer context. / Clear structure with identifiable spans (publication names, paragraph numbers), clean and coherent text suitable for learning patterns in span segmentation within a scholarly context."}}
 {"raw": "colomaithos. LXXVI: LATIN PRAYER . 111 O life of men and all creatures visible and invisible, the eternal clarity of the heavenly spirits, the salvation of all men, and the unfailing origin of piety; (2) who knows all things before they happen; who judges all things visible [and invisible], and you see with indescribable disposition, glorify your holy and ineffable name today: (3) Strengthen my heart; and my understanding, and my soul, and increase my innocence, and strengthen my prayer; and release my soul", "type": "natural", "id": {"id": "984d7e09-57ca-47a2-a067-3d3e29836179"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable thematic spans; useful for learning context and composition in religious texts. / Clear segmentation into phrases and sentences; represents meaningful spans in religious text, useful for learning contextually rich span composition. / Clear, coherent prose with identifiable phrases and sentences suitable for training a span-aware model focused on religious texts. / Clear, coherent prose with identifiable phrases suitable for span segmentation; represents valuable patterns in religious text composition. / Clear, coherent prose with identifiable thematic spans suitable for training a span-aware model focused on religious texts."}}
 {"raw": "But that most holy prayer: Lameht ragua_ with its part Semeht segaht [*segheahlt] and with its prologue should be said on the Luna Prima66 four times, namely once very early in the morning, once around Terce, three times around noon, three times around None (5) On the third day of the moon it should be recited three times: Once in the morning, once around noon, and once around None: On the sixth day - ofthe moon it should be recited twice in the morning, twice at noon, and twice at None: On the ninth day of", "type": "natural", "id": {"id": "1eb9347b-4bd7-42f5-8f1e-04e409dcfda1"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear structure with identifiable spans (prayers, recitation times) and compositional value for learning patterns in religious texts. / Clear structure with identifiable spans (prayers, times of day), coherent and representative for learning temporal patterns in religious texts. / Clear structure with identifiable spans like prayers, times of day (morning/Terce/None), and days; useful for learning temporal patterns in text. / Clear structure with identifiable spans such as prayers and times; useful for learning temporal patterns in religious texts. / The text segment contains clear, structured elements like prayers and their recitation times which can be segmented into meaningful spans for learning purposes; it is coherent but lacks contextual clarity due to archaic language usage."}}
 {"raw": "Sworn Book of Honorius\n165\nXC. PRAYER 28.133\nElscha, lift up this day the senses of my body [and soul], 0", "type": "mixed", "id": {"id": "706ef325-f109-4e0c-8660-170466bde987"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text lacks clear, meaningful spans for training; it's a title with fragmented phrases unsuitable as is. / Clear structure with identifiable spans; represents meaningful patterns for learning span composition in religious texts. / Clear structured elements with identifiable spans; combines both religious text (natural language) and a reference to an ancient document, which could be valuable for learning span segmentation in diverse contexts. / The segment contains a mixture of structured elements (title, numerical reference) and unstructured text that can be segmented into meaningful spans for training purposes. It is clean but lacks context or coherence typically needed in natural language processing tasks; however, its structure provides valuable patterns related to span segmentation across different content types. / Contains a mix of structured elements (title, chapter number) and unstructured text; spans can be identified for training purposes."}}
 {"raw": "Lord my God, and lift up my heart to You, so that they are pleasing to you, 0 Lord, let my lamenting be manifest and known to your sight today; (2) and may my words and deeds be pleasing in your sight, and may your omnipotence and mercy shine brightly within me this day; and may my mind be effectually made clear in your works, (3) and may the glorification become more rooted in my soul, and your grace sprout forth in my heart and mouth, in order that whatever wrong I may have done O committed will thus be", "type": "natural", "id": {"id": "20b69004-85e7-4830-8a8f-db5ee794a6cd"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable phrases and sentences suitable for learning span segmentation in a tokenizer-free context. / Clear, coherent prose with identifiable phrases and sentences suitable for training a span-aware model on English text. / Clear, coherent prose with identifiable phrases and sentences suitable for training a span-aware model in understanding contextually rich segments of text. / Clear, coherent prose with identifiable phrases and sentences suitable for learning span segmentation in a tokenizer-free context. / Clear, coherent prose with identifiable phrases and sentences suitable for learning span segmentation in a tokenizer-free context."}}
 {"raw": "(6) And those spirits that are goV- erned by air act according to the nature of air itself; and therefore we can understand their nature. The air therefore, insofar as an element; is governed by the influences of the planets.", "type": "natural", "id": {"id": "3a00751d-67ce-43b7-b750-65c4d6b58bd2"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose structure with identifiable phrases and concepts suitable for training a span-aware model on English text. / Clear prose with identifiable thematic spans; useful for learning sentence-level span segmentation in a purely linguistic context. / Clear prose structure with identifiable spans; useful for learning sentence composition and semantic relationships in text. / Clear sentence structure with identifiable phrases and concepts that can be segmented into meaningful spans for learning purposes. / Clear sentence structure with identifiable phrases; useful for learning span composition in English prose."}}
 {"raw": "Thus it has pleased the Creator; and the Lord himself orders it to be conse- crated in such a manner: (56) First; the worker must be clean, not impure, and should do so with devotion, not cunningly He must not eat or drink until the work is completed, and the blood, with which the writing will be done, must first be blessed, as will be declared afterward.", "type": "natural", "id": {"id": "492fc093-54e7-4865-be8a-bdab704c9083"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear narrative structure with identifiable spans (e.g., phrases, sentences). Well-suited for learning span segmentation in a tokenizer-free context due to its coherent and clean composition reflecting meaningful patterns of language use. / Clear prose with identifiable meaningful spans; clean and coherent for training purposes. / Clear narrative structure with identifiable spans like \"the Creator,\" \"Lord himself orders it to be consecrated in such a manner:\" and others, representing coherent prose suitable for training on span segmentation of continuous text. / Clear narrative structure with identifiable spans; coherent for training purposes. / Clear prose with identifiable thematic spans; useful for learning sentence structure and coherence in text."}}
 {"raw": "(9) And a spirit of that part may be compelled to serve, hence each one should consider which wind is suitable for the operation, because the daemons of that part are awakened then: (1o) But the wind for the invocation is not always easily discovered.", "type": "natural", "id": {"id": "1e3cc8ac-d364-43a7-83be-c1bd793a6618"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains clear sentence structures and phrases that can be segmented into meaningful spans, such as \"spirit of that part,\" which could help the model learn span composition in a narrative context. However, it lacks code elements or mixed content types for this particular evaluation. / The text segment contains clear linguistic structures and phrases that can be segmented into meaningful spans, such as \"spirit,\" \"wind for the invocation,\" which are indicative of compositional patterns in language; it is clean but lacks domain-specific terminology or context clues typical to code. / The text segment contains clear sentence structures and phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in a tokenizer-free context focused on English prose. / Clear sentence structure with meaningful phrases; however, archaic language may pose challenges for modern training data. / Clear sentence structure with identifiable spans; useful for learning context and phrase segmentation in NLP tasks."}}
 {"raw": "Ger- man translation of Ganells Summa Sacre Magice (SSM). Circe 158o s. Leipzig Cod. Mag: I6: (Pp: I-176), titled Die alleredelste und allerbochste Kunst und Wissenschaft, das ist: Magia universalis divina angelica ac diabolica", "type": "mixed", "id": {"id": "347200a0-8a9e-4a04-ba10-56de37411306"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains a mix of Latin phrases and German translations, which presents meaningful spans for learning span segmentation in both languages; however, it lacks clarity due to the presence of abbreviations like \"SSM\" without context or explanation. / Contains a mix of historical reference and bibliographic citation, with clear delimiters for potential span segmentation (e.g., author names, titles). / Contains a mix of historical text and bibliographic references, with clear spanable phrases like \"Ger-man translation,\" which can help the model learn complex spans in both language contexts. / Contains a mix of historical reference and bibliographic citation, which can help in learning span segmentation for both language elements (author names) and structured data (publication details). / The segment contains a mix of historical text and bibliographic notation, which can be segmented into meaningful spans such as titles (\"Ger-man translation\"), author names or sources (\"Ganells Summa Sacre Magice (SSM)\"), publication details (\"Circe 158o s. Leipzig Cod.\"), and descriptions (\"Die alleredelste und allerbochste Kunst und Wissenschaft\"). These elements are well-formed for training purposes, though the text's archaic language may pose some challenges in span segmentation learning due to its unique vocabulary or syntax structures not commonly found today."}}
 {"raw": "dis - pergere, ligare ac ipsos innocuos reddere, (61) homines placare et ab eis suas peticiones graciosius habere, inimicos pacificare, pacificatos disiun- gere, sanos in sanitate custodire vel infirmare, infirmos curare, (62) homi- nes bonos a malis custodire et distinguere et cognoscere, omne corporale periculum evadere, iudices in placito placatos reddere, victoriam in omni- bus optinere, (63) peccata carnalia mortificare et spiritualia fugare, vincere et evitare, divicias in bonis augmentare, et dum in", "type": "mixed", "id": {"id": "bc4f3191-d5f4-4869-870e-6d648295a364"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of Latin phrases and numbers, suggesting it could be extracted from legal or religious texts (natural language), but also includes structured elements like references to verses (\"(61)\", \"(62)\", \"(63)\") that resemble code-like constructs for easy identification. / The segment contains a mixture of Latin phrases and structured lists, which can help the model learn span segmentation in both linguistic patterns (natural language) and formal constructs like enumerations or bullet points typical for code-like structures. However, it lacks coherence as an isolated example due to its fragmented nature; thus it's not ideal training data on its own but could be useful when combined with other coherent examples. / The segment contains a mixture of Latin phrases and numbers, suggesting potential for learning span segmentation in historical texts or legal documents that mix prose with structured elements like citations (e.g., \"(61) homines placare\"). However, the lack of context may limit immediate utility. / The segment contains a mixture of Latin phrases and numbers, suggesting it could be part of legal or religious text (likely historical). While not directly useful for modern NLP tasks without context on the language's structure, its structured nature with clear boundaries between segments can help in learning span segmentation. / The segment contains a mixture of Latin phrases and numbers, which could be useful for learning span segmentation in both linguistic patterns (natural language) and structured elements like references or citations often found alongside code comments or documentation. However, the lack of context makes it less ideal as standalone training data but valuable when combined with other segments to improve understanding across mixed content types."}}
 {"raw": "228 SWORN BOOK OF HONORIUS illi orientales et occidentales et dicuntur boni, quia operaciones eorum iuvant in bono, et vix nocent alicui, nisi ad hoc cogantur divina virtute. (3) Mali sunt et cum superbia feroces australes et septemtrionales et dicuntur mali; quia opera eorum sunt mala in omnibus, et nocent libenter omnibus et vix aliquid, quod sequatur; ad bonum faciunt, nisi ad hoc supe- riori virtute cogantur: (4) Set inter istos sunt alii collaterales istis, qui neque boni neque mali dicuntur; quoniam", "type": "mixed", "id": {"id": "320cdce0-83c1-4e9c-b2f1-64409b21e52e"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains clear, structured Latin prose with identifiable phrases and sentences that can be segmented into meaningful spans for a span-aware model to learn from; it is clean but lacks modern English context which may limit its utility in certain NLP tasks. / The text segment contains a mixture of Latin phrases and structured enumeration, which can help the model learn span segmentation in both language structure (Latin) and enumerative patterns typical for legal or religious texts. / The text contains a mixture of Latin phrases and structured enumeration, which can help the model learn span segmentation in both linguistic patterns (natural language) and formal structures like lists or enumerations common to code-like constructs. Despite being archaic with potential OCR errors (\"SWORN BOOK OF HONORIUS\" may be misinterpreted), it maintains clear structural elements that are beneficial for training purposes, especially given its historical context which could enrich the model's understanding of diverse text formats and languages. / The text segment contains a mixture of Latin phrases and structured formatting, which can help the model learn span segmentation in both linguistic patterns (natural language) and formal structures often found within historical texts or documents that may resemble code-like constructs due to their rigid structure. / The text contains a mix of Latin phrases and structured formatting, which can help the model learn span segmentation in both linguistic patterns (natural language) and formal structures like numbered lists or sections often found alongside code documentation."}}
 {"raw": "166 SWORN BOOK OF HONORIUS quod ex nativitate aut ex peccati labe contraxi divina tua [illa] ineffabi- lis pietas aboleat, (6) qua in principio celum et terram creare voluisti, illa spiritualis magna misericordia tua restauret; (7) qua hominem perditum ad gracie pristinum statum amissum revocare dignatus es, cui iudicium Sathane facultatem visionis abstulit et intellectus. (8) Tu, Domine, cuius sensus atque sapiencia et claritas est attingens a fine usque ad finem fortiter et disponens omnia suaviter et", "type": "mixed", "id": {"id": "066d885b-9d00-4ad3-acb6-a621e6a61da1"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains a mixture of Latin phrases and prose, which can be segmented into meaningful spans representing both language structure (natural) and potential poetic or liturgical constructs that may not directly translate to code but are structurally complex for training purposes. / The segment contains a mixture of Latin text and religious context, which can be segmented into meaningful spans like phrases or sentences for training purposes; however, the language barrier may limit its utility without additional contextual information. / The segment contains a mixture of Latin text and religious context, which can be segmented into meaningful spans like phrases or sentences that are coherent in their structure for training purposes. / The segment contains a mix of Latin phrases and prose, which can be segmented into meaningful spans representing both language structure (Latin) and thematic elements related to religious or philosophical texts. It is coherent for training purposes in recognizing span segmentation across different linguistic structures within the same text type. / The segment contains a mixture of Latin text and religious phrases, which have clear structure suitable for span segmentation; however, it lacks context that could be beneficial in training data."}}
 {"raw": "in omnibus obediunt invocanti, sive in bono fuerit sive in malo. De quibus hic est cognicio cuiuscumque. CXIX De spiritibus orientalibus Istorum autem 4 sunt in oriente regnantes et sunt subditi Soli et vento eius, qui boreas dicitur Et excitantes eum sunt isti 4: Baxhatau <rex> , Gahatus, Caudes, Iarabal; (2) et habent hos 4 demones et eorum subditos excitare, congregare, dispergere, constringere et in loco proprio ligare, quorum Barthan est rex, Taadas, Caudas, Yalcal sunt ministri; (3) et eorum natura", "type": "mixed", "id": {"id": "bef2496e-a507-49b1-a159-9ecd416e0882"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mixture of Latin phrases and references to entities that could be useful for span segmentation in historical or religious texts, though the context is not entirely clear without additional background knowledge. / The segment contains a mixture of Latin phrases and references to entities, which may be useful for learning span segmentation in multilingual or historical texts. However, it lacks clear structure due to the mix of languages/patterns that could confuse training without further context. / The text contains a mix of Latin phrases and references to entities that could be relevant for training on span segmentation in historical or religious texts, though it lacks clear modern context which may limit its utility. / The text contains a mixture of Latin phrases and references to historical or mythological entities, which can be segmented into meaningful spans for learning purposes; however, it lacks clarity in modern context making its utility limited without additional annotations. / The segment contains a mixture of Latin phrases and references to historical or mythological entities, which could be useful for learning span segmentation in both linguistic patterns (natural language) and structured expressions related to code-like annotations (\"rex\", \"demons\")."}}
 {"raw": "Sworn Book of Honorius 129 But note, that this same prayer offered in chasticy and cleanness and faith has power over dangers from fire, wild beasts, or daemons, and then no specific time of day or month need be observed. (8) But that most holy prayer: Hazaram hihel (XXXI) with its four parts, which are Hihelma helma' etc: (XXXII), \"Agloros theomythos' etc: (XXXIV), \"Megal agal\" etc: (XXXV) \"Hamicchiahel\" etc: (XXXVII), simi- larly with its prologues, (9) namely \"Strengthen, solidify etc. (XXXIII); *0", "type": "mixed", "id": {"id": "e10b9d8b-678c-42cb-a417-ebe30280da8b"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mixture of historical text and possibly coded references, with clear spanable phrases like \"Sworn Book of Honorius\" or specific prayers that can be segmented for training purposes. However, the presence of non-standard characters (XXXI etc.) may affect clarity slightly but still retains compositional value. / Contains a mix of historical text and possibly coded references (e.g., \"XXXI\", \"XXXII\"). Clear spans for meaningful segments like phrases, prayers, or parts can be identified despite some archaic language usage. / Contains a mixture of religious text and Latin phrases, with clear structured elements like prayers that can be segmented into meaningful spans for learning purposes. / Contains a mix of historical text and possibly Latin phrases, with clear structured elements like titles (e.g., \"Sworn Book\") that can be segmented into meaningful spans for learning span composition in both natural language processing tasks related to code or documentation analysis. / The text segment contains a mix of historical religious language and references to specific prayers, which can be segmented into meaningful spans for learning purposes; however, the presence of abbreviations (e.g., \"XXXI\", \"(9)\") may pose challenges in training data quality."}}
 {"raw": "Bodleian Library: Aubrey 24, Michael 276. Printed sources, primary Boudet, Jean-Patrice, 'Magie theurgique, angelologie et vision beatifique dans le Liber sacratus sive juratus attribue & Honorius de Thebes Melanges de [Ecole francaise de Rome: Moyen age, Volume II4, Issue 2, LEcole, 2o02. Includes a criti- cal edition of much of the Latin text.", "type": "mixed", "id": {"id": "fff9b1cd-7227-45bc-ad1a-9795fd85b7fb"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains structured elements like library references and publication details, though not ideal for training due to lack of clear span segmentation patterns. / Contains a mix of bibliographic references and academic citations, with clear spans for names (Bodleian Library), numbers (Aubrey 24, Michael 276) that can be used to train the model on span segmentation in both natural language text and structured data. / The segment contains a mixture of bibliographic references and academic citations, which can be segmented into meaningful spans like author names, titles, publication details; however it lacks clear compositional patterns for training purposes due to its specialized nature. / The segment contains a mixture of bibliographic references and academic citations, which can be segmented into meaningful spans such as author names, titles, publication details; however it lacks clear delimiters for span segmentation in its current form. Cleaned up version could improve training utility by adding punctuation or separating elements with commas to enhance clarity. / The segment contains a mixture of bibliographic references and academic citations, which can be segmented into meaningful spans such as author names, titles, publication details; however it lacks clear delimiters for span segmentation. It is clean but may not represent diverse patterns needed for robust training due to its specific domain context."}}
 {"raw": "Mafteah shelomob Clavicula Salomonis: a Hebrew manuscript.", "type": "natural", "id": {"id": "aeac8c29-6c60-4288-960c-2260317ba0aa"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent sentence with identifiable meaningful spans (words and phrases). Suitable for training a span-aware model on Hebrew text structure. / Clear sentence structure with identifiable spans (words, phrases). Well-suited for training a span-aware model on Hebrew text segmentation. / Clear, coherent sentence with identifiable meaningful spans (words and phrases). Well-suited for training a span-aware model on Hebrew text. / Clear, coherent sentence with identifiable meaningful spans (words and phrases). Suitable for learning span composition in Hebrew text. / Clear, coherent sentence with identifiable meaningful spans (words and phrases). Suitable for learning span composition in Hebrew text."}}
 {"raw": "(4) I beg and entreat You, most holy one, that you, being one with the Father; will illuminate my soul with the ray of your most Holy Spirit, (5) that Imay be able to progress in this most sacred art,so that I might be worthy to achieve the vision of your face, 0", "type": "natural", "id": {"id": "6215b773-3b76-4a90-a893-bef2bba419a8"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable meaningful spans; well-formed for training purposes and represents valuable patterns in span segmentation of religious or poetic text. / Clear prose with identifiable phrases and sentences suitable for span segmentation in a tokenizer-free context. / Clear prose with identifiable meaningful spans; clean and coherent for training purposes, though lacks technical jargon or domain-specific patterns. / Clear prose with identifiable phrases suitable for span segmentation; coherent and clean text representative of spiritual or religious content. / Clear sentence structure with identifiable spans; useful for learning span segmentation in prose."}}
 {"raw": "Frankfurt a.M:: J. Kauffmann. 1903. Sepher Mapbteah Shelomo (book of the Key of Solomon)", "type": "natural", "id": {"id": "c85be4ed-ddd8-454e-873f-5aba28c54c15"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of names, locations (Frankfurt), and titles in Hebrew script which could be useful for learning span segmentation across different languages or scripts within the same text block. However, it lacks clear delimiters between spans that would typically aid training purposes; thus it's not ideal but still holds some value as mixed content. / Clear title with structured elements (author, year) suitable for training a span-aware model in recognizing document metadata and titles. / Clear title with author and publication year, suitable for training a tokenizer-free span-aware model on structured text segments. / Clear title and author names, suitable for learning span segmentation in historical texts. / Clear title and author names, suitable for training a tokenizer-free span-aware model on structured text segments."}}
 {"raw": "To appease people and favor- ably obtain from them their petitions, to pacify enemies, to disunite those pacified, to protect the health ofthose who are healthy, o1 to sicken them, and to cure the sick: (62)", "type": "natural", "id": {"id": "324c7305-4c56-4ed2-be6c-e7685ce259da"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable thematic spans; useful for learning context and compositional patterns in text. / Clear prose with identifiable phrases and sentences suitable for span segmentation; coherent, clean text representative of the target domain. / Clear prose structure with identifiable meaningful spans; clean and coherent for training purposes. / Clear prose structure with identifiable phrases and sentences suitable for span segmentation training. / Clear prose structure with thematic progression; spans can be identified for training purposes."}}
 {"raw": "Sworn Book of Honorius 167 which I have because of my flesh, or because of my birth, or incurred through sin, may be wiped out through your divine ineffable kindness, (6) whereby in the beginning you willed to create Heaven and Earth, that your spiritual and great mercy may deign to restore (7) to its former state of grace, that which mankind has lost; which ability of seeing and comprehending the judgment of Satan has stolen away: (8) You; 0 Lord; whose understanding, wisdom, and clarity powerfully reach", "type": "natural", "id": {"id": "60c1991a-b445-46ef-8733-b753bda96409"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose with identifiable thematic spans; useful for learning context and structure in language. / Clear prose with identifiable phrases and sentences; good for learning span segmentation in text. / Clear prose structure with identifiable thematic spans; however, it lacks clear compositional patterns for span segmentation training. / The text segment contains clear, structured elements of religious prose that can be segmented into meaningful spans for training purposes; it is clean and coherent but lacks compositional value due to its repetitive nature. / The text segment is structurally clear with identifiable spans such as verses and phrases, representing valuable patterns for learning span composition in religious or poetic texts. It’s clean but lacks explicit programming constructs that would classify it strictly under \"code\"."}}
 {"raw": "determined, he should fast on bread and water the day before he is to plead his case, and afterwards he should recite quietly those three prayers, two times:", "type": "natural", "id": {"id": "87346dbc-840b-4265-bc89-19089fad656d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text segment contains a clear narrative structure with identifiable spans such as \"determined,\" phrases like \"fast on bread and water the day before he is to plead his case,\" which can be segmented into meaningful parts for learning span composition in an encoder model focused on understanding context without tokenization. / The segment contains a clear narrative structure with identifiable phrases and instructions that can be segmented into meaningful spans, such as \"determined,\" \"fast on bread and water the day before he is to plead his case,\" etc., which are useful for learning span composition in context. / The text segment contains a clear narrative structure with identifiable phrases and actions that can be segmented into meaningful spans, such as \"determined,\" \"fast on bread and water the day before he is to plead his case,\" etc., which are useful for learning span composition. / The text segment contains clear instructions and a sequence of actions, which can be segmented into meaningful spans such as \"determined,\" \"fast on bread and water the day before he is to plead his case,\" etc., representing valuable patterns for learning span composition in natural language. / The segment contains a clear narrative structure with identifiable spans such as \"determined,\" \"bread and water,\" which can be useful for learning span segmentation in the context of religious or historical texts, though it lacks compositional value due to its brevity."}}
 {"raw": "of angels from Rasiel 4ob-41b; names of 'angels that minister before Boal (Boel) 49b; general conjuration 52b (2); seal of terrestrial spirits 64b (?); construction of the whistle 64b-6sa. Apparently some versions of Raziel/Rasiel included seals of the angels: Hedegard, Gosta.", "type": "mixed", "id": {"id": "bfec3834-1637-4997-aeb7-a653cdac87e6"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mix of names, references to angelic entities and related concepts that can be segmented into meaningful spans for learning purposes. The text is coherent but lacks context which might affect its utility as training data. / The segment contains a mix of names, references to angelic beings and their roles (natural language), along with specific codes or identifiers that could be useful for learning span segmentation in both natural text and code-like structures. It is coherent but lacks context which may affect its utility as training data. / Contains a mixture of names, references to angelic beings and their roles (natural language), alongside specific codes or labels that can be segmented into meaningful spans for learning purposes. The text is clean but lacks coherence as it appears fragmented without context. / Contains a mix of names, references to angelic beings and related concepts which can be segmented into meaningful spans for learning purposes. The text is coherent but lacks context that could improve its utility as training data. / The text contains a mixture of names, references to angelic beings and biblical verses which are not clearly structured for meaningful span segmentation; lacks coherence as training data."}}
 {"raw": "130 SWORN BOOK OF HONORIUS sapiencia et eloquencia dabitur sibi in proponendo causam suam, quod breviter optinebit, nisi quod oportet <te> esse bene mundum et castum: (18) Similiter hanc oracionem valet dicere, quando aliquis spiritus vocatus venit, cum illa: \"Lameht ragua pro evitando periculum et acquirendo sapienciam et eloquenciam affandi audacter spiritu advocato. LII Cum igitur nichil aliud fecerit vel aliquod peccatum cogitaverit, roget con- tinue Deum aliis horis a primis 6\",10\",12\" oracione et in", "type": "mixed", "id": {"id": "0ab0e551-1f96-4ed1-b0f0-cb83c6b7ce77"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mixture of Latin phrases and references to religious texts, which may have structured patterns useful for training on span segmentation in historical or liturgical contexts. However, the text is fragmented with unclear sentence boundaries that could pose challenges during preprocessing stages before actual model training begins. / The segment contains a mixture of Latin phrases and references to religious texts, which can provide valuable patterns for learning span segmentation in both structured (code-like) elements like numbers (\"130 SWORN BOOK OF HONORIUS\") as well as unstructured natural language text segments. / The segment contains a mixture of Latin phrases and references to religious texts, which can be segmented into meaningful spans for learning purposes; however, the archaic language may pose challenges in terms of clarity but still offers valuable patterns for span composition. / The segment contains a mix of Latin phrases and references to religious texts, which can be segmented into meaningful spans for learning purposes; however, the presence of numbers suggests it may also contain coded elements that need further investigation. / Contains a mix of Latin phrases and potential religious text, with clear sentence structures that can be segmented into meaningful spans for learning purposes. The presence of both language elements makes it valuable as training data in the context of span-aware models dealing with code-mixed content."}}
 {"raw": "230\nSWORN BOOK OF HONORIUS\nCXX De spiritibus occidentalibus Occidentales sunt illi 4, quibus omnes alii regionis demones subduntur; quorum Harthan est rex, Bileth, Milalu, Habuchaba eius ministri, et sunt subditi Lune et vento eius, qui zephirus dicitur: (2) Et excitantes [eum] sunt isti:", "type": "mixed", "id": {"id": "d1cca45f-e1c1-44c1-b0a2-a96a3d0d88f9"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of numerical references, Latin phrases (suggesting historical or legal text), and possibly archaic language that could provide diverse patterns for span segmentation in both structured data formats like code comments/documentation as well as natural languages with unique syntactic structures. / Contains a mix of numerical data, Latin phrases (potentially historical or legal text), and structured formatting that can be segmented into meaningful spans for training purposes. The combination reflects both natural language elements with potential code-like structure in the form of numbered references which could help learn span segmentation across different content types. / Contains a mix of numerical data, Latin phrases (likely from historical texts), and potentially archaic language structures that can be segmented into meaningful spans for learning purposes; clean but less coherent due to the mixture of elements. / The text contains a mix of numerical data, Latin phrases (which could be relevant for historical or linguistic models), and structured lists that can help the model learn span segmentation in both natural language contexts as well as code-like structures with numbered items. / The segment contains a mixture of numerical references, Latin phrases (likely from an ancient text), and possibly archaic language elements that can be segmented into meaningful spans for training purposes in both natural language processing tasks related to historical texts or code parsing involving annotations/metadata referencing."}}
 {"raw": "Bibliography\n309\nPrinted sources; secondary Agrippa, Heinrich, De occulta philosophia Libri Tres, [Koln, ] 1533. Critical edition V Perrone Compagni. Leiden. Leiden and London: Brill, 1992. English translation: Tbree Books of Occult Pbilosophy, translated by J[ohn] F[rench], London, 1641.", "type": "natural", "id": {"id": "521b971e-95b3-404d-9d7b-8a1ecbf6e85f"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains structured bibliographic entries with clear spans for authors, titles, and publication details; represents valuable patterns in span segmentation across different languages and formats. / Clear bibliographic entries with identifiable spans (authors, titles, publication details) suitable for training on span segmentation in scholarly texts. / Clear bibliographic entries with identifiable spans (titles, authors, publication details) suitable for training a span-aware model on structured text like references or citations. / Clear bibliographic entries with discernible spans (authors, titles, publication details) suitable for learning span segmentation in academic texts. / The text lacks clear, meaningful spans for training; it's a bibliographic entry with minimal structure and context."}}
 {"raw": "[in] inte- riora mea sicut aqua fluens de celo et sicut oleum in ossibus meis per te, Deus, salvator omnium, qui es fons bonitatis et tocius pietatis origo. (3) Dirige me et promove me in ista sancta faciali visione, quam deposco, tu, qui es trinus et unus. Amen. LXXXII 224.oracio Hanethi, Deus, tocius pietatis auctor et fundamentum; omnium salus eterna et redempcio populorum, inspirator omnium graciarum et sanc- titatum omnium purarum operacionum largitor inmense, (2) de cuius munere et misericordia venit,", "type": "mixed", "id": {"id": "806802a8-1c0e-4ad0-95e7-27e8c912ca1d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mix of Latin phrases and religious text, with clear sentence structure suitable for span segmentation; however, the specialized language may limit generalizability to broader training data. / The segment contains a mixture of Latin text and religious phrases, which can be segmented into meaningful spans like \"inte- riora mea,\" \"Deus, salvator omnium,\" etc., representing valuable patterns for learning span composition in both natural language processing (NLP) tasks related to multilingual texts or specific domains such as liturgical studies. / Contains a mixture of Latin phrases and prayers, with clear structure for span segmentation; however, the domain is specialized religious text which may limit generalizability. / Contains a mixture of Latin phrases and prayers, with clear structure for span segmentation; however, it is specialized religious text which may limit generalization to other domains. / The segment contains a mix of Latin text and religious phrases, lacking clear structure for meaningful span segmentation; not representative enough to learn patterns in either domain effectively."}}
 {"raw": "168 SWORN BOOK OF HONORIUS XCIII 314 ORACIO Leyste, profiteor tibi hodie, Deus, Pater omnium, qui secreta celestia ostendisti. Te deprecor suppliciter et maiestatem tuam precor et exoro, (2) ut, sicut tu es rex et princeps cogitacionum, voluntatum et animarum et omnium virtutum aliarum, hodie exaudi preces meas, (3) et dirigantur operaciones mee in conspectu tuo, et acciones mee in conspectu celes- tium virtutum prevaleant: (4) Clamo hodie ad te, Deus meus; nunc exaudi clamorem meum: Ingemisco ad te; hodie", "type": "natural", "id": {"id": "0ae9116e-8f2a-46cb-a3b8-d516f751742b"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of Latin phrases and poetic structure, which can be segmented into meaningful spans for training purposes; however, its archaic language may limit generalizability to modern contexts. / Clear, coherent prose with identifiable phrases and sentences suitable for learning span segmentation in a non-tokenized context. / Clear structure with distinct phrases and sentences suitable for learning span segmentation in prose text. / Clear structure with identifiable phrases and sentences suitable for span segmentation; coherent text representing valuable patterns in language composition. / Clear, coherent prose with identifiable thematic spans; suitable for learning span composition in a non-code context."}}
 {"raw": "cessisti scire sacramenta, (3) tuere, Domine, defende et clarifica animam meam et libera cor meum de pravis huius mundi cogitacionibus et incen- tiva libidinis voluptate et omnis fornicacionis desideri a in me potenter extingue et reprime, (4) ut puritatibus tuis et actibus misticis ac virtutibus? delecter in eis, et des michi peticionem cordis mei, ut in glorificacione tua confirmatus et delectatus diligam te, (5) quod valeam efficaciter tuam faci- alem visionem et sanctam meo vivente corpusculo optinere,", "type": "natural", "id": {"id": "9aa569a7-ac70-4eb2-8264-4afe0982e0e5"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mixture of religious text and Latin phrases, with clear structure for span segmentation; however, its domain-specific nature may limit generalizability. / Clear poetic structure with identifiable phrases and verses suitable for learning span segmentation in a literary context. / Clear poetic structure with identifiable phrases and verses suitable for learning span segmentation in a literary context. / Clear poetic structure with identifiable spans; rich patterns for learning span segmentation in literary texts. / The segment contains a mixture of Latin phrases and punctuation, which can be segmented into meaningful spans that reflect the structure (e.g., \"cessisti scire\", \"(3) tuere\") useful for learning span segmentation in both natural language processing tasks involving code-like structures."}}
 {"raw": "exaudi gemitus cordis mei: (5) Et ego commendo hodie spiritum meum, corpus meum et animam meam et cogitaciones meas in manus tuas, Pater mi et Deus meus, (6) et ne me a te senciam derelictum set misericordia in tuam in me [senciam], et exaltetur nomen tuum in me, clementissime Spiritus sancte, Deus, (7) cuius bonitas est eterna, cuius misericordia est incomprehensibilis, cuius perpetua clari- tas, cuius possessione pleni sunt celi et terra, (8) aspira et respice in me, Domine, et", "type": "mixed", "id": {"id": "d93b889f-bc68-4bcc-a7b4-9d25772e55f3"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mix of religious text and Latin phrases, with clear verse structure suitable for span segmentation training. / Contains a mix of Latin phrases and punctuation, with clear separations between lines; useful for learning span segmentation in religious or historical texts. / Contains a mix of Latin phrases and punctuation, which may not be ideal for training but shows clear structure in religious text segments. / The segment contains a mixture of Latin phrases and punctuation, which can be segmented into meaningful spans for training purposes; however, it lacks clear compositional patterns due to its poetic structure. / The segment contains a mixture of Latin phrases and punctuation, which can be segmented into meaningful spans for training purposes; however, the presence of non-English text may affect generalization to other languages or domains."}}
 {"raw": "Sworn Book of Honorius 231 CXX: Concerning the Spirits ofthe West. The western ones are four; and all other daemons of the region are under them; of which Harthan is the king, Bileth, Milalu, and Habuchaba are his ministers, and they are subordinate to the Moon and its wind, which is called Zephyr: (2) And raising it up are these: Hebethel, Amocap, Oilol, Mylau, and Abuchaba, and they have these four daemons and their subordinates to raise Up, congregate, scatter; constrain, and bind to their proper place:", "type": "mixed", "id": {"id": "659b9b73-6956-42db-9356-cf297b99e36c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The text contains a mixture of historical/spiritual language and structured descriptions that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both domains. / The text contains a mixture of historical/cultural references and structured descriptions that can be segmented into meaningful spans, such as names (\"Harthan\", \"Bileth\"), roles (\"ministers\"), or entities (\"the Moon\"). Despite being somewhat archaic in language style (which might affect clarity), it is clean for training purposes. / Contains a mixture of historical text and references to entities (spirits, daemons) with clear hierarchical relationships that can be segmented into meaningful spans for learning purposes. The content is coherent but may require additional context or domain knowledge due to its archaic language style. / The text contains a mixture of historical language and references to mythical entities, which can be segmented into meaningful spans for learning span composition in both narrative context (natural) and structured descriptions that resemble code-like patterns with hierarchical relationships (\"concerning the Spirits...\"). / The text segment contains a mixture of historical/cultural references and structured lists, which can help the model learn span segmentation in both narrative prose (natural language) and enumerative constructs typical for code-like structures. However, it lacks clear delimiters that are common to programming languages or markup formats; thus while it's somewhat useful as mixed content training data due to its varied structure, clarity could be improved with additional context markers."}}
 {"raw": "tifica me, Domine, quia in te pono me innocentificandum: Glorifica me, Domine, quia in te pono me glorificandum: (11) Rege me, Domine, quia in te pono me regendum, et in me gracie tue fidem infunde et fige, ut Spiritus tuus sanctus in me veniat, regnet et imperet pro hac sancta visione divina. Amen. XCIV 324 ORACIO Horistion, Domine, quia ego servus tuus sum, servio tibi hodie et con- fiteor coram maiestate glorie tue, in cuius conspectu omnis magnificencia et sanctimonia est, (2) et deprecor sanctum et", "type": "mixed", "id": {"id": "6b7fd15b-5f1b-40f9-a834-32da12997e7c"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mixture of Latin phrases and religious text, with clear structure for span segmentation; however, it lacks coherence in English context which may limit its utility as is. Consider translating or providing additional annotations to improve training effectiveness. / Contains both structured religious text and a numerical reference, indicating potential for learning span segmentation in diverse contexts. / Contains both religious text and Latin phrases, showing clear structure for span segmentation; however, it may not be representative enough of the target domain due to its specialized nature. / The segment contains a mix of Latin text and what appears to be religious or scriptural content, lacking clear structure for meaningful span segmentation; it is not coherent as training data. / Contains both structured phrases and a mix of languages, providing diverse span patterns for training."}}
 {"raw": "And I entrust today my spirit; my body and my soul, and my thoughts into your hands, my Father and my God, (6) and let me never feel abandoned by you, but rather sense your compassion in me, and may your name be exalted in me, 0 most merciful Holy Spirit 0 God, (7) whose goodness is eternal, whose mercy is incompre- hensible; whose brightness is everlasting; whose substance fills Heaven and Earth, (8) breathe into me and look upon me, 0 Lord, and this operation of mine, and for your praise and honor grant", "type": "natural", "id": {"id": "92f4723a-9fea-4ba3-839f-167e79228a86"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable spiritual and religious phrases suitable for training a span-aware model on structured text segments. / Clear, coherent prose with identifiable phrases and sentences suitable for training a span-aware model on English text. / Clear, coherent prose with identifiable phrases and sentences suitable for training a span-aware model in understanding religious text structure. / Clear, coherent prose with identifiable spiritual phrases and expressions suitable for learning compositional patterns in religious texts. / Clear, coherent prose with identifiable phrases and sentences suitable for learning span segmentation in a spiritual context."}}
 {"raw": "Vol. I New York, NY: Doubleday, 1983 Dan, Joseph, Rachel Elior; and Peter Schafer: Creation and re-creation in Jewisb thought: festschrift in honor of Joseph Dan on the occasion of bis 'seventieth birthday: Tibingen: Mohr Siebeck, 2oo5.", "type": "natural", "id": {"id": "bfca2e95-c415-41de-a933-3b0be6153daf"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear structure with identifiable spans (authors, titles) suitable for training on span segmentation in prose. / Clear sentence structure with identifiable spans (authors, titles). Well-suited for learning span segmentation in prose text. / Clear structure with identifiable spans (authors, titles). Well-formatted for training purposes and represents valuable patterns in span segmentation of academic texts. / Clear structure with identifiable spans (authors, titles) and coherent content representative of scholarly texts. / Clear structure with identifiable spans (authors, titles) and coherent content suitable for learning span composition in a tokenizer-free context."}}
 {"raw": "(11) Et caveat, ne corpus Christi accipiat pro effectu malo, quia non esset salus immo mors, unde quidam intitulaverunt librum istum sic: \"Incipit mors anime\" (12) Et hoc est verum male operantibus et propter effectum malum, et non propter scienciam: Nam ait Dominus: \"Petite, et dabitur vobis. Que- rite et invenietis\" (13) Et alibi dicit Dominus: \"Ubi duo vel tres congregati fuerint in nomine meo, ibi sum in medio\" et \"De omni re, quam pecierint in nomine meo, fiet illis a patre meo:\" LIII Oraciones", "type": "mixed", "id": {"id": "a32e69a5-0e1b-415d-b0f4-c4264793869b"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of Latin phrases and references to biblical text, which have clear linguistic structures suitable for span segmentation; however, it lacks contextual clarity due to its specialized language use. / Contains a mixture of biblical text (natural language) and references to chapters/verses, which can help the model learn span segmentation for both prose structure and numbered elements. However, it lacks clear code constructs or programming syntax that would be beneficial in training solely on one type. / Contains both structured phrases and a mix of Latin text with numerical references, representing valuable patterns for learning span segmentation in multilingual contexts. / The segment contains a mix of biblical text (natural language) and Latin phrases, which can help the model learn span segmentation in both contexts; however, it lacks clear delimiters for spans within code constructs or structured programming elements. / Contains a mixture of religious text and Latin phrases, with clear verse structure that can be segmented into meaningful spans for training purposes. However, the domain-specific language may limit generalizability across different contexts."}}
 {"raw": "R inserts a section on suffumigations and names of angels, largely drawn from Agrippa. I8 This prayer appears in the Liturgia Horarum (Liturgy ofthe Hours) and other medieval collections of prayers A slightly different version is found in the Rituale Romanum (Roman Ritual) Other magic texts incorporate it as well, including the Book of Oberon, P: SI; Sloane 3851, fol. 21; and Kieckhefer, Forbidden Rites, p. 252.", "type": "natural", "id": {"id": "9033f2e5-15ba-447a-96d3-7be1095bb094"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains both historical references and citations, which can help the model learn span segmentation for diverse content types. / The text lacks clear, consistent patterns for meaningful span segmentation; it is a mixture of references and descriptions without coherent structure suitable as training data. / The text lacks clear, meaningful spans for training a span-aware model; it is mostly prose with no discernible patterns suitable for learning. / The segment contains a mix of historical references and citations, which may not provide clear span segmentation patterns for training purposes due to its diverse content types. / The text lacks clear, identifiable spans for meaningful segmentation; it's a continuous prose segment without discernible patterns suitable as training data."}}
 {"raw": "harayn: XVIII Lux mundi, Deus inmense, pater eternitatis, largitor sapiencie et totius gra- cie spiritualis pie et inestimabilis dispensator noscens omnia, priusquam fiant, faciens tenebras et lucem, (2) mitte manum tuam et tange animam meam et corpus meum et pone illam ut gladium furbitum ad visionem tuam habendam et fac eam ut sagittam electam et granum tritici reconditum ad contemplandum tuam mirabilem faciem (3) et emitte Spiritum sanctum tuum, Domine, in cor meum ad istud donum percipiendum et in", "type": "mixed", "id": {"id": "553c3dad-a566-4fb7-863c-d2eddc122efa"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mix of religious text and structured phrases, with clear spans for segmentation; however, the archaic language may pose challenges in learning patterns. / Contains a mixture of religious text and structured phrases, with clear spans for potential learning in spiritual context. / The segment contains a mixture of religious text and structured phrases that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic context (natural language) and liturgical structure (code-like constructs). / The segment contains a mixture of religious text and poetic structure, with clear phrases that can be segmented into meaningful spans for learning purposes. It is coherent but may require domain-specific knowledge to fully understand the context (spiritual or theological). / Contains a mix of religious text and structured phrases with clear spiritual references, suitable for learning span segmentation in both language context and coded expressions."}}
 {"raw": "LXXXII PRAYER 22.117 Hanethi, 0 God, the author and foundation of all piety the eternal health and redemption ofall people, inspirer ofall graces and sanctity the generous immeasurable giver ofall pure operations, (2) from whose gift and mercy your servants are granted such great indulgences, your servants, which you have even permitted me, a miserable sinner to know your sacraments, (3) 0 Lord, watch, defend, and clarify my soul, and free my heart from faulty thinking of this world, and the allure of the", "type": "natural", "id": {"id": "78d37467-caaf-426b-b8e8-7d1e5a93496a"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent prose with identifiable thematic spans; useful for learning context and structure in text. / Clear religious text with identifiable phrases and structured prayers suitable for learning span segmentation in a tokenizer-free context. / Clear prose structure with identifiable phrases and sentences suitable for training a span-aware model on English text. / Clear, coherent prose with identifiable phrases and sentences suitable for learning span segmentation in a tokenizer-free context. / Clear, coherent prose with identifiable phrases and sentences suitable for training a span-aware model on English text."}}
 {"raw": "(9) Teach me, 0 Lord, in whom I entrust myself for teaching: Purify me, 0 Lord; in whom I entrust myself to be purified.", "type": "natural", "id": {"id": "3b0c7c4f-a7ac-4fac-9ef6-8637b6654551"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear, coherent sentences with potential for meaningful span segmentation; representative of religious or philosophical discourse. / Clear, coherent prose with identifiable phrases suitable for span segmentation and composition learning. / Clear, coherent prose with identifiable phrases suitable for span segmentation and composition learning. / Clear, coherent prose with identifiable phrases suitable for span segmentation and training purposes. / Clear, coherent prose with meaningful phrases suitable for training a span-aware model on sentence structures and thematic elements."}}
 {"raw": "232 SWORN BOOK OF HONORIUS CXXII De spiritibus septemtrionalibus Septemtrionales sunt isti: Maymon rex, Albunalich, Assaibi, Haibal- idech, Yasfla, quibus omnes alii demones regionis subduntur; et sunt sub- diti Saturno et vento eius, qui Affricus dicitur: (2) Et excitantes eum sunt isti 3: Mextyura, Alcybany, Alflas, ethabent hos 5 demones et eorum sub- ditos congregare, dispergere, constringere ac in loco proprio ligare: (3) Sua natura est seminare discordias, odia generare, malas cogitaciones, furta et", "type": "mixed", "id": {"id": "07e42874-df5f-4624-9b87-384bb8750d8f"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mixture of structured elements (names, titles) and unstructured text that could be useful for learning span segmentation in both domains. However, the lack of clear delimiters makes it less ideal as is; additional preprocessing may improve its utility. / The segment contains a mixture of structured phrases and terms that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic context (natural language) and specific references to entities or concepts (\"Maymon rex,\" \"Assaibi\"). / The segment contains a mixture of Latin phrases and structured lists, which can be segmented into meaningful spans for learning span composition in both language processing tasks related to historical texts (natural) as well as code-like structures that resemble programming constructs or markup languages used historically. / Contains a mix of structured elements (names, titles) and unstructured text; spans can be identified for learning purposes. / The segment contains a mix of Latin phrases and structured lists that can be segmented into meaningful spans, representing valuable patterns for learning span composition in both linguistic structures (natural language) and coded formats like enumerations or listings commonly found within historical texts or documents related to ancient languages."}}
 {"raw": "prenominate et post nominande numero sunt hec, scilicet: PRIMA ORACIO Agla, lux, veritas, vita, via, iudex misericors, misericordia, fortitudo, paciencia, conserva et iuva me in hac sancta visione et miserere mei (2) propter misericordiam tuam et servicium huius sancti suffumigii et sancti sacrificii Domini nostri Iesu Christi et propter meritum gloriose semper virginis Marie, matris Domini nostri Iesu Christi, (3) et meritum apostolo-", "type": "mixed", "id": {"id": "b836d199-99cc-4fc9-b95b-d4aa3c2e55e7"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.74, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mixture of Latin phrases and references to religious figures, which can be segmented into meaningful spans for training purposes; however repetitive nature may affect learning efficiency. / Contains a mixture of Latin phrases and religious text, with clear structured elements like titles (e.g., \"PRIMA ORACIO\") that can be segmented into meaningful spans for learning purposes. The content is clean but may require domain-specific knowledge to fully understand the context or patterns within this type of data. / Contains both structured phrases and religious text, with clear demarcations for potential spans like \"PRIMA ORACIO\" or \"(2) propter misericordiam tuam.\" However, it lacks coherence in English grammar which may affect training utility slightly. / Contains a mixture of religious text and Latin phrases, with clear demarcations for potential spans; however, the presence of non-standard characters (e.g., \"œ\") may affect readability in training data. / Contains a mixture of Latin phrases and religious text, with clear structure for span segmentation; however repetitive nature may limit learning diversity."}}
 {"raw": "animam meam ad emundan- dum et in conscienciam meam ad speculandum: (4) Per iuramentum cor- dis4 [*coheredis] tui, id est per dexteram pie sciencie tue, misericorditer; clementer et leniter in me graciam tuam inspira et doce et instrue (5) et instaura introitum et exitum sensuum meorum, et doceas me et clarifices me et mundifices me et corrigas me cum disciplina tua usque in finem (6, ut visionem tuam facialiter optineam, et adiuvet me consilium altissimum per infinitam sapienciam tuam et misericordiam", "type": "mixed", "id": {"id": "052c6bc4-03e1-4de4-a7cc-dbb83ca851ee"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains both structured religious text and Latin phrases, providing diverse span patterns for learning. / Contains both structured phrases and a mix of Latin text with potential for meaningful span segmentation, though not entirely coherent in English context. / The segment contains a mix of Latin phrases and punctuation, indicating structured elements that can be segmented into meaningful spans for learning purposes in both linguistic patterns (natural language) and formatting rules (code-like structures). / The segment contains a mixture of Latin phrases and punctuation, which can be segmented into meaningful spans for training purposes; however, the language barrier may reduce its utility without additional context or translation tools integrated during model development. / Contains a mixture of Latin phrases and punctuation, indicating clear spans for training in both language structure recognition and handling multilingual text."}}
 {"raw": "desire for physical pleasure, and extinguish and potently restrain in me all lust for fornication, (4) that with your purifi- cations and mystical acts and powers, I may delight in them, and grant to me the petition of my heart, that may be strengthened in your glorification, and love you, (5) and that I may have the ability to attain the vision of your holy face while my body lives, and may the virtue ofthe Holy Spirit increase in me, through your deliverance and the reward of the faithful, for the health", "type": "natural", "id": {"id": "e6ba7597-f93f-4d68-87da-3d5114526203"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear sentence structure with identifiable phrases and clauses suitable for training a span-aware model in understanding complex sentences of religious text. / Clear and coherent prose with identifiable phrases suitable for learning span segmentation in a tokenizer-free context. / Clear and coherent prose with identifiable thematic spans; suitable for learning span segmentation in a tokenizer-free context. / Clear and coherent prose with identifiable thematic spans, suitable for training on span segmentation in a tokenizer-free context. / The text lacks clear, identifiable spans for meaningful segmentation; it is too abstract and poetic without discernible patterns or structures suitable as training data."}}
 {"raw": "Sworn Book of Honorius\n133\nbody of Christ; the priest should petition on behalf of the operator; that he may obtain success in his petition through divine grace: (9) And so too you should understand regarding all the prayers, which pertains to the priest and for the operation, because in general they are required for all pecitions. (1o) But nothing else should be added.", "type": "natural", "id": {"id": "7d7506c8-df3a-4199-a2e5-168cfcc11ffb"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.8, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear prose structure with identifiable phrases suitable for span segmentation; coherent and representative of religious text composition. / Clear prose structure with identifiable phrases and sentences suitable for training a span-aware model on English text. / Clear prose structure with identifiable phrases and sentences suitable for span segmentation; coherent text representative of religious texts. / Clear prose structure with identifiable phrases and sentences suitable for learning span segmentation in a tokenizer-free context. / Clear prose structure with identifiable phrases and sentences suitable for learning span segmentation in a tokenizer-free context."}}
 {"raw": "Sworn Book of Honorius\n233\nCXXII Concerning the Spirits ofthe North. 239\nThe Northern ones are these: Maymon the king, Albunalich, Assaibi; Haibalidech, and Yasfla, and all other daemons of the region are placed under these, and they are subordinate to Saturn and its wind, which is called Africus (or the 'southwest wind\"  ).", "type": "natural", "id": {"id": "890fc6dd-3719-4c57-bb5a-a471eb605deb"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.78, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear narrative structure with identifiable spans (e.g., names, titles). Well-suited for learning span segmentation in historical texts or documents related to mythology and ancient beliefs. / Clear prose with identifiable entities and thematic structure suitable for span segmentation learning. / Clear prose structure with identifiable thematic spans; useful for learning span segmentation in historical texts. / Clear prose structure with identifiable thematic spans; useful for learning span segmentation in historical texts. / Clear prose with identifiable thematic spans; useful for learning context and entity recognition in historical texts."}}
 {"raw": "indivisibilis Deus, adoro hodie nomen sanctum tuum ego, indignus et miserimus peccator; (2) extollens oracionem meam et intellectum meum et racionem meam ad templum sanctum tuum celestis Ierusalem et assisto tibi hodie, Deus meus, ostendens te Deum meum, creatorem meum et salvatorem meum. (3) Et ego, creatura racionabilis, invoco hodie gloriosam clemenciam tuam, ut visitet\" hodie Spiritus sanctus infirmitatem meam (4) Ettu, Domine, Deus meus, qui Moysiet Abrahe,servis tuis, per fidem et puritatem visionis", "type": "mixed", "id": {"id": "3a825666-0577-487a-8d9e-29189ae7539d"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.7, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "The segment contains a mixture of Latin phrases and religious text, which can be segmented into meaningful spans such as verses or individual words/phrases for training purposes; however, it lacks clear code constructs that would make the content more suitable for X-Spanformer specifically trained on programming languages. / The segment contains a mixture of Latin phrases and English text, with clear demarcations between them that can be used for span segmentation training in multilingual contexts. However, the presence of both languages may pose challenges depending on X-Spanformer's design goals; thus it is kept but noted as potentially challenging due to language mixing. / The segment contains a mixture of Latin phrases and religious text, which may not have clear syntactic structures for tokenization but can be valuable in learning span segmentation due to the presence of distinct words or phrases that are contextually related. / The segment contains a mixture of Latin phrases and references to religious texts, which have clear structure but may not be directly useful for span segmentation in X-Spanformer without additional context or preprocessing tailored specifically towards such content. / The segment contains a mixture of Latin text (likely from religious scripture) and phrases that could be translated into English, showing clear structure for span segmentation in both languages; however, the presence of code-like elements is minimal or non-existent."}}
 {"raw": "170\nSWORN BOOK OF HONORIUS\ncelum fundasti et terram, te, Pater piissime, largiente, qui vivis et regnas solus per omnia secula seculorum. Amen:\nXCV 334 ORACIO\nJeremon, clementissime Domine, Deus meus, et miserere mei et parce malis meis. Sana animam meam, quia peccavi   tibi. Non abneges  uni quod pluribus contulisti.", "type": "natural", "id": {"id": "e76546ee-7f01-43bd-a87e-d0e941e7b578"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.76, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Clear structure with identifiable spans such as verses, prayers or phrases; clean and coherent text suitable for learning patterns in religious texts. / Clear structure with identifiable spans (e.g., verses, prayers). Clean and coherent text suitable for learning patterns in religious or poetic texts. / Clear structure with identifiable phrases and sentences suitable for span segmentation; coherent text representing religious prayers, valuable patterns present. / Clear, coherent prose with identifiable phrases and sentences suitable for span segmentation training. / Clear, coherent prose with identifiable phrases and sentences suitable for learning span segmentation in a tokenizer-free context."}}
 {"raw": "(2) Exaudi, Deus, oracionem famuli tui N, et in quacumque die invocavero te. Velociter exaudi me, sicut exaudisti sanctam Mariam Magdalenam: (3) Suscipe, Domine, clamorem confitentis ad te, audi vocem precantis et per oraciones beatissime Marie virginis, matris tue, atque omnium sanctorum tuorum, (4) ut oraciones et preces perve- niant ad aures pietatis tue, quas ego, N, pro hac sancta visione effundo coram te in hac hora, ut per tua sanctissima nomina et sacramenta, (5) que sunt Hosel: Iesel. Hazaiacol.", "type": "mixed", "id": {"id": "182f45a4-772c-4bca-8604-41368cef65e8"}, "meta": {"status": "keep", "tags": [], "doc_language": "en", "extracted_by": "pdf2seg", "confidence": 0.72, "source_file": "Sworn_Book_of_Honorius_Liber_-_Juratus_Honorii_-_Joseph_Peterson.pdf", "notes": "Contains a mixture of religious text (natural language) and references to biblical verses, which can help the model learn span segmentation for both prose structure and specific phrases or codes like verse numbers. / The segment contains a mixture of religious text and Latin phrases, with clear sentence structures that can be segmented into meaningful spans for training purposes. It represents valuable patterns in both language structure (natural) and specific linguistic constructs found within liturgical texts or historical documents which could benefit the model's understanding across different domains. / The segment contains a mixture of religious text and Latin phrases, with clear verse structure that can be segmented into meaningful spans for training purposes. However, the presence of archaic language may pose challenges in generalization to modern contexts. / Contains a mixture of religious text and Latin phrases, with clear structure for span segmentation; however, it may not be representative enough due to its specialized nature. / The segment contains a mixture of religious text and Latin phrases, with clear structure for span segmentation; however, it may require domain-specific knowledge to fully understand the context."}}
No results found