Created
January 14, 2014 02:19
-
-
Save ashfurrow/8411942 to your computer and use it in GitHub Desktop.
Docbook parser I wrote in my universities days.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import Text.XML.Light.Input | |
import Text.XML.Light.Types | |
{- | |
Dear Markers: | |
Some known issues: | |
1. if an ordered list item contains a sublist, that list item doesn't get numbered | |
2. given an itemized list, the asterisks align with the start of the letters, not two spaces before them | |
3. there's an extra newline inserted at the end of the file | |
Those are all the issues I can think of and they are illustrated in the test cases, which are included at the end of this file. After 10 hours of working on this assignment, I'm calling it "done enough". | |
Cheers, | |
Ash | |
-} | |
--sets up the rest of the program | |
formatFile name = do | |
str <- readFile name | |
let xml = parseXML str | |
putStrLn $ printElements "" 0 xml | |
main = formatFile "xml" | |
--prefix is a list marker like "* " or "2 " | |
--level is the indentation level | |
--this method loops through a Content list and calls printElement on each item, if it's not empty. | |
printElements :: String -> Int -> [Content] -> String | |
printElements prefix level (n:xs) | isEmpty (printElement prefix level n) = printElements prefix level xs | |
| otherwise = printElement prefix level n ++ printElements "" level xs | |
printElements _ _ [] = "" | |
--ditto on this for prefix and level, they cannonical throughout my program | |
--sorts into different cases depending on the tag's qualified name and calls appropriate functions to display the tag's contents | |
printElement :: String -> Int -> Content -> String | |
printElement prefix level (Elem (Element (QName "para" qURI qPrefix) elAttribs ns elLine)) = printElements prefix level ns ++ "\n" | |
printElement prefix level (Elem (Element (QName "itemizedlist" qURI qPrefix) elAttribs ns elLine)) = printItemizedListElements prefix (level +1) ns | |
printElement prefix level (Elem (Element (QName "orderedlist" qURI qPrefix) elAttribs ns elLine)) = printOrderedListElements prefix (level +1) 1 ns | |
printElement prefix level (Elem (Element (QName "listitem" qURI qPrefix) elAttribs ns elLine)) = printElements prefix level ns -- ++ "\n" | |
printElement prefix level (Text (CData cdVerbatim (n) cdLine)) | isEmpty n = "" | |
| otherwise = gimmeSomeTabs level ++ prefix ++ removeLeadingTabs n ++ "\n" | |
printElement _ _ _ = "Something went wrong" | |
printItemizedListElements :: String -> Int -> [Content] -> String | |
printItemizedListElements prefix level (n:xs) = printItemizedListElement prefix level n ++ printItemizedListElements prefix level xs | |
printItemizedListElements _ _ [] = "" --don't know why it's sometimes called with an empty [Content], but whatever | |
printItemizedListElement :: String -> Int -> Content -> String | |
printItemizedListElement prefix level (Elem (Element (QName "listitem" qURI qPrefix) elAttribs ns elLine)) = printElements "* " level ns -- ++ "\n" | |
printItemizedListElement prefix level (Text (CData cdVerbatim (n) cdLine)) = "" | |
printOrderedListElements :: String -> Int -> Int -> [Content] -> String | |
printOrderedListElements prefix level k (n:xs) | isOrderedListElementEmpty n = (printOrderedListElements "" level k xs) | |
| otherwise = (printOrderedListElement prefix level k n) ++ (printOrderedListElements prefix level (k+1) xs) | |
printOrderedListElements _ _ k [] = "" --don't know why it's sometimes called with an empty [Content], but whatever | |
printOrderedListElement :: String -> Int -> Int -> Content -> String | |
printOrderedListElement prefix level k (Elem (Element (QName "listitem" qURI qPrefix) elAttribs ns elLine)) = printElements (show k ++ " ") level ns -- ++ "\n" | |
printOrderedListElement _ _ k (Text (CData cdVerbatim (n) cdLine)) = "" | |
--returns if a Content item has empty text | |
isOrderedListElementEmpty :: Content -> Bool | |
isOrderedListElementEmpty (Elem (Element (QName "listitem" qURI qPrefix) elAttribs ns elLine)) = isEmpty (printElements "" 0 ns) | |
isOrderedListElementEmpty (Text (CData cdVerbatim (n) cdLine)) = isEmpty n | |
--returns if a String is empty (contains only newlines, tabs, or spaces) | |
isEmpty :: String -> Bool | |
isEmpty (n:xs) | n == '\n' = isEmpty xs | |
| n == '\t' = isEmpty xs | |
| n == ' ' = isEmpty xs | |
| otherwise = False | |
isEmpty [] = True | |
--removes leading tabs that are present in non-flat XML files | |
removeLeadingTabs :: String -> String | |
removeLeadingTabs (n:xs) | n == '\t' = removeLeadingTabs xs | |
| n == '\n' = (removeLeadingTabs xs) | |
| n == ' ' = removeLeadingTabs xs | |
| otherwise = (n:xs) | |
removeLeadingTabs a = a | |
--for indenting to a specified level, we pass an integer in and it gives us that many tabs in a String | |
gimmeSomeTabs :: Int -> String | |
gimmeSomeTabs 0 = "" | |
gimmeSomeTabs n = "\t" ++ gimmeSomeTabs (n-1) | |
{- | |
Your Test Cases: | |
*Main> formatFile "xml0" | |
The component suffered from three failings: | |
* It was slow | |
* It ran hot | |
* It didn't actually work | |
Of these three, the last was probably the most important. | |
*Main> formatFile "xml1" | |
Level 0 | |
* Level 1 | |
*Main> formatFile "xml2" | |
Level 0 | |
* Level 1 | |
* Level 2 | |
*Main> formatFile "xml3" | |
Level 0 | |
* Level 1 | |
* Level 2 | |
Level 2 | |
* Level 1 | |
*Main> formatFile "xml4" | |
Level 0 | |
* Level 1 | |
1 Level 2 | |
*Main> formatFile "xml5" | |
Level 0 | |
1 Level 1 | |
* Level 2 | |
Level 2 | |
* Even more Level 2 | |
3 Level 1 | |
Level 0 | |
*Main> formatFile "xml6" | |
The component suffered from three failings: | |
* It was slow | |
1 It made molasses look fast. | |
2 I mean, so slow, we hired geologists to watch it. | |
3 Fundamentally, this thing was not fast. | |
* It ran hot. | |
* It didn't actually work. | |
Of these three, the last was probably the most important. | |
*Main> | |
*Main> formatFile "myxml0" | |
Ash | |
*Main> formatFile "myxml1" | |
Ash | |
*Main> formatFile "myxml2" | |
* it seems superfluous to have all these test cases | |
*Main> formatFile "myxml3" | |
Ash | |
is my name | |
*Main> | |
BadWolf:~ ash$ cat myxml* | |
<para><para>Ash</para></para> | |
<para>Ash</para> | |
<para> | |
<itemizedlist> | |
<listitem><para>it seems superfluous to have all these test cases</para></listitem> | |
</itemizedlist> | |
</para> | |
<para>Ash</para> | |
<para>is my name</para> | |
BadWolf:~ ash$ | |
-} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment