パース処理
parsec
For parsing in Haskell it is quite common to use a family of libraries known as Parser Combinators which let us write code to generate parsers which themselves looks very similar to the parser grammar itself!
Combinators
<|>
The choice operator tries to parse the first argument before proceeding to the second. Can be chained sequentially to a generate a sequence of options.
many
Consumes an arbitrary number of patterns matching the given pattern and returns them as a list.
many1
Like many but requires at least one match.
optional
Optionally parses a given pattern returning it's value as a Maybe.
try
Backtracking operator will let us parse ambiguous matching expressions and restart with a different pattern.
There are two styles of writing Parsec, one can choose to write with monads or with applicatives.
parseM :: Parser Expr
parseM = do
a <- identifier
char '+'
b <- identifier
return $ Add a b
The same code written with applicatives uses the applicative combinators:
-- | Sequential application.
(<*>) :: f (a -> b) -> f a -> f b
-- | Sequence actions, discarding the value of the first argument.
(*>) :: f a -> f b -> f b
(*>) = liftA2 (const id)
-- | Sequence actions, discarding the value of the second argument.
(<*) :: f a -> f b -> f a
(<*) = liftA2 const
parseA :: Parser Expr
parseA = Add <$> identifier <* char '+' <*> identifier
Now for instance if we want to parse simple lambda expressions we can encode the parser logic as compositions
of these combinators which yield the string parser when evaluated under with the parse
.
import Text.Parsec
import Text.Parsec.String
data Expr
= Var Char
| Lam Char Expr
| App Expr Expr
deriving Show
lam :: Parser Expr
lam = do
char '\\'
n <- letter
string "->"
e <- expr
return $ Lam n e
app :: Parser Expr
app = do
apps <- many1 term
return $ foldl1 App apps
var :: Parser Expr
var = do
n <- letter
return $ Var n
parens :: Parser Expr -> Parser Expr
parens p = do
char '('
e <- p
char ')'
return e
term :: Parser Expr
term = var <|> parens expr
expr :: Parser Expr
expr = lam <|> app
decl :: Parser Expr
decl = do
e <- expr
eof
return e
test :: IO ()
test = parseTest decl "\\y->y(\\x->x)y"
main :: IO ()
main = test >>= print
カスタムレクサ
In our previous example lexing pass was not necessary because each lexeme mapped to a sequential collection of characters in the stream type. If we wanted to extend this parser with a non-trivial set of tokens, then Parsec provides us with a set of functions for defining lexers and integrating these with the parser combinators. The simplest example builds on top of the builtin Parsec language definitions which define a set of most common lexical schemes.
haskellDef :: LanguageDef st
emptyDef :: LanguageDef st
haskellStyle :: LanguageDef st
javaStyle :: LanguageDef st
For instance we'll build on top of the empty language grammar.
import Text.Parsec
import Text.Parsec.Expr
import Text.Parsec.String
import qualified Text.Parsec.Token as Token
lexerStyle :: Token.LanguageDef ()
lexerStyle = Token.LanguageDef
{ Token.commentStart = "{-"
, Token.commentEnd = "-}"
, Token.commentLine = "--"
, Token.nestedComments = True
, Token.identStart = letter
, Token.identLetter = alphaNum <|> oneOf "_"
, Token.opStart = Token.opLetter lexerStyle
, Token.opLetter = oneOf "`~!@$%^&*-+=;:<>./?"
, Token.reservedOpNames= []
, Token.reservedNames = ["if", "then", "else", "def"]
, Token.caseSensitive = True
}
lexer :: Token.TokenParser ()
lexer = Token.makeTokenParser lexerStyle
parens :: Parser a -> Parser a
parens = Token.parens lexer
natural :: Parser Integer
natural = Token.natural lexer
identifier :: Parser String
identifier = Token.identifier lexer
reservedOp :: String -> Parser ()
reservedOp = Token.reservedOp lexer
reserved :: String -> Parser ()
reserved = Token.reserved lexer
whiteSpace :: Parser ()
whiteSpace = Token.whiteSpace lexer
comma :: Parser String
comma = Token.comma lexer
See: Text.ParserCombinators.Parsec.Language
単純なパース処理
Putting our lexer and parser together we can write down a more robust parser for our little lambda calculus syntax.
module Parser (parseExpr) where
import Text.Parsec
import Text.Parsec.String (Parser)
import Text.Parsec.Language (haskellStyle)
import qualified Text.Parsec.Expr as Ex
import qualified Text.Parsec.Token as Tok
type Id = String
data Expr
= Lam Id Expr
| App Expr Expr
| Var Id
| Num Int
| Op Binop Expr Expr
deriving (Show)
data Binop = Add | Sub | Mul deriving Show
lexer :: Tok.TokenParser ()
lexer = Tok.makeTokenParser style
where ops = ["->","\\","+","*","-","="]
style = haskellStyle {Tok.reservedOpNames = ops }
reservedOp :: String -> Parser ()
reservedOp = Tok.reservedOp lexer
identifier :: Parser String
identifier = Tok.identifier lexer
parens :: Parser a -> Parser a
parens = Tok.parens lexer
contents :: Parser a -> Parser a
contents p = do
Tok.whiteSpace lexer
r <- p
eof
return r
natural :: Parser Integer
natural = Tok.natural lexer
variable :: Parser Expr
variable = do
x <- identifier
return (Var x)
number :: Parser Expr
number = do
n <- natural
return (Num (fromIntegral n))
lambda :: Parser Expr
lambda = do
reservedOp "\\"
x <- identifier
reservedOp "->"
e <- expr
return (Lam x e)
aexp :: Parser Expr
aexp = parens expr
<|> variable
<|> number
<|> lambda
term :: Parser Expr
term = Ex.buildExpressionParser table aexp
where infixOp x f = Ex.Infix (reservedOp x >> return f)
table = [[infixOp "*" (Op Mul) Ex.AssocLeft],
[infixOp "+" (Op Add) Ex.AssocLeft]]
expr :: Parser Expr
expr = do
es <- many1 term
return (foldl1 App es)
parseExpr :: String -> Expr
parseExpr input =
case parse (contents expr) "<stdin>" input of
Left err -> error (show err)
Right ast -> ast
main :: IO ()
main = getLine >>= print . parseExpr >> main
Trying it out:
λ: runhaskell simpleparser.hs
1+2
Op Add (Num 1) (Num 2)
\i -> \x -> x
Lam "i" (Lam "x" (Var "x"))
\s -> \f -> \g -> \x -> f x (g x)
Lam "s" (Lam "f" (Lam "g" (Lam "x" (App (App (Var "f") (Var "x")) (App (Var "g") (Var "x"))))))
状態遷移のあるパース処理
For a more complex use, consider parser that are internally stateful, for example adding operators that can
defined at parse-time and are dynamically added to the expressionParser
table upon definition.
module Main where
import qualified Text.Parsec.Expr as Ex
import qualified Text.Parsec.Token as Tok
import Text.Parsec.Language (haskellStyle)
import Data.List
import Data.Function
import Control.Monad.Identity (Identity)
import Text.Parsec
import qualified Text.Parsec as P
type Name = String
data Expr
= Var Name
| Lam Name Expr
| App Expr Expr
| Let Name Expr Expr
| BinOp Name Expr Expr
| UnOp Name Expr
deriving (Show)
data Assoc
= OpLeft
| OpRight
| OpNone
| OpPrefix
| OpPostfix
deriving Show
data Decl
= LetDecl Expr
| OpDecl OperatorDef
deriving (Show)
type Op x = Ex.Operator String ParseState Identity x
type Parser a = Parsec String ParseState a
data ParseState = ParseState [OperatorDef] deriving Show
data OperatorDef = OperatorDef {
oassoc :: Assoc
, oprec :: Integer
, otok :: Name
} deriving Show
lexer :: Tok.GenTokenParser String u Identity
lexer = Tok.makeTokenParser style
where ops = ["->","\\","+","*","<","=","[","]","_"]
names = ["let","in","infixl", "infixr", "infix", "postfix", "prefix"]
style = haskellStyle { Tok.reservedOpNames = ops
, Tok.reservedNames = names
, Tok.identLetter = alphaNum <|> oneOf "#'_"
, Tok.commentLine = "--"
}
reserved = Tok.reserved lexer
reservedOp = Tok.reservedOp lexer
identifier = Tok.identifier lexer
parens = Tok.parens lexer
brackets = Tok.brackets lexer
braces = Tok.braces lexer
commaSep = Tok.commaSep lexer
semi = Tok.semi lexer
integer = Tok.integer lexer
chr = Tok.charLiteral lexer
str = Tok.stringLiteral lexer
operator = Tok.operator lexer
contents :: Parser a -> Parser a
contents p = do
Tok.whiteSpace lexer
r <- p
eof
return r
expr :: Parser Expr
expr = do
es <- many1 term
return (foldl1 App es)
lambda :: Parser Expr
lambda = do
reservedOp "\\"
args <- identifier
reservedOp "->"
body <- expr
return $ Lam args body
letin :: Parser Expr
letin = do
reserved "let"
x <- identifier
reservedOp "="
e1 <- expr
reserved "in"
e2 <- expr
return (Let x e1 e2)
variable :: Parser Expr
variable = do
x <- identifier
return (Var x)
addOperator :: OperatorDef -> Parser ()
addOperator a = P.modifyState $ \(ParseState ops) -> ParseState (a : ops)
mkTable :: ParseState -> [[Op Expr]]
mkTable (ParseState ops) =
map (map toParser) $
groupBy ((==) `on` oprec) $
reverse $ sortBy (compare `on` oprec) $ ops
toParser :: OperatorDef -> Op Expr
toParser (OperatorDef ass _ tok) = case ass of
OpLeft -> infixOp tok (BinOp tok) (toAssoc ass)
OpRight -> infixOp tok (BinOp tok) (toAssoc ass)
OpNone -> infixOp tok (BinOp tok) (toAssoc ass)
OpPrefix -> prefixOp tok (UnOp tok)
OpPostfix -> postfixOp tok (UnOp tok)
where
toAssoc OpLeft = Ex.AssocLeft
toAssoc OpRight = Ex.AssocRight
toAssoc OpNone = Ex.AssocNone
toAssoc _ = error "no associativity"
infixOp :: String -> (a -> a -> a) -> Ex.Assoc -> Op a
infixOp x f = Ex.Infix (reservedOp x >> return f)
prefixOp :: String -> (a -> a) -> Ex.Operator String u Identity a
prefixOp name f = Ex.Prefix (reservedOp name >> return f)
postfixOp :: String -> (a -> a) -> Ex.Operator String u Identity a
postfixOp name f = Ex.Postfix (reservedOp name >> return f)
term :: Parser Expr
term = do
tbl <- getState
let table = mkTable tbl
Ex.buildExpressionParser table aexp
aexp :: Parser Expr
aexp = letin
<|> lambda
<|> variable
<|> parens expr
letdecl :: Parser Decl
letdecl = do
e <- expr
return $ LetDecl e
opleft :: Parser Decl
opleft = do
reserved "infixl"
prec <- integer
sym <- parens operator
let op = (OperatorDef OpLeft prec sym)
addOperator op
return $ OpDecl op
opright :: Parser Decl
opright = do
reserved "infixr"
prec <- integer
sym <- parens operator
let op = (OperatorDef OpRight prec sym)
addOperator op
return $ OpDecl op
opnone :: Parser Decl
opnone = do
reserved "infix"
prec <- integer
sym <- parens operator
let op = (OperatorDef OpNone prec sym)
addOperator op
return $ OpDecl op
opprefix :: Parser Decl
opprefix = do
reserved "prefix"
prec <- integer
sym <- parens operator
let op = OperatorDef OpPrefix prec sym
addOperator op
return $ OpDecl op
oppostfix :: Parser Decl
oppostfix = do
reserved "postfix"
prec <- integer
sym <- parens operator
let op = OperatorDef OpPostfix prec sym
addOperator op
return $ OpDecl op
decl :: Parser Decl
decl =
try letdecl
<|> opleft
<|> opright
<|> opnone
<|> opprefix
<|> oppostfix
top :: Parser Decl
top = do
x <- decl
P.optional semi
return x
modl :: Parser [Decl]
modl = many top
parseModule :: SourceName -> String -> Either ParseError [Decl]
parseModule filePath = P.runParser (contents modl) (ParseState []) filePath
main :: IO ()
main = do
input <- readFile "test.in"
let res = parseModule "<stdin>" input
case res of
Left err -> print err
Right ast -> mapM_ print ast
For example input try:
infixl 3 ($);
infixr 4 (#);
infix 4 (.);
prefix 10 (-);
postfix 10 (!);
let z = y in a $ a $ (-a)!;
let z = y in a # a # a $ b; let z = y in a # a # a # b;
ジェネリックなパース処理
Previously we defined generic operations for pretty printing and this begs the question of whether we can write a parser on top of Generics. The answer is generally yes, so long as there is a direct mapping between the specific lexemes and sum and products types. Consider the simplest case where we just read off the names of the constructors using the regular Generics machinery and then build a Parsec parser terms of them.
{-# LANGUAGE TypeOperators #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE ScopedTypeVariables #-}
import Text.Parsec
import Text.Parsec.Text.Lazy
import Control.Applicative ((<*), (<*>), (<$>))
import GHC.Generics
class GParse f where
gParse :: Parser (f a)
-- Type synonym metadata for constructors
instance (GParse f, Constructor c) => GParse (C1 c f) where
gParse =
let con = conName (undefined :: t c f a) in
(fmap M1 gParse) <* string con
-- Constructor names
instance GParse f => GParse (D1 c f) where
gParse = fmap M1 gParse
-- XXX
{-instance GParse f => GParse (S1 c f) where-}
{-gParse = fmap M1 gParse-}
-- Sum types
instance (GParse a, GParse b) => GParse (a :+: b) where
gParse = try (fmap L1 gParse <|> fmap R1 gParse)
-- Product types
instance (GParse f, GParse g) => GParse (f :*: g) where
gParse = (:*:) <$> gParse <*> gParse
-- Nullary constructors
instance GParse U1 where
gParse = return U1
data Scientist
= Newton
| Einstein
| Schrodinger
| Feynman
deriving (Show, Generic)
data Musician
= Vivaldi
| Bach
| Mozart
| Beethoven
deriving (Show, Generic)
gparse :: (Generic g, GParse (Rep g)) => Parser g
gparse = fmap to gParse
scientist :: Parser Scientist
scientist = gparse
musician :: Parser Musician
musician = gparse
λ: parseTest parseMusician "Bach"
Bach
λ: parseTest parseScientist "Feynman"
Feynman
attoparsec
Attoparsec is a parser combinator like Parsec but more suited for bulk parsing of large text and binary files instead of parsing language syntax to ASTs. When written properly Attoparsec parsers can be extremely efficient.
One notable distinction between Parsec and Attoparsec is that backtracking
operator (try
) is not present and reflects on attoparsec's different
underlying parser model.
For a simple little lambda calculus language we can use attoparsec much in the same we used parsec:
{-# LANGUAGE OverloadedStrings #-}
{-# OPTIONS_GHC -fno-warn-unused-do-bind #-}
import Control.Applicative
import Data.Attoparsec.Text
import qualified Data.Text as T
import qualified Data.Text.IO as T
import Data.List (foldl1')
data Name
= Gen Int
| Name T.Text
deriving (Eq, Show, Ord)
data Expr
= Var Name
| App Expr Expr
| Lam [Name] Expr
| Lit Int
| Prim PrimOp
deriving (Eq, Show)
data PrimOp
= Add
| Sub
| Mul
| Div
deriving (Eq, Show)
data Defn = Defn Name Expr
deriving (Eq, Show)
name :: Parser Name
name = Name . T.pack <$> many1 letter
num :: Parser Expr
num = Lit <$> signed decimal
var :: Parser Expr
var = Var <$> name
lam :: Parser Expr
lam = do
string "\\"
vars <- many1 (skipSpace *> name)
skipSpace *> string "->"
body <- expr
return (Lam vars body)
eparen :: Parser Expr
eparen = char '(' *> expr <* skipSpace <* char ')'
prim :: Parser Expr
prim = Prim <$> (
char '+' *> return Add
<|> char '-' *> return Sub
<|> char '*' *> return Mul
<|> char '/' *> return Div)
expr :: Parser Expr
expr = foldl1' App <$> many1 (skipSpace *> atom)
atom :: Parser Expr
atom = try lam
<|> eparen
<|> prim
<|> var
<|> num
def :: Parser Defn
def = do
skipSpace
nm <- name
skipSpace *> char '=' *> skipSpace
ex <- expr
skipSpace <* char ';'
return $ Defn nm ex
file :: T.Text -> Either String [Defn]
file = parseOnly (many def <* skipSpace)
parseFile :: FilePath -> IO (Either T.Text [Defn])
parseFile path = do
contents <- T.readFile path
case file contents of
Left a -> return $ Left (T.pack a)
Right b -> return $ Right b
main :: IO (Either T.Text [Defn])
main = parseFile "simple.ml"
For an example try the above parser with the following simple lambda expression.
f = g (x - 1);
g = f (x + 1);
h = \x y -> (f x) + (g y);
Attoparsec adapts very well to binary and network protocol style parsing as well, this is extracted from a small implementation of a distributed consensus network protocol:
{-# LANGUAGE OverloadedStrings #-}
import Control.Monad
import Data.Attoparsec
import Data.Attoparsec.Char8 as A
import Data.ByteString.Char8
data Action
= Success
| KeepAlive
| NoResource
| Hangup
| NewLeader
| Election
deriving Show
type Sender = ByteString
type Payload = ByteString
data Message = Message
{ action :: Action
, sender :: Sender
, payload :: Payload
} deriving Show
proto :: Parser Message
proto = do
act <- paction
send <- A.takeTill (== '.')
body <- A.takeTill (A.isSpace)
endOfLine
return $ Message act send body
paction :: Parser Action
paction = do
c <- anyWord8
case c of
1 -> return Success
2 -> return KeepAlive
3 -> return NoResource
4 -> return Hangup
5 -> return NewLeader
6 -> return Election
_ -> mzero
main :: IO ()
main = do
let msgtext = "\x01\x6c\x61\x70\x74\x6f\x70\x2e\x33\x2e\x31\x34\x31\x35\x39\x32\x36\x35\x33\x35\x0A"
let msg = parseOnly proto msgtext
print msg