Onboarding Parsec Library from Self-Defined Parser¶

Previously, I used my own parser to parse the solidity language syntax and finished the expression part. It's a combination of ExceptT and State to maintain the states and report error.

type ErrMsg = Text

type Parser a = ExceptT ErrMsg (State Text) a

The original parser doesn't have error report at all, and it definitely is not the correct way. Moreover, I believe the open source parser library works better than mine as i'm still newbie to haskell. Hence, I choose to replace my custom parser by the parsec parser.

This blog introduces the investigation of parsec and the problems I encountered during migration. Beside them, I will compare some pattern differences between parsec and mine.

Investigation¶

Define Our Parser Using Parsec¶

The parsec provides a transformer ParsecT for users to custom their own parsers.

ParsecT s u m a is a parser with stream type s, user state type u, underlying monad m and return type a. Parsec is strict in the user state.

A bit complex, especially the underlying monad and state type. Don't worry, let's use Parsec first. It uses Identity as the underlying monad, which doesn't make sense for the simple usage.

data ParsecT s u m a

type Parsec s u = ParsecT s u Identity

Hence, we can redefine the parser as this:

import Text.Parsec
import Data.Text

type MyParser a = Parsec Text () a

Define and Run Parser¶

Then, we can define some parsers to parse a word and a line.

type MyParser a = Parsec Text () a
pReadline :: MyParser Text
pReadline = pack <$> manyTill anyChar newline

pWord :: MyParser Text
pWord =pack <$> many1 (noneOf " \n")

Then, we can execute it via parse, which is similar like evalState:

main :: IO ()
main = do
  print $ parse pReadline "" "hello world\n"
  print $ parse pWord "" "hello world"

Moreover, if you would like to check the left stream, you can use runParser function to emit them. The function getInput is required to retrieve the state.

main :: IO ()
main = do
  print $
    runParser
      ( pReadline
          >>= \result ->
            getInput
              >>= \rest -> return (result, rest)
      )
      ()
      ""
      "hello world\n left"

The getInput works similar with state's get method, so you can do the code shows before. It's almost of the same as the State monad.

pWord :: MyParser Text
pWord = do
    s <- getInput
    trace (unpack s) $ pure ()
    r <- pack <$> many1 (noneOf " \n")
    s' <- getInput
    trace (unpack s') $ pure ()
    return r

Error Report and Alternative¶

Report error is also important in parser. We use throwError from ExceptT to report error in previous parser, but now we use fail provided by Parsec:

pErr :: MyParser Text
pErr = fail "the error is intended"

The output will contain the error message along with the position where the error raises.

Left (line 1, column 1):
the error is intended

To use alternative, try function is used because we need to keep the original state if a parser fails, but usually many parsers stop parsing and to report an error.

For example, if we want to parse a whole line or a word, we can do the code below, it outputs Right "hello".

main :: IO ()
main = do
  let p = try pReadline <|> pWord
  print $ parse p "" "hello world"

Try¶

The function try restores the stream state if the parser failed. For example, when you try to consume string via pOneKeyword implemented as below from stream str1, the parser will fail and leave the new state 1.

pOneKeyword :: Text -> Parser Text
pOneKeyword s = T.pack <$> string (T.unpack s)

This behavior causes the problem because we need to use another alternative parser to consume it again. To prevent this issue, you can use try $ pOneKeyword "string" to parse the stream.

Pitfalls¶

Separator Parser Should Be Concise¶

Using sepBy to parse the pattern a, b ,c ,d is a good idea, but note that the separator should be as concise as possible. In our case, the separator should be a single comma, instead of a comma quoted with several possible spaces.

do: char ','
don't: pManySpaces *> char ',' <* pManySpaces

The reason why we shouldn't use the latter part is that once the separator parser consumes stream, the sepBy parser believes it encounters a separator and should finish to parse. However, it's not because sometimes the leading characters are misleading. For example, when you parse a, new b(), c, the sepBy find the space after a, new and treat the characters below should have a separator.

ManyTill + LookAhead¶

When we parse decorators for the function, it should stop parsing when encountering the returns keyword or { without consuming it. To do so, we need to use manyTill with the lookAhead feature, where lookAhead returns the result of parser without consuming stream:

pFunctionDecorators :: Parser [FnDecorator]
pFunctionDecorators = do
  pManySpaces
    *> manyTill
      ( ( FnDecV <$> try pFnDeclVisibility
            <|> pFnDeclVirtual
            <|> (FnDecS <$> try pFnDeclStateMutability)
            <|> (FnDecOs <$> try pOverrideSpecifier)
            -- modifier invocation should be put at last,
            -- otherwise it will process the 'override' and 'virtual' as a modifier invocation,
            -- which is definitely wrong
            <|> (FnDecMI <$> try pFnDeclModifierInvocation)
        )
          <* pMany1Spaces
      )
      ( lookAhead $
          try (pOneKeyword "returns")
            <|> try (pOneKeyword "{")
            <|> eof $> ""
      )