Skip to content

Simple parser combinators for Common Lisp.

License

Notifications You must be signed in to change notification settings

fosskers/parcom

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

parcom - Parser Combinators

parcom is a consise Parser Combinator library in the style of Haskell’s parsec and Rust’s nom.

(in-package :parcom)
(parse (*> (string "Tempus") #'space (string "fugit")) "Tempus fugit.")
fugit

parcom operates strictly on strings, not streamed byte data, but is otherwise “zero copy” in that extracted substrings of the original input are not reallocated.

parcom has no dependencies.

Table of Contents

Compatibility

CompilerStatus
SBCL
ECL
Clasp
ABCL
CCL
Clisp
Allegro
LispWorks

Performance

parcom operates over simple-string input, accessing its elements with schar. The parcom/json component has been measured to parse JSON at ~30mb/s on a 2016 ThinkPad, which should be sufficient for ordinary usage. Its performance is generally competitive across compilers, while remaining expressive and light in implementation.

See also this article about some of the optimization techniques used within parcom.

JSON Benchmarks

In parsing a 25mb JSON file from an in-memory string.

SBCL:

MethodTime (ms)Memory (bytes)
jsown150135m
jzon300132m
shasht600400m
parcom/json750260m
yason1700400m

ECL:

MethodTime (s)Memory (bytes)
shasht3.11.8b
jzon4.11.6b
parcom/json4.7980m
yason5.71.9b
jsown6.2165m

ABCL:

MethodTime (s)Memory (cons cells)
parcom/json6.92m
jsown8.025m
jzon12.0864k
shasht91.9110m
yason328.7366m

API

The examples below use (in-package :parcom) for brevity, but it’s assumed that you’ll use a local nickname like p or pc in your actual code. Further, most examples run the parsers with parse, but occasionally funcall is used instead to demonstrate what the remaining input offset would be after the parse succeeded. You will generally be using parse in your own code.

Types and Running Parsers

All parsers have the function signature offset -> (value, offset), where offset is the current parsing offset. The new value and offset must be yielded via values as multiple return values, as this is the most memory-efficient.

(in-package :parcom)
(funcall (string "Hello") (in "Hello there"))
"Hello", 5

Of course, a parser might fail, in which case a failure message and the offset are returned:

(in-package :parcom)
(funcall (string "Hello") (in "Bye!"))
:FAIL, 0

In general though, we call parse to fully run some combined parsers and yield the final output:

(in-package :parcom)
(apply #'+ (parse (sep (char #\.) #'unsigned) "123.456.789!"))
1368

parse otherwise ignores any final, unconsumed input. It will also raise a Condition if the parsing failed.

Parsers

A “parser” is a function that consumes some specific input and yields a single result.

Characters and Strings

char

Parse a given character.

(in-package :parcom)
(parse (char #\a) "apple")
#\a

string

Parse the given string. The parsed string is a slice into the original input.

(in-package :parcom)
(parse (string "Hello") "Hello there!")
Hello

any

Parse any character.

(in-package :parcom)
(parse #'any "Hello there!")
#\H

any-but

Parse any character except the one you don’t want.

(in-package :parcom)
(parse (any-but #\!) "Hello there!")
#\H
(in-package :parcom)
(funcall (any-but #\H) (in "Hello there!"))
:FAIL, 0

any-if

Any character that passes the predicate.

(in-package :parcom)
(parse (any-if #'digit?) "8a")
#\8

hex

Parse a hex character of any case.

(in-package :parcom)
(parse (many #'hex) "abcd0efgh")
(#\a #\b #\c #\d #\0 #\e #\f)

sneak

Yield the given char if it’s the next one, but don’t advance the offset. Like peek, but character-based and thus more performant.

(in-package :parcom)
(funcall (sneak #\a) (in "aaabcd"))
#\a, 0

eof

Recognize the end of the input.

(in-package :parcom)
(parse #'eof "")
T
(in-package :parcom)
(parse (*> (string "Mālum") #'eof) "Mālum")
T
(in-package :parcom)
(funcall (*> (string "Mālum") #'eof) (in "Mālum rubrum"))
:FAIL, 5

Numbers

unsigned

Parse a positive integer into a fixnum.

(in-package :parcom)
(parse #'unsigned "44")
44

integer

Parse a positive or negative integer into a fixnum.

(in-package :parcom)
(parse #'integer "-44")
-44

float

Parse a positive or negative floating point number into a double-float.

(in-package :parcom)
(parse #'float "123.0456")
123.0456d0

Whitespace

newline

Matches a single newline character.

(in-package :parcom)
(let ((s (concatenate 'simple-string '(#\newline #\a #\b #\c)))) ; "\nabc"
(parse #'newline s))
#\Newline

space, space1

Parse 0 or more ASCII whitespace and tab characters.

(in-package :parcom)
(length (parse #'space "   Salvē!"))
3

Parse 1 or more ASCII whitespace and tab characters.

(in-package :parcom)
(length (parse #'space1 "   Salvē!"))
3
(in-package :parcom)
(funcall #'space1 (in "Salvē!"))
:FAIL, 0

multispace, multispace1

Parse 0 or more ASCII whitespace, tabs, newlines, and carriage returns.

(in-package :parcom)
(length (parse #'multispace (concatenate 'simple-string '(#\tab #\newline #\tab))))
3

Parse 1 or more ASCII whitespace, tabs, newlines, and carriage returns.

(in-package :parcom)
(length (parse #'multispace1 (concatenate 'simple-string '(#\tab #\newline #\tab))))
3
(in-package :parcom)
(funcall #'multispace1 (in "Ārcus"))
:FAIL, 0

Taking in Bulk

These always yield a substring borrowed directly from the original input.

take

Take n characters from the input.

(in-package :parcom)
(parse (take 3) "Arbor")
Arb

It’s okay for n to be too large:

(in-package :parcom)
(parse (take 100) "Arbor")
Arbor

take-while, take-while1

Take characters while some predicate holds.

(in-package :parcom)
(parse (take-while (lambda (c) (equal #\a c))) "aaabbb")
aaa

take-while1 is like take-while, but must yield at least one character.

(in-package :parcom)
(funcall (take-while1 (lambda (c) (equal #\a c))) (in "bbb"))
:FAIL, 0

consume

A faster version of take-while and skip when you know you’re character-based and don’t need the parsed output.

(in-package :parcom)
(funcall (consume (lambda (c) (equal #\a c))) (in "aaabbb"))
T, 3

rest

Consume the rest of the input. Always succeeds.

(in-package :parcom)
(parse (<*> (string "Salvē") (*> #'space #'rest)) "Salvē domine!")
("Salvē" "domine!")

Other

pure

Consume no input and just yield a given value.

(in-package :parcom)
(parse (pure :pāx) "Bellum")
:PĀX

Useful for chaining with other compound parsers to inject values into the results.

(in-package :parcom)
(parse (<*> (<*> (pure :pāx) (string "PĀX"))
            #'multispace
            (<*> (pure :bellum) (string "BELLUM")))
       "PĀX BELLUM")
((:PĀX "PĀX") " " (:BELLUM "BELLUM"))

Combinators

“Combinators” combine child parsers together to form compound results. They allow us to express intent like “parse this then that” and “parse this, then maybe that, but only if…” etc.

*>, right

Run multiple parsers one after another, but yield the value of the rightmost one. right is an alias.

(in-package :parcom)
(parse (*> (char #\!) #'unsigned) "!123?")
123

<*, left

Run multiple parsers one after another, but yield the value of the leftmost one. left is an alias.

(in-package :parcom)
(parse (<* (char #\!) #'unsigned) "!123?")
#\!

<*>, all

Combination of parsers yielding all results as a list. all is an alias.

(in-package :parcom)
(parse (<*> #'unsigned (char #\!) #'unsigned) "123!456")
(123 #\! 456)

This library does not offer a currying mechanism, so the technique usually available in Haskell of fmap’ing a function over chain of <*> must be done instead with apply:

(in-package :parcom)
(apply #'+ (parse (<*> #'unsigned (*> (char #\!) #'unsigned)) "123!456"))
579

<$, instead

Run some parser, but substitute its inner value with something else if parsing was successful. instead is an alias.

(in-package :parcom)
(parse (<$ :roma (string "Roma")) "Roma!")
:ROMA

alt

Accept the results of the first parser from a group to succeed. Can combine as many parsers as you want.

(in-package :parcom)
(parse (alt (string "dog") (string "cat")) "cat")
cat

opt

Yield nil if the parser failed, but don’t fail the whole process nor consume any input.

(in-package :parcom)
(parse (opt (string "Ex")) "Exercitus")
Ex
(in-package :parcom)
(parse (opt (string "Ex")) "Facēre")
NIL

between

A main parser flanked by two other ones. Only the value of the main parser is kept. Good for parsing backets, parentheses, etc.

(in-package :parcom)
(parse (between (char #\!) (string "Salvē") (char #\!)) "!Salvē!")
Salvē

many, many1

many parses 0 or more occurrences of a parser. many1 demands that at least one parse succeeds or a Condition will be raised.

(in-package :parcom)
(parse (many (alt (string "ovēs") (string "avis"))) "ovēsovēsavis!")
("ovēs" "ovēs" "avis")

sep, sep1

sep parses 0 or more instances of a parser separated by some sep parser. sep1 demands that at least one parse succeeds or a Condition will be raised.

(in-package :parcom)
(parse (sep (char #\!) (string "pilum")) "pilum!pilum!pilum.")
("pilum" "pilum" "pilum")

Critically, if a separator is detected, the parent parser must also then succeed or the entire combination fails. For example, this will not parse due to the ! on the end:

(in-package :parcom)
(funcall (sep (char #\!) (string "pilum")) (in "pilum!pilum!pilum!"))
:FAIL, 18

For more lenient behaviour regarding the separator, see sep-end.

sep-end, sep-end1

The same as sep, but the separator may appear at the end of the final “parent”. Likewise, sep-end1 demands that at least one parse of the parent succeeds.

(in-package :parcom)
(parse (sep-end (char #\!) (string "pilum")) "pilum!pilum!pilum!scūtum")
("pilum" "pilum" "pilum")

skip

Parse some parser 0 or more times, but throw away all the results.

(in-package :parcom)
(parse (*> (skip (char #\!)) #'unsigned) "!!!123")
123

peek

Yield the value of a parser, but don’t consume the input.

(in-package :parcom)
(funcall (peek (string "he")) (in "hello"))
he

count

Apply a parser a given number of times and collect the results as a list.

(in-package :parcom)
(funcall (count 3 (char #\a)) (in "aaaaaa"))
(#\a #\a #\a), 3

take-until

Take characters until another parser succeeds. Does not advance the offset by the subparser.

(in-package :parcom)
(funcall (take-until (char #\')) (in "abcd'"))
"abcd", 4

If the subparser is just looking for a single char like the above, use take-while or consume instead. take-until is intended for more complex halting conditions that can’t easily be detected by a char-by-char predicate function.

recognize

If the given parser was successful, return the consumed input as a string instead.

(in-package :parcom)
(funcall (recognize (<*> (string "hi") #'unsigned)) (in "hi123there"))
"hi123", 5

Utilities

empty?

Is a given string empty?

(in-package :parcom)
(empty? "")
T

digit?

Is a given character a number from 0 to 9?

(in-package :parcom)
(digit? #\7)
T

fmap

Apply a pure function to the result of a successful parse.

(in-package :parcom)
(fmap #'1+ (funcall #'unsigned (in "1")))
2, 1

pmap

Similar to fmap, but this transforms a parser into another one, altering its inner result if it happened to be successful.

(in-package :parcom)
(parse (pmap #'1+ #'unsigned) "123")
124

const

Yield a function that ignores its input and returns some original seed.

(in-package :parcom)
(funcall (const 1) 5)
1

JSON

By depending on the optional parcom/json system, you can parse JSON strings or include parcom-compatible JSON parsers into your own custom parsing code.

(in-package :parcom/json) is used below for brevity, but it’s assumed that in your own code you will use a nickname, perhaps pj.

If you don’t care about the individual parsers per se and just want to simply parse some JSON, use pj:parse.

Conversions:

JSONLisp
trueT
falseNIL
ArrayVector
ObjectHash Table
Numberdouble-float
StringString
null:NULL

Performance Note

As with the parent parcom library, parcom/json works strictly off of strings. With SBCL it parses JSON at about 30mb/s on my 2016 ThinkPad, which should be sufficient for ordinary usage. For a more “industrial strength” JSON parsing library, see jzon which parses about 3x faster than parcom/json.

parse

Attempt to parse any JSON value. Analogous to parse from the main library.

(in-package :parcom/json)
(parse "{\"x\": 1, \"y\": 2, \"z\": [1, {\"a\":true}]}")
#<HASH-TABLE :TEST EQUAL :COUNT 3 {1004C0B293}>
(in-package :parcom/json)
(parse "[1.9,true,3e+7,\"hi\",[4],null]")
#(1.9d0 T 3.0d7 "hi" #(4.0d0) :NULL)

Non-ascii and unicode characters are supported:

(in-package :parcom/json)
(parse "\"hēllお🐂\\u03B1\"")
hēllお🐂α

json

Parse any kind of JSON (the actual parser).

(in-package :parcom/json)
(json (parcom:in "{\"x\": 1, \"y\": 2, \"z\": [1, {\"a\":true}]}  "))
#<HASH-TABLE :TEST EQUAL :COUNT 3 {1004CA4C63}>, 38

There are other subparsers exposed, but they are left out here for brevity. Please consult the source code if you need them.

Dates and Times

The parcom/datetime system provides types and parsers for RFC3339 timestamps.

(in-package :parcom/datetime) is used below for brevity, but it’s assumed that in your own code you will use a nickname, perhaps pd.

As with the other parcom libraries, this has no external dependencies, which is an advantage over the otherwise excellent local-time library, which depends on heavy uiop.

parse

parse is lenient, and will parse any kind of date or time you give it.

(in-package :parcom/datetime)
(parse "1975-04-05")
#S(LOCAL-DATE :YEAR 1975 :MONTH 4 :DAY 5)
(in-package :parcom/datetime)
(parse "1975-04-05T04:05:06+03:00")
#S(OFFSET-DATE-TIME
   :DATE #S(LOCAL-DATE :YEAR 1975 :MONTH 4 :DAY 5)
   :TIME #S(LOCAL-TIME :HOUR 4 :MINUTE 5 :SECOND 6 :MILLIS 0)
   :OFFSET #S(OFFSET :HOUR 3 :MINUTE 0))

It’s up to you to handle the concrete type that you’re returned. See the date and time generic functions below.

now

Right now!

(in-package :parcom/datetime)
(now)
#S(OFFSET-DATE-TIME
   :DATE #S(LOCAL-DATE :YEAR 2025 :MONTH 5 :DAY 5)
   :TIME #S(LOCAL-TIME :HOUR 10 :MINUTE 0 :SECOND 28 :MILLIS 0)
   :OFFSET #S(OFFSET :HOUR 9 :MINUTE 0))

It’s a cloudy May morning.

date

Regardless of what parsed, you can usually pull a local-date out of it.

(in-package :parcom/datetime)
(date (parse "1975-04-05T04:05:06+03:00"))
#S(LOCAL-DATE :YEAR 1975 :MONTH 4 :DAY 5)

time

Regardless of what parsed, you can usually pull a local-time out of it.

(in-package :parcom/datetime)
(time (parse "1975-04-05T04:05:06+03:00"))
#S(LOCAL-TIME :HOUR 4 :MINUTE 5 :SECOND 6 :MILLIS 0)

format

To convert your object back into something human-readable. Note that this is different from cl:format!

(in-package :parcom/datetime)
(format nil (date (parse "1975-04-05T04:05:06+03:00")))
1975-04-05

Writing your own Parsers

Basics

The whole point of Parser Combinators is that it becomes simple to write your own parsing functions. Recall that a “fully realized” parser has the signature offset -> (value, offset). In the simplest case, a parser of yours could look like:

(in-package :parcom)

(defun excited-apple (offset)
  (funcall (<* (string "Mālum") (char #\!)) offset))

(funcall #'excited-apple (in "Mālum! Ō!"))
"Mālum", 6

Wherein you utilize the combinators provided by this library to build up composite parsers that are useful to you.

Parameterized Parsers

You can also parameterize your parsers, similar to parsers like take or combinators like count:

(in-package :parcom)

(defun excited-apple (offset)
  (funcall (<* (string "Mālum") (char #\!)) offset))

(defun excited-apples (n)
  "Parse a certain number of excited apples."
  (lambda (offset)
    (funcall (count n #'excited-apple) offset)))

(funcall (excited-apples 3) (in "Mālum!Mālum!Mālum!Mālum!"))
("Mālum" "Mālum" "Mālum"), 18

So, if your parser is parameterized by some initial argument, it has to return a lambda that accepts an offset.

Failure

You can use ok? and failure? within more complex hand-written parsers to explicitly test for sub-parser failure, and then react accordingly. Yielding :fail signals that parsing has failed overall.

(in-package :parcom)

(defun three-sad-pears (offset)
  (multiple-value-bind (res next) (funcall (many (string "Pirum trīste")) offset)
    (if (or (failure? res)
            (< (length res) 3)
            (> (length res) 3))
        (fail next)
        (values res next))))

(three-sad-pears (in "Pirum trīste"))
:FAIL, 12