Urbit Docs
  • What is Urbit?
  • Get on Urbit
  • Build on Urbit
    • Contents
    • Environment Setup
    • Hoon School
      • 1. Hoon Syntax
      • 2. Azimuth (Urbit ID)
      • 3. Gates (Functions)
      • 4. Molds (Types)
      • 5. Cores
      • 6. Trees and Addressing
      • 7. Libraries
      • 8. Testing Code
      • 9. Text Processing I
      • 10. Cores and Doors
      • 11. Data Structures
      • 12. Type Checking
      • 13. Conditional Logic
      • 14. Subject-Oriented Programming
      • 15. Text Processing II
      • 16. Functional Programming
      • 17. Text Processing III
      • 18. Generic and Variant Cores
      • 19. Mathematics
    • App School I
      • 1. Arvo
      • 2. The Agent Core
      • 3. Imports and Aliases
      • 4. Lifecycle
      • 5. Cards
      • 6. Pokes
      • 7. Structures and Marks
      • 8. Subscriptions
      • 9. Vanes
      • 10. Scries
      • 11. Failure
      • 12. Next Steps
      • Appendix: Types
    • App School II (Full-Stack)
      • 1. Types
      • 2. Agent
      • 3. JSON
      • 4. Marks
      • 5. Eyre
      • 6. React app setup
      • 7. React app logic
      • 8. Desk and glob
      • 9. Summary
    • Core Academy
      • 1. Evaluating Nock
      • 2. Building Hoon
      • 3. The Core Stack
      • 4. Arvo I: The Main Sequence
      • 5. Arvo II: The Boot Sequence
      • 6. Vere I: u3 and the Serf
      • 7. Vere II: The Loom
      • 8. Vanes I: Behn, Dill, Kahn, Lick
      • 9. Vanes II: Ames
      • 10. Vanes III: Eyre, Iris
      • 11. Vanes IV: Clay
      • 12. Vanes V: Gall and Userspace
      • 13. Vanes VI: Khan, Lick
      • 14. Vanes VII: Jael, Azimuth
    • Runtime
      • U3
      • Conn.c Guide
      • How to Write a Jet
      • API Overview by Prefix
      • C in Urbit
      • Cryptography
      • Land of Nouns
    • Tools
      • Useful Links
      • JS Libraries
        • HTTP API
      • Docs App
        • File Format
        • Index File
        • Suggested Structure
    • Userspace
      • Command-Line App Tutorial
      • Remote Scry
      • Unit Tests
      • Software Distribution
        • Software Distribution Guide
        • Docket File
        • Glob
      • Examples
        • Building a CLI App
        • Debugging Wrapper
        • Host a Website
        • Serving a JS Game
        • Ship Monitoring
        • Styled Text
  • Urbit ID
    • What is Urbit ID?
    • Azimuth Data Flow
    • Life and Rift
    • Urbit HD Wallet
    • Advanced Azimuth Tools
    • Custom Roller Tutorial
    • Azimuth.eth Reference
    • Ecliptic.eth Reference
    • Layer 2
      • L2 Actions
      • L2 Rollers
      • L2 Roller HTTP RPC-API
      • L2 Transaction Format
  • Urbit OS
    • What is Urbit OS?
    • Base
      • Hood
      • Threads
        • Basics Tutorial
          • Bind
          • Fundamentals
          • Input
          • Output
          • Summary
        • HTTP API Guide
        • Spider API Reference
        • Strandio Reference
        • Examples
          • Child Thread
          • Fetch JSON
          • Gall
            • Poke Thread
            • Start Thread
            • Stop Thread
            • Take Facts
            • Take Result
          • Main-loop
          • Poke Agent
          • Scry
          • Take Fact
    • Kernel
      • Arvo
        • Cryptography
        • Move Trace
        • Scries
        • Subscriptions
      • Ames
        • Ames API Reference
        • Ames Cryptography
        • Ames Data Types
        • Ames Scry Reference
      • Behn
        • Behn API Reference
        • Behn Examples
        • Behn Scry Reference
      • Clay
        • Clay API Reference
        • Clay Architecture
        • Clay Data Types
        • Clay Examples
        • Clay Scry Reference
        • Filesystem Hierarchy
        • Marks
          • Mark Examples
          • Using Marks
          • Writing Marks
        • Using Clay
      • Dill
        • Dill API Reference
        • Dill Data Types
        • Dill Scry Reference
      • Eyre
        • EAuth
        • Eyre Data Types
        • Eyre External API
        • Eyre Internal API
        • Eyre Scry Reference
        • Low-Level Eyre Guide
        • Noun channels
      • Gall
        • Gall API Reference
        • Gall Data Types
        • Gall Scry Reference
      • Iris
        • Iris API Reference
        • Iris Data Types
        • Iris Example
      • Jael
        • Jael API Reference
        • Jael Data Types
        • Jael Examples
        • Jael Scry Reference
      • Khan
        • Khan API Reference
        • Khan Data Types
        • Khan Example
      • Lick
        • Lick API Reference
        • Lick Guide
        • Lick Examples
        • Lick Scry Reference
  • Hoon
    • Why Hoon?
    • Advanced Types
    • Arvo
    • Auras
    • Basic Types
    • Cheat Sheet
    • Cryptography
    • Examples
      • ABC Blocks
      • Competitive Programming
      • Emirp
      • Gleichniszahlenreihe
      • Islands
      • Luhn Number
      • Minimum Path Sum
      • Phone Letters
      • Restore IP
      • Rhonda Numbers
      • Roman Numerals
      • Solitaire Cipher
      • Water Towers
    • Generators
    • Hoon Errors
    • Hoon Style Guide
    • Implementing an Aura
    • Irregular forms
    • JSON
    • Limbs and wings
      • Limbs
      • Wings
    • Mips (Maps of Maps)
    • Parsing Text
    • Runes
      • | bar · Cores
      • $ buc · Structures
      • % cen · Calls
      • : col · Cells
      • . dot · Nock
      • / fas · Imports
      • ^ ket · Casts
      • + lus · Arms
      • ; mic · Make
      • ~ sig · Hints
      • = tis · Subject
      • ? wut · Conditionals
      • ! zap · Wild
      • Constants (Atoms and Strings)
      • --, == · Terminators
    • Sail (HTML)
    • Serialization
    • Sets
    • Standard Library
      • 1a: Basic Arithmetic
      • 1b: Tree Addressing
      • 1c: Molds and Mold-Builders
      • 2a: Unit Logic
      • 2b: List Logic
      • 2c: Bit Arithmetic
      • 2d: Bit Logic
      • 2e: Insecure Hashing
      • 2f: Noun Ordering
      • 2g: Unsigned Powers
      • 2h: Set Logic
      • 2i: Map Logic
      • 2j: Jar and Jug Logic
      • 2k: Queue Logic
      • 2l: Container from Container
      • 2m: Container from Noun
      • 2n: Functional Hacks
      • 2o: Normalizing Containers
      • 2p: Serialization
      • 2q: Molds and Mold-Builders
      • 3a: Modular and Signed Ints
      • 3b: Floating Point
      • 3c: Urbit Time
      • 3d: SHA Hash Family
      • 3e: AES encryption (Removed)
      • 3f: Scrambling
      • 3g: Molds and Mold-Builders
      • 4a: Exotic Bases
      • 4b: Text Processing
      • 4c: Tank Printer
      • 4d: Parsing (Tracing)
      • 4e: Parsing (Combinators)
      • 4f: Parsing (Rule-Builders)
      • 4g: Parsing (Outside Caller)
      • 4h: Parsing (ASCII Glyphs)
      • 4i: Parsing (Useful Idioms)
      • 4j: Parsing (Bases and Base Digits)
      • 4k: Atom Printing
      • 4l: Atom Parsing
      • 4m: Formatting Functions
      • 4n: Virtualization
      • 4o: Molds
      • 5a: Compiler Utilities
      • 5b: Macro Expansion
      • 5c: Compiler Backend & Prettyprinter
      • 5d: Parser
      • 5e: Molds and mold builders
      • 5f: Profiling support
    • Strings
    • The Engine Pattern
    • Udon (Markdown-esque)
    • Vases
    • Zuse
      • 2d(1-5): To JSON, Wains
      • 2d(6): From JSON
      • 2d(7): From JSON (unit)
      • 2e(2-3): Print & Parse JSON
      • 2m: Ordered Maps
  • Nock
    • What is Nock?
    • Decrement
    • Definition
    • Fast Hints and Jets
    • Implementations
    • Specification
  • User Manual
    • Contents
    • Running Urbit
      • Cloud Hosting
      • Home Servers
      • Runtime Reference
      • Self-hosting S3 Storage with MinIO
    • Urbit ID
      • Bridge Troubleshooting
      • Creating an Invite Pool
      • Get an Urbit ID
      • Guide to Factory Resets
      • HD Wallet (Master Ticket)
      • Layer 2 for planets
      • Layer 2 for stars
      • Proxies
      • Using Bridge
    • Urbit OS
      • Basics
      • Configuring S3 Storage
      • Dojo Tools
      • Filesystem
      • Shell
      • Ship Troubleshooting
      • Star and Galaxy Operations
      • Updates
Powered by GitBook

GitHub

  • Urbit ID
  • Urbit OS
  • Runtime

Resources

  • YouTube
  • Whitepaper
  • Awesome Urbit

Contact

  • X
  • Email
  • Gather
On this page
  • The Hoon Parser
  • Scanning Through a $tape
  • $rule Building
  • $rules to parse fixed strings
  • $rules to parse flexible strings
  • Example: Parse a String of Numbers
  • Example: Hoon Workbook
Edit on GitHub
  1. Build on Urbit
  2. Hoon School

17. Text Processing III

This module covers text parsing. It may be considered optional and skipped if you are speedrunning Hoon School.

We need to build a tool to accept a tape containing some characters, then turn it into something else, something computational.

For instance, a calculator could accept an input like 3+4 and return 7. A command-line interface may look for a program to evaluate (like Bash and ls). A search bar may apply logic to the query (like Google and - for NOT).

The basic problem all parsers face is this:

  1. You need to accept a character string.

  2. You need to ingest one or more characters and decide what they “mean”, including storing the result of this meaning.

  3. You need to loop back to #1 again and again until you are out of characters.

The Hoon Parser

We could build a simple parser out of a trap and +snag, but it would be brittle and difficult to extend. The Hoon parser is very sophisticated, since it has to take a file of ASCII characters (and some UTF-8 strings) and turn it via an AST into Nock code. What makes parsing challenging is that we have to wade directly into a sea of new types and processes. To wit:

  • A $tape is the string to be parsed.

  • A $hair is the position in the text the parser is at, as a cell of line & column, [p=@ud q=@ud].

  • A $nail is parser input, a cell of $hair and $tape.

  • An $edge is parser output, a pair of a $hair and a +unit containing a pair of the result and a $nail. (There are some subtleties around failure-to-parse here that we'll defer a moment.)

  • A $rule is a parser, a gate which applies a $nail to yield an $edge.

Basically, one uses a $rule on [hair tape] to yield an $edge.

A substantial swath of the standard library is built around parsing for various scenarios, and there's a lot to know to effectively use these tools. If you can parse arbitrary input using Hoon after this lesson, you're in fantastic shape for building things later. It's worth spending extra effort to understand how these programs work.

There is a full guide on parsing which goes into more detail than this quick overview.

Scanning Through a $tape

+scan parses a $tape or crashes, simple enough. It will be our workhorse. All we really need to know in order to use it is how to build a $rule.

Here we will preview using +shim to match characters with in a given range, here lower-case. If you change the character range, e.g. putting ' ' in the +shim will span from ASCII 32, ' ' to ASCII 122, 'z'.

> `(list)`(scan "after" (star (shim 'a' 'z')))  
~[97 102 116 101 114]  

> `(list)`(scan "after the" (star (shim 'a' 'z')))
{1 6}  
syntax error  
dojo: hoon expression failed

$rule Building

The $rule-building system is vast and often requires various components together to achieve the desired effect.

$rules to parse fixed strings

+just takes in a single $char and produces a $rule that attempts to match that $char to the first character in the $tape of the input $nail.

> ((just 'a') [[1 1] "abc"])
[p=[p=1 q=2] q=[~ [p='a' q=[p=[p=1 q=2] q="bc"]]]]

+jest matches a $cord. It takes an input $cord and produces a $rule that attempts to match that $cord against the beginning of the input.

> ((jest 'abc') [[1 1] "abc"])
[p=[p=1 q=4] q=[~ [p='abc' q=[p=[p=1 q=4] q=""]]]]

> ((jest 'abc') [[1 1] "abcabc"])
[p=[p=1 q=4] q=[~ [p='abc' q=[p=[p=1 q=4] q="abc"]]]]

> ((jest 'abc') [[1 1] "abcdef"])
[p=[p=1 q=4] q=[~ [p='abc' q=[p=[p=1 q=4] q="def"]]]]

(Keep an eye on the structure of the return $edge there.)

+shim parses characters within a given range. It takes in two atoms and returns a $rule.

> ((shim 'a' 'z') [[1 1] "abc"])
[p=[p=1 q=2] q=[~ [p='a' q=[p=[p=1 q=2] q="bc"]]]]

+next is a simple $rule that takes in the next character and returns it as the parsing result.

> (next [[1 1] "abc"])
[p=[p=1 q=2] q=[~ [p='a' q=[p=[p=1 q=2] q="bc"]]]]

$rules to parse flexible strings

So far we can only parse one character at a time, which isn't much better than just using +snag in a trap.

> (scan "a" (shim 'a' 'z'))  
'a'  

> (scan "ab" (shim 'a' 'z'))  
{1 2}  
syntax error  
dojo: hoon expression failed

How do we parse multiple characters in order to break things up sensibly?

+star will match a multi-character list of values.

> (scan "a" (just 'a'))
'a'

> (scan "aaaaa" (just 'a'))
! {1 2}
! 'syntax-error'
! exit

> (scan "aaaaa" (star (just 'a')))
"aaaaa"

+plug takes the $nail in the $edge produced by one rule and passes it to the next $rule, forming a cell of the results as it proceeds.

> (scan "starship" ;~(plug (jest 'star') (jest 'ship')))
['star' 'ship']

+pose tries each $rule you hand it successively until it finds one that works.

> (scan "a" ;~(pose (just 'a') (just 'b')))
'a'

> (scan "b" ;~(pose (just 'a') (just 'b')))
'b'

> (;~(pose (just 'a') (just 'b')) [1 1] "ab")
[p=[p=1 q=2] q=[~ u=[p='a' q=[p=[p=1 q=2] q=[i='b' t=""]]]]]

+glue parses a delimiter (a $rule) in between each $rule and forms a cell of the results of each non-delimiter $rule. Delimiters representing each symbol used in Hoon are named according to their aural ASCII pronunciation. Sets of characters can also be used as delimiters, such as prn for printable characters (more here).

> (scan "a b" ;~((glue ace) (just 'a') (just 'b')))  
['a' 'b']

> (scan "a,b" ;~((glue com) (just 'a') (just 'b')))
['a' 'b']

> (scan "a,b,a" ;~((glue com) (just 'a') (just 'b')))
{1 4}
syntax error

> (scan "a,b,a" ;~((glue com) (just 'a') (just 'b') (just 'a')))
['a' 'b' 'a']

The ;~ micsig will create ;~(combinator (list rule)) to use multiple $rules.

> (scan "after the" ;~((glue ace) (star (shim 'a' 'z')) (star (shim 'a' 'z'))))  
[[i='a' t=<|f t e r|>] [i='t' t=<|h e|>]

> (;~(pose (just 'a') (just 'b')) [1 1] "ab")  
[p=[p=1 q=2] q=[~ u=[p='a' q=[p=[p=1 q=2] q=[i='b' t=""]]]]]

At this point we have two problems: we are just getting raw @t atoms back, and we can't iteratively process arbitrarily long strings. +cook will help us with the first of these:

+cook will take a $rule and a gate to apply to the successful parse.

> ((cook ,@ud (just 'a')) [[1 1] "abc"])
[p=[p=1 q=2] q=[~ u=[p=97 q=[p=[p=1 q=2] q="bc"]]]]

> ((cook ,@tas (just 'a')) [[1 1] "abc"])
[p=[p=1 q=2] q=[~ u=[p=%a q=[p=[p=1 q=2] q="bc"]]]]

> ((cook |=(a=@ +(a)) (just 'a')) [[1 1] "abc"])
[p=[p=1 q=2] q=[~ u=[p=98 q=[p=[p=1 q=2] q="bc"]]]]

> ((cook |=(a=@ `@t`+(a)) (just 'a')) [[1 1] "abc"])
[p=[p=1 q=2] q=[~ u=[p='b' q=[p=[p=1 q=2] q="bc"]]]]

However, to parse iteratively, we need to use the +knee function, which takes a noun as the bunt of the type the $rule produces, and produces a $rule that recurses properly. (You'll probably want to treat this as a recipe for now and just copy it when necessary.)

|-(;~(plug prn ;~(pose (knee *tape |.(^$)) (easy ~))))

There is an example of a calculator in the parsing guide that's worth a read at this point. It uses +knee to scan in a set of numbers at a time.

Example: Parse a String of Numbers

A simple +shim-based parser:

> (scan "1234567890" (star (shim '0' '9')))  
[i='1' t=<|2 3 4 5 6 7 8 9 0|>]

A refined +cook/+cury/+jest parser:

> ((cook (cury slaw %ud) (jest '1')) [[1 1] "123"])  
[p=[p=1 q=2] q=[~ u=[p=[~ 1] q=[p=[p=1 q=2] q="23"]]]]  

> ((cook (cury slaw %ud) (jest '12')) [[1 1] "123"])
[p=[p=1 q=3] q=[~ u=[p=[~ 12] q=[p=[p=1 q=3] q="3"]]]]

Example: Hoon Workbook

More examples demonstrating parser usage are available in the Hoon Workbook, such as the Roman Numeral tutorial.

Previous16. Functional ProgrammingNext18. Generic and Variant Cores

Last updated 1 day ago