Convert Formatted Text Into a Data Structure Using Parsing

Parsing is a process of converting formatted text into a data structure. A data structure type can be any suitable representation of the information engraved in the source text.

Tree type is a common and standard choice for XML parsing, HTML parsing, JSON parsing, and any programming language parsing. The output tree is called Parse Tree or Abstract Syntax Tree. In an HTML context, it is called the Document Object Model (DOM).

A CSV file parsing can result in a

List

List of values

or a

List of Record objects

Graph Type is a choice for natural language parsing.

A piece of program that does parsing is called Parser.

How it works

Parser analyses source text against the format* prescribed. If source text does not match against format error is thrown or returned.

If source text does not match against format error is thrown or returned.
If matches then “data structure” is returned.

*format is coded inside the parser.

Format is the DNA of a parser.

Small Case Study

Consider an example of Date parsing from a string (source) in format

DD-MM-YYYY

to Date object:

class Date {
  int day;
  int month;
  int year;
}

Implementation Note

For Date parsing, I would be using Regular Expression¹ (regex for short). Regex can be matched against a string. It also helps in extracting part of the source text if it is matched.

Note: This is going to be a small-scale illustration of parsing with “Regex.” This might be a fair approach for one-two lines input text but not in every case. You may have to write a parser by hand (most complex) or use Parser Generator tools (moderately complex).

Code

The parsing and extracting the Date element is as:

String date = "20-05-2012";
// 1. defining a regular expression
Pattern dateMatcher = Pattern.compile("(\\d{2})-(\\d{2})-(\\d{4})");
// 2. matching a regular expression
Matcher matcher = dateMatcher.matcher(date);
// 3. if pattern matches 
if (matcher.find()) { 
  // extract day 
  int day = Integer.parseInt(matcher.group(1));
  int month = Integer.parseInt(matcher.group(2));
  int year = Integer.parseInt(matcher.group(3));
  // 4. validate date : skipped code
  return new Date(day, month, year);
} else {
 // throw Error
 throw new DateParseException(date +" is not valid")
}

The code is written in Java.

How does this code work:

1. Date formatted as DD-MM-YYYY can be matched using regex:

(\d{2})-(\d{2})-(\d{4})
# where \d matches any digit between 0 to 9, 
# {n} means how many times the last character type is expected
# () is called capturing group and enclosed part of string that would be extracted.
# Each group is numbered, the first group is assigned "1", the second one is given number "2", and so on

2. The given date as the string is matched against the defined regex.

3. If the match is successful, then the day, month, and year are extracted from the source string using Group Construct provided by Regular Expression API. It is a standard construct in any Regular Expression API in any language.

4. The extracted date parts can be validated; related code is skipped for you to exercise.

This is an illustration of Regular Expression based parsing, which is limited to format, which can be defined by regular expressions. It is a basic example of parsing. A programming language or format like XML parsing is complex in nature. You can refer to the book named “Crafting Interpreters” to get an idea about a full-scaled parsing.

You can also read about “Lezer: Code Mirror’s Parsing System”

Phases of Parsing

Parsing can be scoped as a composition of Scanning and Syntactic Analysis or just Syntactic Analysis.

Scanning

Scanning is a process of converting a stream of characters into tokens.

A token represents a “concept” introduced by format. Logically, a token can be considered as a label assigned to one or more characters. From a processing point of view: A token is an object, can contain lexeme³, location information, and more.

In Java language: if, while, int are examples of tokens. In date parsing, tokens are defined by Regex;

\d{2}

(day, month),

—

(separator),

\d{4}

(year) are tokens. Note: Day and month are the same token type. A token is defined by “the pattern of characters” not by locations of characters.

Scanning is also called Tokenization or Lexical Analysis. A part of the program which does scanning is called Lexer or Scanner or Tokenizer.

Syntactic Analysis

Syntactic Analysis analyses the structure formed as keeping tokens in order as their positions. It also validates and extracts engraved data to create the preferred Data Structure.

In date parsing example: “day is followed by month and year.” The order is checked by Regex Engine. Extraction is done based on order and matches.

Errors reported by this phase are called Syntactic Errors. Going back to Date parsing example,

99-JAN-2021

is an invalid Date; however,

99–99–9999

is a valid Date because rule

(\d{2})-(\d{2})-(\d{4})

says so. It may seem absurd, but parsers generally validate Syntactic Correctness.

Closing Notes

The scale of parsing determines the inclusion or exclusion of Scanning as part of it:

It is beneficial for big-scale parsing like language (natural or programming) parsing to keep Scanning, and Syntactic Analysis separated. In this case, Syntactic Analysis is called parsing.
For small-scale parsing like Date, it may not be beneficial to distinct Lexer and Syntactic Analysis. In this case, it is called Scannerless Parsing.

Many times, parsing tasks are delegated to parser code generators like Antlr or Lex. These tools require a set of rules or grammar and generate parser code.

The generated parsers produce a tree upon parsing, which may not be the desired data structure; however, these libraries provide sufficient APIs to convert the constructed tree to the desired data structure.

There is more in Parser world; the type of parsing styles: top-down or bottom-up, Leftmost or Rightmost derivation. These design styles limit parsing capabilities of the parser. You may visit the following link to get a synoptic view of these design choices and their comparison:

https://tomassetti.me/guide-parsing-algorithms-terminology/#tablesParsingAlgorithms

Footnotes

Regular Expression is an algebraic expression² representing a group of strings.
An algebraic expression utilizes symbols and operators.
A Lexeme is a raw string version of Token.

Previously published at https://themightyprogrammer.dev/article/parsing

Convert Formatted Text Into a Data Structure Using Parsing

Too Long; Didn't Read

Company Mentioned

How it works

Implementation Note

Phases of Parsing

Closing Notes

Footnotes

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

Convert Formatted Text Into a Data Structure Using Parsing

Too Long; Didn't Read

Company Mentioned

How it works

Implementation Note

Phases of Parsing

Closing Notes

Footnotes

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics