Leanpub: Publish Early, Publish Often

PowerShell’s Tokenizer

The PowerShell language is made up of atomic units called tokens. The Tokenizer class is responsible for scanning all text given to PowerShell, whether directly via the command line or from a script file, and creating tokens from the text. This chapter peeks into how the tokenizer works and how knowledge of the tokenizer’s functionality and limitations can help you gain a more effective understanding of the PowerShell language.

This will merely scratch the surface to give you an overview of how the tokenizer operates, and how to recognize the different parsing modes and when each applies. Further study is highly recommended; the source code is openly available on Github for you to examine. However, should you wish to have a really thorough understanding, it’s recommended you take a moment to clone the PowerShell repository on Github. This will allow you to easily follow references throughout the PowerShell engine internals as the chapter progresses.

Context

For the tokenizer to work it needs something to work with. For context’s sake here is a brief overview of the context the tokenizer primarily works within, so that when one of these things is referenced you have a basic idea of what’s discussed.

It’s recommended that you follow along as this chapter goes through the code, as line numbers are referenced frequently to keep things as succinct and clear as possible. Enough chit-chat; time to look at how the tokenizer works!

Helper Classes and Enums

DynamicKeyword (Ln. 19–481)

The first thing you’ll notice when you pull up tokenizer.cs in the PowerShell repo is the DynamicKeyword class, and a few associated enums related to the necessary parsing of these tokens. Very few people have really heard of — much less used — a DynamicKeyword. They’re quite old in terms of PowerShell features, but they’e rather tricky to work with, and perhaps because of this (at least in part) you won’t see them used frequently.

DynamicKeyword code will crop up throughout the tokenizer, but this chapter won’t get too deep into that. All you need to worry about here is that these things exist and can be used, and the tokenizer needs to be able to recognize them. They work very similarly to regular keywords, just with a little extra complexity to allow some room for customization, thus the enums and the relative complexity of the DynamicKeyword type.

TokenizerMode (Ln. 483–489)

The TokenizerMode enum is very simple and to the point. The tokenizer has 4 core modes that it can be in as it steps through text:

Command
Expression
TypeName
Signature

The default mode is Command mode, where the tokenizer will be looking for command names, language keywords and the subsequent tokens that defines those language features or command parameters. However, if it recognizes other input types (for example, the class keyword, a variable or literal value, the opening [ bracket of a type declaration), it can immediately switch out of Command mode. The rules by which it switches to each mode are reasonably straightforward, and each mode has several stages it will cycle through as it creates tokens. These are covered in more detail as you step through the methods themselves.

NumberSuffixFlags (Ln. 491–549)

The NumberSuffixFlags enum is used when the tokenizer is parsing numerical input. PowerShell actually recognizes several type suffixes when it reads in a number, which affect the final data type of the value.

A number in PowerShell may be suffixed with:

Nothing - no suffix may indicate either a typical 32-bit integer or a floating-point double value (depending on whether decimal points are present).
u — Unsigned — the u suffix indicates an unsigned integer (.NET has no unsigned floating-point types).
y — SignedByte — the y suffix indicates an sbyte (signed byte) value.
uy — UnsignedByte — the uy suffix indicates a regular (unsigned) byte value.
s — Short — the s suffix indicates a short value, a 16-bit integer.
us — UnsignedShort — the us suffix indicates a ushort, an unsigned 16-bit integer.
l — Long — the l suffix indicates a long value, a 64-bit integer.
ul — UnsignedLong — the ul suffix indicates a ulong value, an unsigned 64-bit integer.
d — Decimal — the d suffix indicates a decimal value, a 128-bit high-precision floating point number.
n — BigInteger — the n suffix indicates an arbitrarily large signed integer.

All suffixes are case-insensitive.

NumberFormat (Ln. 554–570)

Another enum utilized when parsing various forms of numbers, this enum is used to classify the format of a number. Numbers can come in 3 basic forms: decimal (base 10, no prefix), hexadecimal (base 16, 0x prefix), or binary (base 2, 0b prefix).

TokenizerState (Ln. 576–586)

This mini class is effectively in-memory storage for the state of the tokenizer. Finding nested scriptblocks within the script it has been asked to parse is pretty common, and in order to parse these, it will store its state using this class. Then, it proceeds to reset its own state and perform a whole new tokenizing operation on the scriptblock. Once that’s completed it can then restore its prior state, skip to the end of the now-tokenized scriptblock, and resume where it left off in the main script.

The Tokenizer Class (Ln. 5–88)

The tokenizer itself takes up the remainder of the file, around 4500 lines.

Core Private Members (Ln. 591–613)

The following runs through some of the key private members and outline their job so you can retain your bearings as you go.

Static Members

1 private static readonly Dictionary<string, TokenKind> s_keywordTable // Ln. 591
2 private static readonly Dictionary<string, TokenKind> s_operatorTable // Ln. 593

These two dictionaries handle the mapping between text input and language keyword TokenKind values, and the similar mapping between operator strings and operator TokenKind values. These are actually populated during the tokenizer’s constructor with all the values in these two arrays:

1 private static readonly string[] s_keywordText // Ln. 618
2 private static readonly TokenKind[] s_keywordTokenKind // Ln. 634
3 
4 internal static readonly string[] _operatorText // Ln. 650
5 private static readonly TokenKind[] s_operatorTokenKind // Ln. 669

As you can see, each keyword and operator is stored initially in two arrays; one containing the text, and the other containing the corresponding TokenKind values. These have a 1:1 relationship, as the values at each position correspond to the same token type.

Instance Members

1 private string _script; // Ln. 611
2 private int _tokenStart; // Ln. 612
3 private int _currentIndex; // Ln. 613

As you might imagine, _script stores the full text of the script the tokenizer is currently examining, _currentIndex stores exactly what point it’s at in the script, and _tokenStart is used to remember the starting position of the current token it’s examining. These last two are extremely helpful when there could be a few possible TokenKind values as it scans a token; sometimes you do need to backtrack a little if the initial assumption didn’t pan out.

For example, if you enter 300_spartans into PowerShell as a bare string, at first it will see the number 3 and start scanning for more numbers. Then, once it reaches the _ it will immediately realize something isn’t right, stop scanning for numbers, and fall back to scanning what it calls a generic token. Generic tokens are most often parsed into command names or unquoted command arguments by the Parser.

Constructor (Ln. 690–720)

The tokenizer has a static constructor which does 2 main things:

Verifies that the keyword and operator tables mentioned above are both in sync, and the same size.
Generates a very basic hash to be used when checking for a script signature block.

The only instance constructor takes a single argument — a Parser instance, which is the instance of the parser that the tokenizer will provide with the generated tokens. This is the only way to create an instance of the tokenizer. In other words, you must have a Parser instance to pair with it — you can’t instantiate the tokenizer on its own.

Auxiliary Private & Internal Members (Ln. 722–7–39)

Quickfire round! Most of these are relatively self-explanatory. For the ones you don’t understand yet, don’t worry; their meaning will become more apparent in context.

 1 internal TokenizerMode Mode         // Stores the current tokenizer mode
 2 
 3 internal bool AllowSignedNumbers    // Are signed numbers valid in the current conte\
 4 xt?
 5 
 6 internal bool WantSimpleName        // Restrict allowed characters in an identifier
 7 
 8 internal bool InWorkflowContext     // Are we in a PS Workflow context? (Deprecated)
 9 
10 internal List<Token> TokenList      // The list of tokens generated
11 
12 internal Token FirstToken           // The first token in the script
13 
14 internal Token LastToken            // The last token in the script
15 
16 private List<Token> RequiresTokens  // The tokens in the script's #Requires statement

There are also a few mode-checking methods which mainly serve to make code more expressive. For example:

1 private bool InCommandMode() { return Mode == TokenizerMode.Command; }

There’s one method corresponding to each tokenizer mode.

Initialize Method (Ln. 741–7–76)

1 internal void Initialize(string fileName, string input, List<Token> tokenList)

This method is used to set the stage, make sure everything the tokenizer needs is in place. It does 2 main things:

Ensures that any stored tokens are discarded, and the token list is set to the input list.
Runs through the entire input very quickly to determine where each line starts.

Nested Scaning Methods (Ln. 778–8–00)

1 internal TokenizerState StartNestedScan(UnscannedSubExprToken nestedText);
2 
3 internal void FinishNestedScan(TokenizerState ts);

I mentioned TokenizerState earlier; this is where it’s used. The StartNestedScan() method creates takes a snapshot of the current tokenizer state, then proceeds to reconfigure the tokenizer to scan the nested script section.

Once this scan is completed, the FinishNestedScan() method will be called to restore the tokenizer to the stored state.

This pattern might seem a little baffling at first, but this method of approaching nested scanning is quite clever. It allows the tokenizer to scan nested scriptblocks to effectively any depth as needed, without ever needing to spin up a second tokenizer instance. One is enough. This helps immensely with speed and memory usage, and also very neatly sidesteps what could be a sticky question: how multiple tokenizer instances might interact with the one Parser instance.

Utility Methods

In addition to the InCommandMode() method and its cousin, the tokenizer has a slew of helper methods that avoid an otherwise significant amount of duplicate code. These are also extremely helpful in terms of maintaining readability in most cases. This is a quick summary of their purposes, but a few rarely-used methods may be skipped to avoid overwhelming the reader.

These methods are used for navigation in the script input. It’s stored as one huge block of text, and the tokenizer uses these methods to navigate within it.

1 private char GetChar(); // Ln. 816

GetChar() has one pretty simple job — increment the current index value (in other words, step forward one character) and then return that character. It also performs checks to ensure you’re not trying to read outside the bounds of the input.

1 private void UngetChar(); // Ln. 831

Even simpler than its pair, UngetChar() simply decrements the current index value (in other words, it steps backwards one character).

1 private char PeekChar(); // Ln.838

PeekChar() checks the character at the current index without modifying the index value. It’s a quick “what’s it currently looking at” so that the tokenizer can conditionally step forwards, or check the same character against a few possible cases, or simply recognize something’s wrong and either report an error or try another possibility.

PowerShell’s flexibility as a language is partly made possible by these tiny helper methods and how they’re used.

1 private void SkipChar(); // Ln. 850

You’ll often find this paired with PeekChar(), and its role is extremely simple. It simply increments the current index value, without returning anything.

1 internal static bool IsKeyword(string str); // Ln. 862

The IsKeyword() method simply checks a given string to determine if it’s a valid keyword. It will return true for both regular language keywords and dynamic keywords, and it simply searches the keyword and dynamickeyword tables for a match.

1 internal void SkipNewlines(bool skipSemis); // Ln. 877
2 
3 private void SkipWhiteSpace(); // Ln. 955
4 
5 private void ScanNewline(char c); // Ln. 969

These methods all handle skipping whitespace or generating newline tokens.

SkipNewlines() may seem a bit convoluted, but in a nutshell all it does is skip whitespace characters (including newlines) until it finds something more interesting to look at. It also skips comments, though it scans them and stores the tokens as it finds them. This process loops until either the script ends or another character is recognized.

SkipWhiteSpace() is a little more strict than SkipNewLines(), and only skips whitespace.

Finally, ScanNewLine() simply returns a token representing a newline character. It also normalizes CRLF (carriage return, line feed) sequences by stripping out the CR character, as PowerShell internally only typically uses LF to represent newlines.

1 private void ScanSemicolon(); // Ln. 981

There is a specific method to scan semicolons. The tokenizer needs to be as performant and lightweight as it can be, so this method verifies whether TokenList is available to store tokens in before opting to create the token.

1 private void ScanLineContinuation(char c); // Ln. 992

If you’re not familiar with line continuations, in PowerShell a line continuation can be performed with a backtick character ( ` ) followed immediately by a newline. This method is almost identical to the above ScanNewline() method; the only difference you’ll note is that this method is expecting a two-character newline: the backtick, and the line feed character.

1 internal void Resync(Token token); // Ln. 1010
2 
3 internal void Resync(int start); // Ln. 1018

The Resync() method is called by the Parser when it determines that it needs to backtrack. This can be a fairly expensive operation for the tokenizer, so uses are kept to a minimum. When this method is called, the tokenizer discards any tokens generated after the token given to this method, and then resumes scanning from this point again.

1 internal void CheckAstIsBeforeSignature(Ast ast); // Ln. 1112

In PowerShell, signature blocks must be at the bottom of a file. If this isn’t the case, this method reports a parse-time error and stops tokenizing the file.

Error Reporting (Ln. 1125–1153)

1 private void ReportError(IScriptExtent extent, string errorId, string errorMsg);
2 
3 private void ReportIncompleteInput(int errorOffset, string errorId, string errorMsg);

The ReportError() and ReportIncompleteInput() methods, which have a few overloads, call back to the similarly-named methods on the Parser instance. This will cause a parsing error to be created, and parsing of the script or input will halt.

Creating Extents (Ln. 1155–1191)

All AST objects created by the parser have an Extent property which details both position in the overall script, and the literal text that was turned into the token.

These methods are necessary in order to generate this information, which can be later used in the Resync() method, for example. Extents are also used in reporting parse errors, which helps a great deal with locating the problem!

Generating Tokens (Ln. 1193–1309)

There are a wide variety of tokens used in PowerShell. The methods here are all very similar. Each method calls the private SaveToken() method after creating the token that each method is named after. Depending on the kind of token, the method and token creation may require different input values. Here are a few of the more commonly-used token-generating methods:

 1 private Token NewNumberToken(object value);
 2 
 3 private Token NewParameterToken(string name, bool sawColon);
 4 
 5 private VariableToken NewVariableToken(VariablePath path, bool splatted);
 6 
 7 private StringToken NewStringLiteralToken(
 8     string value,
 9     TokenKind tokenKind,
10     TokenFlags flags);
11 
12 private StringToken NewStringExpandableToken(
13     string value,
14     string formatString,
15     TokenKind tokenKind,
16     List<Token> nestedTokens,
17     TokenFlags flags);
18 
19 private Token NewGenericToken(string value);

Special Line Continuations (Ln. 1345–1416)

There are a few recent additions to the tokenizer that allow for additional line continuations, which will be included for the first time in PowerShell 7.0.

1 internal bool IsPipeContinuance(IScriptExtent extent);
2 
3 internal bool IsPipeContinuance(IScriptExtent extent);

These methods allow the tokenizer to read pipelines written like this, which wasn’t valid syntax in PowerShell 6.x and below:

1 PS> Get-ChildItem -Path $HOME -Directory
2 >> | ForEach-Object -MemberName Name
3 >> | Get-Random -Count 5
4 >> | Sort-Object -Descending

At the time of writing, there are several other proposals for additional line-continuation behaviours that may end up being incorporated into the tokenizer in a later PowerShell version.

Skipping Comments (Ln. 1418–1446)

1 private int SkipLineComment(int i);
2 
3 private int SkipBlockComment(int i);

These methods simply step along through the comments in a script, looking for the character(s) that indicate their end. Once they find it, they return the index just after the end of the comment.

Special Characters (Ln. 1448–1577)

1 private char Backtick(char c, out char surrogateCharacter);
2 
3 private char ScanUnicodeEscape(out char surrogateCharacter);
4 
5 private static char GetCharsFromUtf32(uint codepoint, out char lowSurrogate);

These methods are all parsing backtick-escaped characters and giving back the appropriate character. For example, "`n" is transformed into a literal newline, and "`u{0x25}" becomes a % sign.

Backtick() specifically handles all the standard characters like a tab stop, newline, etc., and then the `u case handles any and all unicode characters by calling out to the ScanUnicodeEscape() method, which turns the input value into the appropriate Unicode character.

Signed Scripts and `#Requires` (Ln. 1579–1714)

1 private void ScanToEndOfCommentLine(out bool sawBeginSig, out bool matchedRequires)

This method is mainly just for reading to the end of a single-line comment. However, both #Requires and a signature block begin the same way as a single-line comment, so this method has a fair bit of additional logic in order to be able to determine whether it’s something important or just a comment.

Reusable StringBuilders (Ln. 1718–1745)

The tokenizer maintains its own private Queue of StringBuilder objects, as you can see here.

1 private readonly Queue<StringBuilder> _stringBuilders = new Queue<StringBuilder>();
2 
3 private StringBuilder GetStringBuilder();
4 
5 private void Release(StringBuilder sb);
6 
7 private string GetStringAndRelease(StringBuilder sb);

This allows the tokenizer to avoid a lot of unnecessary allocations. Many of its parsing methods require the use of StringBuilders in order to efficiently create tokens. This setup creates a framework where a StringBuilder is only created if it’s needed; StringBuilders can be cleared out and reused instead of expending resources instantiating a new one.

Scanning Comments (Ln. 1747–2244)

1 private void ScanLineComment();
2 
3 private void ScanBlockComment();

What’s that, you thought you were done with comments? There is still a bit more here. While some simple recognition of #Requires and signature blocks have already been covered, the methods already seen don’t cover actually parsing a #Requires statement, much less evaluating the validity of a signature block.

 1 internal ScriptRequirements GetScriptRequirements();
 2 
 3 private void HandleRequiresParameter(CommandParameterAst parameter,
 4     ReadOnlyCollection<CommandElementAst> commandElements,
 5     bool snapinSpecified,
 6     ref int index,
 7     ref string snapinName,
 8     ref Version snapinVersion,
 9     ref string requiredShellId,
10     ref Version requiredVersion,
11     ref List<string> requiredEditions,
12     ref List<ModuleSpecification> requiredModules,
13     ref List<string> requiredAssemblies,
14     ref bool requiresElevation);
15 
16 private List<string> HandleRequiresAssemblyArgument(
17     Ast argumentAst,
18     object arg,
19     List<string> requiredAssemblies);
20 
21 private List<string> HandleRequiresPSEditionArgument(
22     Ast argumentAst,
23     object arg,
24     ref List<string> requiredEditions)

The code contained in these methods is pretty dry, but does about what you’d expect. It uses the methods you’ve already seen to determine if a comment is a #Requires or a signature, and then it evaluates them appropriately. If you’re particularly interested, looking at the #Requires code is recommended, though you will end up scrolling by a lot of ReportError() calls for all the failure cases.

Tokenizing Strings (Ln. 2246–2834)

PowerShell has a few neat tricks up its sleeve when it comes to strings, but by and large the biggest difference you’ll see will come down to whether it starts with single or double quotes.

Strings in Command Arguments

Sometimes, however, you can have a string value that doesn’t have quotes. These are only available when entering command arguments. When you enter a command and then pass an argument to it that’s a simple string, that string will remain a string token, instead of a generic token (generic tokens are what commands start as).

1 # Despite not enclosing this in quotes, it remains a string value
2 PS> Get-ChildItem -Path .\pwsh

Here is the method which is called upon to handle this:

1 internal StringToken GetVerbatimCommandArgument(); // Ln. 2250

This is pretty straightforward for a tokenizer method. It checks for disallowed characters, stops on a space (spaces aren’t included in an unquoted command argument string), and has a bit of extra handling for the stranger edge cases where you want to put quotation marks in the middle of the unquoted argument.

Single-Quoted Strings

As for single-quoted strings, those of you familiar with PowerShell are probably aware, but a 'single-quoted string' is assumed to be literal. The engine won’t bother checking it for variables or subexpressions to expand.

1 private TokenFlags ScanStringLiteral(StringBuilder sb); // Ln. 2284
2 
3 private Token ScanStringLiteral(); // Ln.2322

These are pretty straightforward, simply stepping through and adding every character it sees to the StringBuilder that’s being used to create the final string value. The only additional handling here is checking for an escaped quote in the middle of the string. As a quick example:

1 PS> 'this string contains ''escaped quotes,'' you see?'
2 this string contains 'escaped quotes,' you see?

Double-Quoted Strings

Double-quoted strings have a lot more going on: expanding variables, entire subexpressions, etc. In the interest of keeping it relatively easy to digest, this chapter doesn’t get too in-depth with the actual code itself.

 1 private Token ScanSubExpression(bool hereString); // Ln. 2329
 2 
 3 private TokenFlags ScanStringExpandable( // Ln. 2516
 4     StringBuilder sb,
 5     StringBuilder formatSb,
 6     List<Token> nestedTokens);
 7 
 8 private bool ScanDollarInStringExpandable( // Ln. 2485
 9     StringBuilder sb,
10     StringBuilder formatSb,
11     bool hereString,
12     List<Token> nestedTokens);
13 
14 private Token ScanStringExpandable(); // Ln. 2535

Essentially what’s going on here is:

The tokenizer recognizes an expandable string (starts with ")
ScanStringExpandable() is called.
TokenFlags are determined, and any subexpressions are examined, and sub-tokens are retrieved.
An expandable string token is generated, with all this information.

Once the token is created, the Parser and Compiler can use it to evaluate the final string content.

Here-Strings

Here-strings in PowerShell look like this:

1 $Literal = @'
2 String contents
3 '@
4 
5 $Expandable = @"
6 String $contents
7 "@

In terms of the tokenizer, it now simply means that it changes what it looks for. The tokenizer here behaves similarly to scanning through other strings, except that it’s now searching for a valid ending sequence for the string, and no longer caring about quotation marks that would otherwise end a normal string (unless they comprise part of the ending sequence).

1 private bool ScanAfterHereStringHeader(string header);
2 
3 private bool ScanPossibleHereStringFooter(
4     Func<char, bool> test, Action<char> appendChar,
5     ref int falseFooterOffset);
6 
7 private Token ScanHereStringLiteral();
8 
9 private Token ScanHereStringExpandable();

As you can see, there are specific methods to scan for the header or footer of the here-string. In addition, here-strings have their own version of the ScanStringLiteral / ScanStringExpandable which mainly only differs in the handling of quotation marks as mentioned a moment ago.

Tokenizing Variables (Ln. 2836–3149)

Variables in PowerShell are very powerful, but tokenizing them correctly is no easy feat. The intricate details here are well worth a look (linked above!) but in terms of understanding the primary method used here, efforts were made to keep it simplified for you.

1 // Scan a variable - the first character ($ or @) has been consumed already.
2 private Token ScanVariable(bool splatted, bool inStringExpandable);

As noted by the comment here, by the time it gets to this method, the tokenizer’s already consumed the $ or @ character. As a result, this information is passed by the method parameters instead.

This method checks for braced variables as well, as braced variables are permitted to use just about any possible character in their names. Unbraced variables are limited to basic alphanumeric names, with only a few allowed symbols.

1 ${my braced\\variable} = "ted"
2 $unbraced_variable = "not ted"

You’ll notice that, like many of the tokenizer’s Scan methods, this method actually uses a while (true) loop — a deliberately infinite loop. As ill-advised as this can be, arguably it’s one of the only straightforward ways to get this done.

What often makes these methods a lot more difficult to follow, however, is their use of the goto keyword to break out of the loop and move on once the initial scanning is completed. Some of the methods use goto in order to re-start their scans when necessary as well. It can sometimes be more clear if you follow it one character at a time; trying to figure out how it’ll operate across even one whole word can be a little tricky.

The long and the short of it for this method is:

It checks if the variable is braced or not.
It loops through all the characters, until it reaches the end (or an invalid character).
It checks for errors, and report any found.
If no errors are found, it creates our variable token and return it.

Note that this is the tokenizer — it’s just reading the input text. The tokenizer hasn’t the faintest idea what’s in the variable, if anything at all. It just knows there’s a variable token here.

It’s also worth noting that PowerShell has some very cleverly reserved variables (namely, $$ and $^) which would normally be considered actually invalid as variable names. These variables are hard-coded into the tokenizer as an additional case, ignoring the standard variable name rules.

Generic Tokens (Ln. 3330–3446)

1 private Token ScanGenericToken(char firstChar); // Ln. 3330
2 
3 private Token ScanGenericToken(char firstChar, char surrogateCharacter); // Ln. 3337
4 
5 private Token ScanGenericToken(StringBuilder sb); // Ln. 3349

These methods are the tokenizer’s “catch-all” bucket. If it looked like it was going to be some other token, and then an unexpected character was found, it will end up being passed to these methods.

It’s actually rather difficult to summarize it better than the original code comments in the method here convey it:

On entry, it’s already scanned an unknown number of characters and found a character that didn’t end the token, but made the token something other than what it thought it was.

Examples:

77z — it looks like a number, but the ‘z’ makes it an argument.

$+ — it looks like a variable, but the ‘+’ makes it an argument.

A generic token is typically either command name or command argument (though a generic token is accepted in other places, such as a hash key or function name.) A generic token can be taken literally or treated as an expandable string, depending on the context. A command argument would treat a generic token as an expandable string whereas a command name would not. It optimizes for the command argument case - if it finds anything expandable, it continues processing assuming the string is expandable, so it’ll tokenize sub-expressions and variable names. This would be considered extra work if the token was a command name, so it assumes that will happen rarely, and indeed, ‘$’ isn’t commonly used in command names.

Tokenizing Numbers (Ln. 3448–4118)

The tokenizer has a few different methods for parsing numbers. First of all, it has the basic scanning methods:

 1 private void ScanHexDigits(StringBuilder sb); // Ln. 3450
 2 
 3 private int ScanDecimalDigits(StringBuilder sb); // Ln. 3461
 4 
 5 private void ScanBinaryDigits(StringBuilder sb); // Ln. 3476
 6 
 7 private void ScanExponent( // Ln. 3487
 8     StringBuilder sb,
 9     ref int signIndex,
10     ref bool notNumber);
11 
12 private void ScanNumberAfterDot( // Ln. 3506
13     StringBuilder sb,
14     ref int signIndex,
15     ref bool notNumber);

These methods handle the string input directly, making sure that the subsequent methods only receive useful data.

ScanHexDigits(), ScanDecimalDigits(), and ScanBinaryDigits() all operate similarly, looping over the input and adding valid numbers to the string. Numbers can also have suffixes in PowerShell, so they don’t immediately assume that an unwanted character means an error.

ScanExponent() is called when a number like 2e6 is passed in; ScanDecimal() handles it normally until the e is hit, then returns. Then, ScanExponent() is called to finish up and get the exponent value on the end. These numbers will shortly be evaluated, and 2e6 becomes 2,000,000 before the final data type is decided. Note that for exponents, it’s normal and acceptable to have a negative sign in the exponent, so 2e-6 is also a valid number token for PowerShell’s tokenizer to accept.

ScanNumberAfterDot() does exactly what it says on the tin; when you give the tokenizer a number with a decimal place, ScanDecimal() would stop at that point. Then, ScanNumberAfterDot() will be called to finish the scan.

1 private static bool TryGetNumberValue(
2     ReadOnlySpan<char> strNum,
3     NumberFormat format,
4     NumberSuffixFlags suffix,
5     bool real,
6     long multiplier,
7     out object result);

This method is called after the scans are done, and the type is decided. It bears the brunt of the logic to determine the actual type of the resulting value.

If the d decimal suffix is provided, it can proceed to parse it as decimal and return early. Otherwise, it will either attempt to parse it as a double value or a BigInteger depending on whether or not the number is considered real. All numbers input with a decimal point or an exponent are automatically considered real numbers.

Before checking types, any multiplier suffixes (for example, 10gb, 5mb) are applied first to ensure that the target type can actually hold the requested value. Then, any provided suffix is checked at this point to determine if the value can be contained in the specified type.

If no suffixes are provided, the default type is double for real values, and int or long for non-real values, depending on whether the value is small enough to be stored in an int or not.

If all goes well, the number can be returned and stored in the resulting number token. Below is a brief example of the process of determining number value in the tokenizer with a few example cases.

1 10d                             -> decimal
2 11.4d                           -> decimal
3 12e4   -> real                  -> double
4 12     -> non-real              -> BigInteger -> int
5 10gb   -> non-real -> 10 * 1GB  -> BigInteger -> long
6 12.0   -> real                  -> double
7 1.5sKb -> real     -> 1.5 * 1KB -> double     -> short

The methods that does a lot of the actual decision making to get to that point, however, are ScanNumber() and ScanNumberHelper().

1 private Token ScanNumber(char firstChar); // Ln. 3838
2 
3 private ReadOnlySpan<char> ScanNumberHelper(  // Ln. 3886
4     char firstChar,
5     out NumberFormat format,
6     out NumberSuffixFlags suffix,
7     out bool real,
8     out long multiplier);

ScanNumberHelper() is responsible for:

Determining the format of the number (hex, binary, decimal).
Handling the number strings from the input.
Identifying decimal points, exponents, and suffixes.
Determining the type and multiplier suffix values.

At this point, the information is fed out to ScanNumber() and then back into TryGetNumberValue() to determine the final values.

Member Access / Invocations / Necessary Characters (Ln. 4120–4216)

1 internal Token GetMemberAccessOperator(bool allowLBracket); // Ln. 4120
2 
3 internal Token GetInvokeMemberOpenParen(); // Ln. 4197

These two methods handle syntax like this:

1 PS> $variable.Property
2 PS> $type::StaticMethod()
3 PS> $variable.Method()

They actually only handle the ., ::, and ( items in the above examples, all of which are key. When calling a member of an instance or a type, PowerShell needs to know whether the access method is static or not, and it also needs to know if you’re actually calling the method or requesting information about it.

A method in PowerShell called without its parentheses will actually list out the metadata for the method, including the required parameters to invoke it. Additionally, the distinction between access with . and :: is extremely important.

All these methods need to do is check that the context they’re being queried in is valid, and then examine the separator characters in order to generate the correct tokens for the Parser to utilize later on.

The actual member names themselves are separately scanned as generic tokens.

Identifiers (Ln. 4335–4391)

The tokenizer uses Identifiers for a few different purposes:

Keywords
Command Names
Command Arguments
Method / Property Names

The basic rule here is that identifiers are pretty much universally alphanumeric, with only a scant few symbols permitted.

1 private Token ScanIdentifier(char firstChar);

Depending on whether the tokenizer is currently in command or expression mode, the generated token here can be a generic token instead of a true Identifier, but this method will handle both.

Type Names (Ln. 4393–4505)

1 private Token ScanTypeName(); // Ln. 4395
2 
3 private void ScanAssemblyNameSpecToken(StringBuilder sb); // Ln. 4428
4 
5 internal string GetAssemblyNameSpec(); // Ln. 4452

These methods are called after you’ve already got an open L-bracket, and are expecting a type name of some form. The permitted symbols are much more restricted, as the allowable characters in a type name in the .NET CLR is pretty restricted.

The only allowable characters are letters, digits, and a few special characters.

Label Names (Ln. 4507–4543)

Labels are rarely used in PowerShell, and only valid in the context of labelling a specific loop in a script. They follow the same rules as any other identifier / generic token in PowerShell in the tokenizer, except that they start with a : character. For example:

1 :loop1 foreach ($a in 1..10) { if ($_ -in 3, 6, 7) { break } $_ }

The tokenizer doesn’t retain sufficient context awareness to determine if a label is actually valid in context, however — for example, the loop that’s meant to come after it could be missing.

NextToken (Ln. 4545–4992)

This is the primary method that the parser will interact with the tokenizer. It will simply call _tokenizer.NextToken() and then see what it gets, and that’s fantastic when you consider the relative complexity of this method.

1 internal Token NextToken();

It’s essentially a giant switch statement, with a lot of bells and whistles, plus a sprinkling of goto statements here and there. These are mainly here so that the tokenizer can recognize there’s whitespace, handle it, and then continue scanning until it gets the next token that the Parser actually wants to look at. It saves an additional series of checks and messy code in the Parser itself by simply tucking that away here at this point.

The long and the short of it: this method determines what the tokenizer is looking for based on its current position in the file, and what it looks for from there. If you were to try adding a completely new style of syntax to PowerShell, that’s certainly one place you might need to look at adding a few extra special cases for.

Up next

Keeping Your Users in the Loop with Toast Notifications