Tuesday, 1 May 2018

egg Strings

Currently, strings are a fundamental type in egg. In many languages they are not. I'm still undecided how much compiler and language support to provide. In particular, because egg is supposed to be a stable language, if strings are a fundamental type, any decisions I make now will need to be "baked" into the specification. So I've been looking into what strings actually are.

In languages where strings are not fundamental, such as C++, there may be many representations of text:

  char*
  char[N]
  std::string
  std::wstring
  std::u16string
  std::u32string

This has the advantage that the different representations and implementations can be tailored for different criteria, but it can be a bit overwhelming for learners.

Strings in egg are simply sequences of Unicode characters. They are immutable (like Java) and do not directly expose their internal representation. They are also independent of your locale.

So what can you do with strings like these? The Wikipedia pages leave a bit to be desired on this subject, so I had to start from first principles.

Assignment

Strings can be assigned from string literals or from other strings:

  string greeting = "hello";
  string word = greeting;

Concatenation

Strings can be created by concatenating other strings using the type-name constructor:

  string greeting = string("hello"); // "hello"
  greeting = string(greeting," world"); // "hello world"

Property: int length

Strings have a single property, "length", which is the count of the Unicode code points (not code units) that make up the string:

  string greeting = "hello";
  print(greeting.length); // prints '5'

Operator: string[int]

Strings have an indexing operator to extract single-character substrings at the relevant integer code point index:

  string greeting = "hello";
  string letter = greeting[0]; // sets letter to "h"

If the index is negative, or beyond the end of the string, an exception is thrown.

Operators: == and !=

Strings have equality/inequality operators:

  print("hello" == "hello"); // prints 'true'
  print("hello" == "goodbye"); // prints 'false'
  print("hello" != "hello"); // prints 'false'
  print("hello" != "goodbye"); // prints 'true'

Equality is case-sensitive and based on the equality of the constituent code point sequence. Strings are never equal to non-string values.

Iteration

The code points that make up a string can be iterated:

  string greeting = "hello";
  for (string i : greeting) print(i); // prints 'h' 'e' 'l' 'l' 'o'


Note that the iteration returns single-character strings.

Member: int hash()

Computes a 64-bit signed hash integer for use in containers and algorithms.

Member: int compare(string other)

Compares the string to another in an invariant manner (code point by code point). Returns negative/zero/position depending on whether the string is less/equal.greater than the other.

  "hello".compare("goodbye") // returns a positive integer

Case-insensitive and locale-specific comparisons are outside the remit of the basic string methods.

To prevent accident assumptions about collation ordering, the relational operators "<", "<=", ">=" and ">" are not directly supported for strings.

Member: bool contains(string needle)

Searches for a substring within the string.

  "beggar".contains("egg") // returns true

This is an optimisation of

  haystack.indexOf(needle) != null

Member: bool startsWith(string needle)

Determines if a string starts with substring.

  "beggar".startsWith("beg") // returns true

This is an optimisation of

  haystack.indexOf(needle) == 0

Member: bool endsWith(string needle)

Determines if a string starts with substring.

  "beggar".endsWith("beg") // returns false

This is an optimisation of

  haystack.indexOf(needle) == (haystack.length - needle.length)

Member: int? indexOf(string needle)

Returns the zero-based index of the first occurrence of the substring, or "null" if the substring is not present.

  "beggar".indexOf("egg") // returns 1

The full signature is

  int? indexOf(string needle,
               int? start_index = null,
               int? limit = null,
               bool? negate = null)

Member: int? lastIndexOf(string needle)

Returns the zero-based index of the last occurrence of the substring or "null" if the substring is not present.

  "beggar".lastIndexOf("g") // returns 3

The full signature is

  int? lastIndexOf(string needle,
                   int? start_index = null,
                   int? limit = null,
                   bool? negate = null)

Member: string replace(string needle, string replacement)

Replaces occurrences of a substring with another string.

  "beggar".replace("egg", "e") // returns "bear"

Remember, the result is returned from this function; the original string is unaltered. The full signature is

  string replace(string needle,
                 string replacement,
                 int? max_occurrences = null)

If "max_occurrences" is negative, replacements are made for the last occurrences of the substring.

Member: string slice(int? begin = null, int? end = null)

Extracts a substring from the string. The parameters are both optional indices. Negative indices count back from the end of the string:

  "beggar".slice(1) // returns "eggar"
  "beggar".slice(1, 4) // returns "egg"
  "beggar".slice(1, -2) // returns "egg"

Member: string[] split(string separator, int? limit = null)

Splits a string into an array of substrings:

  "beggar".split("g") // returns ["be","","ar"]

If "limit" is negative, splits are limited to the last occurrences of the separator.

Member: string join(any[] ...params)

Joins strings together:

  "a".join("b", "n", "n", "") // returns "banana"

Member: string repeat(int count)

Creates a new string by concatenating repetitions of a source string:

  "beggar".repeat(0) // returns ""


  "beggar".repeat(3) // returns "beggarbeggarbeggar"

And that's pretty much it. There are some obvious (and deliberate) omissions:

MISSING Member: string format(any[] ...params)

The functionality for formatting general text will live in separate modules. This will allow for more than one formatting scheme and also allow the algorithms to evolve over time.

MISSING Member: int codePointAt(int index)

This is really a function of the (Unicode) encoding specification, so should be defined there:

  import unicode = /*...*/;
  print(unicode.getCodePointAt("hello", 1)); // prints 101

MISSING Members: string lowercase() and string uppercase()

These are omitted because they rely on the Unicode character database, which changes over time. See here. The functionality will probably be made available via a versioned module:

  import unicode = /*...*/;
  print(unicode.toUpper("hello")); // prints "HELLO"

MISSING Members: string trim(), string trimLeft() and string trimRight()

Removing white-space (or any class of character) from the ends of strings is a common operation. The Unicode standards committee designated twenty-odd code points as "white-space". This includes one of my favourite characters: U+1680: OGHAM SPACE MARK. Alas, the categorisation is somewhat fluid, so one requires access to the Unicode character database to get it right.

MISSING Member: string reverse()

Although characters in egg strings are Unicode, reversing a Unicode string may not produce the results you expect. To do it properly again requires access to the Unicode character database, so the reverse function should belong to a versioned module.

Also, there some members that I'm still dithering about:

POSSIBLE Members: string padLeft() and string padRight()

These members can be synthesised using the existing members. However, they are surprisingly easy to get wrong, so maybe they're a useful addition:

  string padLeft(int min_length, string? padding = null)
  string padRight(int min_length, string? padding = null)

Computers Don't Lie T-Shirt

Computer Don't Lie - They're Just Misinformed

Tuesday, 17 April 2018

egg Functions

My first test script for functions in egg was everyone's favourite chestnut: the Fibonacci series.

  int fibonacci(int n) {
    if (n < 2) {
      return n;
    }
    return fibonacci(n - 1) + fibonacci(n - 2);
  }
  print(fibonacci(10));

It does indeed print out "55"!

In egg, functions defined like this are actually special cases of callable objects. The script declares an identifier "fibonacci" which is initialised with an instance of an object that supports the "call" operation: in this case, taking a single integer parameter and returning an integer.

At present, there is no notion of read-only variables in egg, so it's possible to subsequently assign a different function to "fibonacci":

  int fibonacci(int n) {
    if (n < 2) {
      return n;
    }
    return fibonacci(n - 1) + fibonacci(n - 2);
  }
  int zero(int n) {
    return 0;
  }
  fibonacci = zero;
  print(fibonacci(10));

This prints out "0". It's a good demonstration that functions in egg are considered first-class entities, but it might violate the principle of "least surprise" for newcomers.

egg "For" Statements

Recently, I got my first egg script running. Here it is:

  var s = 0;
  for (var i = 0; i < 100; ++i) {
    s += i;
  }
  print(s);

You'll be unsurprised to hear that this printed out "4950" (it was a pleasant surprise to me when it first happened, though, I can tell you!)

There are three things worth mentioning in this script.

Firstly, the "var" keyword initiates type inference for variables "s" and "i". In both cases, "int" is inferred (there are no unsigned integers in egg).

Secondly, the "print" function is a built-in method that will probably be removed, but is useful for testing, at the moment.

Finally, the syntax of "for" statements turns out to be non-trivial. Like C++, I adopted two forms:

  for (before ; condition ; step) { ... }
  for (iterator : collection) { ... }

The latter form is for iterating around collections, which we won't discuss here.

My main concern with the former form of the "for" loop is:
Which syntactic elements are valid for "before", "condition" and "step"?
The easiest of the three clauses is "condition". It's optional, but I only allow expressions that evaluate to Boolean values. This is similar to all the main curly-brace languages (C/C++/Java/JavaScript).

The "before" statement is also optional, but it must be a statement. This includes variable declarations: the scope of such variables is limited to the "for" statement, including the "condition" and "step" clauses. Again this is similar to the other languages (if we ignore JavaScript's problematic scoping rules).

The "step" statement is optional too, but cannot be a declaration. As mentioned previously, the increment and decrement operators are supported in egg purely to allow the classic for-loop idiom seen in this example.

But then I got a bit confused with which egg statements should be valid for "before" and "step". Can you use "break" and "continue"? If so, what do they do?

So I asked my C++ compiler, and it says that I cannot use "break" or "continue", but I can "throw" exceptions. The reason you cannot "break" or "continue" in "before" and "step" clauses is because those statements are just that: statements. The "before" and "step" clauses in C++ expect C++ expressions.

But why can you "throw" exceptions in those clauses in C++?

  for (auto i = 0; i < 100; throw "Bang!") {} // Valid C++

Well, it turns out that what I think of as throw statements are actually throw expressions (of type "void") in the formal syntax. It's a cul-de-sac that others have found themselves in too!

As egg is meant to be an easy-to-learn language, with few surprises, I decided to classify "throw" as a statement and explicitly forbid it in expressions and "before" and "step" clauses of "for" statements.

Tuesday, 3 April 2018

ZX Spectrum Flags of the European Union

Someone, somewhere, out there on the Internet, is looking for the flags of the twenty-eight (current) member countries of the European Union  ... in the original Sinclair ZX Spectrum SCREEN$ format. Surely...

Sunday, 1 April 2018

Mappa Edmundi de Waal

Personally, I think Edmund missed a trick...


(original by Edmund de Waal)

Friday, 23 March 2018

A Prime Example of JavaScript Golf

This is why you shouldn't play JavaScript golf:

for(a=[1];!a[999];)/^(11+)\1+$/.test(a+=1)||print(a.length)