Techno Blender
Digitally Yours.

1:length(A) considered harmful — or how to make Julia code “generic safe” | by Roland Schätzle | Jun, 2022

0 75


Photo by Erik Witsoe on Unsplash

Julia allows code to be written on a rather abstract and generic level, thus offering great expressive power. However, this advantage also poses special challenges for the programmer. This article shows how to overcome them.

Several of Julia’s language concepts like the user-extendable type system, parametric types, optional type declarations and multiple dispatch make it possible to write code which is quite generic and can be applied in many situations — situations even the authors of the code often cannot anticipate. These possibilities are also the pillar of Julia’s high degree of composability.

This is a clear advantage and one of the reasons why many people prefer Julia over other programming languages. But it comes at some cost as it opens up the range of possible errors. So special care must be taken when writing such code.

If you are writing Julia code just for your own use and are therefore able to control the parameters passed and how the code is composed with other packages, then you do not need to worry about these issues so much. But if you are writing code which is used by other people (especially publicly available packages like the ones in the Julia registry), then you need to keep these aspects in mind.

I will present three common examples to show how these challenges look like and how to overcome them. Most of this information is readily available in the Julia documentation, but I doubt that many newcomers to the language have read them, just because in many other programming languages these situations don’t exist. So, why should you search for an answer, if you don’t know that there is a question? Furthermore, recent discussion on Discourse showed that even experienced Julia programmers sometimes fall in these traps.

Let’s start with one of the most common data structures in Julia: Arrays. (One-dimensional) arrays are on first glimpse quite similar to arrays found in other programming languages (perhaps with the exception that the indexing is 1-based).

There are a bunch of predefined functions you can use on Arrays, e.g. length(A) gives you the number of elements contained in array A. That offers the possibility to iterate over an array and print each element in the following way:

for i in 1:length(A)
println(A[i])
end

You may find another way to do that, using the function eachindex(A) which returns an iterator for A:

for i in eachindex(A)
println(A[i])
end

The second variant looks a bit more elegant, but apart from that both versions seem to be interchangeable. … and this is the point, where the problem begins.

In fact, both variants are only interchangeable, if you are using arrays of (concrete) type Array. Array is a subtype of AbstractArray. If you allow A being any other subtype of AbstractArray, you may run into trouble. AbstractArray does not guarantee that the first element is at index 1, nor that the last element is at index length(A).

Therefore 1:length(A) has to be considered harmful.

If you are using e.g. OffsetArrays (another subtype of AbstractArray ), you are free to define the index boundaries. The following statement creates an array with 11 (random) elements, indexed from -5 to 5:

B = OffsetArray(rand(11), -5:5)--> 11-element OffsetArray(::Vector{Float64}, -5:5) 
with eltype Float64 with indices -5:5

If you would apply the 1:length(B)-variant for indexing, an error would occur, since index values > 5 will be used in this case. Apart from this, indexing would not start on the first, but on the 7th element.

The eachindex(B)-variant on the other hand would be safe, since it guarantees that you get an iterator which covers all elements from the first to the last one, no matter which way of indexing the data structure uses.

If you don’t need the index values (as is the case in the example), the following variant would also be safe (and preferred because it’s more concise):

for a in A
println(a)
end

Another common data structure are Strings, which are defined as a sequence of Chars in Julia. At first glance String s can be used like an array of chars (similar to Java or C). So to access the second character in the string s = "Hello" you can write c = s[2]. The variable c contains the letter ‘e’ after this operation.

And you can access the character following the ‘e’ in s using s[3] (a ‘l’ in this case).

This works, as long as only ASCII-characters are used. But in Julia a Char may be any Unicode-character (UTF-8 encoded). These characters may need more than one byte of storage (in contrast to pure ASCII-characters).

With Unicode characters the example above doesn’t work anymore:

u = “α = alpha”    # string with a Unicode-characteru[1]     --> 'α': Unicode U+03B1 (category Ll: Letter, lowercase)u[2]     --> ERROR: StringIndexError: invalid index [2], 
valid nearby indices [1]=>'α', [3]=>' '

The first letter in u (‘α’) is a (2-byte) Unicode-character. Therefore u[2] is not a printable character (but part of the first one).

To safely index arbitrary Unicode-strings you need the following functions:

  • firstindex(u) — the minimal index of u
  • lastindex(u) — the maximal index of u
  • nextind(u, i) — the next valid index after index i
  • prevind(u, i) — the previous valid index before index i

So to get the first character in u (‘α’) and then the next one following ‘α’, you should use:

i = firstindex(u)   --> 1
u[i] --> 'α': Unicode U+03B1
(category Ll: Letter, lowercase)
i = nextind(u,i) --> 3
u[i] --> ' ': ASCII/Unicode U+0020
(category Zs: Separator, space)

In order to iterate safely over all elements of a string, the following expression does the job (the above-mentioned functions are only necessary if you want to move back and forth within a string):

for c in u
println(c)
end

The two examples we have seen so far are dealing with the situation that you should access data structures on the appropriate abstraction level. I.e. access to data structures should not rely on knowledge or (worse) assumptions about their implementation.

The next example presents another topic: It’s about the use of type declarations.

Type declarations are optional in Julia, but they help the compiler to produce optimal code. The compiler needs to know at least the specific type of the input parameters to produce efficient code for an algorithm. Based on this starting point it can then infer the types of all other variables in most situations.

On the other hand, you want your code to be as generic as possible to make it applicable to a broad range of situations. These two requirements conflict.

Let’s have a look at an example to make this situation clearer. The type definition for a point in two-dimensional space could look like this:

struct Point
x
y
end

This is the most generic way of such a definition. But typically we want to define functions on our Point s which rely on some arithmetic functions as in the following example:

function distance(a::Point, b::Point)
sqrt((b.x - a.x)^2 + (b.y - a.y)^2)
end

In order to guarantee, that these arithmetic functions are available for x and y, we have to declare these fields as being of type Number. This is the most generic (i.e. least specific) declaration possible:

struct Point
x :: Number
y :: Number
end

So the requirement to keep the code as generic as possible is met by this definition of Point. But it is not really useful for the compiler.

When creating instances of Points, we can use any subtype of Number like Int8, Int64, Int128, Float16, Float32, Complex{Float32}, Rational{Int64} etc. If we want e.g. represent pixels on a screen with our Point type, an Int32 -value for x and y may be appropriate for this task. In order to model points for some mathematical application, we may use Float64 to represent real values. The definition of a polygon (represented as a vector of points) for these two use-cases would then be:

pixel_poly = Vector{Point}()real_poly  = Vector{Point}()

Both expressions are identical. That means the compiler gets no clue about what memory representation is needed in each case and thus cannot produce optimized code for that specific type. Instead it has to provide a very generic structure capable of storing each of the above mentioned subtypes. This is not efficient.

The solution to this conflict is the use of a (constrained) parametric type as follows:

struct Point{T <: Number}
x :: T
y :: T
end

Using a parametric type T still gives us the freedom to use any concrete type for T we like when creating an instance of a Point. The declaration <: Number restricts the possible types to subtypes of Number (just as above).

But if we need a specific instance, we can now also be specific about its type and write e.g. the two definitions for a polygon as:

pixel_poly = Vector{Point{Int32}}()real_poly  = Vector{Point{Float64}}()

Using a (constrained) parametric type instead of an abstract type meets both of our requirements: A broad applicability of the Point type and being able to inform the compiler about the concrete type needed in a specific use-case.

We have seen some of the most common challenges for writing generic Julia code and how to make it safe.

The first two examples showed that you have to understand the abstraction a data type provides and access it with functions matching this abstraction level:

  • AbstractArray represents (in the one-dimensional case) a sequence of elements which can be accessed using an index. But it makes no assertions about the minimum and the maximum index.
    Only the concrete subtype Array guarantees that the first index is 1 and the last index is length(A).
  • String represents a sequence of Chars. But it makes no assertions that they are of equal size.
    Only if you restrict the application to ASCII-characters you may assume equal size (of one byte).

In all these cases Julia provides a rich set of functions to treat the data structures in an adequate (and safe) way.

The last example demonstrated how you can use type declarations in conjunction with parametric types to keep your code as generic as possible and at the same time provide the compiler with sufficient information to produce optimal code.

Of course these are not the only situations where you have to be careful about these issues. But I think the examples are quite representative and hopefully I could raise awareness for these challenges.


Photo by Erik Witsoe on Unsplash

Julia allows code to be written on a rather abstract and generic level, thus offering great expressive power. However, this advantage also poses special challenges for the programmer. This article shows how to overcome them.

Several of Julia’s language concepts like the user-extendable type system, parametric types, optional type declarations and multiple dispatch make it possible to write code which is quite generic and can be applied in many situations — situations even the authors of the code often cannot anticipate. These possibilities are also the pillar of Julia’s high degree of composability.

This is a clear advantage and one of the reasons why many people prefer Julia over other programming languages. But it comes at some cost as it opens up the range of possible errors. So special care must be taken when writing such code.

If you are writing Julia code just for your own use and are therefore able to control the parameters passed and how the code is composed with other packages, then you do not need to worry about these issues so much. But if you are writing code which is used by other people (especially publicly available packages like the ones in the Julia registry), then you need to keep these aspects in mind.

I will present three common examples to show how these challenges look like and how to overcome them. Most of this information is readily available in the Julia documentation, but I doubt that many newcomers to the language have read them, just because in many other programming languages these situations don’t exist. So, why should you search for an answer, if you don’t know that there is a question? Furthermore, recent discussion on Discourse showed that even experienced Julia programmers sometimes fall in these traps.

Let’s start with one of the most common data structures in Julia: Arrays. (One-dimensional) arrays are on first glimpse quite similar to arrays found in other programming languages (perhaps with the exception that the indexing is 1-based).

There are a bunch of predefined functions you can use on Arrays, e.g. length(A) gives you the number of elements contained in array A. That offers the possibility to iterate over an array and print each element in the following way:

for i in 1:length(A)
println(A[i])
end

You may find another way to do that, using the function eachindex(A) which returns an iterator for A:

for i in eachindex(A)
println(A[i])
end

The second variant looks a bit more elegant, but apart from that both versions seem to be interchangeable. … and this is the point, where the problem begins.

In fact, both variants are only interchangeable, if you are using arrays of (concrete) type Array. Array is a subtype of AbstractArray. If you allow A being any other subtype of AbstractArray, you may run into trouble. AbstractArray does not guarantee that the first element is at index 1, nor that the last element is at index length(A).

Therefore 1:length(A) has to be considered harmful.

If you are using e.g. OffsetArrays (another subtype of AbstractArray ), you are free to define the index boundaries. The following statement creates an array with 11 (random) elements, indexed from -5 to 5:

B = OffsetArray(rand(11), -5:5)--> 11-element OffsetArray(::Vector{Float64}, -5:5) 
with eltype Float64 with indices -5:5

If you would apply the 1:length(B)-variant for indexing, an error would occur, since index values > 5 will be used in this case. Apart from this, indexing would not start on the first, but on the 7th element.

The eachindex(B)-variant on the other hand would be safe, since it guarantees that you get an iterator which covers all elements from the first to the last one, no matter which way of indexing the data structure uses.

If you don’t need the index values (as is the case in the example), the following variant would also be safe (and preferred because it’s more concise):

for a in A
println(a)
end

Another common data structure are Strings, which are defined as a sequence of Chars in Julia. At first glance String s can be used like an array of chars (similar to Java or C). So to access the second character in the string s = "Hello" you can write c = s[2]. The variable c contains the letter ‘e’ after this operation.

And you can access the character following the ‘e’ in s using s[3] (a ‘l’ in this case).

This works, as long as only ASCII-characters are used. But in Julia a Char may be any Unicode-character (UTF-8 encoded). These characters may need more than one byte of storage (in contrast to pure ASCII-characters).

With Unicode characters the example above doesn’t work anymore:

u = “α = alpha”    # string with a Unicode-characteru[1]     --> 'α': Unicode U+03B1 (category Ll: Letter, lowercase)u[2]     --> ERROR: StringIndexError: invalid index [2], 
valid nearby indices [1]=>'α', [3]=>' '

The first letter in u (‘α’) is a (2-byte) Unicode-character. Therefore u[2] is not a printable character (but part of the first one).

To safely index arbitrary Unicode-strings you need the following functions:

  • firstindex(u) — the minimal index of u
  • lastindex(u) — the maximal index of u
  • nextind(u, i) — the next valid index after index i
  • prevind(u, i) — the previous valid index before index i

So to get the first character in u (‘α’) and then the next one following ‘α’, you should use:

i = firstindex(u)   --> 1
u[i] --> 'α': Unicode U+03B1
(category Ll: Letter, lowercase)
i = nextind(u,i) --> 3
u[i] --> ' ': ASCII/Unicode U+0020
(category Zs: Separator, space)

In order to iterate safely over all elements of a string, the following expression does the job (the above-mentioned functions are only necessary if you want to move back and forth within a string):

for c in u
println(c)
end

The two examples we have seen so far are dealing with the situation that you should access data structures on the appropriate abstraction level. I.e. access to data structures should not rely on knowledge or (worse) assumptions about their implementation.

The next example presents another topic: It’s about the use of type declarations.

Type declarations are optional in Julia, but they help the compiler to produce optimal code. The compiler needs to know at least the specific type of the input parameters to produce efficient code for an algorithm. Based on this starting point it can then infer the types of all other variables in most situations.

On the other hand, you want your code to be as generic as possible to make it applicable to a broad range of situations. These two requirements conflict.

Let’s have a look at an example to make this situation clearer. The type definition for a point in two-dimensional space could look like this:

struct Point
x
y
end

This is the most generic way of such a definition. But typically we want to define functions on our Point s which rely on some arithmetic functions as in the following example:

function distance(a::Point, b::Point)
sqrt((b.x - a.x)^2 + (b.y - a.y)^2)
end

In order to guarantee, that these arithmetic functions are available for x and y, we have to declare these fields as being of type Number. This is the most generic (i.e. least specific) declaration possible:

struct Point
x :: Number
y :: Number
end

So the requirement to keep the code as generic as possible is met by this definition of Point. But it is not really useful for the compiler.

When creating instances of Points, we can use any subtype of Number like Int8, Int64, Int128, Float16, Float32, Complex{Float32}, Rational{Int64} etc. If we want e.g. represent pixels on a screen with our Point type, an Int32 -value for x and y may be appropriate for this task. In order to model points for some mathematical application, we may use Float64 to represent real values. The definition of a polygon (represented as a vector of points) for these two use-cases would then be:

pixel_poly = Vector{Point}()real_poly  = Vector{Point}()

Both expressions are identical. That means the compiler gets no clue about what memory representation is needed in each case and thus cannot produce optimized code for that specific type. Instead it has to provide a very generic structure capable of storing each of the above mentioned subtypes. This is not efficient.

The solution to this conflict is the use of a (constrained) parametric type as follows:

struct Point{T <: Number}
x :: T
y :: T
end

Using a parametric type T still gives us the freedom to use any concrete type for T we like when creating an instance of a Point. The declaration <: Number restricts the possible types to subtypes of Number (just as above).

But if we need a specific instance, we can now also be specific about its type and write e.g. the two definitions for a polygon as:

pixel_poly = Vector{Point{Int32}}()real_poly  = Vector{Point{Float64}}()

Using a (constrained) parametric type instead of an abstract type meets both of our requirements: A broad applicability of the Point type and being able to inform the compiler about the concrete type needed in a specific use-case.

We have seen some of the most common challenges for writing generic Julia code and how to make it safe.

The first two examples showed that you have to understand the abstraction a data type provides and access it with functions matching this abstraction level:

  • AbstractArray represents (in the one-dimensional case) a sequence of elements which can be accessed using an index. But it makes no assertions about the minimum and the maximum index.
    Only the concrete subtype Array guarantees that the first index is 1 and the last index is length(A).
  • String represents a sequence of Chars. But it makes no assertions that they are of equal size.
    Only if you restrict the application to ASCII-characters you may assume equal size (of one byte).

In all these cases Julia provides a rich set of functions to treat the data structures in an adequate (and safe) way.

The last example demonstrated how you can use type declarations in conjunction with parametric types to keep your code as generic as possible and at the same time provide the compiler with sufficient information to produce optimal code.

Of course these are not the only situations where you have to be careful about these issues. But I think the examples are quite representative and hopefully I could raise awareness for these challenges.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment