The Scientific Coder

User-defined Show Method in Julia

Matthijs Cox — Tue, 18 Jul 2023 13:24:13 GMT

I often find myself looking for a way to write custom display methods for Julia types on the REPL. Time to write it down in a short pragmatic blog post, for you and my future self.

What's the issue? When exploring on the Julia REPL or in notebooks, you display your own custom type, then it doesn't look always look the most informative. Let's say you have some type:

struct MyType    some_number::Float64    some_dict::Dictend

You can quickly make an object and display it.

julia> obj = MyType(4.0, Dict(:x => 5))MyType(4.0, Dict(:x => 5))

Okay... Julia basically shows the constructor of the object. I would like to see the field names, or maybe other information. Sometimes I want to see statistical properties for example, instead of the raw data.

As an alternative, to quickly see the field names, you can dump the content of an object. Which is nice for simple objects, but I explicitly put a Dict in there, because it'll dump the dictionary internals, which you don't want to see:

julia> dump(obj)MyType  some_number: Float64 4.0  some_dict: Dict{Symbol, Int64}    slots: Array{UInt8}((16,)) UInt8[0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x82]    keys: Array{Symbol}((16,))      1: #undef      2: #undef      ...      15: #undef      16: Symbol x    vals: Array{Int64}((16,)) [5065505441550857052, 465637893754, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5]    ndel: Int64 0    count: Int64 1    age: UInt64 0x0000000000000001    idxfloor: Int64 16    maxprobe: Int64 0

Not pretty. How to improve this developer experience?

Compact Mode

Before I go to the solution, it turns out there is a "compact" mode to displaying objects. You can notice this behavior when you place dictionaries inside an array for example:

julia> d = Dict(:a => 1, :b => 2, :c => 3)Dict{Symbol, Int64} with 3 entries:  :a => 1  :b => 2  :c => 3julia> [d, Dict(:d => 4)]2-element Vector{Dict{Symbol, Int64}}: Dict(:a => 1, :b => 2, :c => 3) Dict(:d => 4)

You see that the dictionary is displayed differently in the two cases above. Inside the array we prefer a more compact display, since you may have many objects. I often forget about this compact mode, and then I get ugly array printing.

Here's some discussion on the topic on the Julia discourse.

Custom show

In the end, this is a typical approach I take. You can make it a lot more fancy if you like, but this is a good starting point:

struct MyType    some_number::Float64    some_dict::Dictend# default show used by Array showfunction Base.show(io::IO, obj::MyType)    compact = get(io, :compact, true)    print_object(io, obj, compact)end# default show used by display() on the REPLfunction Base.show(io::IO, mime::MIME"text/plain", obj::MyType)    compact = get(io, :compact, false)    print_object(io, obj, compact)endfunction print_object(io::IO, obj::MyType, compact::Bool)    if compact        # write something short, or go back to default mode        Base.show_default(io, obj)    else        print(io, "MyType")        print(io, "\n  ")        print(io, "some_number: $(obj.some_number)")        print(io, "\n  ")        print(io, "some_dict: $(obj.some_dict)")    endend

This works fine:

julia> t = MyType(5.0, Dict(:a => 1, :b => 2))MyType  some_number: 5.0  some_dict: Dict(:a => 1, :b => 2)julia> [t]1-element Vector{MyType}: MyType(5.0, Dict(:a => 1, :b => 2))

You can make your type printing as fancy as you desire.

One additional trick, to make the code more concise when you have a lot of properties with special types, you can also loop over the propertynames(obj) and use for example getproperty(obj, :name) . Now I hardcoded the property names.

Here's where I found stuff in the base language:

many show methods in show.jl, including the Dict show.
the Dict show called by the array.
the Array show internals.

If you read the code, please note the other IO options, like :limit. Can you guess what it's used for?

Conclusion

Well that's it, hope it helps as a reference to you and future me ;)

JuliaCon Local Eindhoven 2023

Matthijs Cox — Tue, 11 Jul 2023 07:23:12 GMT

I am very happy to announce that I am an organizer of the first city-level JuliaCon conference. This will be a one-day event in Eindhoven on December 1st, organized together with the PyData Eindhoven conference on November 30th (the day before).

The website is live: https://juliacon.org/local/eindhoven2023/. You can submit proposals, book early-bird tickets and consider joining as a volunteer.

We named it "JuliaCon Local" to avoid any confusion with the yearly Global JuliaCon, which is typically also associated with a city name. The date is also positioned in the winter, to be out of sync with the summer schedule of the Global JuliaCon conferences. People who could not attend the Global JuliaCon now have another opportunity to meet like-minded Julians and computational scientists in the industry and academia.

My apologies if I notify you via multiple channels, including my blog, but we are really excited about growing our scientific computing community in the area. Please consider sharing the news with your network. Of course everyone on the planet is welcome to join our conference! Hopefully we are paving the path to more city-level JuliaCon conferences.

How to deploy algorithms anywhere?

Matthijs Cox — Sun, 09 Jul 2023 12:41:35 GMT

Let's say you are an incredible scientific programmer. You've got some pretty math, machine learning model or scientific computing code. And you want to give it to other users. Maybe even turn it into a real product and make a profit from your work. How do you "deploy" that piece of code? Most scientists do not think much about this problem at all, but it can have a great influence on how you should develop your code.

Basically, we need to take what you developed, turn it into something which can be given to the user, so they can install and use it in their computing environment. What to provide depends entirely on the environment of the user. So you'll first need to understand that: the so called "production environment", the environment in which your "product" or service will operate.

The easiest way to make sure the code works, is to write the code inside the production environment and run it there. Boom! Everything works. Some startups operate like that, but it's not very common. It's quite a risk to mess up your production environment accidentally. It's also possible you have no direct access to your production environment, for example if you are writing code that needs to be installed on millions of cars around the planet.

If you want to know about the possible deployment processes adopted by many possible companies, I recommend the Pragmatic Engineer - Shipping to Production. Unfortunately, that focuses mainly on procedures and assumes quite some software knowledge already.

I think there's roughly three options here that we need to consider:

You fully understand and control the production environment. For example, if you work for a car manufacturer and you write the firmware, then you deploy the code into an environment that you control (or at least your employer does). You might be able to prepare the production environment to best suit your chosen algorithm technology.
You understand the production environment, but you do not control it. In the previous example, let's say you are a vendor selling software to the car manufacturer. You probably need to restrict yourself to the production environment of your customer.
You neither know the production environment, nor do you control it. Let's say you are selling software that might run on any laptop with any operating system (MacOs, Linux, Windows), or even on mobile devices. You have no clue what to expect. This can be tough, but is quite common for consumer software.

In the latter option, the modern era has tried to work around the issue by deploying to servers (or "clouds"). In that case you fully control and understand the production environment, and you merely provide the user with access to your service. This does assume your user has internet access, which seems reasonable these days, but is not true in environments like super-secure semiconductor factories (where I may have some experience).

Assuming you understand your production environment, you are still looking for a balance between how much you share yourself and how much you re-use. If you can re-use software components in the production environment, you can distribute a smaller deployment package/artifact. But you may have to conform to things you do not control, which can be unpleasant.

Let's move on to typical deployment options. What is this "thing", this "artifact", that we send to the production environment? Here's the general options I work with:

Deploy the raw source code files and make sure the interpreter/compiler is available in the production environment. Python and Java typically work like this.
Compile the source code to something "standalone". More C and Rust style.
Package everything together and ship it. Docker containers are the most extreme version of this approach, as they include even the operating system.

But there are plenty of options in between, including all combinations of possible production environments and their restrictions. Some combinations are not possible, for example when deploying on an Arduino you are severely limited by computational capabilities and you will probably have to compile a tiny standalone solution. If you've just written some massive Python AI monstrosity, you'll have to rewrite it to something much leaner. That can be very painful to find out at the end of your project.

That's why it's important to have some end-product in mind and work backwards from that vision in your development. Scientists and business people like to keep the behavior of the software in mind, what the software will do and such, but forget about where it will operate.

Source code deployment examples with Julia

The Julia language is currently my favorite language, as it tries to unite multiple programming worlds; those in science and in software engineering. It focuses on being easy to use and fast to execute. In theory Julia can be deployed anywhere, but being developed primarily by numerical computing professionals, it lacks some ease of use in that deployment area. I think that highlights some of the blind spots of typical scientists. I'll use Julia, and it's pain points, to highlight deployment considerations, while trying to keep everything generic to other languages.

The most basic deployment happens when you, as a developer, begin your journey into the programming language. You install the language, you type some code in some editor (or directly on the REPL), and you run the code. That's it. Note that when you installed the language, you use the deployment mechanism from someone else.

The second most basic deployment, is to give your code to a fellow developer. That developer will understand their own environment (to a certain extent). They probably have already installed the programming language. If not, they can follow the same installation instructions.

Now if the code you write depends only on the installed language, everything should work. But in the modern era, you typically depend on plenty of other people's code. You'll be importing open-source packages left and right. That's really nice, since it saves you a lot of effort. But now you need to share those extra packages with your fellow developer. Note that packages may include pure source code, but also compiled libraries.

You can either:

Create a "bundle" of all those open-source packages and share it, or...
Share a reproducible way to install all that code. See my previous article on that.

So if you would like to share a piece of code with someone, you need to consider how to share everything that code depends on.

These scenarios I described so far are simple (installing for yourself or sharing with a colleague), but they already show the concepts we have to take into consideration when sharing:

The core language features.
The default operating system (OS) libraries which the language depends on.
The code you wrote.
The code others wrote for you.
Any libraries created by others.

You can choose which parts you share directly, and which parts you allow to be installed/downloaded.

For installing source code, Julia depends on the package manager to install everything for you, by downloading it from the internet. This all runs with an existing Julia installation. However, Julia doesn't have a good source code "bundler", where you quickly create an installer with your code in one "bundle" or "distributable" (for example an executable on Windows) and you give that to a person. I think that's missing in the Julia ecosystem.

Note that such solutions are operating system dependent. For Python, you've got py2exe for windows, py2app for MacOs, pex for Unix.

Compiling libraries

A computer doesn't directly execute your source code, it needs low-level instructions. Turning your source code into machine instructions is called "compiling". In my previous article How to Solve the Two Language Problem, I roughly explained how technologies like Julia work. There are lots of steps, but on a high-level, you go from 1) written characters to 2) an LLVM representation to 3) machine instructions, a.k.a. native code.

When you gather all that native code and place it in a library (.dll in Windows, .so on Unix), then you can share that library directly with an end-user. Assuming you know which operating system they are working on. This process of turning the machine instructions into a distributable library is often considered part of the compilation process.

And you will still have to "bundle" any external libraries together with your compiled library. This may include certain libraries from your chosen language. Libraries can be linked statically or dynamically, but I don't want to go into those details here. I do want to make you aware, to always, ALWAYS, consider ALL your dependencies. If you forget to consider a dependency, and it's missing or mis-located in the production environment, your program will not run and your deployment has failed!

The Julia language community provides the PackageCompiler package. If you want to make everything fully standalone, you are looking at creating an "app". This will:

Compile your code, and the dependent code, into one library.
Gather all Julia language libraries.
Gather all dependent third-party libraries.
Place all of those together in a folder, and make sure the dependencies are linked correctly.
Optional: filter out unnecessary libraries (at your own risk).

Note that the default operating system libraries, such as libc, are not included in this "bundle" of libraries.

It's possible in Julia to remove as many dependencies as possible, to go to a very small distributable library, and even become independent of any core Julia language libraries. For example, you can run Julia on an Arduino. But it's far from trivial. Keep your eyes on StaticTools.jl to follow the developments.

Languages like Rust are geared fully towards statically compiling and deploying small independent libraries. That results in very good tooling for the library deployment use-case.

Docker: just deploy everything

Docker tries to be the software technology to solve all deployment. It wraps everything you need into a "container": code, runtime, system tools, system libraries and settings. It's all about portability: to make sure you can share your software with others, as standalone as possible. I won't go into details, Docker has solid documentation.

You will still need to install Docker itself in the production environment. This means that if you do not control the environment, you may never be able to run Docker containers there.

You will have to decide how to deploy everything inside the Docker container, either with source code or with compiled libraries or anything else, but at least you know you have full control over what you place inside.

The container size can be a problem in some production environments. There exist layered containers, to re-use parts among multiple containers, but that just returns the dependency problem, right?

Containerization is an amazing software technology that can solve many deployment difficulties, but I'd like you to balance it against other deployment options and take the production environment restrictions in mind.

Integrating and interfacing

Once you figured out what "artifact" you will send to your production environment, you will also have to consider how that artifact will operate there. In other words, what happens after deployment?

This is mostly a matter of communication. You have to decide on a communication mechanism, a data format and the contents of the data. You'll also have to think about how to handle and communicate errors and other exceptional aspects.

The simplest common approach in today's webservice era is to deploy a Docker container, turn it into a REST server (that's a communication mechanism using HTTP), then send JSON strings or ProtoBuf objects (the data format). If it's a computational backend service, say some fitting algorithm, then you can put vectors inside the JSON and maybe some settings (that's the content of the data).

But there are many more options, all depending on the restrictions of your production environment. This probably deserves a separate blog post.

Deploy anything with Julia

Want more detailed information and tutorials?

Build entire Julia web apps? See the Genie framework.
Roll your own simple REST server? See HTTP.jl (used by Genie.jl).
Deploy Julia bare-metal on Arduino? Blog here.
Embed Julia libraries into C/C++ systems? Tutorial here.
Make a standalone app? See PackageCompiler docs.
Just want to share a script? Good, but make it reproducible!

I probably missed many others, feel free to add more links in the comments.

Conclusion

As I grow older and gain experience in deploying in more and more environments, I admit I appreciate fully statically compiled languages more. Plain-old C or Rust. You know you will be able to deploy anywhere if needed. Of course, you may not have such restrictions in your current production environment, but it's nice to work with a technology where you know you will not be blocked when the time comes.

However, such technologies are often tedious to use for scientific exploration or data analysis. Immediately from the beginning they add a lot of restrictions to your software development. Why can't there be a language that does it all? Where you slowly add the necessary restrictions as you progress in your project. I'm hoping we can tune Julia further in that direction, so that we have a language that's easy to write, performant when needed AND easy to deploy anywhere.

I hope this article helps to explain the concepts involved in deploying algorithms (or any type of code) in production environments. Understanding those concepts at the start of your project will make the entire process much smoother. It's essential to consider all dependencies and choose the right deployment method based on the production environment and the language you're using.

Fruity Composable Design Patterns in Julia

Matthijs Cox — Fri, 23 Jun 2023 12:18:08 GMT

A design pattern is a repeatable solution to a common coding problem. Design patterns are not something beginner programmers typically think about a lot (that includes most scientists), they are probably focused on making their code work. At least that's what I did when I was a young programmer. At the other extreme such patterns can become a religion for people, where everything has to be a design pattern, or else the code is not considered good enough. However, people who make this mistake are not senior programmers either in my opinion. Senior programmers look for a balance between pure abstraction and simplicity (and many other requirements).

The Julia community has a special standing on design patterns: people don't really like them. In general the Julia community believes that design patterns expose a mistake in the language, because we should be able to automate any pattern away. I like that philosophy and I prefer not to focus on design patterns too much, but it's inevitable to encounter them while coding. Even if you do not consciously write design patterns, you may accidentally use them. For example I've used the Factory Method design pattern multiple times, specifically one that takes strings as input and outputs types/classes. This is quite a typical pattern to find in Python as well.

Therefore it's still valuable to think about design patterns. You can see them as best practices that you can learn from. Or you can see them as fun little puzzles, where you take some code out of context and ask "what is the best way to code X?".

Composable Factory Method

Let's write a short example with fruits. Don't ask me why, but sometimes you get strings on the input from a user, or another data source, and you want to turn those into specific (factory) types for your internal code. These "factory types" can later be used to create something else. To be honest, I'm not focusing this article on the entire factory pattern, but only on a composable way to retrieve the type from a string. This also relates to a question about enums as types. Maybe this part of the pattern actually has another name? Who cares, I want to do the following:

module NaiveFruitFactory    abstract type Fruit end    struct Apple <: Fruit end    struct Orange <: Fruit end    function fruit(str::String)        if str == "apple"            result = Apple()        elseif str == "orange"            result = Orange()        else            error("Unknown fruit $str")        end        return result    endend

This works fine, right. I can turn a fruity string into a fruit type now.

julia> NaiveFruitFactory.fruit("apple")Main.NaiveFruitFactory.Apple()

In this nave example the value of the string makes the pattern especially difficult to extend by an outside user, you have to go into the module and add another ifelse statement. By the way there is a reason to avoid this factory pattern at all, because the code is type unstable, the output type cannot be predicted by the compiler from the input type. There are many reasons to avoid this factory pattern, but as I said, sometimes it's unavoidable. However, I am looking for a better alternative that is still readable and performant, yet also easily extendable. I know, software engineering always involves the most insane requirements.

I've read the book Hands-On Design Patterns and Best Practices with Julia from Tom Kwong again for reference. The factory pattern in his Creational Patterns chapter is not exactly what I am looking for, as it doesn't use strings as input. His output factory depends on the input type (not the value), which is more preferable. His example is a formatter used for printing certain types in different ways:

abstract type Formatter endstruct IntegerFormatter <: Formatter endstruct FloatFormatter <: Formatter endformatter(::Type{T}) where {T <: Integer} = IntegerFormatter()formatter(::Type{T}) where {T <: AbstractFloat} = FloatFormatter()formatter(::Type{T}) where T = error("No formatter defined for type $T")

So maybe we should have a separate name for a "type-based factory method" and a "value-based factory method"?

I have three options for a composable "value-based factory method" (please leave a comment if you see a better option):

Interactive subtype looping (don't do this!)
Registration mechanism
Value-based dispatching

The first one I considered long ago, is simply to loop over the subtypes of the abstract type. I'll show this was a performance mistake. The fact that you need to import InteractiveUtils.jl in your code is always a big warning sign.

We can do one with a collection like a dictionary and a register! function, but I personally prefer one with automatic registration/subscription of the new type. This pattern is probably something you'd do in Python.

Finally, we can do a Val dispatch, it's a bit slower than the if-else/switch statement. This is what we can use if performance isn't a main issue, like on a public interface function. You may want to reconsider in a deep inner loop that is performance critical for your code.

Let's get into the details.

Subtype Looping

I will show a very straightforward solution, that's very difficult for the compiler. I am showing this approach, because I made this mistake once. Here's the code. It's very similar to the nave example, except now we ask every type of fruit to provide a fruitname function and we loop over subtypes(Fruit) until we find the string.

module SubtypeFruitFactory    import InteractiveUtils: subtypes    abstract type Fruit end    struct Apple <: Fruit end    struct Orange <: Fruit end    fruitname(::Type{Apple}) = "apple"    fruitname(::Type{Orange}) = "orange"    function fruit(str::String)        for type in subtypes(Fruit)            if str == fruitname(type)                return type()            end        end        error("Unknown fruit $str")    endend

The benefit is that I can let anyone extend this module with their own fruit types with very little code:

module SubtypeFruitExtension    import ..SubtypeFruitFactory    struct Banana <: SubtypeFruitFactory.Fruit end    SubtypeFruitFactory.fruitname(::Type{Banana}) = "banana"end

It works fine, but the catch is that subtypes is an interpreted runtime function, it cannot be compiled at all, because at any moment a new Fruit subtype can be added. You can see the drastic difference in timing on my computer (I don't even have to do proper benchmarking):

julia> @time NaiveFruitFactory.fruit("orange");  0.000002 secondsjulia> @time SubtypeFruitFactory.fruit("orange");  0.014683 seconds (1.01 k allocations: 814.500 KiB)

So let's avoid this one, shall we?

Registration Mechanism

Another straightforward approach. Instead of hardcoding the names of the types that we want to check, we store them in a mutable collection, like a dictionary.

module RegisterFruitFactory    abstract type Fruit end    const FRUIT_MAP = Dict{String, DataType}()    function register!(fruit::Type{<:Fruit}, name::String)        FRUIT_MAP[name] = fruit    end    struct Apple <: Fruit end    register!(Apple, "apple")    struct Orange <: Fruit end    register!(Orange, "orange")    function fruit(str::String)        fruit_type = get(FRUIT_MAP, str, nothing)        if isnothing(fruit_type)            error("Unknown fruit $str")        else            return fruit_type()        end    endend

Similar to the previous example, we can easily extend this one:

module RegisterFruitExtension    import ..RegisterFruitFactory    struct Banana <: RegisterFruitFactory.Fruit end    RegisterFruitFactory.register!(Banana, "banana")end

Performance is good in my opinion, though slower than the hardcoded if-else statement in the start, due to the dictionary. Let's check the minimum time with BenchmarkTools.jl . (And we always have to be careful that we are not looking at compiler optimizations.)

julia> using BenchmarkToolsjulia> @btime NaiveFruitFactory.fruit($"orange");  10.911 ns (0 allocations: 0 bytes)julia> @btime RegisterFruitFactory.fruit($"orange");  146.007 ns (0 allocations: 0 bytes)

Looks okay. Downside is that we are using a global variable in a module to store the registered types. We may have to put locks around that for multi-threading purposes. That would be a topic for another blog post.

Value-based Dispatching

Let's have a swing at another Julia solution. In Julia it is possible to dispatch on values, by wrapping them into parametric Val{} types. Note that this works only for plain data types, for example check isbitstype(Int64). Strings are mutable arrays of characters, so they are not allowed as parametric values. However, we can first convert them to symbols and then dispatch on those. Let's have a look at the implementation.

module ValueFruitFactory    abstract type Fruit end    struct Apple <: Fruit end    struct Orange <: Fruit end    fruit(str::String) = fruit(Symbol(str))    fruit(sym::Symbol) = fruit(Val(sym))    fruit(::Val{:apple}) = Apple()    fruit(::Val{:orange}) = Orange()    # default error    fruit(::Val{T}) where T = error("Unknown fruit $T")end

The smallest implementation so far! And as always the extension package is 3 lines of code:

module ValueFruitExtension    import ..ValueFruitFactory    struct Banana <: ValueFruitFactory.Fruit end    ValueFruitFactory.fruit(::Val{:banana}) = Banana()end

How are we doing in performance?

julia> @btime ValueFruitFactory.fruit($"orange");  236.941 ns (0 allocations: 0 bytes)

Slightly slower than the registration method with a dictionary, but significantly more pleasing to read in my opinion.

Conclusion

In summary, I wanted this behavior in a simple, yet performing manner:

using ValueFruitFactoryfruit("apple") == Apple()fruit("orange") == Orange()fruit("banana") # throws errorusing SomeFruitExtensionfruit("banana") == Banana()

(I am ignoring namespaces for a moment here, but we can always export those symbols in Julia.)

In the end, a simple switch statement (an if-elseif-...-elseif) is best for performance when you want to construct types from values, such as strings. But that means you cannot extend the constructor with another type, because it's hardcoded in the switch statement. If you want a decently performing, composable solution that is pleasant to read, then the value-based dispatching seems to be the way to go.

I should probably wrap up with a final conclusion about design patterns. First of all, solving little puzzles is fun and when you enjoy your work, you generally do better, so please tinker with design patterns if you find them fun. Next to that it's a matter of balancing the requirements of your code, look for what works best in your case, while keeping less obvious non-functional requirements in mind, such as readability, decent performance and composability. With that pragmatic mindset you can look at design patterns for inspiration.

Software Testing for Scientists

Matthijs Cox — Sun, 11 Jun 2023 12:35:26 GMT

I am currently reading the book "Software Engineering for Science." It is one giant complaint about how scientists are terrible at writing maintainable code for themselves. I won't go into all the pain, but I do recognize that pain deeply and have written about it elsewhere. Right now I am reading this book hoping to find solutions. So, what's the proposed solution? The book doesn't provide a simple answer, but one recurring topic is "testing, testing, TESTING!" So, let's talk about testing!

Why don't scientists test their code? Well, it turns out that most scientists do not have a software engineering background, yet they find themselves writing code and software for their work. Alternatively, they may collaborate heavily with software engineers, either in academia or in the industry. If you find yourself in this scenario, you probably don't have the time to suddenly obtain a computer science degree, but it's beneficial to learn a few tricks from the professional software development community. As mentioned, the primary skill to acquire is software testing.

(As always I want to repeat that when I talk about "scientists", this may refer to anyone who uses modeling techniques and computational thinking in their daily job. This can include data scientists, requirements engineers, business analysts, financial quants or anyone else. See My Target Audience article.)

I know from experience that scientists often use manual testing strategies. In this article I will share a few simple steps that you can take to improve the correctness of your code, thus making you and all your colleagues trust the results of your work. Without trust, people will stop using the software you develop. That's a shame.

Tests also help to "refactor" the code, you can change the code, quickly check all tests pass, and be confident about your changes. Since your code will change a lot during your work, the peace of mind you gain from knowing everything is working is absolutely worth the effort of learning about a few software test strategies.

Some people even believe that writing tests will make the code more usable, more modular and improve the overall software architecture. So plenty of reasons to test.

There was a time that I didn't write any tests for my code. Several times it was pointed out to me that software testing is a good practice. I never really started, or started only half-heartedly. At some point I was working with a lot of software engineers, and they all got a course in test-driven development (TDD), so I promised to try TDD in a new project. At the start it was painful to change my way of working, but after a while I got the hang of it and I have been writing tests ever since. The main benefit for me is that I do not have to keep the whole codebase in my mind anymore. I can just focus on the a small portion, make it work, and then see if I didn't break anything by running the tests. Before, when I didn't have any tests, I would have to consider all dependent code in my mind, and start checking manually whether those other pieces of code still worked. The tests help me relax and save literal headaches.

Semi-Automated Testing

So the first advice is simple: if you ever find yourself running the same manual tests, for example by executing a function with example inputs and checking that the output matches your expectation, then it's time to automate that manual work by writing explicit tests!

There are many other benefits to automated testing, but the primary reason to get started as an individual is simply to save yourself the effort of running endless scenarios by hand from memory. Manual testing is un-scalable as the code base grows. How can you be sure you didn't break something a colleague of yours is using?

Fully automated testing

Writing tests and running your tests manually is a big step up from having no tests at all. Unfortunately people can forget to run the tests. To increase the confidence in your code, you can automate the tests for every change that is made to your software, and only allow changes that pass the tests.

In our modern age, a junior scientist can single handedly setup an automated testing system, at least for open source projects on Github. In the Julia ecosystem, which is mostly written by scientists, 89% of packages have automated tests.

Unless you are forced to setup infrastructure inside your own organization, automating your test suite should be relatively low effort, yet high reward.

To get you started, you can read my previous article about how to automate your tests and code quality.

Regression Testing

The rest of this article will mostly be about the types of tests you can write. Consider them as best practices if you like.

Regression testing is probably the simplest form of testing, and typically what people do intuitively already in manual testing. It's all about checking that the results of your functions reproduce. Run the code with known inputs and check that the outputs match with expected values. If you have no well known reference, these expected values can come from historical runs of your own code.

Basically all you do is this: f(x,y) == expected_value

Stochastic processes are harder to test this way. You may set a seed to keep the code deterministic. Or check that the output falls within some expected distribution. Or focus on testing the non-stochastic components of your code.

Boundary Testing

Difficult, or rare, uses of the code are often called "corner cases" or "boundary cases", as they exist somewhere on the boundary of what your code can do. People often forget to test these cases, focusing all effort on verifying typical use cases.

Sometimes people call this "good weather" versus "bad weather" testing. Good weather is the typical use of your code, with input data in some normal operating range. Bad weather happens when less expected input data leads to less expected behavior in your code.

Errors are very common corner cases, or "bad weather". Don't forget to test errors. Errors and their messages are extremely important for users and developers of your software to figure out what went wrong and how to fix the mistakes. Junior developers always underestimate the importance of good error messages.

Other corner cases will depend on the domain you are modeling with your code. If you are simulating fluid dynamics in metal pipes typically ranging from 10 cm to 50 cm, but the user may input 500 cm, then you have to consider that corner case. Do you throw an error beyond a certain range, or provide a warning that the behavior may be incorrect, or test the behavior properly even though most users will never go there? These are all decisions to be made by you, the programmer.

Extreme input values can also lead to numerical instabilities, which brings us to the next section.

Numerical instabilities

You may have the most beautiful math and science, but when you write code, you'll have to understand some of the limitations of computer hardware. Mathematical problems may be poorly conditioned, but the numerical algorithms can also be a source errors and mistakes. Numerical instability is about poorly conditioned computer algorithms, even though the math behind it is well conditioned.

For example, a common source of mistakes happens with floating point arithmetic. Be careful with math that uses very big and very small numbers. For example when using 64-bit floating points in Julia, we can get:

>>> 10^10 + 10^-6 - 10^101.9073486328125e-6>>> 10^10 + 10^-7 - 10^100.0

In both cases we expect to return the small value in the middle, since x + y - x = y , but that's not what we get. We can find the wrong value of y or even obtain a zero. This kind of issues happen because numbers in computers are represented with a finite accuracy, as a trade-off to limit the amount of allocated memory.

This example may seems silly, but if you do any kind of linear algebra with matrices that contain a wide range of values, you may quickly run into such problems without noticing.

For our testing strategy, one simple take-away from floating point arithmetic is to use approximate equalities instead of identical equality checks, so test that x 5.0 instead of x == 5.0. What tolerances you find acceptable in your comparisons is another big decision you will have to make.

In general, read a good book like Fundamentals of Numerical Computation to get an idea of the interplay between math and computers.

Toy examples

If you have some complicated, multi-dimensional, multi-physics simulation software, you do not really know how it behaves. Actually you are using the software to figure out how your system behaves. So how can you test the behavior?

Well, you can probably compare your code to simpler problems that are well known, like toy models or analytical solutions. Cases where you do know the answer, your code should behave accordingly. If it doesn't match, then you know you have a fundamental error in your code somewhere.

For the more complex cases you are researcher, you cannot check the end result, but you can test all the smaller components of your code. The unknown, untestable behavior probably resides in the interplay between all kinds of known smaller parts. As long as you know the smaller components behave according to known physical and mathematical principles, you have more trust in the aggregate.

Reference datasets

Instead of finding simple toy examples, you can also look for reference code and datasets. Either by looking in the literature or by testing against alternative software packages. Your code should do something novel, else you would be using existing software, but there is probably overlap in behavior with other software packages. That overlap in functionality is the part you can check automatically to look for errors in the behavior of your code.

Coverage metrics

Measuring how well your tests cover your source code is not really a testing strategy, but it is really helpful to learn where you can improve your testing strategy. Code that has no corresponding tests yet is low hanging fruit. And while code coverage is no guarantee that your testing strategy is perfect, it is a good first indication for others about how serious you are in your testing. This increases the trust they (and you yourself) may have in your code.

Once you get into code coverage metrics, you can slowly expand to other code quality metrics and tools, to further increase that trust.

Common Sense

A simple way to invent tests is to use your common sense. Let's say you were given a piece of code from a colleague. How would you verify that the code is working properly? What would make you trust that code? Now figure out a way to codify that common sense check into an automated test code. Done!

Most testing strategies are simply common sense. They are best practices found by legions of software developers around the world over the last decades. Stand on the shoulders of all that experience, but don't forget to use your own mind.

Objections to testing

A common objection from scientists, against writing tests, is that their code evolves too fast. They do not really know what to test, because they are using the code to figure out the physics and science. So they say that there is no need to test.

This is not a valid argument I am afraid. Scientists are not really that special. Most professional software developers do not know exactly what their users want. They expect their code to evolve. In The Lean Startup, the whole process of building a (software) company is described as a scientific cycle. Build a product based on assumptions, measure if users want it, learn from that, build some more and continue onwards. It's like a social science experiment, trying to figure out what humans want by building the code. Yet in professional software companies there is always a heavy focus on testing the code.

So I believe that the uncertainty and evolution of the code is no reason against testing. I believe the main reason scientists forgo testing is simply because they are never trained to think about the benefits of software testing. They code as a side project. But if the code is critical to your scientific results, you will be very happy to have the tests to prove the correctness of the code.

Conclusion

I strongly advise to adopt software testing strategies to improve the correctness and reliability of your code. By starting to use some of the techniques I described, researchers can build trust in their software and ensure its quality. Embracing these best practices will not only save time and effort but also enhance the long term research process.

The Nebulous Mysteries of Scientific Coding

Matthijs Cox — Sat, 03 Jun 2023 11:56:48 GMT

There is a concept in meta-rationality called nebulosity. I will look up the definition later, but in my own words nebulosity means the following:

Nebulosity: a concept or problem is ill-defined. You cannot describe it perfectly. The boundaries of the concept are unclear.

Nebulosity drives rational people crazy, its worse than NP-hard. Rational people need well defined problems. Even if you can prove that the problem cannot be solved, at least the problem itself should be known. But is this always possible?

You may have a problem that you can barely describe to yourself. You may feel some shape of it, intuitively in your mind, but you cannot explain it perfectly. You notice that it is especially difficult to explain the problem to people unknown to the domain around the problem. There is only some vague shape you can gesture at. After wrestling with the problem for a long time, you may even begin to wonder whether there is a problem at all. This can be challenging if you built an identity or career around such a nebulous concept.

My nebulous problem

The problem I have been wrestling with the last years has such nebulosity. It started simple. Software development is slow in our organization and many organizations around us. One part of the problem, that many people complained about, is that the organizations contain many scientists who do not know how to develop software properly, and the professional software engineers do not understand the scientific domain. This causes lots of errors, both in the communication and in the software itself.

(This is a nebulous problem that can be generalized to any profession that involves people who focus on learning the domain, instead of building the products, say a business analyst, a financial quant or a mechatronics designer. Generalizing a nebulous problem makes it more nebulous and even harder to solve. More people will feel the shape of the problem, but it applies less to their exact case. This is a nebulous problem faced by high level thinkers in general, leading to proposed solutions that do not apply to the context. We have a potentially recursive nebulosity growth here.)

The scientist vs engineer problem seems easy enough to fix. Simply teach the scientists the good practices of software engineering. Give them the right tools for their domain. Then they write better code and create better scientific software, or at least they learn to communicate better with software engineers. But this is a nebulous problem. It turns out many of the scientists do not want to learn software engineering skills. It will take too much time away from their real science work. They are also not rewarded for getting better at coding, they are rewarded for finding insights and writing articles. This simple problem just became some kind of complex resource allocation problem; how much time should scientists spend on software skills so that it pays off in their career, without becoming a non-scientist? Then there is the fact that all their scientists friends around them are not great coders either. Why should they change first? Is it a peer-pressure problem? Or maybe they believe they are actually amazing coders, never having met better coders in professional settings. Look at how quickly I wrote these thousands of lines! I have been successfully working like this for 20 years! You cannot teach me anything. Never mind that their colleagues cannot understand the code, nor reproduce any of the results. Maybe its even a status thing, unlike engineers the scientists may look down upon building and coding? There are so many possible root causes.

Virtually all these problems are interpersonal human problems, not hard science puzzles like we find in math or physics. Interpersonal problems are virtually always nebulous and multi-faceted. Many rational-oriented people shy away from interpersonal problems, thus enhancing the problem instead of tackling it head-on. This is another nebulous problem. You cannot see the cloud from the inside. Sure, its a little foggy around here, but thats always been the case. Yet the frustration remains.

Or is there really a problem? It's good to question your own beliefs from time to time. Can we argue the problem away?

Scientists focus on understanding the universe, and occasionally build something for that reason. Engineers focus on building stuff, and use their understanding of the universe for that. Perhaps these activities should be kept separate? Or perhaps separation of these types of people happens naturally in large organizations and we should accept that fact of life? Or maybe we should allow a third group to arise, scientific coders, an elite group of people who help bridge the gap between the two cultures? Problems can become opportunities, right?

I have spoken with managers who believe there is no problem. They are quite satisfied with the two culture separation. They prefer the scientists to only communicate their findings to the software engineers via another medium than code. Maybe math-like pseudocode, written in ambiguous text documents, or haphazardly explained in a few meetings. Or perhaps the scientists share the incomprehensible throw-away example code. "Ambiguous", "incomprehensible" and "irreproducible" are keywords here, because the documents are never clear to the engineers, the example code is complex and doesn't reproduce. The software engineers are quickly confused and give up on understanding all together. The scientists become frustrated with the miscommunication and perceived apathy of the software engineers. The product development is delayed and the resulting code behaves incorrectly.

This doesn't seem like an acceptable situation for me. Yet the proponents of improving scientific software engineering also seem confused. (That includes me.) No one knows the exact solution that can finally resolve the matter effectively. After many years of wrestling with this cloudy issue myself, I have learned a great deal, but have not succeeded on pinpointing the exact problem. Most of my success has come from finding other people who also experience this nebulous problem. People who cannot accurately articulate the root causes either, yet feel the pain and want to solve the matter. I started calling them scientific coders, but even that is nebulous; finding the right words to name these people.

This cloudy-ness has become a growing part of my career and professional curiosity. With this blog I hope to clarify my thoughts, to better describe the shape of the problem, and identify possible solution directions. The uncertainty around the problem definition does not reduce my confidence in moving forward.

Nebulous conclusion

So, for now: breathe in, breathe out. Embrace this journey through the cloud. We can neither define nor solve the problem quickly. There is no shortcut that I know of.

If you are interested, here is the original definition of nebulosity that I referred to: metarationality.com/nebulosity. It describes nebulosity far more in-depth than I did. Actually the entire meta-rationality blog seems to revolve around nebulosity.

The concept of nebulosity is fascinating in itself. A big step in your personal development may come from the conscious choice to stare nebulosity in the face. To accept its existence. A lot of that personal development is dealing with uncertainty, because many people struggle with uncertainty in life. Once you see nebulosity, you cannot un-see it. You may notice that all concepts are a little nebulous. Nothing is perfectly defined.

Edsger W. Dijkstra, famous in many ways, seems to defy nebulosity by noting that "The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise." This is interesting on several levels. First of all, I slightly disagree since abstractions are leaky, so their precision will fail under the right circumstances. Secondly, you should read the context of his thoughts. This quote comes from a lengthy lecture where he discusses all the misconceptions around programming. While the quote itself is about code, I can already see the two culture problem emerging in his talk as he laments about scientists who do not appreciate computers and programming. Observe how great thinkers struggle with this nebulosity, even as they confidently announce precision in some intellectual areas.

Here we come to the end of my introspection. I questioned whether to publish this blog post here on The Scientific Coder or on my personal website Functional Noise. Since I've applied nebulosity to scientific coding, this blog seemed like the right place. I believe it can help any of you deal with the stress and difficulties of being stuck inside this nebulous problem. Known that you are not alone and that it is no shame to struggle within this field of work.

Scientific Software Institutes

Matthijs Cox — Wed, 31 May 2023 12:34:28 GMT

Have you ever gone through life completely oblivious to something? I recently experienced that sensation when I stumbled upon an entire ecosystem of institutions, only learning about them after starting this blog. These organizations are dedicated to promoting better scientific software, which aligns with the mission of my blog. I wanted to know what's going on, so let's have a look at what's out there.

I noticed the names of the fields of "scientific software" vary a little, but I consider all of these roughly equivalent:

Scientific Software
Research Software Engineering (RSE)
Scientific Computing
Numerical Computing

Yes, there are differences between them, but all of them involve turning scientific knowledge into algorithms and software, and writing software to do scientific research or other exploratory research. My apologies if someone has a strong feeling about a name meaning something entirely different from the others.

This article may not be my most interesting one, but I'd like to curate and store everything I've found for future reference. People on LinkedIn have already been kind enough to assist me when I asked nicely.

How many institutes are there?

My goodness, so many!

My journey started by encountering a post about Better Scientific Software on LinkedIn. I was impressed that this institute gives away $25000 fellowship grants to people helping to improve scientific software.

But after some searching and asking around, we can quickly find many more:

BE-RSE - Belgium Research Software Engineers community
DE-RSE - Society for Research Software in Germany
NL-RSE - The community of Research Software Engineers in the Netherlands
NORDIC-RSE - Nordic Research Software Engineers Community
RSE-AUNZ - The RSE Association of Australia and New Zealand
SocRSE - Society of Research Software Engineering - UK
US-RSE - The US Research Software Engineer Association
Danish RSE - Danish Research Software Engineers Community
RSE Asia - You get the idea

The list doesn't stop here, there are all kinds of more creatively named organizations:

Netherlands eScience Center
SURF in the Netherlands, has research-oriented IT
Hardware Acceleration Network in the Netherlands
Digital Research Alliance of Canada
the Australia Research Data Commons (ARDC),
the National Center for Supercomputing Applications (NCSA)
the Software Sustainability Institute (SSI)
Le Group Calcul, French obviously
Bunch of FAIR initiatives seem related, (Findable, Accessible, Interoperable, Reusable) data principles in science, like fairpoints.org and go-fair.org
IDEAS initiative of the US department of energy.

Some of these groups provide grants to researchers. Others provide paid consulting services. Most of them seem to blog and try to create a "community", which is typically a Slack channel to chat, but sometimes includes dedicated conferences.

International Institutes

Most of these organizations focus on the interests of a single nation, probably because most funding comes from governments. But there exist a few global institutes for scientific software.

This Research Software Engineers International organization claims to be an umbrella for many other RSE organizations across the globe.
But wait, there is another one, the Research Software Alliance (ReSA) that claims to be a worldwide RSE institute.
There is a UK-centric Society of Research Software Engineering, but someone mentioned they have a very active international Slack channel. And this society organizes a global conference called RSECon.
When it comes to conferences, there is the Society for Industrial and Applied Mathematics (SIAM) which I know from their conference recently in Amsterdam.
There's a Research Software Directory that tries to make an overview of ... you guessed it: research software. It tries to index known software packages, but also has an overview of all contributing organizations.

The only organization I knew before all this, is NumFocus. Which has a slightly different goal of promoting open-source numerical computing software, such as NumPy and Julia, and sponsors conferences such as PyData and JuliaCon. Because of their visibility at conferences and heavily used packages, they are much better known.

Competing with NumFocus, or maybe complementing, is the Essential Open Source Software for Science by Chan Zuckerberg. Funding lots of open source package improvements it seems, many from NumFocus.

Industry

Very few of these institutes are focused on industry or industry collaboration. According to the Research Software Alliance (ReSA) on this webpage, only 1/12th of their funding is from the industry. I have also noticed that most of the websites focus on academic research.

Who is doing numerical and scientific computing in the industry?

I bet a lot of companies. In our JuliaLang Eindhoven Meetup we would like to find everyone doing numerical computing in our area. Generalizing bluntly, I believe the Julia meetups attract numerical computing and scientific software enthusiasts, while PyData meetups attract more data science and AI enthusiasts.

Another trick to finding industry users could be by looking at Mathworks and JuliaHub customers. You'll find examples from fields in automotive, semiconductors, finance, pharmaceutical and many more. Successful numerical computing service providers are good at finding their industry users.

Maybe I will write down a good industry overview in another article. I am interested in learning how scientists do numerical computing and write software at places I haven't heard from yet.

Conclusion

There are many organizations promoting and improving scientific software practices. I've only done a quick sweep through the field and found plenty. I may expand this list in the future with updated findings. Whether you are looking for grants or a community of like-minded individuals, you can get started with this overview. Or just be amazed like me that such organizations exist at all.

Clean Code Tips for Scientists #1 - Reproducible Environments

Matthijs Cox — Wed, 24 May 2023 12:03:44 GMT

Author commentary: I am starting a "clean code" blog series with simple tips that you can integrate into your workflow. I often write long, complicated articles that try to teach a lot at once. This is an attempt to chop things up in bite-sized chunks. Note that the Clean Code books by Robert Martin are great, you should read them if you have time! If not, you can follow these short articles :)

If you've written a lot of scripts and shared some of those scripts with colleagues or others, then you probably encountered the problem that the code doesn't always work on their device, or produces different results. When this happens, people may quickly lose trust in your results and begin to ignore your work entirely. So making code reproducible is extremely important! Even if you are a scientist and not a professional software developer. I'll explain a simple strategy you can take to make your code more reproducible.

Code Environments

First we must take a small step back from your code. Because when you write your script, it is not standalone. It exists in a certain "environment". Besides the hardware of your computer and your operating system, this involves your programming language version and all the (open-source) packages you used to run your code.

When sharing the environment with someone else, you do not want to give them your computer, right? Nor do you want to send all the dependent package code on your computer, because that can easily become gigabytes of packages and dependencies. The environment may not even work exactly on their computer. All kinds of issues may make relocating the environment difficult, for example if they use a different operating system (Linux instead of Windows).

Instead, you want to share a way to install an exact copy of your environment, by sharing the exact configuration of packages you used.

Python Environments

In Python you typically share your dependencies with a requirements.txt file. You can find plenty of blog posts online about this approach, like here. There are also alternatives like Poetry that try to make Python environment management easier for you.

I won't go into the details of Python environments here, but please know it's possible. Instead I'd like to show how this problem is tackled in the Julia language. If you prefer another language, then you can consider this an example.

Julia Environments

In Julia everything can be done with the built-in package manager.

Let's say you have your very important script file. It looks something like:

using DataFrames, LinearAlgebra# much important code for your colleagues

What you want to share is the exact same versions of the packages you are using to run this script, including all the package dependencies (for example DataFrames v1.5 is using DataAPI v1.14 under the hood). If you can easily send that knowledge to your colleague, then you can be sure they will get the same results.

Start with an empty environment. Add all the packages you use for your script. You can use the Julia Pkg mode on the REPL with ], or write something like this:

using PkgPkg.activate("ExperimentNinetyFive")Pkg.add(["DataFrames", "LinearAlgebra"])

You will now have a folder called ExperimentNinetyFive on your device, with two files inside: a Project.toml and a Manifest.toml. The Project.toml simply lists the packages. The Manifest.toml is what describes your exact environment:

The Julia version
All packages you added with their version, such as DataFrames version 1.5.0
For each package: lists all their dependent packages. Such as DataAPI for DataFrames.
For each dependent package it specifies the version, such as version 1.14.0 for the DataAPI package.

Here's a picture showing a snippet of the Manifest.toml (it's 234 lines in total for me):

To share a reproducible environment with a colleague, all you need to do is put the script inside the same folder, and then zip it, or push it to a repository, or whatever way you prefer, and send it to your colleague. After receiving your code, all your colleague now needs to do is this:

using Pkgcd("path/to/ExperimentNinetyFive")Pkg.activate(".")Pkg.instantiate()# and then they can run the scriptinclude("another_script.jl")

The function Pkg.instantiate will install all the packages exactly according to the Manifest.toml. So your colleague will use the exact same versions as you did.

That's it! Modern programming languages come with a simple package manager for the purpose of sharing reproducible code.

If your code is meant to be re-used inside other people's code, the next step would be to make a package that can be installed and updated automatically (instead of emailing your script). Packages are essentially installable code, including a reproducible environment and preferably things like documentation and tests. But that's for another blog post.

In general: never only share your code. Share a reproducible way to setup your coding environment as well!

Appendix

Warning: You inherit the global shared environment!

What do I mean with this? Let me briefly explain. When you start a Julia REPL you typically start in the global environment like @v1.8. If you install packages in @v1.8 and then switch to another environment, those packages are still available. This means you may accidentally forget to add those packages to your new environment, because your script just works. But the the environment you share with the Manifest.toml is still not reproducible for someone else! It's missing some dependencies.

To avoid this problem, and other issues, I typically keep my global environment as clean as possible, with only a few utility packages that I only use on the REPL, such as Revise and OhMyREPL and LocalRegistry. This way I keep all my environments separate.

Similarly be careful when switching environments within a single Julia REPL session. I would advise to test your script once in a fresh REPL, before you send it to others.

Pluto does it all

Pluto notebooks are designed to be reproducible. Under the hood they contain the package environment inside them (check by viewing the Pluto .jl files in your favorite text editor). This can make it easier to share a Pluto notebook instead of a script or package.

Other programming languages probably have other solutions for easy sharing of environments and scripts (though Jupyter notebooks do not do this well). Or you can try online editors like Replit, which maintain the environment for you. I would still advise to understand how package environments work in your favorite programming language, because you cannot use notebooks for everything. And leaky abstractions are always a good reason to occasionally look under the hood.

Building a Scalable Inner-Source Ecosystem For Collaborative Development

Matthijs Cox — Wed, 17 May 2023 12:29:08 GMT

Three years ago, we decided to embrace the Julia programming language to solve the two language problem at our organization. We want our scientists to join forces with software engineers so that they can work on the same problems together. In our journey, I could have used more books or blogs to guide us on the following topics:

How to build and deploy software products with the Julia language?
How to create the seeds for an effective scientific software ecosystem?

This article is here to help you with the second topic, but I warn you that we had to figure out 1 and 2 at the same time. I intend to write more blog posts about the Julia productization aspects. Yet in the long term, I am betting on the ecosystem to radically improve our organization, so I consider that more important to blog about.

One thing I must continually emphasize is that the technology alone, regardless of how wonderful Julia is, cannot change people. What I needed was an environment where our scientists could contribute to product development in a rewarding way, while upholding the quality standards of modern software engineering. Additionally, we required a setup that could eventually scale to thousands of engineers.

An Ecosystem Blueprint

We had to figure out everything from scratch. I hope this article will help fledgling ecosystem architects to gain a head start in their organization. Consider it a guide, or a blueprint, but be mindful of the unique needs of your own organization. I will use my own experience as an example, see my JuliaCon presentation for more information.

As architects of this ecosystem, our main design choice was to have a development workflow that feels similar to being an open-source developer, to help onboard scientists and engineers with little friction. You should be able to install internal packages via a standard package manager, Pkg in the case of Julia. You can use your favorite IDE, though we advised VS Code due to the Julia plugin maturity. You work on GIT with code reviews. You have automated testing pipelines to check your commits and pull requests. Ideally, all code and tools are available to all engineers, that's what I call "inner-source". Everything should feel instantly recognizable, even if we use slightly different tools and practices than the open-source community.

Types of Repository Structures

To begin, we will have to choose how we kickstart our codebases. As mentioned I wanted a full-fledged inner-source ecosystem. However, different organizations may have different desires. On a high level, I can imagine the following scenarios for your development organization:

A single repository, with a single monolithic package. Probably with internal submodules as it grows bigger.
A single repository, but with multiple packages inside. With or without a registry.
A multi-repository, multi-package setup. Similar to the public open-source ecosystem you observe on Github, including a separate registry.

What is a package registry? A registry is merely a lookup table with links to all the packages in your organization. A package manager uses this registry to find and install packages for the users, including all the package dependencies. For example, see the Julia General Registry , or the Python Package Index. I advise to setup a separate local registry in your organization for your internal packages.

A mono-repo with multiple packages seems common among startups, see this discussion here about Beacon Biosignals approach and the responses from others. But I have also heard about startups who use option 1: a mono-package setup.

The advantage of the first option, a mono-package, is that you need no serious package management. No local registry is needed to install dependencies. You clone and go. The downside is that this single package can quickly grow big and clunky, slowing down pre-compile times and maybe coupling internal interfaces. If you want to develop quickly and deploy separate modules to separate products, how will you disentangle everything? I also don't know how difficult it is to move from option 1 to option 2 once you are over-invested. Overall, I would advise option 2 to get started, if option 3 (multi-repo) seems too scary. If you immediately set up a registry and make it part of the workflow, then splitting off packages into multiple repositories should be easy in the long run.

At the beginning of our journey, I immediately aimed for option 3, the multi-repository setup, because I wanted to mimic an open-source ecosystem. After a few months of working with 3 developers on our first product, we restructured the packages into a hybrid approach with one big repository with all our main product-related packages, and a bunch of satellite packages that were well-defined and reusable by future projects. The main package architecture was still rapidly evolving and the dependencies between them were not entirely clear. In the multi-repo scenario this forced us to open a lot of pull requests at once into multiple repositories for every change. To find the balance, we went with a hybrid approach. I've seen open-source projects like Makie migrate to a similar hybrid setup.

Today we still work with this approach in our department, sometimes spinning off packages out of the big repository into a separate repository whenever we think it's a common facility useful for other teams or departments. If we know at the start that a package is common, we typically immediately start it in a separate repository. Other departments sometimes follow our hybrid approach, or start with a full multi-repository setup, depending on their development needs.

Types of Packages

Next to setting up a repository structure in a version control system, I needed to distinguish between different types of packages, which require different ways of handling them.

Open-source packages, which may need to be checked and approved.
Inner-source packages, which are useful for multiple groups besides your own, or even common for the entire organization. These packages may depend on open-source packages.
Domain-specific or product-specific packages, that only apply to a single group in your organization, where access might be restricted to a need-to-know basis. These packages may depend on inner-source and open-source packages.
Integration packages. These are end-points from the development ecosystem perspective. For example, a Julia REST server which provides an API around a set of domain-specific packages and gets deployed into a cloud application. Or a package that gets compiled and integrated into C++. Multiple domains may collaborate and deploy together, or independently, that depends on your product environment.

I will not go into the deployment considerations in this article. But I often had to explain to managers these different types of packages and their relationship to the final product.

We are also currently considering to add some kind of tags to certain package versions, to distinguish in maturity levels or use-cases:

Prototyping packages for personal projects, or very early research explorations among a few scientists.
Research packages, used by many scientists, but not used (yet) in commercial products. Plotting and data analysis packages typically fall in this category.
Tooling packages, used for testing or deployment. Important for developers, but not shipped to production.
Production-grade packages, shipped to customers. These should not break!

We're still working on the exact details. Typically it's pretty clear which package is what, especially if a package is still version 0.x.y then it's probably a prototype or research package. But there is mobility between the package types. A research package can suddenly become integrated into a new product, at which point we need to address the quality and reliability of the code, and make it clear to the researchers how to continue working with this more mature package.

Typical Developer Workflow

As an inner-source ecosystem architect, the developers are your customers. You should design the system such that it supports an ideal developer workflow. For you developers it should feel low-effort, and rewarding, to make high quality deliveries.

We typically consider two types of profiles:

Package users that do not develop packages, such as data analysts.
Package developers that write the package code.

Often the package user and the package developer are the same person, especially at the beginning of your ecosystem when there are no packages yet. Therefore I focused most of my effort on making the life of the developer easier, hoping that the developers will make the user's life easier.

The developer workflow is not linear, but if I have to linearize it for the sake of this article, I would divide it into the following steps, all of which need software infrastructure.

Explore Packages - Anytime you start, you'll probably have to do some exploration of existing package. To figure out if something already exists out there in the world, before inventing it yourself. You want infrastructure that makes it easy to search and discover packages. And it should be easy to read the package documentation.

Prototyping - Once you found the interesting packages, you probably want to do some prototyping or data analysis, or whatever is necessary to figure out your new requirements. In this phase you are still a passive user of the ecosystem, merely installing packages. But package installation should be easy with a registry and package manager in place. Simple Pkg.add("InternalPackage") and use it in your development environment.

Developing - Once you contribute to an existing package, or develop a new package, you'll have to use standard GIT tooling to clone the code and commit new changes. You write tests and when committing changes to a package, the continuous integration systems get triggered, automatically running all tests and other checks, similar to an open-source contribution.

Monitoring - During development, any contribution is already qualified, but as developer or package owner you want to monitor the code quality over time, with metrics such as code coverage, to make sure your packages are continuously improving. (To be fair, we enabled this step last.)

Sharing - After creating a new package version, you want to update the internal registry to share this new version with others via the package manager. You also want to create and host the updated package documentation. This step may or may not be fully automated in your organization.

Deployment - Packages that are released as formal products should be deployed, as libraries or microservices or otherwise. We automated this in the main branch of the product integration packages, once a new version is released.

Package Development Infrastructure

We need many tools to support a developer-friendly workflow inside a large ecosystem. We grew all this infrastructure organically over the years, solving one bottleneck at a time. Here are some of the many tools and practices we worked on, roughly in chronological order:

I assume you already have a GIT repository hosting system in place, such as Gitlab, Github or Bitbucket.
I also assume you have some Identity and Access Management (IAM) layer in place, so users can connect to the infrastructure tools with your company credentials. I'm adding this layer explicitly since access rights are often a source of bureaucracy and frustration. The "inner-source" concept can be at odds with IT security people who want to restrict all access by default.
With basic IT tools in place, the first thing I did was set up a local registry. With Julia this is easy since the registry just another repository. I did it in an afternoon and it saved us endless effort. With Python I know that it's a bit more tricky to set up a local PyPI.
Create your first packages and figure out their dependencies, together with a high level strategy for repository ownership among the different projects and groups within your organization. This is a complex topic, and the structure will evolve, but a solid start based on real domain experience helps a lot here.
The value of workshops and courses to teach multi-package development, should not be underestimated, especially for scientists with limited software experience. Be ready to endlessly explain GIT, SSH keys, test-driven development, CICD, language fundamentals, and much more.
Automated pipelines for each package are crucial. This enables scientists and developers to automate their quality checks, instead of running tests and checks manually. If this step is easy, scientists will use it for their research packages, giving them early DevOps training. This makes the later handover to software engineers more pleasant and reproducible.
Documentation hosting for each package, to explain the API and provide examples for package users. I hope the benefit of good documentation is clear to everyone. Maybe in the future an AI can automatically explain how your code works, but not today.
Production-grade build pipelines, as an extension to the existing automated testing pipelines, to integrate packages into bigger systems. Otherwise, your scientists will be spending endless time manually compiling and delivering code. Is that a good use of their time?
Qualification pipelines, where we test all registered packages at once. This was first built to qualify new Julia language versions before rolling them out, but we run it more often for ecosystem-wide testing. Maybe we'll go to a nightly run like the open-source community performs here.
IT Security assessments of our open source code usage. For example, to enforce correct license usage and guard against supply chain attacks. In general, you probably want an internal mirror server where you store all (approved) open-source packages.
Formalized coding standards and style guides. If you don't have these, a lot of time in code reviews will be wasted on silly aesthetic arguments about the best way to define a function.
... and much more

It's important to quickly build a solid foundation that supports your developers from day one and nudges them towards delivering high quality. After that you can continuously improve the workflow of your developers, adding, removing or changing tools as required.

I am very happy that we now have a serious set of development tools in place, supported by amazing DevOps and IT engineers. This took considerable time and effort to get into place, as the architect you'll need long term commitment to make this happen.

Package Architecture

A difficult topic, especially when working with many scientists with little software engineering backgrounds, is figuring out the best configuration of packages. What should you put in which package? How many packages do you need? When should you split a package? What should each package APIs look like? How can packages work seamlessly together where necessary? How should packages depend on each other?

I advise to study Domain-Driven Design (DDD). Besides that it's endless tinkering based on real-world experience of your business domain, continuously refactoring, and hoping Conway's law doesn't get in your way. Especially at the start, I was heavily involved in writing the code of many of our packages, building products while trying to avoid big bottlenecks for the long term. I am sorry to say that we have not found a shortcut for deciding on the right package architecture.

One trick that helps us to refactor safely is to define an "interface package" that your users can rely upon. Then you can refactor and restructure packages behind that interface while keeping the interface package itself backward compatible. Users do not enjoy constantly re-learning how your package works. In the previous section named "Types of Packages" I discussed the integration packages, which also serve this interface package purpose, if you consider the production systems as a user.

Deviations From Open-Source

A business has different requirements than the open-source community, which results in some deviations from the open-source setup. Here are just a few differences that I learned to be mindful of and that I had to explain to junior developers:

Rapid development. Typically the pace of development is faster in a business, with multiple developers working full-time on multiple packages at once. If you are unable to separate concerns properly, multiple teams may even be working on the same package at once, with a lot of possible chaos.
Product integration. Open-source development is all about sharing packages with users and fellow programmers. A business is all about building products and services, so there is much more emphasis on integrating with cloud applications and embedding in devices, and everything that revolves around that.
Access restrictions. Due the risk of disclosing sensitive information and other security concerns, companies will have much more strict access control. In an open-source world, everything is transparent and accessible to everyone. In your inner-source world, this may vary per package.

Conclusion

In conclusion, building an inner-source package ecosystem requires careful planning, a solid foundation, and continuous improvement to support developers in delivering high-quality software. By adopting best practices from open-source communities and adapting them to fit the unique needs of an organization, you can create a thriving ecosystem that fosters collaboration between scientists and software engineers, ultimately improving development effectiveness.

Continue Reading

How to solve the Two Language Problem? - An overview of software technologies to get speed and simplicity at once. Comparing Python, C++, Cython, Numba, Julia and more.
Automate your Code Quality in Julia - An overview of tools and methods that help improve your code.
My Target Audience - Where I explain what kind of people I have in mind while writing this blog. Includes the Two Culture Problem as I observe it.
Organizational Refactoring (on my previous blog) - About the human challenges of creating a better scientific development organization.

Extreme Multi-Threading: C++ and Julia 1.9 Integration

Matthijs Cox — Thu, 11 May 2023 14:10:09 GMT

In this tutorial we demonstrate how to call Julia libraries with multiple threads from C++. With the introduction of Julia 1.9 in May 2023, the runtime can dynamically "adopt" external threads, enabling the integration of Julia libraries into multi-threaded codebases written in other languages, such as C++. This article is written in collaboration with Evangelos Paradas, the maestro of algorithm deployment at ASML. Evangelos has been responsible for heavily testing and debugging this multi-threading feature. I humbly repeated the final results after his many trial-and-error attempts and summarized everything for you in this article.

Julia in production

Julia is a general-purpose language designed for scientific and numerical computing, striking a balance between speed and simplicity. The adoption of Julia in the industry is growing every year, but complex cases require enhanced deployment capabilities in the core of the language. One such crucial improvement we needed was the ability to call Julia libraries with multiple threads from another language. Fortunately, this is now possible in Julia version 1.9. Since we have been involved in testing this new feature extensively, we would like to share this tutorial with you to accelerate your journey with external threads in Julia.

Weaving threads across multiple programming languages is an extreme sport in software engineering. You do so at your own risk. Incorrect usage of this technology will crash your production systems. You have been duly warned.

Before starting, making sure you are working with Julia 1.9, either by using juliaup or downloading Julia 1.9 manually and adding it to your path.

Introduction to C++ embedding

In the past, I have spent quite some time writing a tutorial about how to embed Julia libraries into C++. It's not trivial. High level the steps involved are:

Create a Julia package with the Julia c-interface functions
Write the C++ code that will call those Julia functions
Compile the Julia code to a library with PackageCompiler.jl
Compile C++ and link it to the Julia library

I won't delve into all the specifics above, so if you wish to reproduce the results of this article, it's advisable to first read my previous article. Prior to embarking on a multi-threaded adventure, make sure that you are intimately familiar with embedding in a single-threaded manner. Having multiple C++ threads call into Julia is an exceptionally advanced subject, particularly if you have limited prior experience with C++ and multi-threading. Take your time to learn the ropes.

Julia code example

We wrote a very simple Julia function that throws an error depending on the input value. The exact Julia functionality doesn't matter in this article. As mentioned, you can read the extensive blog post on my previous blog for details, but here are the important highlights for making a Julia function ready for C/C++ embedding:

Use Base.@ccallable to make sure the Julia function can be called from C/C++
Use C types on the interface. In this example I only use Cint types. Note that Cint is an alias for Int32, so Julia integers and C integers actually have the same memory layout.

Base.@ccallable function divide_function(input::Cint)::Cint    if input > 10        throw(ErrorException("You cannot divide by more than 10"))    end    outputValue::Cint = div(12, input)    return outputValueend

When you place this function inside a Julia package, you can compile it to a library with PackageCompiler. An example build script can be found in my github repository that accompanies this article.

Initializing Julia

Here are some of the interfaces that are important for initializing the Julia library in the correct manner for accepting/adopting external threads from C++. We requested advise to use many of these functions, as we're not experts in this either. The C API of the Julia runtime (those jl_* functions) could definitely use some more documentation.

The init_julia function comes from a header file that is created together with your compiled Julia library. Nothing special here.
The code with jl_is_initialized has to go into a try/catch block because when Julia is not initialized this variable is not available in the memory and returns a segfault. A surprising gotcha.
Make sure to lock and unlock the initialization of Julia, so that no other thread can accidentally try to start Julia as well, while this thread is busy initializing Julia.
jl_adopt_thread enables this C++ thread to be used by Julia. This is the most important C API function to remember for external multi-threading. It's available since Julia 1.9.
the job of jl_gc_safe_enter is to mark the thread as safe, so that the garbage collector (GC) can run concurrently to that thread. By using this function, you make a promise not to do any GC visible work, such as allocating new memory. The use of parentheses around the function is simply to avoid confusion with a function-like macro.
jl_enter_threaded_region sets Julia to multi-threading mode, I believe. This function is also used for example by the Julia @threads macro, but lacks any documentation.

According to the link with news about thread adoption says that @ccallable Julia function will automatically adopt threads. This is true, but what if you execute a Julia function or macro before the @ccallable function? In that case you get a segmentation fault, because this thread is not yet adopted. For example, when you want to capture Julia errors, you need to call the JL_TRY macro before the @ccallable. In the next section, we will show how to use such macros within a multithread environment. In this initialization section, we show the safest way is to perform the thread adoption by calling jl_adopt_thread explicitly.

All together we use these functions to initialize the Julia compiled library as follows. I have kept the code example concise to highlight what matters.

#include "julia_init.h"bool is_julia_initialized(){    try    {        return jl_is_initialized() != 0;    }    catch (...)    {        return false;    }}void initialize_julia(int argc, char *argv[]){    mtx.lock();    if (!is_julia_initialized())    {        init_julia(argc, argv);        jl_adopt_thread();        (jl_gc_safe_enter)();        jl_enter_threaded_region();    }    mtx.unlock();}

The main C++ code

Let's write a simple wrapper around our lovely c-callable Julia function and show you how to catch any errors thrown by Julia. All in a multi-threaded way. Remember, the Julia function divide_function is a trivial function that uses integers and throws an exception when the input integer is larger than 10.

We use jl_get_pgcstack to check if a thread is already adopted by Julia. If you attempt to adopt a thread twice, you will encounter a segmentation fault. This is one way to avoid making that mistake accidentally.

The JL_TRY macro will check if an error occurred in the adopted thread. This macro only works if the thread is actually adopted, else you get yet another segmentation fault. Inside the macro we call the function from the Julia library.

If you want to retrieve the actual Julia error inside the JL_CATCH, you will need to call into the Julia runtime. I have some example code in a previous article about catching Julia exceptions from C++ on my personal blog. In the example here, we kept it simple and just printed a message.

void call_and_catch(int x){    // to make sure every thread is adopted by Julia, and only once!    if (jl_get_pgcstack() == NULL)        jl_adopt_thread();        // JL_TRY requires the thread to be adopted, else it won't work    JL_TRY    {        divide_function(x); // may throw an error depending on your input        std::cout << "Succeeded for x = " << x << std::endl;    }    JL_CATCH    {        std::cout << "Caught error for x = " << x << std::endl;    }}

We can now write a piece of multi-threaded C++ code and call our Julia function. The easiest way is to first create a pool of threads. If you want to make this example more complicated, you'll have to learn a bit more about C++, which is beyond the scope of this article. But this is a good example to get you started.

int main(){    const size_t n_of_threads(15);    initialize_julia();    // initialize all threads and assign them our function    std::thread all_threads[n_of_threads];    for(int i=0; istd::thread(call_and_catch, i+1);    // run all the threads    for(auto& thread : all_threads)        thread.join();    return 0;}

Compiling

Make sure to add the -lpthread flag, this is a system library that is required for C++ threads. I've already added this flag to the MakeFile in my repository. Other than that, compilation is identical to regular Julia embedding in C++.

After compiling with the makefile, I can run the generated executable, and we see 15 printed messages, as expected. They appear in somewhat random order, due to the nature of multi-threading, but the erroring threads appear last, probably because the error handling takes additional time.

If you ever manage to arrive at this same point, please congratulate yourself! This is tricky business.

Succeeded for x = 2 Succeeded for x = 1Succeeded for x = 4Succeeded for x = 3Succeeded for x = 5Succeeded for x = 8Succeeded for x = 10Succeeded for x = 9Succeeded for x = 7Succeeded for x = 6Caught error for x = 15Caught error for x = 12Caught error for x = 13Caught error for x = 14Caught error for x = 11

Pitfalls to avoid

In general multi-threading requires a lot of attention due to many possible pitfalls, such as thread-safety issues, deadlocks, race conditions and much more. Adding external multi-threading to the mix makes everything even more complicated. Consider carefully whether you really want to go down this route with multiple languages. If you want to continue, here's a few complexities we encountered along the way, but be aware that you may find many more.

We encountered some issues with BLAS and other libraries. It's best to set the number of threads to one via LinearAlgebra.BLAS.set_num_threads(1), else every thread in Julia spawns multiple threads in the BLAS library. Same for MKL and any other third party library you use. Things may work fine, but your performance might not be optimal. You probably don't want your 4 external C++ threads accidentally spawning 16 BLAS threads or more.

In general, be sure to test every binary artifact you want to use in production and consider the implications for your multi-threading setup. This is good advice for any software development project you undertake, independent of Julia.

We encountered a pitfall with Java, when embedding our library into Spark. In this article, we will not go into the details of passing Java threads (via C++) to Julia, but we noticed some issues with the Java signal handler. Make sure that your library is explicitly aware of the Java signal handling library, for example via export LD_PRELOAD=/path/to/libjsig.so . Otherwise Julia will produce a segmentation fault and your application will crash. This is some kind of language interoperability issue that we had to circumvent.

Big lesson learned from the above: never ever disable the Julia signal handler, because else Java is only handling the signals. These signals are operating system signals, such as segfaults or sigabort or the famous sigkill (when you hit ctrl+c to kill something). If Julia cannot handle those signals, you've got a serious problem. We made this mistake while figuring out the previous pitfall.

Conclusion

Integrating C++ and Julia with multiple threads can be a complex task, but it offers powerful capabilities for incorporating Julia libraries into multi-threaded C++ codebases. By carefully initializing the Julia runtime and handling potential pitfalls, developers can successfully combine these two languages for improved performance and functionality. However, it's crucial to be mindful of complicated multi-threading challenges to ensure the reliability of the final product.

Mastering Scientific Programming: Practical Tips and Tricks

Matthijs Cox — Wed, 10 May 2023 09:46:17 GMT

Scientific programming involves writing code to solve scientific problems. This can range from simulating complex physical phenomena to analyzing large datasets. While such software is incredibly important, it can be challenging for scientists to learn all the required software development skills. However, by gradually adding specific tricks into your workflow, you can enhance your coding efficiency and effectiveness.

Software skills are important for everyone these days, including scientists. I see certain common risks if you do not spend effort on your code quality:

Incorrect code leads to incorrect results, which means you may have to redo work or even risk damaging your reputation.
Unreproducible code means others, including your future self, cannot verify your work, nor built on top of it.
Both incorrect code and unreproducible code may lead people to stop trusting your software and the conclusions they draw from that code.
As your code grows, it may become unreadable and unmaintainable, making it harder for you and others to understand it and contribute further to the code.

Scientific code and scientific principles are also applied in startups and in the industry, for research and development of software products. All these aspects are even more important to learn if you ever want to join a professional software development organization.

Choose the Right Language

When it comes to scientific programming, choosing the right programming language is crucial. You want a language that is efficient, easy to use, and has good libraries for scientific computing. Some popular languages for scientific programming include Python, MATLAB, R, and Julia. If you need performance, you typically end up learning Fortran, C, C++ or Rust, though they are considered more difficult and take more time to master.

If you need both performance and simplicity (you probably do if you write complicated algorithms) then you quickly encounter the so-called "two language problem". This is the fact that you typically need to work with at least two programming languages. Read my recent article about how to solve this two language problem if you want to know more.

Choosing the right language can be a difficult task. Typically people pick the language that people around them are using, but it may pay off to investigate alternatives in order to avoid running into technical difficulties later on.

Write Clean and Readable Code

Scientific programming often involves writing complex algorithms and data structures. It's important to write code that is easy to read and understand. This will make it easier to debug and maintain your code in the long run. Some tips for writing clean code include using meaningful variable names, adding comments to explain your code, and breaking up long functions into smaller, more manageable pieces. I intend to elaborate on many of these topics on this blog.

Test Your Code

Testing is an important part of software development in general. You want to make sure that your code is working correctly before you use it to analyze data or simulate physical phenomena. One popular testing method is unit testing, which involves writing small tests for individual functions or methods. This can help you catch bugs early on and ensure that your code is working as expected. There is a lot of attention in the software development community regarding testing. But I believe this topic also deserves another blog post from me, to explain how to get started, but also how to address the iterative nature of scientific development.

Use Version Control

Version control is a system for managing changes to your code over time. It allows you to keep track of changes, revert to previous versions, and collaborate with others. One popular version control system is Git, which is widely used in scientific programming. In some organizations you may also encounter other systems, such as SVN.

Thanks to Github and other tools using Git, it has become ever simpler to control your code and share changes with others.

Code Reviews

When you write scientific publications, you go through a rigorous reviewing process, starting with advice from your colleagues and finally a peer review procedure. Somehow the code doesn't always get such rigorous reviewing.

In typical software development environments, code reviews are common practice, for example to make sure the code is readable and correctly tested. Most version control systems, such as Github, provide easy web interfaces to inspect code changes and leave comments. Another practice to improve the code is so called "pair programming", where you code together side-by-side, essentially doing the reviewing in real time.

Learn from Others

Scientific programming is a rapidly evolving field, and there is always something new to learn. One of the best ways to learn is to collaborate with others who have more experience or knowledge. This can involve joining online communities or attending scientific programming conferences. There are also books to learn from, though I wish there were more targeting the challenges of scientific software. Currently I am reading Software Engineering for Science, which someone suggested to me recently.

Of course if you want to stay up to date, you can subscribe to this blog, where I intend to keep sharing my knowledge.

Conclusion

In conclusion, scientific programming is an important tool for solving complex scientific problems. By choosing the right language, writing clean and readable code, testing your code, using version control, and learning from others, you can write efficient and effective software that advances the field of science.

How to solve the two language problem?

Matthijs Cox — Mon, 08 May 2023 07:41:53 GMT

My professional obsession is solving the Two Culture Problem. How can scientists optimally join forces with software engineers and their principles, so that we can work on the same problems together? How to accelerate the cycle from idea to product? The Two Culture Problem requires a solution to the related Two Language Problem, which has a technical nature. A solution to the technical problem does not guarantee a solution to the organizational problem, but when it comes to engineering cultures you first need to prove the technical solution before you can even begin to tackle the social implications. I have a strong opinion on the best technical solution, but let's review all our options.

As far as I can tell, we have the following alternatives:

Accept the status quo: use a slow and a fast (usually harder) language
Code generation using a look-a-like framework inside the slow language
Using (LLVM-based) optimization frameworks that look like the slow language
Speed up the slow language itself, working around its limitations
Design a new language that is both easy and fast

There are many tutorials about all of these options. Here I'd like to write a short overview of all of these.

For another similar technical overview, see Martin Maas's blog posts about Julia vs Python vs Numba vs Cython.

The two language problem - Python and C++ as a primary example

Your scientists or domain experts write prototypes in a simple language, let's say Python, where they can rapidly explore, do dynamic data analysis, model desired behavior and gather requirements from users. When they find something valuable, software engineers convert the prototype into high-performant code, let's say C++, and integrate it into production systems to sell as professional services to those users. This is my assumption of the status quo.

(I will continue to refer to the modeling culture as "scientists" whether or not they are actual scientists, domain experts, requirements engineers, data analysts, quants or any other kind of expert whose primary job is modeling the behavior of your product without actually writing and deploying the final source code.)

Depending on the size of your organization and the skill level of your engineers, you may end up with several configurations:

Teams of highly skilled scientific engineers who can do all the work
Teams with a mix of scientists and software engineers
Separate teams of scientists and separate teams of software engineers

You may have any combination of the above. The first option is a team of unicorns, which I have seen the least, but is amazing to work with.

Perhaps you have accepted this status quo. As the organization grows, separate code bases may evolve for the two types of tasks. In my experience, the production code rarely gets re-integrated into the analysis code, because it's not worth the effort in the short term. Long term you may get inconsistencies and other issues, but that's typically for someone else to worry about. Or perhaps people notice the problems, but profit margins are good, so why worry?

If you integrate your fast code (C++) as embedded libraries into the slow language (Python), you typically need some intermediate glue code or language in between. This requires yet more technical expertise from your people. See for example this blog about How to Call C++ from Python.

One stated benefit of keeping the two-language culture intact is that your prototypes and your scientists never mess up your production systems. The production systems are brittle and valuable, so this is a valid concern, but I think there are better ways to teach people to write better code than by blocking them.

For this article, I assume you are looking for alternatives. Maybe the problems have grown too big, or you want to avoid them early on, or you simply cannot hire enough senior software engineers. Thus for one reason or another, you need your scientists to be deeply involved in the software development.

Learning C/C++ is still a good idea to grow your expertise or the competence of your scientists, but it can take a long time to develop. At a minimum, I advise learning what it means to compile and link libraries. And learn a bit about computers by reading great summaries such as What Scientists Should Know About Hardware to Write Fast Code.

Still want to find a technology that's easier to use, yet brings some of the hardcore software benefits? Let's see what's possible!

Code generation - Cython example

Generating low-level code, most likely C, from a high-level language, most likely Python, is typically done to try to avoid some of the disadvantages of the two-language problem. Maybe you want to compile static libraries to embed into devices. Or you have some other reason. Unless your examples are very simple, do not expect big performance boosts though, the generated code still needs to make similar kinds of assumptions as the high-level language. Also, most code generators do not support the complete language semantics, so you will have to make sure your high-level code adheres to the capabilities of the generator.

In Python you can use the Cython 'compiler' to help you generate C code. On the surface it looks a lot like Python, yet with C types and certain decorators. This means you need to rewrite the parts of your codebase that you want to speed up. The process of turning Python into Cython is sometimes called "cythonizing". You get such examples in the Cython quickstart tutorial:

@cython.cfunc@cython.exceptval(-2, check=True)def f(x: cython.double) -> cython.double:    return x ** 2 - xdef integrate_f(a: cython.double, b: cython.double, N: cython.int):    i: cython.int    s: cython.double    dx: cython.double    s = 0    dx = (b - a) / N    for i in range(N):        s += f(a + i * dx)    return s * dx

You have to build this cythonized code. As I mentioned, this happens in two stages:

The .py or .pyx file is converted by Cython to a .c file.
The .c file is compiled by a C compiler to a .so file (or .pyd on Windows) which can be import-ed back into Python with setuptools.

The downside of Cython, and any similar C code generator, is that you obtain rather obscure C code. If you want code obfuscation, you can consider that a benefit, but trouble begins when you have to work with that C code. It can be hard to debug once deployed in the field. Make sure to add lots of clear error messages and logging. If you integrate the generated code inside existing C/C++ codebases, your software engineers may dislike writing the necessary glue-code (I learned that from experience). Finally, naive Cython is not very performant and writing optimized Cython can be as difficult as writing regular C code. But the benefit is that you can move gradually up in complexity.

How to make a standalone-ish distribution? Cython generates code that interfaces with the python runtime. You can create a binary executable, but you also need to distribute it with libpython.so which is the python runtime. Moreover, you also need to add all the python dependencies and .so/.dlls that those packages are using. This might be a bit tedious using Cython, but it is certainly possible. Other packages like Nuitka make this process a bit less painless by figuring out all your dependencies.

Fun fact: Code generation is sometimes referred to as "transpiling", since you translate your code to another language that's ready for compiling.

Interlude: LLVM

What if we do not want to write or generate C code? Do we have any other options? Yes, we can generate something else: LLVM code! Before we go into such frameworks, let's do a quick introduction into LLVM itself.

LLVM is a middleman between your source code and the compiled native code. Compilers typically consist of two stages: byte code and native code. The byte code is an intermediate representation that is agnostic of the CPU or GPU architecture. LLVM is an attempt to standardize the byte code definition, which will then be compiled for you to any architecture you want. In some frameworks or languages (like Julia) you can ask to see the LLVM code, and the eventual native code, the processor instructions, which is typically assembly code (that's just before it becomes those zeros and ones you always hear about).

Frameworks that use LLVM may store the compiled native code in memory. In that case, if you want to distribute the compiled code, you need the option to package the native code into a library (that's a .so on Linux or a .dll on Windows) together with all of its dependencies. This packaging option may be important to investigate for your deployment strategy.

Fun fact: the clang compiler from C also compiles via LLVM. So if you write Cython and then compile via clang, you are taking an interesting route.

Optimization Frameworks - Numba example

Numba is an LLVM code generator that integrates directly with Python. What you need to do is add decorators to every Python function you want to optimize. In principle it looks simple:

@jit(int32(int32, int32), nopython=True)def f(x, y):    return x + y

JIT stands for Just-In-Time, as it compiles to LLVM at the moment of calling the function, and inferring which types you used, just in time before executing. You can optionally provide the types yourself, as I did above. And there are lots of other settings for the @jit decorator, like the nopython mode to get faster performance.

Only a subset of Python is supported with Numba. It works well with your NumPy code, they made sure of that. It doesn't work with other packages such as Pandas, because those work differently. Even dictionaries are not supported by Numba. Nobody writes a custom file format parser in Numba, it's purely for numerical code.

If you want to compile to GPU: there are other decorators, such as cuda.@jit. This suggests you need to edit your code for GPU.

If you want to compile ahead of time, you will again have to replace all your @jit decorators with the @cc.export decorator and be explicit about your types.

You cannot debug the jitted Numba code itself, you'll have to change the decorator setting to debug mode and use the gdb tool, so be careful there. That's another disadvantage of Numba.

We have never tried to make a standalone distribution of Numba compiled code, but I assume you ship the entire Python environment, with the ahead-of-time compiled code. If someone has experience with the nitty-gritty details of distributing Numba code, then let me know!

Codon

Codon is a recent attempt similar to Numba, except it claims zero-overhead; you do not necessarily have to decorate your code. Well, except if you want to use it inside larger Python codebases (you probably do), then you have the @codon.jit decorator and other decorators depending on your use-case.

Codon has only 9 contributors at the moment, and it has a non-permissive license, so you'll have to pay to use Codon commercially in production. It's interesting but looks more like a startup than a regular open-source project. Similar to Numba it only supports a subset of the Python language, which may get better over time (or worse, if Python evolves, yet the developers do not update Codon).

Jax, TensorFlow, PyTorch

Every scientific computing and machine learning framework in Python implements its own optimized numerical libraries it seems. Some of them, like JAX, have a @jit decorator like Numba. All these frameworks look like Python, but to get performant code you'll have to use their API, not Python itself. Often you write Python in a more complicated directed-acyclic-graph (DAG) structure that can be fed to the underlying libraries for execution. Don't ask me how to debug these things. Please consider whether you are really writing Python or another language.

Also see this section from the Mojo language comparing such Python improvements.

Boost the slow language - PyPy

There is a continuous effort to improve the performance of slow interpreted languages like Python and R. In a blog post called Python 3.14 Will be Faster than C++ the author joked that linear extrapolation of Python improvements will soon surpass C++ performance. Let's see how that graph evolves in the next Python versions.

PyPy

An alternative to waiting for Python to improve is PyPy. This is a replacement for CPython. Note, CPython is not Cython. CPython is essentially Python itself, as the Python interpreter is written in the C language. PyPy is an attempt to make the entire Python language faster with a better interpreter. PyPy can optionally use LLVM as a backend, to use similar tricks as Numba, and also has a JIT decorator.

In general, the Python language design creates limitations on the performance, see this video for example on How Python was Shaped by Leaky Internals. If you don't want to change language, you may hope that Python 4 ever comes around with syntax that can actually be optimized.

An optimized language - Julia as example

If you don't want to wait for Python 4, then there's Julia instead. Julia is a language that is optimized for talking to LLVM, while looking as similar as possible to high-level languages like Python and MATLAB. In short, it's an LLVM whisperer. This makes it fast and easy. See Why Was Julia created? to get an impression of the rationale.

Similar to Cython and Numba, you can optionally add type information to Julia, which can help the compiler, though Julia is good at type interference.

function f(x::Int, y::Int)    return x + yend

Is Julia better than Numba and Cython? For an opinionated and long blog post read Why Numba and Cython are not substitutes for Julia. There are also lengthy discussions in this discourse on Why weren't Numpy, Numba, SciPy good enough?. And I also like Martin Maas's blog post about Julia vs Numba and Cython.

I would summarize the benefits as: You don't have to decorate your code, Julia is the JIT decorator. You can write the same code for CPU and GPU. When compiling ahead-of-time, it's again the same code. The compiler can optimize across all Julia code, not just a single package like NumPy that you are currently using. Composability is often praised: Julia packages work easily together.

The downside of Julia, if you are coming from another language like Python, is obviously that you have to learn another language. Though I wonder how much more difficult Julia is compared to learning a complex framework like Numba. And writing optimized Cython can be considered similar to writing another language. Julia has many similarities with Python, check for example this cheat sheet to compare MATLAB to Python to Julia, except Julia bypasses the problems that make Python difficult to compile to LLVM.

Similar to Cython and Numba, naive Julia code is good, but not necessarily as performant as optimized C. Read the performance tips to get the most out of your code.

If you want to move gradually to Julia, you can embed Julia into Python via PyJulia or the more recent two-way package PythonCall. Or you can re-use existing Python code inside Julia, via PyCall.jl, or the aforementioned PythonCall. This way you can use Julia as if it's yet another Python framework, instead of a completely new language. You can even send Pandas dataframes to Julia and back.

Similar to Numba, Julia stores the compiled native code in memory. In the upcoming Julia 1.9 release, this code will also be automatically cached on disk per Julia package, so compilation happens only once, instead of the first call in every new Julia session. Ahead-of-time compilation was always possible with PackageCompiler.jl. The PackageCompiler is not actually compiling (Julia and LLVM do that), it simply gathers the in-memory compiled code and stores it in a .so library (or .dll on windows). This can be used for a standalone library of your compiled code, and will automatically include all dependent libraries. I have written a long tutorial on how to embed such Julia libraries inside C++ on my private website.

Static compilation of Julia, into tiny libraries fully independent of the runtime, is in an early stage with StaticCompiler.jl and StaticTools.jl, but needs more investment. Once you try out static compilation, you will notice that it enforces limitations on your code, because you cannot use all the fancy dynamic language features. I believe this is an unavoidable trade-off in any of the discussed technologies so far, but I'd love to be surprised on this point.

Other attempts to make a fast language easier to use are Zig, Swift and GoLang to a certain extent. Rust is very interesting, but I would not call it easy for scientists. None of them are targeting numerical computing as much as Julia.

Mojo

A new language that was revealed very recently is Mojo. In their article Why Mojo? they rephrase the two language problem as a Two World Problem, or even three world problem (Python, C++ and CUDA) for machine learning. From the code snippets on their website it looks like they want a Python compatible language that has features of Rust. Note that Mojo is not yet released to the public, the mojo github repository is empty at this time of writing, so we don't even know if Mojo will have a permissive license. Ambitious, but very young. We'll keep an eye on this one.

Final Comparison

What are good comparison criteria? I have chosen a few below. The development community size can be used as an estimate of how much effort goes into each project. Other than that I have tried to compare the usage and technology choices. I don't want to give quantitative performance comparisons here, they are heavily dependent on your use case, but from the benchmarks on complex examples that I have seen, Julia typically performs best. However, performance might not be your main criterion. Other unlisted aspects, such as debugging, profiling or cloud deployment may be more relevant for your use case.

	Cython	Numba	Julia
Contributors	430	298	1386
Github stars	7.9k	8.6k	42.2k
Backend tech	C transpiler	LLVM	LLVM
Usage	Decorators and Cython types	Decorators everywhere	Learn another language
Python interoperability	Import cythonized modules	Just-in-Time (JIT) decorators	PythonCall package
Performance	Decent	Good	Best
Distribution	Ship the .so with all dependencies	Ahead-of-Time compilation decorators	Ahead-of-Time compilation via PackageCompiler.jl

While I have a preference for Julia, I tried to stay as unbiased as possible in this blog post. All options are amazing open source projects and are maintained by mostly voluntary developer communities. Investigating them all is a humbling experience in the complexity of software technology.

There are other software engineering requirements that I have not yet included in this post, but might be important for your use case:

Package management. How easy is it to create and install a package, with all of its dependencies, in the chosen technology. I believe Julia's Pkg is superior here.
How does dependency management and distribution of binary artifacts work exactly? These nitty-gritty details can slow down your project. I have not yet tried this extensively for Numba. Cython involves some manual work. Julia has an artifact manager that works together with the package manager inside PackageCompiler.
Complex cases, like multi-threading inside the framework or external threads calling from another language. (Note: Julia doesn't have the global interpreter lock (GIL) like Python. Cython can release the GIL in certain cases.) I haven't gone into such topics yet, but there are many complex use cases that you may want to gradually add to your codebase. How far can you go with each technology before hitting a wall?

Finally, remember that a technical solution does not necessarily result in a cultural improvement. If you hired a lot of scientists or analysts or domain experts, and none of them have the necessary software skillset, it is difficult to improve collaboration with software engineers by forcing a 'better' technology onto them. You will have to empower your scientists to learn the necessary software development tools and processes, such as version control, test-driven development and continuous integration. Vice versa, your software engineers can learn the business domain and the tricks of numerical computing with the help of your scientists, to know exactly what code to write. By bridging these gaps, you can create a more effective team that can leverage the full potential of the technology investments you make.

Thanks to Jorge Vieyra and Jeroen van der Meer for reviewing and suggesting excellent improvements to the article.

These long posts take me quite some time and effort to write. If you like them, please encourage me by leaving a comment with suggestions or subscribe to my newsletter. With enough support, I intend to write a book about building and deploying professional numerical computing applications. With Julia examples.

Production-ready code for scientists: 3 lessons learned

Matthijs Cox — Thu, 04 May 2023 09:43:36 GMT

How do you become a great scientific coder? To understand this, I want to ask others about their journey and share their lessons with you. This post is a collaboration with Keith Myerscough, a mathematical consultant and senior engineer, who helped me with setting up our internal Julia language ecosystem.

Matthijs asked me to write a guest post about what skills are needed to become a great scientific coder. He asked me because I assisted a team of scientists in adopting Julia for their research and development work. I am keen to help: I empathize with people who come up with great ideas but have a hard time wrapping these up into something that can be turned into a product. Even under the assumption that code will be extensively improved upon by others with more software knowledge, that first delivery is either a big hurdle or the seed for a smooth project.

Before diving into what I think is most important for scientists developing code, let us take a small step back. A central theme of this scientific coding blog is that we must rid ourselves of hard boundaries between different groups working towards delivering the same product. But removing the divide(s) does not remove the inherent differences between ideation and productization, between divergent and convergent thinking. We will still have to bring something from an idea into a product; we just want to make this journey a continuous one. The reality is, however, that larger products require multiple people working on them. So the idea-product continuum will still have to be divided. I see this as creating a chain of several people working in stages. The important thing is that any boundary is permeable, in both directions: new ideas are presented in a way that can be turned into a product and improvements in the product can find their way back to the code used in ideation. This blog post addresses the implications of the first of those requirements on people not used to having their code end up in a production environment.

From my time working with a team of engineers with a physics background generating code that was intended for production, I have found the following suggestions to be most relevant:

Make your work reproducible
Keep everything as small as possible
Be relentless in asking for help

I will discuss these in more detail below.

Make your work reproducible

This is probably too obvious for many of you, but this is the start. The divide between ideation and productization is also one between it works now and it will work forever. As a scientific software developer, your code does not have to work forever, it does not have to cover all corner cases and it does not have to be optimized for performance. But if you want to hand it over to the next person in the chain, it will have to at least do (more or less) the same in most cases.

The best way to guarantee this is to have the intended use(s) of your idea encoded in tests and to have these tests run in an automated environment. It might even help to write the tests first. Unfortunately, I see it too often that people are testing their code using REPL commands or script runs, but do not include these as parts of delivery, losing the commands/scripts to run the code forever. Running tests in an automated environment, such as Github Actions, Bitbucket Pipelines, Jenkins etc., avoids any sneaky dependencies on your local machine configuration. This infrastructure is a prerequisite for scientific software development.

There are two skills required for this. The first is a solid understanding of the test framework in use for your project. You should feel comfortable in both modifying and creating tests, using the tools available. The second skill is version control. In particular, familiarize yourself with the command-line version. There is some irony in the fact that version control tools like Tortoise and SourceTree themselves lead to irreproducibility, as theres no way of tracing just what you clicked when; with the command line, you always have your history. Furthermore, the command-line restricts you to using only the commands you know. You can get a long way with the basics.

Keep everything as small as possible

This is a super-linear advantage, to put it in nerdy terms. Reducing the size of a component makes all the work that has to be done on it easier. It makes every pull request less work, it makes every test run faster and it makes every bug easier to find. But more than that, it reduces the number of people who need to work on it, reducing the amount of communication required.

This is an important advantage of moving from a two-language situation to the one-language paradigm. In the two-language situation, the productized software will always have a modular structure, thanks to the software engineering attention that was spent on the architecture. This modularity would, however, not necessarily exist in the ideation code base. In the one-language paradigm, the same architecture is available to scientists. You, as a scientific software developer, can see this as one of the rewards of switching tools.

One of the easiest yet often overlooked ways to keep things small is to rely on existing (open-source, inner-source or even closed-source) tooling. This immediately reduces the complexity of your code. In Julia, it also aids in the composability of your code to rely on existing implementations.

Its harder to divine specific skills for this goal. You will need to know a little more about how a language works to know how to split a package into multiple packages, or better still, how to introduce new functionality in a separate package that interacts nicely with existing packages.

[Matthijs' comment:] I can think of a simple tip. If you notice that you often write long scripts or functions, say dozens or hundreds of lines of code inside one function, you can probably cut it into multiple functions with easy-to-understand names.

Be relentless in asking for help

It is important to acknowledge that you can not know everything. As a scientific software developer, you will probably need help from the more software-savvy people on your team. Do not hold back in asking them. Again within the framework of an idea-product continuum, there should be no hard boundaries. Whoever is responsible for bringing your delivery to the next level, has an interest in helping you deliver something better, to make your life easier.

Conclusion

Writing code for product development as a scientist can be a challenging task. Hopefully, my advice can help you find ways to improve your work. Remember, writing good code is not just about solving the problem at hand, but also about helping others, including your future self. By investing in your software skills now, you can set yourself and your team up for success in the long run.

Let me know if you enjoy this guest post and would like to read more of such posts in the future, by leaving a comment!

Automate Your Code Quality In Julia

Matthijs Cox — Wed, 26 Apr 2023 07:47:00 GMT

Code quality is a topic in Julia that I believe deserves more attention from both users and developers. The Julia language originated in academia and focused heavily on helping scientists write better code, which is going great and deserves much praise! However, to onboard more software engineers and professional organizations we're going to have to invest even further into code quality and automated code quality tools and other methods such as used in the field of "quality assurance". In this article I'll explore the current state in the Julia ecosystem.

At our workplace we have investigated the following tools and practices. I'll start from generic practices and then move on to more advanced tools.

Package structure
Unit testing with Pkg.jl
Automated testing and Continuous Integration (CI)
Code Coverage with Pkg.jl
Documentation testing with Documenter.jl
Style guides and JuliaFormatter.jl
Static Code Analysis with StaticLint.jl
Quality Assurance with Aqua.jl
Type stability with JET.jl

Let's have a look at all of them.

Packages

The Julia community uses a standardized package structure and has a single package manager called Pkg.jl. There's also a single documentation system and a single testing system. This consensus alone helps tremendously with automating any workflows in your projects and organizations.

Please make sure you share professional code with others via packages. It's straightforward to adhere to the package structure. To get you started with creating well-defined (open source) packages, you can look at PkgTemplates.jl.

When setting up my first open source Julia package, I enjoyed the documentation of the JuliaTemplateRepo.jl, which goes through all the basic steps and configurations for a Julia package.

Unit Testing

Nowadays unit testing is a common practice in professional software engineering. Developers in Julia should be no exception. Fortunately, according to Viral Instruction, 89% of all open source Julia packages have tests, including a lot of beginner packages. It's safe to say that the Julia community puts a lot of emphasis on testing, which I think is remarkable for a language that originated in academia. This really sets a good example.

All unit testing use the Test.jl package, which is shipped with the base language. There are some extensions like TestSetExtensions.jl and ReTest.jl, but I believe you can do most of your work with Test.jl.

Getting started with testing is trivial in Julia, just add a test/runtests.jl file to your package and add code like this:

using Testusing MyPackage@testset "MyPackage tests" begin    @test 1==2end

You can then run the tests with Pkg.test("MyPackage") which starts an isolated sandbox environment for the tests.

If you are creating a reproducible package for others, including your future self, then there is absolutely no reason to not write tests. However, writing good descriptive tests that cover all your bases is a more advanced art. Many books are written on this topic. I would advise to just get started and improve your testing strategy as you grow your codebase.

Automated Testing & Continuous Integration

Once you have unit tests defined, this aspect is a low hanging fruit for automation. It's very easy to forget to run the unit tests before committing. Automatically testing the code will save you from simple mistakes.

You can setup Github Actions or use a tool like Jenkins, to automatically run the tests upon every commit and block developers from merging if the unit tests do not pass. PkgTemplates.jl will typically already generate this Github Action for your open source package.

The tools and practice of frequently and automatically checking your code development is called Continuous Integration (CI). Inside software organization this is often combined with Continuous Deployment (CI/CD). All continuous integration (CI) tools that the Julia community uses can be found in JuliaCI GitHub, I will address a few of those. If you want to configure your own Github actions you be inspired by the examples in Julia Actions GitHub.

Code Coverage

A straightforward metric for monitoring your code quality is to check the fraction of code covered by your tests. Similar to automating your tests, this is a very low hanging fruit in the Julia community.

The code coverage generation itself is embedded inside the package manager via Pkg.test("MyPackage", coverage=true) . This will generate .jl.cov files with information about how often a line of code is touched by the tests.

Analyzing the code coverage visually line-by-line, for example inside VS Code, can help you identify where you are lacking tests, or help you find out where you can delete un-used code that is never called by your functions. You can automatically send the files to a service, like Coveralls.io or Codecov.io, and analyze in the browser. Here's an example in Flux's functions.jl file, that's has one uncovered line (note that it's very common to forget to test our errors or other corner cases):

You can calculate statistics on these coverage files, for example with Coverage.jl or with one of the services above. That way you can monitor coverage statistics over time. And if you use such a service for your open source package you can add a shiny badge to your readme to show-off your coverage.

Other commercial tools are busy adopting Julia's code coverage, check with your preferred supplier if they already support Julia. If not, please request them to do so.

Documentation Testing

Good documentation is incredible important for the users of your package, both in the open source community as well as inside organizations. Unfortunately, documentation that includes code examples can run out of sync with your code if you forget to test those. But it's very easy to automatically test those code examples with Documenter.jl doctesting.

Whenever you write a docstring or write code snippets in your docs folder, just add jldoctest and the expected output.

```jldoctesta = 1b = 2a + b# output3```

Now just add Documenter.doctest(MyPackage) to your automated tests, and you know immediately when your examples no longer work. Easy!

Style Guides & Code Formatting

One of the challenges when working with many people on a single codebase is to adhere to a consistent coding style that is pleasant and unambiguous for everyone. This is where style guides help, together with formatting tools that make it easy to adhere to such a style guide.

The primary open source Julia style guides are at the moment are:

YAS (yet another style)
Blue style
SciML style

You can automatically format your Julia files in VS Code with the click of a button. If you are a command line user, or want to automate the formatting in your CI, you can use the underlying JuliaFormatter.jl package.

This should get you started with code formatting in no time. Discuss with your colleagues which style guide you prefer. Personally I use the BlueStyle so far, but the SciML style is relatively new, so looking into that one as well.

Static Code Analysis

Static code analysis will look at the code without executing it. One package we found is StaticLint.jl, which is used primarily by the Julia VS Code plugin LanguageServer.jl to report on potential problems in your code. These are normally reported under the "Problems" tab. Here I found a few potential problems inside the DataFrames.jl package:

The VS Code plugin also reports on potential problems when hovering over code, such as a warning about this unused function argument. These are not reported in the "Problems" tab.

StaticLint still misses some documentation for users, but you can use the following script and read my discussion in an issue here. After some fiddling with the code and environments I am able to obtain the same "diagnostics" on my REPL for a given file:

julia> docs[3]Document: file:///c%3A/Users/matcox/Documents/Julia/static_lint/src/abstractdataframe/selection.jljulia> docs[3].diagnostics[1]LanguageServer.Diagnostic(LanguageServer.Range(LanguageServer.Position(223, 15), LanguageServer.Position(223, 36)), 4, "UnusedFunctionArgument", missing, "Julia", "An argument is included in a function signature but not used within its body.", [1], missing)

So StaticLint.jl can be used, but it's not yet user friendly for integration into any command line interfaces or automated tooling.

Quality Assurance with Aqua.jl

The package Aqua.jl is developed to automate quality assurance for Julia. The readme is clear on what it checks:

There are no method ambiguities.
There are no undefined exports.
There are no unbound type parameters.
There are no stale dependencies listed in Project.toml.
Check that test target of the root project Project.toml and test project (test/Project.toml) are consistent.
Check that all external packages listed in deps have corresponding compat entry.
Project.toml formatting is compatible with Pkg.jl output.
There are no "obvious" type piracies

Aqua provides a function to use in your testing environment, which will report issues and throws an error so your tests cannot pass unless all Aqua checks pass. Here's a snippet of what we get for DataFrames:

julia> using Aqua, DataFramesjulia> Aqua.test_all(DataFrames)17 ambiguities foundAmbiguity #1<=(a::Integer, b::SentinelArrays.ChainedVectorIndex) in SentinelArrays at C:\Users\matcox\.julia\packages\SentinelArrays\BcfVF\src\chainedvector.jl:208<=(x::BigInt, i::Integer) in Base.GMP at gmp.jl:696Possible fix, define  <=(::BigInt, ::SentinelArrays.ChainedVectorIndex)... Test Summary:    | Fail  Total  TimeMethod ambiguity |    1      1  8.3sERROR: Some tests did not pass: 0 passed, 1 failed, 0 errored, 0 broken.

Unfortunately I only find ambiguities for DataFrames, maybe I should find a package with more problems. You can also run all the underlying checks independently if you read the Aqua documentation. Note that if you only want to check for ambiguities, you can also choose to call Test.detect_ambiguities directly from the standard Julia Test package.

A nice addition to Aqua would be a way to store the found issues in a standardized file format instead of printing them on the REPL. Similar to code coverage reporting, this can help to make overviews in automated systems. Now we would have to capture the printed output and parse that somehow.

Finding Type Instability with JET.jl

There is non-stop activity in the Julia community to analyze our own code for improvements. An advanced package is JET.jl, which uses the Julia compiler itself to detect potential bugs and type instabilities.

What is type instability? This happens when the type of a variable changes unexpectedly. Here's a simple example that outputs either an integer or a floating point variable:

function foo(value)    if value < 1        return 1    else        return 1.0    endend

Type instability is bad for performance because the compiler cannot infer the types and generate optimal native code. It may also point at bugs in your code, if you did not intend to have such instability. Julia does not enforce type stability like certain languages, because it wants to remain an easy language to use. Sometimes you don't care about performance and don't want to worry about types, in which cases it's easy if you are not forced to worry.

If you just want to check whether the output value can be inferred, you can use Test.@inferred in your tests:

julia> using Testjulia> Test.@inferred foo(0.5)ERROR: return type Int64 does not match inferred return type Union{Float64, Int64}julia> Test.@inferred foo(1.5)ERROR: return type Float64 does not match inferred return type Union{Float64, Int64}

However, when you want more certainty about the internals of your code then you can turn to JET. Most of JET is doing specific method analysis with @report_opt and @report_call . JET can do some limited static analysis of your entire package with the report_package function. Unlike @report_call , this means JET doesn't know what types you want to input into your methods, so it has to make some assumptions.

I do warn that the output of JET can be rather intimidating. Here's what you get when executing the example @report_call sum("julia") :

And that's just the example from the Quick Start page of JET.

We're still investigating how to use JET, because it is pretty advanced tooling. If you just started with types and Julia, I wouldn't dive right into this. Take your time to think about what type inference really means, and read the documentation of JET if you want to know more.

Reducing Compile Time with SnoopCompile.jl

Optimizing your code such that the compilation times are reduced is maybe not the first thing that comes to mind when thinking about "code quality", but it can improve the user experience of your package. Nobody likes to wait long to import your package. The SnoopCompile.jl package helps you with analyzing your code for such improvements. It "snoops" on the compiler and reports on it's findings.

There is a lengthy blog post from the SciML community on how they improved their compilation times with SnoopCompile and other tools, called How Julia ODE Solve Compile Time Was Reduced From 30 Seconds to 0.1. Definitely read that one for more information, I will not go into details here, but I do think SnoopCompile is worth a mention.

Conclusion

You have plenty of options to check the code quality of your Julia packages and improve the quality over time. If this feels like a daunting task as a beginning (Julia) developer, don't worry, you can slowly add these tools to your workflow over time. The most important thing is to start with a good package structure and basic testing. The fact that the Julia ecosystem is so focused on making quality easy for beginners is truly praise-worthy and will help us all in the long run.

For senior developers and managers looking into these tools, one thing to remember is that lot's of code quality tooling in Julia is written with the human developer in mind. This currently limits some of the integration in automated CI tools. I believe this topic deserves some more attention in the Julia community and more support from commercial code quality tooling vendors. The good thing is that due to the standardization of the Julia package management it is very easy to get started with a uniform automation system in your organization. As the tools improve for these systems, it will be easy to incrementally add such tools to any open source or internal CI workflows.

Thanks to my colleague Matthijs den Otter for helping with the investigation. If we find better ways to monitor your Julia code quality, I intend to share that here, so don't forget to subscribe to the blog.

The Art of Multiple Dispatch

Matthijs Cox — Thu, 20 Apr 2023 09:01:49 GMT

I love thinking visually by drawing doodles and schematics for my work. It's one of my favorite things to do, next to coding. When working with the Julia language, one visualization I enjoy is seeing the type space of a method that you are dispatching on. Normally I do this in my mind's eye, but let me clarify this by drawing some actual figures.

To start with the basics; Julia has functions and methods. A function is simply the name, like push! or read . Methods are specific definitions of a function, for certain types of arguments. Take for example push!(s::Set, x) or read(io::IO) . From an object-oriented perspective you could say that methods are instances of functions.

For any given method you can consider the dispatching as slicing a part of the entire possible type space of that given function. For a given set of arguments of course. If you increase the number of arguments in the function definition, then more dimensions get added to the type space. I don't even know how to find the best written words for this, the visualization above just feels intuitive to me.

Let's take the function f and imagine for a moment that there are only 3 types in the whole Julia type universe: the Float64, Int64 and the String. The Float64 and the Int64 are a subtype of Number, which is obvious I hope. By default in Julia if you specify no type in your function argument, then it will be assumed you mean the Any type, of which every other type is a subtype.

A method f(::Any, ::Any) thus describes the entire space of all possible types for the function named f . On the other hand, a method like f(::Int64, ::String) is super concrete, it's a singular point in the type space.

You can use abstract types like Number or unions like Union{Float64, Int64} to capture a subset of the discrete type space. This way you can choose which part you want to define for your function, with the chosen set of types you will be dispatching on at runtime. Abstract types in Julia exist only for this dispatching purpose, to dispatch on a set of subtypes, they have no other influence on their subtypes what so ever. They are not dictators like classes in other languages.

I like these visuals. Some junior engineers wonder what "diagonal dispatch" is. I don't have any other way of explaining the concept then by just drawing it. The figure is immediately obvious. Diagonal dispatch happens when the type of all arguments is forced to be equal with f(::T, ::T) where T . This truly represent a diagonal through the type space. You can see it in the example above. You can also limit the diagonal dispatch to a subset with f(::T, ::T) where T<:Number and in higher dimensions you can be fancy like f(::T, ::T, ::S) where {T<:Number, S<:AbstractString} by adding multiple of these parametric types.

Note that when you define a method twice, you have to take care that it is clear which method gets dispatched on. The compiler will prioritize the one that is most concrete, so the one that is most specific about the types. In the figure above, I ordered them from most specific to least specific. You can try for yourself to see if I ordered them correctly.

For example if you define the following:

f(::Any, ::Any) = println("any & any")f(::Int64, ::Int64) = println("int & int")

Then most function calls will run the broadest method because that one is defined for the whole type space, but when you input two integers you will call the very specific method f(::Int64, ::Int64) . Let's give it a go:

julia> f("string", 5)any & anyjulia> f(4, 5)int & int

From a visual perspective, we have created an overlapping dispatch, where one method is specifically defined for the integer case f(::Int64, ::Int64) and will be called when only integers are used as arguments.

There are some caveats here. If you are not careful the methods can become ambiguous and Julia won't like that. For example if you define the following:

f(::Any, ::String) = println("any & string")f(::String, ::Any) = println("string & any")

Which one of these methods should be called with f("string", "string") ?

julia> f("string", 5)string & anyjulia> f(5, "string")any & stringjulia> f("string", "string")ERROR: MethodError: f(::String, ::String) is ambiguous.  Candidates:  f(::Any, ::String) in Main at REPL[8]:1  f(::String, ::Any) in Main at REPL[9]:1Possible fix, define  f(::String, ::String)

Yikes, that's impossible! Fortunately there is a fix proposed, by explicitly defining the ambiguous case. Though perhaps you should reconsider what your actual intentions are in this design. The visual representation below hopefully makes the mistake more clear. At first there is confusion because the two dispatches overlap and neither is more specific than the other, but we can fix it by defining a more concrete method in the conflicting area.

When you define a lot of methods, you are creating a colorful patchwork in the type space of your function. You can come up with the craziest designs in your methods, but be careful. Finding the right balance of a few big broad abstract methods versus multiple tiny concrete methods is a true art in Julia.

People do not often share how they visualize the code design in their mind, while I believe this really shapes the creative process. The closest visual representation in Julia I have seen is the article about Julia's dispatch with Pokemon types. You can read that for more detailed examples with Julia's multiple dispatch.

That concludes this short artsy post, but I hope it helps the visual thinkers in the programming community! Let me know if you use different kinds of visualizations in your coding work.

My Target Audience

Matthijs Cox — Tue, 18 Apr 2023 11:46:15 GMT

What kind of people do I have in mind while writing this blog? People who share my professional mission of course! What is that mission you ask? Let me elaborate.

From Research to Engineering

I am a scientist who danced with startups and moved into industry. I've spend years doing research, and years doing software product development. During those years I have gotten slightly frustrated with the process of moving ideas from research to production, primarily in my favorite field of scientific computing.

The typical problem is that the time from research to production can be long, often spanning multiple years per project, and the process can be error prone. Some common smaller issues I see arising in many of our projects:

Unreproducible, unexplainable scripts or notebooks in research
Effort to translate from one technology to another
Difficulty optimizing the idea for direct product integration

One typical constraint problem I observed is the following. The researchers quickly write a prototype, management wants it turned into a product, the software engineers then spend ages deciphering the prototype and converting it for the specifics of the production systems. Because the researchers are not the bottleneck they continue tinkering with the prototype and bothering the software engineers with new insights, which further reduces the overall efficiency. Meanwhile other researchers find new prototypes and hand them over to these software engineering teams. This can result in a never ending cycle of downwards productivity.

While many improvements in this scenario are possible and needed, one aspect attracts a lot of attention: the technology differences. For many years it was taken for granted that we need different tools in the separate phases between research and product development. For quick research and exploration you use an easy to use programming language, for example Python. For high performance, reliable, scalable production environments, you use a language like C++, sacrificing ease-of-use in favor of these important utilities. This two language problem immediately creates a bottleneck in handing over ideas smoothly for development.

The Two Culture Problem

The two language problem was globally accepted for several decades. You just learn to live with it. Until one day it was challenged by the Julia language, a programming language that promises both speed and ease of use. As I feel the pain of the two language problem deeply, I wanted to try out this new solution. So together with several allies I went on a mission to adopt this new technology at work, and remove the bottleneck.

While we had initial success and attention, we quickly stumbled into resistance from the existing groups of researchers/scientists and developers. Over time I have named this the two culture problem. In the beginning I didnt see the cultures clearly, which limited our success. I was too focused on the technological problem itself.

I will refer to the two cultures as scientists versus developers. However, the scientists group generalizes to anyone who codes quick and dirty to explore, such as domain experts, data analysts and others like that. I do hope everyone is doing their exploration somewhat scientifically, so the generalization should makes sense. Scientists typically want to get their stuff done, perhaps with code, but they don't care about the code. Software developers care deeply about the code craftmanship, sometimes obsessively so, but often developers barely understand the business domain or science. There are people near the middle, trying to balance both, but they are a rare breed.

The benefit of these two names "scientist" versus "developer" is that I can call the middle ground "scientific developers", or "scientific coders", or "scientific software engineers", which is the kind of people I have in mind to form the bridge between these two cultures. I want more of these super engineers and give them the tools they need to succeed.

Bridging The Two Cultures

I have discussed some of my attempted changes in my blog post Organizational Refactoring. People who walk between cultures feel friction, their skills not always acknowledged, yet they get to see the problems and the opportunities from multiple perspectives. Often an individual can become immensely valuable by learning the skills that people stuck in a single mindset are unwilling to pick up.

So let's start a new culture right there in the middle, by connecting people who share this passion for improvement. The Julia language is a tool designed for scientific coders, though not necessarily available to use in their current organization which may force upon them the tools of the dominant cultures. I have spoken with many who see the bottleneck between science and production, but are frustrated that they are not allowed to solve it. This is a good sign! There is a desire for bottom-up change!

I am not saying the Julia language is the answer to everything. You can do fine with a combination of tools, especially if you do not write your own high performance algorithms. Many data scientists I worked with only use off-the-shelf Python packages with a C/Fortran/C++ backend. They barely look under the hood of their favorite packages, and never try to write a performant algorithm themselves. So they are dependent on the small group of interested software developers who can juggle multiple technologies.

Numerical computing professionals are a different breed, they want to write their own high performant code. They are often unsatisfied with the features of the libraries embedded in Python. These professionals will run into the problems I described, especially in larger organizations. It's the same struggles I have run into over the years. So there you have my target audience: scientific software engineers. In the end it's not about betting on a single technology, but about betting on the right kind of people.

This audience is a smaller group of people than either scientists or pure software engineers. Yet I believe that if we band together there are plenty of us. Unfortunately, it can be difficult to find each other.

Do you feel the pains I describe? Do you feel a desire for improvement? Do you share my professional mission for better numerical product development? Then this blog is for you! Subscribe to the newsletter and let me know what kind of knowledge would help you in your personal struggles!

Why this blog?

Matthijs Cox — Sun, 16 Apr 2023 17:27:18 GMT

My goal in life is to follow my curiosity and help others with the knowledge I uncover. A solid way to help people is to reflect on the last years and think about what would have helped your past self. In my case, I have been busy trying to develop and deploy numerical computing applications with the Julia language. One thing that I found lacking was good material on how to do numerical computing professionally in the industry, something I learned is sometimes called "scientific software engineering". A book or blog about large-scale development and deployment into production systems would have helped me tremendously. I had to learn most things the hard way through trial and error.

Since what I need doesn't exist, we will have to make it exist! That's this blog!

I have been blogging about my journey on my private website functionalnoise.com, but I will separate the professional software content from my private musings there. That's why I am starting this new blog for you, including a newsletter to easily subscribe to my content (something people have been asking me for a while now).

In the future, I would like to write a book, but I first freely share my current knowledge here and get your feedback. I am also interested to interview people in the industry or doing collaborative blogging with people (Let me know if that's you! Don't underestimate yourself, few people are experts in this field). Once I have a good amount of posts, I'd like to polish those and bundle them into a book.

Side note: if you prefer podcasts about numerical computing software, head over to inspiringcomputing.com. This an initiative from a fellow scientific traveler, which I encourage even if I prefer blogs and books over audio content.

The blog name

I came up with a few options and asked for feedback on LinkedIn. These are the poll results:

I was personally doubting between "The Numerical Mind" and "The Scientific Coder". They both sound more personal than the other two. "The Numerical Mind" had my slight personal preference because there is a hint to my open-source work with brain waves and biosensors. The Scientific Coder matches better with the target audience I have in mind: my kind of people. So I opted for the latter.

Who are my people?

I will write a longer post about who I consider my ideal audience, but in short: scientific software engineers. Scientists who love to code and want to learn professional software development. Software engineers with a scientific background or interest in scientific computing. And the magical unicorns who can already do both, but want to improve their skills further.

What do my people want?

I asked in a LinkedIn post what people wanted to read in a book about my work.

I summarize the requests as:

Pure Julia deployment options.
Deploying into old/reliable systems such as C++/Python.
Compare development and deployment options to Python/Matlab/R deployment.
Environment setup
The best way to write the code
(Large scale) code organization
Development workflow (alone, or with multiple people and teams?)
Examples of applications from start to finish, and in all stages of development.
How to overcome the two-culture problem (scientists vs developers). I want to blog about this in more detail.

Plenty of material for a blog and book!

How is this blog created?

If anyone is interested, or for my future reference, this blog is created and hosted on hashnode.com. I choose HashNode because it promises optimal tech writer tools, unlike other blogging sites like SubStack or Ghost.

Here's a quick experiment with a code block and Julia syntax highlighting:

function example(x::String)    return x * "foo"end

Everything works fine!

I bought the domain on namecheap.com and set up the DNS redirect to HashNode.

HashNode blogs come with an automatic email newsletter. I am curious how that will play out. Let me know if you (dis)like the newsletter.

Don't forget to subscribe! That way you will be informed about new blog posts. And you'll encourage me to write my book when I know I have a decent audience.