<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[The Scientific Coder]]></title><description><![CDATA[A scientific software developer with over a decade of experience in academia, startups and industry. My mission is to turn you into an elite numerical computing]]></description><link>https://scientificcoder.com</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1683717871482/hyBTSYFqt.JPG</url><title>The Scientific Coder</title><link>https://scientificcoder.com</link></image><generator>RSS for Node</generator><lastBuildDate>Mon, 20 Apr 2026 04:01:03 GMT</lastBuildDate><atom:link href="https://scientificcoder.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[When should you turn your ugly research script into a reproducible package?]]></title><description><![CDATA[Blog status: I haven’t been spending much time on the blog. I admit I don’t have much to say at the moment, we’ll see how it goes in the future. This current article is mostly a journal-like question for myself, to reflect a bit, but maybe it helps o...]]></description><link>https://scientificcoder.com/when-should-you-turn-your-ugly-research-script-into-a-reproducible-package</link><guid isPermaLink="true">https://scientificcoder.com/when-should-you-turn-your-ugly-research-script-into-a-reproducible-package</guid><category><![CDATA[General Advice]]></category><dc:creator><![CDATA[Matthijs Cox]]></dc:creator><pubDate>Mon, 13 Jan 2025 10:24:37 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1736763701370/17fbe5f4-10d6-4d4a-9f77-0ebad18ff0eb.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Blog status: I haven’t been spending much time on the blog. I admit I don’t have much to say at the moment, we’ll see how it goes in the future. This current article is mostly a journal-like question for myself, to reflect a bit, but maybe it helps others as well.</em></p>
<p>Lately I’ve been re-assigned to a research project, and instead of writing ‘production’ software I spent my days preparing experiments and analyzing data with ‘throw-away’ scripts. Plenty of times these scripts are little code snippets that I quickly write to investigate something and then discard them again. But often I do need to re-use code, or want to share the code with others, and then the question becomes whether I should improve the quality. So when is it the right time to turn a script into something more reproducible and easier to maintain?</p>
<p>When you are writing ‘real’ software this question doesn’t pop up. Your code is important and will be deployed to some automated system, so you add all the required engineering quality, like unit tests and documentation and everything else. But in a research setting there’s a lot more gray area. On the one extreme there’s that script that you are certain will be used only once. And on the other extreme there’s code that you are certain will be re-used by your future self and colleagues and should thus be turned into a high quality package. In between there is a whole spectrum of other cases. Maybe you have a script that you used a couple of times now. Or maybe a co-worker asked how to reproduce your data analysis and asks for the code you used for that. You are not very comfortable with this code, you didn’t really test it well, you didn’t put much effort into it, you are not proud of it, you are not certain it’s correct and it’s still rapidly evolving. What to do with this code?</p>
<p>Some people say you should immediately turn any code into a unit tested package. But I personally do not believe that is feasible. That first script is an experiment in finding out what code you even need to write. You don’t even know what results to expect yet. You have to just fiddling with code and data and plots until things start to make sense. There’s nothing wrong with that (unreproducible) exploration.</p>
<p>Other people never ever write packages, or unit tests, or documentation. They may refuse to share their code with others. Maybe out of fear they’ll be judged by others, or because it’s too much work to share the code, or any other reason. I disagree with this approach as well. If you are a scientist/researcher working with data and code, and you stumble upon a presentable insight, or find yourself repeating the same tasks, then part of the job is to make your code legible and reproducible.</p>
<p>(Nowadays I notice people are often insecure in sharing their code with me. They will do so with a lot comments like “I’m not a good coder”, “this is not very good code, please don’t expect too much”, “please don’t share this code with anyone else”. This happened less when I was a junior coder myself. I understand the sentiment. I should probably spend more time comforting people upfront that I’m not there to judge their code, I just want to understand how they did their analysis and figure out how to continue together. There’s some lesson in here for senior programmers reviewing scripts of researchers.)</p>
<p>Currently I have some code that’s in the gray area. I have a few scripts that I keep re-using. I already made the scripts <a target="_blank" href="https://scientificcoder.com/clean-code-tips-for-scientists-1-reproducible-environments">fully reproducible</a> (they are in a git repo and have a well defined environment). Every time I run the scripts on new data I have to update the functions to work on that new case, so they are becoming more and more generic and abstract. This is actually quite nice, those functions are becoming very useful to me! But at the same time, I’m also beginning to forget what each function is actually doing. And when I change one of these functions, I might actually break a previous analysis I performed (and I’m lazy, I don’t want to re-run all previous analysis to find out). Some code that I expected to be re-used by others I already placed inside a package. But the code that’s still left in the scripts is just so specific for <em>my</em> analysis. No one else wants to use that code yet. I really feel this barrier, this question of whether I want to put in the effort to write a package, with the risk that no one will use it while I will have to keep updating failing unit tests just to do my research.</p>
<p>I just want to say that I understand the desire to not write high quality code for your research. This is the source of the <a target="_blank" href="https://scientificcoder.com/my-target-audience#heading-the-two-culture-problem">two culture problem</a> between scientists and engineers.</p>
<p>At some point I will pass the threshold. I will keep re-doing the same analysis for months and just want quick reproducible functions. Or others will want to run my code and I don’t want to explain the functions to them all the time. Then the answer is clear, the time has come, I will turn my scripts into a package. (Even at this point some people will still refuse to turn their code into an easily installable, reproducible, well-documented package. Please don’t be like those people.)</p>
<p>In the mean time, it’s good to recognize this gray area between ‘ugly script’ and ‘high quality code’. It’s good to ask yourself often whether it’s time to turn that script into a package. If you find yourself asking this question then that’s not a sign of insecurity, it’s a sign you are growing as a professional scientific coder.</p>
]]></content:encoded></item><item><title><![CDATA[Straightforward Functional Programming Examples in Julia]]></title><description><![CDATA[Functional programming has gained quite some popularity in recent years. Yet if you code with the Julia language you probably already used a lot of functional programming concepts without really thinking about it. In it's essence functional programmi...]]></description><link>https://scientificcoder.com/straightforward-functional-programming-examples-in-julia</link><guid isPermaLink="true">https://scientificcoder.com/straightforward-functional-programming-examples-in-julia</guid><category><![CDATA[Julia]]></category><category><![CDATA[Functional Programming]]></category><dc:creator><![CDATA[Matthijs Cox]]></dc:creator><pubDate>Wed, 18 Sep 2024 09:36:57 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1721213651461/f4ec729d-1c33-479e-842e-bd2fc3d0f124.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Functional programming has gained quite some popularity in recent years. Yet if you code with the Julia language you probably already used a lot of functional programming concepts without really thinking about it. In it's essence functional programming simply means that functions can be used as arguments in other functions.</p>
<p>I noticed recently that I have been using more functional programming concepts in my daily coding. Mostly I am moving away from vectorized code to using "higher order functions". This might sound fancy, but it's pretty straightforward. Let me explain with some examples.</p>
<p>Here are a few simple ways to check whether an array has any values less than 3.</p>
<pre><code class="lang-julia">numbers = [<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">5</span>]

<span class="hljs-comment"># vectorized</span>
any(numbers .&lt; <span class="hljs-number">3</span>)

<span class="hljs-comment"># functional, using a pre-defined function</span>
less_than_three(x) = x &lt; <span class="hljs-number">3</span>
any(less_than_three, numbers)

<span class="hljs-comment"># functional, using a lambda/anonymous function</span>
any(x -&gt; x &lt; <span class="hljs-number">3</span>, numbers)

<span class="hljs-comment"># functional, using a 'currying' function</span>
any(&lt;(<span class="hljs-number">3</span>), numbers)
</code></pre>
<p>You can see that the <code>any</code> function can either take a single (boolean) vector as input, or it can take a function and a vector as input. Passing a function into a function is a form of <em>functional</em> programming. Such functions that use functions are called <em>higher order functions.</em></p>
<p>You can input any kind of function into <code>any</code> that returns a boolean. It can just be the name of an existing function, an "anonymous" function like <code>x -&gt; x &lt; 3</code> or as shown above a "curried" function. The functional programming people love inventing new names for concepts. Currying just means that a function can return a function with some arguments already filled in. So <code>f(1, 2, 3)</code> can become <code>f(1)(2)(3)</code>. This is what happened with the function call <code>&lt;(3)</code>, it will return a function similar to <code>x -&gt; x &lt; 3</code>. Essentially you called something like <code>&lt;(y) = x -&gt; x &lt; y</code> , so all of these examples are equivalent:</p>
<pre><code class="lang-julia">julia&gt; <span class="hljs-number">2</span> &lt; <span class="hljs-number">3</span>
<span class="hljs-literal">true</span>

julia&gt; &lt;(<span class="hljs-number">2</span>,<span class="hljs-number">3</span>)
<span class="hljs-literal">true</span>

julia &lt;(<span class="hljs-number">3</span>)(<span class="hljs-number">2</span>)
<span class="hljs-literal">true</span>
</code></pre>
<p>I nowadays always choose the functional style of programming like <code>any(&lt;(3), numbers)</code>, since the vectorized form will first allocate the boolean vector <code>numbers .&lt; 3</code> in memory before calling <code>any</code>. The functional form does not need to create this vector in memory. So the functional style is typically more performant, especially if the <code>any</code> function can stop early:</p>
<pre><code class="lang-julia">julia&gt; <span class="hljs-keyword">using</span> BenchmarkTools

julia&gt; numbers = <span class="hljs-number">5</span> .* ones(<span class="hljs-number">10_000</span>);

julia&gt; <span class="hljs-meta">@btime</span> any($numbers .&lt; <span class="hljs-number">3</span>);
  <span class="hljs-number">4.271</span> μs (<span class="hljs-number">3</span> allocations: <span class="hljs-number">5.55</span> KiB)

julia&gt; <span class="hljs-meta">@btime</span> any(&lt;(<span class="hljs-number">3</span>), $numbers);
  <span class="hljs-number">3.750</span> μs (<span class="hljs-number">0</span> allocations: <span class="hljs-number">0</span> bytes)

julia&gt; numbers[<span class="hljs-number">5</span>] = <span class="hljs-number">0.0</span>;

julia&gt; <span class="hljs-meta">@btime</span> any($numbers .&lt; <span class="hljs-number">3</span>);
  <span class="hljs-number">4.229</span> μs (<span class="hljs-number">3</span> allocations: <span class="hljs-number">5.55</span> KiB)

julia&gt; <span class="hljs-meta">@btime</span> any(&lt;(<span class="hljs-number">3</span>), $numbers);
  <span class="hljs-number">3.400</span> ns (<span class="hljs-number">0</span> allocations: <span class="hljs-number">0</span> bytes)
</code></pre>
<p>Next to <code>any</code> I mostly use <code>all</code>, <code>filter</code> (and <code>filter!</code>), <code>map</code>, <code>reduce</code> and <code>mapreduce</code> in my daily coding. The functions <code>any</code>, <code>all</code> and <code>filter</code> seem obvious in their behavior:</p>
<pre><code class="lang-julia">julia&gt; numbers = [<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">5</span>];

julia&gt; any(&lt;(<span class="hljs-number">3</span>), numbers)
<span class="hljs-literal">true</span>

julia&gt; all(isequal(<span class="hljs-number">3</span>), numbers)
<span class="hljs-literal">false</span>

julia&gt; filter(&lt;(<span class="hljs-number">3</span>), numbers)
<span class="hljs-number">2</span>-element <span class="hljs-built_in">Vector</span>{<span class="hljs-built_in">Int64</span>}:
 <span class="hljs-number">1</span>
 <span class="hljs-number">2</span>
</code></pre>
<p>The <code>map</code> function is typically similar to a simple broadcast, it just applies (in other word <em>maps</em>) a function to each element in a collection. <code>map(f, x)</code> is equivalent to <code>f.(x)</code> in many cases, so you can choose whichever you like:</p>
<pre><code class="lang-julia">julia&gt; numbers = [<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>];

julia&gt; map(x -&gt; x^<span class="hljs-number">2</span>, numbers)
<span class="hljs-number">5</span>-element <span class="hljs-built_in">Vector</span>{<span class="hljs-built_in">Int64</span>}:
  <span class="hljs-number">1</span>
  <span class="hljs-number">4</span>
  <span class="hljs-number">9</span>

julia&gt; numbers.^<span class="hljs-number">2</span>
<span class="hljs-number">5</span>-element <span class="hljs-built_in">Vector</span>{<span class="hljs-built_in">Int64</span>}:
  <span class="hljs-number">1</span>
  <span class="hljs-number">4</span>
  <span class="hljs-number">9</span>
</code></pre>
<p>However, in some cases <code>map</code> is more efficient, see this <a target="_blank" href="https://discourse.julialang.org/t/when-to-use-broadcasting-with-vs-map/58078">discussion here</a>.</p>
<p>What I find more interesting are <code>reduce</code> and <code>mapreduce</code>. The <code>reduce</code> function essentially applies a function iteratively to two subsequent elements in a collection. I think a simple example is more clear:</p>
<pre><code class="lang-julia">julia&gt; numbers = [<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">5</span>];

julia&gt; reduce(+, numbers)
<span class="hljs-number">15</span>

julia&gt; sum(numbers)
<span class="hljs-number">15</span>
</code></pre>
<p>More powerful is the <code>mapreduce</code> function, which as the name suggests, combines both a map and a reduce:</p>
<pre><code class="lang-julia">julia&gt; numbers = [<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">5</span>];

julia&gt; mapreduce(x -&gt; x^<span class="hljs-number">2</span>, +, numbers)
<span class="hljs-number">55</span>

julia&gt; sum(numbers.^<span class="hljs-number">2</span>)
<span class="hljs-number">55</span>
</code></pre>
<p>As before the broadcast/vectorized code will create another vector in memory (the <code>numbers.^2</code>) and only then does the summing, while the <code>mapreduce</code> doesn't need to do this, so that's a big advantage for <code>mapreduce</code>. To be fair, Julia also allows a mapreduce with <code>sum(x -&gt; x^2, numbers)</code>, which might be more readable in this case.</p>
<p>Wow, so we actually discussed a lot of functional programming concepts here, without going into the details:</p>
<ul>
<li><p>higher order functions like <code>any(f, collection)</code></p>
</li>
<li><p>anonymous functions like <code>x -&gt; x^2</code></p>
</li>
<li><p>curried functions like <code>&lt;(3)</code></p>
</li>
<li><p>reduce functions like <code>reduce</code> and <code>mapreduce</code></p>
</li>
</ul>
<p>Another concept that functional programmers love, but which I barely use in Julia is <em>function composition</em>. Here's an example:</p>
<pre><code class="lang-julia"><span class="hljs-comment"># let's say we have two functions</span>
add_one(x) = x + <span class="hljs-number">1</span>
double(x) = <span class="hljs-number">2</span>x

<span class="hljs-comment"># we can define our own compose function</span>
compose(f,g) = x -&gt; f(g(x))
add_one_and_double = compose(double, add_one)
add_one_and_double(<span class="hljs-number">5</span>) <span class="hljs-comment"># returns 12</span>

<span class="hljs-comment"># or using the compose operator ∘, which does the above</span>
add_one_and_double = double ∘ add_one
add_one_and_double(<span class="hljs-number">5</span>) <span class="hljs-comment"># returns 12</span>
</code></pre>
<p>It may look very elegant, but I only occasionally see a use for such composition. And it's unintuitive to many programmers unfamiliar to the concept. You can already compose functions the old-fashioned way: <code>add_one_and_double(x) = double(add_one(x))</code> and that serves most purposes in my opinion.</p>
<p>So that's it! These are all functional programming concepts that I use on a daily basis in my Julia programming. Especially the use of "higher order functions", like <code>any(&lt;(3), [1,2,3,4])</code> I use a lot and actively try to favor over any vectorized broadcasting. If you've been coding in Julia for a while now, I bet you've secretly already been doing lots of functional programming.</p>
]]></content:encoded></item><item><title><![CDATA[Julia Type Annotations]]></title><description><![CDATA[The Julia language allows type annotation in multiple ways, with different behaviors, in order to improve performance and readability of the code. Types annotations always use the :: syntax, for example in function declarations such as f(variable::In...]]></description><link>https://scientificcoder.com/julia-type-annotations</link><guid isPermaLink="true">https://scientificcoder.com/julia-type-annotations</guid><category><![CDATA[Julia]]></category><category><![CDATA[Type Annotation]]></category><category><![CDATA[Types]]></category><dc:creator><![CDATA[Matthijs Cox]]></dc:creator><pubDate>Tue, 16 Jul 2024 08:19:04 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1721115946719/caea6f6c-497a-410a-816e-1624c9174180.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The Julia language allows type annotation in multiple ways, with different behaviors, in order to improve performance and readability of the code. Types annotations always use the <code>::</code> syntax, for example in function declarations such as <code>f(variable::Integer)</code> . When you are new to Julia, it might not be clear what the possible type annotations are, nor what their expected behavior is.</p>
<p>The behavior of type annotation is well-documented, but a little scattered across the Julia manual. So I decided to write an overview here. In short, these are the kinds of type annotation I know of:</p>
<ol>
<li><p>Field declarations in composite types:</p>
<pre><code class="lang-julia"> <span class="hljs-keyword">struct</span> MyType
     field::<span class="hljs-built_in">String</span>
 <span class="hljs-keyword">end</span>
</code></pre>
</li>
<li><p>Method definitions may contain type annotation:</p>
<pre><code class="lang-julia"> my_function(input::<span class="hljs-built_in">Integer</span>) = input + <span class="hljs-number">1</span>
</code></pre>
</li>
<li><p>Type assertion of variables</p>
<pre><code class="lang-julia"> x = my_function(<span class="hljs-number">5</span>)::<span class="hljs-built_in">String</span>
</code></pre>
</li>
<li><p>Automatic type conversion</p>
<pre><code class="lang-julia"> x::<span class="hljs-built_in">Int8</span> = <span class="hljs-number">5</span>
 <span class="hljs-comment"># or as function output type</span>
 <span class="hljs-keyword">function</span>(input)::<span class="hljs-built_in">String</span> = input + <span class="hljs-number">1</span>
</code></pre>
</li>
</ol>
<p>Additionally, we could add a fifth, overarching function of type annotation:</p>
<ol start="5">
<li>Documentation and code clarity</li>
</ol>
<p>While this is not strictly a technical behavior, code clarity can be a crucial reason to annotate types in any programming language.</p>
<p>This blog post is inspired by:</p>
<ul>
<li><p><a target="_blank" href="https://discourse.julialang.org/t/on-type-annotations/116305">Julia Discourse On type annotations</a></p>
</li>
<li><p><a target="_blank" href="https://discourse.julialang.org/t/annotating-types-best-practice-for-beginners/50521/2">Julia Discourse on Annotating Types: Best practice for beginners</a></p>
</li>
</ul>
<h2 id="heading-composite-types">Composite types</h2>
<p>You can define your own types in Julia very easily with the <code>struct</code> declaration. This is well documented in the <a target="_blank" href="https://docs.julialang.org/en/v1/manual/types/#Composite-Types">manual</a>. You could do this without any type annotation if you want to:</p>
<pre><code class="lang-julia"><span class="hljs-keyword">struct</span> MyType
    x
    y
<span class="hljs-keyword">end</span>
</code></pre>
<p>However, this means the fields <code>x</code> and <code>y</code> can be any type, and the Julia compiler cannot optimize the memory layout for your type. Ideally the memory size of your type is known at compile time, and it's continuous in memory (all bytes are subsequently behind each other in memory). A better memory layout will in turn lead to faster data accessing of your type's fields.</p>
<p>The better practice is therefore to add the types of the fields (if you know them), and it's best if these field types are also <em>concrete</em>. That means they are not abstract types or unions. Then the compiler can better optimize your code. Here's an example:</p>
<pre><code class="lang-julia"><span class="hljs-keyword">struct</span> MyConcreteType
    x::<span class="hljs-built_in">Int64</span>
    y::<span class="hljs-built_in">String</span>
<span class="hljs-keyword">end</span>
</code></pre>
<p>You can also create <a target="_blank" href="https://docs.julialang.org/en/v1/manual/types/#Parametric-Types"><em>parametric</em> types</a>, which allows flexibility in the used type definition, yet still allows you to create a concrete parametric object when it's constructed.</p>
<h2 id="heading-method-definitions">Method definitions</h2>
<p>A method in Julia is simply a specific definition of a function, where every method has a different set of input types. Type annotations are used to define a method:</p>
<pre><code class="lang-julia"><span class="hljs-comment"># a method of `f` with floats</span>
f(x::<span class="hljs-built_in">Float64</span>, y::<span class="hljs-built_in">Float64</span>) = <span class="hljs-number">2</span>x + y
<span class="hljs-comment"># a method of `f` with strings</span>
f(x::<span class="hljs-built_in">String</span>, y::<span class="hljs-built_in">String</span>) = x * y
</code></pre>
<p>The Julia manual extensively discusses <a target="_blank" href="https://docs.julialang.org/en/v1/manual/methods/#Methods">methods and their behavior</a>.</p>
<p>You could use the method definition as some kind of type assertion, because if a method does not exist, it will throw an error. But this is not guaranteed, because there might be a generic method defined, for example for <code>Any</code> input type, and then that function will be called. For the example above, where we have only 2 methods defined for <code>f</code>, we can throw an error for integer input:</p>
<pre><code class="lang-julia">julia&gt; f
f (generic <span class="hljs-keyword">function</span> with <span class="hljs-number">2</span> methods)

julia&gt; f(<span class="hljs-number">1</span>,<span class="hljs-number">2</span>)
ERROR: <span class="hljs-built_in">MethodError</span>: no method matching f(::<span class="hljs-built_in">Int64</span>, ::<span class="hljs-built_in">Int64</span>)
</code></pre>
<p>So this might be considered a kind of type assertion. But we can also define a generic function and then everything will work:</p>
<pre><code class="lang-julia">julia&gt; f(x, y) = <span class="hljs-string">"f will always work now"</span>
f (generic <span class="hljs-keyword">function</span> with <span class="hljs-number">3</span> methods)

julia&gt; f(<span class="hljs-number">1</span>,<span class="hljs-number">2</span>)
<span class="hljs-string">"f will always work now"</span>
</code></pre>
<p>If you want to use the method definitions themselves as a kind of type assertion for your own functions, you'll have to be careful to not declare a method that's very generic. Yet you will probably want your methods to work for a variety of input types, and not be too specific. Finding this balance is an art in Julia.</p>
<h2 id="heading-type-assertion">Type assertion</h2>
<p>Type assertion means that your code will fail if you encounter the wrong type. This helps you check that code works as expected and/or helps inform other developers what type is expected in that piece of code.</p>
<p>In Julia you can trigger such assertion by annotation at the right hand side of an annotation, on the REPL or inside a function.</p>
<pre><code class="lang-julia"><span class="hljs-keyword">function</span> my_assertion(x, y)
    z = f(x, y)::<span class="hljs-built_in">Float64</span>
    <span class="hljs-keyword">return</span> z
<span class="hljs-keyword">end</span>
</code></pre>
<p>Using the function <code>f</code> from the previous section on method definitions we have the following behavior:</p>
<pre><code class="lang-julia">julia&gt; my_assertion(<span class="hljs-number">1.0</span>,<span class="hljs-number">2.0</span>)
<span class="hljs-number">4.0</span>

julia&gt; my_assertion(<span class="hljs-string">"a"</span>, <span class="hljs-string">"b"</span>)
ERROR: <span class="hljs-built_in">TypeError</span>: <span class="hljs-keyword">in</span> typeassert, expected <span class="hljs-built_in">Float64</span>, got a value of <span class="hljs-keyword">type</span> <span class="hljs-built_in">String</span>
</code></pre>
<p>You can also annotate directly on the REPL in recent Julia versions:</p>
<pre><code class="lang-julia">julia&gt; x = <span class="hljs-number">5</span>::<span class="hljs-built_in">Int64</span>
<span class="hljs-number">5</span>

julia&gt; x = <span class="hljs-number">5</span>::<span class="hljs-built_in">String</span>
ERROR: <span class="hljs-built_in">TypeError</span>: <span class="hljs-keyword">in</span> typeassert, expected <span class="hljs-built_in">String</span>, got a value of <span class="hljs-keyword">type</span> <span class="hljs-built_in">Int64</span>
</code></pre>
<h2 id="heading-type-conversion">Type conversion</h2>
<p>There's a tricky difference in Julia whether you place your type annotation on the left or right hand side of the assignment. As explained in the previous section, we get type assertion on the right hand side. But automatic type conversion happens when it's on the left hand side:</p>
<pre><code class="lang-julia">julia&gt; x = <span class="hljs-number">5</span>::<span class="hljs-built_in">Int8</span> <span class="hljs-comment"># type assertion</span>
ERROR: <span class="hljs-built_in">TypeError</span>: <span class="hljs-keyword">in</span> typeassert, expected <span class="hljs-built_in">Int8</span>, got a value of <span class="hljs-keyword">type</span> <span class="hljs-built_in">Int64</span>

julia&gt; x::<span class="hljs-built_in">Int8</span> = <span class="hljs-number">5</span> <span class="hljs-comment"># type conversion that succeeds</span>
<span class="hljs-number">5</span>

julia&gt; x = <span class="hljs-string">"a"</span> <span class="hljs-comment"># note: the type of x is remembered now</span>
ERROR: <span class="hljs-built_in">MethodError</span>: Cannot <span class="hljs-string">`convert`</span> an object of <span class="hljs-keyword">type</span> <span class="hljs-built_in">String</span> to an object of <span class="hljs-keyword">type</span> <span class="hljs-built_in">Int8</span>

julia&gt; foo::<span class="hljs-built_in">String</span> = <span class="hljs-number">5</span> <span class="hljs-comment"># type conversion that fails</span>
ERROR: <span class="hljs-built_in">MethodError</span>: Cannot <span class="hljs-string">`convert`</span> an object of <span class="hljs-keyword">type</span> <span class="hljs-built_in">Int64</span> to an object of <span class="hljs-keyword">type</span> <span class="hljs-built_in">String</span>
</code></pre>
<p>This annotation will simply call <code>convert</code> , so <code>x::Int8 = 5</code> is equivalent to <code>x = convert(Int8, 5)</code> , except that it also seems to remember the type of <code>x</code>.</p>
<p>Type conversion is very handy behavior, but may be less expected by new Julia developers. So be careful with these annotations.</p>
<p>Type conversion also happens automatically when you declare the output type of a function:</p>
<pre><code class="lang-julia"><span class="hljs-keyword">function</span> convert_to_int8(x)::<span class="hljs-built_in">Int8</span>
    <span class="hljs-keyword">return</span> x
<span class="hljs-keyword">end</span>
</code></pre>
<p>Similar to the previous behavior, this will convert anything to <code>Int8</code> unless it cannot find a <code>convert</code> function:</p>
<pre><code class="lang-julia">julia&gt; convert_to_int8(<span class="hljs-number">5</span>) |&gt; typeof
<span class="hljs-built_in">Int8</span>

julia&gt; convert_to_int8(<span class="hljs-string">"a"</span>)
ERROR: <span class="hljs-built_in">MethodError</span>: Cannot <span class="hljs-string">`convert`</span> an object of <span class="hljs-keyword">type</span> <span class="hljs-built_in">String</span> to an object of <span class="hljs-keyword">type</span> <span class="hljs-built_in">Int8</span>
</code></pre>
<p>This means that if you want your code to truly assert your output variable, instead of convert, you need to do this in the return statement:</p>
<pre><code class="lang-julia"><span class="hljs-keyword">function</span> assert_int8(x)
    <span class="hljs-keyword">return</span> x::<span class="hljs-built_in">Int8</span>
<span class="hljs-keyword">end</span>
</code></pre>
<p>Automatic type conversion also happens on the default struct constructor. So for the type we defined previously this would work, even though we specified <code>Int64</code> as type of the first field:</p>
<pre><code class="lang-julia">julia&gt; obj = MyConcreteType(<span class="hljs-built_in">Int8</span>(<span class="hljs-number">5</span>), <span class="hljs-string">"a"</span>) <span class="hljs-comment"># this will convert the Int8</span>
MyConcreteType(<span class="hljs-number">5</span>, <span class="hljs-string">"a"</span>)

julia&gt; typeof(obj.x) <span class="hljs-comment"># see it's an Int64 now, not Int8</span>
<span class="hljs-built_in">Int64</span>

julia&gt; MyConcreteType(<span class="hljs-string">"a"</span>, <span class="hljs-string">"a"</span>) <span class="hljs-comment"># only fail if conversion is not possible</span>
ERROR: <span class="hljs-built_in">MethodError</span>: Cannot <span class="hljs-string">`convert`</span> an object of <span class="hljs-keyword">type</span> <span class="hljs-built_in">String</span> to an object of <span class="hljs-keyword">type</span> <span class="hljs-built_in">Int64</span>
</code></pre>
<p>I've heard the type assertion might add some runtime overhead, while type conversion can often be compiled away. I haven't personally investigated this yet.</p>
<h2 id="heading-assertion-vs-conversion">Assertion vs Conversion</h2>
<p>A quick overview between assertion and conversion of types, because I often forget about these differences.</p>
<pre><code class="lang-julia"><span class="hljs-comment"># assertion, in right-hand definitions:</span>
x = <span class="hljs-number">5</span>::<span class="hljs-built_in">Int</span>
f(x) = <span class="hljs-number">5</span>x::<span class="hljs-built_in">Int</span>
x = f(<span class="hljs-number">5</span>::<span class="hljs-built_in">Int</span>)
<span class="hljs-comment"># conversion, in left-hand definitions:</span>
x::<span class="hljs-built_in">Int</span> = <span class="hljs-number">5</span>
f(x)::<span class="hljs-built_in">Int</span> = <span class="hljs-number">5</span>x

<span class="hljs-comment"># not to be confused with type dispatching in method definitions:</span>
f(x::<span class="hljs-built_in">Int</span>) = <span class="hljs-number">5</span>x
</code></pre>
<p>Note that I also added the assertion example that may happen inside a function call like <code>x = f(5::Int)</code>, which I didn't discuss yet. And it's good to remember this is distinct from type annotation in the actual method definition.</p>
<h2 id="heading-type-annotation-done-wrong">Type annotation done wrong?</h2>
<p>I think there are two cases of "overengineered type annotation":</p>
<ul>
<li><p>too many annotations</p>
</li>
<li><p>too strict annotations</p>
</li>
</ul>
<p>You can go overboard and annotate everything in your code, like for example:</p>
<pre><code class="lang-julia"><span class="hljs-keyword">function</span> f(x::<span class="hljs-built_in">Int64</span>, y::T)::<span class="hljs-built_in">Int64</span> <span class="hljs-keyword">where</span> T&lt;:<span class="hljs-built_in">Real</span>
    z::<span class="hljs-built_in">Int64</span> = <span class="hljs-built_in">Int64</span>(y::T)::<span class="hljs-built_in">Int64</span>
    result::<span class="hljs-built_in">Int64</span> = x::<span class="hljs-built_in">Int64</span> + z::<span class="hljs-built_in">Int64</span>
    <span class="hljs-keyword">return</span> result::<span class="hljs-built_in">Int64</span>
<span class="hljs-keyword">end</span>
</code></pre>
<p>This is a bit extreme and redundant, it doesn't even help readability, and probably just adds overhead for the compiler to remove all these type annotations.</p>
<p>You can also make your types too strict, especially in method definitions, for example:</p>
<pre><code class="lang-julia">f(x::<span class="hljs-built_in">Int64</span>, y::<span class="hljs-built_in">Int64</span>) = x + y
<span class="hljs-comment"># while you can be more generic:</span>
f(x::<span class="hljs-built_in">Real</span>, y::<span class="hljs-built_in">Real</span>) = x + y
</code></pre>
<p>A general heuristic some people follow is that it's good to keep your composite types as concrete as possible, while keeping your methods as abstract as possible.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>There seem to be at least 4 different behaviors of type annotation, with documentation as a 5th reason to add type annotation. I have given a short overview of all of these in this blog post, which may help you compare all different type annotations side by side.</p>
]]></content:encoded></item><item><title><![CDATA[To Dict Or Not To Dict: Comparing Data Structure Sizes]]></title><description><![CDATA[Searching for the best data structure for your problem can be a tricky business. The chosen data structure should be easy to understand for other developers, run fast in the algorithms where it's used and be memory efficient. You can't always optimiz...]]></description><link>https://scientificcoder.com/to-dict-or-not-to-dict-comparing-data-structure-sizes</link><guid isPermaLink="true">https://scientificcoder.com/to-dict-or-not-to-dict-comparing-data-structure-sizes</guid><category><![CDATA[Julia]]></category><category><![CDATA[data structures]]></category><dc:creator><![CDATA[Matthijs Cox]]></dc:creator><pubDate>Thu, 30 May 2024 07:47:04 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1716888816353/31878401-568f-4f95-b67e-dc8bf3df4240.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Searching for the best data structure for your problem can be a tricky business. The chosen data structure should be easy to understand for other developers, run fast in the algorithms where it's used and be memory efficient. You can't always optimize at the start, but you can learn from mistakes. I made such a mistake recently regarding the sizes in memory and on disk, so I'd like to walk you through a simple example.</p>
<p>Basically I was storing keys and values, so obviously a dictionary is the first choice of data structure. But it turned out that was not the most memory efficient solution to choose.</p>
<p>So let's look at 4 options that I tried, a <code>Dict</code> an <code>OrderedDict</code> a <code>Vector{Pair}</code> and an <code>AxisArray</code>. We'll start with a simple example with a 1000 strings and integers:</p>
<pre><code class="lang-julia"><span class="hljs-keyword">using</span> OrderedCollections, AxisArrays

<span class="hljs-comment"># our types</span>
dict = <span class="hljs-built_in">Dict</span>{<span class="hljs-built_in">String</span>,<span class="hljs-built_in">Int</span>}()
dict_ordered = OrderedDict{<span class="hljs-built_in">String</span>,<span class="hljs-built_in">Int</span>}()
vector_pair = <span class="hljs-built_in">Vector</span>{<span class="hljs-built_in">Pair</span>{<span class="hljs-built_in">String</span>,<span class="hljs-built_in">Int</span>}}()

<span class="hljs-comment"># create some key-value objects</span>
<span class="hljs-keyword">for</span> value <span class="hljs-keyword">in</span> <span class="hljs-number">1</span>:<span class="hljs-number">1000</span>
    dict[<span class="hljs-string">"key<span class="hljs-variable">$value</span>"</span>] = value
    dict_ordered[<span class="hljs-string">"key<span class="hljs-variable">$value</span>"</span>] = value
    push!(vector_pair, <span class="hljs-string">"key<span class="hljs-variable">$value</span>"</span> =&gt; value)
<span class="hljs-keyword">end</span>

<span class="hljs-comment"># and an axis array</span>
axis_array = AxisArray(collect(values(dict)), key = collect(keys(dict)))
</code></pre>
<p>If we look at the data structure implementations, we see below that a <code>Dict</code> contains three arrays, <code>slots</code>, <code>keys</code>, <code>vals</code> . I assume the <code>slots</code> are used as a kind of hashes to quickly index the keys. But the <code>Dict</code> pre-allocates 4096 keys and values apparently.</p>
<pre><code class="lang-julia">julia&gt; dump(dict)
<span class="hljs-built_in">Dict</span>{<span class="hljs-built_in">String</span>, <span class="hljs-built_in">Int64</span>}
  slots: <span class="hljs-built_in">Array</span>{<span class="hljs-built_in">UInt8</span>}((<span class="hljs-number">4096</span>,)) <span class="hljs-built_in">UInt8</span>[<span class="hljs-number">0xd9</span>, <span class="hljs-number">0xef</span>, <span class="hljs-number">0xfb</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0xba</span>  …  <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x87</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x94</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>]      
  keys: <span class="hljs-built_in">Array</span>{<span class="hljs-built_in">String</span>}((<span class="hljs-number">4096</span>,))
    <span class="hljs-number">1</span>: <span class="hljs-built_in">String</span> <span class="hljs-string">"key239"</span>
    <span class="hljs-number">2</span>: <span class="hljs-built_in">String</span> <span class="hljs-string">"key390"</span>
    ...
    <span class="hljs-number">4095</span>: <span class="hljs-comment">#undef</span>
    <span class="hljs-number">4096</span>: <span class="hljs-comment">#undef</span>
  vals: <span class="hljs-built_in">Array</span>{<span class="hljs-built_in">Int64</span>}((<span class="hljs-number">4096</span>,)) [<span class="hljs-number">239</span>, <span class="hljs-number">390</span>, <span class="hljs-number">722</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">480</span>  …  <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">532</span>, <span class="hljs-number">0</span>, <span class="hljs-number">568</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>]
  ndel: <span class="hljs-built_in">Int64</span> <span class="hljs-number">0</span>
  count: <span class="hljs-built_in">Int64</span> <span class="hljs-number">1000</span>
  age: <span class="hljs-built_in">UInt64</span> <span class="hljs-number">0x00000000000003f0</span>
  idxfloor: <span class="hljs-built_in">Int64</span> <span class="hljs-number">1</span>
  maxprobe: <span class="hljs-built_in">Int64</span> <span class="hljs-number">4</span>
</code></pre>
<p>The <code>OrderedDict</code> is similar, but doesn't pre-allocate the keys and values.</p>
<pre><code class="lang-julia">julia&gt; dump(dict_ordered)
OrderedDict{<span class="hljs-built_in">String</span>, <span class="hljs-built_in">Int64</span>}
  slots: <span class="hljs-built_in">Array</span>{<span class="hljs-built_in">Int32</span>}((<span class="hljs-number">4096</span>,)) <span class="hljs-built_in">Int32</span>[<span class="hljs-number">239</span>, <span class="hljs-number">390</span>, <span class="hljs-number">722</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">480</span>  …  <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">532</span>, <span class="hljs-number">0</span>, <span class="hljs-number">568</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>]
  keys: <span class="hljs-built_in">Array</span>{<span class="hljs-built_in">String</span>}((<span class="hljs-number">1000</span>,))
    <span class="hljs-number">1</span>: <span class="hljs-built_in">String</span> <span class="hljs-string">"key1"</span>
    <span class="hljs-number">2</span>: <span class="hljs-built_in">String</span> <span class="hljs-string">"key2"</span>
    ...
    <span class="hljs-number">999</span>: <span class="hljs-built_in">String</span> <span class="hljs-string">"key999"</span>
    <span class="hljs-number">1000</span>: <span class="hljs-built_in">String</span> <span class="hljs-string">"key1000"</span>
  vals: <span class="hljs-built_in">Array</span>{<span class="hljs-built_in">Int64</span>}((<span class="hljs-number">1000</span>,)) [<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-number">6</span>, <span class="hljs-number">7</span>, <span class="hljs-number">8</span>, <span class="hljs-number">9</span>, <span class="hljs-number">10</span>  …  <span class="hljs-number">991</span>, <span class="hljs-number">992</span>, <span class="hljs-number">993</span>, <span class="hljs-number">994</span>, <span class="hljs-number">995</span>, <span class="hljs-number">996</span>, <span class="hljs-number">997</span>, <span class="hljs-number">998</span>, <span class="hljs-number">999</span>, <span class="hljs-number">1000</span>]
  ndel: <span class="hljs-built_in">Int64</span> <span class="hljs-number">0</span>
  maxprobe: <span class="hljs-built_in">Int64</span> <span class="hljs-number">4</span>
  dirty: <span class="hljs-built_in">Bool</span> <span class="hljs-literal">true</span>
</code></pre>
<p>A <code>Pair</code> is just a key (<code>first</code>) and a value (<code>second</code>) in a struct. And then we can store an array of those.</p>
<pre><code class="lang-julia">julia&gt; dump(vector_pair[<span class="hljs-number">1</span>])
<span class="hljs-built_in">Pair</span>{<span class="hljs-built_in">String</span>, <span class="hljs-built_in">Int64</span>}
  first: <span class="hljs-built_in">String</span> <span class="hljs-string">"key1"</span>
  second: <span class="hljs-built_in">Int64</span> <span class="hljs-number">1</span>
</code></pre>
<p>An <code>AxisArray</code> is just an array with named axes, where each value in an axis contains our key.</p>
<pre><code class="lang-julia">julia&gt; dump(axis_array)
AxisVector{<span class="hljs-built_in">Int64</span>, <span class="hljs-built_in">Vector</span>{<span class="hljs-built_in">Int64</span>}, <span class="hljs-built_in">Tuple</span>{Axis{:key, <span class="hljs-built_in">Vector</span>{<span class="hljs-built_in">String</span>}}}}
  data: <span class="hljs-built_in">Array</span>{<span class="hljs-built_in">Int64</span>}((<span class="hljs-number">1000</span>,)) [<span class="hljs-number">239</span>, <span class="hljs-number">390</span>, <span class="hljs-number">722</span>, <span class="hljs-number">480</span>, <span class="hljs-number">679</span>, <span class="hljs-number">798</span>, <span class="hljs-number">841</span>, <span class="hljs-number">877</span>, <span class="hljs-number">21</span>, <span class="hljs-number">636</span>  …  <span class="hljs-number">413</span>, <span class="hljs-number">667</span>, <span class="hljs-number">717</span>, <span class="hljs-number">697</span>, <span class="hljs-number">398</span>, <span class="hljs-number">256</span>, <span class="hljs-number">334</span>, <span class="hljs-number">823</span>, <span class="hljs-number">532</span>, <span class="hljs-number">568</span>]
  axes: <span class="hljs-built_in">Tuple</span>{Axis{:key, <span class="hljs-built_in">Vector</span>{<span class="hljs-built_in">String</span>}}}
    <span class="hljs-number">1</span>: Axis{:key, <span class="hljs-built_in">Vector</span>{<span class="hljs-built_in">String</span>}}
      val: <span class="hljs-built_in">Array</span>{<span class="hljs-built_in">String</span>}((<span class="hljs-number">1000</span>,))
        <span class="hljs-number">1</span>: <span class="hljs-built_in">String</span> <span class="hljs-string">"key239"</span>
        <span class="hljs-number">2</span>: <span class="hljs-built_in">String</span> <span class="hljs-string">"key390"</span>
        ...
        <span class="hljs-number">999</span>: <span class="hljs-built_in">String</span> <span class="hljs-string">"key532"</span>
        <span class="hljs-number">1000</span>: <span class="hljs-built_in">String</span> <span class="hljs-string">"key568"</span>
</code></pre>
<p>You can also index an AxisArray similar to a dictionary:</p>
<pre><code class="lang-julia">julia&gt; axis_array[<span class="hljs-string">"key5"</span>]
<span class="hljs-number">5</span>
</code></pre>
<h2 id="heading-1d-sizes">1D sizes</h2>
<p>Let's look at the memory size of these 4 beasts:</p>
<pre><code class="lang-julia">julia&gt; varinfo(<span class="hljs-string">r"dict|vector|axis"</span>)
  name               size summary     
  –––––––––––– –––––––––– ––––––––––––––––––––––––––––––––––––––––––––
  axis_array   <span class="hljs-number">29.302</span> KiB <span class="hljs-number">1</span>-dimensional AxisArray{<span class="hljs-built_in">Int64</span>,<span class="hljs-number">1</span>,...} 
  dict         <span class="hljs-number">81.747</span> KiB <span class="hljs-built_in">Dict</span>{<span class="hljs-built_in">String</span>, <span class="hljs-built_in">Int64</span>} with <span class="hljs-number">1000</span> entries             
  dict_ordered <span class="hljs-number">45.356</span> KiB OrderedDict{<span class="hljs-built_in">String</span>, <span class="hljs-built_in">Int64</span>} with <span class="hljs-number">1000</span> entries    
  vector_pair  <span class="hljs-number">44.856</span> KiB <span class="hljs-number">1000</span>-element <span class="hljs-built_in">Vector</span>{<span class="hljs-built_in">Pair</span>{<span class="hljs-built_in">String</span>, <span class="hljs-built_in">Int64</span>}}
</code></pre>
<p>The <code>AxisArray</code> wins in terms of memory size. I'm also curious about size on disk with <code>JLD2</code></p>
<pre><code class="lang-julia"><span class="hljs-keyword">using</span> JLD2
<span class="hljs-keyword">for</span> <span class="hljs-keyword">type</span> <span class="hljs-keyword">in</span> (:dict, :dict_ordered, :vector_pair, :axis_array)
    jldsave(<span class="hljs-string">"<span class="hljs-variable">$type</span>.jld2"</span>, x=eval(<span class="hljs-keyword">type</span>))
    println(<span class="hljs-string">"<span class="hljs-variable">$type</span>.jld2 size: <span class="hljs-subst">$(filesize(<span class="hljs-string">"<span class="hljs-variable">$type</span>.jld2"</span>)</span>)"</span>)
<span class="hljs-keyword">end</span>
</code></pre>
<p>This gives us the following jld2 sizes:</p>
<pre><code class="lang-julia">dict.jld2 size: <span class="hljs-number">105398</span>
dict_ordered.jld2 size: <span class="hljs-number">66303</span>
vector_pair.jld2 size: <span class="hljs-number">105162</span>
axis_array.jld2 size: <span class="hljs-number">50949</span>
</code></pre>
<p>Again the <code>AxisArray</code> wins, though the difference shrunk a bit. Although a 2x difference doesn't feel very significant.</p>
<h2 id="heading-2d-sizes">2D sizes</h2>
<p>In my case I was actually storing a lot of dictionaries, which were actually sharing the same keys. This was my biggest mistake, because you can store such data as a matrix and share the keys of the rows and columns. Let's look at a small example again:</p>
<pre><code class="lang-julia"><span class="hljs-comment"># let's very naively store a vector of dictionaries, to get a feel for the size</span>
dict_vector = [deepcopy(dict) <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> <span class="hljs-number">1</span>:<span class="hljs-number">100</span>]

<span class="hljs-comment"># let's also store an AxisMatrix with shared rows and columns</span>
axis_matrix = AxisArray(
    rand(<span class="hljs-built_in">Int</span>, <span class="hljs-number">1000</span>, <span class="hljs-number">100</span>),
    row = [<span class="hljs-string">"row<span class="hljs-variable">$r</span>"</span> <span class="hljs-keyword">for</span> r <span class="hljs-keyword">in</span> <span class="hljs-number">1</span>:<span class="hljs-number">1000</span>],
    col = [<span class="hljs-string">"col<span class="hljs-variable">$c</span>"</span> <span class="hljs-keyword">for</span> c <span class="hljs-keyword">in</span> <span class="hljs-number">1</span>:<span class="hljs-number">100</span>]
)
</code></pre>
<p>Now the sizes are significantly different:</p>
<pre><code class="lang-julia">julia&gt; varinfo(<span class="hljs-string">r"dict_vector|axis_matrix"</span>)
  name         size           summary     
  –––––––––––– ––––––––––---- ––––––––––––––––––––––––––––––––––––––––––––
  axis_matrix  <span class="hljs-number">804.845</span> KiB    <span class="hljs-number">2</span>-dimensional AxisArray{<span class="hljs-built_in">Int64</span>,<span class="hljs-number">2</span>,...} 
  dict_vector  <span class="hljs-number">7.984</span> MiB      <span class="hljs-number">100</span>-element <span class="hljs-built_in">Vector</span>{<span class="hljs-built_in">Dict</span>{<span class="hljs-built_in">String</span>, <span class="hljs-built_in">Int64</span>}}
</code></pre>
<p>You see an almost 10x difference in memory size. The difference is more than 10x in the JLD2 files:</p>
<pre><code class="lang-julia">julia&gt; <span class="hljs-keyword">for</span> <span class="hljs-keyword">type</span> <span class="hljs-keyword">in</span> (:dict_vector, :axis_matrix)
    jldsave(<span class="hljs-string">"<span class="hljs-variable">$type</span>.jld2"</span>, x=eval(<span class="hljs-keyword">type</span>))
    println(<span class="hljs-string">"<span class="hljs-variable">$type</span>.jld2 size: <span class="hljs-subst">$(filesize(<span class="hljs-string">"<span class="hljs-variable">$type</span>.jld2"</span>)</span>)"</span>)
<span class="hljs-keyword">end</span>

dict_vector.jld2 size: <span class="hljs-number">10352637</span>
axis_matrix.jld2 size:   <span class="hljs-number">849101</span>
</code></pre>
<p>Differences are so big because most of the data is integers, which are small, and the "keys" of the matrix are strings which can occupy quite some memory.</p>
<p>In the end I opted to switch to AxisArrays for my 2D problem. Though I could only do this by assuming all columns had the same keys.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>There's a difference in memory usage between data structures that store keys and value. Dictionaries do not seem to be optimal, however for the simple case I looked at here the difference is not as large as I imagined. A bigger difference only appears when you have extra information, such as repeating keys, and you can use that information to choose a more optimal data structure, such as a matrix with the keys on the axes.</p>
]]></content:encoded></item><item><title><![CDATA[Comparing Package Management in Python, R, Julia, and Rust]]></title><description><![CDATA[When switching between programming languages, people often start with comparing syntax differences, and many overview exist on this topic. However, a large part of programming revolves around package management, especially if you want to develop your...]]></description><link>https://scientificcoder.com/comparing-package-management-in-python-r-julia-and-rust</link><guid isPermaLink="true">https://scientificcoder.com/comparing-package-management-in-python-r-julia-and-rust</guid><category><![CDATA[Python]]></category><category><![CDATA[Julia]]></category><category><![CDATA[Rust]]></category><category><![CDATA[R Language]]></category><category><![CDATA[package manager]]></category><dc:creator><![CDATA[Matthijs Cox]]></dc:creator><pubDate>Thu, 23 May 2024 09:23:13 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1716453066068/3cc0f5a8-83e5-455e-80a4-761753dea076.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When switching between programming languages, people often start with comparing syntax differences, and many overview exist on this topic. However, a large part of programming revolves around package management, especially if you want to develop your own packages, and I have not encountered many overviews comparing programming languages on this topic. A package is essentially how code is shared between programmers. Understanding package management and package development is vital when you want to get good at a language.</p>
<p>To satisfy my curiosity, and help myself and others, I decided to write this package management overview myself. I have chosen to compare popular languages Python and R, my personal favorite language Julia, and the rising star Rust, which has notoriously good package management.</p>
<h2 id="heading-overview-table">Overview Table</h2>
<p>I created an overview table below comparing various aspects of package management between the languages. I'll go into the details in the remainder of the blog post.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td>Python</td><td>R</td><td>Julia</td><td>Rust</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Package Manager</strong></td><td><code>pip</code> or <code>conda</code></td><td><code>install.packages()</code> (base R)</td><td><code>Pkg</code></td><td><code>cargo</code></td></tr>
<tr>
<td><strong>Development Tools</strong></td><td><code>setuptools</code>, <code>poetry</code></td><td><code>devtools</code></td><td><code>Pkg</code></td><td><code>cargo</code></td></tr>
<tr>
<td><strong>Package Template Tools</strong></td><td><code>cookiecutter</code>, <code>pyscaffold</code>, <code>flit</code></td><td><code>usethis</code>, <code>devtools</code></td><td><code>Pkg.generate()</code> , <code>PkgTemplates.jl</code></td><td><code>cargo new</code>, <code>cargo init</code>, <code>cargo-generate</code></td></tr>
<tr>
<td><strong>Package Repository</strong></td><td>PyPI (Python Package Index)</td><td>CRAN (Comprehensive R Archive Network)</td><td>General registry</td><td>crates.io</td></tr>
<tr>
<td><strong>Virtual Environment</strong></td><td><code>venv</code>, <code>virtualenv</code></td><td><code>renv</code></td><td>Built-in in the <code>Pkg</code> module</td><td>Built-in with <code>cargo</code></td></tr>
<tr>
<td><strong>Distribution Format</strong></td><td><code>.whl</code> (wheel, incl binaries) or <code>tar.gz</code> (source)</td><td><code>.tar.gz</code> (source and/or binary)</td><td><code>Pkg</code> will git clone from source, and download (binary) artifacts</td><td><code>.crate</code>(can be binary or source)</td></tr>
<tr>
<td><strong>Dependency Management</strong></td><td><code>requirements.txt</code>, or <code>Pipfile</code>, or <code>pyproject.toml</code> (poetry)</td><td><code>DESCRIPTION</code>, <code>NAMESPACE</code></td><td><code>Project.toml</code>, <code>Manifest.toml</code>, <code>Artifacts.toml</code></td><td><code>Cargo.toml</code>, <code>Cargo.lock</code></td></tr>
<tr>
<td><strong>Tutorial</strong></td><td><a target="_blank" href="https://py-pkgs.org/">Python Packages book</a> (uses poetry)</td><td><a target="_blank" href="https://r-pkgs.org/">R packages book</a></td><td><a target="_blank" href="https://pkgdocs.julialang.org/v1/">Pkg docs</a> and this <a target="_blank" href="https://julialang.org/contribute/developing_package/">howto</a></td><td><a target="_blank" href="https://doc.rust-lang.org/cargo/guide/index.html">Cargo Guide</a></td></tr>
</tbody>
</table>
</div><h2 id="heading-package-manager">Package manager</h2>
<p>In modern open source programming, package managers are vital to help you install the code you need, and all of it's dependencies, which includes other (source code) packages and compiled binary libraries.</p>
<ul>
<li><p><strong>Python</strong>: <code>pip</code> is the standard tool for installing and managing Python packages. Use <code>pip install &lt;package&gt;</code> to install a package via the command line. A secondary package manager is <code>conda</code>, which tries to be language-agnostic, but is mostly used for Python.</p>
</li>
<li><p><strong>R</strong>: <code>install.packages("&lt;package&gt;")</code> is the base function for installing packages from CRAN.</p>
</li>
<li><p><strong>Julia</strong>: <code>Pkg.jl</code> is the built-in package manager. <code>Pkg.add("&lt;package&gt;")</code> will install a package.</p>
</li>
<li><p><strong>Rust</strong>: <code>cargo</code> is the build system and package manager for Rust. Packages are called "crates" and you can install them via <code>cargo install &lt;crate&gt;</code> on the command line.</p>
</li>
</ul>
<p>An interesting difference is that some languages, like Python and Rust, have a package manager that is called from outside the language, so from your operating system's command line, while others like in Julia and R are called from inside the programming language itself.</p>
<h2 id="heading-package-development-tools">Package Development Tools</h2>
<p>The package manager is often only targeted at helping users install packages. Developers of package may need additional tools, for example to handle dependencies.</p>
<ul>
<li><p><strong>Python</strong>: <code>setuptools</code> can help you with building and distributing packages. <code>poetry</code> is a more recent, elegant way to help you with packaging and dependencies (but only for pure Python code, not for binary dependencies).</p>
</li>
<li><p><strong>R</strong>: <code>devtools</code> is the go-to tool for helping you development of R packages.</p>
</li>
<li><p><strong>Julia</strong>: <code>Pkg.jl</code> can help with most of your development.</p>
</li>
<li><p><strong>Rust</strong>: <code>cargo</code> can help with most of your development.</p>
</li>
</ul>
<h2 id="heading-package-template-tools">Package Template Tools</h2>
<p>Template are predefined formats of a package folder structure and files, typically including documentation, testing and automation (for example with Github actions). This helps you get up and running quickly with a professional package.</p>
<p><strong>Python</strong>:</p>
<ul>
<li><p><code>cookiecutter</code>: A popular tool to create project templates from cookiecutters (project templates).</p>
</li>
<li><p><code>pyscaffold</code>: A tool to set up the scaffolding for new Python projects with sensible defaults.</p>
</li>
<li><p><code>flit</code>: Simplifies the process of packaging simple Python projects, focusing on pyproject.toml.</p>
</li>
</ul>
<p><strong>R:</strong></p>
<ul>
<li><p><code>usethis</code>: Facilitates package development by setting up structure and common files.</p>
</li>
<li><p><code>devtools</code>: Provides functions like <code>create()</code>, <code>package.skeleton()</code>, and more to help create and manage R packages.</p>
</li>
</ul>
<p><strong>Julia:</strong></p>
<ul>
<li><p><code>Pkg.generate()</code>: Built-in function in Julia’s Pkg module to generate a new package with a very minimal template.</p>
</li>
<li><p><code>PkgTemplates.jl</code>: A Julia package that generates new Julia package projects with customizable templates.</p>
</li>
</ul>
<p><strong>Rust:</strong></p>
<ul>
<li><p><code>cargo new</code>: Initializes a new project with a basic template.</p>
</li>
<li><p><code>cargo init</code>: Initializes a new package in an existing directory.</p>
</li>
<li><p><code>cargo-generate</code>: A tool to generate new Rust projects based on existing templates.</p>
</li>
</ul>
<h2 id="heading-package-repository">Package Repository</h2>
<p>When you install a package, the source code and all of its dependencies need to be downloaded from somewhere. Most programming languages use a central location that stores copies of the source code and/or compiled binaries, for every version of a package. Julia is slightly different, using a registry that contains links to the source code.</p>
<ul>
<li><p><strong>Python</strong>: Packages are hosted on <a target="_blank" href="https://pypi.org/">PyPI</a>, the Python Package Index.</p>
</li>
<li><p><strong>R</strong>: <a target="_blank" href="https://cran.r-project.org/">CRAN</a> is the primary repository for R packages.</p>
</li>
<li><p><strong>Julia</strong>: Packages are registered in the <a target="_blank" href="https://github.com/JuliaRegistries/General">General registry</a>. Note these are only links to the (Github) source code. Binary artifacts are built with <a target="_blank" href="https://github.com/JuliaPackaging/Yggdrasil">Yggdrasil</a> and <a target="_blank" href="https://docs.binarybuilder.org/stable/">BinaryBuilder.jl</a>.</p>
</li>
<li><p><strong>Rust</strong>: <a target="_blank" href="http://crates.io">crates.io</a> is the official package registry.</p>
</li>
</ul>
<h2 id="heading-virtual-environments">Virtual Environments</h2>
<p>Virtual environments are crucial when you need to handle different versions of dependencies across your different projects. You could try to use one environment for all your projects, but that may quickly lead to conflicts in your dependencies.</p>
<p>Often virtual environments are just switching the folder location from which packages are installed and loaded, and use their own separate dependency management.</p>
<ul>
<li><p><strong>Python</strong>: Tools like <code>venv</code> and <code>virtualenv</code> create isolated environments for projects. Create an environment with <code>python -m venv /path/to/environment</code>.</p>
</li>
<li><p><strong>R</strong>: <code>renv</code> manages project-specific environments. Create an environment with <code>renv::init(project = "path/to/environment")</code> .</p>
</li>
<li><p><strong>Julia</strong>: Environments are managed within the <code>Pkg</code> module. I have a <a target="_blank" href="https://scientificcoder.com/clean-code-tips-for-scientists-1-reproducible-environments">blog post about Julia environments</a>. Create an environment with <code>Pkg.activate("path/to/environment")</code>.</p>
</li>
<li><p><strong>Rust</strong>: Environments are handled within <code>cargo</code> projects, use <code>cargo new my_project</code>.</p>
</li>
</ul>
<h2 id="heading-distribution-formats">Distribution Formats</h2>
<p>When you release and distribute your package it's good to be aware of how it's handled by the package repository and package manager.</p>
<ul>
<li><p><strong>Python</strong>: Uses <code>.whl</code> for pre-built packages with binary distributions and <code>.tar.gz</code> for source distributions (aka <code>sdist</code>). These distributions are stored in the PyPI.</p>
</li>
<li><p><strong>R</strong>: Packages are distributed as <code>.tar.gz</code> or <code>.zip</code>, stored in CRAN.</p>
</li>
<li><p><strong>Julia</strong>: Source distributions are downloaded directly from their repositories by <code>Pkg</code></p>
</li>
<li><p><strong>Rust</strong>: Packages (crates) are distributed as <code>.crate</code> files, which are by default located at <code>crates.io</code>.</p>
</li>
</ul>
<h2 id="heading-dependency-management">Dependency Management</h2>
<p>When a package is installed, the manager needs to know which dependencies to install, and which versions. Every package developer needs to write this down in some predefined format that the package manager can parse.</p>
<ul>
<li><p><strong>Python</strong>: <code>requirements.txt</code> is used for listing dependencies, or a <code>Pipfile</code> for advanced dependency management. <code>pyproject.toml</code> is an alternative source code dependency management file, used by the <code>poetry</code> tool.</p>
</li>
<li><p><strong>R</strong>: <code>DESCRIPTION</code> and <code>NAMESPACE</code> files manage dependencies.</p>
</li>
<li><p><strong>Julia</strong>: Use <code>Project.toml</code> to handle source code dependencies. A <code>Manifest.toml</code> file can be generated to specify the exact versions used in a project. And <code>Artifacts.toml</code> is used to handle (binary) artifacts.</p>
</li>
<li><p><strong>Rust</strong>: <code>Cargo.toml</code> lists dependencies, and <code>Cargo.lock</code> locks them.</p>
</li>
</ul>
<p>For example, a simple <code>Cargo.toml</code> may look like this to specify your package name, version and dependencies. Julia's <code>Project.toml</code> and Python poetry's <code>pyproject.toml</code> look similar.</p>
<pre><code class="lang-ini"><span class="hljs-section">[package]</span>
<span class="hljs-attr">name</span> = <span class="hljs-string">"mypackage"</span>
<span class="hljs-attr">version</span> = <span class="hljs-string">"0.1.0"</span>

<span class="hljs-section">[dependencies]</span>
<span class="hljs-attr">time</span> = <span class="hljs-string">"0.1.12"</span>
</code></pre>
<h2 id="heading-binary-dependency-management">Binary Dependency Management</h2>
<p>I'm curious how the different languages handle binary dependencies, for example libraries compiled from C code. This is a more advanced topic that most package developers don't need to worry about, but it may interest people (such as myself) who have encountered this topic in one or more programming languages. I've personally encountered this challenge when I contributed to the <a target="_blank" href="https://github.com/brainflow-dev/brainflow">BrainFlow</a> project, which distributes a C++ library with bindings in many programming languages.</p>
<p>There's multiple aspects to binary dependencies:</p>
<ol>
<li><p><strong>Write Binary Code</strong>: Write or include existing C/C++/Fortran code within your package.</p>
</li>
<li><p><strong>Build Configuration</strong>: Configure the build process to compile the binary code (e.g., using <code>setup.py</code>, <code>Cargo.toml</code>, <code>Makevars</code>).</p>
</li>
<li><p><strong>Build</strong>: Run the build and compilation tool specific to your language (e.g., <code>python setup.py</code>).</p>
</li>
<li><p><strong>Use in Code</strong>: Import and use the compiled binaries within your main language.</p>
</li>
</ol>
<p>Each programming language handles this differently.</p>
<p>Let's say we have the following very simple C program, with a header:</p>
<pre><code class="lang-c"><span class="hljs-comment">// myclib.h</span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">ifndef</span> MYCLIB_H</span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">define</span> MYCLIB_H</span>

<span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">hello_from_c</span><span class="hljs-params">()</span></span>;

<span class="hljs-meta">#<span class="hljs-meta-keyword">endif</span> <span class="hljs-comment">// MYCLIB_H</span></span>
</code></pre>
<p>and the C code:</p>
<pre><code class="lang-c"><span class="hljs-comment">// myclib.c</span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">&lt;stdio.h&gt;</span></span>

<span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">hello_from_c</span><span class="hljs-params">()</span> </span>{
    <span class="hljs-built_in">printf</span>(<span class="hljs-string">"Hello from C!\n"</span>);
}
</code></pre>
<p>How would we start embedding that in each language?</p>
<h3 id="heading-python-binaries">Python binaries</h3>
<ol>
<li><p><strong>Creating Binary Extensions</strong>:</p>
<ul>
<li><p><strong>C Extensions</strong>: Python allows you to write C extensions, which can be compiled and used within Python code. This is typically done using the Python C API or using Cython.</p>
</li>
<li><p><strong>Tools</strong>:</p>
<ul>
<li><p><code>setuptools</code>: Includes support for compiling C extensions. You can specify extensions in <code>setup.py</code>.</p>
</li>
<li><p><code>Cython</code>: A superset of Python that additionally supports C language features. See their tutorial on <a target="_blank" href="https://docs.cython.org/en/latest/src/tutorial/clibraries.html">using C libraries</a>. Cython will act as a kind of glue between C and your regular Python code, in the form of a <code>.pxd</code> and/or <code>.pyx</code> file.</p>
</li>
</ul>
</li>
<li><p><strong>Example with Cython</strong>:</p>
<pre><code class="lang-python">  <span class="hljs-comment"># myclib.pxd</span>
  cdef extern <span class="hljs-keyword">from</span> <span class="hljs-string">"myclib.h"</span>:
      void hello_from_c()
</code></pre>
<pre><code class="lang-python">  <span class="hljs-comment"># myextension.pyx</span>

  <span class="hljs-comment"># Import the declarations from the .pxd file</span>
  <span class="hljs-keyword">from</span> myclib cimport hello_from_c

  <span class="hljs-comment"># Create a Python wrapper function</span>
  <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">py_hello_from_c</span>():</span>
      hello_from_c()
</code></pre>
</li>
</ul>
</li>
<li><p><strong>Building</strong>:</p>
<ul>
<li><p>Use <code>setuptools</code> with extensions in the <code>setup.py</code> file:</p>
<pre><code class="lang-python">  <span class="hljs-keyword">from</span> setuptools <span class="hljs-keyword">import</span> Extension, setup
  <span class="hljs-keyword">from</span> Cython.Build <span class="hljs-keyword">import</span> cythonize

  setup(
      ext_modules = cythonize([Extension(<span class="hljs-string">"myextension"</span>, [<span class="hljs-string">"myextension.pyx"</span>])])
  )
</code></pre>
</li>
<li><p>Running <code>python setup.py build_ext --inplace</code> compiles the extension for you, if a C compiler is configured properly. Alternatively you can compile the C code yourself and <a target="_blank" href="https://docs.cython.org/en/latest/src/tutorial/clibraries.html#dynamic-linking">dynamically link to it</a>.</p>
</li>
<li><p>To automatically compile for every platform, look into <a target="_blank" href="https://cibuildwheel.pypa.io/en/stable/">cibuildwheels</a>.</p>
</li>
</ul>
</li>
<li><p><strong>Using Binary Extensions</strong>:</p>
<ul>
<li><p>Once compiled, these extensions can be imported and used in Python code just like any other module.</p>
</li>
<li><p><strong>Installation</strong>: Use <code>pip</code> to install binary packages (wheels) from PyPI or directly from a source distribution. Note that there should be a wheels file per platform, see for example the <a target="_blank" href="https://pypi.org/project/numpy/#files">Numpy built distributions</a>.</p>
</li>
</ul>
</li>
</ol>
<h3 id="heading-r-binaries"><strong>R binaries</strong></h3>
<ol>
<li><p><strong>Creating Binary Packages</strong>:</p>
<ul>
<li><p>R packages can include source code written in C, C++, or Fortran. These are compiled when the package is built.</p>
</li>
<li><p><strong>Tools</strong>:</p>
<ul>
<li><p><code>R CMD INSTALL</code>: The command-line tool to install packages and compile their binary components. Alternatively you can use <code>devtools</code>.</p>
</li>
<li><p><code>Rcpp</code>: A package that makes it easier to integrate R with C or C++ code.</p>
</li>
</ul>
</li>
<li><p><strong>Example</strong>:</p>
<pre><code class="lang-cpp">  <span class="hljs-comment">// myextension.cpp</span>
  <span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">&lt;Rcpp.h&gt;</span></span>
  <span class="hljs-keyword">extern</span> <span class="hljs-string">"C"</span> {
      <span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">"myclib.h"</span></span>
  }

  <span class="hljs-comment">// [[Rcpp::export]]</span>
  <span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">hello_from_c_wrapper</span><span class="hljs-params">()</span> </span>{
      hello_from_c();
  }
</code></pre>
<pre><code class="lang-r">  <span class="hljs-comment"># hello.R</span>

  <span class="hljs-comment"># Import the Rcpp function</span>
  Rcpp::sourceCpp(<span class="hljs-string">"src/myextension.cpp"</span>)

  hello_from_c &lt;- <span class="hljs-keyword">function</span>() {
    hello_from_c_wrapper()
  }
</code></pre>
</li>
</ul>
</li>
<li><p><strong>Building</strong>:</p>
<ul>
<li><p>Update the <code>DESCRIPTION</code> and <code>NAMESPACE</code> files to add Rcpp and your C function. Here's an example <code>NAMESPACE</code> :</p>
<pre><code class="lang-r">  useDynLib(MyPackage)
  importFrom(Rcpp, evalCpp)
  export(hello_from_c)
</code></pre>
</li>
<li><p>Then use <code>R CMD build</code> to create a package tarball and <code>R CMD INSTALL</code> to install it, which compiles the code. Alternatively you can use <code>devtools</code> to build and install inside your R session:</p>
<pre><code class="lang-r">  setwd(<span class="hljs-string">"path/to/MyPackage"</span>)
  devtools::document()
  devtools::build()
  devtools::install()
</code></pre>
</li>
</ul>
</li>
<li><p><strong>Using Binary Packages</strong>:</p>
<ul>
<li><p>After installation, functions from the binary components can be called from R scripts or the console.</p>
</li>
<li><p><strong>Installation</strong>: Binary packages can be installed from CRAN or other repositories using <code>install.packages()</code>.</p>
</li>
</ul>
</li>
</ol>
<h3 id="heading-julia-binaries"><strong>Julia binaries</strong></h3>
<ol>
<li><p><strong>Creating Binary Dependencies</strong>:</p>
<ul>
<li><p>Julia allows direct calling of C functions using its <code>ccall</code> interface. No wrapper code is needed.</p>
</li>
<li><p><strong>Tools</strong>:</p>
<ul>
<li><code>BinaryBuilder.jl</code>: A tool for building binaries that can be used across different platforms.</li>
</ul>
</li>
<li><p><strong>Example</strong>:</p>
<pre><code class="lang-julia">  <span class="hljs-comment"># Calling a C function</span>
  <span class="hljs-keyword">function</span> my_c_function()
      <span class="hljs-keyword">ccall</span>((:hello_from_c, <span class="hljs-string">"libmyclib"</span>), Cvoid, ())
  <span class="hljs-keyword">end</span>
</code></pre>
</li>
</ul>
</li>
<li><p><strong>Building</strong>:</p>
<ul>
<li><p>Use <a target="_blank" href="https://docs.binarybuilder.org/stable/"><code>BinaryBuilder.jl</code></a> to create binaries for every platform and distribute them automatically, including the Julia wrapper code. This way you do not have to compile anything yourself. You will have to put the C code into a separate repository and then provide a build script to <a target="_blank" href="https://github.com/JuliaPackaging/Yggdrasil">Yggdrasil</a>.</p>
</li>
<li><p>Alternatively you can compile the code yourself and dynamically open the library in Julia with <code>Libdl.dlopen()</code>.</p>
</li>
</ul>
</li>
<li><p><strong>Using Binary Dependencies</strong>:</p>
<ul>
<li><p><strong>Artifacts</strong>: Julia uses a system of artifacts to handle binary dependencies, which can be declared in a package's <code>Artifacts.toml</code> file. The wrapper package generated by <code>BinaryBuilder.jl</code> will already have this <code>Artifacts.toml</code> file. The wrapper package will also have regular Julia functions automatically generated for all the <code>ccall</code> functions, which you can use in your Julia code.</p>
</li>
<li><p>If you manually compiled the C library, you'll have to upload it somewhere and add the link to the <code>Artifacts.toml</code> file. <code>ArtifactUtils.jl</code> is a package that can help with that. (Note: I used this approach for <a target="_blank" href="https://github.com/brainflow-dev/brainflow/tree/master/julia_package/brainflow">brainflow</a>.)</p>
</li>
<li><p><strong>Installation</strong>: Julia's package manager <code>Pkg</code> downloads and installs the required binaries automatically.</p>
</li>
</ul>
</li>
</ol>
<h3 id="heading-rust-binaries"><strong>Rust binaries</strong></h3>
<ol>
<li><p><strong>Creating Binary Dependencies</strong>:</p>
<ul>
<li><p>Rust can interface with C libraries using the <code>extern</code> keyword and FFI (Foreign Function Interface). See this <a target="_blank" href="https://docs.rust-embedded.org/book/interoperability/c-with-rust.html">tutorial</a> for example.</p>
</li>
<li><p><strong>Tools</strong>:</p>
<ul>
<li><p><code>cargo</code>: Manages dependencies and builds projects.</p>
</li>
<li><p><code>bindgen</code>: (Optional) generates Rust FFI bindings to C libraries.</p>
</li>
</ul>
</li>
<li><p><strong>Example</strong>:</p>
<pre><code class="lang-rust">  <span class="hljs-comment">// src/extension.rs</span>

  <span class="hljs-keyword">extern</span> <span class="hljs-string">"C"</span> {
      <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">hello_from_c</span></span>();
  }

  <span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">call_hello_from_c</span></span>() {
      <span class="hljs-keyword">unsafe</span> {
          hello_from_c();
      }
  }
</code></pre>
</li>
</ul>
</li>
<li><p><strong>Building</strong>:</p>
<ul>
<li><p>Add the <code>cc</code> crate to <code>Cargo.toml</code> to help with compiling the C code:</p>
<pre><code class="lang-ini">  <span class="hljs-section">[package]</span>
  <span class="hljs-attr">name</span> = <span class="hljs-string">"myextension"</span>
  <span class="hljs-attr">version</span> = <span class="hljs-string">"0.1.0"</span>
  <span class="hljs-attr">edition</span> = <span class="hljs-string">"2018"</span>
  <span class="hljs-attr">build</span> = <span class="hljs-string">"build.rs"</span>

  <span class="hljs-section">[build-dependencies]</span>
  <span class="hljs-attr">cc</span> = <span class="hljs-string">"1.0"</span>
</code></pre>
</li>
<li><p>Specify the paths to the C code in a <code>build.rs</code> file of a Rust project.</p>
<pre><code class="lang-rust">  <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">main</span></span>() {
      cc::Build::new()
          .file(<span class="hljs-string">"src/myclib.c"</span>)
          .compile(<span class="hljs-string">"myextension"</span>);
  }
</code></pre>
</li>
<li><p>Then build the project using <code>cargo build --release</code></p>
</li>
</ul>
</li>
<li><p><strong>Using Binary Dependencies</strong>:</p>
<ul>
<li><p>The example above compiles the C code for you. If your C code is already available as a static library, it's also possible to link against that.</p>
</li>
<li><p><strong>Installation</strong>: Rust's <code>cargo</code> handles fetching and compiling the necessary binary dependencies.</p>
</li>
</ul>
</li>
</ol>
<p>Note that these are very simple, and incomplete, examples of embedding C in the respective languages, yet they give you a highlight of what's involved when working with binary dependencies.</p>
<h3 id="heading-conclusion"><strong>Conclusion</strong></h3>
<p>Mastering package management and binary dependencies in Python, R, Julia, and Rust varies by language, but it's essential if you want to be a proficient developer in any of these languages. I hope this overview helps you with your package development whenever you need to switch between these programming languages.</p>
]]></content:encoded></item><item><title><![CDATA[User-defined Show Method in Julia]]></title><description><![CDATA[I often find myself looking for a way to write custom display methods for Julia types on the REPL. Time to write it down in a short pragmatic blog post, for you and my future self.
What's the issue? When exploring on the Julia REPL or in notebooks, y...]]></description><link>https://scientificcoder.com/user-defined-show-method-in-julia</link><guid isPermaLink="true">https://scientificcoder.com/user-defined-show-method-in-julia</guid><category><![CDATA[Julia]]></category><category><![CDATA[coding]]></category><category><![CDATA[tips]]></category><dc:creator><![CDATA[Matthijs Cox]]></dc:creator><pubDate>Tue, 18 Jul 2023 13:24:13 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1689686215391/469dd124-564a-4ccf-a118-c1c9f885f6b6.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I often find myself looking for a way to write custom display methods for Julia types on the REPL. Time to write it down in a short pragmatic blog post, for you and my future self.</p>
<p>What's the issue? When exploring on the Julia REPL or in notebooks, you display your own custom type, then it doesn't look always look the most informative. Let's say you have some type:</p>
<pre><code class="lang-julia"><span class="hljs-keyword">struct</span> MyType
    some_number::<span class="hljs-built_in">Float64</span>
    some_dict::<span class="hljs-built_in">Dict</span>
<span class="hljs-keyword">end</span>
</code></pre>
<p>You can quickly make an object and display it.</p>
<pre><code class="lang-julia">julia&gt; obj = MyType(<span class="hljs-number">4.0</span>, <span class="hljs-built_in">Dict</span>(:x =&gt; <span class="hljs-number">5</span>))
MyType(<span class="hljs-number">4.0</span>, <span class="hljs-built_in">Dict</span>(:x =&gt; <span class="hljs-number">5</span>))
</code></pre>
<p>Okay... Julia basically shows the constructor of the object. I would like to see the field names, or maybe other information. Sometimes I want to see statistical properties for example, instead of the raw data.</p>
<p>As an alternative, to quickly see the field names, you can <code>dump</code> the content of an object. Which is nice for simple objects, but I explicitly put a <code>Dict</code> in there to mess it up, because it'll dump the dictionary internals, which you don't want to see:</p>
<pre><code class="lang-julia">julia&gt; dump(obj)
MyType
  some_number: <span class="hljs-built_in">Float64</span> <span class="hljs-number">4.0</span>
  some_dict: <span class="hljs-built_in">Dict</span>{<span class="hljs-built_in">Symbol</span>, <span class="hljs-built_in">Int64</span>}
    slots: <span class="hljs-built_in">Array</span>{<span class="hljs-built_in">UInt8</span>}((<span class="hljs-number">16</span>,)) <span class="hljs-built_in">UInt8</span>[<span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x00</span>, <span class="hljs-number">0x82</span>]
    keys: <span class="hljs-built_in">Array</span>{<span class="hljs-built_in">Symbol</span>}((<span class="hljs-number">16</span>,))
      <span class="hljs-number">1</span>: <span class="hljs-comment">#undef</span>
      <span class="hljs-number">2</span>: <span class="hljs-comment">#undef</span>
      ...
      <span class="hljs-number">15</span>: <span class="hljs-comment">#undef</span>
      <span class="hljs-number">16</span>: <span class="hljs-built_in">Symbol</span> x
    vals: <span class="hljs-built_in">Array</span>{<span class="hljs-built_in">Int64</span>}((<span class="hljs-number">16</span>,)) [<span class="hljs-number">5065505441550857052</span>, <span class="hljs-number">465637893754</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">5</span>]
    ndel: <span class="hljs-built_in">Int64</span> <span class="hljs-number">0</span>
    count: <span class="hljs-built_in">Int64</span> <span class="hljs-number">1</span>
    age: <span class="hljs-built_in">UInt64</span> <span class="hljs-number">0x0000000000000001</span>
    idxfloor: <span class="hljs-built_in">Int64</span> <span class="hljs-number">16</span>
    maxprobe: <span class="hljs-built_in">Int64</span> <span class="hljs-number">0</span>
</code></pre>
<p>Not pretty. How to improve this developer experience?</p>
<h2 id="heading-switching-display-mode">Switching display mode</h2>
<p>Before I go to the solution, it turns out there are different "modes" of printing an object. You can notice this behavior when you place dictionaries inside an array for example:</p>
<pre><code class="lang-julia">julia&gt; d = <span class="hljs-built_in">Dict</span>(:a =&gt; <span class="hljs-number">1</span>, :b =&gt; <span class="hljs-number">2</span>, :c =&gt; <span class="hljs-number">3</span>)
<span class="hljs-built_in">Dict</span>{<span class="hljs-built_in">Symbol</span>, <span class="hljs-built_in">Int64</span>} with <span class="hljs-number">3</span> entries:
  :a =&gt; <span class="hljs-number">1</span>
  :b =&gt; <span class="hljs-number">2</span>
  :c =&gt; <span class="hljs-number">3</span>

julia&gt; [d, <span class="hljs-built_in">Dict</span>(:d =&gt; <span class="hljs-number">4</span>)]
<span class="hljs-number">2</span>-element <span class="hljs-built_in">Vector</span>{<span class="hljs-built_in">Dict</span>{<span class="hljs-built_in">Symbol</span>, <span class="hljs-built_in">Int64</span>}}:
 <span class="hljs-built_in">Dict</span>(:a =&gt; <span class="hljs-number">1</span>, :b =&gt; <span class="hljs-number">2</span>, :c =&gt; <span class="hljs-number">3</span>)
 <span class="hljs-built_in">Dict</span>(:d =&gt; <span class="hljs-number">4</span>)
</code></pre>
<p>You see that the dictionary is displayed differently in the two cases above. Inside the array we prefer a single line display, since you may have many objects. I sometimes forget to properly implement this shorter mode, and then I get ugly array printing.</p>
<p>Here's some <a target="_blank" href="https://discourse.julialang.org/t/show-and-showcompact-on-custom-types/8493">discussion on the topic</a> on the Julia discourse.</p>
<h2 id="heading-custom-show">Custom show</h2>
<p>In the end, this is a typical approach I take. You can make it a lot more fancy if you like, but this is a good starting point:</p>
<pre><code class="lang-julia"><span class="hljs-keyword">struct</span> MyType
    some_number::<span class="hljs-built_in">Float64</span>
    some_dict::<span class="hljs-built_in">Dict</span>
<span class="hljs-keyword">end</span>

<span class="hljs-comment"># 2-argument show, used by Array show, print(obj) and repr(obj), keep it short</span>
<span class="hljs-keyword">function</span> Base.show(io::<span class="hljs-built_in">IO</span>, obj::MyType)
    print_object(io, obj, multiline = <span class="hljs-literal">false</span>)
<span class="hljs-keyword">end</span>

<span class="hljs-comment"># the 3-argument show used by display(obj) on the REPL</span>
<span class="hljs-keyword">function</span> Base.show(io::<span class="hljs-built_in">IO</span>, mime::<span class="hljs-string">MIME"text/plain"</span>, obj::MyType)
    <span class="hljs-comment"># you can add IO options if you want</span>
    multiline = get(io, :multiline, <span class="hljs-literal">true</span>)
    print_object(io, obj, multiline = multiline)
<span class="hljs-keyword">end</span>

<span class="hljs-keyword">function</span> print_object(io::<span class="hljs-built_in">IO</span>, obj::MyType; multiline::<span class="hljs-built_in">Bool</span>)
    <span class="hljs-keyword">if</span> multiline
        print(io, <span class="hljs-string">"MyType"</span>) <span class="hljs-comment"># or call summary(io, obj)</span>
        print(io, <span class="hljs-string">"\n  "</span>)
        print(io, <span class="hljs-string">"some_number: <span class="hljs-subst">$(obj.some_number)</span>"</span>)
        print(io, <span class="hljs-string">"\n  "</span>)
        print(io, <span class="hljs-string">"some_dict: <span class="hljs-subst">$(obj.some_dict)</span>"</span>)
    <span class="hljs-keyword">else</span>
        <span class="hljs-comment"># write something short, or go back to default mode</span>
        Base.show_default(io, obj)
    <span class="hljs-keyword">end</span>
<span class="hljs-keyword">end</span>
</code></pre>
<p>This works fine:</p>
<pre><code class="lang-julia">julia&gt; t = MyType(<span class="hljs-number">5.0</span>, <span class="hljs-built_in">Dict</span>(:a =&gt; <span class="hljs-number">1</span>, :b =&gt; <span class="hljs-number">2</span>))
MyType
  some_number: <span class="hljs-number">5.0</span>
  some_dict: <span class="hljs-built_in">Dict</span>(:a =&gt; <span class="hljs-number">1</span>, :b =&gt; <span class="hljs-number">2</span>)

julia&gt; [t]
<span class="hljs-number">1</span>-element <span class="hljs-built_in">Vector</span>{MyType}:
 MyType(<span class="hljs-number">5.0</span>, <span class="hljs-built_in">Dict</span>(:a =&gt; <span class="hljs-number">1</span>, :b =&gt; <span class="hljs-number">2</span>))
</code></pre>
<p>You can test the IO context options as follows:</p>
<pre><code class="lang-julia">julia&gt; show(<span class="hljs-built_in">IOContext</span>(stdout, :compact =&gt; <span class="hljs-literal">true</span>), <span class="hljs-string">MIME"text/plain"</span>(), t)
MyType
  some_number: <span class="hljs-number">5.0</span>
  some_dict: <span class="hljs-built_in">Dict</span>(:a=&gt;<span class="hljs-number">1</span>, :b=&gt;<span class="hljs-number">2</span>)

julia&gt; show(<span class="hljs-built_in">IOContext</span>(stdout, :compact =&gt; <span class="hljs-literal">true</span>, :multiline =&gt; <span class="hljs-literal">false</span>), <span class="hljs-string">MIME"text/plain"</span>(), t)
MyType(<span class="hljs-number">5.0</span>, <span class="hljs-built_in">Dict</span>(:a=&gt;<span class="hljs-number">1</span>, :b=&gt;<span class="hljs-number">2</span>))
</code></pre>
<p>You can make your type printing as fancy as you desire.</p>
<p>One additional trick, to make the code more concise when you have a lot of properties with special types, you can also loop over the <code>propertynames(obj)</code> and use for example <code>getproperty(obj, :name)</code> . Now I hardcoded the property names in the example above, such as in the line <code>print(io, "some_number: $(obj.some_number)")</code>.</p>
<p>Here's where I found stuff in the base language for inspiration:</p>
<ul>
<li><p>many show methods in <a target="_blank" href="https://github.com/JuliaLang/julia/blob/master/base/show.jl">show.jl</a>, including the <a target="_blank" href="https://github.com/JuliaLang/julia/blob/master/base/show.jl#L147">Dict show</a>.</p>
</li>
<li><p>the <a target="_blank" href="https://github.com/JuliaLang/julia/blob/master/base/dict.jl#L3">Dict show</a> called by the array.</p>
</li>
<li><p>the <a target="_blank" href="https://github.com/JuliaLang/julia/blob/master/base/arrayshow.jl">Array show</a> internals.</p>
</li>
</ul>
<p>In these examples above I found out that there are commonly used IO options, like <code>:compact</code> which are described in the <a target="_blank" href="https://docs.julialang.org/en/v1/base/io-network/#Base.IOContext-Tuple%7BIO,%20Pair%7D">Base.IOContext documentation</a>. You can choose to implement such options in your custom show methods, to provide more user configuration to the printing.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Well that's it, hope it helps as a reference to you and future me ;)</p>
]]></content:encoded></item><item><title><![CDATA[JuliaCon Local Eindhoven 2023]]></title><description><![CDATA[I am very happy to announce that I am an organizer of the first city-level JuliaCon conference. This will be a one-day event in Eindhoven on December 1st, organized together with the PyData Eindhoven conference on November 30th (the day before).
The ...]]></description><link>https://scientificcoder.com/juliacon-local-eindhoven-2023</link><guid isPermaLink="true">https://scientificcoder.com/juliacon-local-eindhoven-2023</guid><category><![CDATA[conference]]></category><category><![CDATA[Julia]]></category><category><![CDATA[scientific-computing]]></category><dc:creator><![CDATA[Matthijs Cox]]></dc:creator><pubDate>Tue, 11 Jul 2023 07:23:12 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1689059863447/f5218ac5-7cf2-4036-bbd4-b6f01ff19866.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I am very happy to announce that I am an organizer of the first city-level JuliaCon conference. This will be a one-day event in Eindhoven on December 1st, organized together with the PyData Eindhoven conference on November 30th (the day before).</p>
<p>The website is live: <a target="_blank" href="https://juliacon.org/local/eindhoven2023/">https://juliacon.org/local/eindhoven2023/</a>. You can submit proposals, book early-bird tickets and consider joining as a volunteer.</p>
<p>We named it "JuliaCon Local" to avoid any confusion with the yearly Global JuliaCon, which is typically also associated with a city name. The date is also positioned in the winter, to be out of sync with the summer schedule of the Global JuliaCon conferences. People who could not attend the Global JuliaCon now have another opportunity to meet like-minded Julians and computational scientists in the industry and academia.</p>
<p>My apologies if I notify you via multiple channels, including my blog, but we are really excited about growing our scientific computing community in the area. Please consider sharing the news with your network. Of course everyone on the planet is welcome to join our conference! Hopefully we are paving the path to more city-level JuliaCon conferences.</p>
]]></content:encoded></item><item><title><![CDATA[How to deploy algorithms anywhere?]]></title><description><![CDATA[Let's say you are an incredible scientific programmer. You've got some pretty math, machine learning model or scientific computing code. And you want to give it to other users. Maybe even turn it into a real product and make a profit from your work. ...]]></description><link>https://scientificcoder.com/how-to-deploy-algorithms-anywhere</link><guid isPermaLink="true">https://scientificcoder.com/how-to-deploy-algorithms-anywhere</guid><category><![CDATA[Julia]]></category><category><![CDATA[deployment]]></category><category><![CDATA[software development]]></category><dc:creator><![CDATA[Matthijs Cox]]></dc:creator><pubDate>Sun, 09 Jul 2023 12:41:35 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1688895403637/ca806aa8-49ad-41dd-ab25-b3ba2db8ffe6.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Let's say you are an incredible scientific programmer. You've got some pretty math, machine learning model or scientific computing code. And you want to give it to other users. Maybe even turn it into a real product and make a profit from your work. How do you "deploy" that piece of code? Most scientists do not think much about this problem at all, but it can have a great influence on how you should develop your code.</p>
<p>Basically, we need to take what you developed, turn it into something which can be given to the user, so they can install and use it in their computing environment. What to provide depends entirely on the environment of the user. So you'll first need to understand that: the so called "production environment", the environment in which your "product" or service will operate.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688370403797/7f59581a-f9b6-46b3-80f2-f69fc29c19f4.png" alt class="image--center mx-auto" /></p>
<p>The easiest way to make sure the code works, is to write the code inside the production environment and run it there. Boom! Everything works. Some startups operate like that, but it's not very common. It's quite a risk to mess up your production environment accidentally. It's also possible you have no direct access to your production environment, for example if you are writing code that needs to be installed on millions of cars around the planet.</p>
<p>If you want to know about the possible deployment processes adopted by many possible companies, I recommend the <a target="_blank" href="https://blog.pragmaticengineer.com/shipping-to-production/">Pragmatic Engineer - Shipping to Production</a>. Unfortunately, that focuses mainly on procedures and assumes quite some software knowledge already.</p>
<p>I think there's roughly three options here that we need to consider:</p>
<ol>
<li><p>You fully understand and control the production environment. For example, if you work for a car manufacturer and you write the firmware, then you deploy the code into an environment that you control (or at least your employer does). You might be able to prepare the production environment to best suit your chosen algorithm technology.</p>
</li>
<li><p>You understand the production environment, but you do not control it. In the previous example, let's say you are a vendor selling software to the car manufacturer. You probably need to restrict yourself to the production environment of your customer.</p>
</li>
<li><p>You neither know the production environment, nor do you control it. Let's say you are selling software that might run on any laptop with any operating system (MacOs, Linux, Windows), or even on mobile devices. You have no clue what to expect. This can be tough, but is quite common for consumer software.</p>
</li>
</ol>
<p>In the latter option, the modern era has tried to work around the issue by deploying to servers (or "clouds"). In that case you fully control and understand the production environment, and you merely provide the user with access to your service. This does assume your user has internet access, which seems reasonable these days, but is not true in environments like super-secure semiconductor factories (where I may have some experience).</p>
<p>Assuming you understand your production environment, you are still looking for a balance between how much you share yourself and how much you re-use. If you can re-use software components in the production environment, you can distribute a smaller deployment package/artifact. But you may have to conform to things you do not control, which can be unpleasant.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688391935401/575bd509-cfee-4a45-a26e-56ee2d8f6ac0.png" alt class="image--center mx-auto" /></p>
<p>Let's move on to typical deployment options. What is this "thing", this "artifact", that we send to the production environment? Here's the general options I work with:</p>
<ul>
<li><p>Deploy the raw source code files and make sure the interpreter/compiler is available in the production environment. Python and Java typically work like this.</p>
</li>
<li><p>Compile the source code to something "standalone". More C and Rust style.</p>
</li>
<li><p>Package everything together and ship it. Docker containers are the most extreme version of this approach, as they include even the operating system.</p>
</li>
</ul>
<p>But there are plenty of options in between, including all combinations of possible production environments and their restrictions. Some combinations are not possible, for example when deploying on an Arduino you are severely limited by computational capabilities and you will probably have to compile a tiny standalone solution. If you've just written some massive Python AI monstrosity, you'll have to rewrite it to something much leaner. That can be very painful to find out at the end of your project.</p>
<p>That's why it's important to have some end-product in mind and work backwards from that vision in your development. Scientists and business people like to keep the behavior of the software in mind, what the software will do and such, but forget about where it will operate.</p>
<h2 id="heading-source-code-deployment-examples-with-julia">Source code deployment examples with Julia</h2>
<p>The Julia language is currently my favorite language, as it tries to unite multiple programming worlds; those in science and in software engineering. It focuses on being easy to use and fast to execute. In theory Julia can be deployed anywhere, but being developed primarily by numerical computing professionals, it lacks some ease of use in that deployment area. I think that highlights some of the blind spots of typical scientists. I'll use Julia, and it's pain points, to highlight deployment considerations, while trying to keep everything generic to other languages.</p>
<p>The most basic deployment happens when you, as a developer, begin your journey into the programming language. You install the language, you type some code in some editor (or directly on the REPL), and you run the code. That's it. Note that when you installed the language, you use the deployment mechanism from someone else.</p>
<p>The second most basic deployment, is to give your code to a fellow developer. That developer will understand their own environment (to a certain extent). They probably have already installed the programming language. If not, they can follow the same installation instructions.</p>
<p>Now if the code you write depends only on the installed language, everything should work. But in the modern era, you typically depend on plenty of other people's code. You'll be importing open-source packages left and right. That's really nice, since it saves you a lot of effort. But now you need to share those extra packages with your fellow developer. Note that packages may include pure source code, but also compiled libraries.</p>
<p>You can either:</p>
<ul>
<li><p>Create a "bundle" of all those open-source packages and share it, or...</p>
</li>
<li><p>Share a reproducible way to install all that code. See my <a target="_blank" href="https://scientificcoder.com/clean-code-tips-for-scientists-1-reproducible-environments">previous article on that</a>.</p>
</li>
</ul>
<p>So if you would like to share a piece of code with someone, you need to consider how to share everything that code depends on.</p>
<p>These scenarios I described so far are simple (installing for yourself or sharing with a colleague), but they already show the concepts we have to take into consideration when sharing:</p>
<ul>
<li><p>The core language features.</p>
</li>
<li><p>The default operating system (OS) libraries which the language depends on.</p>
</li>
<li><p>The code you wrote.</p>
</li>
<li><p>The code others wrote for you.</p>
</li>
<li><p>Any libraries created by others.</p>
</li>
</ul>
<p>You can choose which parts you share directly, and which parts you allow to be installed/downloaded.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688377918978/0a0f134b-c213-43bb-b212-d6424a7f2357.png" alt class="image--center mx-auto" /></p>
<p>For installing source code, Julia depends on the package manager to install everything for you, by downloading it from the internet. This all runs with an existing Julia installation. However, Julia doesn't have a good source code "bundler", where you quickly create an installer with your code in one "bundle" or "distributable" (for example an executable on Windows) and you give that to a person. I think that's missing in the Julia ecosystem.</p>
<p>Note that such solutions are operating system dependent. For Python, you've got py2exe for windows, py2app for MacOs, pex for Unix.</p>
<h2 id="heading-compiling-libraries">Compiling libraries</h2>
<p>A computer doesn't directly execute your source code, it needs low-level instructions. Turning your source code into machine instructions is called "compiling". In my previous article <a target="_blank" href="https://scientificcoder.com/how-to-solve-the-two-language-problem">How to Solve the Two Language Problem</a>, I roughly explained how technologies like Julia work. There are lots of steps, but on a high-level, you go from 1) written characters to 2) an LLVM representation to 3) machine instructions, a.k.a. native code.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688395717429/53c419c0-22ce-4c74-83ef-7400a1ddad8a.png" alt class="image--center mx-auto" /></p>
<p>When you gather all that native code and place it in a library (<code>.dll</code> in Windows, <code>.so</code> on Unix), then you can share that library directly with an end-user. Assuming you know which operating system they are working on. This process of turning the machine instructions into a distributable library is often considered part of the compilation process.</p>
<p>And you will still have to "bundle" any external libraries together with your compiled library. This may include certain libraries from your chosen language. Libraries can be linked statically or dynamically, but I don't want to go into those details here. I do want to make you aware, to always, ALWAYS, consider ALL your dependencies. If you forget to consider a dependency, and it's missing or mis-located in the production environment, your program will not run and your deployment has failed!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688394352976/d59637ad-fafb-4073-99af-990ea95dd76f.png" alt class="image--center mx-auto" /></p>
<p>The Julia language community provides the <a target="_blank" href="https://github.com/JuliaLang/PackageCompiler.jl">PackageCompiler</a> package. If you want to make everything fully standalone, you are looking at creating an "app". This will:</p>
<ol>
<li><p>Compile your code, and the dependent code, into one library.</p>
</li>
<li><p>Gather all Julia language libraries.</p>
</li>
<li><p>Gather all dependent third-party libraries.</p>
</li>
<li><p>Place all of those together in a folder, and make sure the dependencies are linked correctly.</p>
</li>
<li><p>Optional: filter out unnecessary libraries (at your own risk).</p>
</li>
</ol>
<p>Note that the default operating system libraries, such as <code>libc</code>, are not included in this "bundle" of libraries.</p>
<p>It's possible in Julia to remove as many dependencies as possible, to go to a very small distributable library, and even become independent of any core Julia language libraries. For example, you can run <a target="_blank" href="https://seelengrab.github.io/articles/Running%20Julia%20baremetal%20on%20an%20Arduino/">Julia on an Arduino</a>. But it's far from trivial. Keep your eyes on StaticTools.jl to follow the developments.</p>
<p>Languages like Rust are geared fully towards statically compiling and deploying small independent libraries. That results in very good tooling for the library deployment use-case.</p>
<h2 id="heading-docker-just-deploy-everything">Docker: just deploy everything</h2>
<p><a target="_blank" href="https://www.docker.com/resources/what-container/">Docker</a> tries to be the software technology to solve all deployment. It wraps everything you need into a "container": code, runtime, system tools, system libraries and settings. It's all about portability: to make sure you can share your software with others, as standalone as possible. I won't go into details, Docker has solid documentation.</p>
<p>You will still need to install Docker itself in the production environment. This means that if you do not control the environment, you may never be able to run Docker containers there.</p>
<p>You will have to decide how to deploy everything inside the Docker container, either with source code or with compiled libraries or anything else, but at least you know you have full control over what you place inside.</p>
<p>The container size can be a problem in some production environments. There exist layered containers, to re-use parts among multiple containers, but that just returns the dependency problem, right?</p>
<p>Containerization is an amazing software technology that can solve many deployment difficulties, but I'd like you to balance it against other deployment options and take the production environment restrictions in mind.</p>
<h2 id="heading-integrating-and-interfacing">Integrating and interfacing</h2>
<p>Once you figured out what "artifact" you will send to your production environment, you will also have to consider how that artifact will operate there. In other words, what happens after deployment?</p>
<p>This is mostly a matter of communication. You have to decide on a communication mechanism, a data format and the contents of the data. You'll also have to think about how to handle and communicate errors and other exceptional aspects.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1688722414015/2af4b943-dae7-4792-b1cc-59e5352f2ede.png" alt class="image--center mx-auto" /></p>
<p>The simplest common approach in today's webservice era is to deploy a Docker container, turn it into a REST server (that's a communication mechanism using HTTP), then send JSON strings or ProtoBuf objects (the data format). If it's a computational backend service, say some fitting algorithm, then you can put vectors inside the JSON and maybe some settings (that's the content of the data).</p>
<p>But there are many more options, all depending on the restrictions of your production environment. This probably deserves a separate blog post.</p>
<h2 id="heading-deploy-anything-with-julia">Deploy anything with Julia</h2>
<p>Want more detailed information and tutorials?</p>
<ul>
<li><p>Build entire Julia web apps? See the <a target="_blank" href="https://www.genieframework.com/index.html">Genie framework</a>.</p>
</li>
<li><p>Roll your own simple REST server? <a target="_blank" href="https://github.com/JuliaWeb/HTTP.jl">See HTTP.jl</a> (used by Genie.jl).</p>
</li>
<li><p>Deploy Julia bare-metal on Arduino? <a target="_blank" href="https://seelengrab.github.io/articles/Running%20Julia%20baremetal%20on%20an%20Arduino/">Blog here</a>.</p>
</li>
<li><p>Embed Julia libraries into C/C++ systems? <a target="_blank" href="https://www.functionalnoise.com/pages/2022-07-21-embedding/">Tutorial here</a>.</p>
</li>
<li><p>Make a standalone app? See <a target="_blank" href="https://julialang.github.io/PackageCompiler.jl/stable/apps.html">PackageCompiler docs</a>.</p>
</li>
<li><p>Just want to share a script? Good, but make it <a target="_blank" href="https://scientificcoder.com/clean-code-tips-for-scientists-1-reproducible-environments">reproducible</a>!</p>
</li>
</ul>
<p>I probably missed many others, feel free to add more links in the comments.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>As I grow older and gain experience in deploying in more and more environments, I admit I appreciate fully statically compiled languages more. Plain-old C or Rust. You know you will be able to deploy anywhere if needed. Of course, you may not have such restrictions in your current production environment, but it's nice to work with a technology where you know you will not be blocked when the time comes.</p>
<p>However, such technologies are often tedious to use for scientific exploration or data analysis. Immediately from the beginning they add a lot of restrictions to your software development. Why can't there be a language that does it all? Where you slowly add the necessary restrictions as you progress in your project. I'm hoping we can tune Julia further in that direction, so that we have a language that's easy to write, performant when needed AND easy to deploy anywhere.</p>
<p>I hope this article helps to explain the concepts involved in deploying algorithms (or any type of code) in production environments. Understanding those concepts at the start of your project will make the entire process much smoother. It's essential to consider all dependencies and choose the right deployment method based on the production environment and the language you're using.</p>
]]></content:encoded></item><item><title><![CDATA[Fruity Composable Design Patterns in Julia]]></title><description><![CDATA[A design pattern is a repeatable solution to a common coding problem. Design patterns are not something beginner programmers typically think about a lot (that includes most scientists), they are probably focused on making their code work. At least th...]]></description><link>https://scientificcoder.com/fruity-composable-design-patterns-in-julia</link><guid isPermaLink="true">https://scientificcoder.com/fruity-composable-design-patterns-in-julia</guid><category><![CDATA[design patterns]]></category><category><![CDATA[Julia]]></category><category><![CDATA[software development]]></category><category><![CDATA[coding]]></category><category><![CDATA[Factory Design Pattern]]></category><dc:creator><![CDATA[Matthijs Cox]]></dc:creator><pubDate>Fri, 23 Jun 2023 12:18:08 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1687522411556/c781ad27-554c-48bd-b6fc-f42d5f629591.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A design pattern is a repeatable solution to a common coding problem. Design patterns are not something beginner programmers typically think about a lot (that includes most scientists), they are probably focused on making their code work. At least that's what I did when I was a young programmer. At the other extreme such patterns can become a religion for people, where everything has to be a design pattern, or else the code is not considered good enough. However, people who make this mistake are not senior programmers either in my opinion. Senior programmers look for a balance between pure abstraction and simplicity (and many other requirements).</p>
<p>The Julia community has a special standing on design patterns: people don't really like them. In general the Julia community believes that design patterns expose a mistake in the language, because we should be able to automate any pattern away. I like that philosophy and I prefer not to focus on design patterns too much, but it's inevitable to encounter them while coding. Even if you do not consciously write design patterns, you may accidentally use them. For example I've used the Factory Method design pattern multiple times, specifically one that takes strings as input and outputs types/classes. This is quite a typical pattern to find in Python as well.</p>
<p>Therefore it's still valuable to think about design patterns. You can see them as best practices that you can learn from. Or you can see them as fun little puzzles, where you take some code out of context and ask "what is the best way to code X?".</p>
<h2 id="heading-composable-factory-method">Composable Factory Method</h2>
<p>Let's write a short example with fruits. Don't ask me why, but sometimes you get strings on the input from a user, or another data source, and you want to turn those into specific (factory) types for your internal code. These "factory types" can later be used to create something else. To be honest, I'm not focusing this article on the entire factory pattern, but only on a composable way to retrieve the type from a string. This also relates to a question about <a target="_blank" href="https://discourse.julialang.org/t/style-recommendation-for-enum-as-type/46078/4">enums as types</a>. Maybe this part of the pattern actually has another name? Who cares, I want to do the following:</p>
<pre><code class="lang-julia"><span class="hljs-keyword">module</span> NaiveFruitFactory
    <span class="hljs-keyword">abstract type</span> Fruit <span class="hljs-keyword">end</span>
    <span class="hljs-keyword">struct</span> Apple &lt;: Fruit <span class="hljs-keyword">end</span>
    <span class="hljs-keyword">struct</span> Orange &lt;: Fruit <span class="hljs-keyword">end</span>

    <span class="hljs-keyword">function</span> fruit(str::<span class="hljs-built_in">String</span>)
        <span class="hljs-keyword">if</span> str == <span class="hljs-string">"apple"</span>
            result = Apple()
        <span class="hljs-keyword">elseif</span> str == <span class="hljs-string">"orange"</span>
            result = Orange()
        <span class="hljs-keyword">else</span>
            error(<span class="hljs-string">"Unknown fruit <span class="hljs-variable">$str</span>"</span>)
        <span class="hljs-keyword">end</span>
        <span class="hljs-keyword">return</span> result
    <span class="hljs-keyword">end</span>
<span class="hljs-keyword">end</span>
</code></pre>
<p>This works fine, right. I can turn a fruity string into a fruit type now.</p>
<pre><code class="lang-julia">julia&gt; NaiveFruitFactory.fruit(<span class="hljs-string">"apple"</span>)
Main.NaiveFruitFactory.Apple()
</code></pre>
<p>In this naïve example the value of the string makes the pattern especially difficult to extend by an outside user, you have to go into the module and add another <code>ifelse</code> statement. By the way there is a reason to avoid this factory pattern at all, because the code is type unstable, the output type cannot be predicted by the compiler from the input type. There are many reasons to avoid this factory pattern, but as I said, sometimes it's unavoidable. However, I am looking for a better alternative that is still readable and performant, yet also easily extendable. I know, software engineering always involves the most insane requirements.</p>
<p>I've read the book <a target="_blank" href="https://www.google.com/search?rlz=1C1GCEU_enDE853DE853&amp;q=Hands-On+Design+Patterns+and+Best+Practices+with+Julia:+Proven+Solutions+to+Common+Problems+in+Software+Design+for+Julia+1.x&amp;si=AMnBZoEZ8aFftZu792frFYrnK9KQYGXRL3UTeDeHB9-uc0sfFeepDAVw_FxX4OtyqI1BQ8YRZRbli_Bwn0DaOA9TunvXQVZADtby617YcRQUdTbvWm-huT4HDHGkx6_eBkRdjtYN44bIvizA7J6wc7xex-HzqalhDEQCmQA6WtA4Wj_fzpnT2B9bMrGNq4mFuF2bZOJhajW_dAZ7gsn_q1oaG3JOw8Pi65BQuCMHx2UNnQpB0rsm-c9xgB0FyEyycCzoi2-CZco41_KXppcywjbX7puLwFCq-jdwADM3csPsF7xEQtANRS8f6hHav7e7vKHfZuf79NIIx52-t6nBMa9J_3KNgc0g_W6XGp44j5RrIoVS-qh7plg%3D&amp;sa=X&amp;ved=2ahUKEwiggciJgbb-AhVaxAIHHZrICI0QmxMoAHoECA8QAg">Hands-On Design Patterns and Best Practices with Julia</a> from Tom Kwong again for reference. The factory pattern in his Creational Patterns chapter is not exactly what I am looking for, as it doesn't use strings as input. His output factory depends on the input type (not the value), which is more preferable. His example is a formatter used for printing certain types in different ways:</p>
<pre><code class="lang-julia"><span class="hljs-keyword">abstract type</span> Formatter <span class="hljs-keyword">end</span>
<span class="hljs-keyword">struct</span> IntegerFormatter &lt;: Formatter <span class="hljs-keyword">end</span>
<span class="hljs-keyword">struct</span> FloatFormatter &lt;: Formatter <span class="hljs-keyword">end</span>

formatter(::<span class="hljs-built_in">Type</span>{T}) <span class="hljs-keyword">where</span> {T &lt;: <span class="hljs-built_in">Integer</span>} = IntegerFormatter()
formatter(::<span class="hljs-built_in">Type</span>{T}) <span class="hljs-keyword">where</span> {T &lt;: <span class="hljs-built_in">AbstractFloat</span>} = FloatFormatter()
formatter(::<span class="hljs-built_in">Type</span>{T}) <span class="hljs-keyword">where</span> T = error(<span class="hljs-string">"No formatter defined for type <span class="hljs-variable">$T</span>"</span>)
</code></pre>
<p>So maybe we should have a separate name for a "type-based factory method" and a "value-based factory method"?</p>
<p>I have three options for a composable "value-based factory method" (please leave a comment if you see a better option):</p>
<ul>
<li><p>Interactive subtype looping (don't do this!)</p>
</li>
<li><p>Registration mechanism</p>
</li>
<li><p>Value-based dispatching</p>
</li>
</ul>
<p>The first one I considered long ago, is simply to loop over the <code>subtypes</code> of the abstract type. I'll show this was a performance mistake. The fact that you need to import <code>InteractiveUtils.jl</code> in your code is always a big warning sign.</p>
<p>We can do one with a collection like a dictionary and a <code>register!</code> function, but I personally prefer one with automatic registration/subscription of the new type. This pattern is probably something you'd do in Python.</p>
<p>Finally, we can do a <code>Val</code> dispatch, it's a bit slower than the if-else/switch statement. This is what we can use if performance isn't a main issue, like on a public interface function. You may want to reconsider in a deep inner loop that is performance critical for your code.</p>
<p>Let's get into the details.</p>
<h2 id="heading-subtype-looping">Subtype Looping</h2>
<p>I will show a very straightforward solution, that's very difficult for the compiler. I am showing this approach, because I made this mistake once. Here's the code. It's very similar to the naïve example, except now we ask every type of fruit to provide a <code>fruitname</code> function and we loop over <code>subtypes(Fruit)</code> until we find the string.</p>
<pre><code class="lang-julia"><span class="hljs-keyword">module</span> SubtypeFruitFactory
    <span class="hljs-keyword">import</span> InteractiveUtils: subtypes

    <span class="hljs-keyword">abstract type</span> Fruit <span class="hljs-keyword">end</span>
    <span class="hljs-keyword">struct</span> Apple &lt;: Fruit <span class="hljs-keyword">end</span>
    <span class="hljs-keyword">struct</span> Orange &lt;: Fruit <span class="hljs-keyword">end</span>

    fruitname(::<span class="hljs-built_in">Type</span>{Apple}) = <span class="hljs-string">"apple"</span>
    fruitname(::<span class="hljs-built_in">Type</span>{Orange}) = <span class="hljs-string">"orange"</span>

    <span class="hljs-keyword">function</span> fruit(str::<span class="hljs-built_in">String</span>)
        <span class="hljs-keyword">for</span> <span class="hljs-keyword">type</span> <span class="hljs-keyword">in</span> subtypes(Fruit)
            <span class="hljs-keyword">if</span> str == fruitname(<span class="hljs-keyword">type</span>)
                <span class="hljs-keyword">return</span> <span class="hljs-keyword">type</span>()
            <span class="hljs-keyword">end</span>
        <span class="hljs-keyword">end</span>
        error(<span class="hljs-string">"Unknown fruit <span class="hljs-variable">$str</span>"</span>)
    <span class="hljs-keyword">end</span>
<span class="hljs-keyword">end</span>
</code></pre>
<p>The benefit is that I can let anyone extend this module with their own fruit types with very little code:</p>
<pre><code class="lang-julia"><span class="hljs-keyword">module</span> SubtypeFruitExtension
    <span class="hljs-keyword">import</span> ..SubtypeFruitFactory
    <span class="hljs-keyword">struct</span> Banana &lt;: SubtypeFruitFactory.Fruit <span class="hljs-keyword">end</span>
    SubtypeFruitFactory.fruitname(::<span class="hljs-built_in">Type</span>{Banana}) = <span class="hljs-string">"banana"</span>
<span class="hljs-keyword">end</span>
</code></pre>
<p>It works fine, but the catch is that <code>subtypes</code> is an interpreted runtime function, it cannot be compiled at all, because at any moment a new Fruit subtype can be added. You can see the drastic difference in timing on my computer (I don't even have to do proper benchmarking):</p>
<pre><code class="lang-julia">julia&gt; <span class="hljs-meta">@time</span> NaiveFruitFactory.fruit(<span class="hljs-string">"orange"</span>);
  <span class="hljs-number">0.000002</span> seconds

julia&gt; <span class="hljs-meta">@time</span> SubtypeFruitFactory.fruit(<span class="hljs-string">"orange"</span>);
  <span class="hljs-number">0.014683</span> seconds (<span class="hljs-number">1.01</span> k allocations: <span class="hljs-number">814.500</span> KiB)
</code></pre>
<p>So let's avoid this one, shall we?</p>
<h2 id="heading-registration-mechanism">Registration Mechanism</h2>
<p>Another straightforward approach. Instead of hardcoding the names of the types that we want to check, we store them in a mutable collection, like a dictionary.</p>
<pre><code class="lang-julia"><span class="hljs-keyword">module</span> RegisterFruitFactory

    <span class="hljs-keyword">abstract type</span> Fruit <span class="hljs-keyword">end</span>

    <span class="hljs-keyword">const</span> FRUIT_MAP = <span class="hljs-built_in">Dict</span>{<span class="hljs-built_in">String</span>, <span class="hljs-built_in">DataType</span>}()

    <span class="hljs-keyword">function</span> register!(fruit::<span class="hljs-built_in">Type</span>{&lt;:Fruit}, name::<span class="hljs-built_in">String</span>)
        FRUIT_MAP[name] = fruit
    <span class="hljs-keyword">end</span>

    <span class="hljs-keyword">struct</span> Apple &lt;: Fruit <span class="hljs-keyword">end</span>
    register!(Apple, <span class="hljs-string">"apple"</span>)
    <span class="hljs-keyword">struct</span> Orange &lt;: Fruit <span class="hljs-keyword">end</span>
    register!(Orange, <span class="hljs-string">"orange"</span>)

    <span class="hljs-keyword">function</span> fruit(str::<span class="hljs-built_in">String</span>)
        fruit_type = get(FRUIT_MAP, str, <span class="hljs-literal">nothing</span>)
        <span class="hljs-keyword">if</span> isnothing(fruit_type)
            error(<span class="hljs-string">"Unknown fruit <span class="hljs-variable">$str</span>"</span>)
        <span class="hljs-keyword">else</span>
            <span class="hljs-keyword">return</span> fruit_type()
        <span class="hljs-keyword">end</span>
    <span class="hljs-keyword">end</span>
<span class="hljs-keyword">end</span>
</code></pre>
<p>Similar to the previous example, we can easily extend this one:</p>
<pre><code class="lang-julia"><span class="hljs-keyword">module</span> RegisterFruitExtension
    <span class="hljs-keyword">import</span> ..RegisterFruitFactory
    <span class="hljs-keyword">struct</span> Banana &lt;: RegisterFruitFactory.Fruit <span class="hljs-keyword">end</span>
    RegisterFruitFactory.register!(Banana, <span class="hljs-string">"banana"</span>)
<span class="hljs-keyword">end</span>
</code></pre>
<p>Performance is good in my opinion, though slower than the hardcoded if-else statement in the start, due to the dictionary. Let's check the minimum time with <code>BenchmarkTools.jl</code> . (And we always have to be careful that we are not looking at <a target="_blank" href="https://juliaci.github.io/BenchmarkTools.jl/stable/manual/#Understanding-compiler-optimizations">compiler optimizations</a>.)</p>
<pre><code class="lang-julia">julia&gt; <span class="hljs-keyword">using</span> BenchmarkTools

julia&gt; <span class="hljs-meta">@btime</span> NaiveFruitFactory.fruit($<span class="hljs-string">"orange"</span>);
  <span class="hljs-number">10.911</span> ns (<span class="hljs-number">0</span> allocations: <span class="hljs-number">0</span> bytes)

julia&gt; <span class="hljs-meta">@btime</span> RegisterFruitFactory.fruit($<span class="hljs-string">"orange"</span>);
  <span class="hljs-number">146.007</span> ns (<span class="hljs-number">0</span> allocations: <span class="hljs-number">0</span> bytes)
</code></pre>
<p>Looks okay. Downside is that we are using a global variable in a module to store the registered types. We may have to put locks around that for multi-threading purposes. That would be a topic for another blog post.</p>
<h2 id="heading-value-based-dispatching">Value-based Dispatching</h2>
<p>Let's have a swing at another Julia solution. In Julia it is possible to dispatch on values, by wrapping them into parametric <code>Val{}</code> types. Note that this works only for plain data types, for example check <code>isbitstype(Int64)</code>. Strings are mutable arrays of characters, so they are not allowed as parametric values. However, we can first convert them to symbols and then dispatch on those. Let's have a look at the implementation.</p>
<pre><code class="lang-julia"><span class="hljs-keyword">module</span> ValueFruitFactory
    <span class="hljs-keyword">abstract type</span> Fruit <span class="hljs-keyword">end</span>
    <span class="hljs-keyword">struct</span> Apple &lt;: Fruit <span class="hljs-keyword">end</span>
    <span class="hljs-keyword">struct</span> Orange &lt;: Fruit <span class="hljs-keyword">end</span>

    fruit(str::<span class="hljs-built_in">String</span>) = fruit(<span class="hljs-built_in">Symbol</span>(str))
    fruit(sym::<span class="hljs-built_in">Symbol</span>) = fruit(<span class="hljs-built_in">Val</span>(sym))
    fruit(::<span class="hljs-built_in">Val</span>{:apple}) = Apple()
    fruit(::<span class="hljs-built_in">Val</span>{:orange}) = Orange()

    <span class="hljs-comment"># default error</span>
    fruit(::<span class="hljs-built_in">Val</span>{T}) <span class="hljs-keyword">where</span> T = error(<span class="hljs-string">"Unknown fruit <span class="hljs-variable">$T</span>"</span>)
<span class="hljs-keyword">end</span>
</code></pre>
<p>The smallest implementation so far! And as always the extension package is 3 lines of code:</p>
<pre><code class="lang-julia"><span class="hljs-keyword">module</span> ValueFruitExtension
    <span class="hljs-keyword">import</span> ..ValueFruitFactory
    <span class="hljs-keyword">struct</span> Banana &lt;: ValueFruitFactory.Fruit <span class="hljs-keyword">end</span>
    ValueFruitFactory.fruit(::<span class="hljs-built_in">Val</span>{:banana}) = Banana()
<span class="hljs-keyword">end</span>
</code></pre>
<p>How are we doing in performance?</p>
<pre><code class="lang-julia">julia&gt; <span class="hljs-meta">@btime</span> ValueFruitFactory.fruit($<span class="hljs-string">"orange"</span>);
  <span class="hljs-number">236.941</span> ns (<span class="hljs-number">0</span> allocations: <span class="hljs-number">0</span> bytes)
</code></pre>
<p>Slightly slower than the registration method with a dictionary, but significantly more pleasing to read in my opinion.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In summary, I wanted this behavior in a simple, yet performing manner:</p>
<pre><code class="lang-julia"><span class="hljs-keyword">using</span> ValueFruitFactory
fruit(<span class="hljs-string">"apple"</span>) == Apple()
fruit(<span class="hljs-string">"orange"</span>) == Orange()
fruit(<span class="hljs-string">"banana"</span>) <span class="hljs-comment"># throws error</span>

<span class="hljs-keyword">using</span> SomeFruitExtension
fruit(<span class="hljs-string">"banana"</span>) == Banana()
</code></pre>
<p>(I am ignoring namespaces for a moment here, but we can always <code>export</code> those symbols in Julia.)</p>
<p>In the end, a simple switch statement (an if-elseif-...-elseif) is best for performance when you want to construct types from values, such as strings. But that means you cannot extend the constructor with another type, because it's hardcoded in the switch statement. If you want a decently performing, composable solution that is pleasant to read, then the value-based dispatching seems to be the way to go.</p>
<p>I should probably wrap up with a final conclusion about design patterns. First of all, solving little puzzles is fun and when you enjoy your work, you generally do better, so please tinker with design patterns if you find them fun. Next to that it's a matter of balancing the requirements of your code, look for what works best in your case, while keeping less obvious non-functional requirements in mind, such as readability, decent performance and composability. With that pragmatic mindset you can look at design patterns for inspiration.</p>
]]></content:encoded></item><item><title><![CDATA[Software Testing for Scientists]]></title><description><![CDATA[I am currently reading the book "Software Engineering for Science." It is one giant complaint about how scientists are terrible at writing maintainable code for themselves. I won't go into all the pain, but I do recognize that pain deeply and have wr...]]></description><link>https://scientificcoder.com/software-testing-for-scientists</link><guid isPermaLink="true">https://scientificcoder.com/software-testing-for-scientists</guid><category><![CDATA[Testing]]></category><category><![CDATA[software development]]></category><category><![CDATA[scientificsoftware ]]></category><category><![CDATA[best practices]]></category><category><![CDATA[tips]]></category><dc:creator><![CDATA[Matthijs Cox]]></dc:creator><pubDate>Sun, 11 Jun 2023 12:35:26 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1685970744744/3005ed53-27a0-4856-92d6-26ece2b8fe52.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I am currently reading the book "Software Engineering for Science." It is one giant complaint about how scientists are terrible at writing maintainable code for themselves. I won't go into all the pain, but I do recognize that pain deeply and have written about it elsewhere. Right now I am reading this book hoping to find solutions. So, what's the proposed solution? The book doesn't provide a simple answer, but one recurring topic is "testing, testing, TESTING!" So, let's talk about testing!</p>
<p>Why don't scientists test their code? Well, it turns out that most scientists do not have a software engineering background, yet they find themselves writing code and software for their work. Alternatively, they may collaborate heavily with software engineers, either in academia or in the industry. If you find yourself in this scenario, you probably don't have the time to suddenly obtain a computer science degree, but it's beneficial to learn a few tricks from the professional software development community. As mentioned, the primary skill to acquire is software testing.</p>
<p>(As always I want to repeat that when I talk about "scientists", this may refer to anyone who uses modeling techniques and computational thinking in their daily job. This can include data scientists, requirements engineers, business analysts, financial quants or anyone else. See <a target="_blank" href="https://scientificcoder.com/my-target-audience">My Target Audience</a> article.)</p>
<p>I know from experience that scientists often use manual testing strategies. In this article I will share a few simple steps that you can take to improve the correctness of your code, thus making you and all your colleagues trust the results of your work. Without trust, people will stop using the software you develop. That's a shame.</p>
<p>Tests also help to "refactor" the code, you can change the code, quickly check all tests pass, and be confident about your changes. Since your code will change a lot during your work, the peace of mind you gain from knowing everything is working is absolutely worth the effort of learning about a few software test strategies.</p>
<p>Some people even believe that writing tests will make the code more usable, more modular and improve the overall software architecture. So plenty of reasons to test.</p>
<p>There was a time that I didn't write any tests for my code. Several times it was pointed out to me that software testing is a good practice. I never really started, or started only half-heartedly. At some point I was working with a lot of software engineers, and they all got a course in test-driven development (TDD), so I promised to try TDD in a new project. At the start it was painful to change my way of working, but after a while I got the hang of it and I have been writing tests ever since. The main benefit for me is that I do not have to keep the whole codebase in my mind anymore. I can just focus on the a small portion, make it work, and then see if I didn't break anything by running the tests. Before, when I didn't have any tests, I would have to consider all dependent code in my mind, and start checking manually whether those other pieces of code still worked. The tests help me relax and save literal headaches.</p>
<h2 id="heading-semi-automated-testing">Semi-Automated Testing</h2>
<p>So the first advice is simple: if you ever find yourself running the same manual tests, for example by executing a function with example inputs and checking that the output matches your expectation, then it's time to automate that manual work by writing explicit tests!</p>
<p>There are many other benefits to automated testing, but the primary reason to get started as an individual is simply to save yourself the effort of running endless scenarios by hand from memory. Manual testing is un-scalable as the code base grows. How can you be sure you didn't break something a colleague of yours is using?</p>
<h2 id="heading-fully-automated-testing">Fully automated testing</h2>
<p>Writing tests and running your tests manually is a big step up from having no tests at all. Unfortunately people can forget to run the tests. To increase the confidence in your code, you can automate the tests for every change that is made to your software, and only allow changes that pass the tests.</p>
<p>In our modern age, a junior scientist can single handedly setup an automated testing system, at least for open source projects on Github. In the Julia ecosystem, which is mostly written by scientists, 89% of packages have automated tests.</p>
<p>Unless you are forced to setup infrastructure inside your own organization, automating your test suite should be relatively low effort, yet high reward.</p>
<p>To get you started, you can read my previous article about <a target="_blank" href="https://scientificcoder.com/automate-your-code-quality-in-julia">how to automate your tests and code quality</a>.</p>
<h2 id="heading-regression-testing">Regression Testing</h2>
<p>The rest of this article will mostly be about the types of tests you can write. Consider them as best practices if you like.</p>
<p>Regression testing is probably the simplest form of testing, and typically what people do intuitively already in manual testing. It's all about checking that the results of your functions reproduce. Run the code with known inputs and check that the outputs match with expected values. If you have no well known reference, these expected values can come from historical runs of your own code.</p>
<p>Basically all you do is this: <code>f(x,y) == expected_value</code></p>
<p>Stochastic processes are harder to test this way. You may set a seed to keep the code deterministic. Or check that the output falls within some expected distribution. Or focus on testing the non-stochastic components of your code.</p>
<h2 id="heading-boundary-testing">Boundary Testing</h2>
<p>Difficult, or rare, uses of the code are often called "corner cases" or "boundary cases", as they exist somewhere on the boundary of what your code can do. People often forget to test these cases, focusing all effort on verifying typical use cases.</p>
<p>Sometimes people call this "good weather" versus "bad weather" testing. Good weather is the typical use of your code, with input data in some normal operating range. Bad weather happens when less expected input data leads to less expected behavior in your code.</p>
<p>Errors are very common corner cases, or "bad weather". Don't forget to test errors. Errors and their messages are extremely important for users and developers of your software to figure out what went wrong and how to fix the mistakes. Junior developers always underestimate the importance of good error messages.</p>
<p>Other corner cases will depend on the domain you are modeling with your code. If you are simulating fluid dynamics in metal pipes typically ranging from 10 cm to 50 cm, but the user may input 500 cm, then you have to consider that corner case. Do you throw an error beyond a certain range, or provide a warning that the behavior may be incorrect, or test the behavior properly even though most users will never go there? These are all decisions to be made by you, the programmer.</p>
<p>Extreme input values can also lead to numerical instabilities, which brings us to the next section.</p>
<h2 id="heading-numerical-instabilities">Numerical instabilities</h2>
<p>You may have the most beautiful math and science, but when you write code, you'll have to understand some of the limitations of computer hardware. Mathematical problems may be poorly conditioned, but the numerical algorithms can also be a source errors and mistakes. Numerical instability is about poorly conditioned computer algorithms, even though the math behind it is well conditioned.</p>
<p>For example, a common source of mistakes happens with floating point arithmetic. Be careful with math that uses very big and very small numbers. For example when using 64-bit floating points in Julia, we can get:</p>
<pre><code class="lang-python-repl"><span class="hljs-meta">&gt;&gt;&gt;</span> <span class="python"><span class="hljs-number">10</span>^<span class="hljs-number">10</span> + <span class="hljs-number">10</span>^<span class="hljs-number">-6</span> - <span class="hljs-number">10</span>^<span class="hljs-number">10</span></span>
1.9073486328125e-6

<span class="hljs-meta">&gt;&gt;&gt;</span> <span class="python"><span class="hljs-number">10</span>^<span class="hljs-number">10</span> + <span class="hljs-number">10</span>^<span class="hljs-number">-7</span> - <span class="hljs-number">10</span>^<span class="hljs-number">10</span></span>
0.0
</code></pre>
<p>In both cases we expect to return the small value in the middle, since <code>x + y - x = y</code> , but that's not what we get. We can find the wrong value of <code>y</code> or even obtain a zero. This kind of issues happen because numbers in computers are represented with a finite accuracy, as a trade-off to limit the amount of allocated memory.</p>
<p>This example may seems silly, but if you do any kind of linear algebra with matrices that contain a wide range of values, you may quickly run into such problems without noticing.</p>
<p>For our testing strategy, one simple take-away from floating point arithmetic is to use approximate equalities instead of identical equality checks, so test that <code>x ≈ 5.0</code> instead of <code>x == 5.0</code>. What tolerances you find acceptable in your comparisons is another big decision you will have to make.</p>
<p>In general, read a good book like <a target="_blank" href="https://tobydriscoll.net/fnc-julia/frontmatter.html">Fundamentals of Numerical Computation</a> to get an idea of the interplay between math and computers.</p>
<h2 id="heading-toy-examples">Toy examples</h2>
<p>If you have some complicated, multi-dimensional, multi-physics simulation software, you do not really know how it behaves. Actually you are using the software to figure out how your system behaves. So how can you test the behavior?</p>
<p>Well, you can probably compare your code to simpler problems that are well known, like toy models or analytical solutions. Cases where you do know the answer, your code should behave accordingly. If it doesn't match, then you know you have a fundamental error in your code somewhere.</p>
<p>For the more complex cases you are researcher, you cannot check the end result, but you can test all the smaller components of your code. The unknown, untestable behavior probably resides in the interplay between all kinds of known smaller parts. As long as you know the smaller components behave according to known physical and mathematical principles, you have more trust in the aggregate.</p>
<h2 id="heading-reference-datasets">Reference datasets</h2>
<p>Instead of finding simple toy examples, you can also look for reference code and datasets. Either by looking in the literature or by testing against alternative software packages. Your code should do something novel, else you would be using existing software, but there is probably overlap in behavior with other software packages. That overlap in functionality is the part you can check automatically to look for errors in the behavior of your code.</p>
<h2 id="heading-coverage-metrics">Coverage metrics</h2>
<p>Measuring how well your tests cover your source code is not really a testing strategy, but it is really helpful to learn where you can improve your testing strategy. Code that has no corresponding tests yet is low hanging fruit. And while code coverage is no guarantee that your testing strategy is perfect, it is a good first indication for others about how serious you are in your testing. This increases the trust they (and you yourself) may have in your code.</p>
<p>Once you get into code coverage metrics, you can slowly expand to other <a target="_blank" href="https://scientificcoder.com/automate-your-code-quality-in-julia">code quality metrics and tools</a>, to further increase that trust.</p>
<h2 id="heading-common-sense">Common Sense</h2>
<p>A simple way to invent tests is to use your common sense. Let's say you were given a piece of code from a colleague. How would you verify that the code is working properly? What would make you trust that code? Now figure out a way to codify that common sense check into an automated test code. Done!</p>
<p>Most testing strategies are simply common sense. They are best practices found by legions of software developers around the world over the last decades. Stand on the shoulders of all that experience, but don't forget to use your own mind.</p>
<h2 id="heading-objections-to-testing">Objections to testing</h2>
<p>A common objection from scientists, against writing tests, is that their code evolves too fast. They do not really know what to test, because they are using the code to figure out the physics and science. So they say that there is no need to test.</p>
<p>This is not a valid argument I am afraid. Scientists are not really that special. Most professional software developers do not know exactly what their users want. They expect their code to evolve. In <a target="_blank" href="https://theleanstartup.com/">The Lean Startup</a>, the whole process of building a (software) company is described as a scientific cycle. Build a product based on assumptions, measure if users want it, learn from that, build some more and continue onwards. It's like a social science experiment, trying to figure out what humans want by building the code. Yet in professional software companies there is always a heavy focus on testing the code.</p>
<p>So I believe that the uncertainty and evolution of the code is no reason against testing. I believe the main reason scientists forgo testing is simply because they are never trained to think about the benefits of software testing. They code as a side project. But if the code is critical to your scientific results, you will be very happy to have the tests to prove the correctness of the code.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>I strongly advise to adopt software testing strategies to improve the correctness and reliability of your code. By starting to use some of the techniques I described, researchers can build trust in their software and ensure its quality. Embracing these best practices will not only save time and effort but also enhance the long term research process.</p>
]]></content:encoded></item><item><title><![CDATA[The Nebulous Mysteries of Scientific Coding]]></title><description><![CDATA[There is a concept in meta-rationality called “nebulosity”. I will look up the definition later, but in my own words nebulosity means the following:

Nebulosity: a concept or problem is ill-defined. You cannot describe it perfectly. The boundaries of...]]></description><link>https://scientificcoder.com/the-nebulous-mysteries-of-scientific-coding</link><guid isPermaLink="true">https://scientificcoder.com/the-nebulous-mysteries-of-scientific-coding</guid><category><![CDATA[software development]]></category><category><![CDATA[Philosophy]]></category><category><![CDATA[reflection]]></category><dc:creator><![CDATA[Matthijs Cox]]></dc:creator><pubDate>Sat, 03 Jun 2023 11:56:48 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1685191473723/77306972-8aa3-41d0-85f4-9efef520f341.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There is a concept in meta-rationality called “nebulosity”. I will look up the definition later, but in my own words nebulosity means the following:</p>
<blockquote>
<p>Nebulosity: a concept or problem is ill-defined. You cannot describe it perfectly. The boundaries of the concept are unclear.</p>
</blockquote>
<p>Nebulosity drives rational people crazy, it’s worse than NP-hard. Rational people <em>need</em> well defined problems. Even if you can prove that the problem cannot be solved, at least the problem itself should be <em>known</em>. But is this always possible?</p>
<p>You may have a problem that you can barely describe to yourself. You may feel some shape of it, intuitively in your mind, but you cannot explain it perfectly. You notice that it is especially difficult to explain the problem to people unknown to the domain around the problem. There is only some vague shape you can gesture at. After wrestling with the problem for a long time, you may even begin to wonder whether there is a problem at all. This can be challenging if you built an identity or career around such a nebulous concept.</p>
<h2 id="heading-my-nebulous-problem">My nebulous problem</h2>
<p>The problem I have been wrestling with the last years has such nebulosity. It started simple. Software development is slow in our organization and many organizations around us. One part of the problem, that many people complained about, is that the organizations contain many scientists who do not know how to develop software properly, and the professional software engineers do not understand the scientific domain. This causes lots of errors, both in the communication and in the software itself.</p>
<p>(This is a nebulous problem that can be generalized to any profession that involves people who focus on learning the domain, instead of building the products, say a business analyst, a financial quant or a mechatronics designer. Generalizing a nebulous problem makes it more nebulous and even harder to solve. More people will feel the shape of the problem, but it applies less to their exact case. This is a nebulous problem faced by high level thinkers in general, leading to proposed solutions that do not apply to the context. We have a potentially recursive nebulosity growth here.)</p>
<p>The scientist vs engineer problem seems easy enough to fix. Simply teach the scientists the good practices of software engineering. Give them the right tools for their domain. Then they write better code and create better scientific software, or at least they learn to communicate better with software engineers. But this is a nebulous problem. It turns out many of the scientists do not want to learn software engineering skills. It will take too much time away from their real ‘science’ work. They are also not rewarded for getting better at coding, they are rewarded for finding insights and writing articles. This simple problem just became some kind of complex resource allocation problem; how much time should scientists spend on software skills so that it pays off in their career, without becoming a non-scientist? Then there is the fact that all their scientists friends around them are not great coders either. Why should they change first? Is it a peer-pressure problem? Or maybe they believe they are actually amazing coders, never having met better coders in professional settings. <em>“Look at how quickly I wrote these thousands of lines! I have been successfully working like this for 20 years! You cannot teach me anything.”</em> Never mind that their colleagues cannot understand the code, nor reproduce any of the results. Maybe it’s even a status thing, unlike engineers the scientists may look down upon building and coding? There are so many possible root causes.</p>
<p>Virtually all these problems are interpersonal human problems, not hard science puzzles like we find in math or physics. Interpersonal problems are virtually always nebulous and multi-faceted. Many rational-oriented people shy away from interpersonal problems, thus enhancing the problem instead of tackling it head-on. This is another nebulous problem. You cannot see the cloud from the inside. Sure, it’s a little foggy around here, but that’s always been the case. Yet the frustration remains.</p>
<p>Or is there really a problem? It's good to question your own beliefs from time to time. Can we argue the problem away?</p>
<p>Scientists focus on understanding the universe, and occasionally build something for that reason. Engineers focus on building stuff, and use their understanding of the universe for that. Perhaps these activities should be kept separate? Or perhaps separation of these types of people happens naturally in large organizations and we should accept that fact of life? Or maybe we should allow a third group to arise, scientific coders, an elite group of people who help bridge the gap between the two cultures? Problems can become opportunities, right?</p>
<p>I have spoken with managers who believe there is no problem. They are quite satisfied with the two culture separation. They prefer the scientists to only communicate their findings to the software engineers via another medium than code. Maybe math-like pseudocode, written in ambiguous text documents, or haphazardly explained in a few meetings. Or perhaps the scientists share the incomprehensible throw-away example code. "Ambiguous", "incomprehensible" and "irreproducible" are keywords here, because the documents are never clear to the engineers, the example code is complex and doesn't reproduce. The software engineers are quickly confused and give up on understanding all together. The scientists become frustrated with the miscommunication and perceived apathy of the software engineers. The product development is delayed and the resulting code behaves incorrectly.</p>
<p>This doesn't seem like an acceptable situation for me. Yet the proponents of improving scientific software engineering also seem confused. (That includes me.) No one knows the exact solution that can finally resolve the matter effectively. After many years of wrestling with this cloudy issue myself, I have learned a great deal, but have not succeeded on pinpointing the exact problem. Most of my success has come from finding other people who also experience this nebulous problem. People who cannot accurately articulate the root causes either, yet feel the pain and want to solve the matter. I started calling them “scientific coders”, but even that is nebulous; finding the right words to name these people.</p>
<p>This cloudy-ness has become a growing part of my career and professional curiosity. With this blog I hope to clarify my thoughts, to better describe the shape of the problem, and identify possible solution directions. The uncertainty around the problem definition does not reduce my confidence in moving forward.</p>
<h2 id="heading-nebulous-conclusion">Nebulous conclusion</h2>
<p>So, for now: breathe in, breathe out. Embrace this journey through the cloud. We can neither define nor solve the problem quickly. There is no shortcut that I know of.</p>
<p>If you are interested, here is the original definition of nebulosity that I referred to: <a target="_blank" href="https://metarationality.com/nebulosity">metarationality.com/nebulosity</a>. It describes nebulosity far more in-depth than I did. Actually the entire meta-rationality blog seems to revolve around nebulosity.</p>
<p>The concept of nebulosity is fascinating in itself. A big step in your personal development may come from the conscious choice to stare nebulosity in the face. To accept its existence. A lot of that personal development is dealing with uncertainty, because many people struggle with uncertainty in life. Once you see nebulosity, you cannot un-see it. You may notice that all concepts are a little nebulous. Nothing is perfectly defined.</p>
<p>Edsger W. Dijkstra, famous in many ways, seems to defy nebulosity by noting that <em>"The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise."</em> This is interesting on several levels. First of all, I slightly disagree since abstractions are <a target="_blank" href="https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/">leaky</a>, so their precision will fail under the right circumstances. Secondly, you should read the context of his thoughts. This quote comes from a lengthy <a target="_blank" href="https://www.cs.utexas.edu/users/EWD/transcriptions/EWD03xx/EWD340.html">lecture</a> where he discusses all the misconceptions around programming. While the quote itself is about code, I can already see the two culture problem emerging in his talk as he laments about scientists who do not appreciate computers and programming. Observe how great thinkers struggle with this nebulosity, even as they confidently announce precision in some intellectual areas.</p>
<p>Here we come to the end of my introspection. I questioned whether to publish this blog post here on The Scientific Coder or on my personal website Functional Noise. Since I've applied nebulosity to scientific coding, this blog seemed like the right place. I believe it can help any of you deal with the stress and difficulties of being stuck inside this nebulous problem. Known that you are not alone and that it is no shame to struggle within this field of work.</p>
]]></content:encoded></item><item><title><![CDATA[Scientific Software Institutes]]></title><description><![CDATA[Have you ever gone through life completely oblivious to something? I recently experienced that sensation when I stumbled upon an entire ecosystem of institutions, only learning about them after starting this blog. These organizations are dedicated to...]]></description><link>https://scientificcoder.com/scientific-software-institutes</link><guid isPermaLink="true">https://scientificcoder.com/scientific-software-institutes</guid><category><![CDATA[institutes]]></category><category><![CDATA[organization]]></category><category><![CDATA[scientificsoftware ]]></category><category><![CDATA[numerical computing]]></category><dc:creator><![CDATA[Matthijs Cox]]></dc:creator><pubDate>Wed, 31 May 2023 12:34:28 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1682844193827/1b0a1c9d-fb03-4a75-9ca0-d3cbcb52d52b.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Have you ever gone through life completely oblivious to something? I recently experienced that sensation when I stumbled upon an entire ecosystem of institutions, only learning about them after starting this blog. These organizations are dedicated to promoting better scientific software, which aligns with the mission of my blog. I wanted to know what's going on, so let's have a look at what's out there.</p>
<p>I noticed the names of the fields of "scientific software" vary a little, but I consider all of these roughly equivalent:</p>
<ul>
<li><p>Scientific Software</p>
</li>
<li><p>Research Software Engineering (RSE)</p>
</li>
<li><p>Scientific Computing</p>
</li>
<li><p>Numerical Computing</p>
</li>
</ul>
<p>Yes, there are differences between them, but all of them involve turning scientific knowledge into algorithms and software, and writing software to do scientific research or other exploratory research. My apologies if someone has a strong feeling about a name meaning something entirely different from the others.</p>
<p>This article may not be my most interesting one, but I'd like to curate and store everything I've found for future reference. People on LinkedIn have already been kind enough to assist me when I <a target="_blank" href="https://www.linkedin.com/posts/matthijscox_i-just-learned-that-there-are-institutes-activity-7055607793361260544-pFzV">asked nicely</a>.</p>
<h2 id="heading-how-many-institutes-are-there">How many institutes are there?</h2>
<p>My goodness, so many!</p>
<p>My journey started by encountering a post about <a target="_blank" href="http://bssw.io">Better Scientific Software</a> on LinkedIn. I was impressed that this institute gives away $25000 <a target="_blank" href="https://bssw.io/pages/bssw-fellowship-program">fellowship grants</a> to people helping to improve scientific software.</p>
<p>But after some searching and asking around, we can quickly find many more:</p>
<ul>
<li><p>BE-RSE - Belgium Research Software Engineers community</p>
</li>
<li><p>DE-RSE - Society for Research Software in Germany</p>
</li>
<li><p>NL-RSE - The community of Research Software Engineers in the Netherlands</p>
</li>
<li><p>NORDIC-RSE - Nordic Research Software Engineers Community</p>
</li>
<li><p>RSE-AUNZ - The RSE Association of Australia and New Zealand</p>
</li>
<li><p>SocRSE - Society of Research Software Engineering - UK</p>
</li>
<li><p>US-RSE - The US Research Software Engineer Association</p>
</li>
<li><p>Danish RSE - Danish Research Software Engineers Community</p>
</li>
<li><p>RSE Asia - You get the idea</p>
</li>
</ul>
<p>The list doesn't stop here, there are all kinds of more creatively named organizations:</p>
<ul>
<li><p><a target="_blank" href="https://www.esciencecenter.nl/">Netherlands eScience Center</a></p>
</li>
<li><p><a target="_blank" href="https://www.surf.nl/en/research-it">SURF</a> in the Netherlands, has research-oriented IT</p>
</li>
<li><p><a target="_blank" href="http://www.hwacc.nl/">Hardware Acceleration Network</a> in the Netherlands</p>
</li>
<li><p><a target="_blank" href="https://alliancecan.ca/en">Digital Research Alliance</a> of Canada</p>
</li>
<li><p>the <a target="_blank" href="https://ardc.edu.au/">Australia Research Data Commons</a> (ARDC),</p>
</li>
<li><p>the <a target="_blank" href="https://www.ncsa.illinois.edu/">National Center for Supercomputing Applications</a> (NCSA)</p>
</li>
<li><p>the <a target="_blank" href="https://www.software.ac.uk/">Software Sustainability Institute</a> (SSI)</p>
</li>
<li><p><a target="_blank" href="https://calcul.math.cnrs.fr/">Le Group Calcul</a>, French obviously</p>
</li>
<li><p>Bunch of FAIR initiatives seem related, (Findable, Accessible, Interoperable, Reusable) data principles in science, like <a target="_blank" href="https://www.fairpoints.org/">fairpoints.org</a> and <a target="_blank" href="https://www.go-fair.org/">go-fair.org</a></p>
</li>
<li><p><a target="_blank" href="https://ideas-productivity.org/">IDEAS initiative</a> of the US department of energy.</p>
</li>
</ul>
<p>Some of these groups provide grants to researchers. Others provide paid consulting services. Most of them seem to blog and try to create a "community", which is typically a Slack channel to chat, but sometimes includes dedicated conferences.</p>
<h2 id="heading-international-institutes">International Institutes</h2>
<p>Most of these organizations focus on the interests of a single nation, probably because most funding comes from governments. But there exist a few global institutes for scientific software.</p>
<ul>
<li><p>This <a target="_blank" href="https://researchsoftware.org/">Research Software Engineers International</a> organization claims to be an umbrella for many other RSE organizations across the globe.</p>
</li>
<li><p>But wait, there is another one, the <a target="_blank" href="https://www.researchsoft.org/">Research Software Alliance</a> (ReSA) that claims to be a worldwide RSE institute.</p>
</li>
<li><p>There is a UK-centric <a target="_blank" href="https://society-rse.org/join-us/">Society of Research Software Engineering</a>, but someone mentioned they have a very active international Slack channel. And this society organizes a global conference called <a target="_blank" href="https://rsecon23.society-rse.org/">RSECon</a>.</p>
</li>
<li><p>When it comes to conferences, there is the <a target="_blank" href="https://www.siam.org/">Society for Industrial and Applied Mathematics</a> (SIAM) which I know from their conference recently in Amsterdam.</p>
</li>
<li><p>There's a <a target="_blank" href="https://research-software-directory.org/">Research Software Directory</a> that tries to make an overview of ... you guessed it: research software. It tries to index known software packages, but also has an overview of all <a target="_blank" href="https://research-software-directory.org/organisations">contributing organizations</a>.</p>
</li>
</ul>
<p>The only organization I knew before all this, is <a target="_blank" href="https://numfocus.org/">NumFocus</a>. Which has a slightly different goal of promoting open-source numerical computing software, such as NumPy and Julia, and sponsors conferences such as PyData and JuliaCon. Because of their visibility at conferences and heavily used packages, they are much better known.</p>
<p>Competing with NumFocus, or maybe complementing, is the <a target="_blank" href="https://chanzuckerberg.com/eoss/">Essential Open Source Software for Science</a> by Chan Zuckerberg. Funding lots of open source package improvements it seems, many from NumFocus.</p>
<h2 id="heading-industry">Industry</h2>
<p>Very few of these institutes are focused on industry or industry collaboration. According to the Research Software Alliance (ReSA) on <a target="_blank" href="https://upstream-force11-org.cdn.ampproject.org/c/s/upstream.force11.org/the-research-software-alliance-resa/amp/">this webpage</a>, only 1/12th of their funding is from the industry. I have also noticed that most of the websites focus on academic research.</p>
<p>Who is doing numerical and scientific computing in the industry?</p>
<p>I bet a lot of companies. In our <a target="_blank" href="https://www.meetup.com/julialang-eindhoven/">JuliaLang Eindhoven Meetup</a> we would like to find everyone doing numerical computing in our area. Generalizing bluntly, I believe the Julia meetups attract numerical computing and scientific software enthusiasts, while PyData meetups attract more data science and AI enthusiasts.</p>
<p>Another trick to finding industry users could be by looking at Mathworks and JuliaHub customers. You'll find examples from fields in automotive, semiconductors, finance, pharmaceutical and many more. Successful numerical computing service providers are good at finding their industry users.</p>
<p>Maybe I will write down a good industry overview in another article. I am interested in learning how scientists do numerical computing and write software at places I haven't heard from yet.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>There are many organizations promoting and improving scientific software practices. I've only done a quick sweep through the field and found plenty. I may expand this list in the future with updated findings. Whether you are looking for grants or a community of like-minded individuals, you can get started with this overview. Or just be amazed like me that such organizations exist at all.</p>
]]></content:encoded></item><item><title><![CDATA[Clean Code Tips for Scientists #1 - Reproducible Environments]]></title><description><![CDATA[Author commentary: I am starting a "clean code" blog series with simple tips that you can integrate into your workflow. I often write long, complicated articles that try to teach a lot at once. This is an attempt to chop things up in bite-sized chunk...]]></description><link>https://scientificcoder.com/clean-code-tips-for-scientists-1-reproducible-environments</link><guid isPermaLink="true">https://scientificcoder.com/clean-code-tips-for-scientists-1-reproducible-environments</guid><category><![CDATA[Julia]]></category><category><![CDATA[clean code]]></category><category><![CDATA[software development]]></category><category><![CDATA[Science ]]></category><category><![CDATA[Python]]></category><dc:creator><![CDATA[Matthijs Cox]]></dc:creator><pubDate>Wed, 24 May 2023 12:03:44 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1684413997401/753bc424-1698-4e8a-a951-cc7727a09e68.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Author commentary: I am starting a "clean code" blog series with simple tips that you can integrate into your workflow. I often write long, complicated articles that try to teach a lot at once. This is an attempt to chop things up in bite-sized chunks. Note that the Clean Code books by Robert Martin are great, you should read them if you have time! If not, you can follow these short articles :)</em></p>
<p>If you've written a lot of scripts and shared some of those scripts with colleagues or others, then you probably encountered the problem that the code doesn't always work on their device, or produces different results. When this happens, people may quickly lose trust in your results and begin to ignore your work entirely. So making code reproducible is extremely important! Even if you are a scientist and not a professional software developer. I'll explain a simple strategy you can take to make your code more reproducible.</p>
<h2 id="heading-code-environments">Code Environments</h2>
<p>First we must take a small step back from your code. Because when you write your script, it is not standalone. It exists in a certain "environment". Besides the hardware of your computer and your operating system, this involves your programming language version and all the (open-source) packages you used to run your code.</p>
<p>When sharing the environment with someone else, you do not want to give them your computer, right? Nor do you want to send all the dependent package code on your computer, because that can easily become gigabytes of packages and dependencies. The environment may not even work exactly on their computer. All kinds of issues may make relocating the environment difficult, for example if they use a different operating system (Linux instead of Windows).</p>
<p>Instead, you want to share a way to install an exact copy of your environment, by sharing the exact <em>configuration</em> of packages you used.</p>
<h2 id="heading-python-environments">Python Environments</h2>
<p>In Python you typically share your dependencies with a <code>requirements.txt</code> file. You can find plenty of blog posts online about this approach, like <a target="_blank" href="https://note.nkmk.me/en/python-pip-install-requirements/">here</a>. There are also alternatives like <a target="_blank" href="https://python-poetry.org/docs/managing-environments/">Poetry</a> that try to make Python environment management easier for you.</p>
<p>I won't go into the details of Python environments here, but please know it's possible. Instead I'd like to show how this problem is tackled in the Julia language. If you prefer another language, then you can consider this an example.</p>
<h2 id="heading-julia-environments">Julia Environments</h2>
<p>In Julia everything can be done with the built-in package manager.</p>
<p>Let's say you have your very important script file. It looks something like:</p>
<pre><code class="lang-julia"><span class="hljs-keyword">using</span> DataFrames, LinearAlgebra
<span class="hljs-comment"># much important code for your colleagues</span>
</code></pre>
<p>What you want to share is the <em>exact</em> same versions of the packages you are using to run this script, including all the package dependencies (for example DataFrames v1.5 is using DataAPI v1.14 under the hood). If you can easily send that knowledge to your colleague, then you can be sure they will get the same results.</p>
<p>Start with an empty environment. Add all the packages you use for your script. You can use the <a target="_blank" href="https://docs.julialang.org/en/v1/stdlib/Pkg/">Julia Pkg mode</a> on the REPL with <code>]</code>, or write something like this:</p>
<pre><code class="lang-julia"><span class="hljs-keyword">using</span> Pkg
Pkg.activate(<span class="hljs-string">"ExperimentNinetyFive"</span>)
Pkg.add([<span class="hljs-string">"DataFrames"</span>, <span class="hljs-string">"LinearAlgebra"</span>])
</code></pre>
<p>You will now have a folder called <code>ExperimentNinetyFive</code> on your device, with two files inside: a <code>Project.toml</code> and a <code>Manifest.toml</code>. The <code>Project.toml</code> simply lists the packages. The <code>Manifest.toml</code> is what describes your exact environment:</p>
<ul>
<li><p>The Julia version</p>
</li>
<li><p>All packages you added with their version, such as DataFrames version 1.5.0</p>
</li>
<li><p>For each package: lists all their dependent packages. Such as DataAPI for DataFrames.</p>
</li>
<li><p>For each dependent package it specifies the version, such as version 1.14.0 for the DataAPI package.</p>
</li>
</ul>
<p>Here's a picture showing a snippet of the Manifest.toml (it's 234 lines in total for me):</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1683624043938/00abd1ce-ead8-432d-b710-524e60e44b46.png" alt class="image--center mx-auto" /></p>
<p>To share a reproducible environment with a colleague, all you need to do is put the script inside the same folder, and then zip it, or push it to a repository, or whatever way you prefer, and send it to your colleague. After receiving your code, all your colleague now needs to do is this:</p>
<pre><code class="lang-julia"><span class="hljs-keyword">using</span> Pkg
cd(<span class="hljs-string">"path/to/ExperimentNinetyFive"</span>)
Pkg.activate(<span class="hljs-string">"."</span>)
Pkg.instantiate()

<span class="hljs-comment"># and then they can run the script</span>
include(<span class="hljs-string">"another_script.jl"</span>)
</code></pre>
<p>The function <code>Pkg.instantiate</code> will install all the packages exactly according to the <code>Manifest.toml</code>. So your colleague will use the exact same versions as you did.</p>
<p>That's it! Modern programming languages come with a simple package manager for the purpose of sharing reproducible code.</p>
<p>If your code is meant to be re-used inside other people's code, the next step would be to make a package that can be installed and updated automatically (instead of emailing your script). Packages are essentially installable code, including a reproducible environment and preferably things like documentation and tests. But that's for another blog post.</p>
<p>In general: never only share your code. Share a reproducible way to setup your coding environment as well!</p>
<h2 id="heading-appendix">Appendix</h2>
<h4 id="heading-warning-you-inherit-the-global-shared-environment">Warning: You inherit the global shared environment!</h4>
<p>What do I mean with this? Let me briefly explain. When you start a Julia REPL you typically start in the global environment like <code>@v1.8</code>. If you install packages in <code>@v1.8</code> and then switch to another environment, those packages are still available. This means you may accidentally forget to add those packages to your new environment, because your script just works. But the the environment you share with the <code>Manifest.toml</code> is still not reproducible for someone else! It's missing some dependencies.</p>
<p>To avoid this problem, and other issues, I typically keep my global environment as clean as possible, with only a few utility packages that I only use on the REPL, such as <code>Revise</code> and <code>OhMyREPL</code> and <code>LocalRegistry</code>. This way I keep all my environments separate.</p>
<p>Similarly be careful when switching environments within a single Julia REPL session. I would advise to test your script once in a fresh REPL, before you send it to others.</p>
<h4 id="heading-pluto-does-it-all">Pluto does it all</h4>
<p><a target="_blank" href="https://github.com/fonsp/Pluto.jl">Pluto notebooks</a> are designed to be reproducible. Under the hood they contain the package environment inside them (check by viewing the Pluto <code>.jl</code> files in your favorite text editor). This can make it easier to share a Pluto notebook instead of a script or package.</p>
<p>Other programming languages probably have other solutions for easy sharing of environments and scripts (though Jupyter notebooks do not do this well). Or you can try online editors like <a target="_blank" href="https://replit.com/">Replit</a>, which maintain the environment for you. I would still advise to understand how package environments work in your favorite programming language, because you cannot use notebooks for everything. And <a target="_blank" href="https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/">leaky abstractions</a> are always a good reason to occasionally look under the hood.</p>
]]></content:encoded></item><item><title><![CDATA[Building a Scalable Inner-Source Ecosystem For Collaborative Development]]></title><description><![CDATA[Three years ago, we decided to embrace the Julia programming language to solve the two language problem at our organization. We want our scientists to join forces with software engineers so that they can work on the same problems together. In our jou...]]></description><link>https://scientificcoder.com/building-a-scalable-inner-source-ecosystem-for-collaborative-development</link><guid isPermaLink="true">https://scientificcoder.com/building-a-scalable-inner-source-ecosystem-for-collaborative-development</guid><category><![CDATA[Julia]]></category><category><![CDATA[Ecosystem Building]]></category><category><![CDATA[management]]></category><category><![CDATA[software development]]></category><category><![CDATA[inner-source]]></category><dc:creator><![CDATA[Matthijs Cox]]></dc:creator><pubDate>Wed, 17 May 2023 12:29:08 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1684239844476/70964039-8d34-410b-8507-79bfd01dcbc0.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Three years ago, we decided to embrace the Julia programming language to solve the <a target="_blank" href="https://scientificcoder.com/how-to-solve-the-two-language-problem">two language problem</a> at our organization. We want our scientists to join forces with software engineers so that they can work on the same problems <em>together</em>. In our journey, I could have used more books or blogs to guide us on the following topics:</p>
<ol>
<li><p>How to build and deploy software products with the Julia language?</p>
</li>
<li><p>How to create the seeds for an effective scientific software ecosystem?</p>
</li>
</ol>
<p>This article is here to help you with the second topic, but I warn you that we had to figure out 1 and 2 at the same time. I intend to write more blog posts about the Julia productization aspects. Yet in the long term, I am betting on the ecosystem to radically improve our organization, so I consider that more important to blog about.</p>
<p>One thing I must continually emphasize is that the technology alone, regardless of how wonderful Julia is, cannot change people. What I needed was an environment where our scientists could contribute to product development in a rewarding way, while upholding the quality standards of modern software engineering. Additionally, we required a setup that could eventually scale to thousands of engineers.</p>
<h2 id="heading-an-ecosystem-blueprint">An Ecosystem Blueprint</h2>
<p>We had to figure out everything from scratch. I hope this article will help fledgling ecosystem architects to gain a head start in their organization. Consider it a guide, or a blueprint, but be mindful of the unique needs of your own organization. I will use my own experience as an example, see my <a target="_blank" href="https://live.juliacon.org/talk/EKZHPS">JuliaCon presentation</a> for more information.</p>
<p>As architects of this ecosystem, our main design choice was to have a development workflow that feels similar to being an open-source developer, to help onboard scientists and engineers with little friction. You should be able to install internal packages via a standard package manager, <code>Pkg</code> in the case of Julia. You can use your favorite IDE, though we advised VS Code due to the Julia plugin maturity. You work on GIT with code reviews. You have automated testing pipelines to check your commits and pull requests. Ideally, all code and tools are available to all engineers, that's what I call "inner-source". Everything should feel instantly recognizable, even if we use slightly different tools and practices than the open-source community.</p>
<h2 id="heading-types-of-repository-structures">Types of Repository Structures</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1684134094157/43eee365-c40e-48c5-b99e-17e872db8fd7.jpeg" alt="a graphic showing the different options of placing packages inside one or more repositories" class="image--center mx-auto" /></p>
<p>To begin, we will have to choose how we kickstart our codebases. As mentioned I wanted a full-fledged inner-source ecosystem. However, different organizations may have different desires. On a high level, I can imagine the following scenarios for your development organization:</p>
<ol>
<li><p>A single repository, with a single monolithic package. Probably with internal submodules as it grows bigger.</p>
</li>
<li><p>A single repository, but with multiple packages inside. With or without a registry.</p>
</li>
<li><p>A multi-repository, multi-package setup. Similar to the public open-source ecosystem you observe on Github, including a separate registry.</p>
</li>
</ol>
<blockquote>
<p>What is a package registry? A registry is merely a lookup table with links to all the packages in your organization. A package manager uses this registry to find and install packages for the users, including all the package dependencies. For example, see the Julia <a target="_blank" href="https://github.com/JuliaRegistries/General">General Registry</a> , or the <a target="_blank" href="https://pypi.org/">Python Package Index</a>. I advise to setup a separate <a target="_blank" href="https://github.com/GunnarFarneback/LocalRegistry.jl">local registry</a> in your organization for your internal packages.</p>
</blockquote>
<p>A mono-repo with multiple packages seems common among startups, see this discussion here about <a target="_blank" href="https://discourse.julialang.org/t/how-beacon-packages-julia-code-in-a-monorepo/90822">Beacon Biosignals approach</a> and the responses from others. But I have also heard about startups who use option 1: a mono-package setup.</p>
<p>The advantage of the first option, a mono-package, is that you need no serious package management. No local registry is needed to install dependencies. You clone and go. The downside is that this single package can quickly grow big and clunky, slowing down pre-compile times and maybe coupling internal interfaces. If you want to develop quickly and deploy separate modules to separate products, how will you disentangle everything? I also don't know how difficult it is to move from option 1 to option 2 once you are over-invested. Overall, I would advise option 2 to get started, if option 3 (multi-repo) seems too scary. If you immediately set up a registry and make it part of the workflow, then splitting off packages into multiple repositories should be easy in the long run.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1684139537457/4af00d8c-973d-41cd-aefa-7f8147958535.jpeg" alt="Our initial repository structure was a blend between a big repository with multiple packages, and multiple 'common facility' packages in their own repository." class="image--center mx-auto" /></p>
<p>At the beginning of our journey, I immediately aimed for option 3, the multi-repository setup, because I wanted to mimic an open-source ecosystem. After a few months of working with 3 developers on our first product, we restructured the packages into a hybrid approach with one big repository with all our main product-related packages, and a bunch of satellite packages that were well-defined and reusable by future projects. The main package architecture was still rapidly evolving and the dependencies between them were not entirely clear. In the multi-repo scenario this forced us to open a lot of pull requests at once into multiple repositories for every change. To find the balance, we went with a hybrid approach. I've seen open-source projects like <a target="_blank" href="https://github.com/MakieOrg/Makie.jl">Makie</a> migrate to a similar hybrid setup.</p>
<p>Today we still work with this approach in our department, sometimes spinning off packages out of the big repository into a separate repository whenever we think it's a common facility useful for other teams or departments. If we know at the start that a package is common, we typically immediately start it in a separate repository. Other departments sometimes follow our hybrid approach, or start with a full multi-repository setup, depending on their development needs.</p>
<h2 id="heading-types-of-packages">Types of Packages</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1684137582238/1b7c439a-cf6e-45a1-9f8f-7c1213a99e7a.jpeg" alt="a diagram describing the different types of packages and their relations" class="image--center mx-auto" /></p>
<p>Next to setting up a repository structure in a version control system, I needed to distinguish between different types of packages, which require different ways of handling them.</p>
<ul>
<li><p>Open-source packages, which may need to be checked and approved.</p>
</li>
<li><p>Inner-source packages, which are useful for multiple groups besides your own, or even common for the entire organization. These packages may depend on open-source packages.</p>
</li>
<li><p>Domain-specific or product-specific packages, that only apply to a single group in your organization, where access might be restricted to a need-to-know basis. These packages may depend on inner-source and open-source packages.</p>
</li>
<li><p>Integration packages. These are end-points from the development ecosystem perspective. For example, a Julia REST server which provides an API around a set of domain-specific packages and gets deployed into a cloud application. Or a package that gets compiled and integrated into C++. Multiple domains may collaborate and deploy together, or independently, that depends on your product environment.</p>
</li>
</ul>
<p>I will not go into the deployment considerations in this article. But I often had to explain to managers these different types of packages and their relationship to the final product.</p>
<p>We are also currently considering to add some kind of tags to certain package versions, to distinguish in maturity levels or use-cases:</p>
<ul>
<li><p>Prototyping packages for personal projects, or very early research explorations among a few scientists.</p>
</li>
<li><p>Research packages, used by many scientists, but not used (yet) in commercial products. Plotting and data analysis packages typically fall in this category.</p>
</li>
<li><p>Tooling packages, used for testing or deployment. Important for developers, but not shipped to production.</p>
</li>
<li><p>Production-grade packages, shipped to customers. These should not break!</p>
</li>
</ul>
<p>We're still working on the exact details. Typically it's pretty clear which package is what, especially if a package is still version 0.x.y then it's probably a prototype or research package. But there is mobility between the package types. A research package can suddenly become integrated into a new product, at which point we need to address the quality and reliability of the code, and make it clear to the researchers how to continue working with this more mature package.</p>
<h2 id="heading-typical-developer-workflow">Typical Developer Workflow</h2>
<p>As an inner-source ecosystem architect, the developers are your customers. You should design the system such that it supports an ideal developer workflow. For you developers it should feel low-effort, and rewarding, to make high quality deliveries.</p>
<p>We typically consider two types of profiles:</p>
<ul>
<li><p>Package users that do not develop packages, such as data analysts.</p>
</li>
<li><p>Package developers that write the package code.</p>
</li>
</ul>
<p>Often the package user and the package developer are the same person, especially at the beginning of your ecosystem when there are no packages yet. Therefore I focused most of my effort on making the life of the developer easier, hoping that the developers will make the user's life easier.</p>
<p>The developer workflow is not linear, but if I have to linearize it for the sake of this article, I would divide it into the following steps, all of which need software infrastructure.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1684144831439/e3e3c860-62e5-4dcd-8fc4-98c0bb16b74f.jpeg" alt="a diagram showing the steps in the developer workflow" class="image--center mx-auto" /></p>
<p><strong>Explore Packages</strong> - Anytime you start, you'll probably have to do some exploration of existing package. To figure out if something already exists out there in the world, before inventing it yourself. You want infrastructure that makes it easy to search and discover packages. And it should be easy to read the package documentation.</p>
<p><strong>Prototyping</strong> - Once you found the interesting packages, you probably want to do some prototyping or data analysis, or whatever is necessary to figure out your new requirements. In this phase you are still a passive user of the ecosystem, merely installing packages. But package installation should be easy with a registry and package manager in place. Simple <code>Pkg.add("InternalPackage")</code> and use it in your development environment.</p>
<p><strong>Developing</strong> - Once you contribute to an existing package, or develop a new package, you'll have to use standard GIT tooling to clone the code and commit new changes. You write tests and when committing changes to a package, the continuous integration systems get triggered, automatically running all tests and other checks, similar to an open-source contribution.</p>
<p><strong>Monitoring</strong> - During development, any contribution is already qualified, but as developer or package owner you want to monitor the code quality over time, with metrics such as code coverage, to make sure your packages are continuously improving. (To be fair, we enabled this step last.)</p>
<p><strong>Sharing</strong> - After creating a new package version, you want to update the internal registry to share this new version with others via the package manager. You also want to create and host the updated package documentation. This step may or may not be fully automated in your organization.</p>
<p><strong>Deployment</strong> - Packages that are released as formal products should be deployed, as libraries or microservices or otherwise. We automated this in the main branch of the product integration packages, once a new version is released.</p>
<h2 id="heading-package-development-infrastructure">Package Development Infrastructure</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1684162705610/51e9f5e8-c8cf-426b-bf42-f9f643253533.jpeg" alt="a diagram showing the various parts of the development infrastructure" class="image--center mx-auto" /></p>
<p>We need many tools to support a developer-friendly workflow inside a large ecosystem. We grew all this infrastructure organically over the years, solving one bottleneck at a time. Here are some of the many tools and practices we worked on, roughly in chronological order:</p>
<ul>
<li><p>I assume you already have a GIT repository hosting system in place, such as Gitlab, Github or Bitbucket.</p>
</li>
<li><p>I also assume you have some Identity and Access Management (IAM) layer in place, so users can connect to the infrastructure tools with your company credentials. I'm adding this layer explicitly since access rights are often a source of bureaucracy and frustration. The "inner-source" concept can be at odds with IT security people who want to restrict all access by default.</p>
</li>
<li><p>With basic IT tools in place, the first thing I did was set up a <a target="_blank" href="https://github.com/GunnarFarneback/LocalRegistry.jl">local registry</a>. With Julia this is easy since the registry just another repository. I did it in an afternoon and it saved us endless effort. With Python I know that it's a bit more tricky to set up a local PyPI.</p>
</li>
<li><p>Create your first packages and figure out their dependencies, together with a high level strategy for repository ownership among the different projects and groups within your organization. This is a complex topic, and the structure will evolve, but a solid start based on real domain experience helps a lot here.</p>
</li>
<li><p>The value of workshops and courses to teach multi-package development, should not be underestimated, especially for scientists with limited software experience. Be ready to endlessly explain GIT, SSH keys, test-driven development, CICD, language fundamentals, and much more.</p>
</li>
<li><p>Automated pipelines for each package are crucial. This enables scientists and developers to automate their quality checks, instead of running tests and checks manually. If this step is easy, scientists will use it for their research packages, giving them early DevOps training. This makes the later handover to software engineers more pleasant and reproducible.</p>
</li>
<li><p>Documentation hosting for each package, to explain the API and provide examples for package users. I hope the benefit of good documentation is clear to everyone. Maybe in the future an AI can automatically explain how your code works, but not today.</p>
</li>
<li><p>Production-grade build pipelines, as an extension to the existing automated testing pipelines, to integrate packages into bigger systems. Otherwise, your scientists will be spending endless time manually compiling and delivering code. Is that a good use of their time?</p>
</li>
<li><p>Qualification pipelines, where we test all registered packages at once. This was first built to qualify new Julia language versions before rolling them out, but we run it more often for ecosystem-wide testing. Maybe we'll go to a nightly run like the open-source community performs <a target="_blank" href="https://github.com/JuliaCI/NanosoldierReports/tree/master/pkgeval/by_date">here</a>.</p>
</li>
<li><p>IT Security assessments of our open source code usage. For example, to enforce correct license usage and guard against supply chain attacks. In general, you probably want an internal mirror server where you store all (approved) open-source packages.</p>
</li>
<li><p>Formalized coding standards and style guides. If you don't have these, a lot of time in code reviews will be wasted on silly aesthetic arguments about the best way to define a function.</p>
</li>
<li><p>... and much more</p>
</li>
</ul>
<p>It's important to quickly build a solid foundation that supports your developers from day one and nudges them towards delivering high quality. After that you can continuously improve the workflow of your developers, adding, removing or changing tools as required.</p>
<p>I am very happy that we now have a serious set of development tools in place, supported by amazing DevOps and IT engineers. This took considerable time and effort to get into place, as the architect you'll need long term commitment to make this happen.</p>
<h2 id="heading-package-architecture">Package Architecture</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1684248543354/d064a18a-67ee-4d2f-b128-3b4d3f55beb6.jpeg" alt class="image--center mx-auto" /></p>
<p>A difficult topic, especially when working with many scientists with little software engineering backgrounds, is figuring out the best configuration of packages. What should you put in which package? How many packages do you need? When should you split a package? What should each package APIs look like? How can packages work seamlessly together where necessary? How should packages depend on each other?</p>
<p>I advise to study <a target="_blank" href="https://martinfowler.com/bliki/DomainDrivenDesign.html">Domain-Driven Design</a> (DDD). Besides that it's endless tinkering based on real-world experience of your business domain, continuously refactoring, and hoping <a target="_blank" href="https://en.wikipedia.org/wiki/Conway%27s_law">Conway's law</a> doesn't get in your way. Especially at the start, I was heavily involved in writing the code of many of our packages, building products while trying to avoid big bottlenecks for the long term. I am sorry to say that we have not found a shortcut for deciding on the right package architecture.</p>
<p>One trick that helps us to refactor safely is to define an "interface package" that your users can rely upon. Then you can refactor and restructure packages behind that interface while keeping the interface package itself backward compatible. Users do not enjoy constantly re-learning how your package works. In the previous section named "Types of Packages" I discussed the integration packages, which also serve this interface package purpose, if you consider the production systems as a user.</p>
<h2 id="heading-deviations-from-open-source">Deviations From Open-Source</h2>
<p>A business has different requirements than the open-source community, which results in some deviations from the open-source setup. Here are just a few differences that I learned to be mindful of and that I had to explain to junior developers:</p>
<ul>
<li><p><strong>Rapid development.</strong> Typically the pace of development is faster in a business, with multiple developers working full-time on multiple packages at once. If you are unable to separate concerns properly, multiple teams may even be working on the same package at once, with a lot of possible chaos.</p>
</li>
<li><p><strong>Product integration.</strong> Open-source development is all about sharing packages with users and fellow programmers. A business is all about building products and services, so there is much more emphasis on integrating with cloud applications and embedding in devices, and everything that revolves around that.</p>
</li>
<li><p><strong>Access restrictions.</strong> Due the risk of disclosing sensitive information and other security concerns, companies will have much more strict access control. In an open-source world, everything is transparent and accessible to everyone. In your inner-source world, this may vary per package.</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In conclusion, building an inner-source package ecosystem requires careful planning, a solid foundation, and continuous improvement to support developers in delivering high-quality software. By adopting best practices from open-source communities and adapting them to fit the unique needs of an organization, you can create a thriving ecosystem that fosters collaboration between scientists and software engineers, ultimately improving development effectiveness.</p>
<h3 id="heading-continue-reading">Continue Reading</h3>
<ul>
<li><p><a target="_blank" href="https://scientificcoder.com/how-to-solve-the-two-language-problem">How to solve the Two Language Problem?</a> - An overview of software technologies to get speed and simplicity at once. Comparing Python, C++, Cython, Numba, Julia and more.</p>
</li>
<li><p><a target="_blank" href="https://scientificcoder.com/automate-your-code-quality-in-julia">Automate your Code Quality in Julia</a> - An overview of tools and methods that help improve your code.</p>
</li>
<li><p><a target="_blank" href="https://scientificcoder.com/my-target-audience">My Target Audience</a> - Where I explain what kind of people I have in mind while writing this blog. Includes the Two Culture Problem as I observe it.</p>
</li>
<li><p><a target="_blank" href="https://www.functionalnoise.com/pages/2022-12-29-org-refactor/">Organizational Refactoring</a> (on my previous blog) - About the human challenges of creating a better scientific development organization.</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Extreme Multi-Threading: C++ and Julia 1.9 Integration]]></title><description><![CDATA[In this tutorial we demonstrate how to call Julia libraries with multiple threads from C++. With the introduction of Julia 1.9 in May 2023, the runtime can dynamically "adopt" external threads, enabling the integration of Julia libraries into multi-t...]]></description><link>https://scientificcoder.com/extreme-multi-threading-c-and-julia-19-integration</link><guid isPermaLink="true">https://scientificcoder.com/extreme-multi-threading-c-and-julia-19-integration</guid><category><![CDATA[cpp]]></category><category><![CDATA[Julia]]></category><category><![CDATA[multithreading]]></category><category><![CDATA[Tutorial]]></category><category><![CDATA[integration]]></category><dc:creator><![CDATA[Matthijs Cox]]></dc:creator><pubDate>Thu, 11 May 2023 14:10:09 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1683635341431/5562e663-f9ea-4cec-a0a8-e6c9da7578ef.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this tutorial we demonstrate how to call Julia libraries with multiple threads from C++. With the introduction of Julia 1.9 in May 2023, the runtime can dynamically "adopt" external threads, enabling the integration of Julia libraries into multi-threaded codebases written in other languages, such as C++. This article is written in collaboration with <a target="_blank" href="https://www.linkedin.com/in/evangelos-paradas/">Evangelos Paradas</a>, the maestro of algorithm deployment at ASML. Evangelos has been responsible for heavily testing and debugging this multi-threading feature. I humbly repeated the final results after his many trial-and-error attempts and summarized everything for you in this article.</p>
<h2 id="heading-julia-in-production">Julia in production</h2>
<p>Julia is a general-purpose language designed for scientific and numerical computing, striking a balance between speed and simplicity. The adoption of Julia in the industry is growing every year, but complex cases require enhanced deployment capabilities in the core of the language. One such crucial improvement we needed was the ability to call Julia libraries with multiple threads from another language. Fortunately, this is now possible in Julia version 1.9. Since we have been involved in testing this new feature extensively, we would like to share this tutorial with you to accelerate your journey with external threads in Julia.</p>
<p>Weaving threads across multiple programming languages is an extreme sport in software engineering. You do so at your own risk. Incorrect usage of this technology will crash your production systems. You have been duly warned.</p>
<p>Before starting, making sure you are working with Julia 1.9, either by using <a target="_blank" href="https://github.com/JuliaLang/juliaup">juliaup</a> or <a target="_blank" href="https://julialang.org/downloads/">downloading Julia 1.9</a> manually and adding it to your path.</p>
<h2 id="heading-introduction-to-c-embedding">Introduction to C++ embedding</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1683794913068/8f233cbc-7651-4944-8083-852f62cf3a4a.jpeg" alt class="image--center mx-auto" /></p>
<p>In the past, I have spent quite some time writing a tutorial about <a target="_blank" href="https://www.functionalnoise.com/pages/2022-07-21-embedding/">how to embed Julia libraries into C++</a>. It's not trivial. High level the steps involved are:</p>
<ul>
<li><p>Create a Julia package with the Julia c-interface functions</p>
</li>
<li><p>Write the C++ code that will call those Julia functions</p>
</li>
<li><p>Compile the Julia code to a library with PackageCompiler.jl</p>
</li>
<li><p>Compile C++ and link it to the Julia library</p>
</li>
</ul>
<p>I won't delve into all the specifics above, so if you wish to reproduce the results of this article, it's advisable to first read my previous article. Prior to embarking on a multi-threaded adventure, make sure that you are intimately familiar with embedding in a single-threaded manner. Having multiple C++ threads call into Julia is an exceptionally advanced subject, particularly if you have limited prior experience with C++ and multi-threading. Take your time to learn the ropes.</p>
<h2 id="heading-julia-code-example">Julia code example</h2>
<p>We wrote a very simple Julia function that throws an error depending on the input value. The exact Julia functionality doesn't matter in this article. As mentioned, you can read the extensive <a target="_blank" href="https://www.functionalnoise.com/pages/2022-07-21-embedding/">blog post on my previous blog</a> for details, but here are the important highlights for making a Julia function ready for C/C++ embedding:</p>
<ul>
<li><p>Use <code>Base.@ccallable</code> to make sure the Julia function can be called from C/C++</p>
</li>
<li><p>Use C types on the interface. In this example I only use <code>Cint</code> types. Note that <code>Cint</code> is an alias for <code>Int32</code>, so Julia integers and C integers actually have the same memory layout.</p>
</li>
</ul>
<pre><code class="lang-julia">Base.<span class="hljs-meta">@ccallable</span> <span class="hljs-keyword">function</span> divide_function(input::<span class="hljs-built_in">Cint</span>)::<span class="hljs-built_in">Cint</span>
    <span class="hljs-keyword">if</span> input &gt; <span class="hljs-number">10</span>
        throw(<span class="hljs-built_in">ErrorException</span>(<span class="hljs-string">"You cannot divide by more than 10"</span>))
    <span class="hljs-keyword">end</span>
    outputValue::<span class="hljs-built_in">Cint</span> = div(<span class="hljs-number">12</span>, input)
    <span class="hljs-keyword">return</span> outputValue
<span class="hljs-keyword">end</span>
</code></pre>
<p>When you place this function inside a Julia package, you can compile it to a library with <a target="_blank" href="https://github.com/JuliaLang/PackageCompiler.jl">PackageCompiler</a>. An example build script can be found in my <a target="_blank" href="https://github.com/matthijscox/embedjuliainc/tree/main/threads/ExternalThreads/build">github repository</a> that accompanies this article.</p>
<h2 id="heading-initializing-julia">Initializing Julia</h2>
<p>Here are some of the interfaces that are important for initializing the Julia library in the correct manner for accepting/adopting external threads from C++. We requested advise to use many of these functions, as we're not experts in this either. The C API of the Julia runtime (those <code>jl_*</code> functions) could definitely use some more documentation.</p>
<ul>
<li><p>The <code>init_julia</code> function comes from a header file that is created together with your compiled Julia library. Nothing special here.</p>
</li>
<li><p>The code with <code>jl_is_initialized</code> has to go into a try/catch block because when Julia is not initialized this variable is not available in the memory and returns a segfault. A surprising gotcha.</p>
</li>
<li><p>Make sure to <code>lock</code> and <code>unlock</code> the initialization of Julia, so that no other thread can accidentally try to start Julia as well, while this thread is busy initializing Julia.</p>
</li>
<li><p><code>jl_adopt_thread</code> enables this C++ thread to be used by Julia. This is the most important C API function to remember for external multi-threading. <a target="_blank" href="https://github.com/JuliaLang/julia/blob/v1.9.0/NEWS.md#multi-threading-changes">It's available since Julia 1.9</a>.</p>
</li>
<li><p>the job of <code>jl_gc_safe_enter</code> is to mark the thread as safe, so that the garbage collector (GC) can run concurrently to that thread. By using this function, you make a promise not to do any GC visible work, such as allocating new memory. The use of <a target="_blank" href="https://stackoverflow.com/questions/13600790/what-do-the-parentheses-around-a-function-name-mean">parentheses around the function</a> is simply to avoid confusion with a function-like macro.</p>
</li>
<li><p><code>jl_enter_threaded_region</code> sets Julia to multi-threading mode, I believe. This function is also used for example by the Julia <code>@threads</code> macro, but lacks any documentation.</p>
</li>
</ul>
<p>According to the <a target="_blank" href="https://github.com/JuliaLang/julia/blob/v1.9.0/NEWS.md#multi-threading-changes">link with news about thread adoption</a> says that <code>@ccallable</code> Julia function will automatically adopt threads. This is true, but what if you execute a Julia function or macro before the <code>@ccallable</code> function? In that case you get a segmentation fault, because this thread is not yet adopted. For example, when you want to capture Julia errors, you need to call the <code>JL_TRY</code> macro before the <code>@ccallable</code>. In the next section, we will show how to use such macros within a multithread environment. In this initialization section, we show the safest way is to perform the thread adoption by calling <code>jl_adopt_thread</code> explicitly.</p>
<p>All together we use these functions to initialize the Julia compiled library as follows. I have kept the code example concise to highlight what matters.</p>
<pre><code class="lang-cpp"><span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">"julia_init.h"</span></span>

<span class="hljs-function"><span class="hljs-keyword">bool</span> <span class="hljs-title">is_julia_initialized</span><span class="hljs-params">()</span>
</span>{
    <span class="hljs-keyword">try</span>
    {
        <span class="hljs-keyword">return</span> jl_is_initialized() != <span class="hljs-number">0</span>;
    }
    <span class="hljs-keyword">catch</span> (...)
    {
        <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>;
    }
}

<span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">initialize_julia</span><span class="hljs-params">(<span class="hljs-keyword">int</span> argc, <span class="hljs-keyword">char</span> *argv[])</span>
</span>{
    mtx.lock();

    <span class="hljs-keyword">if</span> (!is_julia_initialized())
    {
        init_julia(argc, argv);
        jl_adopt_thread();
        (jl_gc_safe_enter)();
        jl_enter_threaded_region();
    }

    mtx.unlock();
}
</code></pre>
<h2 id="heading-the-main-c-code">The main C++ code</h2>
<p>Let's write a simple wrapper around our lovely c-callable Julia function and show you how to catch any errors thrown by Julia. All in a multi-threaded way. Remember, the Julia function <code>divide_function</code> is a trivial function that uses integers and throws an exception when the input integer is larger than 10.</p>
<p>We use <code>jl_get_pgcstack</code> to check if a thread is already adopted by Julia. If you attempt to adopt a thread twice, you will encounter a segmentation fault. This is one way to avoid making that mistake accidentally.</p>
<p>The <code>JL_TRY</code> macro will check if an error occurred in the adopted thread. This macro only works if the thread is actually adopted, else you get yet another segmentation fault. Inside the macro we call the function from the Julia library.</p>
<p>If you want to retrieve the actual Julia error inside the <code>JL_CATCH</code>, you will need to call into the Julia runtime. I have some example code in a previous article about <a target="_blank" href="https://www.functionalnoise.com/pages/2022-07-21-embedding/#catch_those_exceptions">catching Julia exceptions from C++</a> on my personal blog. In the example here, we kept it simple and just printed a message.</p>
<pre><code class="lang-cpp">
<span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">call_and_catch</span><span class="hljs-params">(<span class="hljs-keyword">int</span> x)</span>
</span>{
    <span class="hljs-comment">// to make sure every thread is adopted by Julia, and only once!</span>
    <span class="hljs-keyword">if</span> (jl_get_pgcstack() == <span class="hljs-literal">NULL</span>)
        jl_adopt_thread();    

    <span class="hljs-comment">// JL_TRY requires the thread to be adopted, else it won't work</span>
    JL_TRY
    {
        divide_function(x); <span class="hljs-comment">// may throw an error depending on your input</span>
        <span class="hljs-built_in">std</span>::<span class="hljs-built_in">cout</span> &lt;&lt; <span class="hljs-string">"Succeeded for x = "</span> &lt;&lt; x &lt;&lt; <span class="hljs-built_in">std</span>::<span class="hljs-built_in">endl</span>;
    }
    JL_CATCH
    {
        <span class="hljs-built_in">std</span>::<span class="hljs-built_in">cout</span> &lt;&lt; <span class="hljs-string">"Caught error for x = "</span> &lt;&lt; x &lt;&lt; <span class="hljs-built_in">std</span>::<span class="hljs-built_in">endl</span>;
    }
}
</code></pre>
<p>We can now write a piece of multi-threaded C++ code and call our Julia function. The easiest way is to first create a pool of threads. If you want to make this example more complicated, you'll have to learn a bit more about C++, which is beyond the scope of this article. But this is a good example to get you started.</p>
<pre><code class="lang-cpp">
<span class="hljs-function"><span class="hljs-keyword">int</span> <span class="hljs-title">main</span><span class="hljs-params">()</span>
</span>{
    <span class="hljs-function"><span class="hljs-keyword">const</span> <span class="hljs-keyword">size_t</span> <span class="hljs-title">n_of_threads</span><span class="hljs-params">(<span class="hljs-number">15</span>)</span></span>;
    initialize_julia();

    <span class="hljs-comment">// initialize all threads and assign them our function</span>
    <span class="hljs-built_in">std</span>::thread all_threads[n_of_threads];
    <span class="hljs-keyword">for</span>(<span class="hljs-keyword">int</span> i=<span class="hljs-number">0</span>; i&lt;n_of_threads; i++)
        all_threads[i] = <span class="hljs-built_in">std</span>::thread(call_and_catch, i+<span class="hljs-number">1</span>);

    <span class="hljs-comment">// run all the threads</span>
    <span class="hljs-keyword">for</span>(<span class="hljs-keyword">auto</span>&amp; thread : all_threads)
        thread.join();

    <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>;
}
</code></pre>
<h2 id="heading-compiling">Compiling</h2>
<p>Make sure to add the <code>-lpthread</code> flag, this is a system library that is required for C++ threads. I've already added this flag to the <a target="_blank" href="https://github.com/matthijscox/embedjuliainc/blob/main/threads/Makefile">MakeFile in my repository</a>. Other than that, compilation is identical to <a target="_blank" href="https://www.functionalnoise.com/pages/2022-07-21-embedding/">regular Julia embedding in C++</a>.</p>
<p>After compiling with the makefile, I can run the generated executable, and we see 15 printed messages, as expected. They appear in somewhat random order, due to the nature of multi-threading, but the erroring threads appear last, probably because the error handling takes additional time.</p>
<p>If you ever manage to arrive at this same point, please congratulate yourself! This is tricky business.</p>
<pre><code class="lang-bash">Succeeded <span class="hljs-keyword">for</span> x = 2 
Succeeded <span class="hljs-keyword">for</span> x = 1
Succeeded <span class="hljs-keyword">for</span> x = 4
Succeeded <span class="hljs-keyword">for</span> x = 3
Succeeded <span class="hljs-keyword">for</span> x = 5
Succeeded <span class="hljs-keyword">for</span> x = 8
Succeeded <span class="hljs-keyword">for</span> x = 10
Succeeded <span class="hljs-keyword">for</span> x = 9
Succeeded <span class="hljs-keyword">for</span> x = 7
Succeeded <span class="hljs-keyword">for</span> x = 6
Caught error <span class="hljs-keyword">for</span> x = 15
Caught error <span class="hljs-keyword">for</span> x = 12
Caught error <span class="hljs-keyword">for</span> x = 13
Caught error <span class="hljs-keyword">for</span> x = 14
Caught error <span class="hljs-keyword">for</span> x = 11
</code></pre>
<h2 id="heading-pitfalls-to-avoid">Pitfalls to avoid</h2>
<p>In general multi-threading requires a lot of attention due to many possible pitfalls, such as thread-safety issues, deadlocks, race conditions and much more. Adding external multi-threading to the mix makes everything even more complicated. Consider carefully whether you really want to go down this route with multiple languages. If you want to continue, here's a few complexities we encountered along the way, but be aware that you may find many more.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1683718993689/c847351d-4756-4c7c-9817-f128ad401cbf.jpeg" alt class="image--center mx-auto" /></p>
<p>We encountered some issues with BLAS and other libraries. It's best to set the number of threads to one via <code>LinearAlgebra.BLAS.set_num_threads(1)</code>, else every thread in Julia spawns multiple threads in the BLAS library. Same for MKL and any other third party library you use. Things may work fine, but your performance might not be optimal. You probably don't want your 4 external C++ threads accidentally spawning 16 BLAS threads or more.</p>
<p>In general, be sure to test every binary artifact you want to use in production and consider the implications for your multi-threading setup. This is good advice for any software development project you undertake, independent of Julia.</p>
<p>We encountered a pitfall with Java, when embedding our library into Spark. In this article, we will not go into the details of passing Java threads (via C++) to Julia, but we noticed some issues with the Java signal handler. Make sure that your library is explicitly aware of the Java signal handling library, for example via <code>export LD_PRELOAD=/path/to/libjsig.so</code> . Otherwise Julia will produce a segmentation fault and your application will crash. This is some kind of language interoperability issue that we had to circumvent.</p>
<p>Big lesson learned from the above: never ever disable the Julia signal handler, because else Java is only handling the signals. These signals are operating system signals, such as segfaults or sigabort or the famous sigkill (when you hit ctrl+c to kill something). If Julia cannot handle those signals, you've got a serious problem. We made this mistake while figuring out the previous pitfall.</p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>Integrating C++ and Julia with multiple threads can be a complex task, but it offers powerful capabilities for incorporating Julia libraries into multi-threaded C++ codebases. By carefully initializing the Julia runtime and handling potential pitfalls, developers can successfully combine these two languages for improved performance and functionality. However, it's crucial to be mindful of complicated multi-threading challenges to ensure the reliability of the final product.</p>
]]></content:encoded></item><item><title><![CDATA[Mastering Scientific Programming: Practical Tips and Tricks]]></title><description><![CDATA[Scientific programming involves writing code to solve scientific problems. This can range from simulating complex physical phenomena to analyzing large datasets. While such software is incredibly important, it can be challenging for scientists to lea...]]></description><link>https://scientificcoder.com/mastering-scientific-programming-practical-tips-and-tricks</link><guid isPermaLink="true">https://scientificcoder.com/mastering-scientific-programming-practical-tips-and-tricks</guid><category><![CDATA[software development]]></category><category><![CDATA[tips]]></category><category><![CDATA[tricks]]></category><category><![CDATA[scientificsoftware ]]></category><dc:creator><![CDATA[Matthijs Cox]]></dc:creator><pubDate>Wed, 10 May 2023 09:46:17 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1683568993032/f57416c5-3a85-4420-9074-0559b752625b.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Scientific programming involves writing code to solve scientific problems. This can range from simulating complex physical phenomena to analyzing large datasets. While such software is incredibly important, it can be challenging for scientists to learn all the required software development skills. However, by gradually adding specific tricks into your workflow, you can enhance your coding efficiency and effectiveness.</p>
<p>Software skills are important for everyone these days, including scientists. I see certain common risks if you do not spend effort on your code quality:</p>
<ul>
<li><p>Incorrect code leads to incorrect results, which means you may have to redo work or even risk damaging your reputation.</p>
</li>
<li><p>Unreproducible code means others, including your future self, cannot verify your work, nor built on top of it.</p>
</li>
<li><p>Both incorrect code and unreproducible code may lead people to stop trusting your software and the conclusions they draw from that code.</p>
</li>
<li><p>As your code grows, it may become unreadable and unmaintainable, making it harder for you and others to understand it and contribute further to the code.</p>
</li>
</ul>
<p>Scientific code and scientific principles are also applied in startups and in the industry, for research and development of software products. All these aspects are even more important to learn if you ever want to join a professional software development organization.</p>
<h2 id="heading-choose-the-right-language"><strong>Choose the Right Language</strong></h2>
<p>When it comes to scientific programming, choosing the right programming language is crucial. You want a language that is efficient, easy to use, and has good libraries for scientific computing. Some popular languages for scientific programming include Python, MATLAB, R, and Julia. If you need performance, you typically end up learning Fortran, C, C++ or Rust, though they are considered more difficult and take more time to master.</p>
<p>If you need both performance and simplicity (you probably do if you write complicated algorithms) then you quickly encounter the so-called "two language problem". This is the fact that you typically need to work with at least two programming languages. Read my recent article about how to <a target="_blank" href="https://scientificcoder.com/how-to-solve-the-two-language-problem">solve this two language problem</a> if you want to know more.</p>
<p>Choosing the right language can be a difficult task. Typically people pick the language that people around them are using, but it may pay off to investigate alternatives in order to avoid running into technical difficulties later on.</p>
<h2 id="heading-write-clean-and-readable-code">Write Clean and Readable Code</h2>
<p>Scientific programming often involves writing complex algorithms and data structures. It's important to write code that is easy to read and understand. This will make it easier to debug and maintain your code in the long run. Some tips for writing clean code include using meaningful variable names, adding comments to explain your code, and breaking up long functions into smaller, more manageable pieces. I intend to elaborate on many of these topics on this blog.</p>
<h2 id="heading-test-your-code">T<strong>est Your Code</strong></h2>
<p>Testing is an important part of software development in general. You want to make sure that your code is working correctly before you use it to analyze data or simulate physical phenomena. One popular testing method is unit testing, which involves writing small tests for individual functions or methods. This can help you catch bugs early on and ensure that your code is working as expected. There is a lot of attention in the software development community regarding testing. But I believe this topic also deserves another blog post from me, to explain how to get started, but also how to address the iterative nature of scientific development.</p>
<h2 id="heading-use-version-control">Use Version Control</h2>
<p>Version control is a system for managing changes to your code over time. It allows you to keep track of changes, revert to previous versions, and collaborate with others. One popular version control system is Git, which is widely used in scientific programming. In some organizations you may also encounter other systems, such as SVN.</p>
<p>Thanks to <a target="_blank" href="https://github.com/">Github</a> and other tools using Git, it has become ever simpler to control your code and share changes with others.</p>
<h2 id="heading-code-reviews">Code Reviews</h2>
<p>When you write scientific publications, you go through a rigorous reviewing process, starting with advice from your colleagues and finally a peer review procedure. Somehow the code doesn't always get such rigorous reviewing.</p>
<p>In typical software development environments, code reviews are common practice, for example to make sure the code is readable and correctly tested. Most version control systems, such as Github, provide easy web interfaces to inspect code changes and leave comments. Another practice to improve the code is so called "pair programming", where you code together side-by-side, essentially doing the reviewing in real time.</p>
<h2 id="heading-learn-from-others">Learn from Others</h2>
<p>Scientific programming is a rapidly evolving field, and there is always something new to learn. One of the best ways to learn is to collaborate with others who have more experience or knowledge. This can involve joining online communities or attending scientific programming conferences. There are also books to learn from, though I wish there were more targeting the challenges of scientific software. Currently I am reading <a target="_blank" href="https://www.amazon.com/Software-Engineering-Science-Chapman-Computational/dp/1498743854">Software Engineering for Science</a>, which someone suggested to me recently.</p>
<p>Of course if you want to stay up to date, you can subscribe to this blog, where I intend to keep sharing my knowledge.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In conclusion, scientific programming is an important tool for solving complex scientific problems. By choosing the right language, writing clean and readable code, testing your code, using version control, and learning from others, you can write efficient and effective software that advances the field of science.</p>
]]></content:encoded></item><item><title><![CDATA[How to solve the two language problem?]]></title><description><![CDATA[My professional obsession is solving the Two Culture Problem. How can scientists optimally join forces with software engineers and their principles, so that we can work on the same problems together? How to accelerate the cycle from idea to product? ...]]></description><link>https://scientificcoder.com/how-to-solve-the-two-language-problem</link><guid isPermaLink="true">https://scientificcoder.com/how-to-solve-the-two-language-problem</guid><category><![CDATA[Julia]]></category><category><![CDATA[Python]]></category><category><![CDATA[C++]]></category><category><![CDATA[software development]]></category><category><![CDATA[two language problem]]></category><dc:creator><![CDATA[Matthijs Cox]]></dc:creator><pubDate>Mon, 08 May 2023 07:41:53 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1683112298365/e6ad9ad6-a363-45cc-a04f-a6f38510bf2f.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>My professional obsession is solving the <a target="_blank" href="https://scientificcoder.com/my-target-audience#heading-the-two-culture-problem">Two Culture Problem</a>. How can scientists optimally join forces with software engineers and their principles, so that we can work on the same problems <em>together</em>? How to accelerate the cycle from idea to product? The Two Culture Problem requires a solution to the related Two Language Problem, which has a technical nature. A solution to the technical problem does not guarantee a solution to the organizational problem, but when it comes to engineering cultures you first need to prove the technical solution before you can even begin to tackle the social implications. I have a strong opinion on the best technical solution, but let's review all our options.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1683119498848/0a07083c-0d76-4f94-8657-ecfa64c53f63.png" alt class="image--center mx-auto" /></p>
<p>As far as I can tell, we have the following alternatives:</p>
<ul>
<li><p>Accept the status quo: use a slow and a fast (usually harder) language</p>
</li>
<li><p>Code generation using a look-a-like framework inside the slow language</p>
</li>
<li><p>Using (LLVM-based) optimization frameworks that look like the slow language</p>
</li>
<li><p>Speed up the slow language itself, working around its limitations</p>
</li>
<li><p>Design a new language that is both easy and fast</p>
</li>
</ul>
<p>There are many tutorials about all of these options. Here I'd like to write a short overview of all of these.</p>
<p>For another similar technical overview, see Martin Maas's blog posts about <a target="_blank" href="https://www.matecdev.com/posts/julia-python-numba-cython.html">Julia vs Python vs Numba vs Cython</a>.</p>
<h2 id="heading-the-two-language-problem-python-and-c-as-a-primary-example">The two language problem - Python and C++ as a primary example</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1683037582592/500c6c55-108b-4283-b330-d7d3f4c5ef86.png" alt class="image--center mx-auto" /></p>
<p>Your scientists or domain experts write prototypes in a simple language, let's say Python, where they can rapidly explore, do dynamic data analysis, model desired behavior and gather requirements from users. When they find something valuable, software engineers convert the prototype into high-performant code, let's say C++, and integrate it into production systems to sell as professional services to those users. This is my assumption of the status quo.</p>
<p>(I will continue to refer to the modeling culture as "scientists" whether or not they are actual scientists, domain experts, requirements engineers, data analysts, quants or any other kind of expert whose primary job is modeling the behavior of your product without actually writing and deploying the final source code.)</p>
<p>Depending on the size of your organization and the skill level of your engineers, you may end up with several configurations:</p>
<ul>
<li><p>Teams of highly skilled scientific engineers who can do all the work</p>
</li>
<li><p>Teams with a mix of scientists and software engineers</p>
</li>
<li><p>Separate teams of scientists and separate teams of software engineers</p>
</li>
</ul>
<p>You may have any combination of the above. The first option is a team of unicorns, which I have seen the least, but is amazing to work with.</p>
<p>Perhaps you have accepted this status quo. As the organization grows, separate code bases may evolve for the two types of tasks. In my experience, the production code rarely gets re-integrated into the analysis code, because it's not worth the effort in the short term. Long term you may get inconsistencies and other issues, but that's typically for someone else to worry about. Or perhaps people notice the problems, but profit margins are good, so why worry?</p>
<p>If you integrate your fast code (C++) as embedded libraries into the slow language (Python), you typically need some intermediate glue code or language in between. This requires yet more technical expertise from your people. See for example this blog about <a target="_blank" href="https://www.matecdev.com/posts/cpp-call-from-python.html">How to Call C++ from Python</a>.</p>
<p>One stated benefit of keeping the two-language culture intact is that your prototypes and your scientists never mess up your production systems. The production systems are brittle and valuable, so this is a valid concern, but I think there are better ways to teach people to write better code than by blocking them.</p>
<p>For this article, I assume you are looking for alternatives. Maybe the problems have grown too big, or you want to avoid them early on, or you simply cannot hire enough senior software engineers. Thus for one reason or another, you need your scientists to be deeply involved in the software development.</p>
<p>Learning C/C++ is still a good idea to grow your expertise or the competence of your scientists, but it can take a long time to develop. At a minimum, I advise learning what it means to compile and link libraries. And learn a bit about computers by reading great summaries such as <a target="_blank" href="https://viralinstruction.com/posts/hardware/">What Scientists Should Know About Hardware to Write Fast Code</a>.</p>
<p>Still want to find a technology that's easier to use, yet brings some of the hardcore software benefits? Let's see what's possible!</p>
<h2 id="heading-code-generation-cython-example">Code generation - Cython example</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1683037599047/9dd5b2a2-88c2-4aea-bb09-21fd261e95f2.png" alt class="image--center mx-auto" /></p>
<p>Generating low-level code, most likely C, from a high-level language, most likely Python, is typically done to try to avoid some of the disadvantages of the two-language problem. Maybe you want to compile static libraries to embed into devices. Or you have some other reason. Unless your examples are very simple, do not expect big performance boosts though, the generated code still needs to make similar kinds of assumptions as the high-level language. Also, most code generators do not support the complete language semantics, so you will have to make sure your high-level code adheres to the capabilities of the generator.</p>
<p>In Python you can use the <a target="_blank" href="https://cython.readthedocs.io/en/latest/">Cython</a> 'compiler' to help you generate C code. On the surface it looks a lot like Python, yet with C types and certain decorators. This means you need to rewrite the parts of your codebase that you want to speed up. The process of turning Python into Cython is sometimes called "cythonizing". You get such examples in the <a target="_blank" href="https://cython.readthedocs.io/en/latest/src/quickstart/cythonize.html">Cython quickstart tutorial</a>:</p>
<pre><code class="lang-python"><span class="hljs-meta">@cython.cfunc</span>
<span class="hljs-meta">@cython.exceptval(-2, check=True)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">f</span>(<span class="hljs-params">x: cython.double</span>) -&gt; cython.double:</span>
    <span class="hljs-keyword">return</span> x ** <span class="hljs-number">2</span> - x

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">integrate_f</span>(<span class="hljs-params">a: cython.double, b: cython.double, N: cython.int</span>):</span>
    i: cython.int
    s: cython.double
    dx: cython.double
    s = <span class="hljs-number">0</span>
    dx = (b - a) / N
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(N):
        s += f(a + i * dx)
    <span class="hljs-keyword">return</span> s * dx
</code></pre>
<p>You have to <a target="_blank" href="https://cython.readthedocs.io/en/latest/src/quickstart/build.html">build this cythonized code</a>. As I mentioned, this happens in two stages:</p>
<ul>
<li><p>The <code>.py</code> or <code>.pyx</code> file is converted by Cython to a <code>.c</code> file.</p>
</li>
<li><p>The <code>.c</code> file is compiled by a C compiler to a <code>.so</code> file (or <code>.pyd</code> on Windows) which can be <code>import</code>-ed back into Python with <a target="_blank" href="https://setuptools.pypa.io/en/latest/">setuptools</a>.</p>
</li>
</ul>
<p>The downside of Cython, and any similar C code generator, is that you obtain rather obscure C code. If you want code obfuscation, you can consider that a benefit, but trouble begins when you have to work with that C code. It can be hard to debug once deployed in the field. Make sure to add lots of clear error messages and logging. If you integrate the generated code inside existing C/C++ codebases, your software engineers may dislike writing the necessary glue-code (I learned that from experience). Finally, naive Cython is not very performant and writing <a target="_blank" href="https://notes-on-cython.readthedocs.io/en/latest/std_dev.html">optimized Cython</a> can be as difficult as writing regular C code. But the benefit is that you can move gradually up in complexity.</p>
<p>How to make a standalone-ish distribution? Cython generates code that interfaces with the python runtime. You can create a binary executable, but you also need to distribute it with <a target="_blank" href="http://libpython.so/dll"><code>libpython.so</code></a> which is the python runtime. Moreover, you also need to add all the python dependencies and .so/.dlls that those packages are using. This might be a bit tedious using Cython, but it is certainly possible. Other packages like Nuitka make this process a bit less painless by figuring out all your dependencies.</p>
<p>Fun fact: Code generation is sometimes referred to as "transpiling", since you <em>translate</em> your code to another language that's ready for <em>compiling</em>.</p>
<h2 id="heading-interlude-llvm">Interlude: LLVM</h2>
<p>What if we do not want to write or generate C code? Do we have any other options? Yes, we can generate something else: LLVM code! Before we go into such frameworks, let's do a quick introduction into LLVM itself.</p>
<p>LLVM is a middleman between your source code and the compiled native code. Compilers typically consist of two stages: byte code and native code. The byte code is an intermediate representation that is agnostic of the CPU or GPU architecture. LLVM is an attempt to standardize the byte code definition, which will then be compiled for you to any architecture you want. In some frameworks or languages (like Julia) you can ask to see the LLVM code, and the eventual native code, the processor instructions, which is typically assembly code (that's just before it becomes those zeros and ones you always hear about).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1683033957344/f7c9dc65-4a28-4699-b400-2bded8c92a89.png" alt class="image--center mx-auto" /></p>
<p>Frameworks that use LLVM may store the compiled native code in memory. In that case, if you want to distribute the compiled code, you need the option to package the native code into a library (that's a .so on Linux or a .dll on Windows) together with all of its dependencies. This packaging option may be important to investigate for your deployment strategy.</p>
<p>Fun fact: the <a target="_blank" href="https://clang.llvm.org/">clang</a> compiler from C also compiles via LLVM. So if you write Cython and then compile via clang, you are taking an interesting route.</p>
<h2 id="heading-optimization-frameworks-numba-example">Optimization Frameworks - Numba example</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1683037618888/f615721e-9a2f-499f-af64-a8102a58c4eb.png" alt class="image--center mx-auto" /></p>
<p><a target="_blank" href="https://numba.pydata.org/">Numba</a> is an LLVM code generator that integrates directly with Python. What you need to do is add decorators to every Python function you want to optimize. In principle it looks simple:</p>
<pre><code class="lang-python"><span class="hljs-meta">@jit(int32(int32, int32), nopython=True)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">f</span>(<span class="hljs-params">x, y</span>):</span>
    <span class="hljs-keyword">return</span> x + y
</code></pre>
<p>JIT stands for Just-In-Time, as it compiles to LLVM at the moment of calling the function, and inferring which types you used, just in time before executing. You can optionally provide the types yourself, as I did above. And there are lots of other settings for the <code>@jit</code> decorator, like the <a target="_blank" href="https://numba.readthedocs.io/en/stable/glossary.html#term-nopython-mode">nopython mode</a> to get faster performance.</p>
<p>Only a subset of Python is supported with Numba. It works well with your NumPy code, they made sure of that. It doesn't work with other packages such as Pandas, because those work differently. Even dictionaries are not supported by Numba. Nobody writes a custom file format parser in Numba, it's purely for numerical code.</p>
<p>If you want to compile to GPU: there are other decorators, such as <code>cuda.@jit</code>. This suggests you need to edit your code for GPU.</p>
<p>If you want to compile ahead of time, you will again have to replace all your <code>@jit</code> decorators with the <code>@cc.export</code> decorator and be explicit about your types.</p>
<p>You cannot debug the jitted Numba code itself, you'll have to change the decorator setting to <a target="_blank" href="https://numba.readthedocs.io/en/stable/user/troubleshoot.html#debugging-jit-compiled-code">debug mode</a> and use the gdb tool, so be careful there. That's another disadvantage of Numba.</p>
<p>We have never tried to make a standalone distribution of Numba compiled code, but I assume you ship the entire Python environment, with the ahead-of-time compiled code. If someone has experience with the nitty-gritty details of distributing Numba code, then let me know!</p>
<h3 id="heading-codon">Codon</h3>
<p><a target="_blank" href="https://github.com/exaloop/codon">Codon</a> is a recent attempt similar to Numba, except it claims zero-overhead; you do not necessarily have to decorate your code. Well, except if you want to use it inside larger Python codebases (you probably do), then you have the <code>@codon.jit</code> decorator and other decorators depending on your use-case.</p>
<p>Codon has only 9 contributors at the moment, and it has a non-permissive license, so you'll have to pay to use Codon commercially in production. It's interesting but looks more like a startup than a regular open-source project. Similar to Numba it only supports a subset of the Python language, which may get better over time (or worse, if Python evolves, yet the developers do not update Codon).</p>
<h3 id="heading-jax-tensorflow-pytorch">Jax, TensorFlow, PyTorch</h3>
<p>Every scientific computing and machine learning framework in Python implements its own optimized numerical libraries it seems. Some of them, like <a target="_blank" href="https://jax.readthedocs.io/en/latest/notebooks/quickstart.html#using-jit-to-speed-up-functions">JAX</a>, have a <code>@jit</code> decorator like Numba. All these frameworks look like Python, but to get performant code you'll have to use their API, not Python itself. Often you write Python in a more complicated directed-acyclic-graph (DAG) structure that can be fed to the underlying libraries for execution. Don't ask me how to debug these things. Please consider whether you are really writing Python or another language.</p>
<p>Also see this section from the Mojo language comparing such <a target="_blank" href="https://docs.modular.com/mojo/why-mojo.html#related-work-other-approaches-to-improve-python">Python improvements</a>.</p>
<h2 id="heading-boost-the-slow-language-pypy">Boost the slow language - PyPy</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1683115939288/a6319081-3b7d-47b7-9ab0-79504b54a702.png" alt class="image--center mx-auto" /></p>
<p>There is a continuous effort to improve the performance of slow interpreted languages like Python and R. In a blog post called <a target="_blank" href="https://towardsdatascience.com/python-3-14-will-be-faster-than-c-a97edd01d65d">Python 3.14 Will be Faster than C++</a> the author joked that linear extrapolation of Python improvements will soon surpass C++ performance. Let's see how that graph evolves in the next Python versions.</p>
<h3 id="heading-pypy">PyPy</h3>
<p>An alternative to waiting for Python to improve is <a target="_blank" href="https://www.pypy.org/features.html">PyPy</a>. This is a replacement for CPython. Note, CPython is not Cython. <a target="_blank" href="https://github.com/python/cpython">CPython</a> is essentially Python itself, as the Python interpreter is written in the C language. PyPy is an attempt to make the entire Python language faster with a better interpreter. PyPy can optionally use LLVM as a backend, to use similar tricks as Numba, and also has a JIT decorator.</p>
<p>In general, the Python language design creates limitations on the performance, see this video for example on <a target="_blank" href="https://www.youtube.com/watch?v=qCGofLIzX6g">How Python was Shaped by Leaky Internals</a>. If you don't want to change language, you may hope that Python 4 ever comes around with syntax that can actually be optimized.</p>
<h2 id="heading-an-optimized-language-julia-as-example">An optimized language - Julia as example</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1683037634944/a8634473-0686-4172-ba4d-e6fd79dd1390.png" alt class="image--center mx-auto" /></p>
<p>If you don't want to wait for Python 4, then there's <a target="_blank" href="https://julialang.org/">Julia</a> instead. Julia is a language that is optimized for talking to LLVM, while looking as similar as possible to high-level languages like Python and MATLAB. In short, it's an LLVM whisperer. This makes it fast and easy. See <a target="_blank" href="https://julialang.org/blog/2012/02/why-we-created-julia/">Why Was Julia created?</a> to get an impression of the rationale.</p>
<p>Similar to Cython and Numba, you can optionally add type information to Julia, which can help the compiler, though Julia is good at type interference.</p>
<pre><code class="lang-julia"><span class="hljs-keyword">function</span> f(x::<span class="hljs-built_in">Int</span>, y::<span class="hljs-built_in">Int</span>)
    <span class="hljs-keyword">return</span> x + y
<span class="hljs-keyword">end</span>
</code></pre>
<p>Is Julia better than Numba and Cython? For an opinionated and long blog post read <a target="_blank" href="https://www.stochasticlifestyle.com/why-numba-and-cython-are-not-substitutes-for-julia/">Why Numba and Cython are not substitutes for Julia</a>. There are also lengthy discussions in this discourse on <a target="_blank" href="https://discourse.julialang.org/t/julia-motivation-why-werent-numpy-scipy-numba-good-enough/">Why weren't Numpy, Numba, SciPy good enough?</a>. And I also like Martin Maas's blog post about <a target="_blank" href="https://www.matecdev.com/posts/julia-python-numba-cython.html">Julia vs Numba and Cython</a>.</p>
<p>I would summarize the benefits as: You don't have to decorate your code, Julia <em>is</em> the JIT decorator. You can write the same code for CPU and GPU. When compiling ahead-of-time, it's again the same code. The compiler can optimize across all Julia code, not just a single package like NumPy that you are currently using. Composability is often praised: Julia packages work easily together.</p>
<p>The downside of Julia, if you are coming from another language like Python, is obviously that you have to learn another language. Though I wonder how much more difficult Julia is compared to learning a complex framework like Numba. And writing optimized Cython can be considered similar to writing another language. Julia has many similarities with Python, check for example this <a target="_blank" href="https://cheatsheets.quantecon.org/">cheat sheet to compare MATLAB to Python to Julia</a>, except Julia bypasses the problems that make Python difficult to compile to LLVM.</p>
<p>Similar to Cython and Numba, naive Julia code is good, but not necessarily as performant as optimized C. Read the <a target="_blank" href="https://docs.julialang.org/en/v1/manual/performance-tips/">performance tips</a> to get the most out of your code.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1683114204062/9be8d8f7-192f-4616-8e77-46c6e24c9e7d.png" alt class="image--center mx-auto" /></p>
<p>If you want to move gradually to Julia, you can embed Julia into Python via <a target="_blank" href="https://github.com/JuliaPy/pyjulia">PyJulia</a> or the more recent two-way package <a target="_blank" href="https://github.com/cjdoris/PythonCall.jl">PythonCall</a>. Or you can re-use existing Python code inside Julia, via <a target="_blank" href="https://github.com/JuliaPy/PyCall.jl">PyCall.jl</a>, or the aforementioned PythonCall. This way you can use Julia as if it's yet another Python framework, instead of a completely new language. You can even send <a target="_blank" href="https://cjdoris.github.io/PythonCall.jl/stable/compat/#Tabular-data-/-Pandas">Pandas dataframes</a> to Julia and back.</p>
<p>Similar to Numba, Julia stores the compiled native code in memory. In the upcoming Julia 1.9 release, this code will also be automatically cached on disk per Julia package, so compilation happens only once, instead of the first call in every new Julia session. Ahead-of-time compilation was always possible with <a target="_blank" href="https://github.com/JuliaLang/PackageCompiler.jl">PackageCompiler.jl</a>. The PackageCompiler is not actually compiling (Julia and LLVM do that), it simply gathers the in-memory compiled code and stores it in a <code>.so</code> library (or <code>.dll</code> on windows). This can be used for a standalone library of your compiled code, and will automatically include all dependent libraries. I have written a long tutorial on how to <a target="_blank" href="https://www.functionalnoise.com/pages/2022-07-21-embedding/">embed such Julia libraries inside C++</a> on my private website.</p>
<p>Static compilation of Julia, into tiny libraries fully independent of the runtime, is in an early stage with <a target="_blank" href="https://github.com/tshort/StaticCompiler.jl">StaticCompiler.jl</a> and <a target="_blank" href="https://github.com/brenhinkeller/StaticTools.jl">StaticTools.jl</a>, but needs more investment. Once you try out static compilation, you will notice that it enforces limitations on your code, because you cannot use all the fancy dynamic language features. I believe this is an unavoidable trade-off in any of the discussed technologies so far, but I'd love to be surprised on this point.</p>
<p>Other attempts to make a fast language easier to use are Zig, Swift and GoLang to a certain extent. Rust is very interesting, but I would not call it easy for scientists. None of them are targeting numerical computing as much as Julia.</p>
<h3 id="heading-mojo">Mojo</h3>
<p>A new language that was revealed very recently is Mojo. In their article <a target="_blank" href="https://docs.modular.com/mojo/why-mojo.html">Why Mojo?</a> they rephrase the two language problem as a Two World Problem, or even three world problem (Python, C++ and CUDA) for machine learning. From the code snippets on their website it looks like they want a Python compatible language that has features of Rust. Note that Mojo is not yet released to the public, the <a target="_blank" href="https://github.com/modularml/mojo">mojo github repository</a> is empty at this time of writing, so we don't even know if Mojo will have a permissive license. Ambitious, but very young. We'll keep an eye on this one.</p>
<h2 id="heading-final-comparison">Final Comparison</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1683112314101/c2d307c0-929c-4dcd-8f2e-73cd84ba64ca.png" alt class="image--center mx-auto" /></p>
<p>What are good comparison criteria? I have chosen a few below. The development community size can be used as an estimate of how much effort goes into each project. Other than that I have tried to compare the usage and technology choices. I don't want to give quantitative performance comparisons here, they are heavily dependent on your use case, but from the benchmarks on complex examples that I have seen, Julia typically performs best. However, performance might not be your main criterion. Other unlisted aspects, such as debugging, profiling or cloud deployment may be more relevant for your use case.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td>Cython</td><td>Numba</td><td>Julia</td></tr>
</thead>
<tbody>
<tr>
<td>Contributors</td><td>430</td><td>298</td><td>1386</td></tr>
<tr>
<td>Github stars</td><td>7.9k</td><td>8.6k</td><td>42.2k</td></tr>
<tr>
<td>Backend tech</td><td>C transpiler</td><td>LLVM</td><td>LLVM</td></tr>
<tr>
<td>Usage</td><td>Decorators and Cython types</td><td>Decorators everywhere</td><td>Learn another language</td></tr>
<tr>
<td>Python interoperability</td><td>Import cythonized modules</td><td>Just-in-Time (JIT) decorators</td><td>PythonCall package</td></tr>
<tr>
<td>Performance</td><td>Decent</td><td>Good</td><td>Best</td></tr>
<tr>
<td>Distribution</td><td>Ship the .so with all dependencies</td><td>Ahead-of-Time compilation decorators</td><td>Ahead-of-Time compilation via PackageCompiler.jl</td></tr>
</tbody>
</table>
</div><p>While I have a preference for Julia, I tried to stay as unbiased as possible in this blog post. All options are amazing open source projects and are maintained by mostly voluntary developer communities. Investigating them all is a humbling experience in the complexity of software technology.</p>
<p>There are other software engineering requirements that I have not yet included in this post, but might be important for your use case:</p>
<ul>
<li><p>Package management. How easy is it to create and install a package, with all of its dependencies, in the chosen technology. I believe Julia's <a target="_blank" href="https://pkgdocs.julialang.org/v1/">Pkg</a> is superior here.</p>
</li>
<li><p>How does dependency management and distribution of binary artifacts work exactly? These nitty-gritty details can slow down your project. I have not yet tried this extensively for Numba. Cython involves some manual work. Julia has an artifact manager that works together with the package manager inside PackageCompiler.</p>
</li>
<li><p>Complex cases, like multi-threading inside the framework or external threads calling from another language. (Note: Julia doesn't have the global interpreter lock (GIL) like Python. Cython can release the GIL in certain cases.) I haven't gone into such topics yet, but there are many complex use cases that you may want to gradually add to your codebase. How far can you go with each technology before hitting a wall?</p>
</li>
</ul>
<p>Finally, remember that a technical solution does not necessarily result in a cultural improvement. If you hired a lot of scientists or analysts or domain experts, and none of them have the necessary software skillset, it is difficult to improve collaboration with software engineers by forcing a 'better' technology onto them. You will have to empower your scientists to learn the necessary software development tools and processes, such as version control, test-driven development and continuous integration. Vice versa, your software engineers can learn the business domain and the tricks of numerical computing with the help of your scientists, to know exactly what code to write. By bridging these gaps, you can create a more effective team that can leverage the full potential of the technology investments you make.</p>
<p><em>Thanks to Jorge Vieyra and Jeroen van der Meer for reviewing and suggesting excellent improvements to the article.</em></p>
<p><em>These long posts take me quite some time and effort to write. If you like them, please encourage me by leaving a comment with suggestions or subscribe to my newsletter. With enough support, I intend to write a book about building and deploying professional numerical computing applications. With Julia examples.</em></p>
]]></content:encoded></item><item><title><![CDATA[Production-ready code for scientists:  3 lessons learned]]></title><description><![CDATA[How do you become a great scientific coder? To understand this, I want to ask others about their journey and share their lessons with you. This post is a collaboration with Keith Myerscough, a mathematical consultant and senior engineer, who helped m...]]></description><link>https://scientificcoder.com/production-ready-code-for-scientists-3-lessons-learned</link><guid isPermaLink="true">https://scientificcoder.com/production-ready-code-for-scientists-3-lessons-learned</guid><category><![CDATA[production]]></category><category><![CDATA[code]]></category><category><![CDATA[advice]]></category><category><![CDATA[LessonsLearned]]></category><category><![CDATA[General Programming]]></category><dc:creator><![CDATA[Matthijs Cox]]></dc:creator><pubDate>Thu, 04 May 2023 09:43:36 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1683010031069/a33212ab-1534-4842-92cd-a94b4e89785c.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>How do you become a great scientific coder? To understand this, I want to ask others about their journey and share their lessons with you. This post is a collaboration with</em> <a target="_blank" href="https://www.linkedin.com/in/keith-myerscough/">Keith Myerscough</a><em>, a mathematical consultant and senior engineer, who helped me with setting up our internal Julia language ecosystem</em>.</p>
<p>Matthijs asked me to write a guest post about what skills are needed to become a great scientific coder. He asked me because I assisted a team of scientists in adopting Julia for their research and development work. I am keen to help: I empathize with people who come up with great ideas but have a hard time wrapping these up into something that can be turned into a product. Even under the assumption that code will be extensively improved upon by others with more software knowledge, that first delivery is either a big hurdle or the seed for a smooth project.</p>
<p>Before diving into what I think is most important for scientists developing code, let us take a small step back. A central theme of this scientific coding blog is that we must rid ourselves of hard boundaries between different groups working towards delivering the same product. But removing the divide(s) does not remove the inherent differences between ideation and productization, between divergent and convergent thinking. We will still have to bring “something” from an idea into a product; we just want to make this journey a continuous one. The reality is, however, that larger products require multiple people working on them. So the idea-product continuum will still have to be divided. I see this as creating a chain of several people working in stages. The important thing is that any boundary is permeable, in both directions: new ideas are presented in a way that can be turned into a product and improvements in the product can find their way back to the code used in ideation. This blog post addresses the implications of the first of those requirements on people not used to having their code end up in a production environment.</p>
<p>From my time working with a team of engineers with a physics background generating code that was intended for production, I have found the following suggestions to be most relevant:</p>
<ul>
<li><p>Make your work reproducible</p>
</li>
<li><p>Keep everything as small as possible</p>
</li>
<li><p>Be relentless in asking for help</p>
</li>
</ul>
<p>I will discuss these in more detail below.</p>
<h2 id="heading-make-your-work-reproducible">Make your work reproducible</h2>
<p>This is probably too obvious for many of you, but this is the start. The divide between ideation and productization is also one between “it works now” and “it will work forever”. As a scientific software developer, your code does not have to work forever, it does not have to cover all corner cases and it does not have to be optimized for performance. But if you want to hand it over to the next person in the chain, it will have to at least do (more or less) the same in most cases.</p>
<p>The best way to guarantee this is to have the intended use(s) of your idea encoded in tests and to have these tests run in an automated environment. It might even help to write the tests first. Unfortunately, I see it too often that people are “testing” their code using REPL commands or script runs, but do not include these as parts of delivery, losing the commands/scripts to run the code forever. Running tests in an automated environment, such as Github Actions, Bitbucket Pipelines, Jenkins etc., avoids any sneaky dependencies on your local machine configuration. This infrastructure is a prerequisite for scientific software development.</p>
<p>There are two skills required for this. The first is a solid understanding of the test framework in use for your project. You should feel comfortable in both modifying and creating tests, using the tools available. The second skill is version control. In particular, familiarize yourself with the command-line version. There is some irony in the fact that version control tools like Tortoise and SourceTree themselves lead to irreproducibility, as there’s no way of tracing just what you clicked when; with the command line, you always have your history. Furthermore, the command-line restricts you to using only the commands you know. You can get a long way with the basics.</p>
<h2 id="heading-keep-everything-as-small-as-possible">Keep everything as small as possible</h2>
<p>This is a super-linear advantage, to put it in nerdy terms. Reducing the size of a component makes all the work that has to be done on it easier. It makes every pull request less work, it makes every test run faster and it makes every bug easier to find. But more than that, it reduces the number of people who need to work on it, reducing the amount of communication required.</p>
<p>This is an important advantage of moving from a two-language situation to the one-language paradigm. In the two-language situation, the productized software will always have a modular structure, thanks to the software engineering attention that was spent on the architecture. This modularity would, however, not necessarily exist in the ideation code base. In the one-language paradigm, the same architecture is available to scientists. You, as a scientific software developer, can see this as one of the rewards of switching tools.</p>
<p>One of the easiest yet often overlooked ways to keep things small is to rely on existing (open-source, inner-source or even closed-source) tooling. This immediately reduces the complexity of your code. In Julia, it also aids in the composability of your code to rely on existing implementations.</p>
<p>It’s harder to divine specific skills for this goal. You will need to know a little more about how a language works to know how to split a package into multiple packages, or better still, how to introduce new functionality in a separate package that interacts nicely with existing packages.</p>
<p><em>[Matthijs' comment:] I can think of a simple tip. If you notice that you often write long scripts or functions, say dozens or hundreds of lines of code inside one function, you can probably cut it into multiple functions with easy-to-understand names.</em></p>
<h2 id="heading-be-relentless-in-asking-for-help">Be relentless in asking for help</h2>
<p>It is important to acknowledge that you can not know everything. As a scientific software developer, you will probably need help from the more software-savvy people on your team. Do not hold back in asking them. Again within the framework of an idea-product continuum, there should be no hard boundaries. Whoever is responsible for bringing your delivery to the next level, has an interest in helping you deliver something better, to make your life easier.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Writing code for product development as a scientist can be a challenging task. Hopefully, my advice can help you find ways to improve your work. Remember, writing good code is not just about solving the problem at hand, but also about helping others, including your future self. By investing in your software skills now, you can set yourself and your team up for success in the long run.</p>
<p><em>Let me know if you enjoy this guest post and would like to read more of such posts in the future, by leaving a comment!</em></p>
]]></content:encoded></item><item><title><![CDATA[Automate Your Code Quality In Julia]]></title><description><![CDATA[Code quality is a topic in Julia that I believe deserves more attention from both users and developers. The Julia language originated in academia and focused heavily on helping scientists write better code, which is going great and deserves much prai...]]></description><link>https://scientificcoder.com/automate-your-code-quality-in-julia</link><guid isPermaLink="true">https://scientificcoder.com/automate-your-code-quality-in-julia</guid><category><![CDATA[Julia]]></category><category><![CDATA[Code Quality]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[tools]]></category><dc:creator><![CDATA[Matthijs Cox]]></dc:creator><pubDate>Wed, 26 Apr 2023 07:47:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1682430557530/69f045a8-635c-4ec6-b836-dbdcf7f0dd23.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Code quality is a topic in Julia that I believe deserves more attention from both users and developers. The Julia language originated in academia and focused heavily on helping scientists write better code, which is going great and deserves much praise! However, to onboard more software engineers and professional organizations we're going to have to invest even further into code quality and automated code quality tools and other methods such as used in the field of "quality assurance". In this article I'll explore the current state in the Julia ecosystem.</p>
<p>At our workplace we have investigated the following tools and practices. I'll start from generic practices and then move on to more advanced tools.</p>
<ul>
<li><p>Package structure</p>
</li>
<li><p>Unit testing with Pkg.jl</p>
</li>
<li><p>Automated testing and Continuous Integration (CI)</p>
</li>
<li><p>Code Coverage with Pkg.jl</p>
</li>
<li><p>Documentation testing with Documenter.jl</p>
</li>
<li><p>Style guides and JuliaFormatter.jl</p>
</li>
<li><p>Static Code Analysis with StaticLint.jl</p>
</li>
<li><p>Quality Assurance with Aqua.jl</p>
</li>
<li><p>Type stability with JET.jl</p>
</li>
</ul>
<p>Let's have a look at all of them.</p>
<h2 id="heading-packages">Packages</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1682494969545/cff83dbe-327f-419f-ae55-d3441e2115e2.png" alt class="image--center mx-auto" /></p>
<p>The Julia community uses a standardized package structure and has a single package manager called <a target="_blank" href="https://pkgdocs.julialang.org/v1/">Pkg.jl</a>. There's also a single documentation system and a single testing system. This consensus alone helps tremendously with automating any workflows in your projects and organizations.</p>
<p>Please make sure you share professional code with others via packages. It's straightforward to adhere to the package structure. To get you started with creating well-defined (open source) packages, you can look at <a target="_blank" href="https://github.com/JuliaCI/PkgTemplates.jl">PkgTemplates.jl</a>.</p>
<p>When setting up my first open source Julia package, I enjoyed the documentation of the <a target="_blank" href="https://bjack205.github.io/JuliaTemplateRepo.jl/dev/index.html">JuliaTemplateRepo.jl</a>, which goes through all the basic steps and configurations for a Julia package.</p>
<h2 id="heading-unit-testing">Unit Testing</h2>
<p>Nowadays unit testing is a common practice in professional software engineering. Developers in Julia should be no exception. Fortunately, according to <a target="_blank" href="https://viralinstruction.com/posts/goodjulia/#strong_ecosystem_tooling_consensus">Viral Instruction</a>, 89% of all open source Julia packages have tests, including a lot of beginner packages. It's safe to say that the Julia community puts a lot of emphasis on testing, which I think is remarkable for a language that originated in academia. This really sets a good example.</p>
<p>All unit testing use the <a target="_blank" href="https://docs.julialang.org/en/v1/stdlib/Test/">Test.jl</a> package, which is shipped with the base language. There are some extensions like <a target="_blank" href="https://github.com/ssfrr/TestSetExtensions.jl">TestSetExtensions.jl</a> and <a target="_blank" href="https://github.com/JuliaTesting/ReTest.jl">ReTest.jl</a>, but I believe you can do most of your work with Test.jl.</p>
<p>Getting started with testing is trivial in Julia, just add a <code>test/runtests.jl</code> file to your package and add code like this:</p>
<pre><code class="lang-julia"><span class="hljs-keyword">using</span> Test
<span class="hljs-keyword">using</span> MyPackage

<span class="hljs-meta">@testset</span> <span class="hljs-string">"MyPackage tests"</span> <span class="hljs-keyword">begin</span>
    <span class="hljs-meta">@test</span> <span class="hljs-number">1</span>==<span class="hljs-number">2</span>
<span class="hljs-keyword">end</span>
</code></pre>
<p>You can then run the tests with <code>Pkg.test("MyPackage")</code> which starts an isolated sandbox environment for the tests.</p>
<p>If you are creating a reproducible package for others, including your future self, then there is absolutely no reason to not write tests. However, writing good descriptive tests that cover all your bases is a more advanced art. Many books are written on this topic. I would advise to just get started and improve your testing strategy as you grow your codebase.</p>
<h2 id="heading-automated-testing-amp-continuous-integration">Automated Testing &amp; Continuous Integration</h2>
<p>Once you have unit tests defined, this aspect is a low hanging fruit for automation. It's very easy to forget to run the unit tests before committing. Automatically testing the code will save you from simple mistakes.</p>
<p>You can <a target="_blank" href="https://bjack205.github.io/JuliaTemplateRepo.jl/dev/CI.html">setup Github Actions</a> or use a tool like Jenkins, to automatically run the tests upon every commit and block developers from merging if the unit tests do not pass. <a target="_blank" href="https://github.com/JuliaCI/PkgTemplates.jl">PkgTemplates.jl</a> will typically already generate this Github Action for your open source package.</p>
<p>The tools and practice of frequently and automatically checking your code development is called Continuous Integration (CI). Inside software organization this is often combined with Continuous Deployment (CI/CD). All continuous integration (CI) tools that the Julia community uses can be found in <a target="_blank" href="https://github.com/JuliaCI">JuliaCI · GitHub</a>, I will address a few of those. If you want to configure your own Github actions you be inspired by the examples in <a target="_blank" href="https://github.com/julia-actions">Julia Actions · GitHub</a>.</p>
<h2 id="heading-code-coverage">Code Coverage</h2>
<p>A straightforward metric for monitoring your code quality is to check the fraction of code covered by your tests. Similar to automating your tests, this is a very low hanging fruit in the Julia community.</p>
<p>The code coverage generation itself is embedded inside the package manager via <code>Pkg.test("MyPackage", coverage=true)</code> . This will generate <code>.jl.cov</code> files with information about how often a line of code is touched by the tests.</p>
<p>Analyzing the code coverage visually line-by-line, for example inside VS Code, can help you identify where you are lacking tests, or help you find out where you can delete un-used code that is never called by your functions. You can automatically send the files to a service, like <a target="_blank" href="http://Coveralls.io">Coveralls.io</a> or <a target="_blank" href="http://Codecov.io">Codecov.io</a>, and analyze in the browser. Here's an example in <a target="_blank" href="https://app.codecov.io/gh/FluxML/Flux.jl/blob/master/src/losses/functions.jl#L155">Flux's functions.jl file</a>, that's has one uncovered line (note that it's very common to forget to test our errors or other corner cases):</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1682335565744/6849d64f-4ab0-409f-980b-336c4c82dc97.png" alt class="image--center mx-auto" /></p>
<p>You can calculate statistics on these coverage files, for example with <a target="_blank" href="https://github.com/JuliaCI/Coverage.jl">Coverage.jl</a> or with one of the services above. That way you can monitor coverage statistics over time. And if you use such a service for your open source package you can add a shiny badge to your readme to show-off your coverage.</p>
<p>Other commercial tools are busy adopting Julia's code coverage, check with your preferred supplier if they already support Julia. If not, please request them to do so.</p>
<h2 id="heading-documentation-testing">Documentation Testing</h2>
<p>Good documentation is incredible important for the users of your package, both in the open source community as well as inside organizations. Unfortunately, documentation that includes code examples can run out of sync with your code if you forget to test those. But it's very easy to automatically test those code examples with <a target="_blank" href="https://documenter.juliadocs.org/stable/man/doctests/index.html">Documenter.jl doctesting</a>.</p>
<p>Whenever you write a docstring or write code snippets in your docs folder, just add <code>jldoctest</code> and the expected output.</p>
<pre><code class="lang-julia"><span class="hljs-string">``</span><span class="hljs-string">`jldoctest
a = 1
b = 2
a + b

# output

3
`</span><span class="hljs-string">``</span>
</code></pre>
<p>Now just add <code>Documenter.doctest(MyPackage)</code> to your automated tests, and you know immediately when your examples no longer work. Easy!</p>
<h2 id="heading-style-guides-amp-code-formatting">Style Guides &amp; Code Formatting</h2>
<p>One of the challenges when working with many people on a single codebase is to adhere to a consistent coding style that is pleasant and unambiguous for everyone. This is where style guides help, together with formatting tools that make it easy to adhere to such a style guide.</p>
<p>The primary open source Julia style guides are at the moment are:</p>
<ul>
<li><p><a target="_blank" href="https://github.com/jrevels/YASGuide">YAS</a> (yet another style)</p>
</li>
<li><p><a target="_blank" href="https://github.com/invenia/BlueStyle">Blue</a> style</p>
</li>
<li><p><a target="_blank" href="https://github.com/SciML/SciMLStyle">SciML</a> style</p>
</li>
</ul>
<p>You can <a target="_blank" href="https://www.julia-vscode.org/docs/stable/userguide/formatter/">automatically format</a> your Julia files in VS Code with the click of a button. If you are a command line user, or want to automate the formatting in your CI, you can use the underlying <a target="_blank" href="https://github.com/domluna/JuliaFormatter.jl">JuliaFormatter.jl</a> package.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1682337461140/08742d45-0db9-4a1f-92fc-3dc1778f1f94.png" alt class="image--center mx-auto" /></p>
<p>This should get you started with code formatting in no time. Discuss with your colleagues which style guide you prefer. Personally I use the <a target="_blank" href="https://github.com/invenia/BlueStyle">BlueStyle</a> so far, but the SciML style is relatively new, so looking into that one as well.</p>
<h2 id="heading-static-code-analysis">Static Code Analysis</h2>
<p>Static code analysis will look at the code without executing it. One package we found is <a target="_blank" href="https://github.com/julia-vscode/StaticLint.jl">StaticLint.jl</a>, which is used primarily by the Julia VS Code plugin <a target="_blank" href="https://github.com/julia-vscode/LanguageServer.jl">LanguageServer.jl</a> to report on potential problems in your code. These are normally reported under the "Problems" tab. Here I found a few potential problems inside the <a target="_blank" href="https://github.com/JuliaData/DataFrames.jl">DataFrames.jl</a> package:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1682407779863/4faa1e3e-25f2-4e9f-afc7-471f77de467a.png" alt class="image--center mx-auto" /></p>
<p>The VS Code plugin also reports on potential problems when hovering over code, such as a warning about this unused function argument. These are not reported in the "Problems" tab.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1682416576798/b916156a-a8d4-4738-81a1-bafdb5a94b53.png" alt class="image--center mx-auto" /></p>
<p>StaticLint still misses some documentation for users, but you can use the following <a target="_blank" href="https://gist.github.com/pfitzseb/22493b0214276d3b65833232aa94bf11">script</a> and read my discussion in an <a target="_blank" href="https://github.com/julia-vscode/StaticLint.jl/issues/14">issue here</a>. After some fiddling with the code and environments I am able to obtain the same "diagnostics" on my REPL for a given file:</p>
<pre><code class="lang-julia">julia&gt; docs[<span class="hljs-number">3</span>]
Document: file:///c%<span class="hljs-number">3</span>A/Users/matcox/Documents/Julia/static_lint/src/abstractdataframe/selection.jl

julia&gt; docs[<span class="hljs-number">3</span>].diagnostics[<span class="hljs-number">1</span>]
LanguageServer.Diagnostic(LanguageServer.<span class="hljs-built_in">Range</span>(LanguageServer.Position(<span class="hljs-number">223</span>, <span class="hljs-number">15</span>), LanguageServer.Position(<span class="hljs-number">223</span>, <span class="hljs-number">36</span>)), <span class="hljs-number">4</span>, <span class="hljs-string">"UnusedFunctionArgument"</span>, missing, <span class="hljs-string">"Julia"</span>, <span class="hljs-string">"An argument is included in a function signature but 
not used within its body."</span>, [<span class="hljs-number">1</span>], missing)
</code></pre>
<p>So StaticLint.jl can be used, but it's not yet user friendly for integration into any command line interfaces or automated tooling.</p>
<h2 id="heading-quality-assurance-with-aquajl">Quality Assurance with Aqua.jl</h2>
<p>The package <a target="_blank" href="https://github.com/JuliaTesting/Aqua.jl">Aqua.jl</a> is developed to automate quality assurance for Julia. The readme is clear on what it checks:</p>
<ul>
<li><p>There are no method ambiguities.</p>
</li>
<li><p>There are no undefined <code>export</code>s.</p>
</li>
<li><p>There are no unbound type parameters.</p>
</li>
<li><p>There are no stale dependencies listed in <code>Project.toml</code>.</p>
</li>
<li><p>Check that test target of the root project <code>Project.toml</code> and test project (<code>test/Project.toml</code>) are consistent.</p>
</li>
<li><p>Check that all external packages listed in <code>deps</code> have corresponding <code>compat</code> entry.</p>
</li>
<li><p><code>Project.toml</code> formatting is compatible with Pkg.jl output.</p>
</li>
<li><p>There are no "obvious" type piracies</p>
</li>
</ul>
<p>Aqua provides a function to use in your testing environment, which will report issues and throws an error so your tests cannot pass unless all Aqua checks pass. Here's a snippet of what we get for DataFrames:</p>
<pre><code class="lang-julia">julia&gt; <span class="hljs-keyword">using</span> Aqua, DataFrames

julia&gt; Aqua.test_all(DataFrames)
<span class="hljs-number">17</span> ambiguities found
Ambiguity <span class="hljs-comment">#1</span>
&lt;=(a::<span class="hljs-built_in">Integer</span>, b::SentinelArrays.ChainedVectorIndex) <span class="hljs-keyword">in</span> SentinelArrays at C:\Users\matcox\.julia\packages\SentinelArrays\BcfVF\src\chainedvector.jl:<span class="hljs-number">208</span>
&lt;=(x::<span class="hljs-built_in">BigInt</span>, i::<span class="hljs-built_in">Integer</span>) <span class="hljs-keyword">in</span> Base.GMP at gmp.jl:<span class="hljs-number">696</span>

Possible fix, define
  &lt;=(::<span class="hljs-built_in">BigInt</span>, ::SentinelArrays.ChainedVectorIndex)

... 

Test Summary:    | Fail  Total  Time
<span class="hljs-built_in">Method</span> ambiguity |    <span class="hljs-number">1</span>      <span class="hljs-number">1</span>  <span class="hljs-number">8.3</span>s
ERROR: Some tests did not pass: <span class="hljs-number">0</span> passed, <span class="hljs-number">1</span> failed, <span class="hljs-number">0</span> errored, <span class="hljs-number">0</span> broken.
</code></pre>
<p>Unfortunately I only find ambiguities for DataFrames, maybe I should find a package with more problems. You can also run all the underlying checks independently if you read the <a target="_blank" href="https://juliatesting.github.io/Aqua.jl/stable/#Aqua.test_all-Tuple%7BModule%7D">Aqua documentation</a>. Note that if you only want to check for ambiguities, you can also choose to call <code>Test.detect_ambiguities</code> directly from the standard Julia Test package.</p>
<p>A nice addition to Aqua would be a way to store the found issues in a standardized file format instead of printing them on the REPL. Similar to code coverage reporting, this can help to make overviews in automated systems. Now we would have to capture the printed output and parse that somehow.</p>
<h2 id="heading-finding-type-instability-with-jetjl">Finding Type Instability with JET.jl</h2>
<p>There is non-stop activity in the Julia community to analyze our own code for improvements. An advanced package is <a target="_blank" href="https://github.com/aviatesk/JET.jl">JET.jl</a>, which uses the Julia compiler itself to detect potential bugs and type instabilities.</p>
<p>What is type instability? This happens when the type of a variable changes unexpectedly. Here's a simple example that outputs either an integer or a floating point variable:</p>
<pre><code class="lang-julia"><span class="hljs-keyword">function</span> foo(value)
    <span class="hljs-keyword">if</span> value &lt; <span class="hljs-number">1</span>
        <span class="hljs-keyword">return</span> <span class="hljs-number">1</span>
    <span class="hljs-keyword">else</span>
        <span class="hljs-keyword">return</span> <span class="hljs-number">1.0</span>
    <span class="hljs-keyword">end</span>
<span class="hljs-keyword">end</span>
</code></pre>
<p>Type instability is bad for performance because the compiler cannot infer the types and generate optimal native code. It may also point at bugs in your code, if you did not intend to have such instability. Julia does not enforce type stability like certain languages, because it wants to remain an easy language to use. Sometimes you don't care about performance and don't want to worry about types, in which cases it's easy if you are not forced to worry.</p>
<p>If you just want to check whether the output value can be inferred, you can use <code>Test.@inferred</code> in your tests:</p>
<pre><code class="lang-julia">julia&gt; <span class="hljs-keyword">using</span> Test

julia&gt; Test.<span class="hljs-meta">@inferred</span> foo(<span class="hljs-number">0.5</span>)
ERROR: <span class="hljs-keyword">return</span> <span class="hljs-keyword">type</span> <span class="hljs-built_in">Int64</span> does not match inferred <span class="hljs-keyword">return</span> <span class="hljs-keyword">type</span> <span class="hljs-built_in">Union</span>{<span class="hljs-built_in">Float64</span>, <span class="hljs-built_in">Int64</span>}

julia&gt; Test.<span class="hljs-meta">@inferred</span> foo(<span class="hljs-number">1.5</span>)
ERROR: <span class="hljs-keyword">return</span> <span class="hljs-keyword">type</span> <span class="hljs-built_in">Float64</span> does not match inferred <span class="hljs-keyword">return</span> <span class="hljs-keyword">type</span> <span class="hljs-built_in">Union</span>{<span class="hljs-built_in">Float64</span>, <span class="hljs-built_in">Int64</span>}
</code></pre>
<p>However, when you want more certainty about the internals of your code then you can turn to JET. Most of JET is doing specific method analysis with <code>@report_opt</code> and <code>@report_call</code> . JET can do some limited static analysis of your entire package with the <code>report_package</code> function. Unlike <code>@report_call</code> , this means JET doesn't know what types you want to input into your methods, so it has to make some assumptions.</p>
<p>I do warn that the output of JET can be rather intimidating. Here's what you get when executing the example <code>@report_call sum("julia")</code> :</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1682424595139/69f43545-9fa5-4ee5-bd98-afa928c11a55.png" alt class="image--center mx-auto" /></p>
<p>And that's just the example from the <a target="_blank" href="https://aviatesk.github.io/JET.jl/stable/jetanalysis/#jetanalysis-quick-start">Quick Start</a> page of JET.</p>
<p>We're still investigating how to use JET, because it is pretty advanced tooling. If you just started with types and Julia, I wouldn't dive right into this. Take your time to think about what <a target="_blank" href="https://en.wikipedia.org/wiki/Type_inference">type inference</a> really means, and read the <a target="_blank" href="https://aviatesk.github.io/JET.jl/stable/tutorial/">documentation of JET</a> if you want to know more.</p>
<h2 id="heading-reducing-compile-time-with-snoopcompilejl">Reducing Compile Time with SnoopCompile.jl</h2>
<p>Optimizing your code such that the compilation times are reduced is maybe not the first thing that comes to mind when thinking about "code quality", but it can improve the user experience of your package. Nobody likes to wait long to import your package. The <a target="_blank" href="https://github.com/timholy/SnoopCompile.jl">SnoopCompile.jl</a> package helps you with analyzing your code for such improvements. It "snoops" on the compiler and reports on it's findings.</p>
<p>There is a lengthy blog post from the SciML community on how they improved their compilation times with SnoopCompile and other tools, called <a target="_blank" href="https://sciml.ai/news/2022/09/21/compile_time/">How Julia ODE Solve Compile Time Was Reduced From 30 Seconds to 0.1</a>. Definitely read that one for more information, I will not go into details here, but I do think SnoopCompile is worth a mention.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You have plenty of options to check the code quality of your Julia packages and improve the quality over time. If this feels like a daunting task as a beginning (Julia) developer, don't worry, you can slowly add these tools to your workflow over time. The most important thing is to start with a good package structure and basic testing. The fact that the Julia ecosystem is so focused on making quality easy for beginners is truly praise-worthy and will help us all in the long run.</p>
<p>For senior developers and managers looking into these tools, one thing to remember is that lot's of code quality tooling in Julia is written with the human developer in mind. This currently limits some of the integration in automated CI tools. I believe this topic deserves some more attention in the Julia community and more support from commercial code quality tooling vendors. The good thing is that due to the standardization of the Julia package management it is very easy to get started with a uniform automation system in your organization. As the tools improve for these systems, it will be easy to incrementally add such tools to any open source or internal CI workflows.</p>
<p>Thanks to my colleague Matthijs den Otter for helping with the investigation. If we find better ways to monitor your Julia code quality, I intend to share that here, so don't forget to subscribe to the blog.</p>
]]></content:encoded></item><item><title><![CDATA[The Art of Multiple Dispatch]]></title><description><![CDATA[I love thinking visually by drawing doodles and schematics for my work. It's one of my favorite things to do, next to coding. When working with the Julia language, one visualization I enjoy is seeing the type space of a method that you are dispatchin...]]></description><link>https://scientificcoder.com/the-art-of-multiple-dispatch</link><guid isPermaLink="true">https://scientificcoder.com/the-art-of-multiple-dispatch</guid><category><![CDATA[Julia]]></category><category><![CDATA[Art]]></category><category><![CDATA[design patterns]]></category><dc:creator><![CDATA[Matthijs Cox]]></dc:creator><pubDate>Thu, 20 Apr 2023 09:01:49 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1681976569505/e92f2398-d079-449e-ac6a-4b194fe226ac.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I love thinking visually by drawing doodles and schematics for my work. It's one of my favorite things to do, next to coding. When working with the Julia language, one visualization I enjoy is seeing the type space of a method that you are dispatching on. Normally I do this in my mind's eye, but let me clarify this by drawing some actual figures.</p>
<p>To start with the basics; Julia has functions and methods. A function is simply the name, like <code>push!</code> or <code>read</code> . Methods are specific definitions of a function, for certain types of arguments. Take for example <code>push!(s::Set, x)</code> or <code>read(io::IO)</code> . From an object-oriented perspective you could say that methods are instances of functions.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1681976585586/63ee230c-90c2-41b1-b568-6eafa3b466e9.png" alt class="image--center mx-auto" /></p>
<p>For any given method you can consider the dispatching as slicing a part of the entire possible type space of that given function. For a given set of arguments of course. If you increase the number of arguments in the function definition, then more dimensions get added to the type space. I don't even know how to find the best written words for this, the visualization above just feels intuitive to me.</p>
<p>Let's take the function <code>f</code> and imagine for a moment that there are only 3 types in the whole Julia type universe: the <code>Float64</code>, <code>Int64</code> and the <code>String</code>. The <code>Float64</code> and the <code>Int64</code> are a subtype of <code>Number</code>, which is obvious I hope. By default in Julia if you specify no type in your function argument, then it will be assumed you mean the <code>Any</code> type, of which every other type is a subtype.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1681980418289/8da099db-a243-4908-8a55-b4e2a999fc0d.png" alt class="image--center mx-auto" /></p>
<p>A method <code>f(::Any, ::Any)</code> thus describes the entire space of all possible types for the function named <code>f</code> . On the other hand, a method like <code>f(::Int64, ::String)</code> is super concrete, it's a singular point in the type space.</p>
<p>You can use abstract types like <code>Number</code> or unions like <code>Union{Float64, Int64}</code> to capture a subset of the discrete type space. This way you can choose which part you want to define for your function, with the chosen set of types you will be dispatching on at runtime. Abstract types in Julia exist only for this dispatching purpose, to dispatch on a set of subtypes, they have no other influence on their subtypes what so ever. They are not dictators like classes in other languages.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1681977217318/3aaf2783-ca62-46f7-9854-99515f0c3818.png" alt class="image--center mx-auto" /></p>
<p>I like these visuals. Some junior engineers wonder what "diagonal dispatch" is. I don't have any other way of explaining the concept then by just drawing it. The figure is immediately obvious. Diagonal dispatch happens when the type of all arguments is forced to be equal with <code>f(::T, ::T) where T</code> . This truly represent a diagonal through the type space. You can see it in the example above. You can also limit the diagonal dispatch to a subset with <code>f(::T, ::T) where T&lt;:Number</code> and in higher dimensions you can be fancy like <code>f(::T, ::T, ::S) where {T&lt;:Number, S&lt;:AbstractString}</code> by adding multiple of these parametric types.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1681980462724/38a9832e-1633-4457-864b-7c49b10d76bd.png" alt class="image--center mx-auto" /></p>
<p>Note that when you define a method twice, you have to take care that it is clear which method gets dispatched on. The compiler will prioritize the one that is most concrete, so the one that is most specific about the types. In the figure above, I ordered them from most specific to least specific. You can try for yourself to see if I ordered them correctly.</p>
<p>For example if you define the following:</p>
<pre><code class="lang-julia">f(::<span class="hljs-built_in">Any</span>, ::<span class="hljs-built_in">Any</span>) = println(<span class="hljs-string">"any &amp; any"</span>)
f(::<span class="hljs-built_in">Int64</span>, ::<span class="hljs-built_in">Int64</span>) = println(<span class="hljs-string">"int &amp; int"</span>)
</code></pre>
<p>Then most function calls will run the broadest method because that one is defined for the whole type space, but when you input two integers you will call the very specific method <code>f(::Int64, ::Int64)</code> . Let's give it a go:</p>
<pre><code class="lang-julia">julia&gt; f(<span class="hljs-string">"string"</span>, <span class="hljs-number">5</span>)
any &amp; any

julia&gt; f(<span class="hljs-number">4</span>, <span class="hljs-number">5</span>)
int &amp; int
</code></pre>
<p>From a visual perspective, we have created an overlapping dispatch, where one method is specifically defined for the integer case <code>f(::Int64, ::Int64)</code> and will be called when only integers are used as arguments.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1681976623361/6e67812f-5ae8-44a5-9d14-a26b8bffe6f8.png" alt class="image--center mx-auto" /></p>
<p>There are some caveats here. If you are not careful the methods can become ambiguous and Julia won't like that. For example if you define the following:</p>
<pre><code class="lang-julia">f(::<span class="hljs-built_in">Any</span>, ::<span class="hljs-built_in">String</span>) = println(<span class="hljs-string">"any &amp; string"</span>)
f(::<span class="hljs-built_in">String</span>, ::<span class="hljs-built_in">Any</span>) = println(<span class="hljs-string">"string &amp; any"</span>)
</code></pre>
<p>Which one of these methods should be called with <code>f("string", "string")</code> ?</p>
<pre><code class="lang-julia">julia&gt; f(<span class="hljs-string">"string"</span>, <span class="hljs-number">5</span>)
string &amp; any

julia&gt; f(<span class="hljs-number">5</span>, <span class="hljs-string">"string"</span>)
any &amp; string

julia&gt; f(<span class="hljs-string">"string"</span>, <span class="hljs-string">"string"</span>)
ERROR: <span class="hljs-built_in">MethodError</span>: f(::<span class="hljs-built_in">String</span>, ::<span class="hljs-built_in">String</span>) is ambiguous.  Candidates:
  f(::<span class="hljs-built_in">Any</span>, ::<span class="hljs-built_in">String</span>) <span class="hljs-keyword">in</span> Main at REPL[<span class="hljs-number">8</span>]:<span class="hljs-number">1</span>
  f(::<span class="hljs-built_in">String</span>, ::<span class="hljs-built_in">Any</span>) <span class="hljs-keyword">in</span> Main at REPL[<span class="hljs-number">9</span>]:<span class="hljs-number">1</span>
Possible fix, define
  f(::<span class="hljs-built_in">String</span>, ::<span class="hljs-built_in">String</span>)
</code></pre>
<p>Yikes, that's impossible! Fortunately there is a fix proposed, by explicitly defining the ambiguous case. Though perhaps you should reconsider what your actual intentions are in this design. The visual representation below hopefully makes the mistake more clear. At first there is confusion because the two dispatches overlap and neither is more specific than the other, but we can fix it by defining a more concrete method in the conflicting area.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1681980801630/59886127-3d95-4bce-aff0-bd436fb2a9b2.png" alt class="image--center mx-auto" /></p>
<p>When you define a lot of methods, you are creating a colorful patchwork in the type space of your function. You can come up with the craziest designs in your methods, but be careful. Finding the right balance of a few big broad abstract methods versus multiple tiny concrete methods is a true art in Julia.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1681976654572/db41ed23-bc30-41aa-8370-23ad2ebe75fd.png" alt class="image--center mx-auto" /></p>
<p>People do not often share how they visualize the code design in their mind, while I believe this really shapes the creative process. The closest visual representation in Julia I have seen is the article about <a target="_blank" href="https://www.moll.dev/projects/effective-multi-dispatch/">Julia's dispatch with Pokemon types</a>. You can read that for more detailed examples with Julia's multiple dispatch.</p>
<p>That concludes this short artsy post, but I hope it helps the visual thinkers in the programming community! Let me know if you use different kinds of visualizations in your coding work.</p>
]]></content:encoded></item></channel></rss>