Compiled languages
Before starting this session please review the prerequisites. You should at least have the rust toolchain installed.
In this module you will be introduced to a compiled language called rust. There are three parts:
a. Introduction (this session). A description of basics in rust. No main exercise.
b. Data structures. A description of some more complex types in rust. Exercise: k-mer counter.
c. Object-oriented programming. Using objects (structs
) to write your code. Exercise: Gaussian blur.
Some other examples of compiled languages include C, C++ and Fortran. I've chosen rust mostly because of its great build system. It's much easier to set up than other compilers, which would be a multi-hour session in its own right. The syntax is not massively dissimilar from python and there are good third party packages available. I expect that more research software projects will use it in the future.
Languages like R, python and Julia are interpreted languages. Roughly what this means
is that every time you run the code, it is first translated into machine code which is
what the computer actually runs, and then run. You can think of the translation
and run happening for each time you get to a new line. (NB: We've also looked at packages like numba
which could compile a single
function in python).
In a compiled language, these two steps are split up. Before running the program, you run a compiler on all of your code (known as compile-time), which converts it into machine code and creates an executable. Each time you run the code (known as run-time), you then directly run the executable without having to compile the code. Clearly this can be more efficient, but why else might you want to use a compiled language:
- Speed. As well as skipping the interpreting step, the compiler being able to see all the code at once can automatically optimise your code. Knowing the types of all the variables also avoids other expensive checks. For CPU intensive tasks, it's not unreasonable to expect 10x or more faster code by default.
- Control of memory. Compiled languages allow you to pass objects by copying them, moving them, or by passing a reference. This more direct interface to shared memory gives much more control of data, which can be crucial when working with large arrays.
- Easier/better parallelisation. Similar to memory, the extra control typically allows multiprocessing to be a lot more efficient, and it is often easier to implement.
- Type checking. As we will see, all variables must have a type, for example integer, string, or float. Beyond just speed, ensuring these are compatible can catch errors at compile-time.
- Shipping a binary. It is possible to include everything you need to run your code in a single file. No need for conda, and updated libraries with different APIs won't break your code.
Sounds great! So why isn't everything written in compiled languages? I would say the two main reasons are that:
- It is fundamentally a bit more complicated to learn and write code in a compiled language, so it takes longer to develop in.
- R and python are really popular and have loads of great packages to use -- often you end up implementing more yourself in a compiled language.
In this session we will:
- Introduce the basics of rust and its build system, and write some simple example code.
- Review some of the practical aspects of writing rust code: debugging, profiling and optimising.
The next part of the session will then take your rust skills further, and we will look at some common data structures you can use to accelerate your compiled code. Finally, we will work through an example using object oriented programming to implement Gaussian blur on images.
We only have time for a short introduction, but if this piques your interest the 'Rust By Practice' book is a good place to continue.
Please note that you can actually run the rust code in the examples here if you press the play button.
General introduction
Compiling code
To make a new rust project, run:
cargo new iprog
This will create a new project called 'iprog'. Change into that directory and
you will see Cargo.toml
, which contains the metadata about your project, and
the src/
directory for the code. This currently contains main.rs
which is
the source for your program. It is of course possible to split this over multiple
files or make libraries rather than executables, but this is beyond the scope of this
course, and we will just use this simple structure throughout.
The Cargo.toml
lets you set the version of your code, other information such
as the authors. It's also where you list any dependencies, which we will use later on.
The default main.rs
contains a simple program which writes 'Hello, world!' to
the terminal when run. The main()
function is what is run when the code starts,
and println
means 'print line'. The !
is a macro, a special shorthand for
some common functions:
fn main() { println!("Hello, world!"); }
(try running this with the 'play' button)
Some other commands you'll need for running your code are:
cargo run
Try this in your code directory. You'll see that the code is compiled, and then run.
If you want to compile the code run:
cargo build
You'll then get an executable in target/debug/iprog
. Try running that directly.
Why is it under debug/
? As noted above, when we compile this code the compiler itself can make
various optimisations which make the code run faster. For example, eliminating unused
sections of code, combining instructions, guessing which branch of the loop is
likely to be executed. However, this means the executed code doesn't always correspond
to lines in your source. When run with debug mode on (which is the default) most optimisations
are turned off to help you step through the code line by line. But when you are ready
to test the 'proper' version of your code, instead you add the --release
flag:
cargo build --release
Finished release [optimized] target(s) in 0.22s
You'll see this says optimized
and you can run it from target/release/iprog
.
Have a look at https://godbolt.org/ if you want to see how source code translates
to machine code under different levels of optimisation.
Finally, if you want to install your program, so you can run it just by typing iprog
at the command prompt, run:
cargo install --path .
Note that the default is to compile the optimised code for the installed version.
More useful commands:
cargo fmt
will format your code nicely.cargo clippy
runs the linter.cargo test
runs any tests you have written.
Types
One of the most important distinctions between languages is how they deal with the types of their variables. Rust is 'strongly typed' and 'statically typed', which means that all variables have to have known types at compile time, and trying to assign between different types will result in a compiler error.
In rust, you use the let
keyword to define a variable:
#![allow(unused)] fn main() { let number = 64; println!("{number}"); }
Here, rust has automatically inferred the type of number
, but we can set it
explicitly to the integer type i32
(which we will explain in a moment) as follows:
#![allow(unused)] fn main() { let number: i32 = 64; println!("{number}"); }
Let's try comparing an integer to a floating point f64
:
#![allow(unused)] fn main() { let int: i32 = 64; let decimal: f64 = 64.0; if int == decimal { println!("Equal"); } else { println!("Unequal"); } }
This gives a compiler error:
5 | if int == decimal {
| --- ^^^^^^^ expected `i32`, found `f64`
| |
| expected because this is `i32`
We cannot assign or compare between two different types. But if we convert the
integer to a float using as
, this will work:
#![allow(unused)] fn main() { let int: i32 = 64; let floaty_int = int as f64; let decimal: f64 = 64.0; if floaty_int == decimal { println!("Equal"); } else { println!("Unequal"); } }
A brief overview of the types in rust which are most likely to useful for you:
Type | Description |
---|---|
i32 | Signed integer using 32-bits, range -2,147,483,648 to 2,147,483,647. Equivalent to int or long in C++, int in python, int in R. |
u32 | Unsigned (i.e. positive) integer using 32-bits, range 0 to 4,294,967,295. |
usize | Used for sizes/lengths (e.g. length of a list), positive numbers only. |
f64 | Floating point (decimal) number stored using 64-bits. Equivalent to double in C++, float in python, num in R |
char | A Unicode character such as 'a', 'é' or '🫠'. |
bool | A boolean, true or false . |
Vec | A vector, which is the name for a list/array where the size can change. |
String | A string, which is effectively a list unicode characters (not quite a Vec of char , but similar). You may also see str passed to functions, which is the same, except isn't 'owned' so the length can't be changed. |
Tuples can be written and accessed as follows:
#![allow(unused)] fn main() { let tuple: (i32, char, bool) = (500, 'b', true); println!("{} {} {}", tuple.0, tuple.1, tuple.2); }
The .0
means the 1st element (rust is 0-indexed, same as python, different to R). Note
also that in the println!
any empty brackets {}
get filled with the arguments
after the commas.
Rarely, you may want to use an array, which has a known and fixed size at compile time i.e. you have to know the size when you write the code:
#![allow(unused)] fn main() { let a: [i32; 5] = [1, 2, 3, 4, 5]; println!("{:?}", a); }
Here also note the use of :?
in the print, which uses the debug
rather than display
method to print for that type. You usually need to use this with lists or anything more
complex than a basic type.
One final thing to note here is that everything is const
by
default, so if you want to be able to change a variable
in rust you need to add the mut
keyword (for mutable).
#![allow(unused)] fn main() { let mut number = 80; number += 10; println!("{number}"); }
Conditionals
We've already seen how to use an if
statement above. Another useful method in
rust is the match
statement, which lets you write multiple if statements easily:
#![allow(unused)] fn main() { let number = 25; match number { 1..=24 => println!{"One to 24"}, _ => println!{"Something else"} } let result = match number { 1 | 3 => "One or three", 24..=26 => "24, 25 or 26", _ => "Anything else" }; println!("{result}"); }
The arms are functions (top), if they return a value (bottom) this can be assigned
to a variable. The _
arm catches 'else'.
These are particularly useful with enum
(enumerated) types where you name all
the possible types for a variable:
#![allow(unused)] fn main() { enum EMBLSites { Heidelberg, Ebi, Hamburg, Barcelona, Grenoble, Rome } let my_site = EMBLSites::Ebi; let country = match my_site { EMBLSites::Heidelberg | EMBLSites::Hamburg => "Germany", EMBLSites::Ebi => "UK", EMBLSites::Barcelona => "Spain", EMBLSites::Grenoble => "France", EMBLSites::Rome => "Italy", }; println!{"I work in {}", if country == "UK" {"the UK"} else {country}}; }
The final line here demonstrates a ternary, which is a one-line if/else statement
which returns a value (in C++ these are written as condition ? val if true : val if false
).
Lists, loops and iterators
First let's make something to loop over. Lists are known as vectors and can be
used with the vec!
macro in two ways:
#![allow(unused)] fn main() { let list = vec![1, 2, 4, 5, 6]; // Give the full list for number in list { println!("{number}"); } }
#![allow(unused)] fn main() { let list = vec![2; 10]; // Set the value 2, ten times println!("{list:?}"); for (idx, number) in list.iter().enumerate() { println!("{}", number * idx); } }
By using .iter()
on the vector you can access a lot of the python-like loop methods
like .enumerate()
to get the loop index, and .zip()
to loop over two things
at once. You might see some fancy single line operations by chaining these operations:
#![allow(unused)] fn main() { let list = vec![2; 10]; let out: i32 = list.iter().map(|x| *x * 2).sum(); println!("{out}"); }
You can also use while
loops:
#![allow(unused)] fn main() { let stop: f64 = 10000.0; let mut value: f64 = 2.0; let mut list: Vec<f64> = Vec::new(); while value < stop { value *= 2.0; list.push(value); } println!("{list:?} {}", list[2]); }
Some other new things here are using Vec::new()
to create an empty list, then
the .push()
method to add values to it, and the list[2]
to access the third value.
Also note that we gave the list an explicit type Vec<f64>
. The angular brackets
are a template/generic type. Within a Vec
every item must be the same type, but
they can be any type. We need to let the compiler know what this type is. The reason
for the angular brackets will become clear if you define generic functions where
the same function can accept different types, but that is beyond the scope of this course.
The rust compiler typically gives very helpful error messages, and the vscode extension can fill a lot of types in for you. So if the exact syntax of these is a bit unclear at this point don't worry, as you will be guided into how to fix these in your code.
Functions; reference and value
Let's look at defining a simple function in rust. The types of each parameter must be given explicitly, and the return type given too. Here's an example:
#![allow(unused)] fn main() { fn linear_interpolation(x: f64, slope: f64, intercept: f64) -> f64 { let y = x * slope + intercept; y // same as writing return(y); } let y = linear_interpolation(6.5, -1.2, 0.0); println!("{y}"); }
Here the return type is a decimal number f64
. Note that to return a value from
a function it is preferred to write the value without an ending semicolon, similar
to R functions.
The function above passes the parameters x
, slope
and intercept
by value,
which copies them into new memory acessible by the function before running the function. These can then be modified
without affecting the original variables.
It is possible to pass by reference where instead a pointer to the original variable is passed. This has two effects. Firstly, the variable is not copied. Secondly, if the variable is changed in the function, it will be changed in the original instance.
#![allow(unused)] fn main() { fn array_double(list: &Vec<i32>) -> Vec<i32> { let mut return_list = Vec::with_capacity(list.len()); for value in list { return_list.push(*value * 2); } return_list } let list = vec![1, 2, 3, 4, 5]; println!("{:?}", array_double(&list)); }
We have used the ampersand &
to tell the function we want a reference to the list
not the list itself. When we call the function we also use an ampersand to take
the reference to the list that needs passing. (Another trick shown above: if you know the size
of a Vec
you will push to when you create it, using Vec::with_capacity()
is more
efficient than Vec::new()
.)
Note here we use an asterix *
to dereference the values in the list. We need to
do this as the reference pointer is a memory address, and we can't operate on this.
So each &
increases the reference level by one, each *
decreases it by one (until
you get to the original value, which cannot be dereferenced further).
The rust compiler is pretty good at helping you get this right and will suggest
when you need to reference or dereference a variable. But when should you use
references? As a rule of thumb, any 'small' data types, such a single integers, characters
e.g. i32
, f64
, char
can be passed by value. Anything 'larger' like a Vec
,
String
or dictionary should be passed by reference to avoid a copy. If you want to
mutate the original variable, regardless of type, you'll want to use a mutable reference.
If we want our function to operate on the original list, we can use a mutable reference &mut
and remove the return:
#![allow(unused)] fn main() { fn array_double(list: &mut Vec<i32>) { for value in list { *value = *value * 2; } } let mut list = vec![1, 2, 3, 4, 5]; array_double(&mut list); println!("{:?}", list); }
One final point. You'll often see the above function written more like this:
#![allow(unused)] fn main() { fn array_double(list: &[i32]) -> Vec<i32> { let mut return_list = Vec::new(); for value in list { return_list.push(*value * 2); } return_list } }
So &Vec<i32>
has been replaced with &[i32]
. This is no longer an 'owned' type,
which means the size can't be modified. You can do this to be more flexible with
the type the function can type, for example you can also operate on static arrays
and slices (ranges) of vectors:
#![allow(unused)] fn main() { fn array_double(list: &[i32]) -> Vec<i32> { let mut return_list = Vec::new(); for value in list { return_list.push(value * 2); } return_list } let a: [i32; 5] = [1, 2, 3, 4, 5]; println!("{:?}", array_double(&a)); let mut list = vec![1, 2, 3, 4, 5]; println!("{:?}", array_double(&list[1..=3])); }
(The slice syntax is [start..end]
to not include the end value, or [start..=end]
to include the end value.)
In this manner you may want to use the following replacements in function types:
&Vec[T]
->&[T]
&String
->&str
Box<T>
->&T
Box
is a smart pointer. If you already know what a smart pointer is, you can read the rust guide on
its versions of these. But otherwise it's not necessary to know about these straight away.
Structs
Let's go back to our linear interpolation example above. If we had a single line
and wanted to call this repeatedly on different x
values but using the same
line, it would be good just to store the slope and intercept once. We can do this
with a struct
, which is the foundation of object oriented programming:
#![allow(unused)] fn main() { struct Line { slope: f64, intercept: f64 } impl Line { fn interpolate(&self, x: f64) -> f64 { x * self.slope + self.intercept } } let linear = Line {slope: 6.5, intercept: -1.2}; let y1 = linear.interpolate(0.0); let y2 = linear.interpolate(1.0); let linear2 = Line {slope: 0.0, intercept: 0.0}; let y3 = linear2.interpolate(10000.0); println!("{y1} {y2} {y3}"); }
The struct
definition describes the variables contained by the new type you are
defining, in this case Line
. The impl
block then contains function definitions
which that type implements. These may use the variables within the struct
, accessed
by self
(which is always the first argument, if it is needed). The code itself
then creates an instance of Line
, which is a specific variable of this type
with specific values of all the fields. You can then use this to call the functions
in impl
.
Above the struct is created directly using curly brackets.
You'll often want to add a function which creates the struct by default, which
returns the Self
type (which still works even if you change the name of the struct):
#![allow(unused)] fn main() { use std::f64::consts::PI; struct Circle { radius: f64, diameter: f64, circumference: f64, area: f64, } impl Circle { fn new(radius: f64) -> Self { Self {radius, diameter: 2.0 * radius, circumference: 2.0 * radius * PI, area: PI * radius * radius} } } }
You can add the pub
keyword to make fields of the struct accessible outside of
its functions. But it is generally advised to make helper functions to do this instead,
which can give more control, for example returning a reference rather than a value:
#![allow(unused)] fn main() { struct Train { name: String, pub pantograph_lowered: bool } impl Train { fn get_name(&self) -> &String { &self.name } fn set_name(&mut self, new_name: &str) { self.name = new_name.to_string(); } fn raise_pantograph(&mut self) { self.pantograph_lowered = false; } } let mut choo_choo = Train { name: "Amtrak".to_string(), pantograph_lowered: true }; println!{"The {} train's pantograph is {}", choo_choo.get_name(), if choo_choo.pantograph_lowered { "lowered" } else {"up"}} choo_choo.set_name("LIRR"); choo_choo.raise_pantograph(); println!{"The {} train's pantograph is {}", choo_choo.get_name(), if choo_choo.pantograph_lowered { "lowered" } else {"up"}} }
You can also override some of the default traits (see rust docs on traits for more info on
exactly what those are), the most useful typically being Display
and Debug
:
#![allow(unused)] fn main() { use std::fmt; struct Train { name: String, pub pantograph_lowered: bool } impl Train { fn get_name(&self) -> &String { &self.name } fn set_name(&mut self, new_name: &str) { self.name = new_name.to_string(); } fn raise_pantograph(&mut self) { self.pantograph_lowered = false; } } impl fmt::Display for Train { fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { write!( f, "The {} train's pantograph is {}", self.get_name(), if self.pantograph_lowered { "lowered" } else {"up"} ) } } impl fmt::Debug for Train { fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { write!( f, "name:{}\npantograph_lowered:{}", self.name, self.pantograph_lowered ) } } let mut choo_choo = Train { name: "Amtrak".to_string(), pantograph_lowered: true }; println!("{choo_choo}"); println!("{choo_choo:?}"); }
A handy thing about doing this is that you can easily write the struct (formatted
however you like) to the terminal with println!
or a file with write!
, neatly
from the main function.
Option
and Result
types
Two special types you'll see in a lot of library code are Option<T>
and Result<T>
,
so it helps to know what to do with them. Option<T>
means that the value can either
be of the expected type T
, or empty. These are known as Some
and None
respectively:
#![allow(unused)] fn main() { let value_four: Option<i32> = Some(4); let empty_value: Option<i32> = None; }
So if you want to return nothing you use None
, if there is a value you wrap
it in Some()
.
As an example of how you might use this in a function:
#![allow(unused)] fn main() { fn parse_list(input: &str) -> Option<Vec<u32>> { let v: Vec<&str> = input.split(',').collect(); if v.len() == 1 { None } else { let num_v: Vec<u32> = v.iter().map(|x| u32::from_str_radix(x, 10).unwrap()) .collect(); Some(num_v) } } let p1 = parse_list("4,8,15,16,23,42"); if p1.is_some() { println!("{:?}", p1.unwrap()); } let word = "numbers"; let p2 = parse_list(word); if p2.is_none() { println!("You call this a list? '{word}'"); } }
If you don't care about empty values or don't expect them, just call .unwrap()
and
you'll get the value back out:
#![allow(unused)] fn main() { let value_four = Some(4); println!("{}", value_four.unwrap()); }
This errors if you have an empty value:
#![allow(unused)] fn main() { let empty_value: Option<i32> = None; println!("{}", empty_value.unwrap()); }
You can use .expect("Empty value")
to give a custom error message. There are various
other methods such as .is_none()
, .unwrap_or_default()
etc. A common pattern
is to check an Option
in an if statement, which you can do with if let
as follows:
#![allow(unused)] fn main() { fn print_opt(opt_var: Option<i32>) { if let Some(x) = opt_var { println!("{x}"); } else { println!("empty"); } } print_opt(Some(4)); print_opt(None); }
Result
is the same as Option
, but the equivalent of Some
is Ok
and None
is Err
The key difference rather than just
being empty, Err
can be empty in different ways, usually containing a helpful message
of why it is empty (for example 'file not found').
Input and output
The final thing to mention is how to read and write from files. It's typically
best to use BufReader
and BufWriter
respectively:
#![allow(unused)] fn main() { use std::io::{BufRead, BufReader, BufWriter, Write}; use std::fs::File; let file_in = BufReader::new(File::open("input.txt").expect("Could not open file")); let mut file_out = BufWriter::new(File::open("output.txt").expect("Could not write to file")); for line in file_in.lines() { file_out.write(line.unwrap().as_bytes()); } }
More on the build system
Using packages
To use packages from https://crates.io/ you just add the package name and its
version to Cargo.toml
under the [dependencies]
header. You can use:
- The exact version e.g.
bmp = "0.5.0"
- Pinning to a major version e.g.
bmp = "0.*"
- Allowing any version
bmp = "*"
and various other strategies to choose a version. Unlike in python, using an exact version is fairly robust over time.
Some packages have optional 'features' you can enable, for example:
ndarray = { version = "*", features = ["serde", "rayon"] }
Allows the ndarray
package to save/load matrices with serde
and run in parallel
with rayon
. These are off by default to reduce dependencies and compilation time.
If you have dependencies which are only used during testing (continuous integration)
i.e. when you run cargo test
you can add these under a new header of [dev-dependencies]
Debugging
The default when you run cargo build
or cargo run
is to run the 'debug' version
of the code, which is unoptimised and runs line by line. You can debug in VSCode
by clicking the 'Debug' button above the main()
function or Run->Start debugging
. This
lets you run through the code line by line, see all the currently defined variables and
move up and down the call stack. All in the GUI, which is easy to use.
Adding println!()
statements containing your variables is always useful too.
You may have gotten some out-of-bound errors when you ran your code above. In C/C++
if you access the wrong index of a vector this usually results in a segmentation fault ('segfault')
or even no error at all, and does not give a line number! Rust does check for correct access
by default, which is a lot more informative (you can turn this off for speed, but it is
unlikely to be worth it). If you run with RUST_BACKTRACE=1 cargo run
you will also
get the entire call stack before such errors.
Optimising code
This is pretty easy in rust. Run with cargo run --release
or cargo build --release
to get an optimised version. You can also try adding RUSTFLAGS="-C target-cpu=native"
which turns on all optimisations for your CPU, though in my experience this doesn't
always make the code faster.
Profiling
I like the flamegraph package which is really easy to use and gives most of the info you need. You'll want to run on the optimised code which is much more representative of your actual run times. By default however the function names are lost during optimisation so see this section of the manual for a tip on how to fix this.
After adding:
[profile.release]
debug = true
to your Cargo.toml
run:
cargo install flamegraph
cargo flamegraph --root
Then open flamegraph.svg
in a web browser. We don't get much information for the
program above other than some time is taken for the image save and load. The implementation
above could definitely be improved and would need deeper profiling with e.g. vtune to
uncover this (if you didn't guess why already). But flamegraph is a useful starting
point, especially when you have more complex programs with more functions.
Cargo.toml
A lot can be managed through Cargo.toml
, as an example of a more involved set of metadata:
[package]
name = "ska"
version = "0.3.4"
authors = [
"John Lees <jlees@ebi.ac.uk>",
"Simon Harris <simon.harris@gatesfoundation.org>",
"Johanna von Wachsmann <wachsmannj@ebi.ac.uk>",
"Tommi Maklin <tommi.maklin@helsinki.fi>",
"Joel Hellewell <joel@ebi.ac.uk>",
"Timothy Russell <timothy.russell@lshtm.ac.uk>",
"Romain Derelle <r.derelle@imperial.ac.uk>",
"Nicholas Croucher <n.croucher@imperial.ac.uk>"
]
edition = "2021"
description = "Split k-mer analysis"
repository = "https://github.com/bacpop/ska.rust/"
homepage = "https://bacpop.org/software/"
license = "Apache-2.0"
readme = "README.md"
include = [
"/Cargo.toml",
"/LICENSE",
"/NOTICE",
"/src",
"/tests"
]
keywords = ["bioinformatics", "genomics", "sequencing", "k-mer", "alignment"]
categories = ["command-line-utilities", "science"]
[dependencies]
# i/o
needletail = { version = "0.4.1", features = ["compression"] }
serde = { version = "1.0.147", features = ["derive"] }
ciborium = "0.2.0"
noodles-vcf = "0.22.0"
snap = "1.1.0"
# logging
log = "0.4.17"
simple_logger = { version = "4.0.0", features = ["stderr"] }
indicatif = {version = "0.17.2", features = ["rayon"]}
# cli
clap = { version = "4.0.27", features = ["derive"] }
regex = "1.7.0"
# parallelisation
rayon = "1.5.3"
num_cpus = "1.0"
# data structures
hashbrown = "0.12"
ahash = "0.8.2"
ndarray = { version = "0.15.6", features = ["serde", "rayon"] }
num-traits = "0.2.15"
# coverage model
libm = "0.2.7"
argmin = { version = "0.8.1", features = ["slog-logger"] }
argmin-math = "0.3.0"
[dev-dependencies]
# testing
snapbox = "0.4.3"
predicates = "2.1.5"
assert_fs = "1.0.10"
pretty_assertions = "1.3.0"