Rust Strings
If you've seen any Rust code, you've probably seen two string types: &str
and String
. These can be a little confusing to get used to, but they're actually simple.
You should think of the String
like any other structure. All of the ownership rules we previously discussed apply as-is to the String
type. The String
type does not implement the Copy
trait, meaning that assignments move the data to a new owner.
The &str
type is a slice that references data owned by something else. This means that the &str
type cannot outlive its owner.
To better understand strings, we need to look at some examples. However, there's one important detail we need to address: string literals are of type &str
:
fn main() {
let power_level = "9000!!!";
println!("It's over {}", power_level);
}
In the above snippet power_level
is a &str
. This hopefully makes you ask: who owns the data that power_level
references? For string literals, the data is baked into the executable's data section. We'll talk a little more about this later. For now, knowing that string literals are of type &str
is enough to start understanding how the two types interact with each other and the ownership model.
Let's write code that keeps a count of the words inputted into our program. First, let's look at the skeleton:
fn main() {
loop {
let mut input = String::new();
std::io::stdin().read_line(&mut input).unwrap();
let words: Vec<&str> = input
.split_whitespace()
.collect();
println!("{:?}", words);
}
}
Since input
is a String
, it owns what we typed, say "it's over 9000!!!"
and words
contains a list of slices referencing input
. (We split_whitespace
to create an iterator and use collect
to automatically loop through the iterator and put the values into a list). This all works because our &str
slices don't outlive the owner of the data they point to (input
); it all falls out of scope, and thus gets freed, at the end of each loop iteration. To track the count of words across multiple inputs, you might try:
use std::collections::HashMap;
fn main() {
// word => count
let mut words: HashMap<&str, u32> = HashMap::new();
loop {
let mut input = String::new();
std::io::stdin().read_line(&mut input).unwrap();
for word in input.split_whitespace() {
let count = match words.get(word) {
None => 0,
Some(count) => *count,
};
words.insert(word, count + 1);
}
println!("{:?}", words);
}
}
(If you're wondering why we need to dereference count
, i.e. Some(count) => *count
, it's because the get
method of the HashMap
returns a reference, i.e Option<&T>
, which makes sense, as the HashMap
still owns the value. In this case, we're ok with "moving" this out of the HashMap
since u32
implements the Copy
trait).
The above snippet will not compile. It'll complain that input
is dropped while still borrowed. If you walk through the code, you should come to the same conclusion. We're trying to store word
in our words HashMap
which outlives the data being referenced by word
(i.e. input
).
To prove to ourselves that the issue is with input
scope's, we can "solve" this by moving words
inside the loop
:
fn main() {
loop {
let mut input = String::new();
let mut words: HashMap<&str, u32> = HashMap::new();
...
}
}
Now everything lives in the same scope, our loop, so everything works. But this "fix" doesn't satisfy our desired behaviour: we're now only counting words per input not across multiple inputs.
The real fix is to store Strings
inside of our words
counter, not &str
:
use std::collections::HashMap;
fn main() {
// Changed: HashMap<&str, u32> -> HashMap<String, u32>
let mut words: HashMap<String, u32> = HashMap::new();
loop {
let mut input = String::new();
std::io::stdin().read_line(&mut input).unwrap();
for word in input.split_whitespace() {
let count = match words.get(word) {
None => 0,
Some(count) => *count,
};
// Changed: word -> word.to_owned()
words.insert(word.to_owned(), count + 1);
}
println!("{:?}", words);
}
}
We changed two lines, highlighted by the two comments. Namely, instead of storing &str
we store String
, and to turn our word &str
into a String
we use the to_owned()
method.
You'll probably find yourself using to_owned()
frequently when dealing with strings. It's by far the simplest way to resolve any ownership and lifetime issues with string slices, but it's also, more often than not, semantically correct. In the above code, it's "right" that our words
counter owns the Strings: the existence of the keys in our map should be tied to the map itself.
Performance / Allocations
A String
represents allocated memory on the heap. When the owning variable falls out of scope, the memory is freed. A &str
references all or part of the memory allocated by a String
, or in the case of a string literals, it references a part our executable's data section. When we call to_owned()
on a string slice (&str
) a String
is created by allocating memory on the heap.
This means that the above code allocates memory for each word that we type. A language with a garbage collector, such as Go, could implement the above more efficiently. But that efficiency would come with two significant costs: a garbage collector to track what data is and isn't being used, and a lack of transparency around memory allocation. Specifically, a slice in Go prevents the underlying memory from being garbage collected, which isn't always obvious and certainly isn't always efficient (you could pin gigabytes worth of data for a single small slice).
Rust is very flexible. We could write an implementation similar to Go's, but it would require considerably more code.
Greater transparency helps explain why string literals are represented as &str
instead of String
. Imagine a string literal in a loop:
fn main() {
loop {
let mut input = String::new();
println!("> ");
std::io::stdin().read_line(&mut input).unwrap();
...
}
}
Representing " >"
as a String
would require allocating it on the heap for each iteration. This might not be obvious, and it certainly isn't necessary. Treating string literals as a &str
means that allocation only happens when we explicitly require it (via to_owned()
).
Mutability
From an implementation and mutability point of view, the String
type behaves like Java's StringBuffer
, .NET's StringBuilder
and Go's strings.Builder
. The push()
and push_str()
methods are used to append values to the string. Like any other data, these mutations require the binding to be declared as mutable:
fn main() {
// note same as: String::from("hello")
let fail = "hello".to_owned();
fail.push_str(" world"); // Not mutable, won't compile
// note same as: String::from("hello")
let mut ok = "hello".to_owned();
ok.push_str(" world"); // Mutable, will work
}
A &mut str
on the other hand, is something you'll rarely, if ever, use. It doesn't own the underlying data so it can't really change it.
Just like you'll commonly use to_owned()
to ensure the ownership/lifetime of the value, you'll also commonly use to_owned()
to mutate (often in the form of appending) the string. Fundamentally, both of these concepts are tied to the fact that String
owns its data.
String -> &str
We saw how to_owned()
(or the identical String::from
) can be used to turn a &str
into a String
. To go the other way, we use the [start..end]
slice syntax:
fn main() {
let hi = "Hello World".to_owned();
let slice = &hi[0..5];
println!("{}", slice);
}
Notice that we did &hi[0..5]
and not hi[0..5]
. This is because there is a str
type, but it isn't particularly useful. Technically, str
is the slice and &str
is the slice with an added length value. But str
is so infrequently used that people just refer to &str
as a slice.
You'll often write or use functions which don't need ownership of the string. Logically these functions should accept a &str
. For example, consider the following functions:
fn is_https?(url: &str) -> bool {
url.starts_with("https://")
}
Since it doesn't need ownership of the parameter, &str
is the correct choice. We can, obviously, call this function with a &str
either directly or by slicing a String
. But since this is so common, the Rust compiler will also let us call this function with a &String
:
// called with a &str
is_https("https://www.openmyind.net");
// called with a &String
let mut input = String::new();
std::io::stdin().read_line(&mut input).unwrap();
is_https(&input);
// exact same as previous line
is_https(&input[..])
Common String Tasks
Here are a few comon things you'll likely need to do with strings.
To create a new String
from other String
or &str
(you can mix and match) use the format!
macro:
let key = format!("user.{}", self.id);
To create a String
from a [u8]
use String::from_utf8
. Note that this returns a Result
as it will check to make sure the provided byte-slice is a valid UTF-8 string. A str
is really just a [u8]
, so a &str
is really a &[u8]
, both with the added restriction that the underlying slice must be a valid UTF-8 string. Similarly, a String
is a Vec<u8>
also with same same additional requirement.
Because String
wraps a Vec<u8>
, the String::len()
method returns the number of bytes, not characters. The chars()
method returns an iterator over the characters, so chars().count()
will return the number of characters (and is O(N)). Note that chars()
returns an iterator over Unicode Scalar Values, not graphemes. For graphemes, you'll want to use an exteranl crate.
There's a FromStr
trait (think interface) which many types implement. This is used to parse a string into a given type. The implementation for bool
is easy to understand:
fn from_str(s: &str) -> Result<bool, ParseBoolError> {
match s {
"true" => Ok(true),
"false" => Ok(false),
_ => Err(ParseBoolError),
}
}
To convert a string to a boolean, or a string to a integer, use:
// unwrapping will panic if the parsing fails!
let b: bool = "true".parse().unwrap();
let i: u32 = "9001".parse().unwrap();
Finally, as we already discussed, String
has a push()
function (for characters) and push_str()
method (for strings) to append values onto a String
. You'd also be right to expect other mutating methods such as trim()
, remove()
, replace()
and many more.