The string type is one of the most commonly used types in Python. Many built-in functions accept strings as an input and produce strings as an output. Many built-
in methods also accept strings as input and produce new strings as output.
Because of this, many times when you are given a string, you need to convert it to a byte buffer using the bytes() function. Then, you need to convert that byte buffer into a String using the String constructor.
Unfortunately, this can lead to errors that are very difficult to track down. Because of this, it is recommended that you use the FileReader() function to open files in Java Script instead of using the traditional String constructor on files.
The problem with bytes
Let’s look at a simple example. Say you want to iterate over all the characters in a string, and do something with each one.
With Python 3, the built-in string type is immutable. That means that strings can’t be changed after they’re created. You can only manipulate the strings by looking at and comparing their components, or by adding or removing them.
If we assume that all strings are immutable, then an iterator over a string makes sense: it can only return copies of the string. There’s no way to remove or change any of its components.
But what if our iterator returns bytes? Then we have a problem: it will return both characters and non-characters (bytes with value 0). How can we filter out those 0 bytes? We can’t!
This isn’t just a theoretical problem: I once wrote code that depended on character iterators returning characters, but encountered a bug because of this assumption.
The problem with text files
A major problem in programming is understanding when to use text files and when to use binary files. As you have learned, strings are stored in files as bytes.
That is part of the problem with using text files—you have to make sure you open the file in the right mode. If you open a file in binary mode, you will save the strings as bytes instead of characters. This would look pretty weird if you tried to read it!
The other issue with text files is that some characters cannot be stored as bytes. Characters like $ , & , and sometimes * and / cannot be represented as 8 bits (1 byte) values. When you try to put these characters in a text file, your computer will not be able to read them.
By using binary files, you can avoid this problem. However, then you are just storing the data as bytes instead of characters, which can cause issues later on.
The solution: return strings
The solution to this problem is simple: iterator methods should always return strings, not bytes. By doing so, the surrounding code will be more confident that it is dealing with strings, not bytes.
Bytes are hard to debug because they are not represented in the code in a clear way. When you print a byte, you get some weird hexadecimal string that does not look like anything close to a string.
By returning a string instead, an iterator method makes it clear that it produces strings, and therefore makes it easier to use it correctly in the surrounding code.
Using bytes instead of strings can be an easy mistake to make, especially if you are not familiar with Java’s internal workings. Mistaking a byte for a string can have disastrous consequences down the road.
Write to a binary file
You should not use strings to store data. Instead, you should use structs. A struct is a collection of properties or values that contain information about something.
You can then use these structs as values in your code to represent something, like a person with properties like their name and age. Structs are commonly used in programming languages as a way to organize and represent data.
By using a struct instead of a string, you are making sure that your data is stored in the right format. When you write a binary file, you want to write it in octet-stream format. An octet is equivalent to an 8-bit byte. By writing your file in this format, you are telling the computer how many bits each byte contains.
This prevents confusion about what the data represents because it is being written in the correct format.
Use utf-8 encoding
When you create your files, make sure you use the utf-8 encoding. If you already have a file saved in another format, you can use a tool like TextEdit on Mac or Notepad on Windows to convert the file to utf-8.
All computers and languages use a limited set of characters. With utf-8, any character from any language can be used in your programs.
But if you try to use a character not in the utf-8 table, your computer will not be able to read or write the file correctly. So make sure you double check that too!
When writing your programs, it is good to assume that the user could have any name. To make it easier to read and write characters, use string variables instead of char variables when reading names from files.
Avoid using unicode formatting markers (\’\’) in your strings
When you create a string, you should avoid including the unicode formatting markers. These include the backslash (‘\’), single quote (‘), and the curly brace (‘}’).
Instead, when you create your string, simply include all of the characters you want to appear. For example, instead of:
string = ‘This is a bold statement!’
You should write:
string = ‘this is a bold statement!’
When you do this, Python will understand that all of the characters in the string are supposed to be displayed. It will not need to guess which ones are supposed to be bold or italicized.
Use str objects instead of byte arrays
A common mistake when working with strings is to use byte arrays instead of str objects. While this may work in some cases, it can cause many errors that are difficult to track down.
By using str objects, you ensure that the data you’re working with is a string. Strings are immutable, which means you can’t change their value. This is important when dealing with I/O because you can guarantee that the file will not be changed by the system or operating system.
Another advantage of using str objects is performance. Because they are built-in Python objects, they are implemented more efficiently than byte arrays. When performing I/O operations, this can make a difference in how quickly your code completes what it is trying to do.
When working with strings, always use the str object type instead of byte arrays.
Use Python 3 + future features
As Python developers, we are constantly updating our skills. As the language evolves, new features are added that make our lives easier.
Python 3 was a large update that brought many new features to the language. One of the most important changes was moving to bytes instead of strings.
By switching to byte arrays, Python was able to enforce type casting. This prevents unexpected behaviors when reading files in different modes (binary, text).
As of March 2019, Python 3.7 is the current version of Python. Many developers are still using older versions of Python, however. If you are working with a team and some people have older versions installed on their computer, it may be helpful to use Python 3 + future features to prevent issues with old files.