Lab 3 - Information Theory and Tools Help

This lab has two purposes

Continue to get set up with the environment we need for class
Have you explore Shannon’s information theory, if you were unable to complete this in Lab 1

Getting Set Up With Tools

We were assured the systems would work so that ssh to portal would solve all our future needs. However, we’ll need to ask for some additional software to run the tools we need. The second portion of this lab is designed to make sure each of you have access to the tools or at least an alternative setup you could use.

Exercise: Create, compile, and run a C program.

Make and enter a directory named tools_test.
Create a file there¹ named hello.c which contains the following C code.
```
 #include <stdio.h>
 int main() {
     puts("This file shows your C toolchain is working");
 }
```
We will explain this code later, but we want to make sure you have C working by the time we get there.
Run the following and show the output to a TA. Several of these commands will display things, and the last two (run and quit) will have a different-looking prompt than the others. Both ./a.out and run should have, as part of their output, “This file shows your C toolchain is working”.
```
pwd
clang hello.c
./a.out
lldb a.out
run
quit
```
Show the output of these commands to a TA for check-off.

How-to on `portal`

You’ve seen how to ssh into portal.cs.virginia.edu before. Some things you should know for this:

You have to load modules

The CS servers hide most programs until you ask for them, so that you are guaranteed to get the versions you want. You ask for them by running module load and then the module you want to access the programs inside of. The modules we’ll need in this class are

module load clang-llvm to compile your C code
module load nano (or module load emacs if you prefer emacs, or no module needed if you prefer vim)
module load git to get an updated version of git

Other than loading these modules after you ssh in to portal (each time you ssh into portal), you should be able to use the skills you learned in Lab 00 and Lab 01 to complete the exercise above.

Aside: Each time you log into portal, you get a new BASH shell, so you’ll need to load the modules each time you connect. For convenience, you may wish to have them automatically run when you connect. Each time a new BASH shell is started, it runs all the commands in a file named .bashrc. If you would like the three modules listed above (clang-llvm, nano, and git) to load automatically, you can edit your .bashrc file and add the following code to the very bottom:

case "$-" in
*i*)    module load clang-llvm; module load nano; module load git;;
*)      ;;
esac

We won’t ask you about this code, so don’t worry about what each line does. Broadly, it checks to see if you have an interactive BASH shell (i.e., you connected by SSH to run commands or edit your code), and if so, it will load the modules for you. If you connect another way (which we haven’t discussed), then it will “do the right thing” and not load the modules.

How-to on your own

After you have portal working, you are welcome to try also doing it on your computer.

If you run a mostly POSIX-compliant operating system (Linux, FreeBSD, OpenBSD, AIX, Solaris, and almost all the others with one notable exceptions), you can probably get clang, git, and lldb to work with no extra effort.

As a special case, macOS is POSIX-compliant but by default does not include most of the tools we’ll need, has a slightly different take on some parts of C than normal, and often has commands hidden under non-standard names or the like. If you install clang and lldb through the macOS developer tools, you can probably use your macOS machine directly with no virtual machine. You are welcome to do this, but note that we may not be able to help if something goes wrong. Note that we know macOS will not be able to do Lab06 and HW06; everyone will have to use portal there.

If you run Windows, there’s a thing called the Windows Subsystem for Linux which can let you make windows (almost) act like a POSIX-compliant OS. You are welcome to do this, but note that we may not be able to help if something goes wrong.

Information Theory

Introduction

Claude Shannon founded the field of information theory. A core fact in information theory is that there is a basic unit of information, called a “bit^[a portmanteau of “binary digit”]” or a “bit of entropy.” Roughly speaking, a “bit” is an amount of information that is about as surprising as the result of a single coin flip. In the sentence “Hello, my name is Claude” the word “is” has low entropy; most of the time someone says “Hello, my name” the next word they say is “is,” so adding that word provides very little more information. On the other hand, the word “Claude” in thart same sentence has many bits of entropy; a huge number of words could reasonably follow “Hello, my name is” and any given one we pick is quite surprising.

In computers, we rarely encode things anywhere near as efficiently as its bits of entropy suggest. For example, most common encodings of strings use 8 bits per character. In this lab, you’ll replicate an experiment Claude Shannon published in 1950^[Shannon, Cluade E. (1950), “Prediction and Entropy of Printed English”, Bell Systems Technical Journal (3) pp. 50–64.] to see just how inefficient that encoding is.

Preparation

First, you’ll write a program in either Python or Java. Then, you’ll use that program to perform an experiment and reflect on the results.

You may either work alone or with a buddy in this lab. Buddy programming is where two people work side-by-side, each creating similar programs while talking together to help one another out. In well-running buddy programming each buddy is speaking about equally, describing what they are writing next or how they are testing what they have written. Buddy programming usually results in similarly-designed but non-identical programs.

If you use a buddy, you should sit next to your buddy and use the same programming language they use.

Create the program

Your program should do the following:

Read a text file into a string in memory. You should be able to specify different file names each time you run the program.
Repeatedly
- Pick a random index in the middle of the string
- Display to the user the 50 characters preceding that index (in such a way that they can tell if what you displayed ended in a space character or not)
- Have the user type a single character
- Record if that typing was correct
After some fixed number of iterations (20 might make sense), display
- The ratio of correct guesses (e.g., “You got 14 out of 20 guesses correct!”)
- The estimated bits of entropy per letter of the text, which is \(\log_2\)(g ÷ r) where g is the total number of guesses made and r is the number that were correct (e.g., 0.5145731728297582 for 14 of 20 correct).

What is the entropy of…

Once your program seems to be working, try it on a few different texts. For example, you might try

tarzan.txt (500KB) – the original Tarzan book by Edgar Rice Burroughs.
pi.txt (1MB) – the first million digits of pi.
_pydecimal.txt (229KB)– a large file from the Python standard library.
diff_match_patch.java (89KB) – a large file from the open source Java project diff-match-patch.

Add a comment to the top of your code that includes at least the following:

Who your buddy was, if any
What files you tested (if other than the above, with their full URL or a description of what they contained) and what the results were for each
An additional experiment you did and how it came out. For example, you might try to answer
- is language X more or less entropic than language Y?
- does it matter how many characters you display as context for their guess?
- is the answer different if you display the characters after, not before, the one they guess?
- if you re-run the test on the same file repeatedly, how consistent are the answers?
- if you compress the file (e.g., into a .zip file or the like), how much smaller does it get? How (and why) is this related to its bits of entropy?

Check-off

To check-off this lab, show a TA

Your working C code (that is, the output of the commands we asked you to run on the C code we provided)
Your working code for the Information Theory section

For most students, this should happen in lab; if you have completed the lab exercise before lab occurs, you are welcome to do it in their office hours.

You learned how to do this in Lab 00 using one of Nano, Emacs, or VIM. If you did not, you needed too; please go back and do that part of Lab 00 now! ↩