stegein

Steganography

grafein

Jonathan Erdman

Eric Hutchins

Stephen Liang

Portman Wills

CS 588/Fall ’01

Steganography

Steganography, like cryptography, is a means of securing message transfer. Whereas cryptography, however, relies on obfuscating the message to such an extent that any eavesdropper cannot reconstruct the original, steganography hides the message itself inside an innocuous cover object so that the eavesdropper does not even recognize a communication ever occurred.

Steganography itself is thousands of years old, first developed by the Greeks. In Greek, steganography literally means “covered writing” as evidenced by two well documented Greek techniques: writing on one’s bald scalp which returning hair will eventually cover, and carving text into wooden tablets and subsequently coating them in a film of wax.

In modern day more advanced techniques exist and it is widely believed that Osama bin Laden and his al Qaeda terrorist network have transmitted messages over the Internet steganographically. In fact, as early as February, 2001, a full nine months before the appalling attacks of September 11, Wired magazine posted an article on its website warning of the threat such transmissions pose (McCullagh). According to Bruce Schneier, steganography is an ideal tool for terrorists to communicate because stego-software is widely available and furthermore it is extraordinarily difficult for counter-intelligence agents to uncover the messages unless they know where to look (Schneier).

With steganography at the forefront of the national conscience, we present in this paper an investigation of modern methods of concealing messages in plaintext, audio files, and images. Furthermore, we will examine current software tools that attempt to detect steganographic content as well as a discussion of the challenges of such a task.

Text Steganography

We established early on in the semester that English is redundant. Indeed, it can be calculated that English only encodes .28 letters of information for every letter; or, in other words, English is more than 70% redundant. As a result, we can devise myriad algorithms for hiding messages within a text file.

Word Offsets

The most basic steganographic method using plain text is that of hiding one letter from the message in each word of the stego. Based on a predetermined offset, we take each letter of the message and select a word whose nth letter is the letter from our message. If we take the effort to make our choices of words make sense, we have a message that fairly effectively hides the message we are trying to convey. We will demonstrate with an actual message sent by a German spy during World War II.

Apparently neutral’s protest is thoroughly discounted and ignored. Isman hard hit. Blockade issue affects pretext for embargo on by-products, ejecting suets and vegetable oils.

Upon first glance, this message seems innocuous enough, if somewhat awkward. However, if we were to take only the second letter of each word:

Apparently neutral’s protest is thoroughly discounted and ignored. Isman hard hit. Blockade issue affects pretext for embargo on by-products, ejecting suets and vegetable oils.

We get the following sequence of letters:

p e r s h i n g s a i l s f r o m n y j u n e i

We then note that this forms the message

Pershing sails from NY June 1

This piece of critical information has been hidden in an otherwise unimpressive message.

Ones and zeroes

There are many ways to hide ones and zeroes within a text-only message. With that ability, we can then hide all the information we want in a stego. The effective throughput would be slightly less, but any sort of information can be conveyed in this way. There are two separate categories of methods to hide ones and zeroes – those which utilize properties of the English language, and those which utilize computer science.

1) When you write a list in the English language, you can choose whether or not to put a comma after the word “and”. Neither method would attract much attention, since both are grammatically correct. Therefore, one could encode ones and zeroes in the lists that occur throughout a passage of text. For example:

1 = guns, butter, and brownies

0 = guns, butter and brownies

Then decoding the message merely requires that we go through and examine each list in the passage in order.

2) Using a computer, there are at least two ways to represent the end of a line of text. The ASCII characters 015 (Carriage return) and 012 (Line feed – new line) are visually identical on a computer screen, but an examination of the underlying ASCII codes quickly reveals their true nature. Therefore, we can hide a message by using ASCII 015 and ASCII 012 at the end of lines to represent ones and zeroes.

The previous two examples are very simple to understand and implement. However, they have relatively low throughput rates and are fairly simple to detect. To produce stegos that contain harder-to-find messages, we need algorithms that are slightly more complicated. There are myriad programs and web sites on the Internet that implement intriguing algorithms that are much harder to detect and therefore decode. We will take an in-depth look at two such algorithms.

Spam Mimic

www.spammimic.com

Spam Mimic is a website that takes a message and generates as a stego a piece of text that looks very similar to the spam mail that anyone with an e-mail account is familiar with. Through a specific algorithm, though, Spam Mimic generates the “spam” directly from the message text that you input. This procedure can then be reversed – if we take the spam mail and paste it back into Spam Mimic, it calculates our original message.

Note that the difference between “A-” and “A+” in the message manifests itself in the capitalization of the words “loved ones” in the generated spam mail.

TextHide

www.texthide.com

TextHide is a program that takes a passage of text and, based on a message as the key, chooses synonyms and rephrases the language to produce a new passage of text that has the same meaning but is very different. For example:

Message: “Meeting: 9 o’clock at my home”

Original: “The auto drives fast on a slippery road over the hill.”

Stego: “Over the slope the car travels quickly on an ice-covered street.”

Note the many changes between these two messages – “auto” becomes “car”, “fast” becomes “quickly”, “hill” becomes “street”. Also, the program changes the voice of the message from active to passive. We see how the numerous nuances and word choices in English provide for a way to hide a huge amount of data. The web site for the software claims the program can hide 100kB of text in a minute. Furthermore, it claims that TextHide encodes data at a data-to-text ration of about 1.10.

Audio Steganography

Digital audio files, with their relative heft in data size, provide greater flexibility and opportunity as steganographic covers. Many of the most popular audio compression codecs can be coerced to transmit message data easily.

Similar to image files, a common technique to hiding a message in an audio file is to replace the low order bits. If a compression algorithm can encode 1 kilobyte of sound per channel per second with a sample rate of 44 kHz, then there are 44 kbps of data per second in which a few altered bits could hide a short message. Changing the low order bits can result in noise in the sound file detectable by the human ear. Furthermore, the message is easily destroyed or garbled through resampling or additional noise superimposed on the audio (Sellers).

Another, more robust technique for masking data within audio files uses a series of echoes in the sound to transmit binary 1s and 0s. If the original signal and its echo are separated by a small enough amount of time, humans cannot distinguish the two sounds. Data can be encoded in these echoes by representing 0 and 1 as two different echo offset values, such that both offsets are below the human level of perception (Sellers). Compared to writing to low order bits, echo bit encoding is superior given it does not alter the sound in a perceptible manner and echoes can be of a variety of different frequencies, reducing the message’s susceptibility to resampling.

Image Steganography

Image files make very good candidates for steganography due to their size and common use. We investigated the GIF and JPG formats specifically.

GIF images

GIF images include a 256 color palate consisting of the colors the image is consists of. Each pixel in the actual image refers to one of the colors from the palate. Steganographic algorithms for GIF images take advantage of similarities in the color palate.

We looked specifically at a GIF stenography program called EzStego v2.0 (available at http://www.stego.com.) It has the ability to hide one bit of data in every pixel of a given image. EzStego begins by pairing up similar colors in the palate. For each pair one color will represent 1 while the other will represent 0. EzStego then compares the bits it wants to hide to the color of each pixel. If the pixel already represents the correct bit it remains unchanged. If it is the incorrect bit the color is changed to its pair color. These changes are usually so small that they are undetectable to the human eye.

In the paper Attacks on Stegenographic Systems, Andreas Westfeld and Andreas Pfitzmann point out two flaws with EzStego’s algorithm. The first involves filtering an image then comparing the patterns of the filter to those in the original image. The second takes advantage of the distortions of a GIF’s statistical properties caused the stego.

Most GIF stego algorithms assume that the colors in the image are in pseudo-random distributions of the colors throughout the picture. Westfeld and Pfitzmann showed that this is not necessarily true by running the revealing algorithms on various pictures. The resulting bits are then translated back into a picture with a one representing black and a zero representing white. Since the same color tends to appear in chunks, the original image may still be apparent if no message is hidden, while an image with a hidden message will appear jumbled. This method is called a visual attack since it actually needs to be viewed by the searcher to determine anything. Figure 1 shows an image with a hidden message in the upper half and the image after put through the filter.

Fig. 1: A picture with a hidden message in the upper half (left) and the same image filtered using Westfeld and Pfitzmann’s visual attack (right).

The Visual Attack is unreliable and can be slow due to the human action required. An image that has little structure to begin with may be unrecognizable after passed through the filter and result in a false positive. Westfeld and Pfitzmann propose a better strategy that looks at the statistical properties of the colors in the image.

In an average GIF the colors will have awkward distributions, with some favored heavily over others, even if the they are similar. The EzStego algorithm actually evens out the distribution of the two colors in a pair. By looking at the discrepancies between each pair in the palate it becomes obvious if a picture has a hidden message.

JPEG images

Increasingly, pictures transmitted over the Internet are saved in the JPEG format, named after standards-setting body that created it, the Joint Photographic Expert Group. The steganographer must make sure that the cover is as innocuous as possible. This means that she will conform to Internet standards, and endeavor to hide the message in JPEG file. In order to understand exactly how she can create a JPEG stego, it is important to understand the JPEG file format.

JPEG is an image compression standard, meaning that it takes raw data and discards information to create a smaller file that looks identical. In a way, this is exactly the opposite of the staganographer’s task. Therefore, it is most efficient to embed the message during the encryption process.

A picture is originally stored as a bitmap, meaning that each pixel is given a single color value. This can be thought of as a single 8-bit number in grayscale, or a single 25-bit number in full color. The image is then an m´n matrix of numbers. The actual JPEG format (IS 10918-1) divides the image into 8 by 8 blocks, but this is unimportant for our discussion.

The key to the compression is the discrete cosine transform (DCT), a mathematical operator that changes the nature of the underlying matrix. In laymen’s terms, the DCT changes the matrix so that the “important” data is in the top left. Mathematically, the DCT, much like the discrete Fourier transform, converts a signal from the spatial domain to the frequency domain through the following equation:

A(i,j) is the raw image, and B(k₁,k₂) is the DCT output. If the coefficients of A are 8-bit, the coefficients of B will be 11-bit. The detailed mathematics are beyond the scope of this paper, but all that is important are the properties of the output file. The absolute value of the DCT coefficients tends to zero as you diagonalize down and across the matrix.

Ellen Chang presents an example that makes it easy to visually comprehend the outcome. Given raw input image:

140	144	147	140	140	155	179	175
144	152	140	147	140	148	167	179
152	155	136	167	163	162	152	172
168	145	156	160	152	155	136	160
162	148	156	148	140	136	147	162
147	167	140	155	155	140	136	162
136	156	123	167	162	144	140	147
148	155	136	155	152	147	147	136

The equation will output the following DCT coefficients:

186	-18	15	-9	23	-9	-14	19
21	-34	26	-9	-11	11	14	7
-10	-24	-2	6	-18	3	-20	-1
-8	-5	14	-15	-8	-3	-3	8
-3	10	8	1	-11	18	18	15
4	-2	-18	8	8	-4	1	-7
9	1	-3	4	-1	-7	-1	-2
0	-8	-2	2	1	4	-6	0

As the reader can quickly see, the largest number is the first, symbolizing that the most data is contained in that coefficient. The JPEG compression scheme then calls for quantizing each coefficient by normalizing with an increasing function, and then discarding the zeros. The compression scheme works brilliantly, and it is not unlikely that well over half the coefficients can be discarded without any noticeable change on the picture. Similarly, the steganographer can hide her message in any of the lower-valued DCT coefficients.

We investigated a JPEG stego program called JSteg (available at http://www.tiac.net/users/korejwa/jsteg.htm.) JSteg uses the DCT coefficients to hide information. Jsteg has the ability to hide one bit of data in every non-zero coefficient. The hiding process is very similar to that of the GIF. All possible coefficient values are paired up with similar numbers and each is assigned a one or zero. The actual coefficient for each block is then changed to fit the target stego bit. A coefficient of zero is generally not touched in JPEG stego algorithms since it occurs so frequently and could drastically change the size of the file if changed.

Although messages hidden with JSteg are not vulnerable to the same visual attacks that GIF files are, there are still statistical flaws in the hiding process that can be exploited. The frequency of coefficients tends to decrease exponentially as they move away from zero. JSteg suffers from the same problem as EzStego in that the paired coefficients will balance each other out. This leads to an easily detectable pattern that can be discovered with little work and high accuracy. Provos and Honeymen used these techniques to examine over 2 million images posted on E-Bay. No steganographic content was found.

There are other stego algorithms that attempt to eliminate the frequency problems that JSteg contains. The F5 algorithm developed by Westfeld uses a more complex mapping system between coefficients that leads to a much better distribution. Niels Provos’s Outguess stego program eliminates the problem by only using half of the possible coefficients to store the message and the other half to balance out the distributions. This algorithm reduces the amount of data an image can hold by half. Both approaches make hidden messages much more difficult to detect in JPEG files.

Conclusion

Detecting steganography is particularly difficult. There are too many different possible algorithms for too many different possible covers. So difficult is detection, in fact, that there has been only one documented case of discovering a stego in the “wild.” In October 2001, Niels Provos and Peter Honeyman determined that a particular image found in a crawl of the Internet was in fact a stego and contained another embedded image. After scanning in excess of 2 million images, the researchers found a stego image that was produced in conjunction with an ABC segment on steganography (Provos).

Until only recently, steganography has been secondary in research to cryptography. Whether or not bin Laden and his conspirators actually utilized steganographic methods to coordinate their attacks, we predict that in the coming months and years researchers will develop more efficient methods of concealing messages as well as new, creative techniques for detecting those who do transmit covered information.

Works Cited

Anderson, Ross and Fabien Petitcolas. “On the Limits of Steganography.” IEEE Journal of Selected Areas in Communication, 16(4):474-481, May 1998.

Chang, Ellen and Udara Fernando. "Data Compression." Online. Internet. Available HTTP. http://www.stanford.edu/~udara/SOCO/lossy/jpeg/index.htm. Accessed 11/21/01.

Johnson, F and S. Jajodia. “Exploring Steganography: Seeing the Unseen.” IEEE Computer Magazine, 31(2):26-34, February 1998.

McCullagh, Declan. “Bin Laden: Steganography Master?” Online. Internet. Available HTTP. http://www.wired.com/news/politics/0,1283,41658,00.html. Accessed 12/3/01.

McCullagh, Declan. “Secret Messages Come in .Wavs.” Online. Internet. Available HTTP. http://www.wired.com/news/politic/0,1283,41861,00.html. Accessed 12/3/01.

Provos, Neils and Peter Honeyman. “Detecting Steganographic Content on the Internet.” Working Paper, Center for Information Technology Integration, University of Michigan.

Provos, Niels. “First Steganographic Image in the Wild.” Online. Internet. Available HTTP. http://www.citi.umich.edu/u/provos/stego/abc.html. Accessed 12/3/01.

Schneier, Bruce. “Terrorists and steganography.” Online. Internet. Available HTTP. http://www.zdnet.com/zdnn/stories/news/0,4586,2814256,00.html. Accessed 12/4/01.

Sellers, Duncan. “An Introduction to Steganography.” Online. Internet. Available HTTP. http://www.cs.uct.ac.za/courses/CS400W/NIS/papers99/dsellars/stego.html. Accessed 12/3/01.

Westfeld, Andreas and Andreas Pfitzmann. “Attacks on Stegenographic Systems.” http://os.inf.tu-dresden.de/~westfeld/publikationen/ihw99.pdf

140	144	147	140	140	155	179	175
144	152	140	147	140	148	167	179
152	155	136	167	163	162	152	172
168	145	156	160	152	155	136	160
162	148	156	148	140	136	147	162
147	167	140	155	155	140	136	162
136	156	123	167	162	144	140	147
148	155	136	155	152	147	147	136

186	-18	15	-9	23	-9	-14	19
21	-34	26	-9	-11	11	14	7
-10	-24	-2	6	-18	3	-20	-1
-8	-5	14	-15	-8	-3	-3	8
-3	10	8	1	-11	18	18	15
4	-2	-18	8	8	-4	1	-7
9	1	-3	4	-1	-7	-1	-2
0	-8	-2	2	1	4	-6	0

140	144	147	140	140	155	179	175
144	152	140	147	140	148	167	179
152	155	136	167	163	162	152	172
168	145	156	160	152	155	136	160
162	148	156	148	140	136	147	162
147	167	140	155	155	140	136	162
136	156	123	167	162	144	140	147
148	155	136	155	152	147	147	136

186	-18	15	-9	23	-9	-14	19
21	-34	26	-9	-11	11	14	7
-10	-24	-2	6	-18	3	-20	-1
-8	-5	14	-15	-8	-3	-3	8
-3	10	8	1	-11	18	18	15
4	-2	-18	8	8	-4	1	-7
9	1	-3	4	-1	-7	-1	-2
0	-8	-2	2	1	4	-6	0

140	144	147	140	140	155	179	175
144	152	140	147	140	148	167	179
152	155	136	167	163	162	152	172
168	145	156	160	152	155	136	160
162	148	156	148	140	136	147	162
147	167	140	155	155	140	136	162
136	156	123	167	162	144	140	147
148	155	136	155	152	147	147	136

186	-18	15	-9	23	-9	-14	19
21	-34	26	-9	-11	11	14	7
-10	-24	-2	6	-18	3	-20	-1
-8	-5	14	-15	-8	-3	-3	8
-3	10	8	1	-11	18	18	15
4	-2	-18	8	8	-4	1	-7
9	1	-3	4	-1	-7	-1	-2
0	-8	-2	2	1	4	-6	0