Class 13 – Wednesday, March 3
A dataset by any other name is still a dataset (but it is not a set)
In a loop, a loop — Nesting, but not like a bird — Repeating again
Look both ways
Agenda
- Dataset processing
- Introduce web processing
Downloads
- Program column_grabbing.py
- Program lotta_books.py
- Program master_plan.py
Test 2
- Wednesday March 17
To do list
- Complete current homeworks
Datasets
- A dataset is a list whose elements are lists.
- Datasets are sometimes called tables or data sheets
- The elements of a two-dimensional dataset are called rows. The elements of a row are called data values or cells.
- Most of the datasets that we process will come from the web.
- The datasets acquired by programs are often CSV files; that is, the values are separated by commas.
- One of the CSV dataset we will consider is the best selling fictional books of all time
Program column_grabbing.py
- For a user-specified column index produce a list of values for that column
-
Some program runs
table = [['A', 'B', 'C'], ['D', 'E', 'F'], ['G', 'H', 'I'], ['J', 'K', 'L', 'M']]
Enter column of interest: 0
row ['A', 'B', 'C'] : column 0 cell: A
row ['D', 'E', 'F'] : column 0 cell: D
row ['G', 'H', 'I'] : column 0 cell: G
row ['J', 'K', 'L', 'M'] : column 0 cell: J
Column 0 : ['A', 'D', 'G', 'J']
table = [['A', 'B', 'C'], ['D', 'E', 'F'], ['G', 'H', 'I'], ['J', 'K', 'L', 'M']]
Enter column of interest: 2
row ['A', 'B', 'C'] : column 2 cell: C
row ['D', 'E', 'F'] : column 2 cell: F
row ['G', 'H', 'I'] : column 2 cell: I
row ['J', 'K', 'L', 'M'] : column 2 cell: L
Column 2 : ['C', 'F', 'I', 'L']
Program lotta_books.py
- Examines a literal dataset based on the web dataset best_sellers.csv
Program run
header: ['Name', 'Author', 'Language', 'Date', 'Sales']
sales column: 4
name column: 0
date column: 3
books: [["Alice's Adventures in Wonderland", 'Carroll', 'English', 1865, 100000000], ['And Then There Were None', 'Christie', 'English', 1939, 100000000], ['Dream of the Red Chamber', 'Xueqin', 'Chinese', 1754, 100000000], ['Don Quixote', 'de Cervantes', 'Spanish', 1605, 500000000], ['Harry Potter', 'Rowling', 'English', 1997, 447000000], ['The Hobbit', 'Tolkien', 'English', 1937, 150000000], ['The Little Prince', 'de Saint-Exupery', 'French', 1943, 150000000], ['The Lord of the Rings', 'Tolkien', 'English', 1954, 150000000], ['A Tale of Two Cities', 'Dickens', 'English', 1859, 200000000]]
row: ["Alice's Adventures in Wonderland", 'Carroll', 'English', 1865, 100000000]
row: ['And Then There Were None', 'Christie', 'English', 1939, 100000000]
row: ['Dream of the Red Chamber', 'Xueqin', 'Chinese', 1754, 100000000]
row: ['Don Quixote', 'de Cervantes', 'Spanish', 1605, 500000000]
row: ['Harry Potter', 'Rowling', 'English', 1997, 447000000]
row: ['The Hobbit', 'Tolkien', 'English', 1937, 150000000]
row: ['The Little Prince', 'de Saint-Exupery', 'French', 1943, 150000000]
row: ['The Lord of the Rings', 'Tolkien', 'English', 1954, 150000000]
row: ['A Tale of Two Cities', 'Dickens', 'English', 1859, 200000000]
total sold: 1897000000
dates: [1865, 1939, 1754, 1605, 1997, 1937, 1943, 1954, 1859]
earliest: 1605
latest : 1997
average date: 1872
row with earliest book: 3
row with latest book : 4
info on earliest: ['Don Quixote', 'de Cervantes', 'Spanish', 1605, 500000000]
info on latest: ['Harry Potter', 'Rowling', 'English', 1997, 447000000]
name of earliest: Don Quixote
name of latest: Harry Potter
Web pages
Our introduction to interacting with the web in CS 1112 is intentionally simple. Industrial-strength web applications also require familiarity with other and more powerful URL modules. There is an external library requests worth checking if you have further interest.
For now the only thing we is access to the module urllib.request
. The module supports working with URLs.
import urllib.request
- The only thing we care about in the modudle is its function
urllib.request.urlopen()
that returns a connector to a URL resource (think web page). Sample usage:
stream = urllib.request.urlopen( link )
- If you care (and I do not), officially the value returned by
urlopen()
is anhttp.client.HTTPResponse
.
- All we care about is that a
stream
returned byurlopen()
has a functionread()
to get the contents of the web resource indicated bylink
.
page = stream.read()
- The contents provided by
read()
is a string encoded in a web format rather than as regular text. We can be decode it with string functiondecode()
.
text = page.decode( 'UTF-8' )
The above assignment sets text to be the decoded contents of the url resource named by
link
; that istext
is a string equally the contents of the url resource indicted by;ink
.
- The four statements form a template for getting the contents of a URL resource in string format.
import urllib.request # get module access
stream = urllib.request.urlopen( link ) # open connector to the link web resource
page = stream.read() # read contents of the resource
text = page.decode( 'UTF-8' ) # decode contents as normal text string
- What happens next is problem-dependent.
Program master_plan.py
- Displays the word of the day from the CS 1112 web file
word-of-the-day
.
???
🦆 © 2022 Jim Cohoon | Resources from previous semesters are available. |