Third Assignment
Due 10 December 2004

The aim of the first two questions of this homework is to show you how to use R to analyze similarites data and to make plots. In the third part we will take a roll call matrix and run it through my Optimal Classification (OC) Program, insert the coordinates into the roll call matrix with Epsilon, and then paste the transformed matrix into Stata.

Feel free to send me e-mail or if you get stuck and need help.

  1. In this problem we are going to do the U.S. Map Example under topic 2, Classical Scaling of Similarities Data, on the course webpage. To begin, download the R program and the driving distance matrix that it analyzes (these are also posted on the course webpage):

    double_center_drive_data.r -- R Program to Perform Double-Centering of U.S. Driving Distances Data

    Drive2.TXT -- U. S. Driving Distances Data

    and place the files in the same directory.

    Start R in the usual WINDOZE fashion by double-clicking on the R icon. You should see something like this:

    R IS NOT USER FRIENDLY!!!!. However, it and Stata are rapidly establishing themselves as standards within Political Science -- especially with the "pocket-protector" and "propeller-head" types (like myself) -- and it is definitely worth your time to learn R!

    Our program double_center_drive_data.r uses three plug-in libraries -- MASS, pcurve, and stats -- and the first thing you need to do is check to see if they are installed on your computer. To do this, go to the Packages menu and select Load package:

    You will now see the packages that are already on your computer:

    In my case I already have MASS, pcurve, and stats, so they will appear in this menu. MASS and stats should be in this menu as they are default packages in R. If pcurve is not on the list you can download it from the Comprehensive R Archive Network (CRAN) website by selecting Install package(s) from CRAN under the Packages menu. R will go to the website and a select menu will appear showing all the packages that you can download:

    Scroll down until you see pcurve:

    Click "OK" and it downloads and installs itself automatically. It then asks you if you want to delete the ZIP files:

    Always answer N because the zip files do not take up much disk space and they are always around if you need them!

    R has a fairly good help menu with the PDF of the manual and a useful HTML help that has the complete manual in the form of cross-linked webpages:

    When you select HTML help you will get this:

    Note that the HTML help page comes up in MICROSHAFT IE. This can be a problem if you have installed the latest service pack from MICROSHAFT that includes a version of IE that has a "pop-up" blocker. When you use the search engine MICROSHAFT may screw it up (I have found that this varies from machine to machine)!!! Note that you can always simply type in the link

    C:\Program Files\R\rw1091\doc\html\rwin.html

    in a non-MICROSHAFT browser and it will work just fine.

    To run the double_center_drive_data.r R program, go to the file menu and select Source R Code:

    You will get the usual WINDOZE directory menu. Go to where you placed double_center_drive_data.r and select it:

    When you click Open R runs the program and you should get this:

    If you Right-Click on the image it brings up this very handy menu:

    If you copy it as a bitmap then you can paste it directly into MICROSHAFT WORD or paste it directly into a graphics package like Photoshop or Paint Shop Pro. If you paste it into a graphics package you can then save it as a JPEG or GIF file.

    E-Mail me this graph in either form for a PASS on this problem.

  2. In this problem we are going to modify the R program so that the names of the cities are positioned on the graph in a nicer fashion. In particular, note that Los Angeles is partly cut off in the plot above. In this problem I will show you the R program and how to change the display of the city names.

    Below is a listing of the double_center_drive_data.r program. R is an Interpreter -- that is, it compiles and executes one line of code at a time! In this regard R is a lot like the original BASIC programming language for microcomputers -- the computer quite literally runs the program one-line-at-a-time. You could type the program below into R one line at a time if you wanted to. In this regard, R is a very powerful calculator with many built-in commands like mean(x) and sd(x) where x is a vector of numbers.
    #  The cross-hatch is used as a comment marker -- R ignores the line
    # double_center_drive_data.r -- Double-Center Program  Always put the name of the program at the top
    # Data Must Be Transformed to Squared Distances Below
    # ATLANTA      0000 2340 1084  715  481  826 1519 2252  662  641 2450
    # BOISE        2340 0000 2797 1789 2018 1661  891  908 2974 2480  680
    # BOSTON       1084 2797 0000  976  853 1868 2008 3130 1547  443 3160
    # CHICAGO       715 1789  976 0000  301  936 1017 2189 1386  696 2200  I embed the data for
    # CINCINNATI    481 2018  853  301 0000  988 1245 2292 1143  498 2330  convenience         
    # DALLAS        826 1661 1868  936  988 0000  797 1431 1394 1414 1720
    # DENVER       1519  891 2008 1017 1245  797 0000 1189 2126 1707 1290
    # LOS ANGELES  2252  908 3130 2189 2292 1431 1189 0000 2885 2754  370
    # MIAMI         662 2974 1547 1386 1143 1394 2126 2885 0000 1096 3110
    # WASHINGTON    641 2480  443  696  498 1414 1707 2754 1096 0000 2870
    # CASBS        2450  680 3160 2200 2330 1720 1290  370 3110 2870 0000
    library(MASS)      These are the three libraries
    library(pcurve)    These commands tell R to load them
    #       The command below reads the driving distances into the matrix T
    #       Note that R uses the syntax "<-" for "=" 
    T <- matrix(scan("C:/ucsd_course/drive2.txt",0),ncol=11,byrow=TRUE)
    #    The c() command creates a vector
    names <- c("Atlanta ","Boise ","Boston ","Chicago ","Cincinnati ","Dallas ","Denver ",
               "Los Angeles ","Miami ","Washington ","CASBS ")
    nrow <- length(T[,1])     The length command tells you the length of a vector
    ncol <- length(T[1,])     The "," acts as a wildcard
    TT <- rep(0,nrow*ncol)    rep stands for repeat -- this creates a vector of zeroes
    dim(TT) <- c(nrow,ncol)   dim stands for dimension -- this creates an nrow by ncol matrix of zeroes
    TTT <- rep(0,nrow*ncol)
    dim(TTT) <- c(nrow,ncol)
    xrow <- NULL       The NULL is a convenient way to initialize a vector
    xcol <- NULL
    xcount <- NULL
    matrixmean <- 0    Simple initialization to zero
    matrixmean2 <- 0
    # Transform the Matrix     This is a simple double DO-Loop that divides the Distances
    #                          by 1000 and squares them so that we have a set of squared
    i <- 0                     distances roughly in the range of 0 to 9
    while (i < nrow) {       
      i <- i + 1               The programming language is essentially the same as C/C++ and
      xcount[i] <- i           Visual Basic
      j <- 0
      while (j < ncol) {
         j <- j + 1
    #  Square the Driving Distances
         TT[i,j] <- (T[i,j]/1000)**2
    #  Put it Back in T
    T <- TT
    #  Below is the old Long Method of Double-Centering
    # Compute Row and Column Means
    i <- 0
    while (i < nrow) {
      i <- i + 1
      xrow[i] <- mean(T[i,])  Row Means of the Squared Distance Matrix
    i <- 0
    while (i < ncol) {
      i <- i + 1
      xcol[i] <- mean(T[,i])  Column Means of the Squared Distance Matrix
    matrixmean <- mean(xcol)  Matrix Mean
    matrixmean2 <- mean(xrow) This is a safety check
    # Double-Center the Matrix Using old Long Method
    #  Compute comparison as safety check
    i <- 0
    while (i < nrow) {
      i <- i + 1
      j <- 0
      while (j < ncol) {
         j <- j + 1
         TT[i,j] <- (T[i,j]-xrow[i]-xcol[j]+matrixmean)/(-2)   Basic Formula for Double-Centering 
         TTT[i,j] <- (T[i,j]-xrow[i]-xcol[j]+matrixmean)/(-2)  a Squared Distance Matrix
    #  Perform Eigenvalue-Eigenvector Decomposition of Double-Centered Matrix
    #                eigen performs an eigenvalue-eigenvector decomposition of the input matrix
    ev <- eigen(TT)  The n by n matrix of eigenvectors is returned in ev$vec[*,*]
    #                The n length vector of eigenvalues is returned in ev$val[*]
    #  Find Point furthest from Center of Space
    #  The max(x) command finds the maximum value in the vector x
    aaa <- sqrt(max((abs(ev$vec[,1]))**2 + (abs(ev$vec[,2]))**2)) Note that aaa and bbb should be exactly the same!
    bbb <- sqrt(max(((ev$vec[,1]))**2 + ((ev$vec[,2]))**2))       This is overkill, but it is a handy check on your math!
    #  Weight the Eigenvectors to Scale Space to Unit Circle
    torgerson1 <- ev$vec[,1]*(1/aaa)*sqrt(ev$val[1])  This is the classical Torgerson Solution
    #torgerson2 <- ev$vec[,2]*(1/aaa)
    #torgerson1 <- -ev$vec[,1]*(1/aaa)
    torgerson2 <- -ev$vec[,2]*(1/aaa)*sqrt(ev$val[2]) This is the classical Torgerson Solution
    #  The basic plot command.  The first two arguments are the x-axis and y-axis coordinates.
    plot(torgerson1,torgerson2,type="n",asp=1, The type="n" suppresses all plotting of points; asp=1 maintains the aspect ratio
           main="Double-Centered Driving Distance Matrix \n Torgerson Coordinates", Title -- the \n is a "new-line" command
           xlab="West     East", Label for the x-axis
           ylab="South   North", Label for the y-axis
           xlim=c(-3.0,3.0),ylim=c(-3.0,3.0)) Sets range for the axes -- these must be the same if asp=1
    points(torgerson1,torgerson2,pch=16,col="red")     Places points in the plot -- pch=16 is a solid circle
    text(torgerson1,torgerson2,names,col="blue",adj=1) Places the City names -- adj=1 puts names on left
    Go back to the R prompt by clicking on the R Console Window to bring it to the front. You can now see the values for any of the variables in the program by simply typing the variable name and hitting Enter. For example, to check to see if matrixmean and matrixmean2 produce the same number, simply type each in turn:

    To see the eigenvalues type ev$val:

    Note that some of the eigenvalues are negative!

    We are now going to fix the plotting of the city names. Bring double_center_drive_data.r up in Epsilon. We are going to modify the file but for safety's sake, lets do that in a copy of the file that we will call double_center_drive_data_2.r. To do this, use the write-file command:

    C-X C-W

    A banner will appear in the bottom of the window giving the current path:

    Now just type in the filename -- double_center_drive_data_2.r -- and hit Enter

    Now, go to the bottom of the file and enter the text below using Epsilon. Note that you are commenting out the text(...) command and then adding lines followed by a new text(...) command.
    # pos -- a position specifier for the text. Values of 1, 2, 3 and 4, 
    # respectively indicate positions below, to the left of, above and 
    # to the right of the specified coordinates 
    namepos <- NULL
    namepos[1] <- 2   # Atlanta
    namepos[2] <- 2   # Boise   
    namepos[3] <- 2   # Boston
    namepos[4] <- 2   # Chicago 
    namepos[5] <- 2   # Cininnati
    namepos[6] <- 2   # Dallas     
    namepos[7] <- 2   # Denver
    namepos[8] <- 2   # Los Angeles
    namepos[9] <- 2   # Miami   
    namepos[10] <- 2  # Washington
    namepos[11] <- 2  # CASBS
    The vector namepos controls where the city names in the names vector are placed vis a vis the points. Run double_center_drive_data_2.R in R and you should get:

    The parameter offset controls how close the text is to the points. If you set it to 0.5 you will get this:

    Now, with offset=0.0 or offset=0.25 you can easily adjust the positions of the names of the cities by simply toggling back and forth between Epsilon and R. (Note that in R you can get the previous command by simply using the Up Arrow.) For example, with offset=0.25 and with namepos[8] <- 4 (Los Angeles) and namepos[4] <- 3 (Chicago) you will get:

    Note that I put a space after the city names in the vector names. Take these out and you get:

    We really do not need those spaces because we can achieve the same effect using the offset command.

    For a PASS on this problem 1) produce a city plot with all the names clearly visible (no overlapping of any names); and 2) send me the final version of your double_center_drive_2.r program.

  3. In this problem we are going to scale the 90th U. S. Senate with my Optimal Classification (OC) Program, insert the coordinates into the roll call matrix with Epsilon, and then paste the transformed matrix into Stata using Epsilon. To begin, download the 90th U. S. Senate Roll Call data:

    SEN90KH.ORD -- 90th (1967-68) Senate Roll Call Data

    and the Optimal Classification program and its "Control File":

    Optimal Classification (OC) Scaling Program Executable (Compiled for WINTEL Machines) -- PERFL.EXE

    Control File -- PERFSTRT.DAT

    and put all the files in the same directory.

    Bring SEN90KH.Ord up in Epsilon, split the window using the C-X 2 command and use the find-file C-X C-F command to bring up Perfstrt.dat. You should be here:

    Now, just as we did in part 3 of the second homework assignment, we need to change Perfstrt.dat so that it tells OC the correct file to read, the title of the data, the correct number of roll calls, the number of dimensions to estimate, and so on.

    To get the correct file name and title just change 107 to 90 in the first two lines. This U. S. Senate occurred during the period of history when two basic dimensions were required to account for voting in Congress. Accordingly, change the number of dimensions from "1" to "2". Use the method described in part 3 of the second homework assignment to find the number of roll call votes. You should get 596. You should be here:

    If you counted the number of roll calls correctly using the Show-Point Command -- C-X =, you should have found:

    Column 632, char 632 of 64566 is '^J'=10 decimal=0A hex
    Column 36, char 36 of 64566 is '9'=57 decimal=39 hex

    respectively. Therefore, the format statements are already set up correctly.

    All that remains is to select what Senators to control the polarity of the two dimensions. In this Senate we will use President Johnson and Senator Sparkman, respectively. That is, change the 0001000018 (recall that the spacing of the numbers must be maintained!) to 0000100002. You should now be here:

    Make sure that you save Perfstrt.dat -- C-X C-S or the usual WINDOZE file-save icon -- and exit Epsilon -- C-X C-C. At the WINDOZE command line type


    The program should only take about one minute to run and it produces three output files -- PERF21.DAT, PERF23.DAT, and PERF25.DAT. PERF21.DAT should look something like this:
     21 NOVEMBER  2004
     RANDOM NUMBER SEED     33700
        2  596   20   36    1    2   10 0.005
      1 ROLL CALLS   2    6244   48878  0.12775  0.87225  0.55333           0.00000
        LEGISLATORS  2    6210   48878  0.12705  0.87295  0.55576  0.00000
      2 ROLL CALLS   2    6143   48878  0.12568  0.87432  0.56056           0.99832
        LEGISLATORS  2    6129   48878  0.12539  0.87461  0.56156  0.99434
      3 ROLL CALLS   2    6079   48878  0.12437  0.87563  0.56513           0.99853
        LEGISLATORS  2    6072   48878  0.12423  0.87577  0.56563  0.99982
      4 ROLL CALLS   2    6067   48878  0.12413  0.87587  0.56599           0.99935
        LEGISLATORS  2    6063   48878  0.12404  0.87596  0.56628  0.99922
      5 ROLL CALLS   2    6053   48878  0.12384  0.87616  0.56699           0.99761
        LEGISLATORS  2    6053   48878  0.12384  0.87616  0.56699  0.99999
      6 ROLL CALLS   2    6050   48878  0.12378  0.87622  0.56721           0.99981
        LEGISLATORS  2    6049   48878  0.12376  0.87624  0.56728  0.99982
      7 ROLL CALLS   2    6048   48878  0.12374  0.87626  0.56735           0.99976
        LEGISLATORS  2    6048   48878  0.12374  0.87626  0.56735  1.00000
      8 ROLL CALLS   2    6048   48878  0.12374  0.87626  0.56735           0.99983
        LEGISLATORS  2    6047   48878  0.12372  0.87628  0.56742  0.99999
      9 ROLL CALLS   2    6045   48878  0.12368  0.87632  0.56757           0.99987
        LEGISLATORS  2    6045   48878  0.12368  0.87632  0.56757  1.00000
     10 ROLL CALLS   2    6045   48878  0.12368  0.87632  0.56757           0.99986
        LEGISLATORS  2    6044   48878  0.12365  0.87635  0.56764  0.99999
     11 ROLL CALLS   2    6042   48878  0.12361  0.87639  0.56778           0.99986
        LEGISLATORS  2    6041   48878  0.12359  0.87641  0.56785  0.99996
     12 ROLL CALLS   2    6039   48878  0.12355  0.87645  0.56799           0.99985
        LEGISLATORS  2    6035   48878  0.12347  0.87653  0.56828  0.99986
     13 ROLL CALLS   2    6034   48878  0.12345  0.87655  0.56835           0.99974
        LEGISLATORS  2    6034   48878  0.12345  0.87655  0.56835  1.00000
     14 ROLL CALLS   2    6033   48878  0.12343  0.87657  0.56842           0.99986
        LEGISLATORS  2    6033   48878  0.12343  0.87657  0.56842  1.00000
     15 ROLL CALLS   2    6029   48878  0.12335  0.87665  0.56871           0.99991
        LEGISLATORS  2    6029   48878  0.12335  0.87665  0.56871  1.00000
     16 ROLL CALLS   2    6028   48878  0.12333  0.87667  0.56878           0.99987
        LEGISLATORS  2    6028   48878  0.12333  0.87667  0.56878  1.00000
     17 ROLL CALLS   2    6027   48878  0.12331  0.87669  0.56885           0.99986
        LEGISLATORS  2    6027   48878  0.12331  0.87669  0.56885  1.00000
     18 ROLL CALLS   2    6027   48878  0.12331  0.87669  0.56885           0.99980
        LEGISLATORS  2    6027   48878  0.12331  0.87669  0.56885  1.00000
     19 ROLL CALLS   2    6026   48878  0.12329  0.87671  0.56892           0.99988
        LEGISLATORS  2    6026   48878  0.12329  0.87671  0.56892  1.00000
     20 ROLL CALLS   2    5882   48878  0.12034  0.87966  0.57923           0.99358
        LEGISLATORS  2    5882   48878  0.12034  0.87966  0.57923  0.99998
     MEAN VOLUME LEG.   0.0057   0.0163
     MACHINE PREC.   2    5882   48878  0.12034  0.87966  0.57923
     MACHINE PREC.   2    5882   48878  0.12034  0.87966  0.57923
    You should reproduce the above almost exactly subject to very small variations between CPU types and the random number draw that is used to do a final search on the legislator polytopes. Note that the correct classification is 87.966 percent with an APRE of 0.57923 (the output files are explained in detail on the Optimal Classification (OC) Page).

    To place the legislator OC coordinates into the roll call file, bring SEN90KH.Ord up in Epsilon, split the window using the C-X 2 command and use the find-file C-X C-F command to bring up PERF25.DAT. You should be here:

    Place the cursor at the beginning of the line for President Johnson and start a keyboard macro with the start-keyboard-macro command:

    C-X (

    where "(" is the left parenthesis. We are going to move 68 spaces to the right in PERF25.DAT, put the coordinates in the kill buffer, yank them back out, move to the beginning of the line, and move down one line so the cursor is now in front of the record for Sparkman. To do this, enter:


    and you will now be at this point:

    Now we are going to move to the upper window with the C-X P command (see selecting-windows commands in the Epsilon manual), move 36 spaces to the right in SEN90KH.ORD, yank the coordinates from the buffer, hit the space-bar once to open up one space between the coordinates and the first roll call, move to the beginning of the line, and move down one line so the cursor is now in front of the record for Sparkman. To finish the macro we move down to the lower window with the C-X N command (see selecting-windows commands in the Epsilon manual), and close the macro with C-X ). To do the above, enter


    where SPACE means hit the space-bar one time. You should now be here:

    Execute the macro 100 times (C-U100C-XE) and this should produce:

    Execute the macro 1 more time, save PERF25.DAT, close the lower window with C-X 0 (Control-X "Zero"), and then write out SEN90KH.ORD as SEN90KH_STATA.TXT using the write-file command:

    C-X C-W

    We are now here:

    Go to the beginning of SEN90KH_STATA.TXT with the command:

    Alt-< (hold Alt Key down and type <)

    This will put you at the top of the file. What we need to do now is: 1) remove all the commas that exist in the name fields (you will see why in a moment); 2) run a macro that inserts commas between the variable fields (this allows us to paste into Stata); and 3) paste the file directly into the Stata data editor spreadsheet.

    To begin simply use the replace-string command to replace every comma with a space (that way the columns of the file will remain in alignment). The replace-string command is:

    Alt-X Replace-String

    The Alt-X brings up the Command: banner and then you simply type replace-string:

    Now hit Enter:

    Type , and hit Enter:

    At this point Epsilon remembers the last string you used in this situation and displays it in blue. Just hit Backspace if this occurs.

    Now hit the Space-Bar one time -- this tells Epsilon that you are replacing every occurrence of a "," with " " -- and hit Enter:

    Epsilon tells you how many replacements that it made.

    To finish off our project download these two macro files

    Homework-3-3.TXT -- Keyboard Macro as Text File

    Comma.TXT -- Keyboard Macro to Insert a Comma

    Split the window using the C-X 2 command, use the find-file C-X C-F command to bring up Homework-3-3.TXT, split the window again, and bring up Comma.TXT. You should see this:

    Because homework-3-3 calls commax (defined in Comma.txt), you must load the Comma.txt file first as we did in homework 2 problem 2. With the cursor in the bottom window type Alt-X (Hold Alt Key down and type X) and after the Command: banner comes up, type load-buffer and hit the Enter key. You should get 0 errors detected.

    Move the cursor up to homework-3-3.txt using C-X P and load this file as well using Alt-X and load-buffer. Finally, move the cursor up to SEN90KH_STATA.TXT using C-X P. To run the macro type Alt-X and at the Command: banner type homework-3-3 (the name in the quotes) and hit Enter. You should get:

    Run the macro 100 more times using C-U100Alt-X and at the 100 Command: banner type homework-3-3

    Execute the macro one more time and save SEN90KH_STATA.TXT, close the other two windows and go to the beginning of SEN90KH_STATA.TXT. Start Stata, set the memory to about 500 meg -- set mem 500m, and bring up the data editor spreadsheet:

    Finally, at the top of SEN90KH_STATA.TXT type

    C-@ (hold Control key down and type @)

    This sets a mark in Epsilon. Now go to the bottom of the file with Alt-> and it will highlight the whole file:

    Put the file on the clipboard:

    Go to Stata and paste the file into the data editor using the Edit drop-down menu:

    Save the file as ucsd_homework_3.dta. Turn on the logfile function in Stata (use .log file type!!!), issue the summ command, close the log file, and e-mail me the log file for a PASS on this problem.