回帰分析における変数選択と演習

統計処理 01 クラス : 第14回目(10/19/00)

前々回、前回と回帰分析について紹介し、 残差解析が重要なことを強調したつもりである。 今週は説明変数の取捨選択について説明する。 また、後半では各自のデータに対して回帰分析を行ってもらう。

  1. 配布資料の説明

  2. 変数選択 :

    1. プログラム : les1401.sas

       /* Lesson 14-1 */
       /*    File Name = les1401.sas   10/19/00   */
      
      data air;
        infile 'usair2.prn';
        input id $ y x1 x2 x3 x4 x5 x6;
      /*
        label y='SO2 of air in micrograms per cubic metre'
              x1='Average annual temperature in F'
              x2='Number of manufacturing enterprises employing 20 or more workers'
              x3='Population size (1970 census); in thousands'
              x4='Average annual wind speed in miles per hour'
              x5='Average annual precipitation in inches'
              x6='Average number of days with precipitation per year'
      ;
      */
      
      proc print data=air(obs=10);
      run;
      
      proc corr data=air;
      run;
      
      proc reg data=air;                                       :
        model y=x1 x2 x3 x4 x5 x6;                             : フルモデル
        output out=outreg1 predicted=pred1 residual=resid1;    :
      run;                                                     :
      
      proc plot data=outreg1;
        plot resid1*pred1;                                     :
        plot resid1*x1;                                        : ズラズラと列記
        plot resid1*x2;                                        :
        plot resid1*x3;                                        :
        plot resid1*x4;                                        :
        plot resid1*x5;                                        :
        plot resid1*x6;                                        :
        plot resid1*y;                                         :
      run;
      
      proc reg data=air;                                       :
        model y=x1-x6 / selection=stepwise;                    : 逐次増減法
        output out=outreg1 predicted=pred1 residual=resid1;    : 連続変数の指定方法
      run;                                                     :
      
      proc print data=outreg1(obs=15);
      run;
      
      proc plot data=outreg1;
        plot resid1*pred1;                                     :
        plot resid1*(x1 x2 x3 x4 x5 x6);                       : 簡略形(上と比較せよ)
        plot resid1*(x1-x6);                                   : 簡略形(これも同じ意味)
        plot resid1*y;                                         :
      run;
      
      proc reg data=air;                                       :
        model y=x1-x6 / selection=rsquare;                     : 総当り法
      run;                                                     :
      
    2. 出力結果 : les1401.lst
                                       SAS システム                                 1
                                                    11:38 Wednesday, October 18, 2000
      
            OBS    ID           Y     X1      X2     X3     X4      X5      X6
      
              1    Phoenix     10    70.3    213    582    6.0     7.05     36
              2    Little_R    13    61.0     91    132    8.2    48.52    100
              3    San_Fran    12    56.7    453    716    8.7    20.66     67
              4    Denver      17    51.9    454    515    9.0    12.95     86
              5    Hartford    56    49.1    412    158    9.0    43.37    127
              6    Wilmingt    36    54.0     80     80    9.0    40.25    114
              7    Washingt    29    57.3    434    757    9.3    38.89    111
              8    Jacksonv    14    68.4    136    529    8.8    54.47    116
              9    Miami       10    75.5    207    335    9.0    59.80    128
             10    Atlanta     24    61.5    368    497    9.1    48.34    115
      
                                       SAS システム                                 2
                                                    11:38 Wednesday, October 18, 2000
                                   Correlation Analysis
      
         7 'VAR' Variables:  Y        X1       X2       X3       X4       X5      
                             X6      
      
                                    Simple Statistics
       
        Variable          N       Mean    Std Dev        Sum    Minimum    Maximum
      
        Y                41    30.0488    23.4723       1232     8.0000   110.0000
        X1               41    55.7634     7.2277       2286    43.5000    75.5000
        X2               41   463.0976   563.4739      18987    35.0000       3344
        X3               41   608.6098   579.1130      24953    71.0000       3369
        X4               41     9.4439     1.4286   387.2000     6.0000    12.7000
        X5               41    36.7690    11.7715       1508     7.0500    59.8000
        X6               41   113.9024    26.5064       4670    36.0000   166.0000
      
                                       SAS システム                                 3
                                                    11:38 Wednesday, October 18, 2000
      
                                   Correlation Analysis
      
         Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / N = 41  
      
                  Y         X1         X2         X3         X4         X5         X6
      
      Y     1.00000   -0.43360    0.64477    0.49378    0.09469    0.05429    0.36956
             0.0        0.0046     0.0001     0.0010     0.5559     0.7360     0.0174
      
      X1   -0.43360    1.00000   -0.19004   -0.06268   -0.34974    0.38625   -0.43024
             0.0046     0.0        0.2340     0.6970     0.0250     0.0126     0.0050
      
      X2    0.64477   -0.19004    1.00000    0.95527    0.23795   -0.03242    0.13183
             0.0001     0.2340     0.0        0.0001     0.1341     0.8405     0.4113
      
      X3    0.49378   -0.06268    0.95527    1.00000    0.21264   -0.02612    0.04208
             0.0010     0.6970     0.0001     0.0        0.1819     0.8712     0.7939
      
      X4    0.09469   -0.34974    0.23795    0.21264    1.00000   -0.01299    0.16411
             0.5559     0.0250     0.1341     0.1819     0.0        0.9357     0.3052
      
      X5    0.05429    0.38625   -0.03242   -0.02612   -0.01299    1.00000    0.49610
             0.7360     0.0126     0.8405     0.8712     0.9357     0.0        0.0010
      
      X6    0.36956   -0.43024    0.13183    0.04208    0.16411    0.49610    1.00000
             0.0174     0.0050     0.4113     0.7939     0.3052     0.0010     0.0   
      
                                       SAS システム                                 5
                                                    11:38 Wednesday, October 18, 2000
      Model: MODEL1  
      Dependent Variable: Y                                                  
                                   Analysis of Variance
      
                                      Sum of         Mean
             Source          DF      Squares       Square      F Value       Prob>F
      
             Model            6  14754.63603   2459.10601       11.480       0.0001
             Error           34   7283.26641    214.21372
             C Total         40  22037.90244
      
                 Root MSE      14.63604     R-square       0.6695
                 Dep Mean      30.04878     Adj R-sq       0.6112
                 C.V.          48.70761
      
                                       SAS システム                                 6
                                                    11:38 Wednesday, October 18, 2000
                                    Parameter Estimates
      
                            Parameter      Standard    T for H0:               
           Variable  DF      Estimate         Error   Parameter=0    Prob > |T|
      
           INTERCEP   1    111.728481   47.31810073         2.361        0.0241
           X1         1     -1.267941    0.62117952        -2.041        0.0491
           X2         1      0.064918    0.01574825         4.122        0.0002
           X3         1     -0.039277    0.01513274        -2.595        0.0138
           X4         1     -3.181366    1.81501910        -1.753        0.0887
           X5         1      0.512359    0.36275507         1.412        0.1669
           X6         1     -0.052050    0.16201386        -0.321        0.7500
      
      
                                       SAS システム                                14
                                                    11:38 Wednesday, October 18, 2000
      
                   プロット : RESID1*Y.  凡例: A = 1 OBS, B = 2 OBS, ...
      
               |
         R  50 +                                                 A
         e     |
         s     |                                 A
         i  25 +
         d     |       A          A      AA
         u     |        AA      AA  A         A    A A
         a   0 +      AB      AAABA A         A                          A
         l     |       CAA C   A
               |        ABA      A
           -25 +              A
               ---+---------+---------+---------+---------+---------+---------+--
                  0        20        40        60        80        100       120
                                                Y
      
                                       SAS システム                                15
                                                    11:38 Wednesday, October 18, 2000
      
                    Stepwise Procedure for Dependent Variable Y       
      
      Step 1   Variable X2 Entered        R-square = 0.41572671   C(p) = 23.10893175
      
                      DF         Sum of Squares      Mean Square          F   Prob>F
      
      Regression       1          9161.74469120    9161.74469120      27.75   0.0001
      Error           39         12876.15774782     330.15789097
      Total           40         22037.90243902
      
                      Parameter        Standard          Type II
      Variable         Estimate           Error   Sum of Squares          F   Prob>F
      
      INTERCEP      17.61057438      3.69158676    7513.50474182      22.76   0.0001
      X2             0.02685872      0.00509867    9161.74469120      27.75   0.0001
      
      Bounds on condition number:            1,            1
      
      -------------------------------------------------------------------------------
      
      Step 2   Variable X3 Entered        R-square = 0.58632019   C(p) =  7.55859687
      
                      DF         Sum of Squares      Mean Square          F   Prob>F
      
      Regression       2         12921.26717485    6460.63358743      26.93   0.0001
      Error           38          9116.63526417     239.91145432
      Total           40         22037.90243902
      
                      Parameter        Standard          Type II
      Variable         Estimate           Error   Sum of Squares          F   Prob>F
      
      INTERCEP      26.32508332      3.84043919   11272.71964000      46.99   0.0001
      X2             0.08243410      0.01469656    7548.02378137      31.46   0.0001
      X3            -0.05660660      0.01429968    3759.52248365      15.67   0.0003
      
      
      Bounds on condition number:     11.43374,     45.73494
      -------------------------------------------------------------------------------
      
      Step 3   Variable X6 Entered        R-square = 0.61740155   C(p) =  6.36100514
      
                      DF         Sum of Squares      Mean Square          F   Prob>F
      
      Regression       3         13606.23518823    4535.41172941      19.90   0.0001
      Error           37          8431.66725079     227.88289867
      Total           40         22037.90243902
      
                      Parameter        Standard          Type II
      Variable         Estimate           Error   Sum of Squares          F   Prob>F
      
      INTERCEP       6.96584888     11.77690656      79.72552238       0.35   0.5578
      X2             0.07433399      0.01506613    5547.32153619      24.34   0.0001
      X3            -0.04939437      0.01454421    2628.36952166      11.53   0.0016
      
      X6             0.16435940      0.09480151     684.96801338       3.01   0.0913
      
      Bounds on condition number:     12.65025,     78.63322
      -------------------------------------------------------------------------------
      
      All variables left in the model are significant at the 0.1500 level.
      No other variable met the 0.1500 significance level for entry into the model.
      
               Summary of Stepwise Procedure for Dependent Variable Y       
      
             Variable        Number   Partial    Model
      Step   Entered Removed     In      R**2     R**2      C(p)          F   Prob>F
      
         1   X2                   1    0.4157   0.4157   23.1089    27.7496   0.0001
         2   X3                   2    0.1706   0.5863    7.5586    15.6705   0.0003
         3   X6                   3    0.0311   0.6174    6.3610     3.0058   0.0913
      
                                       SAS システム                                19
                                                    11:38 Wednesday, October 18, 2000
      
        OBS  ID          Y   X1    X2    X3    X4     X5    X6     PRED1    RESID1
      
          1  Phoenix    10  70.3   213   582   6.0   7.05   36    -0.032   10.0316
          2  Little_R   13  61.0    91   132   8.2  48.52  100    23.646  -10.6461
          3  San_Fran   12  56.7   453   716   8.7  20.66   67    16.285   -4.2849
          4  Denver     17  51.9   454   515   9.0  12.95   86    29.410  -12.4103
          5  Hartford   56  49.1   412   158   9.0  43.37  127    50.661    5.3392
          6  Wilmingt   36  54.0    80    80   9.0  40.25  114    27.698    8.3020
          7  Washingt   29  57.3   434   757   9.3  38.89  111    20.079    8.9208
          8  Jacksonv   14  68.4   136   529   8.8  54.47  116    10.011    3.9887
          9  Miami      10  75.5   207   335   9.0  59.80  128    26.844  -16.8439
         10  Atlanta    24  61.5   368   497   9.1  48.34  115    28.673   -4.6731
         11  Chicago   110  50.6  3344  3369  10.4  34.44  122   109.181    0.8191
         12  Indianap   28  52.3   361   746   9.7  38.74  121    16.840   11.1603
         13  Des_Moin   17  49.0   104   201  11.2  30.85  103    21.697   -4.6973
         14  Wichita     8  56.6   125   277  12.7  30.58   82    16.053   -8.0528
         15  Louisvil   30  55.6   291   593   8.3  43.11  123    19.522   10.4776
      
      
                                       SAS システム                                27
                                                    11:38 Wednesday, October 18, 2000
      
                   プロット : RESID1*Y.  凡例: A = 1 OBS, B = 2 OBS, ...
      
            50 +                                                 A
         R     |
         e     |                                 A
         s     |                         AA
         i     |       A        ABA A         A      A
         d   0 +        BA A  ABA A A         A                          A
         u     |      AC C B     A                 A
         a     |       B  A   A  A
         l     |        A
               |
           -50 +
               ---+---------+---------+---------+---------+---------+---------+--
                  0        20        40        60        80        100       120
                                                Y
      
                                       SAS システム                                28
                                                    11:38 Wednesday, October 18, 2000
      
                          N = 41     Regression Models for Dependent Variable: Y     
                            
                        Number in     R-square   Variables in Model
                          Model                   
      
                              1     0.41572671   X2 
                              1     0.24381828   X3 
                              1     0.18800913   X1 
                              1     0.13657727   X6 
                              1     0.00896628   X4 
                              1     0.00294788   X5 
                         --------------------------
                              2     0.58632019   X2 X3 
                              2     0.51611499   X1 X2 
                              2     0.49813569   X2 X6 
      
                              2     0.42138706   X2 X5 
                              2     0.41938296   X2 X4 
                              2     0.40658556   X1 X3 
                              2     0.36568424   X3 X6 
                              2     0.24833602   X3 X5 
                              2     0.24581729   X1 X5 
                              2     0.24392958   X3 X4 
                              2     0.22911013   X1 X6 
                              2     0.19170531   X1 X4 
                              2     0.15866623   X5 X6 
                              2     0.13776827   X4 X6 
                              2     0.01204980   X4 X5 
                         -----------------------------
                              3     0.61740155   X2 X3 X6 
      
                              3     0.61254683   X1 X2 X3 
                              3     0.59304760   X2 X3 X5 
                              3     0.59298732   X2 X3 X4 
                              3     0.56222293   X1 X2 X5 
                              3     0.54523587   X1 X2 X6 
                              3     0.54521259   X1 X2 X4 
                              3     0.50833841   X2 X4 X6 
                              3     0.50466594   X2 X5 X6 
                              3     0.46486113   X1 X3 X5 
                              3     0.44457816   X1 X3 X6 
                              3     0.43203067   X1 X3 X4 
                              3     0.42499409   X2 X4 X5 
                              3     0.38078212   X3 X5 X6 
                              3     0.37015732   X3 X4 X6 
      
                              3     0.25498097   X1 X4 X5 
                              3     0.24843679   X3 X4 X5 
                              3     0.24617714   X1 X5 X6 
                              3     0.23321543   X1 X4 X6 
                              3     0.15899893   X4 X5 X6 
                         --------------------------------
                              4     0.63964257   X1 X2 X3 X5 
                              4     0.63287070   X1 X2 X3 X4 
                              4     0.62909408   X1 X2 X3 X6 
                              4     0.62847667   X2 X3 X4 X6 
                              4     0.61759495   X2 X3 X5 X6 
                              4     0.60282531   X1 X2 X4 X5 
                              4     0.59965327   X2 X3 X4 X5 
                              4     0.57466704   X1 X2 X4 X6 
      
                              4     0.56222294   X1 X2 X5 X6 
                              4     0.51644339   X2 X4 X5 X6 
                              4     0.50348509   X1 X3 X4 X5 
                              4     0.47084076   X1 X3 X4 X6 
                              4     0.46488361   X1 X3 X5 X6 
                              4     0.38714016   X3 X4 X5 X6 
                              4     0.25499437   X1 X4 X5 X6 
                         -----------------------------------
                              5     0.66850854   X1 X2 X3 X4 X5 
                              5     0.65012088   X1 X2 X3 X4 X6 
                              5     0.63964824   X1 X2 X3 X5 X6 
                              5     0.62901313   X2 X3 X4 X5 X6 
                              5     0.60403117   X1 X2 X4 X5 X6 
                              5     0.50433666   X1 X3 X4 X5 X6 
      
                         --------------------------------------
      
                              6     0.66951181   X1 X2 X3 X4 X5 X6 
                         -----------------------------------------
      

    3. 結果の見方
      • フルモデル
      • 逐次選択法(stepwise)
        • 変量増減法。
        • 一度取り込まれても、組合わせによっては削除される。
      • 総当り法(rsquare)
        • 説明変数の組合わせ毎の決定係数(R^2)を表示する。
        • モデルの探索用。
      • 他に、前進選択法(forward)、後退選択法(backward)、...
      • 「数値計算上の最適モデル」と「その分野の知識からの最適モデル」には違いがあることを知っておくこと。
      • 残差解析はいつの場合でも必要
      • ...

    4. SAS の文法 : 簡略な表記
      • 連続変量の指定 : x1-x6
      • plot をまとめて指定 : plot resid1*(x1-x6);

  3. 演習 : 各自のデータに対して回帰分析を行ってみよう

  4. Excel との比較 : おまけ

  5. 次回は、... : 10月26日 14:45
[DIR]講義のホームページへ戻ります