Python数据挖掘实战代码

Python数据挖掘实战

课本代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
# ==== 代码2-1.py ====

import numpy as np
a = np.array([1,2,3,4,5], dtype = np.int64)
print(a)

# ==== 代码2-2.py ====

import numpy as np
a = np.arange(5) #只给定stop参数值
print("数组对象a:\n", a)
b = np.arange(2, 5.0) #给定start和stop参数值,生成一个浮点型数组
print("数组对象b:\n", b)
c = np.arange(2, 6, 2, dtype = np.int32) #给定start、stop、step和dtype参数值
print("数组对象c:\n", c)

# ==== 代码2-3.py ====

a = np.linspace(0, 3, 4, endpoint = True) #数组包含截止值3
print("数组对象a:\n", a)


# ==== 代码2-4.py ====

a = np.zeros((2, 3), dtype = np.int32) #生成2×3形状的全0数组
print("数组对象a:\n", a)
a = np.array([1,2,3,4])
b = np.zeros_like(b) #生成与数组a形状相同,数据类型也相同的全0数组
print("数组对象b:\n", b)


# ==== 代码2-5.py ====

import numpy as np
a = np.random.rand(4) #生成有4个元素的一维随机数组
print("数组对象a:\n", a)
b = np.random.randn(2, 3) #生成形状为2×3,符合正态分布的随机数组
print("数组的对象b:\n", b)
c = np.random.randint(1, 3, size = (2, 3))
# 生成形状为2×3,符合均匀分布的随机整数数组,取值区间为[1,3)
print("数组对象c:\n", c)


# ==== 代码2-6.py ====

import numpy as np
a = np.ones((2, 3), dtype = np.float32) #生成2x3形状的float32型数组
print("数组对象a的类型:\n", a.dtype)
a = a.astype(np.int32) #将float32类型的数据转化为整型数组
print("数组对象a的类型:\n", a.dtype)


# ==== 代码2-7.py ====

import numpy as np
a = np.ones((3, 4), dtype = np.int32)
print("数组a的维数:", a.ndim)
print("数组a的形状:", a.shape)


# ==== 代码2-8.py ====

import numpy as np
a = np.array([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9,10,11]])
b = a[0][1]
print("数组a的第0行第1列元素为:\n", b)


# ==== 代码2-9.py ====

import numpy as np
a = np.arange(24).reshape(2, 3, 4) #生成形状为(2,3,4)的数组
print("数组对象a:\n", a)
b = a[0:1:1, 0:2:1, ...] #第1次切片:在第0和1个维度上进行切片
print("\n第1次切片的结果:\n", b)
c = a[:1, :2] #第2次切片:更精简的切片方式
print("\n第2次切片的结果:\n", c)

# ==== 代码2-10.py ====

import numpy as np
a = np.random.randint(-5, 6, size = (3, 4))
print("数组对象a的原始值:\n", a)
index = (a <= 0) #单条件索引
print("单条件索引的布尔数组:\n", index)
a[index] = 0 # 将布尔索引取值为True的对应位置上的数据赋值为0
print("数组对象a的新值:\n", a)


# ==== 代码2-11.py ====

import numpy as np
a = np.random.randint(-5, 6, size = (3, 4))
print("排序前的数组对象a:\n", a)
b = np.sort(a, axis = 1) #对数组对象a按行排序
print("对数组对象a行排序后的结果:\n", b)


# ==== 代码2-12.py ====

import numpy as np
#1. Numpy数组与数值的算术运算的例子
a = np.array([1, 2, 3, 4, 5], dtype = np.int32)
b1 = a+2 #算术加
b2 = a*2 #算术乘
b3 = a**2 #算术乘方
#2. Numpy数组与数组的算术运算的例子
a = np.arange(24).reshape(2, 3, 4) #生成形状为(2,3,4)的3维数组
weight = np.random.random(size = (3, 4)) #生成2维的权重数组
b4 = a*weight #利用广播特性实现数组和数组相乘


# ==== 代码2-13.py ====

import numpy as np
import pandas as pd
#使用python列表创建Series对象,并指定索引
s1 = pd.Series([0, 1, 2, np.nan], index = ['a', 'b', 'c', 'd'])
print("使用列表创建的Series对象s1:\n", s1)
dic = {'张三': 97, '李四': 68, '王五': 88}
s2 = pd.Series(dic) #使用python字典创建Series对象
print("使用字典创建的Series对象s2:\n", s2)
arr = np.arange(4) #使用Numpy数组创建Series对象
s3 = pd.Series(arr)
print("使用Numpy数组创建的Series对象s3:\n", s3)


# ==== 代码2-14.py ====

import numpy as np
import pandas as pd
#使用二维列表创建
df1 = pd.DataFrame([['a', 1, 2], ['b', 3, 4], ['c', 7, 8]], columns = ['x', 'y', 'z'])
print("使用二维列表创建DataFrame对象:\n", df1)
#使用Numpy二维数组创建
df2 = pd.DataFrame(np.zeros((3, 3)), columns = ['x', 'y', 'z'])
print("使用Numpy二维数组创建DataFrame对象:\n", df2)
#使用字典创建
dic = { '语文': [98, 88, 78],
'数学': [89, 72, 93],
'英语': [84, 85, 77]}
df3 = pd.DataFrame(dic, index = ['张三', '李四', '王五'])
print("使用字典创建DataFrame对象:\n", df3)


# ==== 代码2-15.py ====

import pandas as pd
dic = {'语文': [98, 88, 78],
'数学': [89, 72, 93],
'英语': [84, 85, 77]}
df = pd.DataFrame(dic, index = ['张三', '李四', '王五'])
df1 = df['语文']
print("获取DataFrame对象的一列:\n", df1)
df2 = df[['语文', '英语']]
print("获取DataFrame对象的多列:\n", df2)
df3 = df.iloc[1]
print("使用iloc函数获得DataFrame对象的一行:\n", df3)
df4 = df.iloc[1:, 1:]
print("使用iloc函数获得DataFrame对象的多行多列(切片):\n", df4)
df5 = df.loc['王五', '英语']
print("使用loc函数获得DataFrame对象中的指定行列索引的一个数据:\n", df5)
df6 = df[df['语文'] > 85]
print("使用条件索引获得满足条件的行:\n", df6)


# ==== 代码2-16.py ====

import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.arange(6).reshape(2, 3), index = ['x', 'y'], columns = ['a', 'b', 'c'])
df2 = pd.DataFrame(np.arange(9).reshape(3, 3), index = ['x', 'y', 'z'], columns = ['a', 'b', 'c2'])
df3 = df1 + df2
print('使用运算符相加的结果:\n', df3)
df4 = df1.add(df2)
print('使用add()函数相加的结果:\n', df4)


# ==== 代码2-17.py ====

import pandas as pd
df = pd.DataFrame([[98.2,79.3,28.7], [78.3,87.3,54.7], [77.7,65.9,34.2]],
index = ['2022-3-1', '2022-3-2', '2022-3-3'],
columns = ['商店A', '商店B', '商店C'])
print('三家商店三天的营业额数据为:\n', df)
s1 = df.sum()
print("每家商店在三天的总营业额:\n", s1)
s2 = df.mean(axis = 0)
print("每家商店每天的平均营业额:\n", s2)
s3 = df.sum(axis = 1)
print("每天三家商店的营业额之和:\n", s3)
s4 = df.idxmax(axis = 0)
print("每家商店销售额最高的日期是:\n", s4)
s5 = df.cumsum(axis = 0)
print("每家商店的销售额累计和:\n", s5)
s6 = df.describe()
print("销售数据的一般描述性统计情况(按商店):\n", s6)


# ==== 代码2-18.py ====

import numpy as np
import pandas as pd
s0 = pd.Series([98, 79, 67], index = ['语文', '数学', '英语'])
s1 = s0.reindex(index = ['数学', '语文', '英语', '计算机'], fill_value=60.0)
print("行索引重排后的Series对象:\n", s1)
s2 = pd.Series(['a', 'b', 'c'], index = [0, 2, 3])
s3 = s2.reindex(np.arange(5), method = 'ffill')
print("行索引重排后的Series对象:\n", s3)


# ==== 代码2-19.py ====

import numpy as np
import pandas as pd
dic = {'语文': [98, 88, 78],
'数学': [89, 72, 93],
'英语': [84, 85, 77]}
df = pd.DataFrame(dic, index = ['张三', '李四', '王五'])
print("DataFrame的原始数据对象:\n", df)
df1 = df.reindex(index = ['李四', '张三', '王五', '陈六'],
columns = ['数学', '语文', '英语', '计算机'])
print("对df对象进行行列索引重排后的结果:\n", df1)





# ==== 代码2-20.py ====

import numpy as np
import pandas as pd
dic = {'语文': [98, 88, 78],
'数学': [89, 72, 93],
'英语': [84, 85, 77]}
df = pd.DataFrame(dic, index = ['张三', '李四', '王五'])
print("DataFrame的原始对象:\n", df)
df1 = df.drop(labels = ['李四', '王五'], axis = 0)
print("删除指定行后的DataFrame对象:\n", df1)


# ==== 代码2-21.py ====

import numpy as np
import pandas as pd
dic = {'语文': [98, 88, 78],
'数学': [89, 72, 93],
'英语': [84, 85, 77]}
df = pd.DataFrame(dic, index = ['张三', '李四', '王五'])
print("DataFrame的原始对象df:\n", df)
df1 = df.sort_index(axis = 1, ascending = False)
print("使用sort_index函数对df对象沿水平轴降序排序的结果:\n", df1)
df2 = df.sort_values(by = ['语文', '英语'], ascending = True)
print("使用sort_values函数对df对象多列升序排序的结果:\n", df2)


# ==== 代码2-22.py ====

import numpy as np
import matplotlib.pyplot as plt
x = np.arange(-2*np.pi, 2*np.pi, 0.01)
y1, y2 = np.sin(x), np.cos(x)
plt.figure(figsize = (6, 4))
plt.plot(x, y1)
plt.plot(x, y2)
plt.xlim(-3, 3) #设置X轴和Y轴的显示范围
plt.ylim(-2, 2)
plt.xlabel("x") #设置X轴和Y轴的显示标签
plt.ylabel(u"函数值", fontproperties = 'SimHei')
#设置Y轴的刻度及显示的刻度值
plt.yticks([-1, 0.5, 1, 2], [u'最小值', u'中间值', u'最大值', '2'], fontproperties = 'SimHei')
#设置图例
plt.legend(prop = {'family': 'SimHei', 'size':16},
loc = 'lower right', labels = ['正弦', '余弦'])
#设置文本注释
plt.annotate(s = 'sin(x)', xy = (0.5, np.sin(0.5)), xytext = (0, 1.5),
weight = 'bold', color = 'black',
arrowprops = dict(arrowstyle = '-|>', connectionstyle = 'arc3', color = 'red'),
bbox = dict(boxstyle = 'round, pad = 0.5'))
plt.text(-1, np.cos(-1), 'cos(x)', family = 'fantasy', fontsize = 14, style = 'italic', color = 'k')
plt.show()


# ==== 代码2-23.py ====

import matplotlib.pyplot as plt
import numpy as np
x = np.arange(-10, 11, 1) #获得变量x和y的值
y = x**2
#绘制折线图
plt.figure(figsize = (6, 4), dpi = 200)
plt.plot(x, y, color = 'r', linewidth = 1.5, linestyle = '--', marker = 'o', markersize = 6)
plt.xlim(-11, 11)
plt.ylim(-3, 103)
plt.xlabel("x") #设置X轴和Y轴标签
plt.ylabel("y")
y_ticks = np.arange(0, 101, 10) #设置Y轴刻度
plt.yticks(y_ticks)
plt.title(u'折线图示例', fontproperties = 'SimHei') #设置Title
plt.grid(True, which = 'major', linestyle = '--', linewidth = 1)
plt.show()


# ==== 代码2-24.py ====

import matplotlib.pyplot as plt
import numpy as np
n = 400 #数据集的规模
point = np.random.randn(n, 2)
#绘制散点图
plt.figure(figsize = (6, 4), dpi = 200)
plt.scatter(point[:, 0], point[:, 1], s = 60, marker = 'o', alpha = 0.6)
plt.xlim(-4, 4)
plt.ylim(-4, 4)
plt.xlabel("x") #设置X轴和Y轴标签
plt.ylabel("y")
plt.title(u'散点图示例', fontproperties = 'SimHei')
plt.grid(True, which = 'major', linestyle = '--', linewidth = 1)
plt.show()


# ==== 代码2-25.py ====

import matplotlib
import matplotlib.pyplot as plt
#获得柱状图数据和标签
names = ['张三', '李四', '王五', '陈六']
scores = [98, 67, 77, 56]
#绘制柱状图
plt.figure(figsize = (6, 4), dpi = 200)
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
plt.bar(x = names, height = scores, width = 0.5, color = 'blue',
edgecolor = 'black', label = '成绩')
for xx, yy in zip(names, scores): #绘制文本注释
plt.text(xx, yy+1, str(yy))
plt.xlabel("姓名")
plt.ylabel("分数")
plt.title(u'柱状图示例', fontproperties = 'SimHei')
plt.grid(True, which = 'major', linestyle = '--', linewidth = 1)
plt.show()


# ==== 代码2-26.py ====

from sklearn import datasets
from sklearn import preprocessing
from sklearn import tree
#步骤1: 加载数据集
iris = datasets.load_iris()
n_samples, n_features = iris.data.shape
X = iris.data
Y = iris.target
print('步骤1:加载iris数据集')
print('iris数据集中有%d个样本,%d个特征。' % (n_samples, n_features))
print('iris的前5个样本为:\n', X[0:5])
#步骤2: 数据预处理
min_max_scaler = preprocessing.MinMaxScaler()
X_scale = min_max_scaler.fit_transform(X)
print('步骤2:数据预处理')
print('规范化后iris的前5个样本:\n', X_scale[0:5])
#步骤3: 使用决策树算法构建分类器模型
classifier = tree.DecisionTreeClassifier()
classifier = classifier.fit(X, Y) #在训练集上训练
Y_predict = classifier.predict(X) #使用训练好的模型进行预测
print('步骤3:决策树模型构建…')
#步骤4:模型的评估
accuracy = (Y == Y_predict).sum() / Y.shape[0]
print('步骤4:模型评估')
print("决策树在训练集上的分类准确度为: %.3f" % (accuracy*100))





# ==== 代码3-1.py ====

from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
iris = load_iris()
features = iris.data.T
plt.figure(figsize = (8,6), dpi=200)
plt.scatter(features[2], features[3]) #绘制散点图
plt.xlabel(iris.feature_names[2])
plt.ylabel(iris.feature_names[3])
plt.show()


# ==== 代码3-2.py ====

from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
iris = load_iris()
features = iris.data.T
plt.figure(figsize = (8,6), dpi = 200)
figure,axes = plt.subplots() #得到画板、轴
axes.boxplot(features[1], patch_artist = True) #描点上色
plt.ylabel(iris.feature_names[1])
plt.show() #图形展示


# ==== 代码3-3.py ====

from sklearn.datasets import load_iris
import numpy as np
import matplotlib.pyplot as plt
plt.style.use ('seaborn-white') #使用seaborn包设置white背景
iris = load_iris()
features = iris.data.T
data=np.rint(features[2]) # 四舍五入取整np.rint
# 也可以用其他取整方法
# 截取整数部分 np.trunc
# 向上取整 np.ceil
# 向下取整np.floor
plt.hist(data, bins = 14, density = True, color = 'steelblue');


# ==== 代码3-4.py ====

from sklearn.datasets import load_iris
import numpy as np
import matplotlib.pyplot as plt
iris = load_iris()
species = iris.target
cate_list = iris.target_names
lables, counts = np.unique(species, return_counts = True)
num_list = list(counts)
num_list
plt.bar(range(len(num_list)), num_list)
plt.xlabel("species") # 指定X轴描述信息
plt.ylabel("numbers") # 指定Y轴描述信息
plt.ylim(0,60) # 指定Y轴的高度
idx = np.arange(len(cate_list))
plt.xticks(idx,cate_list)
plt.show()


# ==== 代码3-5.py ====

from sklearn.datasets import load_iris
import numpy as np
import matplotlib.pyplot as plt
iris = load_iris()
species = iris.target
cate_list = iris.target_names
lables, counts = np.unique(species, return_counts = True)
explode = [0, 0.1, 0] # 用于突出显示一个品种
colors = ['#7FFFD4', '#458B74', '#FFE4C4'] #自定义颜色
plt.axes(aspect='equal') # 将X,Y坐标轴标准化处理,设置饼图是正圆
plt.xlim(0, 3.8) # 控制X轴和Y轴的范围
plt.ylim(0, 3.8)
plt.pie(x = counts, # 绘图数据
explode = explode, # 用于突出显示一个品种
labels = cate_list, # 添加鸢尾花品种标签
colors = colors, # 设置饼图的自定义填充色
autopct = '%0.1f%%' ) # 设置显示扇形所占的比例
plt.show() # 显示图形


# ==== 代码3-6.py ====

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris['data'], iris['target'], random_state=0)
iris_dataframe = pd.DataFrame(X_train, columns = iris.feature_names)
grr = pd.plotting.scatter_matrix(iris_dataframe,
c = y_train, # 设置不同品种鸢尾花的颜色
alpha = .8,
figsize = (15,15),
marker = 'o',
hist_kwds = {'bins':20}) # 频率直方图上的箱体数量
plt.show()


# ==== 代码3-7.py ====

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
features = pd.DataFrame(iris.data, columns = iris.feature_names)
print('协方差的结果为:')
print(np.cov(features["petal length (cm)"], features["petal width (cm)"]))
print('pearson相关系数的结果为:')
print(features.iloc[:, [2, 3]].corr(method = "pearson"))
print('spearman相关系数的结果为:')
print(features.iloc[:, [2, 3]].corr(method = "spearman"))
print('kendall相关系数的结果为:')
print(features.iloc[:, [2, 3]].corr(method = "kendall"))


# ==== 代码4-1.py ====

import pandas as pd
import numpy as np
left = pd.DataFrame({"A":[0,0,1,2], "B":[0,1,0,1], "C":[0,0,1,1]})
right = pd.DataFrame({"A":[0,1,0,2], "B":[0,0,1,0], "D":[1,1,0,0]})
print("左数据框对象: \n", left)
print("右数据框对象: \n", right)
result1 = pd.merge(left, right, how = "left", on = ["A", "B"]) # 左连接
print("(1)左连接数据结果: \n", result1)
result2 = pd.merge(left, right, how = "right", on = ["A", "B"]) # 右连接
print("(2)右连接数据结果: \n", result2)
result3 = pd.merge(left, right, how = "inner", on = ["A", "B"]) # 内连接
print("(3)内连接数据结果: \n", result3)
result4 = pd.merge(left, right, how = "outer", on = ["A", "B"]) # 外连接
print("(4)外连接数据结果: \n", result4)

# ==== 代码4-2.py ====

import pandas as pd
scores = {'姓名': ['张三', '李四', '王五', '张三'],
'语文': [ 84, 92, 87, 84],
'数学': [ 89, 90, 95, 89],
'英语': [ 90, 81, 75, 92],
'计算机': [ 85, 92, 90, 85]}
df = pd.DataFrame(scores)
print("检验重复的记录:\n", df.duplicated(subset = ['姓名']))
df_drop = df.drop_duplicates(subset = ['姓名'], keep = 'first')
print("去重的数据为:\n", df_drop )


# ==== 代码4-3.py ====

import pandas as pd
scores = {'姓名': ['张三', '李四', '王五', '张三'],
'语文': [ 84, 92, 87, 84],
'数学': [ 89, 90, 95, 89],
'英语': [ 90, 81, 75, 92],
'计算机': [ 85, 92, 90, 85],
'计算机基础': [ 85, 92, 90, 85]}
df = pd.DataFrame(scores)
print(df.corr(method = 'pearson'))


# ==== 代码4-4.py ====

import pandas as pd
import numpy as np
scores = {'姓名': ['张三', '李四', '王五', '刘一'],
'语文': [ 84, 92, 87, 84],
'数学': [ 89, np.NaN, 95, 89],
'英语': [ 90, 81, np.NaN, 92],
'计算机': [ 85, 92, 90, 85]}
df = pd.DataFrame(scores)
print('成绩数据对象的特征缺失值情况:')
print(df.isnull().sum()) #判断每列是否有缺失值


# ==== 代码4-5.py ====

import pandas as pd
scores = {'姓名': ['张三', '李四', '王五', '刘一'],
'语文': [ 84, 92, 87, 84],
'数学': [ 89, pd.NA, 95, 89],
'英语': [ 90, 81, pd.NA, 92],
'计算机': [ 85, 92, 90, 85]}
df = pd.DataFrame(scores)
df.dropna(axis = 0, how = 'any', inplace = True) # 删除所有包含缺失值的行
print('删除包含缺失值记录后的数据为:\n', df)


# ==== 代码4-6.py ====

import pandas as pd
import numpy as np
#生成包含缺失值的数据
scores = {'姓名': ['张三', '李四', '王五', '刘一'],
'语文': [ 84, 92, 87, 84],
'数学': [ 89, pd.NA, 95, 89],
'英语': [ 90, 81, pd.NA, 92],
'计算机': [ 85, 92, 90, 85]}
df = pd.DataFrame(scores)
# 1.均值替换
df_mean = df['数学'].fillna(value = df['数学'].mean(), inplace = False)
print('使用均值替换: \n', df_mean)
# 2.中位数替换
df_median = df['数学'].fillna(df['数学'].median(), inplace = False)
print('使用中位数替换: \n', df_median)
# 3.使用固定值0替换
df_zero = df['数学'].fillna(value = 0, inplace = False)
print('使用0替换: \n',df_zero)
# 4.使用缺失值前一个值进行填充(按照相应index前后填充)
df_ffill = df['数学'].fillna(method = 'ffill', inplace = False, axis = 0)
print('使用缺失值前一个值替换: \n', df_ffill)
# 5.使用缺失值后一个值进行填充(按照相应columns前后填充)
df_bfill = df['数学'].fillna(method = 'bfill', inplace = False, axis = 0)
print('使用缺失值后一个值替换: \n', df_bfill)
# 6.使用线性插值法进行填充
df['数学'] = pd.to_numeric(df['数学'], errors = 'coerce')
df_linear = df['数学'].interpolate(method = 'linear', inplace = False)
print('使用线性插值法进行填充: \n', df_linear)
# 7.使用多项式插值插值法进行填充
df_poly = df['数学'].interpolate(method = 'polynomial', order = 2, inplace = False)
print('使用多项式插值插值法进行填充: \n', df_poly)
# 8.使用样条插值法进行填充
df_spline = df['数学'].interpolate(method = 'spline', order = 2, inplace = False)
print('(6)使用样条插值法进行填充: \n', df_spline)


# ==== 代码4-7.py ====

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#生成原始数据
scores = {'姓名': ['S1', 'S2', 'S3', 'S4', 'S5', 'S6'], '英语': [90, 81, 110, 92, 83, 85]}
df = pd.DataFrame(scores)
# 绘制箱线图
plt.figure(figsize = (8, 6), dpi = 200)
axes = plt.boxplot(df['英语'], notch = True, patch_artist = True) #箱线图
outlier = axes['fliers'][0].get_ydata() #获取异常值
plt.show() #图形展示
print('异常值为:\n', outlier)


# ==== 代码4-8.py ====

from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
iris = load_iris().data
#使用离差标准化对数据进行预处理
m_scaler = MinMaxScaler() #创建一个min-max规范化对象
iris_scale = m_scaler.fit_transform(iris)
iris_scale= pd.DataFrame(data = iris_scale,
columns = ["petal_len", "petal_wid", "sepal_len", "sepal_wid"])
print("规范化后的前5条iris数据:\n", iris_scale[0:5] )


# ==== 代码4-9.py ====

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import pandas as pd
iris = load_iris().data
#使用标准差规范化对数据进行处理
iris_scale = StandardScaler() #创建一个标准差规范化对象
iris_scale = iris_scale.fit_transform(iris)
iris_scale= pd.DataFrame(data = iris_scale,
columns = ["petal_len", "petal_wid", "sepal_len", "sepal_wid"])
print("规范化后的前5条iris数据:\n", iris_scale[0:5] )

# ==== 代码4-10.py ====

import numpy as np
x = np.array([[ 0., -3., 1.], # 初始化数据
[ 3., 1., 2.],
[ 0., 1., -1.]])
j = np.ceil(np.log10(np.max(abs(x)))) # 获取小数点移动最大位数
sc_C = x/(10**j)
print('标准化后的数据为: \n', sc_C)


# ==== 代码4-11.py ====

from sklearn.preprocessing import Binarizer
import numpy as np
price= np.array([1000, 2530, 3500, 6000, 200, 8200])
b = Binarizer(threshold = 3000) #创建二值化对象,阙值为3000
b_price = b.fit_transform(price.reshape(1,-1))
print("二值化后的价格:\n", b_price)


# ==== 代码4-12.py ====

import pandas as pd
#生成销量数据
sale_df = pd.DataFrame({'sale': [400, 50, 100, 450, 500, 320, 160, 280,
320, 380, 200, 460]})
# 等宽离散化
sale_df['sale_fixedwid'] = pd.cut(sale_df["sale"], bins = 3)
# 等频离散化
sale_df['sale_fixedfreq'] = pd.qcut(sale_df["sale"], q = 4)
print(sale_df)


# ==== 代码4-13.py ====

import pandas as pd
from sklearn.preprocessing import OneHotEncoder,LabelEncoder
#原始数据
weather_df = pd.DataFrame({'天气': ['晴天', '雨天', '阴天', '晴天'], '销量': [400, 50, 100, 450]})
#独热编码
oneHot_weather = OneHotEncoder().fit_transform(weather_df[["天气"]])
print('独热编码的结果为:')
print(oneHot_weather)
#哑变量编码
dummy_weather = pd.get_dummies(weather_df[["天气"]], drop_first = False)
print('哑变量的结果为:')
print(dummy_weather)
#标签编码
label_weather = LabelEncoder().fit_transform(weather_df[["天气"]])
print('标签编码的结果为:')
print(label_weather)


# ==== 代码4-14.py ====

import pandas as pd
sale_df = pd.DataFrame(
{'weather': ['晴天', '雨天', '阴天', '晴天', '晴天', '晴天', '晴天', '晴天', '晴天', '晴天', '晴天', '阴天', '雨天', '阴天', '晴天','阴天', '雨天', '阴天', '晴天', '晴天', '晴天', '晴天', '阴天', '晴天', '晴天', '晴天', '阴天', '晴天', '晴天', '晴天'],
'sale':[400, 50, 100, 450, 620, 325, 170, 280, 710, 330, 500, 320, 160, 280, 175, 240, 605, 270, 250, 510, 320, 380, 200, 460, 380, 420, 560, 80, 240, 630]}
)
#1.简单随机抽样
random_sample = sale_df.sample(10, random_state = 124)
print('简单随机抽样方法的结果:\n', random_sample)
#2.分层抽样
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits = 1, train_size = 10, random_state = 124)
for train_index, test_index in split.split(sale_df, sale_df['weather']):
strat_sample_set = sale_df.loc[train_index]
strat_test_set = sale_df.loc[test_index]
print('分层抽样方法的结果:\n', strat_sample_set)


# ==== 代码4-15.py ====

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler
iris = load_iris()
X = iris.data
y = iris.target
sc = StandardScaler()
X_scaled = sc.fit_transform(X)
pca = PCA(n_components = 2)
X_pca = pca.fit_transform(X_scaled)
lda = LinearDiscriminantAnalysis(n_components = 2, solver = 'svd')
X_lda = lda.fit_transform(X, y)
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (13.5 ,4))
sns.scatterplot(X_pca[:, 0], X_pca[:, 1], hue = y, palette = 'Set1', ax = ax[0])
sns.scatterplot(X_lda[:, 0], X_lda[:, 1], hue = y, palette = 'Set1', ax = ax[1])
ax[0].set_title("PCA of IRIS dataset", fontsize = 15, pad = 15)
ax[1].set_title("LDA of IRIS dataset", fontsize = 15, pad = 15)
ax[0].set_xlabel("PC1", fontsize = 12)
ax[0].set_ylabel("PC2", fontsize = 12)
ax[1].set_xlabel("LD1", fontsize = 12)
ax[1].set_ylabel("LD2", fontsize = 12)
plt.savefig('PCA vs LDA.png', dpi = 80)


# ==== 代码5-1.py ====

import numpy as np
from sklearn.feature_selection import VarianceThreshold
#模拟数据集
X = np.array([[1,2,3,4], [1,6,7,9], [1,4,4,2], [1,4,6,1], [0,0,5,2], [1,7,4,7]])
selector = VarianceThreshold(1.0) #阈值设置为1
selector.fit(X) #训练
transformed_X = selector.transform(X) # 特征选择
print("特征的方差:", selector.variances_)
print("特征选择后的数据集", transformed_X)


# ==== 代码5-2.py ====

import numpy as np
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
# 模拟数据集
X = np.array([[1,2,3,4], [1,4,7,9], [1,4,4,2], [1,4,6,1], [0,0,5,2], [1,7,2,7]])
Y = np.array([1, 0, 1, 1, 1, 0])
selector = SelectKBest(chi2, k = 2)
selector.fit(X, Y) # 训练
transformed_X = selector.transform(X) # 特征选择
print("特征的卡方统计量值:", selector.scores_)
print("特征选择后的数据集:", transformed_X)


# ==== 代码5-3.py ====

import numpy as np
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import SelectKBest
X = np.array([[1,2,3,4], [1,6,7,9], [1,4,4,2], [1,4,6,1], [0,0,5,2], [1,7,4,7]])
Y = np.array([1, 0, 1, 1, 1, 0])
selector = SelectKBest(mutual_info_classif, k = 2)
selector.fit(X, Y) # 训练
transformed_X = selector.transform(X) # 特征选择
print("特征和目标变量的互信息值:", selector.scores_)
print("特征选择后的数据集:", transformed_X)


# ==== 代码5-4.py ====

import numpy as np
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import SelectKBest
X = np.array([[1,2,3,4], [1,6,7,9], [1,4,4,2], [1,4,6,1], [0,0,5,2], [1,7,4,7]])
Y = np.array([1, 0, 1, 1, 1, 0])
selector = SelectKBest(f_classif, k = 2)
selector.fit(X, Y) # 训练
transformed_X = selector.transform(X) # 特征选择
print("特征F-统计量值:", selector.scores_)
print("特征选择后的数据集:", transformed_X)


# ==== 代码5-5.py ====

import numpy as np
import tarfile
from scipy.stats import pearsonr
from sklearn.feature_selection import SelectKBest
import pandas as pd
with tarfile.open(mode="r:gz", name='cal_housing.tgz') as f:
cal_housing= np.loadtxt(f.extractfile('CaliforniaHousing/cal_housing.data'),delimiter=',')
cols=['longitude', 'latitude', 'housingMedianAge', 'totalRooms', 'totalBedrooms', 'population', 'households', 'medianIncome']
X=cal_housing[:,0:8]
Y=cal_housing[: ,8]
#封装的皮尔森相关系数计算函数
def ud_pearsonr(X, y):
result = np.array([pearsonr(x, y) for x in X.T]) #返回皮尔森相关系数, p值
return np.absolute(result[:, 0]), result[:, 1]
selector = SelectKBest(ud_pearsonr, k = 4)
selector.fit(X,Y) # 训练
transformed_X = selector.transform(X)
print("特征的皮尔森相关系数值:\n", pd.Series(selector.scores_, index=cols))
print("选择的特征为:\n", np.array(cols)[selector.get_support(indices=True)])
print("特征选择后的数据集:\n", transformed_X)


# ==== 代码5-6.py ====

import pandas as pd
import numpy as np
from mrmr import mrmr_classif
X = np.array([[1,2,3,4], [1,6,7,9], [1,4,4,2], [1,4,6,1], [0,0,5,2], [1,7,4,7]])
Y = np.array([1, 0, 1, 1, 1, 0])
X = pd.DataFrame(X, columns=['0', '1', '2', '3'])
F = mrmr_classif(X = X, y = Y, K = 2) #特征选择
print("选择的特征索引为:", F)


# ==== 代码5-7.py ====

from sklearn.datasets import load_wine # 导入红酒数据集
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold, chi2, SelectKBest
from sklearn.feature_selection import mutual_info_classif,f_classif
from mrmr import mrmr_classif
from sklearn.tree import DecisionTreeClassifier as DTC
import pandas as pd
import numpy as np
# 1. 获得数据
wine = load_wine()
X, Y = wine.data, wine.target
num_class = 5 #待选取的特征子集的大小
# 2. 特征选择过程
vt_sel = VarianceThreshold(1.0) #方差阈值法(阈值为1)
vt_sel.fit(X)
vt_trans_X = vt_sel.transform(X)
print("方差阈值法选择的特征:", vt_sel.get_support(True))
chi_sel = SelectKBest(chi2, k=num_class) # 卡方统计量法
chi_sel.fit(X, Y)
chi_trans_X = chi_sel.transform(X)
print("卡方统计量方法选择的特征:", chi_sel.get_support(True))
mi_sel = SelectKBest(mutual_info_classif, k = num_class) # 互信息法
mi_sel.fit(X, Y)
mi_trans_X = mi_sel.transform(X)
print("互信息法选择的特征:", mi_sel.get_support(True))
F_sel = SelectKBest(f_classif, k = num_class) # F统计量法
F_sel.fit(X, Y)
F_trans_X = F_sel.transform(X)
print("F统计量法选择的特征:", F_sel.get_support(True))
dfX = pd.DataFrame(X, columns = [i for i in range(len(wine.feature_names))])
F = mrmr_classif(dfX, Y, num_class) #mRMR方法
mrmr_trans_X = X[:, F]
print("mRMR方法选择的特征:", np.sort(F).tolist( ))
# 3. 函数:调用统一的决策树分类模型
def ClassifyingModel(X, Y):
# 分割数据集
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state = 9)
tree = DTC(criterion = "entropy", max_depth = 3, random_state = 9) # 决策树模型
tree.fit(X_train, y_train)
score = tree.score(X_test, y_test, sample_weight = None) # 计算测试精度
return score
# 4. 不同特征选择结果能达到的测试精度(accuracy)
print("决策树模型在不同的特征选择方法选取的子集上取得的测试精度:")
print("方差阈值法:", ClassifyingModel(vt_trans_X, Y))
print("卡方统计量法:", ClassifyingModel(chi_trans_X, Y))
print("互信息法:", ClassifyingModel(mi_trans_X, Y))
print("F统计量法:", ClassifyingModel(F_trans_X, Y))
print("mRMR法:", ClassifyingModel(mrmr_trans_X, Y))


# ==== 代码5-8.py ====

from sklearn.datasets import load_wine
from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE, RFECV
# 1. 获得数据
wine = load_wine()
X,Y = wine.data, wine.target
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state = 9)
#2. RFE特征选择结果
tree = DTC(criterion = "entropy", max_depth = 3, random_state = 9) #决策树模型
RFE_selector = RFE(estimator = tree, n_features_to_select = 5, step = 1)
RFE_selector.fit(X_train, y_train) #训练
print("RFE选择的特征", RFE_selector.get_support(True))
print("RFE方法选取特征所获得的测试精度", RFE_selector.score(X_test, y_test))
#3. RFECV特征选择结果
RFECV_selector = RFECV(estimator = tree, cv = 5, step = 1)
RFECV_selector.fit(X_train, y_train)
print("RFECV选择的特征", RFECV_selector.get_support(True))
print("RFECV方法选取特征所获得的测试精度", RFECV_selector.score(X_test, y_test))


# ==== 代码5-9.py ====

from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.datasets import load_wine
from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.model_selection import train_test_split
#辅助函数:特征子集的性能评价函数
def evaluate_select_subset(Xtrain, y_train, X_test, y_test, feature_index):
Xtrain= X_train[: , feature_index]
Xtest= X_test[: , feature_index]
tree =DTC(criterion = "entropy", max_depth = 3, random_state = 9)
tree.fit(Xtrain, y_train)
return tree.score(Xtest, y_test)
# 1. 获得数据
wine = load_wine()
X, Y = wine.data, wine.target
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state = 9)
#2. SFS特征选择结果
tree =DTC(criterion = "entropy", max_depth = 3, random_state = 9) #决策树模型
SFS_selector = SequentialFeatureSelector(estimator = tree,
n_features_to_select = 5, direction = 'forward')
SFS_selector.fit(X_train, y_train) # 训练
sd_feat = SFS_selector.get_support(True)
print("SFS选择的特征", SFS_selector.get_support(True))
SFS_score = evaluate_select_subset(X_train, y_train, X_test, y_test, sd_feat)
print("SFS选择的特征子集上获得的测试精度:", SFS_score)
#2. SBS特征选择结果
tree = DTC(criterion = "entropy", max_depth = 3, random_state = 9) #决策树模型
SBS_selector = SequentialFeatureSelector(estimator = tree,
n_features_to_select = 5, direction = 'backward')
SBS_selector.fit(X_train, y_train) # 训练
sd_feat = SBS_selector.get_support(True)
print("SBS选择的特征", SBS_selector.get_support(True))
SBS_score = evaluate_select_subset(X_train, y_train, X_test, y_test, sd_feat)
print("SBS选择的特征子集上获得的测试精度:", SBS_score)

# ==== 代码5-10.py ====

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# 1. 获得数据
wine = load_wine()
X,Y = wine.data, wine.target
#对X进行规范化
normalize_model = StandardScaler().fit(X)
X=normalize_model.transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state = 9)
#2. L1正则化Logitsitc回归模型进行特征选择
#logistic分类模型: 正则参数C控制正则效果的大小,C越大,正则效果越弱
logistic_model = LogisticRegression(penalty = 'l1', C = 0.5, solver = 'liblinear',
random_state = 1234)
#嵌入式特征选择模型
selector = SelectFromModel(estimator = logistic_model, max_features = 5)
selector.fit(X_train, y_train)
#特征选择结果
print("L1正则嵌入法选择的特征:", selector.get_support(True))
print("L1正则化Logistic回归模型获得的测试精度",
selector.estimator_.score(X_test, y_test))


# ==== 代码5-11.py ====

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.tree import DecisionTreeClassifier as DTC
import numpy as np
# 1. 获得数据
wine = load_wine()
X,Y = wine.data, wine.target
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state = 9)
#2. 决策树模型进行嵌入式特征选择
tree = DTC(criterion = "entropy", max_depth = 3, random_state = 9) #决策树模型
#嵌入式特征选择
selector = SelectFromModel(estimator = tree, threshold = 'mean')
selector.fit(X_train, y_train)
#特征选择结果
print("决策树嵌入法选择的特征:", selector.get_support(True))
print("决策树输出的特征重要性系数",
np.round(selector.estimator_.feature_importances_, 3))
print("决策树嵌入法获得的测试精度", selector.estimator_.score(X_test, y_test))

# ==== 代码6-1.py ====

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
#1. 读入数据
df = pd.read_csv('UniversalBank.csv')
y = df['Personal Loan']
X = df.drop(['ID', 'ZIP Code', 'Personal Loan'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 0)
#2. 训练高斯朴素贝叶斯模型
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# 3. 评估模型
y_pred = gnb.predict(X_test)
acc = gnb.score(X_test, y_test)
print('GaussianNB模型的准确度: %s'%acc)
y_pred = gnb.predict_proba(X_test)
print('测试数据对象0的预测结果(概率):', y_pred[0])


# ==== 代码6-2.py ====

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
# 1. 读入数据
df = pd.read_csv('UniversalBank.csv')
y = df['Personal Loan']
X = df[['Family', 'Education', 'Securities Account',
'CD Account', 'Online', 'CreditCard']] #只选用6个特征
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
#2. 训练多项式朴素贝叶斯模型
mnb = MultinomialNB()
mnb.fit(X_train, y_train)
acc = mnb.score(X_test, y_test)
print('MultinomialNB模型的准确度: %s'%acc)


# ==== 代码6-3.py ====

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
# 1. 读入数据,建立两个数据集
df = pd.read_csv('UniversalBank.csv')
df = df.drop(['ID', 'ZIP Code'], axis = 1)
ccol = ['Family', 'Education', 'Securities Account',
'CD Account', 'Online', 'CreditCard'] #类别特征的索引
y = df['Personal Loan']
X_mul = df[ccol] #多项式朴素贝叶斯使用的数据
X_gau = df.drop(ccol + ['Personal Loan'], axis = 1) #高斯朴素贝叶斯使用的数据
X_mul_train, X_mul_test,X_gau_train, X_gau_test, y_train, y_test =\
train_test_split(X_mul, X_gau, y, test_size=0.1, random_state = 0)
# 2. 使用类别特征训练多项式朴素贝叶斯分类器
mnb = MultinomialNB()
mnb.fit(X_mul_train, y_train)
m_train_pred = mnb.predict_proba(X_mul_train)
m_test_pred = mnb.predict_proba(X_mul_test)
acc=mnb.score(X_mul_test, y_test)
print('MultinomialNB模型的准确度: %s'%acc)
# 3. 使用数值特征训练高斯朴素贝叶斯模型
gnb = GaussianNB()
gnb.fit(X_gau_train, y_train)
g_train_pred = gnb.predict_proba(X_gau_train)
g_test_pred = gnb.predict_proba(X_gau_test)
acc = gnb.score(X_gau_test, y_test)
print('GaussianNB模型的准确度: %s'%acc)
# 4. 集成两个模型
acc=sum(((m_test_pred[: , 1] + g_test_pred[ : , 1]) >= 1) == (y_test == 1)) / len(y_test)
print('集成模型的准确度: %s'%acc)


# ==== 代码6-4.py ====

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# 1. 建立数据集
df = pd.read_csv('UniversalBank.csv')
y = df['Personal Loan']
X = df.drop(['ID', 'ZIP Code', 'Personal Loan'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
n_neighbors = 5 #K值
# 2. 采用两种weights参数建立KNN模型,并评估
for weights in ['uniform', 'distance']:
knn = KNeighborsClassifier(n_neighbors, weights = weights)
knn.fit(X_train, y_train)
acc = knn.score(X_test, y_test)
print('%s 准确度: %s'%(weights, acc))


# ==== 代码6-5.py ====

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
np.random.seed(10)
# 1. 建立数据集
df = pd.read_csv('UniversalBank.csv')
y = df['Personal Loan']
X = df.drop(['ID', 'ZIP Code', 'Personal Loan'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
#2. 使用默认参数训练CART模型
model1 = DecisionTreeClassifier()
model2 = model1.fit(X_train, y_train)
acc1 = model1.score(X_test, y_test)
print('默认参数的CART决策树的准确度: \n', acc1)
# 3. 设置sample_weight参数后训练CART模型
sample_weight = np.ones((y_train.shape[0],))
sample_weight[y_train == 1] = np.ceil(sum(y_train == 0) / sum(y_train == 1))
model2 = DecisionTreeClassifier(max_depth = 10) #设置模型的max_depth参数
model2 = model2.fit(X_train, y_train, sample_weight)
acc2 = model2.score(X_test, y_test)
print('设置参数后的CART决策树的准确度:\n', acc2)
#4. 可视化决策树
from sklearn.tree import export_graphviz
import graphviz
dot_data = export_graphviz(model2, out_file = None,
feature_names = X.columns,
class_names=["0","1"],
filled=True) #指定是否为节点上色
graph = graphviz.Source(dot_data)
graph.render(r'wine')
graph.view()


# ==== 代码6-6.py ====

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
df = pd.read_csv('UniversalBank.csv')
y = df['Personal Loan']
X = df.drop(['ID', 'ZIP Code', 'Personal Loan'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
#构建BP神经网络模型
model = MLPClassifier(hidden_layer_sizes = (1000, 10), activation = 'logistic', verbose = 1)
model.fit(X_train, y_train)
acc = model.score(X_test, y_test)
print('BP神经网络的准确度:%s'%acc)


# ==== 代码6-7.py ====

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC, NuSVC
df = pd.read_csv('UniversalBank.csv')
y = df['Personal Loan']
X = df.drop(['ID', 'ZIP Code', 'Personal Loan'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 0)
# 1. 在规范化数据集上训练SVC模型
model = make_pipeline(StandardScaler(), SVC(gamma = 'auto', C=3, class_weight={0:1,1:2}))
model.fit(X_train, y_train)
acc = model.score(X_test,y_test)
print('在规范化数据集上训练SVC模型的准确度: \n', acc)
# 2. 在未规范化数据集上训练SVC模型
model = SVC(gamma = 'auto', C = 3)
model.fit(X_train, y_train)
acc = model.score(X_test, y_test)
print('在未规范化数据集上训练SVC模型的准确度:\n', acc)
# 3. 在规范化数据集上训练Nu-SVC模型
model = make_pipeline(StandardScaler(), NuSVC(gamma = 'auto', nu = 0.07,
class_weight = 'balanced'))
model.fit(X_train, y_train)
acc = model.score(X_test, y_test)
print('在规范化数据集上训练Nu-SVC模型的准确度: \n', acc)


# ==== 代码6-8.py ====

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold
# 1. 准备数据集
df = pd.read_csv('UniversalBank.csv')
y = df['Personal Loan']
X = df[['Age', 'Experience','Income', 'CCAvg', 'Mortgage']]
n_neighbors = 5
X2 = np.array(X)
y2 = np.array(y)
#2. 在5折交叉验证数据集上测试KNN的准确度
kf = KFold(n_splits = 5)
acc = 0
for train_index, test_index in kf.split(X2):
knn = KNeighborsClassifier(n_neighbors) #构建KNN分类模型
knn.fit(X2[train_index], y2[train_index])
acc += knn.score(X2[test_index], y2[test_index])
print('使用KFold实现交叉验证计算KNN的准确度: %s'% (acc/kf.get_n_splits()))
#3. 使用cross_val_score函数实现5折交叉验证,计算KNN的准确度
from sklearn.model_selection import cross_val_score
knn = KNeighborsClassifier(n_neighbors)
acc = cross_val_score(knn, X, y, cv = 5)
print('使用cross_val_score实现交叉验证计算KNN的准确度:%s'% np.mean(acc))


# ==== 代码6-9.py ====

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
f_train = 'cs-training.csv'
f_test = 'cs-test.csv'
f_target = 'sampleEntry.csv'
df_train = pd.read_csv(f_train, header = 0)
df_test = pd.read_csv(f_test, header = 0)
df_target = pd.read_csv(f_target, header = 0)
df_train = df_train.iloc[:, 1:]
df_test = df_test.iloc[:, 1:]
# 类的分布情况
pos = sum(df_train['SeriousDlqin2yrs'] > 0.5)
neg = len(df_train) - pos
plt.figure(figsize=(14,10))
dict = {'POS': pos, 'NEG': neg}
size = len(dict)
for i, key in enumerate(dict):
plt.bar(i, dict[key], width=0.2)
plt.text(i-0.05, dict[key] + 0.01, dict[key],fontsize=24)
plt.xticks(np.arange(size), dict.keys(), fontsize=24)
plt.yticks([10000, 70000, 130000],fontsize=24)

# ==== 代码6-10.py ====

#绘制缺失值柱状图
plt.figure(figsize=(14,10))
loc = []
s = pd.isnull(df_train).sum() / len(df_train)
for i in range(0, df_train.shape[1]):
if s[i] != 0:
plt.bar(i, s[i],width=1)
plt.text(i-0.1, s[i]+0.005, '%.3f'%s[i], fontsize=24)
loc.append(i)
plt.xticks(loc, s.index[loc],fontsize=24)
plt.yticks([0, 0.1,0.2],fontsize=24)
plt.ylim(0, 0.25)
# 处理缺失值
df_train = df_train.drop(['MonthlyIncome'], axis = 1)
df_test = df_test.drop(['MonthlyIncome'], axis = 1)
df_train['NumberOfDependents'].fillna(df_train['NumberOfDependents'].mean(),
inplace = True)
df_test['NumberOfDependents'].fillna(df_train['NumberOfDependents'].mean(), inplace = True)


# ==== 代码6-11.py ====

fig = plt.figure(figsize=(14,8))
for i in range(df_train.shape[1]):
fig.add_subplot(5, 2, i+1)
plt.title(df_train.columns[i], fontsize = 16)
dat = df_train.iloc[:, i]
plt.scatter(np.arange(len(dat)), dat, s = 1)
plt.xticks([])
plt.yticks(fontsize=16)
fig.tight_layout()
# 删除异常值数据
index = df_train['RevolvingUtilizationOfUnsecuredLines'] <= 1
df_train2 = df_train[index]
index = df_train['age'] > 18
df_train2 = df_train2[index]


# ==== 代码6-12.py ====

from sklearn.model_selection import KFold
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_validate
import numpy as np
np.random.seed(10)
X = np.array(df_train2.iloc[:, 1:])
y = np.array(df_train2.iloc[:, 0])
weight = sum(y == 0) / sum(y == 1)
class_weight ={0:1, 1:weight}
scoring = ['accuracy', 'balanced_accuracy', 'roc_auc']
# 创建CART决策树模型
cart = DecisionTreeClassifier(class_weight = class_weight,
min_samples_leaf = 80,
max_depth = 8)
scores = cross_validate(cart, X, y, cv = 10, scoring = scoring)
print('CART决策树模型的信用评分结果:')
s = np.mean(scores['test_accuracy'])
print('accuracy: %s'% s)
s = np.mean(scores['test_balanced_accuracy'])
print('balanced_accuracy: %s'% s)
s = np.mean(scores['test_roc_auc'])
print('AUC: %s'% s)


# ==== 代码6-13.py ====

svm = make_pipeline(StandardScaler(), SVC(gamma = 'auto', C = 100,
class_weight = class_weight))
scores = cross_validate(svm, X, y, cv =2, scoring = scoring)
print(' SVM模型的性能评价结果:')
s = np.mean(scores['test_accuracy'])
print('accuracy: %s'% s)
s = np.mean(scores['test_balanced_accuracy'])
print('balanced_accuracy: %s'% s)
s = np.mean(scores['test_roc_auc'])
print('AUC: %s'% s)


# ==== 代码6-14.py ====

from sklearn.model_selection import KFold
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_validate
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import roc_auc_score
weight = sum(y == 0) / sum(y == 1)
class_weight ={0:1, 1:weight}
def evaluate(model, name, X_test, y_true): #自定义评价函数
print(' %s模型的信用评分结果:'% name)
y_pred = model.predict(X_test)
score = accuracy_score(y_true, y_pred)
print('accuracy: %s'%score)
score = balanced_accuracy_score(y_true, y_pred)
print('balanced accuracy: %s'%score)
score = roc_auc_score(y_true, y_pred)
print('AUC: %s'%score)
X_test= np.array(df_test.iloc[:,1:])
y_test= df_target['Probability'].gt(0.5).astype(np.short)
#使用最优参数训练CART决策树
cart = DecisionTreeClassifier(class_weight = class_weight,
min_samples_leaf = 80,
max_depth = 8)
cart.fit(X, y)
evaluate(cart, 'CART', X_test, y_test)
#使用最优参数训练SVM模型
svm = make_pipeline(StandardScaler(), SVC(gamma = 'auto', C = 100,
class_weight = class_weight))
svm.fit(X, y)
evaluate(svm, 'SVM', X_test, y_test)


# ==== 代码6-15.py ====

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
df = pd.read_csv('winequality-white.csv', delimiter = ';')
y = np.array(df['quality'])
X = df.drop(['quality'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
reg = LinearRegression().fit(X_train, y_train) #线性回归模型
pred = reg.predict(X_test)
mae = np.sum(np.abs(pred - y_test)) / len(y_test)
print('线性回归模型的MAE为:', mae)


# ==== 代码6-16.py ====

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
df = pd.read_csv('winequality-white.csv', delimiter = ';')
y = df['quality']
X = df.drop(['quality'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
#建立CART决策回归树模型,训练并做性能评价
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X_train, y_train)
pred = regressor.predict(X_test)
mae = np.sum(np.abs(pred - y_test)) / len(y_test)
print('CART决策回归树模型的MAE为:', mae)
mse = regressor.score(X_test, y_test)
print('CART决策回归树模型的MSE为:', mse)


# ==== 代码6-17.py ====

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
df = pd.read_csv('winequality-white.csv', delimiter = ';')
y = df['quality']
X = df.drop(['quality'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
#建立MLP模型,训练并做性能评价
regressor = MLPRegressor(hidden_layer_sizes = (100,10) , solver = 'adam',
activation = 'logistic', random_state = 0)
regressor.fit(X_train, y_train)
pred = regressor.predict(X_test)
mae = np.sum(np.abs(pred - y_test)) / len(y_test)
print('BPNN模型的MAE为:', mae)
mse = regressor.score(X_test, y_test)
print('BPNN模型的MSE为:', mse)


# ==== 代码6-18.py ====

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('winequality-white.csv', delimiter = ';')
y = np.array(df['quality'])
X = df.drop(['quality'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
# 创建SVR模型,训练并预测
regressor = SVR(C = 100)
model = make_pipeline(StandardScaler(), regressor)
model.fit(X_train, y_train)
pred = model.predict(X_test)
mae = np.sum(np.abs(pred - y_test)) / len(y_test)
print('SVR模型的MAE为:', mae)
mse = model.score(X_test, y_test)
print('SVR模型的MSE为:', mse)

# ==== 代码7-1.py ====

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
df = pd.read_csv('UniversalBank.csv')
y = df['Personal Loan']
X = df.drop(['ID', 'ZIP Code', 'Personal Loan'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
#建立KNN模型和装袋模型
knn = KNeighborsClassifier(5, weights = 'distance')
bagging_model = BaggingClassifier(base_estimator = knn, n_estimators = 10)
# 模型训练和评估
knn.fit(X_train, y_train)
acc = knn.score(X_test, y_test)
print('KNN模型的准确度: %s'%(acc))
bagging_model.fit(X_train, y_train)
acc = bagging_model.score(X_test, y_test)
print('Bagging模型的准确度: %s'%(acc))



# ==== 代码7-2.py ====

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
df = pd.read_csv('UniversalBank.csv')
y = df['Personal Loan']
X = df.drop(['ID', 'ZIP Code', 'Personal Loan'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
#建立CART决策树基模型和提升模型
cart = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = 6)
ada_model = AdaBoostClassifier(base_estimator = cart, n_estimators = 50,
random_state = 10)
#模型训练和测试
cart.fit(X_train, y_train)
acc = cart.score(X_test, y_test)
print('CART决策树模型的准确度: %s'%(acc))
ada_model.fit(X_train, y_train)
acc = ada_model.score(X_test, y_test)
print('Adaboost模型的准确度: %s'%(acc))


# ==== 代码7-3.py ====

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingClassifier
from sklearn.svm import SVC, NuSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
# 1. 读入数据,建立训练集和测试集
df = pd.read_csv('UniversalBank.csv')
y = df['Personal Loan']
X = df.drop(['ID', 'ZIP Code', 'Personal Loan'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
# 2. 建立基模型、元模型和堆叠模型
cart = DecisionTreeClassifier()
svm = make_pipeline(StandardScaler(),
NuSVC(gamma = 'auto', nu = 0.07, class_weight = 'balanced'))
lr = LogisticRegression() #元模型
estimators = [('cart', cart), ('svm', svm)] #基模型
kf = KFold(n_splits = 10)
stacking_model = StackingClassifier(estimators = estimators, #堆叠模型
final_estimator = lr, cv = kf)
#3. 训练和测试模型
cart.fit(X_train, y_train)
acc = cart.score(X_test, y_test)
print('CART决策树模型的准确度: %s'%(acc))
svm.fit(X_train, y_train)
acc = svm.score(X_test, y_test)
print('支持向量机模型的准确度: %s'%(acc))
stacking_model.fit(X_train, y_train)
acc = stacking_model.score(X_test, y_test)
print('堆叠(Stacking)模型的准确度: %s'%(acc))


# ==== 代码7-4.py ====

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
#1. 读数据,建立训练集和测试集
df = pd.read_csv('UniversalBank.csv')
y = df['Personal Loan']
X = df.drop(['ID', 'ZIP Code','Personal Loan'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
#2. 计算样本权重
sample_weights = np.ones((y_train.shape[0],))
sample_weights[y_train == 1] = np.ceil(sum(y_train == 0) / sum(y_train == 1))
#3. 构建随机森林
model = RandomForestClassifier(n_estimators = 400, max_depth = 8,
min_samples_split = 3, random_state = 0)
#4. 训练和测试模型
model = model.fit(X_train, y_train, sample_weights)
acc = model.score(X_test, y_test)
print('随机森林模型的准确度: %s' % acc)


# ==== 代码7-5.py ====

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
#1. 读入数据,建立训练集和测试集
df = pd.read_csv('UniversalBank.csv')
y = df['Personal Loan']
X = df.drop(['ID', 'ZIP Code', 'Personal Loan'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
#2. 计算样本权重
sample_weights = np.ones((y_train.shape[0],))
sample_weights[y_train == 1] = np.ceil(sum(y_train == 0) / sum(y_train == 1))
#3. 建立提升树模型
model = GradientBoostingClassifier(n_estimators = 200, learning_rate = 0.3,
max_depth = 5, min_samples_leaf = 4, random_state = 0)
#4. 训练和评估模型
model.fit(X_train, y_train, sample_weights)
acc = model.score(X_test, y_test)
print('提升树模型的准确度: %s' % acc)


# ==== 代码7-6.py ====

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
df = pd.read_csv('churn.csv')
#处理TotalCharges特征上的缺失值
idx = df['TotalCharges'] == ' '
df['TotalCharges'][idx] = df['MonthlyCharges'][idx]
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], downcast = "float")
#对Churn特征进行编码
le = LabelEncoder()
le.fit(df['Churn'])
y = le.transform(df['Churn'])
df = df.drop(['customerID', 'Churn'], axis=1)


# ==== 代码7-7.py ====

excluded_cols = ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges']
cates = list(set(df.columns) - set(excluded_cols))
encoder = OneHotEncoder(drop = 'first')
df2 = encoder.fit_transform(df[cates]).toarray()
X = np.concatenate((df2, df[excluded_cols]), axis = 1)


# ==== 代码7-8.py ====

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import accuracy_score
import numpy as np
np.random.seed(10)
skf = StratifiedKFold(n_splits =10, shuffle = True, random_state = 10)


# ==== 代码7-9.py ====

# step 1: 设置初始准确度和平衡准确度
acc_rf, acc_gbt, acc_ada = 0, 0, 0 #设置三种模型的初始准确度
bacc_rf, bacc_gbt, bacc_ada = 0, 0, 0 #设置三种模型的初始平衡准确度
for train_index, test_index in skf.split(df, y):
X_train = X[train_index]
y_train = y[train_index]
X_test = X[test_index]
y_test = y[test_index]
# step 2: 计算样本权重
sample_weights = np.ones((len(y_train), ))
sample_weights[y_train == 1] = np.ceil(sum(y_train == 0) / sum(y_train == 1))
# step 3: 提升树模型的训练与评估
gbt = GradientBoostingClassifier(n_estimators = 200,
learning_rate = 0.3,
max_depth = 5,
min_samples_leaf = 4,
random_state = 0)
gbt.fit(X_train, y_train, sample_weights)
y_pred = gbt.predict(X_test)
acc_gbt += accuracy_score(y_test, y_pred)
bacc_gbt += balanced_accuracy_score(y_test, y_pred)
# step 4: 随机森林的训练与评估
rf = RandomForestClassifier(n_estimators = 1000,
max_depth = 8,
min_samples_split = 3,
random_state = 0)
rf.fit(X_train, y_train,sample_weights)
y_pred = rf.predict(X_test)
acc_rf += accuracy_score(y_test, y_pred)
bacc_rf += balanced_accuracy_score(y_test, y_pred)
# step 5: Adaboost的训练与评估
cart = DecisionTreeClassifier(min_samples_leaf = 15, max_depth = 15)
ada = AdaBoostClassifier(base_estimator = cart, n_estimators = 1000, random_state = 10)
ada.fit(X_train, y_train, sample_weights)
y_pred = ada.predict(X_test)
acc_ada += accuracy_score(y_test, y_pred)
bacc_ada += balanced_accuracy_score(y_test, y_pred)
# step 6: 显示分类结果
print('提升树的准确度:%s, 平衡准确度: %s'%(acc_gbt/10, bacc_gbt/10))
print('随机森林的准确度:%s, 平衡准确度: %s'% (acc_rf/10,bacc_rf/10))
print('Adaboost的准确度:%s, 平衡准确度: %s'%(acc_ada/10, bacc_ada/10))

# ==== 代码7-10.py ====

from numpy import mean
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_validate
from imblearn.ensemble import EasyEnsembleClassifier
from sklearn.ensemble import AdaBoostClassifier
np.random.seed(10)
k = 5
df = pd.read_csv('UniversalBank.csv')
y = df['Personal Loan']
X = df.drop(['ID', 'ZIP Code', 'Personal Loan'], axis = 1)
scorings = ['accuracy', 'balanced_accuracy']


# ==== 代码7-11.py ====

model = DecisionTreeClassifier(min_samples_leaf = 7)
scores = cross_validate(model, X, y, cv = 10, scoring = scorings)
print('不处理不平衡问题的CART决策树模型:')
s = np.mean(scores['test_balanced_accuracy'])
print('平衡准确度: %s'% s)
s = np.mean(scores['test_accuracy'])
print('准确度: %s'% s)


# ==== 代码7-12.py ====

class_weight = {0:1, 1:sum(y == 0) / sum(y == 1)} #设置类别权重
model = DecisionTreeClassifier(class_weight = class_weight, min_samples_leaf = 7)
scores = cross_validate(model, X, y, cv = 10, scoring = scorings)
print('设置类别权重后的CART决策树模型: ')
s = np.mean(scores['test_balanced_accuracy'])
print('平衡准确度: %s'% s)
s = np.mean(scores['test_accuracy'])
print('准确度: %s'% s)


# ==== 代码7-13.py ====

smote = SMOTE(sampling_strategy = 'minority', k_neighbors = k)
X_res, y_res = smote.fit_resample(X, y)
model = DecisionTreeClassifier(min_samples_leaf = 7)
scores = cross_validate(model, X_res, y_res, cv = 10, scoring = scorings)
print('使用SMOTE过采样处理不平衡数据后的CART决策树模型:')
s = np.mean(scores['test_balanced_accuracy'])
print('平衡准确度: %s'% s)
s = np.mean(scores['test_accuracy'])
print('准确度: %s'% s)


# ==== 代码7-14.py ====

adasyn = ADASYN(sampling_strategy = 'minority')
model = DecisionTreeClassifier(min_samples_leaf = 7)
X_res, y_res = adasyn.fit_resample(X, y)
scores = cross_validate(model, X_res, y_res, cv = 10, scoring = scorings)
print('使用ADASYN过采样处理不平衡数据后的CART决策树模型:')
s = np.mean(scores['test_balanced_accuracy'])
print('平衡准确度: %s'% s)
s = np.mean(scores['test_accuracy'])
print('准确度: %s'% s)


# ==== 代码7-15.py ====

cart = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = 6)
ada = AdaBoostClassifier(base_estimator = cart, n_estimators = 100)
eec = EasyEnsembleClassifier(base_estimator = ada,
sampling_strategy = 'all', replacement = True)
scores = cross_validate(eec, X, y, cv = 10, scoring = scorings)
print('处理不平衡问题的Easy Ensemble集成模型:')
s = np.mean(scores['test_balanced_accuracy'])
print('平衡准确度: %s'% s)
s = np.mean(scores['test_accuracy'])
print('准确度: %s'% s)

# ==== 代码8-1.py ====

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
#1. 获得数据集
n_samples = 200 #样本数量
X, y = make_blobs(n_samples = n_samples,
random_state = 9, centers = 4, cluster_std = 1)
#2. KMeans模型创建和训练预测
model = KMeans(n_clusters = 4, random_state = 12345)
y_pred = model.fit_predict(X)
#3. 聚类结果及评价
print("聚类后的SSE值:", model.inertia_) # SSE值
print("聚类质心:", model.cluster_centers_)
#4. 绘图显示聚类结果
plt.figure(figsize = (5, 5))
plt.rcParams['font.sans-serif'] = ['SimHei'] #显示中文标签
plt.rcParams['axes.unicode_minus'] = False
plt.scatter(X[y_pred == 0][:, 0], X[y_pred == 0][:, 1], marker = 'D', color = 'g')
plt.scatter(X[y_pred == 1][:, 0], X[y_pred == 1][:, 1], marker = 'o', color = 'b')
plt.scatter(X[y_pred == 2][:, 0], X[y_pred == 2][:, 1], marker = 's', color = 'm')
plt.scatter(X[y_pred == 3][:, 0], X[y_pred == 3][:, 1], marker = 'v', color = 'r')
plt.title("k-means算法的聚类结果, k = 4")
plt.show()


# ==== 代码8-2.py ====

from sklearn import metrics
print("1.内部度量指标")
print(" 轮廓系数: %0.3f" % metrics.silhouette_score(X, y_pred))
print(" CH指数: %0.3f" % metrics.calinski_harabasz_score(X, y_pred))
print("2.外部度量指标")
print(" ARI指数: %0.3f" % metrics.adjusted_rand_score(y, y_pred))
print(" NMI指数: %0.3f" % metrics.normalized_mutual_info_score(y, y_pred))


# ==== 代码8-3.py ====

import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
from sklearn import metrics
# 1. 获得数据集
n_samples = 200 #样本数量
X, y = make_moons(n_samples = n_samples, random_state = 9,noise = 0.1)
#添加噪声(若无需噪声,此步骤可删除)
X = np.insert(X, 0, values = np.array([[1.5, 0.5], [-0.5, 0]]), axis = 0)
y = np.insert(y, 0, [0, 0], axis = 0)
#2. DBSCAN模型创建和训练
model = DBSCAN( eps = 0.2, min_samples = 4)
y_pred = model.fit_predict(X) # -1代表噪声,其余值代表预测的簇标号,0,1
# 统计聚类后的簇数量
n_clusters_ = len(set(y_pred)) - (1 if -1 in y_pred else 0)
#3. 聚类模型评价
print('聚类的簇数: %d' % n_clusters_)
print('轮廓系数: %0.3f' % metrics.silhouette_score(X, y_pred))
print('调整兰德指数AMI: %0.3f' % metrics.adjusted_rand_score(y, y_pred))
# 4. 绘图显示聚类结果
core_samples_mask = np.zeros_like(model.labels_, dtype = bool) #获得核心对象的掩码
core_samples_mask[model.core_sample_indices_] = True
#绘制原始数据集
set_marker = ['o', 'v', 'x', 'D', '>', 'p', '<']
set_color = ['b', 'r', 'm', 'g', 'c', 'k', 'tan']
plt.figure(figsize = (5, 5))
for i in range(n_clusters_):
plt.scatter(X[y == i][:, 0], X[y == i][:, 1], marker = set_marker[i],
color = 'none', edgecolors = set_color[i])
plt.title(' Moons数据集(带2个噪声点)', fontsize = 14)
#绘制DBSCAN的聚类结果
plt.figure(figsize = (5, 5))
unique_labels = set(y_pred)
i = -1 #flag变量
for k, col in zip(unique_labels, set_color[0: len(unique_labels)]):
if k == -1:
col = 'k' # 黑色表示标记噪声点.
class_member_mask = (y_pred == k)
i += 1
if (i>=len(unique_labels)): i = 0
#绘制核心对象
xcore = X[class_member_mask & core_samples_mask]
plt.plot(xcore[:, 0], xcore[:, 1], set_marker[i], markerfacecolor = col,
markeredgecolor = 'k', markersize = 8)
#绘制边界对象和噪声
xncore = X[class_member_mask & ~core_samples_mask]
plt.plot(xncore[:, 0], xncore[:, 1], set_marker[i], markerfacecolor = col,
markeredgecolor = 'k', markersize = 4)
plt.title('DBSCAN算法的聚类结果: 识别的簇= %d' % n_clusters_, fontsize = 14)
plt.show()


# ==== 代码8-4.py ====

from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs # 用于生成数据集的库
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
from utils import draw_ellipse, BIC # 引用辅助函数
# 1. 获得数据集
n_samples = 200 # 样本数量
X, y = make_blobs(n_samples = n_samples, random_state = 9, centers = 4, cluster_std = 1)
# 2. GMM模型的创建和训练
K = 4 # 簇的数量
model = GaussianMixture(n_components = K, covariance_type = 'full', random_state = 15)
y_pred = model.fit_predict(X)
# 3. 聚类模型评价
print(" 轮廓系数: %0.3f" % metrics.silhouette_score(X, y_pred))
print(" 调整兰德指数AMI: %0.3f" % metrics.adjusted_rand_score(y, y_pred))
# 4绘图显示GMM的聚类结果
plt.figure(figsize = (5, 5))
plt.rcParams['font.sans-serif'] = ['SimHei'] #显示中文标签
plt.rcParams['axes.unicode_minus'] = False
for i in range(K):
plt.scatter(X[y_pred == i][:, 0], X[y_pred == i][:, 1],
marker=set_marker[i], color=set_color[i])
# 为簇绘制椭圆阴影区域
for p, c, w in zip(model.means_, model.covariances_, model.weights_):
draw_ellipse(p, c, alpha = 0.05)
plt.title(" GMM的聚类结果, K=%d"% K, fontsize = 14)
plt.show()


# ==== 代码8-5.py ====

from matplotlib.patches import Ellipse
import matplotlib.pyplot as plt
import numpy as np
from sklearn.mixture import GaussianMixture
# 函数: 给定的位置画一个椭圆
def draw_ellipse(position, covariance, ax = None, **kwargs):
ax = ax or plt.gca()
# 将协方差转换为主轴
if covariance.shape == (2, 2):
U, s, Vt = np.linalg.svd(covariance)
angle = np.degrees(np.arctan2(U[1, 0], U[0, 0]))
width, height = 2 * np.sqrt(s)
else:
angle = 0
width, height = 2 * np.sqrt(covariance)
ax.add_patch(Ellipse(position, 3 * width, 3 * height, angle, **kwargs))
# 函数: 计算BIC准则
def BIC(X):
lowest_bic = np.infty
bic = []
n_components_range = range(1, 10)
for n_components in n_components_range:
gmm_model = GaussianMixture(n_components = n_components)
gmm_model.fit(X)
bic.append(gmm_model.bic(X))
if bic[-1] < lowest_bic:
lowest_bic = bic[-1]
bic = np.array(bic)
return bic.argmin() + 1


# ==== 代码9-1.py ====

import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.preprocessing import TransactionEncoder
itemSetList = [['A', 'C', 'D'],
['B', 'C', 'E'],
['A', 'B', 'C','E'],
['B', 'E']]
#数据预处理——编码
te = TransactionEncoder()
te_array = te.fit(itemSetList).transform(itemSetList)
df = pd.DataFrame(te_array, columns = te.columns_)
#挖掘频繁项集(最小支持度为0.5)
frequent_itemsets = apriori(df, min_support = 0.5, use_colnames = True)
print("发现的频繁项集包括:\n", frequent_itemsets)


# ==== 代码9-2.py ====

from mlxtend.frequent_patterns import association_rules
rules = association_rules(frequent_itemsets, metric = 'confidence',
min_threshold = 0.5,
support_only = False)
rules= rules[ rules['lift']>1]
print("生成的强关联规则为:\n", rules)


# ==== 代码9-3.py ====

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth, association_rules
itemSetList = [['A', 'C', 'D'],
['B', 'C', 'E'],
['A', 'B', 'C','E'],
['B', 'E']]
#数据预处理——编码
te = TransactionEncoder()
te_array = te.fit(itemSetList).transform(itemSetList)
df = pd.DataFrame(te_array, columns = te.columns_)
#利用FP-Growth算法发现频繁项集,最小支持度为0.5
frequent_itemsets = fpgrowth(df, min_support = 0.5, use_colnames = True)
print("发现的频繁项集包括:\n", frequent_itemsets)
#生成强规则(最小置信度为0.5, 提升度>1)
rules = association_rules(frequent_itemsets, metric = 'confidence',
min_threshold = 0.5, support_only = False)
rules= rules[ rules['lift'] > 1]
print("生成的强关联规则为:\n", rules)


# ==== 代码9-4.py ====

#Eclat类的定义
class Eclat:
def __init__(self, min_support = 3, min_confidence = 0.6, min_lift = 1):
self.min_support = min_support
self.min_confidence = min_confidence
self.min_lift = min_lift
#函数:倒排数据
def invert(self, data):
invert_data = {}
fq_item = []
sup = []
for i in range(len(data)):
for item in data[i]:
if invert_data.get(item) is not None:
invert_data[item].append(i)
else:
invert_data[item] = [i]
for item in invert_data.keys():
if len(invert_data[item]) >= self.min_support:
fq_item.append([item])
sup.append(invert_data[item])
fq_item = list(map(frozenset, fq_item))
return fq_item, sup
#函数:取交集
def getIntersection(self, fq_item, sup):
sub_fq_item = []
sub_sup = []
k = len(fq_item[0]) + 1
for i in range(len(fq_item)):
for j in range(i+1, len(fq_item)):
L1 = list(fq_item[i])[: k-2]
L2 = list(fq_item[j])[: k-2]
if L1 == L2:
flag = len(list(set(sup[i]).intersection(set(sup[j]))))
if flag >= self.min_support:
sub_fq_item.append(fq_item[i] | fq_item[j])
sub_sup.append(
list(set(sup[i]).intersection(set(sup[j]))))
return sub_fq_item, sub_sup
#函数:获得频繁项
def findFrequentItem(self, fq_item, sup, fq_set,sup_set):
fq_set.append(fq_item)
sup_set.append(sup)
while len(fq_item) >= 2:
fq_item, sup = self.getIntersection(fq_item, sup)
fq_set.append(fq_item)
sup_set.append(sup)
#函数,生成关联规则
def generateRules(self, fq_set, rules, len_data):
for fq_item in fq_set:
if len(fq_item) > 1:
self.getRules(fq_item, fq_item, fq_set, rules, len_data)
#辅助函数,删除项目
def removeItem(self, current_item, item):
tempSet = []
for elem in current_item:
if elem != item:
tempSet.append(elem)
tempFrozenSet = frozenset(tempSet)
return tempFrozenSet
#辅助函数:生成关联规则
def getRules(self, fq_item, cur_item, fq_set, rules,len_data):
for item in cur_item:
subset = self.removeItem(cur_item, item)
confidence = fq_set[fq_item] / fq_set[subset]
supp = fq_set[fq_item] / len_data
lift = confidence / (fq_set[fq_item - subset] / len_data)
if confidence >= self.min_confidence and lift > self.min_lift:
flag = False
for rule in rules:
if (rule[0] == subset) and (rule[1] == fq_item-subset):
flag = True
if flag == False:
rules.append(("%s --> %s,support=%5.3f, confidence=%5.3f, lift = %5.3f"%(list(subset), list(fq_item - subset),
supp, confidence, lift)))
if len(subset) >= 2:
self.getRules(fq_item, subset, fq_set, rules, len_data)
#函数:Eclat模型训练
def fit(self, data, display = True):
frequent_item, support = self.invert(data)
frequent_set = []
support_set = []
len_data= len(data)
self.findFrequentItem(frequent_item, support, frequent_set, support_set)
data = {}
for i in range(len(frequent_set)):
for j in range(len(frequent_set[i])):
data[frequent_set[i][j]] = len(support_set[i][j])
rules = []
self.generateRules(data, rules, len_data)
if display:
print("Association Rules:")
for rule in rules:
print(rule)
print("发现的规则数量:", len(rules))
return frequent_set, rules
#用Eclat类创建一个关联规则模型,训练后生成关联规则
itemSetList = [['A', 'C', 'D'],
['B', 'C', 'E'],
['A', 'B', 'C','E'],
['B', 'E']]
et = Eclat(min_support = 2, min_confidence = 0.5, min_lift = 1)
et.fit(itemSetList, True)


# ==== 代码9-5.py ====

import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
inputfile = 'Online_Retail.xlsx' # 输入的数据文件
data = pd.read_excel(inputfile)
#步骤1:数据探索
data.info()
print("不同的国家名称:\n", data.Country.unique())
#步骤2:预处理
data['Description'] = data['Description'].str.strip() #去除空格
data.dropna(axis = 0,subset = ['CustomerID'], inplace = True) #删除含缺失值的行
data['InvoiceNo'] = data['InvoiceNo'].astype('str')
data = data[~data['InvoiceNo'].str.contains('C')] #删除所有已取消交易
#步骤3:数据分割和转换
basket_France = (data[data['Country'] == "France"]
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))
basket_Por = (data[data['Country'] == "Portugal"]
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))
basket_Sweden = (data[data['Country'] == "Sweden"]
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))
def hot_encode(x):
if(x<= 0): return 0
if(x>= 1): return 1
basket_France = basket_France.applymap(hot_encode) #0/1编码数据
basket_Por = basket_Por.applymap(hot_encode)
basket_Sweden = basket_Sweden.applymap(hot_encode)


# ==== 代码9-6.py ====

import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
inputfile = './Online_Retail.xlsx' # 输入的数据文件
data = pd.read_excel(inputfile)
#步骤1:数据探索
data.info()
print("不同的国家名称:\n", data.Country.unique())
#步骤2:预处理
data['Description'] = data['Description'].str.strip() #去除空格
data.dropna(axis = 0,subset =['CustomerID'],inplace = True) #删除含缺失值的行
data['InvoiceNo'] = data['InvoiceNo'].astype('str')
data = data[~data['InvoiceNo'].str.contains('C')] #删除所有已取消交易
print(data.head(5))
#步骤3:数据分割和转换
basket_France = (data[data['Country'] =="France"]
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))
basket_Por = (data[data['Country'] =="Portugal"]
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))
basket_Sweden = (data[data['Country'] =="Sweden"]
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))
def hot_encode(x):
if(x<= 0): return 0
if(x>= 1): return 1
basket_France = basket_France.applymap(hot_encode) #0/1编码数据
basket_Por = basket_Por.applymap(hot_encode)
basket_Sweden = basket_Sweden.applymap(hot_encode)
# (1)法国数据集的关联规则挖掘
frq_items = apriori(basket_France, min_support = 0.1, use_colnames = True)
rules =association_rules(frq_items, metric ="confidence", min_threshold= 0.3)
rules= rules[ rules['lift']>=1.5] #设置最小提升度
rules = rules.sort_values(['confidence', 'lift'], ascending =[False, False])
print(rules.head()) #显示前5条强关联规则
#(2)葡萄牙数据集的关联规则挖掘
frq_items = apriori(basket_Por, min_support = 0.1, use_colnames = True)
rules =association_rules(frq_items, metric ="confidence", min_threshold= 0.3)
rules= rules[ rules['lift']>=1.5] #设置最小提升度
rules = rules.sort_values(['confidence', 'lift'], ascending =[False, False])
print(rules.head()) #显示前5条强关联规则
#(3) 瑞典数据集的关联规则挖掘
frq_items = apriori(basket_Sweden, min_support = 0.05, use_colnames = True)
rules =association_rules(frq_items, metric ="confidence", min_threshold= 0.3)
rules= rules[ rules['lift']>=1.5] #设置最小提升度
rules = rules.sort_values(['confidence', 'lift'], ascending =[False, False])
print(rules.head()) #显示前5条强关联规则



# ==== 代码10-1.py ====

import pandas as pd
from pandas import datetime
file = './data/shampoo.csv' #该数据放置在data文件夹下,读者可以自行定义
#函数:日期数据解析
def date_parser (x):
return datetime.strptime('190'+x, '%Y-%m')
data = pd.read_csv(file, header = 0, parse_dates = [0], index_col = 0, squeeze = True,
date_parser = date_parser)


# ==== 代码10-2.py ====

import pandas as pd
from pandas import datetime
file = './data/shampoo.csv' #该数据放置在data文件夹下,读者可以自行定义
#函数:日期数据解析
def date_parser (x):
return datetime.strptime('190'+x, '%Y-%m')
data = pd.read_csv(file, header = 0, parse_dates = [0], index_col = 0, squeeze = True,
date_parser = date_parser)
# 时序图
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号
plt.plot(data)
plt.legend()
plt.show()
# 自相关图
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(data).show()
plt.show()
# ADF单位根检测方法
from statsmodels.tsa.stattools import adfuller as ADF
print('原始序列的ADF检验结果为:', ADF(data)) # 返回值依次为adf、p值等


# ==== 代码10-3.py ====

import pandas as pd
from pandas import datetime
file = './data/shampoo.csv' #该数据放置在data文件夹下,读者可以自行定义
#函数:日期数据解析
def date_parser (x):
return datetime.strptime('190'+x, '%Y-%m')
data = pd.read_csv(file, header = 0, parse_dates = [0], index_col = 0, squeeze = True,
date_parser = date_parser)
# 时序图
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号
plt.plot(data)
plt.legend()
plt.show()
# 自相关图
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(data).show()
plt.show()
# ADF单位根检测方法
from statsmodels.tsa.stattools import adfuller as ADF
print('原始序列的ADF检验结果为:', ADF(data)) # 返回值依次为adf、p值等
# 差分操作
Date_data = data.diff().dropna()
plt.plot(Date_data) # 差分序列的时序图
plt.show()
#差分序列的平稳性检验
plot_acf(Date_data).show() # 自相关图
from statsmodels.graphics.tsaplots import plot_pacf
plot_pacf(Date_data).show() # 偏自相关图
plt.show()
print('差分序列的ADF检验结果为:', ADF(Date_data)) #ADF单位根检测方法


# ==== 代码10-4.py ====

import pandas as pd
from pandas import datetime
file = './data/shampoo.csv' #该数据放置在data文件夹下,读者可以自行定义
#函数:日期数据解析
def date_parser (x):
return datetime.strptime('190'+x, '%Y-%m')
data = pd.read_csv(file, header = 0, parse_dates = [0], index_col = 0, squeeze = True,
date_parser = date_parser)
# 时序图
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号
plt.plot(data)
plt.legend()
plt.show()
# 自相关图
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(data).show()
plt.show()
# ADF单位根检测方法
from statsmodels.tsa.stattools import adfuller as ADF
print('原始序列的ADF检验结果为:', ADF(data)) # 返回值依次为adf、p值等
# 差分操作
Date_data = data.diff().dropna()
plt.plot(Date_data) # 差分序列的时序图
plt.show()
#差分序列的平稳性检验
plot_acf(Date_data).show() # 自相关图
from statsmodels.graphics.tsaplots import plot_pacf
plot_pacf(Date_data).show() # 偏自相关图
plt.show()
print('差分序列的ADF检验结果为:', ADF(Date_data)) #ADF单位根检测方法
# 纯随机性检验
from statsmodels.stats.diagnostic import acorr_ljungbox
print('差分序列的白噪声检验结果为:', acorr_ljungbox(Date_data, lags = 1))
# 分别返回LB统计量和p值


# ==== 代码10-5.py ====

import pandas as pd
from pandas import datetime
file = './data/shampoo.csv' #该数据放置在data文件夹下,读者可以自行定义
#函数:日期数据解析
def date_parser (x):
return datetime.strptime('190'+x, '%Y-%m')
data = pd.read_csv(file, header = 0, parse_dates = [0], index_col = 0, squeeze = True,
date_parser = date_parser)
# 时序图
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号
plt.plot(data)
plt.legend()
plt.show()
# 自相关图
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(data).show()
plt.show()
# ADF单位根检测方法
from statsmodels.tsa.stattools import adfuller as ADF
print('原始序列的ADF检验结果为:', ADF(data)) # 返回值依次为adf、p值等
# 差分操作
Date_data = data.diff().dropna()
plt.plot(Date_data) # 差分序列的时序图
plt.show()
#差分序列的平稳性检验
plot_acf(Date_data).show() # 自相关图
from statsmodels.graphics.tsaplots import plot_pacf
plot_pacf(Date_data).show() # 偏自相关图
plt.show()
print('差分序列的ADF检验结果为:', ADF(Date_data)) #ADF单位根检测方法
# 纯随机性检验
from statsmodels.stats.diagnostic import acorr_ljungbox
print('差分序列的白噪声检验结果为:', acorr_ljungbox(Date_data, lags = 1))
# 分别返回LB统计量和p值
# 模型定阶:相对最优模型法
#from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.arima_model import ARIMA
data = data.astype(float)
pmax = int(len(Date_data) /10) # 阶数p不超过序列长度的1/10
qmax = int(len(Date_data) /10) # 阶数q不超过序列长度的1/10
bic_matrix = [] # BIC矩阵
for x in range(pmax + 1):
tmp = []
for y in range(qmax + 1):
try: # 错误处理块
tmp.append(ARIMA(data.values, order=(x,1,y)).fit().bic)
except:
tmp.append(None)
bic_matrix.append(tmp)
bic_matrix = pd.DataFrame(bic_matrix)
print(bic_matrix)
p,q = bic_matrix.stack().idxmin() #找出最小值位置
print('BIC最小的p值和q值为:%s、%s' % (p,q))


# ==== 代码11-1.py ====

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets._samples_generator import make_blobs
#函数:生成数据集
def generate_data(n_normal = 500, n_anomaly = 20):
X_normal, Y_normal = make_blobs(n_samples = n_normal, centers = [[0, 0]],
cluster_std = 0.8, random_state = 5)
X_anomaly = np.random.rand(n_anomaly, 2) * 10 - 5
Y_anomaly = np.zeros(n_anomaly)
X = np.vstack([X_normal, X_anomaly])
Y = np.hstack([Y_normal, [1 for _ in range(X_anomaly.shape[0])]])
return X, Y
#函数:计算数据在正态分布上的概率值
def multivariate_Gaussian(X, mu, sigma):
d = len(mu) #特征维度
X -= mu.T
cov_mat_inv = np.linalg.pinv(sigma)
cov_mat_det = np.linalg.det(sigma)
p = (np.exp(-0.5 * np.dot(X, np.dot(cov_mat_inv, X.T)))
/ (2. * np.pi) ** (d/2.) / np.sqrt(cov_mat_det))
return p
#获得人工合成数据集
X, Y = generate_data()
#计算均值和协方差,设置全局阈值(经验给定)
mu = X.mean(axis = 0)
sigma = np.cov(X.T)
threshold = 0.0025
#计算每个训练样本的概率
pro = []
for i, _ in enumerate(X):
p = multivariate_Gaussian(X[i], mu, sigma)
pro += [p]
pro = np.array(pro)
#识别异常对象,并绘图显示
anomaly_index = (pro <= threshold)
plt.figure(figsize = (6, 4))
predict_anomaly = X[anomaly_index]
predict_normal = X[~anomaly_index]
plt.scatter(predict_normal[:, 0], predict_normal[:, 1],
s = 60, marker = 'o', alpha = 0.6)
plt.scatter(predict_anomaly[:, 0], predict_anomaly[:, 1],
s = 60, marker = 'x', c = 'r')
plt.grid(True, which = 'major', linestyle = '--', linewidth = 1)
plt.show()


# ==== 代码11-2.py ====

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets._samples_generator import make_blobs
#函数:生成数据集
def generate_data(n_normal = 500, n_anomaly = 20):
X_normal, Y_normal = make_blobs(n_samples = n_normal, centers = [[0, 0]],
cluster_std = 0.8, random_state = 5)
X_anomaly = np.random.rand(n_anomaly, 2) * 10 - 5
Y_anomaly = np.zeros(n_anomaly)
X = np.vstack([X_normal, X_anomaly])
Y = np.hstack([Y_normal, [1 for _ in range(X_anomaly.shape[0])]])
return X, Y
#函数:计算数据在正态分布上的概率值
def multivariate_Gaussian(X, mu, sigma):
d = len(mu) #特征维度
X -= mu.T
cov_mat_inv = np.linalg.pinv(sigma)
cov_mat_det = np.linalg.det(sigma)
p = (np.exp(-0.5 * np.dot(X, np.dot(cov_mat_inv, X.T)))
/ (2. * np.pi) ** (d/2.) / np.sqrt(cov_mat_det))
return p
#获得人工合成数据集
X, Y = generate_data()
#计算均值和协方差,设置全局阈值(经验给定)
mu = X.mean(axis = 0)
sigma = np.cov(X.T)
threshold = 0.0025
#计算每个训练样本的概率
pro = []
for i, _ in enumerate(X):
p = multivariate_Gaussian(X[i], mu, sigma)
pro += [p]
pro = np.array(pro)
#识别异常对象,并绘图显示
anomaly_index = (pro <= threshold)
plt.figure(figsize = (6, 4))
predict_anomaly = X[anomaly_index]
predict_normal = X[~anomaly_index]
plt.scatter(predict_normal[:, 0], predict_normal[:, 1],
s = 60, marker = 'o', alpha = 0.6)
plt.scatter(predict_anomaly[:, 0], predict_anomaly[:, 1],
s = 60, marker = 'x', c = 'r')
plt.grid(True, which = 'major', linestyle = '--', linewidth = 1)
plt.show()
from sklearn.cluster import DBSCAN
model = DBSCAN(eps = 0.4, min_samples = 4) # DBSCAN聚类建模
# 根据DBSCAN的聚类结果识别异常点(DBSCAN将-1类作为异常点的)
Y_pred = model.fit_predict(X)
anomaly_index = (Y_pred == -1)
X_anomaly = X[anomaly_index]
X_normal = X[~anomaly_index]
#绘图
plt.figure(figsize = (6, 4))
plt.scatter(X_normal[:, 0], X_normal[:, 1], s=60, marker = 'o', alpha = 0.6)
plt.scatter(X_anomaly[:, 0], X_anomaly[:, 1], s = 60, marker = 'x', c = 'r')
plt.grid(True, which = 'major', linestyle = '--', linewidth = 1)
plt.show()


# ==== 代码11-3.py ====

import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('./creditcard.csv',encoding='gbk')
# 绘制柱状图,查看两个类别的数量
plt.rcParams['font.sans-serif'] = ['SimHei']
count_classes = pd.value_counts(data['Class'], sort = False)
plt.figure(figsize=(12,8))
plt.bar([0,1], count_classes, width=0.6)
plt.xticks([0,1], ['0','1'], fontsize=20)
plt.yticks(fontsize=20)
plt.title ("不同类别的数量",fontsize=20)
plt.xlabel ("Class",fontsize=20)
plt.ylabel ("Frequency",fontsize=20)
plt.show()
#查看是否有缺失值
print(data.isnull().sum())
#查看数据的描述性统计信息
print(data.describe())


# ==== 代码11-4.py ====

from pyod.models.iforest import IForest #孤立森林
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import seaborn as sns
#划分训练集和测试
X = data.iloc[:, data.columns != 'Class']
y = data.iloc[:, data.columns == 'Class']
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.3,
random_state= 123456, stratify = y)
#创建IForest模型
iforest = IForest(n_estimators = 300, contamination = 0.00172)
iforest.fit(X_train)
# 得到测试结果
y_test_pred = iforest.predict(X_test) # 预测的类别标签
y_test_scores = iforest.decision_function(X_test) #预测的属于异常的概率
# 混淆矩阵绘图函数
def plot_confusion_matrix(cm, title = "Confusion Matrix"):
sns.set()
f,ax=plt.subplots()
sns.heatmap(cm, annot = True, ax = ax, cmap = "Blues", fmt = "4d")
ax.set_title("confusion matrix")
ax.set_xlabel("predict")
ax.set_ylabel("true")
plt.show()
#绘制混淆矩阵
cm= confusion_matrix(y_test, y_test_pred1, labels = [0,1])
plot_confusion_matrix(cm)


# ==== 代码12-1.py ====

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from tqdm import tqdm
#评分数据读取
file_path = './ml-1m/'
file_rating = open(file_path+'ratings.dat','r',encoding = "ISO-8859-1")
data = file_rating.read()
data = data.split('\n')
file_rating.close()
train_data, test_data = train_test_split(data, test_size = 0.2)
print('训练数据数据量:'+ str(len(train_data)))
print('测试数据数据量:'+ str(len(test_data)))
CF_matrix = np.zeros((6040,3952)) #最大用户ID为6040 ,最大电影ID为3952
for each_data in train_data:
if len(each_data) == 0:
break
str_temp = each_data.split('::') #分割数据
user_id_temp = int(str_temp[0]) - 1 #将用户ID从0开始编码
movies_id_temp = int(str_temp[1]) - 1 #将电影ID从0开始编码
rating_temp = int(str_temp[2]) #读取评分
CF_matrix[user_id_temp][movies_id_temp] = rating_temp #填充矩阵
#余弦相似度函数
def sim_cosine(x, y):
if np.linalg.norm(x) == 0 or np.linalg.norm(y) == 0:
return 0
return np.sum(x*y) / np.linalg.norm(x) / np.linalg.norm(y)
#计算用户相似度
print("计算用户相似度矩阵:")
user_cross_sim = np.zeros((6040, 6040)) #矩阵初始化
for i in tqdm(range(CF_matrix.shape[0])):
for j in range(i, CF_matrix.shape[0]):
user_cross_sim[i, j] = sim_cosine(CF_matrix[i, :], CF_matrix[j, :]) #使用余弦相似度
user_cross_sim[j, i] = user_cross_sim[i, j]
#评分预测函数
def predict_rating(user_id, movies_id):
sum_w = 0
sum_w_rating = 0
for i in range(6040):
if CF_matrix[i, movies_id] != 0:
sum_w += user_cross_sim[user_id, i]
sum_w_rating += user_cross_sim[user_id, i]*CF_matrix[i, movies_id]
return sum_w_rating / sum_w
mae = 0
#在测试集上对推荐模型的性能进行评价(MAE)
for i in range(len(test_data)):
temp_i = train_data[i]
if len(temp_i) == 0: #结束
break
str_temp = temp_i.split('::')
user_id_temp = int(str_temp[0]) - 1
movies_id_temp = int(str_temp[1]) - 1
rating_temp = int(str_temp[2])
rating_pre = predict_rating(user_id_temp, movies_id_temp)
rating_pre = int(rating_pre + 0.5)
mae += abs(rating_temp-rating_pre)
print('推荐模型的平均绝对误差:' + str(mae / (i + 1)))
#对用户1进行电影推荐
rating_pre_list = []
user_id_1 = 0
#通过预测的评分矩阵对用户1的评分序列排序,选取前五只属性为其未观看的电影。
for i in range(0, 3952):
#最大电影ID为3952
rating_pre = predict_rating(user_id_1, i)
rating_pre_list.append(rating_pre)
rating_pre_list_1 = sorted(rating_pre_list, reverse = True)
#输出评分top5的未观看电影
top5 = []
count_num = 5
for i in rating_pre_list_1:
movies_id_temp = rating_pre_list.index(i)
if CF_matrix[user_id_1, movies_id_temp] == 0:
try:
top5.index(movies_id_temp)
except ValueError:
top5.append(movies_id_temp)
count_num -= 1
rating_pre_list.pop(movies_id_temp)
else:
continue
if count_num < 0:
break
print("为用户1推荐的评分前五未观看电影为:", top5)



问题

1.ud_pearsonr(X, y)

1
2
3
4
# 封装的皮尔森相关系数计算函数
def ud_pearsonr(X, y):
result = np.array([pearsonr(x, y) for x in X.T]) # 返回皮尔森相关系数, p值
return np.absolute(result[:, 0]), result[:, 1]

这段代码定义了一个名为 ud_pearsonr 的函数,用于计算特征矩阵 X 中的每一列与目标变量 y 之间的皮尔森相关系数。以下是对代码的详细解释:

代码解释

1
2
3
4
# 封装的皮尔森相关系数计算函数
def ud_pearsonr(X, y):
result = np.array([pearsonr(x, y) for x in X.T]) # 返回皮尔森相关系数, p值
return np.absolute(result[:, 0]), result[:, 1]
  1. 函数定义
1
def ud_pearsonr(X, y):
  • ud_pearsonr 是函数的名称。
  • X 是特征矩阵,形状为 (n_samples, n_features),其中 n_samples 是样本数量,n_features 是特征数量。
  • y 是目标变量,形状为 (n_samples,)
  1. 计算皮尔森相关系数
1
result = np.array([pearsonr(x, y) for x in X.T])
  • X.T 是特征矩阵 X 的转置,形状为 (n_features, n_samples)。这样做是为了方便对每一列(即每个特征)进行操作。
  • pearsonr(x, y)scipy.stats 模块中的函数,用于计算两个变量之间的皮尔森相关系数及其 p 值。
  • [pearsonr(x, y) for x in X.T] 是一个列表推导式,对 X.T 中的每一列 x 计算其与 y 的皮尔森相关系数和 p 值。
  • np.array(...) 将列表转换为 NumPy 数组,形状为 (n_features, 2),其中每行包含一个特征的皮尔森相关系数和 p 值。
  1. 返回结果
1
return np.absolute(result[:, 0]), result[:, 1]
  • result[:, 0] 提取皮尔森相关系数,形状为 (n_features,)
  • np.absolute(...) 计算相关系数的绝对值,因为相关系数的正负只表示方向,绝对值表示相关性的强度。
  • result[:, 1] 提取 p 值,形状为 (n_features,)
  • 函数返回两个数组:一个是皮尔森相关系数的绝对值,另一个是 p 值。

总结

这个函数的主要作用是计算特征矩阵 X 中的每一列与目标变量 y 之间的皮尔森相关系数,并返回相关系数的绝对值和对应的 p 值。这些值可以用于特征选择,帮助识别哪些特征与目标变量最相关。


Python数据挖掘实战代码
http://example.com/2024/10/04/Python数据挖掘实战代码/
作者
Tingfeng
发布于
2024年10月4日
许可协议