給大家科普一下奇億娛樂諳33300(2022已更新(今日/知乎)
復(fù)習(xí)提示
這一部分基本不變。延續(xù)2022年版的內(nèi)容,回顧這里的重點(diǎn)題型即可:
特別推薦回顧:
? Q5-1此外有新的例題:
?Q010-1. An internal credit scoring model, named ALPHA, was created by a former employee using 25 defined features and make recommendations on model performance improvements. The ALPHA model compared expected and actual defaults over the past 12 months.
Model Prediction: Default/No DefaultActual Result: Default/No DefaultPrediction Result: 1/0 (1 is a correct prediction; 0 is an incorrect prediction)(1) The best description of the ALPHA model is that it is an example of a(n):
A. logistic regression model.B. unsupervised machine learning model.C. classification and regression tree (CART) model.解析:選A。因?yàn)檩敵鼋Y(jié)果為1/0型,屬于Logit模型。存在輸入,因此為有監(jiān)督模型,排除B項(xiàng)。CART模型適用于樹結(jié)構(gòu),而題干中是單層結(jié)構(gòu),排除C項(xiàng)。
(2) For Records modeled with correct predictions and errors:
Actual Result: 7,018Prediction Result: 5,851Type 1 Error: 273Type 2 Error: 894The model was able to correctly predict a default in 5,290 instances of the model prediction dataset after the completed data wrangling. The precision of the model is closest to:
A. 75.4%.B. 85.5%.C. 95.1%.解析:選C。P = TP/(TP + FP) = 5,290/(5,290 + 273)。
(3) A colleague mentions that there are concerns in how long it takes the ALPHA model to complete its recommendations, and they discuss several potential methods to reduce computation time for the ALPHA model. The most appropriate method to resolve the computation problem is:
A. Use principal components analysis to reduce the number of dimensions.B. Decrease the learning rate in the algorithm to reduce overall computational requirements in the model. C. Apply winsorization to the existing data to remove extreme values and outliers and replace with predetermined values for minimum and maximum of known outliers.解析:選A。主成分分析可以減少解釋數(shù)據(jù)變化所需的變量數(shù)量、減少完成該模型所需的計(jì)算時(shí)間,而不是對每個(gè)記錄使用每個(gè)參數(shù)。B項(xiàng),降低學(xué)習(xí)率實(shí)際上會(huì)增加計(jì)算需求,因?yàn)樗鼘⒃黾幽P托枰\(yùn)行的迭代次數(shù),以便能夠?qū)W習(xí)指定的目標(biāo)。C項(xiàng),縮尾用于管理離群值場景,方法是用最小或最大的非離群值數(shù)據(jù)點(diǎn)替換單個(gè)離群值,有效地增加分布曲線的端點(diǎn)。但數(shù)據(jù)點(diǎn)仍然存在,因此縮尾不會(huì)對模型的計(jì)算需求產(chǎn)生影響。
?Q010-2. A database from a large national weather provider that contains detailed weather data (temperature, humidity, rainfall, atmospheric pressure, etc.) at a very localized geographic level recorded by GPS coordinates for the past 36 months: The database contains a reference note that some geographic areas had their sensors upgraded to capture additional metrics that include a field to identify when that upgrade occurred.
(1) The type of error least likely to be generated by the weather dataset reference note is:
A. invalidity.B. incompleteness.C. non-uniformity error.解析:選A。無效錯(cuò)誤是指數(shù)據(jù)超出了有意義的范圍,從而導(dǎo)致數(shù)據(jù)無效。在本例中,傳感器被升級為收集額外的信息,而不是糾正之前的記錄。但是不完整性錯(cuò)誤(因?yàn)楦虑安糠肿侄螣o數(shù)據(jù))和不統(tǒng)一錯(cuò)誤(因?yàn)閿?shù)據(jù)粒度變化)是存在的。
(2) There are many data fields included that would likely be highly irrelevant to their analysis and begins the process of selecting a subset of data fields that he believes are applicable. The selection of a subset of data from the weather dataset is best described as:
A. trimming.B. feature selection.C. feature engineering.解析:選B。識(shí)別和刪除數(shù)據(jù)集中不需要、不相關(guān)或冗余的特征的過程稱為特征選擇,符合題意。A項(xiàng),修剪是一個(gè)處理數(shù)據(jù)集中異常值的過程,通過簡單地刪除極值,也稱為截?cái)唷項(xiàng),特征工程是對當(dāng)前天氣數(shù)據(jù)集中不存在的新特征進(jìn)行組合、鞏固或創(chuàng)建的過程。
知識(shí)回顧
有一些小知識(shí)點(diǎn)需要再次記憶:
大數(shù)據(jù)4V包括大量、多樣、快速、準(zhǔn)確。準(zhǔn)確(Veracity)與數(shù)據(jù)源的可信度和可靠性有關(guān)結(jié)構(gòu)型ML和文本型ML的流程很類似,有一些細(xì)微差異:結(jié)構(gòu)型:① 概念化;② 數(shù)據(jù)收集;③ 數(shù)據(jù)準(zhǔn)備和整理;④ 數(shù)據(jù)探索;⑤ 模型訓(xùn)練 文本型:① 公式化;② 數(shù)據(jù)(文本)管理;③ 文本準(zhǔn)備和整理;④ 數(shù)據(jù)探索;⑤ 模型訓(xùn)練 其中,數(shù)據(jù)(文本)管理步驟是用爬蟲收集數(shù)據(jù)(不要把④里面的內(nèi)容錯(cuò)誤分到②)區(qū)分六類錯(cuò)誤。例如:不統(tǒng)一錯(cuò)誤(Non-uniformity error)表示數(shù)據(jù)的顯示格式不統(tǒng)一不一致錯(cuò)誤(Inconsistency error)表示數(shù)據(jù)與其他數(shù)據(jù)或現(xiàn)實(shí)矛盾,如以0填充但不應(yīng)該是0區(qū)分五種轉(zhuǎn)換。例如,用各種計(jì)算方式得到新變量的轉(zhuǎn)換叫做提取(Extraction)區(qū)分正態(tài)化和標(biāo)準(zhǔn)化。正態(tài)化是指(X-min)/(max-min),標(biāo)準(zhǔn)化是指(X-μ)/σ 文本準(zhǔn)備和整理四刪包括刪html、標(biāo)點(diǎn)、數(shù)字、空白。注意需要將特定標(biāo)點(diǎn)進(jìn)行字符替換,如/percentSign/令牌化(Tokenization)是將給定文本分割為單獨(dú)令牌的過程,將數(shù)據(jù)拆分為單詞集合。歸一化過程包括小寫化、停止詞、詞干分析(Stemming)、詞形還原(Lemmatization)最后得到的令牌會(huì)簡潔且?guī)в邢聞澗€,如sale_decreas歸一化完成后創(chuàng)建詞袋(BOW)。詞袋是樣本數(shù)據(jù)集中所有文本的一組不同的令牌的集合詞頻某詞詞頻 = 某詞詞數(shù) / 收集的總詞數(shù)詞云可以根據(jù)詞頻值顯示數(shù)據(jù)集中信息量最大的詞詞頻非常高和非常低的屬于噪聲特征。太高的是停止詞,太低的是稀疏詞,都需要排除特征選擇緩解過擬合,特征工程緩解欠擬合混淆矩陣:P = TP/(TP + FP)(關(guān)心第1類錯(cuò)誤)R = TP/(TP + FN)(關(guān)心第2類錯(cuò)誤)A = (TP + TN)/(TP + FP + TN + FN)F1 = 2/(1/P + 1/R)(調(diào)和平均)若要對FP和FN給予等權(quán),則應(yīng)采用F1衡量若給出閾值p表,低于閾值p的視為真值(1的真為TP,0的真為TN),反之為假值如果出現(xiàn)類失衡,則對多數(shù)類進(jìn)行欠采樣,對少數(shù)類進(jìn)行過采樣交叉驗(yàn)證集使用K折交叉驗(yàn)證將提高基礎(chǔ)模型在預(yù)測實(shí)際事件時(shí)的整體準(zhǔn)確性交叉驗(yàn)證集的AUC越高,說明模型的泛化能力越強(qiáng)交叉驗(yàn)證集預(yù)測誤差遠(yuǎn)大于訓(xùn)練集預(yù)測誤差,說明模型過擬合,正則化程度不夠通過LASSO增加懲罰項(xiàng),可以緩解過擬合問題TF-IDF(這里補(bǔ)充更深入了)TF,文檔頻率,為包含給定單詞的文檔(即句子)的數(shù)量除以句子的總數(shù)公式:TF = 給定單詞出現(xiàn)在句子中的數(shù)量 / 句子的總數(shù)量IDF,逆文檔頻率,給定單詞在整個(gè)語料庫中的獨(dú)特程度的相對度量,與語料庫的大小沒有直接關(guān)系公式:IDF = log(1/DF)TF-IDF,將句子級別的TF乘以單詞的IDF。較高的TF-IDF值表示在較少的文檔中出現(xiàn)頻率更高的單詞。這意味著相對來說更獨(dú)特的重要詞。相反,低TF-IDF值表示在許多文檔中出現(xiàn)的詞。TF-IDF值在度量跨文檔編譯的關(guān)鍵詞時(shí)很有用,并且可以作為訓(xùn)練ML模型的單詞特征值。公式:TF-IDF = TF×IDFTF-IDF值隨文檔的數(shù)量而變化,當(dāng)應(yīng)用于只有少量文檔的數(shù)據(jù)集時(shí),模型性能可能會(huì)有所不同如果數(shù)據(jù)集沒有地面真值(ground truth),說明這是無監(jiān)督模型,不需要訓(xùn)練集模型擬合誤差(偏差誤差和方差誤差)用于調(diào)試(tuning)性能評估一般采用誤差分析(混淆矩陣)、ROC/AUC、RMSE獨(dú)熱編碼(one hot encoding):將具有多個(gè)值的特征按類型分解為單個(gè)特征,并以1/0記錄其他知識(shí)點(diǎn)可以查看:
(完)
掃描二維碼推送至手機(jī)訪問。
版權(quán)聲明:本文由財(cái)神資訊-領(lǐng)先的體育資訊互動(dòng)媒體轉(zhuǎn)載發(fā)布,如需刪除請聯(lián)系。