Follow these steps to complete this recipe:
- Import the required libraries:
import pandas as pd
from sklearn import neighbors, metrics
from sklearn.metrics import roc_auc_score, classification_report,\
precision_recall_fscore_support,confusion_matrix,precision_score, \
roc_curve,precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import statsmodels.formula.api as smf
- Import the data:
df = spark.sql("select * from BreastCancer")
pdf = df.toPandas()
- Split the data:
X = pdf
y = pdf['diagnosis']
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.3, random_state=40)
- Create the formula:
cols = pdf.columns.drop('diagnosis')
formula = 'diagnosis ~ ' + ' + '.join(cols)
- Train the model:
model = smf.glm(formula=formula, data=X_train,
family=sm.families.Binomial())
logistic_fit = model.fit()
- Test our model:
predictions...