Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Statistical Application Development with R and Python
  • Toc
  • feedback
Statistical Application Development with R and Python

Statistical Application Development with R and Python

4.3 (4)
close
Statistical Application Development with R and Python

Statistical Application Development with R and Python

4.3 (4)

Overview of this book

Statistical Analysis involves collecting and examining data to describe the nature of data that needs to be analyzed. It helps you explore the relation of data and build models to make better decisions. This book explores statistical concepts along with R and Python, which are well integrated from the word go. Almost every concept has an R code going with it which exemplifies the strength of R and applications. The R code and programs have been further strengthened with equivalent Python programs. Thus, you will first understand the data characteristics, descriptive statistics and the exploratory attitude, which will give you firm footing of data analysis. Statistical inference will complete the technical footing of statistical methods. Regression, linear, logistic modeling, and CART, builds the essential toolkit. This will help you complete complex problems in the real world. You will begin with a brief understanding of the nature of data and end with modern and advanced statistical models like CART. Every step is taken with DATA and R code, and further enhanced by Python. The data analysis journey begins with exploratory analysis, which is more than simple, descriptive, data summaries. You will then apply linear regression modeling, and end with logistic regression, CART, and spatial statistics. By the end of this book you will be able to apply your statistical learning in major domains at work or in your projects.
Table of Contents (12 chapters)
close
11
Index

What this book covers

Chapter 1, Data Characteristics, introduces the different types of data through a questionnaire and dataset. The need of statistical models is elaborated in some interesting contexts. This is followed by a brief explanation of the installation of R and Python and their related packages. Discrete and continuous random variables are discussed through introductory programs. The programs are available in both the languages and although they do not need to be followed, they are more expository in nature.

Chapter 2, Import/Export Data, begins with a concise development of R basics. Data frames, vectors, matrices, and lists are discussed with clear and simpler examples. Importing of data from external files in CSV, XLS, and other formats is elaborated next. Writing data/objects from R for other languages is considered and the chapter concludes with a dialogue on R session management. Python basics, mathematical operations, and other essential operations are explained. Reading data from different format of external file is also illustrated along with the session management required.

Chapter 3, Data Visualization, discusses efficient graphics separately for categorical and numeric datasets. This translates into techniques for bar chart, dot chart, spine and mosaic plot, and four fold plot for categorical data while histogram, box plot, and scatter plot for continuous/numeric data. A very brief introduction to ggplot2 is also provided here. Generating similar plots using both R and Python will be a treatise here.

Chapter 4, Exploratory Analysis, encompasses highly intuitive techniques for the preliminary analysis of data. The visualizing techniques of EDA such as stem-and-leaf, letter values, and the modeling techniques of resistant line, smoothing data, and median polish provide rich insight as a preliminary analysis step. This chapter is driven mainly in R only.

Chapter 5, Statistical Inference, begins with an emphasis on the likelihood function and computing the maximum likelihood estimate. Confidence intervals for parameters of interest is developed using functions defined for specific problems. The chapter also considers important statistical tests of z-test and t-test for comparison of means and chi-square tests and f-test for comparison of variances. The reader will learn how to create new R and Python functions.

Chapter 6, Linear Regression Analysis, builds a linear relationship between an output and a set of explanatory variables. The linear regression model has many underlying assumptions and such details are verified using validation techniques. A model may be affected by a single observation, or a single output value, or an explanatory variable. Statistical metrics are discussed in depth which helps remove one or more types of anomalies. Given a large number of covariates, the efficient model is developed using model selection techniques. While the stats core R package suffices, statsmodels package in Python is very useful.

Chapter 7, The Logistic Regression Model, is useful as a classification model when the output is a binary variable. Diagnostic and model validation through residuals are used which lead to an improved model. ROC curves are next discussed which helps in identifying of a better classification model. The R packages pscl and ROCR are useful while pysal and sklearn are useful in Python.

Chapter 8, Regression Models with Regularization, discusses the problem of over fitting, which arises from the use of models developed in the previous two chapters. Ridge regression significantly reduces the probability of an over fit model and the development of natural spine models also lays the basis for the models considered in the next chapter. Regularization in R is achieved using packages ridge and MASS while sklearn and statsmodels help in Python.

Chapter 9, Classification and Regression Trees, provides a tree-based regression model. The trees are initially built using raw R functions and the final trees are also reproduced using rudimentary codes leading to a clear understanding of the CART mechanism. The pruning procedure is illustrated through one of the languages and the reader should explore to find the fix in another.

Chapter 10, CART and Beyond, considers two enhancements to CART, using bagging and random forests. A consolidation of all the models from Chapter 6, Linear Regression Analysis, to Chapter 10, CART and Beyond, is also provided through a dataset. The ensemble methods is fast emerging as very effective and popular machine learning technique and doing it in both the languages will improve users confidence.

Unlock full access

Continue reading for free

A Packt free trial gives you instant online access to our library of over 7000 practical eBooks and videos, constantly updated with the latest in tech
bookmark search playlist font-size

Change the font size

margin-width

Change margin width

day-mode

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Delete Bookmark

Modal Close icon
Are you sure you want to delete it?
Cancel
Yes, Delete