Course Introduction

Canhong Wen

Course Description

  • This course covers aspects of numerical analysis for statisticians and data scientists (including matrix inversion, function optimization, cross validation, and bootstrap) with an emphasis on implementing these methods in R and Python.
  • Important language specific tools and computation strategies such as vectorization, code profiling, and data visualization will also be covered.
  • Class examples, homework, and projects will be completed in R or Python using Rmarkdown or Jupyter notebooks.

rpython

Learning Outcomes

By the end of this course students will be able to do:

  • Understand common data structures in R and Python (vectors, matrices, arrays, lists, dataframes) and their various strengths and weaknesses.
  • Design and implement simulation studies in R and Python.
  • Produce reproducible research reports in clear, well–documented R and Python code
  • Learn how to deal with data in R and Python, including: importing, preprossing, ploting data, and perform basic statistical inference such as linear regression and hypothesis testing
  • Learn some basic statistical algorithms such as Newton algorithm for optimization, bootstrap, cross validation.
  • Package your own R package or Python module.

Textbooks

There is no required textbook for this course. Lectures will be based on material from the following sources.

  1. Geof Givens and Jennifer Hoeting Computational Statistics (Second Edition)
  2. Maria Rizzo, Statisitcal computing with R
  3. Norman Matloff, The Art of R Programming: A Tour of Statistical Software Design
  4. Joseph Adler, R in a Nutshell (Second Edition)
  5. Eric Matthes, Python Crash Course: A Hands-on, Project-based introduction to programming /(中文版)Python编程:从入门到实践
  6. Wes McKinney, Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. /(中文版)利用Python进行数据分析

Other online resourses:

  1. The official intro, An Introduction to R, available online in PDF
  2. Quick-R
  3. Rmarkdown reference and cheatsheet
  4. A byte of Python(英文版) or Python简明教程(中文版)
  5. Computational Statistics in Python

Course Mechanics and Grading(I)

There will be a biweekly in-class lab(optional), homework nearly every one week, a midterm project and a final exam. Grades will be calculated as follows:

  • Homework: 40%
    • You must submit your assignment using Rmarkdown or Jupyter notebooks.
    • Every file you submit should have a file name which includes your first and last name, for example CanhongWen_Hw1.rmd or CanhongWen_Hw1.ipynb, as well as a pdf file CanhongWen_Hw1.pdf or a html file CanhongWen_Hw1.html.
  • Project: 30% (See next slide)
  • Final exam: 30%
    • The final exam is semi-open book access, but absolutely no communicating with other humans.
  • 5 points bonus if you attend a kaggle competition and get a ranking within Top 30%, with an additional 2 points for every 10%.

Course Mechanics and Grading(II)

Project: 30%

  • Two or three students in a group. Write clearly your own contribution.
  • Each group will cooperate on writing code, documenting it, writing a report, and making a presentation on the project (within 10 mins) during the final exam period.
  • The topic of the report / presentation may be
    1. reproduction / extension of a simulation study
    2. development or implementation of a statistical model or computing algorithm
    3. analysis of a data from kaggle or other source.
    4. ...

Get started to R and Python

Why learn R and Python?(I)

Survey of Kagglers finds Python, R to be preferred tools rpython

Why learn R and Python?(II)

The rankings varied according to the job title of the respondent.

  • R: Business Analyst, Data Analyst, Data Miner, Operations Researcher, Predictive Modeler, Statistician
  • Python: Computer Scientist, Data Scientist, Engineer, Machine Learning Engineer, Other, Programmer, Researcher, Scientist, Software Developer

What is R?

  • Software for Statistical Data Analysis
  • Free and Open Source Software (https://cran.r-project.org/)
  • Based on S
  • Programming Environment for Data Storage, Analysis, Graphing
  • Interpreted Language

The R Console

Basic interaction with R is by typing in the console, a.k.a. terminal or command-line

You type in commands, R gives back answers (or errors)

Menus and other graphical interfaces are extras built on top of the console

R

RStudio

R

R Packages

  • R packages are collections of functions and data sets developed by the community.
    • A package will include code (not only R code!), documentation for the package and the functions inside, some tests to check everything works as it should, and data sets.
  • The official repository (CRAN) reached 14,670 packages published (as of Aug. 2nd, 2019), and many more are publicly available through the internet.
getOption("defaultPackages")
(.packages())

Load R Package before you use it.

library(rpart)
fit <- rpart(Kyphosis ~  Age + Number + Start, data = kyphosis)
require(rpart)
fit <- rpart(Kyphosis ~  Age + Number + Start, data = kyphosis)

What is the difference between library and require?

Install R Packages

  • Install from CRAN
    {r}
    install.packages("BeSS")
  • Install from other resources through devtools

    {r}
    library("devtools")
    install_github("ggplot2") # install from Github
    install_bioc("AnnotationDbi")
  • Install locally in some directory (downloaded from author's homepage)

    {r}
    install.packages("l0tf_0.1.0.tar.gz", repos = NULL, type = "source")

Getting Help and Learnging More

  • If you know the function name, start with help document within R.
    {r}
    help("t.test")
    ?t.test
  • If you get stuck, try Google.
    • Typically adding R to a query is enough to restrict it to relevant results
    • Particularly useful for error messages. (If the error message isn't in English, run Sys.setenv(LANGUAGE = "en") and re-run the code.)
  • If Google doesn't help, try stackoverflow.

Python

Python is rapidly becoming the preferred language of data scientists in both industry and academia. It’s used by Google, Facebook and other tech giants to perform data analysis and run machine learning algorithms that can handle hundreds of thousands of terabytes of data per day.

Python can be used for:

  • Storing and analyzing large and small datasets.
  • Web scraping and data collection using APIs.
  • Beautiful data visualization.
  • Natural language processing and text analysis.
  • General machine learning.
  • Deep learning.
  • Image analysis and much, much more…

Important libraries

  • numpy: Numerical Python. Basic library for numerical analysis
  • pandas: panel data. Provides data.frame
  • matplotlib: data visualization
  • scikit-learn: machine learning including classification, regression(Lasso), clustering, PCA, model selection
  • statsmodels: statistical modeling including regression, ANOVA, time series, density estimation

Install Python via anaconda distribution

After installing Anaconda, we have

  • Anaconda Navigator:
  • Anaconda Prompt:
  • Jupyter Notebook: start the Jupyter server and open a tab in the web browser that is connected to the server. Similar to Rmarkdown.
  • Spyder: Integrated development enviroment. Similar to RStudio.

anaconda

Create a new enviroment in anaconda

create

Spyder

  • Scientific PYthon Development EnviRonment.
  • Power Python IDE with advanced editing, interative testing, debugging and introspection features.

Spyder

Jupyter Notebook

The Jupyter notebook is an interactive, web-based environment that allows one to combine code, text and graphics into one unified document.

  • Developed from Ipython.
  • It also supports multiple kernels (differnet languages including Julia Python and R).

The Jupyter notebook has three types of cells:

  • Code
  • Markdown
  • Raw NBConvert

For more details, see this online Reference Guide

Jupyter

Code in Jupyter Notebook

hello

In [15]:
2+3
Out[15]:
5
In [16]:
import time, sys    # Import packages time and sys
for i in range(8):  # for loop
    print(i)        # print the number of iteration
    time.sleep(0.5) # sleep 0.5 second before executing the procedure
0
1
2
3
4
5
6
7

Markdown in Jupyter Notebook

markdown

markdown

List in Markdown

markdown

  1. Ordered list 1
    1. Ordered list 1.1
    2. Ordered list 1.2
  2. Ordered list 2
  3. Ordered list 3
  • Bulleted Lists 1
    • Bulleted Lists 1.1
    • Bulleted Lists 1.2
    • Bulleted Lists 2

Style and Emphasis

*Italics* _Italics_ **Bold** __Bold__ ***Bold and Italics*** ___Bold and Italics___ ~~strickout~~

Italics

Italics

Bold

Bold

Bold and Italics

Bold and Italics

strickout

Inerting Table in Markdown

markdown

Header Header Header Header
Cell Cell Cell Cell
Cell Cell Cell Cell
Cell Cell Cell Cell
Cell Cell Cell Cell

Centered, Right-Justified, and Regular Cells and Headers:

markdown

centered header regular header right-justified header centered header regular header
centered cell regular cell right-justified cell centered cell regular cell
centered cell regular cell right-justified cell centered cell regular cell

Inserting Hyperlinks

Inserting Images

Inserting an image is almost identical to inserting a link. You just also type a ! before the first set of brackets:

![jupyterinMarkdown](./jupyterinMarkdown.png)

jupyterinMarkdown

Including Code Examples

markdown

lm()

import time, sys # Import packages time and sys for i in range(8): # for loop print(i) # print the number of iteration time.sleep(0.5) # sleep 0.5 second before executing the procedure

import time, sys    # Import packages time and sys
  for i in range(8):  # for loop
      print(i)        # print the number of iteration
      time.sleep(0.5) # sleep 0.5 second before executing the procedure

LaTeX Math

Jupyter Notebooks' Markdown cells support LateX for formatting mathematical equations. To tell Markdown to interpret your text as LaTex, surround your input with dollar signs like this:

$z=\dfrac{2x}{3y}$

$z=\dfrac{2x}{3y}$

$$2x+3y=z$$

$$2x+3y=z$$

Raw NBConvert in Jupyter Notebook

  • unlike all other Jupyter Notebook cells, have no input-output distinction.
  • mainly used to create examples.
This is Raw NBConvert output: centered header | regular header | right-justified header | centered header | regular header :-:|-|-:|:-:|- centered cell|regular cell|right-justified cell|centered cell|regular cell centered cell|regular cell|right-justified cell|centered cell|regular cell

This is Raw NBConvert output:

centered header regular header right-justified header centered header regular header
centered cell regular cell right-justified cell centered cell regular cell
centered cell regular cell right-justified cell centered cell regular cell

R Markdown

R Markdown provides an authoring framework for data science. You can use a single R Markdown file to both

  • save and execute code
  • generate high quality reports that can be shared with an audience

Installation

{r}
install.packages("rmarkdown")

rmarkdown

R code chunks in R Markdown

a