---
project:
output-dir: docs
title: "Accelerator Crash Course - R"
subtitle: "Center of Excellence for Women & Technology, Indiana University Bloomington"
author: "Jeffery (Shih-Chieh) Wang"
date: "April 11, 2026"
format:
html:
toc: true
toc-depth: 3
toc-location: left
page-layout: full
number-sections: true
theme: minty
highlight-style: tango
code-fold: show
code-tools: true
code-copy: true
code-overflow: wrap
smooth-scroll: true
anchor-sections: true
self-contained: true
pdf:
documentclass: article
execute:
echo: true
message: false
warning: false
---
# Course Overview {.unnumbered}
- **Learning materials:** [jeffery-wang.com/r-basics/](https://jeffery-wang.com/r-basics/)
- **Duration:** 4 hours (10:30--12:00; 13:00-15:30)
- **Goal**: By the end of this course, you will know the basics of R, able to import data into R, maniupate and tidy data sets using tidyverse tool, create visualizations, and get a better idea how to use generative AI to accelerate the R learning.
- **Prerequisites:** None! This course is designed for true beginners.
- **What you need:**
- A laptop with R and RStudio installed
- An internet connection (for installing packages and AI tools)
## Schedule {.unnumbered}
**10:30–11:30**\
[Module 1: Getting Started with R & RStudio](#module1)
**11:30–12:00**\
[Module 2: Importing Data (readr & readxl)](#module2)
**12:00–1:00**\
*Lunch Break*
**13:00–14:00**\
[Module 3: Data Manipulation with dplyr](#module3)
**14:00-15:30**\
[Module 4: Data Tidying with tidyr](#module4)\
[Module 5: Visualization with Base R](#module5)\
[Module 6: Leveraging AI](#module6)
------------------------------------------------------------------------
# Module 1: Getting Started {#module1}
## Learning Objectives
- Navigate the RStudio interface (console, script editor, environment, files)
- Understand basic R concepts: objects, assignment, data types
- Install and load packages
- Understand what the tidyverse is
## The RStudio Interface
RStudio is the most popular IDE for R language. It has four main panes:
- `IDE: Integrated Development Environment (code editing, building, testing, debugging, and many others)`
1. **Source Editor** (top-left) — where you write and save scripts
2. **Console** (bottom-left) — where R runs commands
3. **Environment** (top-right) — shows your data and objects
4. **Files/Plots/Help** (bottom-right) — file browser, plot viewer, documentation
- Note: You may customize the layout by going to `Tools` → `Global Options`→ `Pane Layout`
## R Basics
### Basic Calculations in R
- One of the easiest ways to start using R is to treat it like a calculator.
- R can perform basic arithmetic operations such as addition, subtraction, multiplication, division, and exponents.
```{r}
#| label: module1-basics1
# Some examples:
2 + 3 # addition
10 - 4 # subtraction
6 * 5 # multiplication
12 / 3 # division
2^3 # exponent
(2 + 3) * 4 # parentheses change the order of operations
```
### Common R Objects
Objects are the basic building blocks for storing data in R. Five common types of R objects are vectors, factors, matrices, lists, and data frames.
{alt="Common R Objects" fig-cap="Common R Objects" width="100%"}
```{r}
#| label: module1-basics2
# Assignment: use <- to store values in objects
my_name <- "World"
my_number <- 42
# Print values
my_name
my_number
# Vectors: a collection of values of the same type
ages <- c(25, 30, 35, 40, 45)
names <- c("Alice", "Bob", "Carol", "Dave", "Eve")
# Basic operations on vectors
mean(ages)
length(ages)
# Basic data types
class(my_name) # character
class(my_number) # numeric
class(1) #
class(TRUE) # logical (can be )
# Example about checking logical values
is_adult <- 20 >= 18
class(is_adult) # logical
is_adult # TRUE, because it stores the result of "20 >= 18"
is_tall <- 150 > 180
is_tall
```
- Note: These create logical results
- `== # equal to`
- `!= # not equal to > # greater than`
- `< # less than`
- `>= # greater than or equal to`
- `<= # less than or equal to`
### Installing and Loading the Tidyverse
```{r}
#| label: module1-tidyverse
#| eval: false
# Install the tidyverse (only need to do this once)
install.packages("tidyverse")
```
```{r}
#| label: module1-load
# Load the tidyverse (do this every session)
library(tidyverse)
```
The tidyverse is a collection of packages that share a common design philosophy:
- **readr** — reading data files (CSV, TSV, etc.)
- **dplyr** — data manipulation (filter, select, mutate, summarize)
- **tidyr** — reshaping and tidying data
- **ggplot2** — data visualization
- **stringr** — string manipulation
- **forcats** — working with categorical data (factors)
- **purrr** — functional programming
- **tibble** — modern data frames
### Data Frame?
```{r}
#| label: module1-dataframe
# A data frame is a table — the fundamental data structure for analysis
# Let's look at a built-in dataset in ggplot2 (part of the idyverse)
mpg
# Useful functions for exploring data frames
dim(mpg) # rows x columns
names(mpg) # column names
glimpse(mpg) # compact overview
head(mpg) # first 6 rows
```
- Another approach more common because you most like import your own data
```{r}
#| label: module1-dataframe2
# use vector
data_mpg <- mpg
glimpse(data_mpg ) # compact overview
summary(data_mpg) # summarize your data variables (basic stats, types, NAs)
dim(data_mpg ) # rows x columns
names(data_mpg ) # column names
head(data_mpg) # first 6 rows
# Let's see the car company list
length(data_mpg$manufacturer)
unique(data_mpg$manufacturer)
length(unique((data_mpg$manufacturer)))
# And all the car models
unique(data_mpg$model)
sort(unique(data_mpg$model))
```
- `dbl: double-precision floating-point numbers, storing real numbers for those with decimal points`
- Tibble (modern) vs Dataframe
- {alt="Common R Objects" fig-cap="Tibble vs Dataframe" width="80%"}\
Source: [www.r-bloggers.com](https://www.r-bloggers.com/2021/02/4-ways-to-make-data-frames-in-r/)
------------------------------------------------------------------------
## Exercise 1.1
```{r}
#| label: ex1-1
#| eval: false
# 1. Create a vector called 'fruits' containing: "apple", "banana", "cherry"
# 2. Create a vector called 'prices' with values: 1.20, 0.50, 2.75
# 3. Find the mean price
```
## Exercise 1.2
```{r}
# 1. Load the tidyverse and explore the 'diamonds' dataset using glimpse()
# 2. How many rows and columns?
# 3. What are the variable names?
# 4 What does 'cut' contain?
```
# Module 2: Importing Data {#module2}
## Learning Objectives
- Read CSV files into R with `read_csv()`
- Understand file paths and working directories
- Inspect imported data for issues
- Know about other import functions (Excel, TSV, etc.)
## Key Concepts
### Your Working Directory
```{r}
#| label: module2-wd
#| eval: false
# Where is R looking for files?
getwd()
# You can set it in RStudio: Session > Set Working Directory > Choose Directory
# Or use code:
setwd("/path/to/your/folder") # Replace the path with the actual location of your file
list.files()
```
### Reading CSV Files
```{r}
#| label: module2-csv
# Reading a CSV file
# Replace the path with the actual location of your file
survey_data <- read_csv("sample_survey_data.csv")
# Look at the result
survey_data
glimpse(survey_data)
summary(survey_data)
```
### Understanding the Output
When `read_csv()` imports data, it tells you:
- How many rows and columns it found
- What type it guessed for each column (character, double, integer, etc.)
### Common Import Issues
```{r}
#| label: module2-issues
#| eval: false
# If your CSV uses semicolons instead of commas (common in Europe)
data <- read_csv2("european_file.csv")
# If your file uses tabs (Tab-Separated Values)
data <- read_delim("file.tsv", delim = "\t")
# Reading Excel files (need the readxl package)
library(readxl)
data <- read_excel("file.xlsx")
data <- read_excel("file.xlsx", sheet = "Sheet2") # specific sheet
```
- Example of European Semicolon CSV (`read_csv2`)
This format is the standard in many European countries where the **comma** is already used for decimals (e.g.,1,50 instead of 1.50). To separate the columns, they use a **semicolon**.
``` text
Country;GDP_Growth;Inflation_Rate
Germany;1,5;2,1
France;1,2;1,8
Italy;0,9;2,3
```
- Example of Tab-Separated Values
In a "raw" view, tabs often look like big chunks of whitespace. In reality, they are a single invisible character (`\t`). TSVs are great because you almost never have a "tab" inside your actual data, so it's less likely to break your code.
``` text
ID Observation Status
101 High Pressure Stable
102 Low Temperature Critical
103 Ambient Stable
```
### Working with Built-in Datasets
```{r}
#| label: module2-builtin
# R comes with many practice datasets
# The tidyverse adds even more
data(mpg) # fuel economy data
data(diamonds) # diamond prices
data(starwars) # Star Wars characters
# View the starwars dataset
starwars
```
## Exercise 2.1
``` text
# 1. Download the sample_survey_data.csv file to your computer
```
- Down here: [Link](https://www.dropbox.com/scl/fi/s2eem5x1awd0kxnrof6s8/sample_survey_data.csv?rlkey=u53se53dad02ho50xd10wp3dh&dl=0)
``` text
# 2. Read it into R using read_csv() and save it as 'survey'
# 3. Use summary() to examine the structure
# 4. How many rows and columns does it have?
# 5. What are the column names?
# 6. What variables have missing values?
```
------------------------------------------------------------------------
# Module 3: Data Manipulation with dplyr {#module3}
## Learning Objectives
- Chain operations with the pipe `|>` (or `%>%`)
- Sort data with `arrange()`; Note: `sort()` is a base R function
- Select columns with `select()`
- Filter rows with `filter()`
- Create new columns with `mutate()`
- Summarize data with `summarize()` and `group_by()`
## Key Concepts
### The Pipe: `|>` (or `%>%`)
The pipe takes the output of one function and feeds it as the first argument to the next function. Think of it as "and then."
```{r}
#| label: module3-pipe
# Without the pipe (nested, hard to read)
head(arrange(filter(mpg, manufacturer == "toyota"), desc(hwy)), 5)
# With the pipe (read left to right, top to bottom)
mpg |>
filter(manufacturer == "toyota") |>
arrange(desc(hwy)) |>
head(5)
# arrange() for sorting
arrange(mpg, hwy) # hwy (highway) ascending, small to large
arrange(mpg, desc(hwy)) # descending, large to small
```
### select() — Choose Columns
```{r}
#| label: module3-select
names(mpg)
glimpse(mpg)
mpg
# Select specific columns by name
mpg |>
select(manufacturer, model, year, hwy) # Note: A `dplyr` pipeline only prints the result unless you assign it with `<-`.
# For example:
mpg_small <- mpg %>% select(manufacturer, model, year, hwy) # Don't forget that you can also use %>%
# or
mpg.s <- select (mpg, manufacturer, model, year, hwy )
# Remove columns with -
mpg |>
select(-fl, -class)
# Select a range of columns
mpg |>
select(manufacturer:year)
# Helper functions
mpg |>
select(starts_with("c")) # columns starting with "c"
mpg |>
select(where(is.numeric)) # only numeric columns
```
### filter() — Choose Rows
```{r}
#| label: module3-filter
# Filter rows that meet a condition
mpg |>
filter(manufacturer == "audi")
# Multiple conditions (AND)
mpg |>
filter(manufacturer == "audi", year == 2008)
# OR conditions
mpg |>
filter(manufacturer == "audi" | manufacturer == "bmw")
# A shortcut for multiple OR on the same column
mpg |>
filter(manufacturer %in% c("audi", "bmw", "toyota")) # This requires you know well your data
# Numeric comparisons
mpg |>
filter(hwy > 30)
mpg |>
filter(hwy >= 25, cty >= 20)
```
### mutate() — Create or Modify Columns
```{r}
#| label: module3-mutate
# Create a new column
mpg |>
mutate(avg_mpg = (cty + hwy) / 2) # create avg_mpg at the last column
mpg |>
mutate(avg_mpg = (cty + hwy) / 2) |>
select(manufacturer, model, cty, hwy, avg_mpg) # more control of the columns selected and arranged
# Modify an existing column
mpg |>
mutate(manufacturer = str_to_title(manufacturer)) |> # str_to_title() means convert text to title case, ex, "toyota" to "Toyota
select(manufacturer, model)
# Create multiple columns at once
mpg |>
mutate(
avg_mpg = (cty + hwy) / 2,
fuel_efficiency = if_else(avg_mpg > 25, "Good", "Average") # if_else: if avg_mpg > 25, then "Good", otherwise "Average"
) |>
select(manufacturer, model, avg_mpg, fuel_efficiency)
mpg |>
mutate(
avg_mpg = (cty + hwy) / 2,
fuel_efficiency = if_else(
avg_mpg > 30, "Excellent",
if_else(avg_mpg > 25, "Good", "Average")
) # If avg_mpg > 30 then "Excellent", otherwise if_else avg_mpg >25, then "Good", othwerwise "Average"
) |>
select(manufacturer, model, avg_mpg, fuel_efficiency) |>
arrange(desc(fuel_efficiency))
```
### summarize() and group_by() — Aggregate Data
```{r}
#| label: module3-summarize
# Summarize the entire dataset
mpg |>
summarize(
avg_hwy = mean(hwy), # avg_hwy, the column name created; mean() is the function
max_hwy = max(hwy), # maximum highway MPG
count = n() # n() in dplyr means the number of rows in the current data.
)
# Group by a variable, then summarize
mpg |>
group_by(manufacturer) |>
summarize(
avg_hwy = mean(hwy),
count = n()
) |>
arrange(desc(avg_hwy))
# This code groups the data by manufacturer, so all cars made by the same manufacturer are placed in the same group, they could be the same or different models. It then calculates the average highway MPG (`avg_hwy`) for each manufacturer and counts how many observations belong to that manufacturer. Thus, `count` shows how many rows in the dataset are included for each manufacturer, such as how many Honda cars are being counted.
# Group by multiple variables
mpg |>
group_by(manufacturer, year) |> # the condition grouping by manufacturer, "and" not "or", year
summarize(
avg_hwy = mean(hwy),
count = n()
)
```
### count() — A Handy Shortcut
```{r}
#| label: module3-count
# Count occurrences of each value
mpg |>
count(manufacturer, sort = TRUE)
# or
mpg |>
count(manufacturer) |>
arrange(desc(n)) # If for ascending: arrange(n)
# Count combinations
mpg |>
count(manufacturer, class, sort = TRUE)
```
## Exercise 3.1: dplyr Practice
```{r}
#| label: ex3-1
#| eval: false
# Using the starwars dataset, answer these questions:
# 1. How many characters are from Tatooine? (filter + count or nrow)
# 2. What are the names and heights of the 5 tallest characters?
# (select, arrange, head)
# 3. Create a new column called bmi = mass / (height/100)^2
# Who has the highest BMI? (mutate, arrange)
# 4. What is the average height of characters grouped by species?
# Only show species with more than 1 character. (group_by, summarize, filter)
# 5. How many characters have blue eyes? (filter, count)
```
------------------------------------------------------------------------
# Module 4: Data Tidying with tidyr {#module4}
## Learning Objectives
- Understand the concept of "tidy data"
- Reshape data from wide to long with `pivot_longer()`
- Reshape data from long to wide with `pivot_wider()`
- Handle missing values with `drop_na()` and `replace_na()`
- Separate and unite columns
## Key Concepts
### What is Tidy Data?
Tidy data has three rules:
1. Each **variable** has its own **column**
2. Each **observation** has its own **row**
3. Each **value** has its own **cell**
### pivot_longer() — Wide to Long
```{r}
#| label: module4-longer
# Create a wide dataset (common in spreadsheets)
grades_wide <- tibble(
student = c("Alice", "Bob", "Carol"),
math = c(85, 92, 78),
science = c(90, 88, 95),
english = c(88, 76, 82)
)
grades_wide
# or traditional dataframe
grades_wide.d <- data.frame(
student = c("Alice", "Bob", "Carol"),
math = c(85, 92, 78),
science = c(90, 88, 95),
english = c(88, 76, 82)
)
grades_wide.d
# Pivot to long format
grades_long <- grades_wide |>
pivot_longer(
cols = science:english, # columns to pivot; if only choose columns selectively, e.g., c(math, science)
#col = c(math,science,english),
names_to = "subject", # new column for the old column names
values_to = "score" # new column for the values
)
grades_long # This is more preferable (or "Tidy") structure
```
### pivot_wider() — Long to Wide
```{r}
#| label: module4-wider
# Convert back to wide format
grades_long |>
pivot_wider(
names_from = subject,
values_from = score
)
```
### Handling Missing Values - drop_na() and replace_na()
```{r}
#| label: module4-missing
names(starwars)
# The starwars dataset has many missing values
starwars |>
select(name, height, mass, hair_color) |>
head(10)
# Drop rows with ANY missing value
starwars |>
select(name, height, mass) |>
drop_na()
# Drop rows with missing values in specific columns only
starwars |>
select(name, height, mass) |>
drop_na(mass)
# Replace missing values
starwars |>
select(name, hair_color) |>
mutate(hair_color = replace_na(hair_color, "unknown")) |> # If `mutate()` uses the name of an existing column, it replaces that column with the updated values.
head(10)
```
### separate() and unite()
```{r}
#| label: module4-separate
# Create example data with combined columns
contacts <- tibble(
name = c("Alice Smith", "Bob Jones", "Carol Lee"),
phone = c("555-1234", "555-5678", "555-9012")
)
# Separate name into first and last
contacts |>
separate(name, into = c("first_name", "last_name"), sep = " ") # into = ... is part of the required syntax of separate() when you want to split one column into multiple new columns.
# Unite columns
contacts |>
separate(name, into = c("first_name", "last_name"), sep = " ") |>
unite("full_name",last_name, first_name, sep = ", ")
```
## Exercise 4.1: Tidying Data
```{r}
#| label: ex4-1
#| eval: false
# 1. Create this messy dataset:
sales <- tibble(
product = c("Widget", "Gadget", "Doohickey"),
Q1_2024 = c(150, 200, 80),
Q2_2024 = c(175, 180, 95),
Q3_2024 = c(200, 210, 110),
Q4_2024 = c(225, 190, 130)
)
# 2. Pivot it to long format with columns: product, quarter, revenue
# 3. Separate the 'quarter' column into 'quarter' and 'year'
# 4. Which product had the highest total revenue? (group_by + summarize)
# 5. Which quarter had the highest average revenue across all products?
```
------------------------------------------------------------------------
# Module 5: Visualization with Base R {#module5}
Base R plotting comes built into every R installation — no packages needed. The functions are simple and direct: `plot()`, `hist()`, `boxplot()`, `barplot()`. You call a function, you get a picture.
```{r}
#| label: setup
# Read and quick-clean the survey data
survey <- read.csv("sample_survey_data.csv")
summary(survey)
# Replace blank department with "Unknown"
survey$department[survey$department == ""] <- "Unknown" # [ ] is used to access or replace elements in a vector or column.
# Remove rows where years_experience is missing
survey <- survey[!is.na(survey$years_experience), ]
str(survey)
nrow(survey)
```
We have `r nrow(survey)` rows with clean departments and no missing salaries. Let's plot.
------------------------------------------------------------------------
## Four Plots You'll Use All the Time
### Scatter Plot — `plot()`
The workhorse of base R graphics. Give it an x and a y and it draws points.
```{r}
#| label: scatter-basic
#| fig-width: 7
#| fig-height: 5
# Basic scatter plot (two quantitative varaibales)
plot(survey$years_experience, survey$annual_salary)
```
That works, but it's ugly. Let's add labels, color, and a trend line.
```{r}
#| label: scatter-polished
#| fig-width: 7
#| fig-height: 5
# Color points by department
unique(survey$department)
dept_colors <- c(
Engineering = "steelblue",
HR = "tomato",
Marketing = "forestgreen",
Sales = "darkorange",
Unknown = "gray50"
)
colors() # all named colors
point_cols <- dept_colors[survey$department]
# It looks at each department in survey$department, matches it to the corresponding named color in dept_colors, and returns a color vector in the same row order as survey.
# Create a scatter plot of annual salary against years of experience.
# Each point is colored according to department using point_cols.
# pch = 19 uses solid circles, and cex = 1.3 makes the points slightly larger.
# xlab and ylab set the axis labels, and main adds the plot title.
plot(
survey$years_experience, survey$annual_salary,
col = point_cols,
pch = 19, # solid circles (pch = 19)
cex = 1.3, # point size
xlab = "Years of Experience",
ylab = "Annual Salary ($)",
main = "Experience vs. Salary"
)
# Common ones:
# pch = 1: open circle
# pch = 15: filled square
# pch = 16: filled circle
# pch = 17: filled triangle
# pch = 19: solid circle
# Add a trend line
abline(
lm(annual_salary ~ years_experience, data = survey),
lty = 2, col = "gray30", lwd = 2
)
# lty = 1: solid line
# lty = 2: dashed line
# lty = 3: dotted line
# lty = 4: dot-dash line
# lty = 5: long-dash line
# lty = 6: two-dash line
# Add a legend
legend("topleft",
legend = names(dept_colors),
col = dept_colors,
pch = 19,
cex = 0.8,
bty = "n") # no legend box
# Common bty values:
# bty = "o": full box around the plot
# bty = "l": only left and bottom lines
# bty = "7": top and right removed
# bty = "c": left removed
# bty = "u": top removed
# bty = "]": left and top removed
# bty = "n": no box
```
::: callout-tip
Key arguments for `plot()` `pch` = point shape (19 = solid circle), `cex` = size multiplier, `col` = color, `lwd` = line width, `lty` = line type (1 = solid, 2 = dashed).
:::
### Histogram — `hist()`
Shows how a single numeric variable is distributed.
```{r}
#| label: hist
#| fig-width: 7
#| fig-height: 4.5
hist(survey$annual_salary,
breaks = 8, # number of bins
col = "steelblue",
border = "white",
main = "Salary Distribution",
xlab = "Annual Salary ($)",
ylab = "Number of Employees")
```
::: callout-tip
Choosing `breaks` `breaks` controls how many bins. More bins = more detail but noisier. Try `breaks = 5`, then `breaks = 15`, and see what tells the clearest story.
:::
### Box Plot — `boxplot()`
Compares the distribution of a number across groups (Quantitative vs Categorical).
```{r}
#| label: boxplot
#| fig-width: 8
#| fig-height: 4.5
boxplot(annual_salary ~ department,
data = survey,
# col = c("steelblue", "tomato", "forestgreen",
# "darkorange", "gray80"),
main = "Salary Distribution by Department",
xlab = "",
ylab = "",
las = 1) # horizontal y-axis labels
```
The box is the middle 50% (IQR). The thick line is the median. Dots beyond the whiskers are potential outliers.
### Bar Plot — `barplot()`
Shows counts or pre-calculated values as bars.
```{r}
#| label: barplot-count
#| fig-width: 7
#| fig-height: 4.5
# Count respondents per department
dept_counts <- table(survey$department)
barplot(sort(dept_counts),
col = "steelblue",
main = "Respondents by Department",
ylab = "Count",
las = 1,
border = NA) # cleaner look without borders
```
For **pre-calculated values** (like averages), compute first then plot:
```{r}
#| label: barplot-avg
#| fig-width: 7
#| fig-height: 4.5
# Average salary by department
avg_sal <- tapply(survey$annual_salary, survey$department, mean)
avg_sal <- sort(avg_sal)
# Horizontal bar plot
barplot(avg_sal,
horiz = TRUE,
col = "steelblue",
main = "Average Salary by Department",
xlab = "Average Annual Salary ($)",
las = 1,
border = NA)
```
::: callout-tip
`table()` and `tapply()` — your bar plot friends `table(x)` counts how many of each value. `tapply(value, group, function)` applies a function (like `mean`) to each group — perfect for feeding into `barplot()`.
:::
## Polish Your Plots
### Titles and Labels
Every base R plot function accepts these arguments:
| Argument | What it does | Example |
|----------|-------------------------|--------------------------|
| `main` | Title above the plot | `main = "My Title"` |
| `sub` | Subtitle below the plot | `sub = "Source: survey"` |
| `xlab` | X-axis label | `xlab = "Age"` |
| `ylab` | Y-axis label | `ylab = "Salary ($)"` |
### Colors
You can use color names (`"steelblue"`, `"tomato"`) or hex codes (`"#2C73D2"`). Run `colors()` in R to see all 657 built-in color names.
```{r}
#| label: color-demo
#| fig-width: 7
#| fig-height: 4
hist(survey$satisfaction_score,
breaks = 10,
col = "#2C73D2",
border = "white",
main = "Satisfaction Score Distribution",
xlab = "Score (1–5)",
ylab = "Count")
```
### Adding Extra Elements
```{r}
#| label: extras
#| fig-width: 7
#| fig-height: 5
plot(survey$age, survey$annual_salary,
pch = 19,
col = "steelblue",
cex = 1.2,
xlab = "Age",
ylab = "Annual Salary ($)",
main = "Age vs. Salary")
# Trend line
abline(lm(annual_salary ~ age, data = survey),
col = "tomato", lwd = 2, lty = 2)
# Horizontal reference line at the mean salary
abline(h = mean(survey$annual_salary),
col = "gray50", lty = 3)
# Label the reference line
text(x = 26, y = mean(survey$annual_salary) + 2500,
labels = paste("Mean:", round(mean(survey$annual_salary))),
cex = 0.8, col = "gray40")
# Add a grid for readability
grid(col = "gray90")
```
#### Multi-Panel Layouts
```{r}
#| label: multi-panel
#| fig-width: 8
#| fig-height: 6
par(mfrow = c(2, 2)) # 2 rows, 2 columns
hist(survey$annual_salary, col = "steelblue", border = "white",
main = "Salary Distribution", xlab = "Salary ($)")
hist(survey$satisfaction_score, col = "tomato", border = "white",
main = "Satisfaction Distribution", xlab = "Score")
boxplot(annual_salary ~ department, data = survey,
col = "steelblue", main = "Salary by Dept", las = 2, cex.axis = 0.7)
plot(survey$years_experience, survey$annual_salary,
pch = 19, col = "steelblue", cex = 0.9,
main = "Experience vs. Salary", xlab = "Years", ylab = "Salary ($)")
abline(lm(annual_salary ~ years_experience, data = survey), lty = 2)
par(mfrow = c(1, 1)) # reset to single panel
```
------------------------------------------------------------------------
## Saving Plots
```{r}
#| label: save
#| eval: false
getwd()
# Simplest method: Using the "Plots" tab to export plots
# Save to PNG
png("salary_scatter.png", width = 700, height = 450, res = 150)
plot(survey$years_experience, survey$annual_salary,
pch = 19, col = "steelblue",
xlab = "Experience", ylab = "Salary",
main = "Experience vs. Salary")
abline(lm(annual_salary ~ years_experience, data = survey), lty = 2)
dev.off() # IMPORTANT: closes the file
# You do not need an existing PNG file first. `png()` creates a new image file, and the plot is saved into it when `dev.off()` is called.
# Save to PDF (great for publications)
pdf("salary_scatter.pdf", width = 7, height = 4.5)
# ... same plot code ...
dev.off()
```
::: callout-warning
Don't forget `dev.off()`! When saving to a file, you must call `dev.off()` after your plot code. Otherwise the file won't be written and your plots will stop showing up in RStudio.
:::
------------------------------------------------------------------------
## Cheat Sheet
### Which Function Do I Need?
| Question | Function | Example |
|-------------------------|-------------------------|-----------------------|
| How do two numbers relate? | `plot(x, y)` | `plot(age, salary)` |
| How is a number distributed? | `hist(x)` | `hist(salary)` |
| How do groups compare? | `boxplot(y ~ group)` | `boxplot(salary ~ dept)` |
| How many per group? | `barplot(table(x))` | `barplot(table(dept))` |
| What's the average per group? | `barplot(tapply(...))` | `barplot(tapply(sal, dept, mean))` |
### Common Appearance Arguments
| Argument | Controls | Values |
|--------------------------|--------------------------|---------------------|
| `col` | Fill color | `"steelblue"`, `"#FF6347"`, `c("red","blue")` |
| `border` | Border color | `"white"`, `NA` (no border) |
| `pch` | Point shape | 19 = solid circle, 1 = open circle, 17 = triangle |
| `cex` | Size multiplier | 1 = default, 1.5 = 50% bigger |
| `lwd` | Line width | 1 = default, 2 = double |
| `lty` | Line type | 1 = solid, 2 = dashed, 3 = dotted |
| `las` | Axis label direction | 1 = horizontal, 2 = perpendicular |
### Adding Extras to a Plot
| Function | What it adds |
|---------------------|-----------------------|
| `abline(lm(...))` | Trend line |
| `abline(h = value)` | Horizontal line |
| `abline(v = value)` | Vertical line |
| `legend(...)` | Legend |
| `text(x, y, label)` | Text label at a point |
| `grid()` | Grid lines |
| `points(x, y)` | More points on top |
| `lines(x, y)` | Lines on top |
------------------------------------------------------------------------
# Module 6: Leveraging AI Tools {#module6}
## Learning Objectives
- Understand how generative AI can assist your R workflow
- Learn effective prompting strategies for R code generation
- Use AI to debug errors and understand code
- Know the limitations and best practices
## How AI Can Help You Learn R
### 1. Writing Code from Natural Language
You can describe what you want to accomplish in plain English, and AI tools can generate the R code for you.
**Example prompt:**
> "Using the mpg dataset in R with the tidyverse, create a bar chart showing the average highway mpg for each car class, sorted from highest to lowest, with a clean minimal theme."
**What you might get back:**
```{r}
#| label: module7-ai-example
#| fig-width: 8
#| fig-height: 5
mpg |>
group_by(class) |>
summarize(avg_hwy = mean(hwy)) |>
ggplot(aes(x = reorder(class, avg_hwy), y = avg_hwy)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(x = "Vehicle Class", y = "Average Highway MPG",
title = "Average Highway Fuel Economy by Vehicle Class") +
theme_minimal()
```
### 2. Debugging Error Messages
When you get an error, you can paste it into an AI tool and ask for help.
**Example:** "I'm getting this error in R. What does it mean and how do I fix it?"
```
Error in select(., name, height) :
unused arguments (name, height)
```
**AI would explain:** This typically happens when another package (like MASS) has overwritten dplyr's `select()`. Fix it by being explicit: `dplyr::select()`.
### 3. Explaining Code
Paste code you don't understand and ask AI to explain it line by line.
### 4. Discovering New Methods
Ask AI things like:
- "What's the best way to remove duplicate rows in R?"
- "How do I join two datasets in the tidyverse?"
- "What alternatives exist for creating interactive plots in R?"
## Tips for Effective AI Prompting for R
### Be Specific About Your Data
```{r}
#| label: module7-tips
#| eval: false
# BAD prompt: "Make a plot of my data"
# GOOD prompt: "Using ggplot2, create a scatter plot of the mpg dataset
# with displ on the x-axis and hwy on the y-axis,
# colored by the drv variable, with a smooth trend line."
```
### Include Context
Tell the AI:
- What packages you're using (tidyverse, specific packages)
- What your data looks like (column names, types, sample rows)
- What you've already tried
- What output format you want
### Ask for Explanations
Add "and explain each step" to your prompts to learn while coding.
### Iterate
Start with a basic request, then refine: "Now change the colors to a blue-to-red gradient" or "Add error bars."
## Practical AI Exercise
```{r}
#| label: module7-exercise
#| eval: false
# Try these prompts with your AI tool of choice (ChatGPT, Claude, etc.):
# 1. "Write tidyverse code to find the top 5 most expensive diamond cuts
# in the diamonds dataset, showing average price and count."
# 2. Paste this buggy code and ask AI to fix it:
# mpg %>%
# filter(class = "suv") %>%
# sumamrize(mean_hwy == mean(hwy))
# 3. Ask: "Explain what the following code does step by step:"
# diamonds |>
# filter(carat > 1) |>
# group_by(cut, color) |>
# summarize(avg_price = mean(price), .groups = "drop") |>
# pivot_wider(names_from = color, values_from = avg_price)
```
## Best Practices and Caveats
1. **Always test AI-generated code** — Run it yourself and verify it works
2. **Understand what the code does** — Don't just copy-paste blindly
3. **AI can make mistakes** — Especially with newer packages or niche functions
4. **Use AI as a learning accelerator** — Not a replacement for understanding
5. **Start simple, build up** — Ask for basic code first, then add complexity
6. **Cross-reference with documentation** — Use `?function_name` in R for official docs
------------------------------------------------------------------------
# Additional: Data Visualization with ggplot2 {#module7}
## Learning Objectives
- Understand the grammar of graphics (data + aesthetics + geometries)
- Create common plot types: scatter, bar, line, histogram, box plot
- Customize colors, labels, and themes
- Save plots to files
## Key Concepts
### The Grammar of Graphics
Every ggplot has three essential components:
1. **Data** — the dataset
2. **Aesthetics** (`aes()`) — what variables map to x, y, color, size, etc.
3. **Geometries** (`geom_*()`) — how to draw the data (points, bars, lines, etc.)
### Scatter Plots
```{r}
#| label: module5-scatter
#| fig-width: 8
#| fig-height: 5
# Basic scatter plot
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point()
# Add color by a variable
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point()
# Add size and transparency
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point(size = 3, alpha = 0.7)
```
### Bar Charts
```{r}
#| label: module5-bar
#| fig-width: 8
#| fig-height: 5
# Count-based bar chart (the default)
ggplot(mpg, aes(x = class)) +
geom_bar()
# Colored bars
ggplot(mpg, aes(x = class, fill = class)) +
geom_bar() +
theme(legend.position = "none")
# Bar chart with a pre-calculated value
mpg |>
group_by(manufacturer) |>
summarize(avg_hwy = mean(hwy)) |>
ggplot(aes(x = reorder(manufacturer, avg_hwy), y = avg_hwy)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(x = "Manufacturer", y = "Average Highway MPG")
```
### Histograms and Density Plots
```{r}
#| label: module5-hist
#| fig-width: 8
#| fig-height: 5
# Histogram
ggplot(mpg, aes(x = hwy)) +
geom_histogram(binwidth = 2, fill = "steelblue", color = "white")
# Density plot
ggplot(mpg, aes(x = hwy, fill = drv)) +
geom_density(alpha = 0.5)
```
### Box Plots
```{r}
#| label: module5-box
#| fig-width: 8
#| fig-height: 5
# Box plot comparing groups
ggplot(mpg, aes(x = class, y = hwy, fill = class)) +
geom_boxplot() +
theme(legend.position = "none")
```
### Line Charts
```{r}
#| label: module5-line
#| fig-width: 8
#| fig-height: 5
# Create some time series data
monthly_sales <- tibble(
month = 1:12,
revenue = c(45, 52, 48, 61, 55, 67, 72, 69, 75, 80, 85, 92)
)
ggplot(monthly_sales, aes(x = month, y = revenue)) +
geom_line(color = "steelblue", linewidth = 1) +
geom_point(color = "steelblue", size = 3) +
scale_x_continuous(breaks = 1:12)
```
### Adding Labels and Themes
```{r}
#| label: module5-labels
#| fig-width: 8
#| fig-height: 5
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point(size = 2, alpha = 0.7) +
labs(
title = "Engine Size vs. Highway Fuel Economy",
subtitle = "Larger engines tend to have lower fuel economy",
x = "Engine Displacement (liters)",
y = "Highway MPG",
color = "Vehicle Class",
caption = "Source: EPA fuel economy data"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(color = "gray40")
)
```
### Facets — Small Multiples
```{r}
#| label: module5-facet
#| fig-width: 10
#| fig-height: 6
# Split into panels by a variable
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(color = "steelblue") +
facet_wrap(~ class, scales = "free_y") +
theme_minimal()
```
### Saving Plots
```{r}
#| label: module5-save
#| eval: false
# Save the last plot you created
ggsave("my_plot.png", width = 8, height = 5, dpi = 300)
# Save a specific plot
my_plot <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()
ggsave("scatter.pdf", plot = my_plot, width = 8, height = 5)
```
------------------------------------------------------------------------
# Addtional: R Packages (Three Common Live Places)
Not everything is on CRAN. Here's where to look:
| Source | What's there | How to install |
|--------|-------------|----------------|
| **CRAN** | ~20,000 general-purpose packages | `install.packages("name")` |
| **Bioconductor** | ~2,200 packages for genomics, genetics, bioinformatics | `BiocManager::install("name")` |
| **GitHub** | Development versions, niche tools, unreleased packages | `devtools::install_github("user/repo")` |
---
## CRAN — The Default
```{r}
#| eval: false
# Most packages you'll encounter
install.packages("tidyverse")
install.packages("ape") # phylogenetics & population genetics
library(ape)
```
If `install.packages()` works, you're on CRAN. Nothing else needed.
---
## Bioconductor — Genetics & Genomics
Many genetics and bioinformatics packages live on [Bioconductor](https://bioconductor.org/), not CRAN. You need `BiocManager` to install them.
```{r}
#| eval: false
# Step 1: Install BiocManager (one time, from CRAN)
install.packages("BiocManager")
# Step 2: Use it to install Bioconductor packages
BiocManager::install("Biostrings") # DNA/RNA sequence handling
BiocManager::install("VariantAnnotation") # VCF files & variant data
BiocManager::install("GenomicRanges") # genomic intervals
BiocManager::install("DESeq2") # differential gene expression
# Then load normally
library(Biostrings)
```
::: {.callout-tip}
Finding Bioconductor packages
Browse [bioconductor.org/packages](https://bioconductor.org/packages/release/BiocViews.html#___Software) or search by topic (e.g., "Genetics", "Sequencing", "Variant"). Each package page has install instructions.
:::
---
## GitHub — devtools / remotes
Some packages aren't on CRAN or Bioconductor at all — they only exist on GitHub. Others have a development version on GitHub that's newer than the CRAN release.
```{r}
#| eval: false
# Step 1: Install devtools or remotes (one time)
install.packages("devtools")
# or the lighter alternative:
install.packages("remotes")
# Step 2: Install from GitHub using "username/repository"
devtools::install_github("tidyverse/dplyr") # dev version of dplyr
remotes::install_github("YuLab-SMU/ggtree") # phylogenetic tree plots
# Then load normally
library(ggtree)
```
::: {.callout-note}
`devtools` vs `remotes`
`remotes` does just the installation part and is faster to install. `devtools` includes `remotes` plus tools for *building your own* packages. For just installing other people's packages, either works — `remotes` is lighter.
:::
---
## Quick Decision Guide
```
Need a package?
│
├─ Try install.packages("name") first
│ └─ Works? → Done!
│
├─ "Package not available" error?
│ ├─ Is it a bio/genetics package? → BiocManager::install("name")
│ └─ Is it on GitHub? → devtools::install_github("user/repo")
│
└─ Still can't find it?
└─ Search: google "R package [what you need]"
or ask AI: "What R package does [task]?"
```
---
## Common Genetics / Bioinformatics Packages
| Package | Source | What it does |
|---------|--------|-------------|
| `ape` | CRAN | Phylogenetics, population genetics |
| `pegas` | CRAN | Population & evolutionary genetics |
| `adegenet` | CRAN | Multivariate genetics (PCA, DAPC) |
| `Biostrings` | Bioconductor | DNA/RNA/protein sequence manipulation |
| `GenomicRanges` | Bioconductor | Genomic intervals and annotations |
| `DESeq2` | Bioconductor | Differential gene expression (RNA-seq) |
| `VariantAnnotation` | Bioconductor | Read and annotate VCF files |
| `SNPRelate` | Bioconductor | SNP data analysis (PCA, relatedness) |
| `ggtree` | GitHub / Bioconductor | Phylogenetic tree visualization |