| Title: | Data for an Introduction to Statistical Learning with Applications in R |
|---|---|
| Description: | We provide the collection of data-sets used in the book 'An Introduction to Statistical Learning with Applications in R'. |
| Authors: | Gareth James, Daniela Witten, Trevor Hastie and Rob Tibshirani |
| Maintainer: | Trevor Hastie <[email protected]> |
| License: | GPL-2 |
| Version: | 1.4 |
| Built: | 2026-05-19 06:52:41 UTC |
| Source: | https://github.com/cran/ISLR |
Gas mileage, horsepower, and other information for 392 vehicles.
AutoAuto
A data frame with 392 observations on the following 9 variables.
mpgmiles per gallon
cylindersNumber of cylinders between 4 and 8
displacementEngine displacement (cu. inches)
horsepowerEngine horsepower
weightVehicle weight (lbs.)
accelerationTime to accelerate from 0 to 60 mph (sec.)
yearModel year (modulo 100)
originOrigin of car (1. American, 2. European, 3. Japanese)
nameVehicle name
The orginal data contained 408 observations but 16 observations with missing values were removed.
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, https://www.statlearning.com, Springer-Verlag, New York
pairs(Auto) attach(Auto) hist(mpg)pairs(Auto) attach(Auto) hist(mpg)
The data contains 5822 real customer records. Each record
consists of 86 variables, containing sociodemographic data (variables
1-43) and product ownership (variables 44-86). The sociodemographic
data is derived from zip codes. All customers living in areas with the
same zip code have the same sociodemographic attributes. Variable 86
(Purchase) indicates whether the customer purchased a caravan
insurance policy. Further information on the individual variables can
be obtained at http://www.liacs.nl/~putten/library/cc2000/data.html
CaravanCaravan
A data frame with 5822 observations on 86 variables.
The data was originally supplied by Sentient Machine Research and was used in the CoIL Challenge 2000.
P. van der Putten and M. van Someren (eds) . CoIL Challenge
2000: The Insurance Company Case. Published by Sentient Machine
Research, Amsterdam. Also a Leiden Institute of Advanced Computer
Science Technical Report 2000-09. June 22, 2000. See
http://www.liacs.nl/~putten/library/cc2000/
P. van der Putten and M. van Someren. A Bias-Variance Analysis of a Real World Learning Problem: The CoIL Challenge 2000. Machine Learning, October 2004, vol. 57, iss. 1-2, pp. 177-195, Kluwer Academic Publishers
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013)
An Introduction to Statistical Learning with applications in R,
https://www.statlearning.com,
Springer-Verlag, New York
summary(Caravan) plot(Caravan$Purchase)summary(Caravan) plot(Caravan$Purchase)
A simulated data set containing sales of child car seats at 400 different stores.
CarseatsCarseats
A data frame with 400 observations on the following 11 variables.
SalesUnit sales (in thousands) at each location
CompPricePrice charged by competitor at each location
IncomeCommunity income level (in thousands of dollars)
AdvertisingLocal advertising budget for company at each location (in thousands of dollars)
PopulationPopulation size in region (in thousands)
PricePrice company charges for car seats at each site
ShelveLocA factor with levels Bad, Good
and Medium indicating the quality of the shelving location
for the car seats at each site
AgeAverage age of the local population
EducationEducation level at each location
UrbanA factor with levels No and Yes to
indicate whether the store is in an urban or rural location
USA factor with levels No and Yes to
indicate whether the store is in the US or not
Simulated data
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, https://www.statlearning.com, Springer-Verlag, New York
summary(Carseats) lm.fit=lm(Sales~Advertising+Price,data=Carseats)summary(Carseats) lm.fit=lm(Sales~Advertising+Price,data=Carseats)
Statistics for a large number of US Colleges from the 1995 issue of US News and World Report.
CollegeCollege
A data frame with 777 observations on the following 18 variables.
PrivateA factor with levels No and Yes
indicating private or public university
AppsNumber of applications received
AcceptNumber of applications accepted
EnrollNumber of new students enrolled
Top10percPct. new students from top 10% of H.S. class
Top25percPct. new students from top 25% of H.S. class
F.UndergradNumber of fulltime undergraduates
P.UndergradNumber of parttime undergraduates
OutstateOut-of-state tuition
Room.BoardRoom and board costs
BooksEstimated book costs
PersonalEstimated personal spending
PhDPct. of faculty with Ph.D.'s
TerminalPct. of faculty with terminal degree
S.F.RatioStudent/faculty ratio
perc.alumniPct. alumni who donate
ExpendInstructional expenditure per student
Grad.RateGraduation rate
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the ASA Statistical Graphics Section's 1995 Data Analysis Exposition.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, https://www.statlearning.com, Springer-Verlag, New York
summary(College) lm(Apps~Private+Accept,data=College)summary(College) lm(Apps~Private+Accept,data=College)
A simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt.
CreditCredit
A data frame with 10000 observations on the following 4 variables.
IDIdentification
IncomeIncome in $1,000's
LimitCredit limit
RatingCredit rating
CardsNumber of credit cards
AgeAge in years
EducationNumber of years of education
GenderA factor with levels Male and Female
StudentA factor with levels No and Yes
indicating whether the individual was a student
MarriedA factor with levels No and Yes
indicating whether the individual was married
EthnicityA factor with levels African American, Asian, and Caucasian
indicating the individual's ethnicity
BalanceAverage credit card balance in $.
Simulated data, with thanks to Albert Kim for pointing out that this was omitted, and supplying the data and man documentation page on Oct 19, 2017
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, https://www.statlearning.com, Springer-Verlag, New York
summary(Credit) lm(Balance ~ Student + Limit, data=Credit)summary(Credit) lm(Balance ~ Student + Limit, data=Credit)
A simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt.
DefaultDefault
A data frame with 10000 observations on the following 4 variables.
defaultA factor with levels No and Yes
indicating whether the customer defaulted on their debt
studentA factor with levels No and Yes
indicating whether the customer is a student
balanceThe average balance that the customer has remaining on their credit card after making their monthly payment
incomeIncome of customer
Simulated data
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, https://www.statlearning.com, Springer-Verlag, New York
summary(Default) glm(default~student+balance+income,family="binomial",data=Default)summary(Default) glm(default~student+balance+income,family="binomial",data=Default)
Major League Baseball Data from the 1986 and 1987 seasons.
HittersHitters
A data frame with 322 observations of major league players on the following 20 variables.
AtBatNumber of times at bat in 1986
HitsNumber of hits in 1986
HmRunNumber of home runs in 1986
RunsNumber of runs in 1986
RBINumber of runs batted in in 1986
WalksNumber of walks in 1986
YearsNumber of years in the major leagues
CAtBatNumber of times at bat during his career
CHitsNumber of hits during his career
CHmRunNumber of home runs during his career
CRunsNumber of runs during his career
CRBINumber of runs batted in during his career
CWalksNumber of walks during his career
LeagueA factor with levels A and N
indicating player's league at the end of 1986
DivisionA factor with levels E and W
indicating player's division at the end of 1986
PutOutsNumber of put outs in 1986
AssistsNumber of assists in 1986
ErrorsNumber of errors in 1986
Salary1987 annual salary on opening day in thousands of dollars
NewLeagueA factor with levels A and N
indicating player's league at the beginning of 1987
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. This is part of the data that was used in the 1988 ASA Graphics Section Poster Session. The salary data were originally from Sports Illustrated, April 20, 1987. The 1986 and career statistics were obtained from The 1987 Baseball Encyclopedia Update published by Collier Books, Macmillan Publishing Company, New York.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, https://www.statlearning.com, Springer-Verlag, New York
summary(Hitters) lm(Salary~AtBat+Hits,data=Hitters)summary(Hitters) lm(Salary~AtBat+Hits,data=Hitters)
The data consists of a number of tissue samples corresponding to four distinct types of small round blue cell tumors. For each tissue sample, 2308 gene expression measurements are available.
KhanKhan
The format is a list containing four components: xtrain,
xtest, ytrain, and ytest. xtrain contains
the 2308 gene expression values for 63 subjects and ytrain
records the corresponding tumor type. ytrain and ytest
contain the corresponding testing sample information for a further 20 subjects.
This data were originally reported in:
Khan J, Wei J, Ringner M, Saal L, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu C, Peterson C, and Meltzer P. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, v.7, pp.673-679, 2001.
The data were also used in:
Tibshirani RJ, Hastie T, Narasimhan B, and G. Chu. Diagnosis of Multiple Cancer Types by Shrunken Centroids of Gene Expression. Proceedings of the National Academy of Sciences of the United States of America, v.99(10), pp.6567-6572, May 14, 2002.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, https://www.statlearning.com, Springer-Verlag, New York
table(Khan$ytrain) table(Khan$ytest)table(Khan$ytrain) table(Khan$ytest)
NCI microarray data. The data contains expression levels on 6830 genes from 64 cancer cell lines. Cancer type is also recorded.
NCI60NCI60
The format is a list containing two elements: data and
labs.
data is a 64 by 6830 matrix of the expression values while
labs is a vector listing the cancer types for the 64 cell lines.
The data come from Ross et al. (Nat Genet., 2000). More information can be obtained at http://genome-www.stanford.edu/nci60/
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, https://www.statlearning.com, Springer-Verlag, New York
table(NCI60$labs)table(NCI60$labs)
The data contains 1070 purchases where the customer either purchased Citrus Hill or Minute Maid Orange Juice. A number of characteristics of the customer and product are recorded.
OJOJ
A data frame with 1070 observations on the following 18 variables.
PurchaseA factor with levels CH and MM
indicating whether the customer purchased Citrus Hill or Minute
Maid Orange Juice
WeekofPurchaseWeek of purchase
StoreIDStore ID
PriceCHPrice charged for CH
PriceMMPrice charged for MM
DiscCHDiscount offered for CH
DiscMMDiscount offered for MM
SpecialCHIndicator of special on CH
SpecialMMIndicator of special on MM
LoyalCHCustomer brand loyalty for CH
SalePriceMMSale price for MM
SalePriceCHSale price for CH
PriceDiffSale price of MM less sale price of CH
Store7A factor with levels No and Yes
indicating whether the sale is at Store 7
PctDiscMMPercentage discount for MM
PctDiscCHPercentage discount for CH
ListPriceDiffList price of MM less list price of CH
STOREWhich of 5 possible stores the sale occured at
Stine, Robert A., Foster, Dean P., Waterman, Richard P. Business Analysis Using Regression (1998). Published by Springer.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, https://www.statlearning.com, Springer-Verlag, New York
summary(OJ) plot(OJ$Purchase,OJ$PriceCH)summary(OJ) plot(OJ$Purchase,OJ$PriceCH)
A simple simulated data set containing 100 returns for each of two assets, X and Y. The data is used to estimate the optimal fraction to invest in each asset to minimize investment risk of the combined portfolio. One can then use the Bootstrap to estimate the standard error of this estimate.
PortfolioPortfolio
A data frame with 100 observations on the following 2 variables.
XReturns for Asset X
YReturns for Asset Y
Simulated data
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, https://www.statlearning.com, Springer-Verlag, New York
summary(Portfolio) attach(Portfolio) plot(X,Y)summary(Portfolio) attach(Portfolio) plot(X,Y)
Daily percentage returns for the S&P 500 stock index between 2001 and 2005.
SmarketSmarket
A data frame with 1250 observations on the following 9 variables.
YearThe year that the observation was recorded
Lag1Percentage return for previous day
Lag2Percentage return for 2 days previous
Lag3Percentage return for 3 days previous
Lag4Percentage return for 4 days previous
Lag5Percentage return for 5 days previous
VolumeVolume of shares traded (number of daily shares traded in billions)
TodayPercentage return for today
DirectionA factor with levels Down and
Up indicating whether the market had a positive or negative
return on a given day
Raw values of the S&P 500 were obtained from Yahoo Finance and then converted to percentages and lagged.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, https://www.statlearning.com, Springer-Verlag, New York
summary(Smarket) lm(Today~Lag1+Lag2,data=Smarket)summary(Smarket) lm(Today~Lag1+Lag2,data=Smarket)
Wage and other data for a group of 3000 male workers in the Mid-Atlantic region.
WageWage
A data frame with 3000 observations on the following 11 variables.
yearYear that wage information was recorded
ageAge of worker
maritlA factor with levels 1. Never Married
2. Married 3. Widowed 4. Divorced and
5. Separated indicating marital status
raceA factor with levels 1. White
2. Black 3. Asian and 4. Other indicating race
educationA factor with levels 1. < HS Grad
2. HS Grad 3. Some College 4. College Grad
and 5. Advanced Degree indicating education level
regionRegion of the country (mid-atlantic only)
jobclassA factor with levels 1. Industrial and
2. Information indicating type of job
healthA factor with levels 1. <=Good and
2. >=Very Good indicating health level of worker
health_insA factor with levels 1. Yes and
2. No indicating whether worker has health insurance
logwageLog of workers wage
wageWorkers raw wage
Data was manually assembled by Steve Miller, of Inquidia Consulting (formerly Open BI). From the March 2011 Supplement to Current Population Survey data.
https://www.re3data.org/repository/r3d100011860
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, https://www.statlearning.com, Springer-Verlag, New York
summary(Wage) lm(wage~year+age,data=Wage) ## maybe str(Wage) ; plot(Wage) ...summary(Wage) lm(wage~year+age,data=Wage) ## maybe str(Wage) ; plot(Wage) ...
Weekly percentage returns for the S&P 500 stock index between 1990 and 2010.
WeeklyWeekly
A data frame with 1089 observations on the following 9 variables.
YearThe year that the observation was recorded
Lag1Percentage return for previous week
Lag2Percentage return for 2 weeks previous
Lag3Percentage return for 3 weeks previous
Lag4Percentage return for 4 weeks previous
Lag5Percentage return for 5 weeks previous
VolumeVolume of shares traded (average number of daily shares traded in billions)
TodayPercentage return for this week
DirectionA factor with levels Down and
Up indicating whether the market had a positive or negative
return on a given week
Raw values of the S&P 500 were obtained from Yahoo Finance and then converted to percentages and lagged.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, https://www.statlearning.com, Springer-Verlag, New York
summary(Weekly) lm(Today~Lag1+Lag2,data=Weekly)summary(Weekly) lm(Today~Lag1+Lag2,data=Weekly)