Stats Geog226 - Text From Flashcards
Important statistical concepts
*** Learn to ask questions about numbers: source, range of uncertainty, calculations to arrive at number
* justify the use of numbers, use numbers to backup and justify arguments
* Be responsible to look at data appropriately
* Choose appropriate statistical method
*** Most time is spent looking at data, become intimate with the data, really need to know your data
*** Create distribution plot to become more familiar with data
* units are key
* Compare observations on the same scale
Statistics
* arithmetic
* dealing with collection, analysis, interpretation, presentation
of masses of numerical data
Data
* The numbers represent the data
Discrete data
* Vector dimensions: e.g. Points, lines, polygons, spatially exists at one of these
* Versus numeric integer data, example one horse or three horses, not 1.5 horses
* Spatially represents statistical relationship
Continuous data
* Spatially exists everywhere
* e.g. air pollution, fire smoke at all locations, measuring different values
* Raster data, e.g. satellite data
* Numerically on a number scale, values
Dispersion
* Range, Speed
Skewness
* Data shape distribution, depends what can be done to it
Graphing
* Represent data graphically
* Geographers view spatial images as place
* Visualize space with qualitative methods
Quantitative
* Measured
* Number of something
* Scientific method
Qualitative
* Counted
* Observations are assigned categories
* Text, symbols, numbers, model of the landscape quantitatively
Descriptive statistics
* Analyze data
* e.g. average house price
Probability
* based on a sample
Sampling
* can't test everything
Inferential statistics
* Be representative
* infer from data what's actually happening
Nonparametric stats
* Data doesn't fit mathematical expectation
* need to look at distribution
Corelation and simple linear regression
* relationship between variables
* eg temperature/rainfall
Plots and charts
* graph, Present data
* diagram versus formula
RStudio
* uses Scripts
* Graphic interface for command line code
* Scroll through previously typed script code
Hypothesis testing
* state processes
* Collect, measure, present, plot
* eg relationship between sunlight amount and plant growth
* Report and present statistical analysis
* Export diagrams in JPEG
History
* Blaise Pascal: Probability and gaming theory
* William Sealy Gosset: Students T test, small samples
* Karl Pearson:R: Large sample size, Corelation, relation between two variables
NOIR
* Key to deciding statistical analysis
* Scales of measurement
* Nominal
* Ordinal
* Interval
* Ratio
Nominal
* has a name
* Present / absence, yes / no
Ordinal
* Rank, order
* qualitative
* e.g. first, second, third, order, high, low, medium
Interval
* quantitative
* can only add or subtract, cannot multiply or divide data
* no absolute zero, e.g. no year zero, temperature Celsius or Fahrenheit zero is arbitrary
* Interval and ratio can both be generalized to both ordinal and nominal, but it's impossible to do the same in reverse
* e.g. cannot unclassify data
Ratio
* Quantitative
* there is a known zero
* can add, subtract, divide, multiply, can do anything with the data
* Interval and ratio can both be generalized to both ordinal and nominal, but it's impossible to do the same in reverse
* e.g. cannot unclassify data
Measures of central tendency
* Without making a histogram
* The average
* How far away are the tails?
* The peak point at midpoint of normal curve, 50% have less, 50% have more
* Mean
* Used for standard deviation
* Average
Outliers
* The bane of researchers
* A massive impact on the mean
* good way to say I have an outlier and I removed it because
* For good science recording on data outliers, go back and re-sample, check for errors
Mean
* average
* Sample: x̄
* Population: μ
* used for Standard Deviation
* Most common statistics - Often just acccepted - need to investigate further
* A.k.a. Arithmetic average
Median
* actual middle and number
* Used for Inter Quartile Range
* The middle
* Rank data first, find of middle, 50% greater, 50% less
* Ordered data set
* Central is in the center
* Middle is the mean
* Not influenced by outliers, as it is just one observation
* When in is an even number of observations, select the two middle values, add these two values and divide by two to get the mean; here we use the mean to find the median
* What measure for average is used? Mean or median
Mode
* most common occurring value
* Used for range
* Can have more than one mode, or no modes, the most common is bimodal
* Extreme values have no effect
* Categorical data, class boundaries, class with the most observations, frequencies, modal class
* Look at histogram
Weighted means
* The number manipulated depending on the population
* used lots by geographers in spatial stats for example census tracks
* Observations per area
* Weighted means on frequencies and categories, no original numbers are available, not individual members just categories
* Representative number of categories, treat as mean of category
Population
* formulas use Greek notation
* Theorize - Canada - the population, all Douglas-fir trees not just in British Columbia, all students not just UVic
* Mean (μ)
* Variance (σ²)
* Standard deviation (σ)
* Number of observations (N)
* Used in calculations (N)
Sample
* formulas use Roman notation
* Infer from sample to population; results from a small sample can be used to infer for a whole population
* Mean (x̄)
* Variance (s²)
* Standard deviation (s)
* Number of observations (n)
* Used in calculations (n-1)
* as sample size increases, standard deviation of the means decreases
Number of observations
* Population: N
* Sample: n
Used in Calculations Number of observations
* Population: N
* Sample: n-1
Frequency table
* Relative frequency
* Compare classes to each other, compare on the same scale
* Divide counts by the total
* standardizes data
* Percent - Number per
* How many in total?
Histogram
* Summarize data distribution graphically
* Shows the frequency of data graphically, the amount of time this measurement happens
* X equals data categories, Y equals frequencies
* Allows analysis choice
Normal curve distribution
* percentage cumulative relative frequency histogram
* Theoretical, one peak, symmetrical, distributed around the peak, with less at each tail
* Middle is the theoretical representation of average
* Majority happens here in the middle
* Distribution can have multiple peaks
Data distributions
* Each observation is a measurement
* Units are key
Categorize – Group – interval – ratio data
* Lots of data? Conveys the message
* Simplicity and data
* Groupings have no overlaps, intervals include all observations, check the impact of the number of groups for example many vs few, e.g. 0-100, >100-200
* Categories are mutually exclusive
* The impact of the number and width of class breaks, different categories for example 4 breaks that go from low to high versus 12 breaks with two peaks - what happens at the low between the two peaks?
* How to represent data to argue the message, can manipulate the message
* Hang on, wait a minute, think through the visualization, broad class boundaries
Measures of Dispersion
* Range
* Inter Quartile Range
* Variance
* Standard Deviation
* dispersion of data around a midpoint, spread of data, midpoint explains the spread of data
* Midpoint of a frequency midpoint is the middle number of a table
* Mediums are only used if you know of all the observations
* dispersion of data around a midpoint, spread of data, midpoint explains the spread of data
* Midpoint of a frequency midpoint is the middle number of a table
* medians are only used if you know all the observations
Range
* difference between minimum and maximum observations
* Limit to be careful of extreme values and outliers
Inter Quartile Range (IQR)
* IQR
* Ranks observations from low to high Q1 & Q2 has 25% of the variables on each side of the median
* Not affected by extreme values
* At 50% quartile is the median
* Outliers are above the top and bottom whiskers
Variance
* Data spread / dispersion
* Units are squared
* Not used on reports only for theory
* Subtract observations from the mean and square to get the sum of all numbers together, divided by total number of observations in the population
* The mean is the middle
* Some numbers will be positive where the number of observations is less than the mean
* Some numbers will be negative where the number of observations is greater than the mean
* Population: σ²
* Sample: s²
* σ²=∑Ni+1(xi-μ)2/N
Coefficient of variance
* used to compare variances from different samples and different data sets
* Compare the dispersion
* only works if the means are equal - If not use the coefficient of variation
* The lowest standard deviation is the least disbursed
* If the coefficient of variance is bigger than the standard deviation, the dispersal is wider
Standard Deviation
* Used most to report
* No square units
* The square root of the variance
* Report in the same units as in the observations
* Compare every observation with the mean of that sample
* Population: σ
* Sample: s
* σ=√∑Ni+1(xi-μ)2/N
* as sample size increases, standard deviation of mean decreases