Statistics resources for clinic users

Included page "clone:stats-clinic-resources" does not exist (create it now)

Disclaimer and request: these links to resources are intended to help potential statistics clinic users, either instead of coming to the clinic, for example if there is no convenient session scheduled, or to follow up ideas after coming to the clinic. This is an early version, which we intend to continue to improve. The links are provided in good faith, but the material they reference has not been carefully checked. If you find errors in the material, please email

and similarly use this address to tell us about additional resources that you think other users might find helpful. Please ignore the menus above, they access facilities only available to the site owner.

Basic ideas and techniques

Histograms

http://tinlizzie.org/histograms/

Statistical testing

Where to start with regard to statistical testing of data.

Power, effect size and sample size calculations

http://powerandsamplesize.com

The R system/environment/programming language/statistics package

R is an integrated suite of software facilities for data manipulation, graphical display, statistical analysis, calculation and simulation. It handles and analyzes data very effectively and it contains a suite of operators for calculations on arrays and matrices. In addition, it has the graphical capabilities for very sophisticated graphs and data displays. Finally, it is an elegant, object-oriented programming language.

It is freely available, and can be downloaded (for Windows, Mac and Linux platforms) from https://cran.r-project.org/
The same site also gives access to a huge library of free add-on packages.

Some introductions/tutorials for beginners and inexperienced users (best read while sitting at a computer so you can try the examples as you go):

R for Beginners, by Emmanuel Paradis: http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf
Introduction, Code and Commentary by JH Maindonald - Using R for Data Analysis and Graphics: http://cran.r-project.org/doc/contrib/usingR.pdf
Simple R - Using R for Introductory Statistics, by John Verzani: http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf
The R Guide, by W.J. Owen: https://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf
An introduction to R, by Longhow Lam: http://cran.r-project.org/doc/contrib/Lam-IntroductionToR_LHL.pdf

CRAN contributed documentation: http://cran.r-project.org/other-docs.html - page has a lot of useful material. Most of this is written for statisticians and/or programmers so is mostly fairly technical.

Visualisation: ideas and discussion

https://medium.com/multiple-views-visualization-research-explained/multiple-views-on-how-to-choose-a-visualization-b3ffc99fcddc

Statsref: Statistical Analysis Handbook

A more comprehensive web-based handbook, a guide to much standard elementary and intermediate statistical methodology:
http://www.statsref.com/

'Five ways to fix statistics'

A fascinating commentary in Nature: should be read by all scientists: https://www.nature.com/articles/d41586-017-07522-z

As debate rumbles on about how and how much poor statistics is to blame for poor reproducibility, Nature asked influential statisticians to recommend one change to improve science. The common theme? The problem is not our maths, but ourselves.

More advanced techniques

Logistic regression

How to interpret logistic regression output. This might be useful here: https://www.youtube.com/watch?v=ckkiG-SDuV8

Modern alternatives to ANOVA

Slides for a talk on this topic by Jonty Rougier in April 2018, and accompanying R code.

R for Data Science

R for Data Science - Garrett Grolemund & Hadley Wickham

An accessible introduction to creating a 'data science workflow' in R.

http://r4ds.had.co.nz/

Regression with errors in both variables

When your data has observational/measurement error or noise in both x (predictor/covariate/independent variable) and in y (response/dependent variable), ordinary regression techniques like simple linear regression are not really valid. See https://en.wikipedia.org/wiki/Errors-in-variables_models for some discussion. There you will see that you need to think carefully about your assumptions in this situation. For some purposes, so-called Deming regression is appropriate, and this (along with several generalisations you might also consider) is provided in the R package 'deming', see https://cran.r-project.org/web/packages/deming/index.html.

Multilevel modelling

The LEMMA course from the Centre of Multilevel Modelling is a great resource for anyone who needs to do something with this type of modelling:

http://www.bristol.ac.uk/cmm/learning/online-course/

Biostatistics

Frank Harrell's book Biostatistics for Biomedical Research: http://biostat.mc.vanderbilt.edu/tmp/bbr.pdf

In fact, Harrell has a lot of useful resources, notably in Biostatistics, perhaps starting with his blog: http://www.fharrell.com/

Statistical learning

The following MIT open course on artificial intelligence has some great lectures on statistical learning:

https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-034-artificial-intelligence-fall-2010/lecture-videos/

An accessible (i.e. for non-mathematically trained researchers) introduction to statistical learning:

An Introduction to Statistical Learning with applications in R - Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

Link to website: http://www-bcf.usc.edu/~gareth/ISL/

Free pdf version here: http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf

Videolectures.net

A huge library (over 23000 videos!) of generally well-produced generally high-quality advanced-level video lectures, concentrating on machine learning and applications. Use search facility to locate material.

http://videolectures.net/

Reproducibility

A presentation on the importance of reproducibility in science, and the role of statistics in promoting this. The local lead of the Bristol Reproducibility Network is Hugo Pedder (hugo.pedder at b**l.ac.uk)

Getting more organised

Why you should be very careful if you use a spreadsheet like Excel for your scientific data

Housekeeping, and maintaining code and data

See http://www.brown.edu/Research/Shapiro/pdfs/CodeAndData.pdf

This is all good advice — I expect most of us are doing most of this already. The thing I deliberately don't do is use version control, which I tried for several years (svn, subversion, git) but didn't like, but then I'm old school. I use synchronised version numbers in file names. The thing I'm not doing but ought to is using a project management tool for my collaborations, instead of relying on email threads.

Misuse/misinterpretation of statistical data

Interesting examples

The following are from the October 2016 issue of ‘Significance':

"Predictive policing systems are used increasingly by law enforcement to try to prevent crime before it occurs. But what happens when these systems are trained using biased data? …"

The article makes a strong and disturbing case:

http://onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2016.00960.x/full
(The article is available on open access.)

Misuse of statistics by the UK Department of Education features in: "According to the UK's Department for Education, “missing the equivalent of just one week a year from school can mean a child is significantly less likely to achieve good GCSE grades”. Can this really be true?”

http://onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2016.00959.x/full (One needs to be subscribed to see the full article.)

A bit ironic that the DfE has demonstrated itself so badly educated on matters of use of statistical information.

The only trouble is that they took it to heart and made national policy around it which is collecting masses of fines in England…
"Term-Time Holiday Fines Top £4m As ITV Reveals More Than 60,000 Fines Have Been Issued"
http://www.huffingtonpost.co.uk/entry/term-time-holiday-fines_uk_5808b640e4b07ebc072c56bd

home