Stats explanation – boxplots

BOXPLOTS or ‘box-and-whisker’ plots

Go to Index of AI research

I will try to explain what the boxplot, a visual summary, or graphic visualisation of data, means by showing actual data.

Occupation vs Tentative

Occupation vs TentativeThis plots data for Occupation (eg, Other, Student, Scribbler, etc) against Sentiment scores for the emotion Tentative (using text analysis of their written feedback).

Looking at all of the different things on the boxplot, you can see:

    • a coloured (usually)

box

    • or

rectangle

    • a

vertical line

    • somewhere in the middle of the box

 

    • horizontal

lines

    • (but not always) coming out left and right

dot

    • or dots (but not always) on the same level but not on the lines

 

    • the data might show as just

one vertical line

    .

Boxplots show:

    • many features of the raw data in a simple way, and

 

    the distribution of a continuous variable.

The box edges to left and right are also called hinges. The vertical line in the middle is the median value (middle value of all the numbers). The horizontal lines are also called whiskers. (This is the Tukey method, see references at bottom.)

The boxplot shows five summary statistics:

    • the median

 

    • two hinges or edges of the box, the quartiles

 

    • with up to two lines or whiskers, showing the other quartiles

 

    • and all outlying (outlier) points individually as dots

 

    any consequently, skewing of data from a symmetrical normal distribution

Example
Now we will look at how one of these graphics is made from the raw data.

If you look at one of the horizontal graphics for occupation – Poet (sixth down):

Poet data Tentative boxplot
Poet data Tentative boxplot

Raw data and graphic explanation
First the numbers are sorted (ranked). Look at this breakdown, below, of where the numbers are in relation to the median (middle) value, and then how this related to the boxplot.
The data is the score on the sentiment analyser for tentative-related words, higher means more, score can be 0 to 1.0.

Data poet tentative explanation
Data poet tentative explanation

Other cases

The one below has just a single line instead of a box, because there is only one data point (0.87) – so you can get a gappy-looking boxplot, that is OK.

tentative-scientist


Notes

A boxplot helps to visualise the distribution of the data by quartile and show any outliers.

The plot above visualises five summary statistics, the median, two hinges or edges, and two whiskers or lines, and all outlier points individually as dots.

The box (coloured rectangle) always extends from the 25th to 75th percentiles. These sometimes called the ‘hinges’ of the plot.

The line in the middle of the box is plotted at the median.

Quartile: a type of quantile which divides the number of data points into four more or less equal parts, or quarters.

Quantile: in statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way.

Outliers: examination of the data for observations that are far removed from the mass of data (which could be for unrelated or distracting issues, or not).

Practical note: In the boxplot above, the data (which is from the experiment, saved as CSV files, and then imported into Excel for data cleaning (tidying up gaps etc. from the CSV format). From Excel it is then used in R statistical package.

References

General statistics calculators (great sites)

https://www.socscistatistics.com/tests/mannwhitney/

https://goodcalculators.com/statistics-calculators/

Boxplots (this is the best introduction)

Box Plot Explained: Interpretation, Examples, & Comparison

Wiki
https://en.wikipedia.org/wiki/Box_plot
R and boxplots
https://www.statmethods.net/graphs/boxplot.html

The box and whiskers plot was first introduced in 1970 by John Tukey, who later published on the subject in 1977.
John W. Tukey (1977). Exploratory Data Analysis. Addison-Wesley.

Computer-Human Hybrid AI Writing and Creative Ethics

Introduction

This blog is about my 2020 research into computer text generation and the effects on professional ands amateur writers. I am working on this topic at the University of the Arts London (UAL CCI, Dir. Mick Grierson).

No-one has asked creatives or writers what they think of the new ‘AI’ systems that generate readable text and so directly threaten their jobs, and could change the way people work forever (or don’t work forever). This is a topic that directly impinges on self-worth and financial worth in more ways than anyone can imagine, although plenty are worrying.

STUDY – ONLINE EXPERIMENT
August-October 2020

I devised an online experiment about this topic, allowing respondents to experiment with creating hybrid stories using a text generator. The people were all professional or serious amateurs (and a couple of small students) invited from my own creative writing software mailing list, a couple of writing forums, and a publisher’s writers’ forum, plus friends and relatives who generally use writing in their work. Credits are at the bottom.

Text generation

You might have heard of Google OpenAI’s GPT-2 and GPT-3. My experiment uses a generating system (Fabrice Bellard’s Text Synth, with permission)  based on GPT-2, that anyone can use. GPT-2 was used here as the model works well for idea generation and is more generally available at the time than GPT-3, which is much larger.

Note: The text generation and editing system is now a free online tool (creativity support tool or CST) at

Story Live writing with AI free online

The experimental results will feed into this blog (see Index for different aspects) and later an academic paper, and also a new book for the general public on the whole subject of computers, creativity and writing.

Please sign up for news and notifications – there’s a form on this page.

Brief description of the Study

Below is a graphic of the entire online study. Each block is a page and journey was left to right from top to bottom. The three text generation and editing experiments used a similar set up to the Story Live tool.

Each writing experiment – Caption, News and Fiction – had a question afterwards, then there were more questions after the experiments (see diagram below). All this will be addressed in blogs here, along with other discussions.

The image writing prompt was the same for each experiment and for all respondents for uniformity (there is a blog on the man and dog here).

Prompt image man and dog
Prompt image man and dog
Flowchart of Study

Geoff Davis

The computer support tool (CST) from this study is Story Live writing with AI free online

My other creativity tools are Notes Story Board and Story Lite from my Story Software. For my other activities please see the home page of this site.

Study

This study was devised and the site programmed by Geoff Davis for post-graduate research at University of London Creative Computing Institute UAL CCI 2020. The Supervisor is Professor Mick Grierson, Research Leader, UAL Creative Computing Institute.

Text Synth

Text Synth, by Fabrice Bellard, is a publicly available text generator, was used as this is the sort of system people might use outside of the study. It was also not practical to recreate (program, train, fine-tune, host) a large scale text generation system for this usability pre-study. Permission was granted to use Text Synth in the study by Fabrice Bellard Jul 7 2020.

Fabrice Bellard, coder of Text Synth.
Fabrice is an all-round genius and writes a lot of OS. Text Synth was built using the GPT-2 language model released by Google OpenAI. It is a neural network of 1.5 billion parameters based on the Transformer architecture.