This project was undertaken as part of the CS6601 class in a team of 4 students. In the paper below, we compare standard classification techniques for the detection of substitution errors with the detection in the Automatic Whiteout algorithm, also by performing feature selections to reduce the dimension of features and speed the algorithms up. We will introduce the Clawson dataset and describe it’s features, their relevance to current and future work in the project.
The QWERTY keyboard is the de-facto standard for keyboard layouts and is used in Desktops, laptops, mobile phones (keypad or touchscreen) and tablets. Each device has its own space constraints. For example, keys on a 15” laptop are spaced far and wide while those on a 5” mobile phone are smaller and placed closer together. While most humans are accustomed to this keyboard, there is considerable error while typing out words, especially on mobile phones and smartphones. Auto-correct and spell check have been used to flag these typos but with more and more people using their smartphones as their primary devices, it has become important to be able to detect and correct typing errors.
Substitution errors are a class of typing errors where a character is replaced by a different character in a string. For example, a user might type HELO where s/he might be trying to type HELP, O here being substituted for P. These errors account for a majority of the typing errors that occur on QWERTY keyboards, with 62.92% of the errors on a full- QWERTY keyboard and 40.2% errors on a mini-QWERTY keyboard being substitution errors . Being able to correct substitution errors on QWERTY keyboards can dramatically improve the performance of auto-correct features on mobile phones.
The Clawson dataset was collected from two different sizes of mini-QWERTY (RIM Blackberry style) keyboards from 14 participants typing in 20 minute sessions on the
keyboard models resulting in over 400 minutes of use per participant. Combining both studies we have a dataset of 42,340 phrases that the participants typed and 1,261,791 key- presses. Every instance is a single key press. However, to prevent the cascading of errors, the part of the phrase after the first error was discarded. After this step, 973089 remained of which 6549 are substitution errors i.e 0.67% in the complete set.
The dataset contains 95 features, features detailes are listed and described in Appendix I of the paper below. To correct substitution errors, we used bi-letter (prob1), tri-letter (prob) frequencies to decide which off-by-one-substitution letter was inserted and should be replaced.