-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathNextWord_documentation.html
320 lines (223 loc) · 15.7 KB
/
NextWord_documentation.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<title>Word Predicting App Documentation</title>
<script type="text/javascript">
window.onload = function() {
var imgs = document.getElementsByTagName('img'), i, img;
for (i = 0; i < imgs.length; i++) {
img = imgs[i];
// center an image if it is the only element of its parent
if (img.parentElement.childElementCount === 1)
img.parentElement.style.textAlign = 'center';
}
};
</script>
<style type="text/css">
body, td {
font-family: sans-serif;
background-color: white;
font-size: 13px;
}
body {
max-width: 800px;
margin: auto;
padding: 1em;
line-height: 20px;
}
tt, code, pre {
font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
}
h1 {
font-size:2.2em;
}
h2 {
font-size:1.8em;
}
h3 {
font-size:1.4em;
}
h4 {
font-size:1.0em;
}
h5 {
font-size:0.9em;
}
h6 {
font-size:0.8em;
}
a:visited {
color: rgb(50%, 0%, 50%);
}
pre, img {
max-width: 100%;
}
pre {
overflow-x: auto;
}
pre code {
display: block; padding: 0.5em;
}
code {
font-size: 92%;
border: 1px solid #ccc;
}
code[class] {
background-color: #F8F8F8;
}
table, td, th {
border: none;
}
blockquote {
color:#666666;
margin:0;
padding-left: 1em;
border-left: 0.5em #EEE solid;
}
hr {
height: 0px;
border-bottom: none;
border-top-width: thin;
border-top-style: dotted;
border-top-color: #999999;
}
@media print {
* {
background: transparent !important;
color: black !important;
filter:none !important;
-ms-filter: none !important;
}
body {
font-size:12pt;
max-width:100%;
}
a, a:visited {
text-decoration: underline;
}
hr {
visibility: hidden;
page-break-before: always;
}
pre, blockquote {
padding-right: 1em;
page-break-inside: avoid;
}
tr, img {
page-break-inside: avoid;
}
img {
max-width: 100% !important;
}
@page :left {
margin: 15mm 20mm 15mm 10mm;
}
@page :right {
margin: 15mm 10mm 15mm 20mm;
}
p, h2, h3 {
orphans: 3; widows: 3;
}
h2, h3 {
page-break-after: avoid;
}
}
</style>
</head>
<body>
<h1>Word Predicting App Documentation</h1>
<p>This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the <strong>Help</strong> toolbar button for more details on using R Markdown).</p>
<p>When you click the <strong>Knit HTML</strong> button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:</p>
<h4>Synopsis</h4>
<p>This document will show how to build a word predicting application using ngram models. This application behaves like the smartkey features on smartphones. </p>
<p>this document will provide the details on: </p>
<ol>
<li>How to efficiently build (and clean) an ngram model</li>
<li>Use the most efficient query method to search through the ngram model files</li>
<li>Provide 2 algorithms for word prediction<br/></li>
</ol>
<hr/>
<h3><strong>Preparing the Data</strong></h3>
<p>Data used on this project can be downloaded from this <a href="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"><strong>link.</strong></a> </p>
<p>Data files consist of 3 files: </p>
<ol>
<li><p>Blogs </p></li>
<li><p>Tweets </p></li>
<li><p>News </p></li>
</ol>
<hr/>
<p>Data information from the 3 files: </p>
<p><img src="" alt="plot of chunk sample plot"/></p>
<hr/>
<h3><strong>Cleaning Data</strong></h3>
<p><strong>Machine specs used to perform this tasks:</strong> </p>
<p>Intel core i5-4300U CPU @1.90GHz 2.49GHz </p>
<p>Memory: 8GIG </p>
<hr/>
<p>All the different steps I have taken to clean the data: </p>
<ol>
<li><p>Read data -> clean data -> 2-5 tokenize (ngram package) 2-3-4 were all quick until 5 ngram: <em>machine froze</em> </p></li>
<li><p>Read data -> 2-5 tokennize and clean (quanteda) worked until 5 grams: <em>run out of memory</em> </p></li>
<li><p>Read data -> 2-5 tokenize and clean (quanteda) ->dfm ->df<em>trim with min freq of 4: _more than 35 minutes on 5 ngram</em> </p></li>
<li><p>Read data -> clean data -> 2-5 tokenize(quanteda) ->dfm -> df<em>trim min freq of 4: _between 30-35 minutes on 5 ngram</em></p></li>
<li><p><strong>Read data -> clean data -> 2-5 tokenize(quanteda) -> dfm(with tolower = false) -> df_trim min freq 4: the fastest, less tha 30 minutes</strong> </p></li>
</ol>
<hr/>
<p>Clean data function I used: </p>
<ol>
<li><p>Concatenate from Ngram </p></li>
<li><p>Preprocess to lower case and remove numbers from ngram </p></li>
<li><p>Remove cursewords from tm </p></li>
<li><p>Gsub remove punctuations, non alphabet characters, foreign charaters, orphaned characters from base r </p></li>
<li><p>Remove whitespace from tm </p>
<hr/></li>
</ol>
<p><strong>TIP 1</strong>: when cleaning, do not use piping from dplyr, memory won't be efficienctly used. </p>
<p>I assigned every task result to a new variable and removed old variable to reclaim memory using rm() and gc() respectively. </p>
<hr/>
<p><strong>TIP 2</strong>: since the input file is already been cleaned and converted to lower case, no need to do it again when running the DFM function. Using the same sample token file with object size 99.3MB, here is what you will gain: </p>
<ul>
<li>dfm using the default <em>tolower = TRUE</em> it took 6.9sec and 5.81 seconds.</li>
<li>dfm using <em>tolower = FALSE</em> it took <strong>4.63</strong> and <strong>4.40</strong> seconds.</li>
</ul>
<p>Once the ngram is processed, I converted the file to a dataframe using the tidy package and saved it as a file. </p>
<hr/>
<h3><strong>Building Ngram Model</strong></h3>
<p>At this point I have 3 version (blog,news and tweet) of 2 to 4 ngram files.I loaded same number of ngram files and merged them. Identified all common word combination from all the 3 version and summed up word frequency for accuracy. </p>
<p>I then converted the file from dataframe to data table. Then saved the files accordingly. At this point all the ngram word combination for all X- ngram are all unique. The merged files now becomes the final ngram model. </p>
<hr/>
<p>###<strong>The most efficient way to search and filter through the ngram model</strong> </p>
<p>Since the ngram model files are all in megabytes in size with milions of data rows, it is important to use a search method thaty is efficient and fast. I found using sqldf is the fastest way searching thorugh millions of data rows. Below is a list of time i took to search thorugh the rows using different search methods.</p>
<p>Using the same sample file with 1,416,902 observations </p>
<p>Dataframe using dplyr to search: <strong>3.55 | 3.37 sec</strong></p>
<p>Datatable using dplyr to search: <strong>3.41 | 3.41 sec</strong> </p>
<p>Datatable using sqldf to search: <strong>1.43 | 1.25 sec</strong> </p>
<p>Dataframe using sqldf to search: <strong>1.30 | 1.25 sec</strong> </p>
<p><strong>SQLDF</strong> is the fastest way to search through a large dataset.</p>
<hr/>
<h3><strong>2 kinds of Word Prediction Algorithm</strong></h3>
<p>1) <strong>Straight word input to +1 ngram model search</strong> </p>
<ul>
<li><p>If entered word is 1, search the 2-ngram model file </p></li>
<li><p>If entered word is 2, search the 3-ngram model file </p></li>
<li><p>If entered word is 3, search the 4-ngram model file </p></li>
<li><p>If entered word is more than 3, use the backoff algorithm </p></li>
</ul>
<p>2) <strong>Backoff Algorithm</strong> </p>
<ul>
<li><p>Count the words entered </p></li>
<li><p>Process words entered and determine: last word, last 2 words and last 3 words </p></li>
<li><p>Using the last 3 words, search the 4-ngram model file </p></li>
<li><p>Using the last 2 words, search the 3-ngram model file </p></li>
<li><p>Using the last word, search the 2-ngram model file </p></li>
</ul>
<hr/>
<hr/>
<p>Link to <a href="https://rpubs.com/noeltemena/NextWordApp">Word Predicting Presentation</a> </p>
<p>Link to <a href="https://noeltemena.shinyapps.io/ShinyWord/">Shiny Word Predicting App</a> </p>
<hr/>
<hr/>
<hr/>
</body>
</html>