Skip to content

Commit cc74032

Browse files
committed
2 parents 1611580 + 5880557 commit cc74032

13 files changed

+533560
-233643
lines changed

README.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,45 @@
22

33
A repository of source code corpora for text analysis.
44

5+
## Analysis scripts and pre-baked results
6+
7+
For your convenience, some basic analysis scripts and results are included in this repository, see /scripts. The results in /scripts/results display a single result per line, followed by the absolute number of occurences, and the cumulative percentage of total occurences. Furthermore, the list is divided into chunks of results within the same order of magnitude.
8+
9+
The following reports are available:
10+
11+
**characters.txt**<br>
12+
Counts the number of occurences of each character. ASCII characters represent about 99.9% of this code corpora.
13+
14+
**pairs_alphanumeric.txt**<br>
15+
Counts the number of occurences of two adjacent alphanumeric characters. The following list of 66 pairs cumulatively covers ~50% of the code corpora:
16+
17+
`in re er st on te at th es en se ti le nt or et he de ar co ct tr al ed io me is ta it as ra ri ng nd ec an to ns ro ne li ur ce pe if ic ge ss ch ac il el ll pa si un om ma am ea fi ou ut ve lo la`
18+
19+
**pairs_combinations.txt**<br>
20+
Counts the number of occurences of _any_ two adjacent characters of which one is alphanumeric and the other is not. The following list of 119 pairs cumulatively covers ~50% of the code corpora:
21+
22+
`\x t_ e_ e( e. _s t( _c _t d_ s. t. e, .c e) r_ _p _i 0, _r s_ (s s( _a _f _e r( _d n. t) s) n( n_ _l y( _n $t p_ t, s, _m e; e: 1, T_ (c r. s- d( .g l_ _b k_ _o E_ g_ y_ (i d) o_ m. d. .t _u e' t: .s l( 2, r, r) .h t; #i s[ d, h_ _S t- @p (t .p (r 0; m_ s: (a f. c_ _h 0\ .a ,0 n, s' 3, .e e- n) .r 0) g. (p $p y. _w f( 's t' n' (v f_ $c .0 ,1 p( _E w_ l.`
23+
24+
**pairs.txt**<br>
25+
Counts the number of occurences of _any_ two adjacent characters. The spread in this report is relatively large, so no summary is given.
26+
27+
**punctuation.txt**<br>
28+
Counts the number of occurences of consecutive non-ASCII characters, naively interpreted as punctuation. Solitary occurences of the following twelve characters covers ~60% of this list: `_ , . ( * = ) \ ; { } $`.
29+
30+
**triplets_alphanumeric.txt**<br>
31+
Counts the number of occurences of three adjacent alphanumeric characters.
32+
33+
**triplets_combinations.txt**<br>
34+
Counts the number of occurences of any three adjacent characters, containing both alphanumeric and a non-alphanumeric character.
35+
36+
**triplets.txt**<br>
37+
Counts the number of occurences of any three adjacent characters.
38+
39+
**words.txt**<br>
40+
Counts the number of occurences of (key)words. The following list contains the first chunk of results, or ~23%. In other words, these 54 words make up about 23% of all words used in the code corpora.
41+
42+
`the if return to is of this in for struct int name end and void array new public static self const data value function be class type string get file or with not The size that google path com id else it def char key param common object import list as from code set`
43+
544
## Licenses
645

746
The Corpora license and licenses of all included projects is available in [LICENSE.md](https://github.com/source-foundry/code-corpora/blob/master/LICENSE.md)

scripts/analyze.js

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,13 +21,15 @@ const startedAt = new Date().getTime()
2121
const languages = [
2222
'c',
2323
'cc',
24+
'go',
2425
'java',
2526
'javascript',
2627
'objective-c',
2728
'php',
2829
'python',
2930
'ruby',
30-
'swift'
31+
'swift',
32+
// 'www'
3133
]
3234

3335
const app = new App('..')

scripts/lib/app.js

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -226,14 +226,19 @@ module.exports = exports = class App {
226226

227227
createDefaultReport(data, filename) {
228228
filename += '.txt'
229+
const total = this.countTotal(data)
229230
let limit = data[0].value / 10
230231
let output = ''
232+
let cumulative = 0
233+
let percentage
231234
data.forEach(item => {
232235
if (item.value < limit) {
233236
output += '----------\n'
234237
limit /= 10
235238
}
236-
output += `${item.key} ${item.value}\n`
239+
cumulative += parseInt(item.value)
240+
percentage = Math.round((cumulative / total) * 100000) / 100000
241+
output += `${item.key} ${item.value} ${percentage}\n`
237242
})
238243
console.log(`Writing ${output.length} bytes to '${filename}'`)
239244
fs.writeFileSync(filename, output)
@@ -242,4 +247,10 @@ module.exports = exports = class App {
242247
onFileWriteErrorHandler(error) {
243248
console.error('WRITE ERROR', error)
244249
}
250+
251+
countTotal(data) {
252+
return data.reduce(function(prev, curr) {
253+
return parseInt(prev) + parseInt(curr.value)
254+
}, 0)
255+
}
245256
}

scripts/results/characters.txt

86.5 KB
Binary file not shown.

scripts/results/pairs.txt

795 KB
Binary file not shown.

0 commit comments

Comments
 (0)