-
Notifications
You must be signed in to change notification settings - Fork 7
/
README.schema
149 lines (113 loc) · 6.34 KB
/
README.schema
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
A quick and approximate guide to the EFF SSL Observatory database schema.
The schema for this database is automatically generated by parsing the Web's
SSL certificates with OpenSSL. It may display idiosyncracies that result from
the complexity of the X.509 file format, from OpenSSL's handling of it, and
potentially even due to the improving nature of our scripts. If you're brave,
you're welcome to improve the code ;)
The dataset has two main tables: all_certs, and valid_certs.
all_certs is comprehensive. It has two unique keys: certid, which is a simple
integer, and fingerprint, which is a textual representation of a sha1 like
this:
"SHA1 Fingerprint=00:00:0A:BB:39:B6:8E:62:AA:1A:2D:24:56:51:AE:7D:1E:3B:0D:E7"
The fingerprint field is unique across scans of the SSL universe. But, when
practical collision attacks are one day found against SHA1, the fingerprint
field will cease to be unique.
Importing Data:
0] Data collection -> you get a bunch of .results files in N.x.x.x folders
1] Set constants in launch_parsers or launch_rescan_parsers to find your data
2] launch_(rescan_)parsers (calls Hack_parse.py) -->
creates certs* tables from the .results files using openssl -> certs*
3] stitch_tables.py --> combines certs tables initial validation -> valid_certs
3.5] *optional -- stich_tables.py --> all invalid & valid certs -> all_certs
4] transvalid.py --> finds valid certs that gave bad cert chains -> certs*
5] modify constants in rebuild.sh
6] rebuild.sh
stich_tables.py --> adds in transvalid certs -> valid_certs
namestractor.py --> creates & populates seen table -> seen
--> creates name records for certs
--> creates subject alt names to certs mapping
--> creates subject names to certs mapping
timestamps.sql --> adds startdate and enddate fields
hack_parse.py --> creates the roots table -> roots
rare_and_interesting_cas.sql--> subject key ids for CAs
--> ca_skids
The valid_certs table (sometimes refered to incorrectly as "vcerts") is the
table that the stitch_tables.py script creates, this table contains all the
certificates that have passed some form of validation, i.e. that under some
scheme appear to validate to a known trust root. The creating of this table is
done in phases. Initially stich_tables.py only adds obviously valid
certificates, i.e. certificates which were provided with a certificate chain
that validates. Some misconfigured servers don't provide the needed
intermediate certificates to validate, but browsers with those intermediate
certificates cached will accept them anyway, transvalid.py attempts to
validate certificates validity based on if they had the corrected intermediate
certificates presented with them. Note that chaning the certificates presented
with a cert does not require any additional private keys and can by done by
anyone who can already use the certificate for their site - it is just
correcting for very common misconfigurations.
For each certificate that is found, only one entry will be created in the
valid certs table, which might be more confusing than it initially seems. This
means that some fields in this table are misleading and probably should not be
used (although in practice they are useful for tracking down bugs in queries).
For example the "ip" field is an IPv4 address at which the cert appeared, but
it is potentially not the only IP address at which this cert was found.
Similarly the "path" field contains a path to a results file which contians
this cert, although many more may exist. The cert* tables can contain multiple
entries for a given cert, if it appears in multiple results files. In the
cert* tables the "ip" and "path" values are correct. The "seen" table provides
a mapping of where a given cert was seen.
The valid_certs table always has the following fields, which are synthetic
(i.e. not just a rendition of the data in the certs):
- certid
the numerical ID of the valid_cert, this is a key useful for joining and
unique to the valid_certs table
- id
ignore this in valid_certs, it correspons to the per certs* table id for
one of the certs this valid cert represents (as each valid cert appears
only once) and so duplicates appear once tables are stiched together
- fetchtime
The time the cert was fetched, this number (an int) is based on the file
system time of the .results file at the time of import. Note that moving
data without preserving time information can cause this data to be
invalid.
- fingerprint
The SHA1 fingerprint of the cert, oddly prefixed with the artifact "SHA1
Fingerprint=".
- ip
An IPv4 address at which this cert was seen. It may have been seen
elsewhere.
- moz_valid
The output of openssl validation against the mozilla certificate roots.
"Yes" means valid, but other values are "self-signed: OK" -- meaning
this is a self signed certificate, and often details about how the
failure manifests are presented. Some certificates that fail moz_valid
will pass ms_valid. Self-signed certificates that moz_valid, probably
are roots under ms_valid.
- ms_valid
The same as moz_valid, only under the full set of all root certificates
supported by IE 8 as of Summer 2009. Note that Microsoft's scheme for
dynamically adding root certificates is complex, but gives them the
flexibility to ban CAs who are comprimised. Microsoft does not just ship
CA certs that don't meet some safety minimums, making subsequent
disabling of those roots possible if breeches occur. See
http://technet.microsoft.com/en-us/library/cc751157.aspx for more
information.
- path
A path on the file system at which a results file which contains this
certificate was found.
- RSA_Modulus_Bits
A number written as a string - if this is an RSA key
- startdate
The date this certificate became valid. This is an actual DB date, and
only exists after you have run rebuild.sh
- enddate
The date this certificate becomes invalid. This is an actual DB date
and is only populated after you run rebuild.sh
Some useful non-synthetic fields (taken from openssl's parsing of the
certificate) are:
- RSA Public Key:Modulus
Hex bytes of the public keys Modulus
- Issuer
The value of the issuer of the certificate, taken from the certificate
and key to validation. i.e. "C=JP, O=Japanese Government,
OU=ApplicationCA"