You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: PyCrawler.py
+30-8Lines changed: 30 additions & 8 deletions
Original file line number
Diff line number
Diff line change
@@ -1,3 +1,4 @@
1
+
#!/usr/bin/python
1
2
importsys
2
3
importre
3
4
importurllib2
@@ -13,7 +14,8 @@
13
14
1) database file name
14
15
2) start url
15
16
3) crawl depth
16
-
4) verbose (optional)
17
+
4) domains to limit to, regex (optional)
18
+
5) verbose (optional)
17
19
Start out by checking to see if the args are there and
18
20
set them to their variables
19
21
"""
@@ -23,12 +25,15 @@
23
25
dbname=sys.argv[1]
24
26
starturl=sys.argv[2]
25
27
crawldepth=int(sys.argv[3])
26
-
iflen(sys.argv) ==5:
27
-
if (sys.argv[4].upper() =="TRUE"):
28
+
iflen(sys.argv) >=5:
29
+
domains=sys.argv[4]
30
+
iflen(sys.argv) ==6:
31
+
if (sys.argv[5].upper() =="TRUE"):
28
32
verbose=True
29
33
else:
30
34
verbose=False
31
35
else:
36
+
domains=False
32
37
verbose=False
33
38
# urlparse the start url
34
39
surlparsed=urlparse.urlparse(starturl)
@@ -37,7 +42,7 @@
37
42
connection=sqlite.connect(dbname)
38
43
cursor=connection.cursor()
39
44
# crawl_index: holds all the information of the urls that have been crawled
40
-
cursor.execute('CREATE TABLE IF NOT EXISTS crawl_index (crawlid INTEGER, parentid INTEGER, url VARCHAR(256), title VARCHAR(256), keywords VARCHAR(256) )')
45
+
cursor.execute('CREATE TABLE IF NOT EXISTS crawl_index (crawlid INTEGER, parentid INTEGER, url VARCHAR(256), title VARCHAR(256), keywords VARCHAR(256), status INTEGER )')
41
46
# queue: this should be obvious
42
47
cursor.execute('CREATE TABLE IF NOT EXISTS queue (id INTEGER PRIMARY KEY, parent INTEGER, depth INTEGER, url VARCHAR(256))')
43
48
# status: Contains a record of when crawling was started and stopped.
PyCrawler is very simple to use. It takes 4 arguments:
1
+
PyCrawler is very simple to use. It takes 5 arguments:
2
2
3
3
1) database file name: The file that that will be used to store information as a sqlite database. If the filename given does not exist, it will be created.
4
4
@@ -7,4 +7,6 @@ PyCrawler is very simple to use. It takes 4 arguments:
7
7
8
8
3) crawl depth: This should be the number of pages deep the crawler should follow from the starting url before backing out.
9
9
10
-
4) verbose (optional): If you want PyCrawler to spit out the urls it is looking at, this should be "true" if it is missing, or has any other value, it will be ignored and considered false.
10
+
4) url regex (optional): A regex to filter the URLs. If not set then all URLs will be be logged.
11
+
12
+
5) verbose (optional): If you want PyCrawler to spit out the urls it is looking at, this should be "true" if it is missing, or has any other value, it will be ignored and considered false.
0 commit comments