|
458 | 458 | "## 2a) Web scraping vs APIs - what's the difference?\n", |
459 | 459 | "Now that we've covered a simple model of how you might interact with the World Wide Web, let's go through the two main ways you may extract data from the web for research or analysis. \n", |
460 | 460 | "\n", |
461 | | - "As a quick recap - when you access data on the web, you typically download a resource. This can occur on a browser, or in your Python console. Because our interaction is primarily visual, information returned is in HTML, a markup language, that delivers both content and rules about how the content is to be presented (fonts, text size, bold, arrangement). By contrast, APIs typically are built to only return data. For this reason, the data is typically returned in XML or JSON formats. We've already seen an example of, and here is an example of a .json file. \n", |
| 461 | + "As a quick recap - when you access data through your browser, you download a resource. Because our interaction is primarily visual, information returned to browsers is in HTML, a markup language, that delivers both content and rules about how the content is to be presented (fonts, text size, bold, arrangement). By contrast, APIs typically are built to only return data. For this reason, the data is typically returned in XML or JSON formats. We've already seen an example of a HTML file, and here is an example of a .json file from the Spotify API. \n" |
| 462 | + ] |
| 463 | + }, |
| 464 | + { |
| 465 | + "cell_type": "raw", |
| 466 | + "metadata": {}, |
| 467 | + "source": [ |
| 468 | + "{\n", |
| 469 | + " \"artists\" : {\n", |
| 470 | + " \"href\" : \"https://api.spotify.com/v1/search?query=MIA&offset=0&limit=1&type=artist\",\n", |
| 471 | + " \"items\" : [ {\n", |
| 472 | + " \"external_urls\" : {\n", |
| 473 | + " \"spotify\" : \"https://open.spotify.com/artist/0QJIPDAEDILuo8AIq3pMuU\"\n", |
| 474 | + " },\n", |
| 475 | + " \"followers\" : {\n", |
| 476 | + " \"href\" : null,\n", |
| 477 | + " \"total\" : 392233\n", |
| 478 | + " },\n", |
| 479 | + " \"genres\" : [ ],\n", |
| 480 | + " \"href\" : \"https://api.spotify.com/v1/artists/0QJIPDAEDILuo8AIq3pMuU\",\n", |
| 481 | + " \"id\" : \"0QJIPDAEDILuo8AIq3pMuU\",\n", |
| 482 | + " \"name\" : \"M.I.A.\",\n", |
| 483 | + " \"popularity\" : 70,\n", |
| 484 | + " \"type\" : \"artist\",\n", |
| 485 | + " \"uri\" : \"spotify:artist:0QJIPDAEDILuo8AIq3pMuU\"\n", |
| 486 | + " } ],\n", |
| 487 | + " \"limit\" : 1,\n", |
| 488 | + " \"next\" : \"https://api.spotify.com/v1/search?query=MIA&offset=1&limit=1&type=artist\",\n", |
| 489 | + " \"offset\" : 0,\n", |
| 490 | + " \"previous\" : null,\n", |
| 491 | + " \"total\" : 510\n", |
| 492 | + " }" |
| 493 | + ] |
| 494 | + }, |
| 495 | + { |
| 496 | + "cell_type": "markdown", |
| 497 | + "metadata": {}, |
| 498 | + "source": [ |
| 499 | + "The structure is similar to how we navigate nested Python objects (such as a list of lists), and we will see how you can navigate json objects using the python json library later in the tutorial. Notice the format of the data is highly structured, with no lines devoted to markup or how a page is to be displayed, like for HTML data to be displayed in the browser. \n", |
462 | 500 | "\n", |
463 | 501 | "In summary:\n", |
464 | 502 | "\n", |
465 | 503 | "__Web scraping__ typically involves the scraping of pages meant for human consumption. Hence you are more likely to work with __.html__ files. \n", |
466 | 504 | "\n", |
467 | | - "__Web APIs__ is a broad category, but in the context of data extraction for research. Here you are likely to work in __XML__ or __JSON__ formats, or whatever format the company or agency chooses to make the data available. There are typically fewer steps\n", |
| 505 | + "__Web APIs__ is a broad category, but in the context of data extraction for research. Here you are likely to work in __XML__ or __JSON__ formats, or whatever format the company or agency chooses to make the data available. There are typically fewer steps between extracting the data and parsing it into a form ready for analysis, as APIs are built to directly return data.\n", |
468 | 506 | "\n", |
469 | 507 | "### Note on Robots:\n", |
470 | | - "We also hear a lot about robots. A robot is ... to accomplish any kind of automated task. However, if you... Bots can be built to extract data from APIs or web scraping. Note that accessing APIs through the console does not necessarily mean it is a bot. If you manually send requests on the console to download specific resources, that is not a bot. Requires that it be automated. In the next section, we will discuss a text file called \"robots.txt\" that is typically contained in the root folder, that contains instructions to bots \n", |
| 508 | + "We also hear a lot about robots. A robot is a program designed to accomplish any kind of automated task. This, you can write an automated script to download data from an API, or to scrape pages. Sending requests manually on the console does not qualify as a bot - the key is that the task must be automated. For instance, a bot can click through every post on a forum, downloading pages that match a specific key word or phrase. \n", |
471 | 509 | "\n", |
472 | | - "For instance, a bot can click through every post on a forum, downloading files for each looking for a specific word or text. In our example, we show how to do this for Wikipedia. " |
| 510 | + "In the next section, we will discuss a text file called \"robots.txt\" (which applies to scaping only) that is typically contained in the root folder, that contains instructions to bots on what can or cannot be scraped or crawled on the site. " |
473 | 511 | ] |
474 | 512 | }, |
475 | 513 | { |
476 | 514 | "cell_type": "markdown", |
477 | 515 | "metadata": {}, |
478 | 516 | "source": [ |
479 | 517 | "## 2b) Menagerie of tools: crawlers, spiders, scrapers - what's the difference? \n", |
480 | | - "Web crawlers or spiders comes from. An image of long, spindly legs, traversing from hyperlink to hyperlink. It is these automated crawlers that continually traverse the web and index new or changed content, that search engines used to update and present the most relevant results to your search requests. \n", |
| 518 | + "Web crawlers or spiders are used by search engines to index the web. The metaphor is that of an automated bot with long, spindly legs, traversing from hyperlink to hyperlink. Search engines use these crawlers to continually traverse the web and index new or changed content, so that our search queries reflect the most recent and up-to-date content. \n", |
| 519 | + "\n", |
| 520 | + "Web scraping is a little different. While many of the tools used may be identical or similar, web scraping \"focuses more on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet.\" (https://en.wikipedia.org/wiki/Web_scraping) In other words, web scraping focuses on translating data into a form ready for storage and analysis (versus just indexing). \n", |
481 | 521 | "\n", |
482 | | - "Web scraping is a little different. While many of the tools used may be identical or similar, web scraping \"focuses more on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet.\" (https://en.wikipedia.org/wiki/Web_scraping) \n", |
| 522 | + "In many cases, to the server, these processes look somewhat identical. Resources are sent in response to requests. Rather, it is what is done to those resources after they are sent, and the overall goal, that differentiates web crawling and scraping. \n", |
483 | 523 | "\n", |
484 | | - "In many cases, to the server, these processes look somewhat identical. Resources are sent in response to requests. Rather, it is what is done to those resources after they are sent, and the overall goal, that differentiates web crawling and scraping. Most websites want crawlers to find them so their pages appear on popular search engines, but see no clear-cut benefit when their content is parsed and converted into usable data. Beyond research, many companies also use web scraping (in a legal grey area or illegally) to repurpose content, etc, a real estate website scraping data from Craigslist to re-post as listings on their website. " |
| 524 | + "Most websites want crawlers to find them so their pages appear on popular search engines, but see no clear-cut benefit when their content is parsed and converted into usable data. Beyond research, many companies also use web scraping (in a legal grey area or illegally) to repurpose content, etc, a real estate website scraping data from Craigslist to re-post as listings on their website. " |
485 | 525 | ] |
486 | 526 | }, |
487 | 527 | { |
|
490 | 530 | "source": [ |
491 | 531 | "## 4) Considerate robots and legality \n", |
492 | 532 | "\n", |
493 | | - "Typically, in starting a new web scraping project, you'll want to follow these steps: \n", |
| 533 | + "__Typically, in starting a new web scraping project, you'll want to follow these steps:__ \n", |
494 | 534 | "1) Find the websites' robots.txt and do not access those pages through your bot \n", |
495 | 535 | "2) Make sure your bot does not make too many requests in a specific period (etc. by using Python's sleep.wait function) \n", |
496 | 536 | "3) Look up the website's term of use or terms of service. \n", |
497 | 537 | "\n", |
498 | | - "We'll discuss each of these in turn. \n", |
| 538 | + "We'll discuss each of these briefly." |
| 539 | + ] |
| 540 | + }, |
| 541 | + { |
| 542 | + "cell_type": "markdown", |
| 543 | + "metadata": {}, |
| 544 | + "source": [ |
| 545 | + "### What data owners care about\n", |
| 546 | + "\n", |
| 547 | + "__Data owners are concerned with:__ \n", |
| 548 | + "1) Keeping their website up \n", |
| 549 | + "2) Protecting the commercial value of their data \n", |
499 | 550 | "\n", |
500 | | - "Data owners are concerned with\n", |
501 | | - "- Keeping their website up\n", |
502 | | - "- Protecting the commercial value of their data \n", |
| 551 | + "Their policies and responses differ with respect to these two areas. You'll need to do some research to determine what is appropriate with regards to your research. \n", |
503 | 552 | "\n", |
504 | | - "Many instances of web scraping often occur in a legal grey area. Most commercial websites have strategies to throttle or block IPs that make too many requests within a fixed amount of time. Because a bot can make a large number of requests in a small amount of time (etc. entering 100 different terms into Google in one second), servers are able to determine if traffic is coming from a bot or a person (among many other methods). For companies that rely on advertising, like Google or Twitter, these requests do not represent \"human eyeballs\" and need to be filtered out from their bill to advertisers. \n", |
| 553 | + "#### 1) Keeping their website up\n", |
| 554 | + "Most commercial websites have strategies to throttle or block IPs that make too many requests within a fixed amount of time. Because a bot can make a large number of requests in a small amount of time (etc. entering 100 different terms into Google in one second), servers are able to determine if traffic is coming from a bot or a person (among many other methods). For companies that rely on advertising, like Google or Twitter, these requests do not represent \"human eyeballs\" and need to be filtered out from their bill to advertisers. \n", |
505 | 555 | "\n", |
506 | | - "__By contrast__, it is typically, since these APIs are defined by the companies, and they usually require registration and set a fixed (though often very large) number of requests through the APIs. From the company's perspective, they provide these APIs sometimes for commercial purposes (such as the Google Maps API, which is used and metered by many companies), but also to divert scrapers away from their site for human eyes. \n", |
| 556 | + "In order to keep their site up and running, companies may block your IP temporarily or permanently if they detect too many requests coming from your IP, or other signs that requests are being made by a bot instead of a person. If you systematically down a site (such as sending millions of requests to an official government site), there is the small chance your actions may be interpreted maliciously (and regarded as hacking), with risk of prosecution. \n", |
507 | 557 | "\n", |
508 | | - "Most frequent is getting your IP blocked temporarily or permanently. In addition, if you plan to publish your results for research, contacting the agency is probably a good idea. \n", |
| 558 | + "#### 2) Protecting the commercial value of their data\n", |
| 559 | + "Companies are also typically very protective of their data, especially data that ties directly into how they make money. A listings site (like Craigslist), for instance, would lose traffic if listings on its site were poached and transfered to a competitor, or if a rival company used scraping tools to derive lists of users to contact. For this reason, companies' term of use agreements are typically very restrictive of what you can do with their data. \n", |
509 | 560 | "\n", |
510 | | - "In summary:\n" |
| 561 | + "Different companies may have a range of responses to your scraping, depending on what you do with the data. Typically, repurposing the data for a rival application or business will trigger a strong response from the company (i.e. legal attention). Publishing any analysis or results, either in a formal academic journal or on a blog or webpage, may be of less concern, though legal attention is still possible. " |
511 | 562 | ] |
512 | 563 | }, |
513 | 564 | { |
514 | 565 | "cell_type": "markdown", |
515 | 566 | "metadata": {}, |
516 | 567 | "source": [ |
517 | | - "### Risks\n", |
518 | | - "In addition, \n", |
519 | | - "\n", |
520 | | - "Most often, you'll simply find your IP being temporarily blocked, or put on a permanent blacklist if you are a repeat offender. However, scraping data for repurposing can constitute . If you crash your favorite site, etc. by writing a simple scraping script and distributing \n", |
| 568 | + "### Where APIs fit\n", |
| 569 | + "Companies typically provide APIs to deal with 1) - to direct bots and scrapers away from their main site, as well as for commercial purposes (such as the Google Maps API, which is used by many companies on a pay-as-you-go basis). \n", |
521 | 570 | "\n", |
522 | | - "Most common consequences include having your IP blocked. \n", |
| 571 | + "Because APIs usually require registration and set a fixed (though often very large) number of requests, they are easier to manage and don't require companies to figure out whether requests are being made by their primary customers, versus scrapers and crawlers.\n", |
523 | 572 | "\n", |
524 | | - "\n", |
525 | | - "\n", |
526 | | - "The use of robots.txt file is a convention. \n", |
527 | | - "\n", |
528 | | - "identifies the links, and highlights specific crawlers. Let's take a look at reddit's [insert reddit's privacy policy] " |
| 573 | + "#### __In general, using APIs vs. web scraping offers more protections because:__ \n", |
| 574 | + "1) Because of the way APIs are designed, you are unlikely to affect the running of the main site and \n", |
| 575 | + "2) API data is data companies have explicitly chosen to make available (though terms of service still apply). By contrast, you may be scraping information companies want to protect if you do it through web scraping." |
| 576 | + ] |
| 577 | + }, |
| 578 | + { |
| 579 | + "cell_type": "markdown", |
| 580 | + "metadata": {}, |
| 581 | + "source": [ |
| 582 | + "### Risks in brief\n", |
| 583 | + "- In general, most often you'll simply find your IP being temporarily blocked if you are careless with the number of requests you make. \n", |
| 584 | + "- More serious consequences would include being put on a permanent blacklist or contacted for a cease-and-desist or legal action by the company (etc. if you create a new service using their data). Some of this falls in a legal grey area. \n", |
| 585 | + "- Finally, if you scale your requests and manage to send them in a sophisticated enough manner to crash the site, this may qualify as digital crime - similar to Distributed Denial-of-Service (DDOS) attacks. We probably don't have to worry about this at this stage." |
529 | 586 | ] |
530 | 587 | }, |
531 | 588 | { |
|
534 | 591 | "source": [ |
535 | 592 | "### robots.txt: internet convention\n", |
536 | 593 | "\n", |
537 | | - "The robots.txt file is usually more geared towards search engines than anything else.\n", |
538 | | - "The bot that calls itself 008 (apparently from 80legs) isn't allowed to access anything\n", |
539 | | - "bender is not allowed to visit my_shiny_metal_ass (it's a Futurama joke, the page doesn't actually exist)\n", |
540 | | - "Gort isn't allowed to visit Earth (another joke, from The Day the Earth Stood Still)\n", |
541 | | - "Other scrapers should avoid checking the API methods or \"compose message\" or 'search\" or the \"over 18?\" page (because those aren't something you really want showing up in Google), but they're allowed to visit anything else. \n", |
| 594 | + "The robots.txt file is typically located in the root folder of the site, with instructions to various services (User-agents) on what they are not allowed to scrape. \n", |
542 | 595 | "\n", |
| 596 | + "Typically, the robots.txt file is more geared towards search engines (and their crawlers) more than anything else. \n", |
| 597 | + "\n", |
| 598 | + "However, companies and agencies typically will not want you to scrape any pages that they disallow search engines from accessing. Scraping these pages makes it more likely for your IP to be detected and blocked (along with other possible actions.) \n", |
| 599 | + "\n", |
| 600 | + "Below is an example of reddit's robots.txt file: \n", |
543 | 601 | "https://www.reddit.com/robots.txt" |
544 | 602 | ] |
545 | 603 | }, |
|
588 | 646 | "cell_type": "markdown", |
589 | 647 | "metadata": {}, |
590 | 648 | "source": [ |
591 | | - "### Privacy policy or terms of use\n", |
592 | | - "\n", |
593 | | - "Twitter's privacy policy. [insert Twitter privacy policy] \n", |
594 | | - "\n", |
595 | | - "Blogs, for instance, or many forum sites do not have APIs available. \n", |
596 | | - "\n", |
597 | | - "\n" |
| 649 | + "User blahblahblah provides a concise description of how to read the robots.txt file:\n", |
| 650 | + "https://www.reddit.com/r/learnprogramming/comments/3l1lcq/how_do_you_find_out_if_a_website_is_scrapable/" |
| 651 | + ] |
| 652 | + }, |
| 653 | + { |
| 654 | + "cell_type": "raw", |
| 655 | + "metadata": {}, |
| 656 | + "source": [ |
| 657 | + "- The bot that calls itself 008 (apparently from 80legs) isn't allowed to access anything\n", |
| 658 | + "- bender is not allowed to visit my_shiny_metal_ass (it's a Futurama joke, the page doesn't actually exist)\n", |
| 659 | + "- Gort isn't allowed to visit Earth (another joke, from The Day the Earth Stood Still)\n", |
| 660 | + "- Other scrapers should avoid checking the API methods or \"compose message\" or 'search\" or the \"over 18?\" page (because those aren't something you really want showing up in Google), but they're allowed to visit anything else." |
598 | 661 | ] |
599 | 662 | }, |
600 | 663 | { |
601 | 664 | "cell_type": "markdown", |
602 | 665 | "metadata": {}, |
603 | 666 | "source": [ |
604 | | - "\n" |
| 667 | + "In general, your bot will fall into the * wildcard category of what the site generally do not want bots to access. You should make sure your scraper does not access any of those pages, etc. www.reddit.com/login etc. " |
605 | 668 | ] |
606 | 669 | }, |
607 | 670 | { |
|
1263 | 1326 | "\n", |
1264 | 1327 | "A web API. The following are examples of each: [examples here] \n", |
1265 | 1328 | "\n", |
1266 | | - "What looks like gibberish. There is little to no spacing. Is not designed for us to read, but for the program (in this case the browser) to parse that content, and present it in a visual interface for us." |
| 1329 | + "What looks like gibberish. There is little to no spacing. Is not designed for us to read, but for the program (in this case the browser) to parse that content, and present it in a visual interface for us.\n", |
| 1330 | + "\n", |
| 1331 | + "\n", |
| 1332 | + "The robots.txt file is usually more geared towards search engines than anything else.\n", |
| 1333 | + "The bot that calls itself 008 (apparently from 80legs) isn't allowed to access anything\n", |
| 1334 | + "bender is not allowed to visit my_shiny_metal_ass (it's a Futurama joke, the page doesn't actually exist)\n", |
| 1335 | + "Gort isn't allowed to visit Earth (another joke, from The Day the Earth Stood Still)\n", |
| 1336 | + "Other scrapers should avoid checking the API methods or \"compose message\" or 'search\" or the \"over 18?\" page (because those aren't something you really want showing up in Google), but they're allowed to visit anything else. \n" |
1267 | 1337 | ] |
1268 | 1338 | }, |
1269 | 1339 | { |
|
0 commit comments