92,296 views
???? $20 offered on Bright Data by creating an account with this link: https://brdta.com/docstring Find Bright Data on their YouTube channel: @BrightData Learn scraping from scratch with this complete training. This training is done with Python but all the theory of scraping and bypassing blocks can be applied to any language that allows scraping. ---------------------------------------------------------- PREREQUISITES: To follow this training with Python, you must master the basics of the language: • LEARN PYTHON FROM A to Z ---------------------------------------------------------- ???? Sources of the scripts https://github.com/DocstringFr/format... ????️ Create your VPS on Infomaniak https://www.infomaniak.com/fr/heberge... ???? My complete Python training on Udemy (+60h of training) ● https://bit.ly/3iGZu9a ???? Subscribe to Docstring ● https://www.docstring.fr/formules/?ut... ???? Join us on the Discord server ● https://www.docstring.fr/discord/?utm... ---------------------------------------------------------- ===== CHAPTERS ===== 00:00:00 Introduction 00:03:13 The program of the training 00:07:58 Definition of scraping 00:08:56 Prerequisites 00:11:06 Obstacles (and the solution) 00:13:20 PART 1: the basics of scraping 00:18:26 Retrieving the content of a page with requests 00:24:35 Analyzing the content of a page with BeautifulSoup 00:33:41 Retrieving information with BeautifulSoup 00:43:03 Analyzing the home page of books 00:54:56 It's your turn! 01:04:32 Simple exercises: Introduction 01:06:08 Retrieving categories with a single book 01:08:40 Solution 01:32:01 Retrieving books rated 1 star 01:35:44 Solution 02:08:18 Advanced exercise: Introduction 02:09:08 Statement of the exercise 02:10:23 Presentation of Selectolax and Loguru 02:18:04 Preparation of a specification 02:28:32 Creation of the body of the script 02:47:46 Retrieving the price of a book 03:12:41 Retrieving all URLs on a page 03:24:48 Retrieving the URL of the next page 03:30:54 Retrieving all URLs of the bookstore 03:38:44 Retrieving the total value of the bookstore 03:46:51 Optimizing our script with sessions 03:53:09 Conclusion 03:53:59 PART 2: Getting around obstacles 03:55:57 What the law says 03:56:38 The CGU 03:59:25 The GDPR 04:00:49 The entreparticuliers.com VS Leboncoin case 04:01:58 Examples of legal and illegal scraping 04:04:59 The robots.txt file https://robots-txt.com/ 04:09:10 Interview with Rony SHALIT https://brightdata.fr/trustcenter https://help.brightdata.com/hc/en-us/... 04:46:29 Technical blocks 04:50:43 Voluntary blocks 04:52:04 Blocking by query limitation 04:59:18 Blocking with the user-agent 05:04:55 Presentation of Playwright 05:10:46 Using playwright to display javascript 05:20:14 Interacting with the DOM 05:26:22 Essential methods to know 05:37:45 The Bright Data solution 05:38:43 Overview of the platform 05:45:04 Create your account on Bright Data 05:48:28 Use the residential proxy network 05:57:59 Use the web unlocker 06:02:12 Use the scraping browser 06:09:47 PART 3: Retrieving data from AirBnB 06:11:01 Preparation for ethical scraping 06:15:04 Analysis of the site to prepare for scraping 06:20:44 Create the project and install the libraries 06:24:21 Simple scraping with requests 06:29:15 Save HTML to disk 06:34:57 Fetch HTML from disk 06:42:39 Fetch price data 07:03:49 Run the script from command line 07:06:11 Advanced scraping with Playwright 07:15:46 Step through all pages 07:25:09 Use Bright Data's scraping browser 07:33:44 Automate opening the debugger 07:39:11 Minimize bandwidth 07:43:20 Navigate to the search page 07:52:09 Move to the next month 08:09:57 Scroll through months 08:22:14 Fetch price and finalize the script 08:34:01 PART 4: E-commerce alert system 08:35:16 The tools used 08:38:01 Prepare an ethical scraping 08:39:55 Retrieve HTML with requests 08:52:47 Add environment variables 08:54:57 Use Web Unlocker 09:00:09 Keep history of values on disk 09:04:45 Compare current value with previous 09:08:17 Add alert function with Pushover 09:11:27 Add logger 09:17:44 End main function 09:28:02 Send files to VPS 09:32:41 Create Cron Job 09:39:17 Remove warning with urllib 09:40:45 Add Sentry alerts 09:50:22 Outro