Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Vilnius.js

Presentation about scraping in general, it's techinques and my personal experience with it.

  • Identifiez-vous pour voir les commentaires

  • Soyez le premier à aimer ceci

Vilnius.js

  1. 1. Attacking fire with fire Or how to get an API from any website
  2. 2. I am Danielius Visockas #givingBackToCommunity Salut!
  3. 3. Web harvesting
  4. 4. Web harvesting Go to a page Extract the data Download a document
  5. 5. Basic diagram of web harvesting
  6. 6. Fundamental metrics ◉ Freshness ◉ Age
  7. 7. Revisiting policy Constant Based on freshness
  8. 8. “ Edward Coffman et. al. proposed that a crawler must minimize the fraction of time pages remain outdated.
  9. 9. Aaah, easy
  10. 10. curl -i https://delfi.lt
  11. 11. No SSL...
  12. 12. curl -i http://delfi.lt
  13. 13. Doesn’t work....
  14. 14. Let’s try mobile curl -i http://m.delfi.lt
  15. 15. ………………. <script type="text/javascript">(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){(i[r].q=i[r].q||[]).push(argume ts)},i[r].l=1*new Date();a=s.createElement(o),m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)})(window,document,'scr pt','//www.google-analytics.com/analytics.js','ga');ga('create','UA-2428893-5','auto');var __ae=document.getElementsByClassName('delfi-author-name')[0]||document.getElementsByClassName('article-author-name')[0];if('undefi ed' !== typeof __ae){var au=__ae.textContent;au=au.replace(/[,;].*/g,'');au=au.replace(/^s+|s+$/g,'');au=au.toLowerCase();ga('set','dimension1',au);}if(m = navigator.userAgent.match(/Delfi/([0-9.]+)/)){var ua='Other';if(/ip(hone|ad|od)/i.test(navigator.userAgent))ua='iOS App';else if(/android/i.test(navigator.userAgent))ua='Android App';else if(/(windows|msie)/i.test(navigator.userAgent))ua='Windows App';ga('set','dimension2',ua);}else if(/FBAV//.test(navigator.userAgent))ga('set','dimension2','FBWV');else ga('set','dimension2','Browser');ga('set','dimension3',''+(window.__dabd && window.__dabd()));ga('send','pageview'); </script> <script type="text/javascript">var t=window.location.hostname.split('.').reverse();if(window._dct)_dct({s:'delfi/mobile',d:'t.'+t[1]+'.'+t[0]});</script> <script type="text/javascript"> var __ae=document.getElementsByClassName('delfi-author-name')[0]||document.getElementsByClassName('article-author-name')[0],au='',_sf_ sync_config = {}; if('undefined' !== typeof __ae){var au=__ae.textContent;au=au.replace(/,.*/g,'');au=au.replace(/^s+|s+$/g,'');au=au.toLowerCase();} _sf_async_config.uid=46335;_sf_async_config.domain='delfi.lt';_sf_async_config.sections='m.delfi';_sf_async_config.authors=au;_sf_ sync_config.useCanonical=true; (function(){function loadChartbeat(){window._sf_endpt=(new Date()).getTime();var e=document.createElement('script');e.setAttribute('language', 'javascript');e.setAttribute('type', 'text/javascript');e.setAttribute('src', '//static.chartbeat.com/js/chartbeat.js');document.body.appendChild(e);} var oldonload=window.onload; window.onload=(typeof window.onload != 'function') ? loadChartbeat : function() { oldonload(); loadChartbeat(); }; })(); </script> <script
  16. 16. This looks familiar Let’s use regex and it should be fine
  17. 17. Overengineering
  18. 18. Basic techniques Pick a right tool for the job
  19. 19. One-time Your computer is on Two ways to harvest Automated Can be done in a server
  20. 20. Copy and paste
  21. 21. Client-side scripting
  22. 22. Extensions and bookmarks!
  23. 23. Online scrapers
  24. 24. The fun part Automated scraping
  25. 25. “ Don’t forget to watch the network tab
  26. 26. Fetching of websites
  27. 27. Extraction of data Cheerio
  28. 28. But then it all changed When fire nation attacked
  29. 29. I found a girl in Kaunas...
  30. 30. 7 seconds Traukiniobilietas.lt response time
  31. 31. Thats
  32. 32. Five
  33. 33. Seconds
  34. 34. More
  35. 35. Than
  36. 36. It
  37. 37. Takes
  38. 38. To
  39. 39. Say
  40. 40. Seven
  41. 41. Seconds
  42. 42. Screenshot Traukiniobilietas.lt didn’t load...
  43. 43. So I decided to learn React And built an app that helps you to find trips
  44. 44. Want big impact? Use big image. How do I get the Data?!
  45. 45. Headless browsers
  46. 46. Brings together the best Of two worlds
  47. 47. I used Casper.js ◉ Runs on PhantomJS ◉ Resource intensive ◉ Can replicate everything ◉ Takes a bit longer ◉ DoS’ed traukiniobilietas… ◉ Works
  48. 48. “ So basically You have to pick The right tool for the job #noFreeLunchTheory
  49. 49. Legal stuff
  50. 50. Security CAPTCHAS and friends...
  51. 51. Interesting ideas ◉ Visual scraping using Machine Learning ◉ Macros + Casper.js (github.com/dvisockas/scrape)
  52. 52. Please ask questions! Thank you! And if someone from TRAFI could help me with traveling salesman..

×