0

So I do not want to pull whole page, just the first 40KB of the page. Just like this Facebook Debugger tool does.

My goal is to grab social media meta-data, i.e. og:image etc.

Can be in any programming language, PHP or Python.

I do have code in phpQuery that uses file_get_contents/cURL and I know how to parse the received HTML, my question is "How to fetch only first nKB of a page without fetching whole page"

Umair Ayub
  • 15,903
  • 12
  • 65
  • 138
  • Maybe this will help https://stackoverflow.com/a/12014561/661872 – Lawrence Cherone Sep 16 '17 at 11:10
  • @LawrenceCherone I do have code in phpQuery that uses file_get_contents/cURL and I know how to parse the received HTML, my question is **"How to fetch only first nKB of a page without fetching whole page"** – Umair Ayub Sep 16 '17 at 11:12
  • 2
    This seems already answered [here](https://stackoverflow.com/questions/2032924/how-to-partially-download-a-remote-file-with-curl). – Dardan Iljazi Sep 16 '17 at 11:15
  • the `--range` curl command-line option seem to be a good fit, but doesn't say much about the specifics https://curl.haxx.se/docs/manpage.html – Calimero Sep 16 '17 at 11:16
  • Fair enough, you could look into using curl with `CURLOPT_WRITEFUNCTION` which aborts after reading 40KB, you could also abort before once you hit `` – Lawrence Cherone Sep 16 '17 at 11:16
  • any idea how to `abort before once you hit ` – Umair Ayub Sep 16 '17 at 11:18

2 Answers2

3

This is not specific to Facebook or any other social media sites but you can get first 40 KB with python like this:

import urllib2
start = urllib2.urlopen(your_link).read(40000)
mdegis
  • 1,988
  • 2
  • 21
  • 34
0

This could be used:

curl -r 0-40000 -o 40k.raw https://www.keycdn.com/support/byte-range-requests/

the -r stands for range:

From the curl man page:

r, --range <range>
          (HTTP FTP SFTP FILE) Retrieve a byte range (i.e a partial document) from a HTTP/1.1, FTP or SFTP server or a local  FILE.  Ranges  can  be
          specified in a number of ways.

          0-499     specifies the first 500 bytes

          500-999   specifies the second 500 bytes

          -500      specifies the last 500 bytes

          9500-     specifies the bytes from offset 9500 and forward

          0-0,-1    specifies the first and last byte only(*)(HTTP)

More info can be found in this article: https://www.keycdn.com/support/byte-range-requests/

Just in case this is a basic example of how to doit with go

package main

import (
    "fmt"
    "io"
    "io/ioutil"
    "log"
    "net/http"
)

func main() {
    response, err := http.Get("https://google.com")
    if err != nil {
        log.Fatal(err)
    }
    defer response.Body.Close()
    data, err := ioutil.ReadAll(io.LimitReader(response.Body, 40000))
    fmt.Printf("data = %s\n", data)
}
nbari
  • 23,059
  • 8
  • 65
  • 110