1

I am trying to ceate a web scraper in Python that goes through all products of Aliexpress supplier. My problem is that when I am going it without logging it I am eventually redirected to login web page. I added login section to my code but it does not help. I will appreciate all suggestions.

My code:

import requests
from bs4 import BeautifulSoup
import re
import sys
from lxml import html


def go_through_paginator(link):
    source_code = requests.get(link, data=payload,  headers = dict(referer = link))
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    print(soup)
    for page in soup.findAll ('div', {'class' : 'ui-pagination-navi util-left'}):
        for next_page in page.findAll ('a', {'class' : 'ui-pagination-next'}):
            next_page_link="https:" + next_page.get('href')
            print (next_page_link)
            gather_all_products (next_page_link)

def gather_all_products (url):
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    for item in soup.findAll ('a', {'class' : 'pic-rind'}):
        product_link=item.get('href')
    go_through_paginator(url)


payload = {
    "loginId": "EMAIL", 
    "password": "LOGIN",
}

LOGIN_URL='https://login.aliexpress.com/buyer.htm?spm=2114.12010608.1000002.4.EihgQ5&return=https%3A%2F%2Fwww.aliexpress.com%2Fstore%2F1816376%3Fspm%3D2114.10010108.0.0.fs2frD&random=CAB39130D12E432D4F5D75ED04DC0A84'

session_requests = requests.session()
source_code = session_requests.get(LOGIN_URL)
source_code = session_requests.post(LOGIN_URL, data = payload)


URL='https://www.aliexpress.com/store/1816376?spm=2114.10010108.0.0.fs2frD'

source_code = requests.get(URL, data=payload,  headers = dict(referer = URL))
plain_text = source_code.text
soup = BeautifulSoup(plain_text)

for L1 in soup.findAll ('li', {'id' : 'product-nav'}):
    for L1_link in L1.findAll('a', {'class' : 'nav-link'}):
        link = "https:" + L1_link.get('href') 
        gather_all_products(link)

And this the aliexpress login URL: https://login.aliexpress.com/buyer.htm?spm=2114.12010608.1000002.4.EihgQ5&return=https%3A%2F%2Fwww.aliexpress.com%2Fstore%2F1816376%3Fspm%3D2114.10010108.0.0.fs2frD&random=CAB39130D12E432D4F5D75ED04DC0A84

MattDMo
  • 96,286
  • 20
  • 232
  • 224
  • Are you doing anything with the cookie they send back? Because they're probably authenticating off of that. You're cookie probably needs to be in the header, but it looks like your header is just the URL? – Alexander Kleinhans Jan 13 '17 at 22:34
  • I would probably diff the headers logged in and logged out with something like this and then set it however they want. https://stackoverflow.com/questions/4423061/view-http-headers-in-google-chrome – Alexander Kleinhans Jan 13 '17 at 22:37

1 Answers1

0

Try to set the cookies value from xman_t and intl_common_forever from response cookies.

I was try it directly to grab all products information. Before I set xman_t and intl_common_forever Aliexpress just allow me to grab 7 products. After I set xman_t and intl_common_forever I successfully grab 50 products.

Hopefully this help you to scrapes their product.