Hướng dẫn lấy toàn bộ dữ liệu từ API phân trang bằng Python

Trong bài hướng dẫn này, chúng ta sẽ tìm hiểu cách lấy toàn bộ dữ liệu từ một API có phân trang bằng Python. Chúng ta sẽ sử dụng API Bitable của Lark Suite (Feishu) làm ví dụ, nhưng các khái niệm có thể áp dụng cho nhiều API phân trang khác.

1. Hiểu về API {#hieu-ve-api}

Trước khi bắt đầu code, hãy hiểu các khía cạnh chính của API mục tiêu:

Endpoint: https://open.larksuite.com/open-apis/bitable/v1/apps/:app_token/tables/:table_id/records/search
Method: POST
Headers:
Authorization: Bearer token
Content-Type: application/json; charset=utf-8
Body: JSON chứa các tham số như page_size và page_token
Phân trang: API trả về page_token cho trang tiếp theo và has_more để chỉ ra có còn dữ liệu hay không

2. Chuẩn bị môi trường {#chuan-bi-moi-truong}

Đầu tiên, hãy cài đặt thư viện requests:

pip install requests

3. Viết hàm gọi API {#viet-ham-goi-api}

Bây giờ, chúng ta sẽ viết một hàm để gọi API:

import requests

def call_api(app_token, table_id, access_token, page_token=None):
    url = f"https://open.larksuite.com/open-apis/bitable/v1/apps/{app_token}/tables/{table_id}/records/search"
    headers = {
        "Authorization": f"Bearer {access_token}",
        "Content-Type": "application/json; charset=utf-8"
    }
    payload = {
        "page_size": 500,
        "page_token": page_token
    }

    response = requests.post(url, headers=headers, json=payload)
    return response.json()

4. Thực hiện phân trang {#thuc-hien-phan-trang}

Tiếp theo, chúng ta sẽ viết hàm để xử lý phân trang và lấy toàn bộ dữ liệu:

def get_all_records(app_token, table_id, access_token):
    all_records = []
    page_token = None

    while True:
        data = call_api(app_token, table_id, access_token, page_token)

        if data.get("code") != 0:
            print(f"Lỗi: {data.get('msg')}")
            break

        records = data["data"]["items"]
        all_records.extend(records)

        if not data["data"]["has_more"]:
            break

        page_token = data["data"].get("page_token")

    return all_records

5. Tổng hợp mã nguồn {#tong-hop-ma-nguon}

Đây là toàn bộ mã nguồn kết hợp các phần trên:

import requests

def call_api(app_token, table_id, access_token, page_token=None):
    url = f"https://open.larksuite.com/open-apis/bitable/v1/apps/{app_token}/tables/{table_id}/records/search"
    headers = {
        "Authorization": f"Bearer {access_token}",
        "Content-Type": "application/json; charset=utf-8"
    }
    payload = {
        "page_size": 500,
        "page_token": page_token
    }

    response = requests.post(url, headers=headers, json=payload)
    return response.json()

def get_all_records(app_token, table_id, access_token):
    all_records = []
    page_token = None

    while True:
        data = call_api(app_token, table_id, access_token, page_token)

        if data.get("code") != 0:
            print(f"Lỗi: {data.get('msg')}")
            break

        records = data["data"]["items"]
        all_records.extend(records)

        if not data["data"]["has_more"]:
            break

        page_token = data["data"].get("page_token")

    return all_records

# Sử dụng hàm
app_token = "your_app_token"
table_id = "your_table_id"
access_token = "your_access_token"

all_records = get_all_records(app_token, table_id, access_token)
print(f"Tổng số bản ghi: {len(all_records)}")

6. Chạy script {#chay-script}

Để chạy script:

Lưu mã nguồn vào một file, ví dụ get_all_records.py
Thay thế your_app_token, your_table_id, và your_access_token bằng thông tin thực của bạn
Chạy script bằng lệnh:

python get_all_records.py

7. Kết luận {#ket-luan}

Trong bài hướng dẫn này, chúng ta đã học cách:

Hiểu cấu trúc của một API phân trang
Viết hàm để gọi API
Xử lý phân trang để lấy toàn bộ dữ liệu
Tổng hợp mã nguồn thành một script hoàn chỉnh

Bằng cách sử dụng phương pháp này, bạn có thể dễ dàng lấy toàn bộ dữ liệu từ bất kỳ API phân trang nào, không chỉ giới hạn ở Lark Suite Bitable API.

Lưu ý rằng script này tuân thủ giới hạn tốc độ của API (20 yêu cầu mỗi giây) bằng cách sử dụng page_size tối đa là 500 cho mỗi yêu cầu. Tuy nhiên, nếu bạn cần xử lý một lượng lớn dữ liệu, bạn có thể cân nhắc thêm một cơ chế delay giữa các yêu cầu để tránh vượt quá giới hạn tốc độ.

Bài viết mới nhất

Bài học từ việc mua nhà ở Ocean Park: Góc nhìn của một Gen Z

31/07/2024

Giữ chân nhân viên GenZ: Góc nhìn đa chiều từ thực tế

31/07/2024

AWS App Studio: Cách Mạng Hóa Việc Phát Triển Ứng Dụng Doanh Nghiệp với Sức Mạnh AI

26/07/2024

AI mã nguồn mở là con đường rộng mở phía trước

24/07/2024

Vượt qua tư duy “chúng ta – chúng nó”: Nhìn nhận lịch sử Việt Nam một cách trung lập

23/07/2024

Hướng dẫn lấy toàn bộ dữ liệu từ API phân trang bằng Python

15/07/2024