Youtube Data Api To Crawl All Comments And Replies

June 11, 2023 Post a Comment

I have been desperately seeking a solution to crawl all comments and corresponding replies for my research. Am having a very hard time creating a data frame that includes comment d

Solution 1:

According to the official doc, the property replies.comments[] of CommentThreads resource has the following specification:

replies.comments[] (list) A list of one or more replies to the top-level comment. Each item in the list is a comment resource.
The list contains a limited number of replies, and unless the number of items in the list equals the value of the snippet.totalReplyCount property, the list of replies is only a subset of the total number of replies available for the top-level comment. To retrieve all of the replies for the top-level comment, you need to call the Comments.list method and use the parentId request parameter to identify the comment for which you want to retrieve replies.

Consequently, if wanting to obtain all reply entries associated to a given top-level comment, you will have to use the Comments.list API endpoint queried appropriately.

I recommend you to read my answer to a very much related question; there are three sections:

Top-Level Comments and Associated Replies,
The property nextPageToken and the parameter pageToken, and
API Limitations Imposed by Design.

From the get go, you'll have to acknowledge that the API (as currently implemented) does not allow to obtain all top-level comments associated to a given video when the number of those comments exceeds a certain (unspecified) upper bound.

For what concerns a Python implementation, I would suggest that you do structure the code as follows:

defget_video_comments(service, video_id):
    request = service.commentThreads().list(
        videoId = video_id,
        part = 'id,snippet,replies',
        maxResults = 100
    )
    comments = []

    while request:
        response = request.execute()

        for comment in response['items']:
            reply_count = comment['snippet'] \
                ['totalReplyCount']
            replies = comment.get('replies')
            if replies isnotNoneand \
               reply_count != len(replies['comments']):
               replies['comments'] = get_comment_replies(
                   service, comment['id'])

            # 'comment' is a 'CommentThreads Resource' that has it's# 'replies.comments' an array of 'Comments Resource'# Do fill in the 'comments' data structure # to be provided by this function:
            ...

        request = service.commentThreads().list_next(
            request, response)

    return comments

def get_comment_replies(service, comment_id):
    request = service.comments().list(
        parentId = comment_id,
        part = 'id,snippet',
        maxResults = 100
    )
    replies = []

    while request:
        response = request.execute()
        replies.extend(response['items'])
        request = service.comments().list_next(
            request, response)

    return replies

Note that the ellipsis dots above -- ... -- would have to be replaced with actual code that fills in the array of structures to be returned by get_video_comments to its caller.

The simplest way (useful for quick testing) would be to have ... replaced with comments.append(comment) and then the caller of get_video_comments to simply pretty print (using json.dump) the object obtained from that function.

Solution 2:

Based on stvar' answer and the original publication here I built this code:

import os
import pickle
import csv
import json
import google.oauth2.credentials
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request

CLIENT_SECRETS_FILE = "client_secret.json"# for more information  to create your credentials json please visit https://python.gotrained.com/youtube-api-extracting-comments/
SCOPES = ['https://www.googleapis.com/auth/youtube.force-ssl']
API_SERVICE_NAME = 'youtube'
API_VERSION = 'v3'defget_authenticated_service():
    credentials = Noneif os.path.exists('token.pickle'):
        withopen('token.pickle', 'rb') as token:
            credentials = pickle.load(token)
    #  Check if the credentials are invalid or do not existifnot credentials ornot credentials.valid:
        # Check if the credentials have expiredif credentials and credentials.expired and credentials.refresh_token:
            credentials.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                CLIENT_SECRETS_FILE, SCOPES)
            credentials = flow.run_console()

        # Save the credentials for the next runwithopen('token.pickle', 'wb') as token:
            pickle.dump(credentials, token)

    return build(API_SERVICE_NAME, API_VERSION, credentials = credentials)

defget_video_comments(service, **kwargs):
    request = service.commentThreads().list(**kwargs)
    comments = []

    while request:
        response = request.execute()

        for comment in response['items']:
            reply_count = comment['snippet'] \
                ['totalReplyCount']
            replies = comment.get('replies')
            if replies isnotNoneand \
               reply_count != len(replies['comments']):
               replies['comments'] = get_comment_replies(
                   service, comment['id'])

            # 'comment' is a 'CommentThreads Resource' that has it's# 'replies.comments' an array of 'Comments Resource'# Do fill in the 'comments' data structure # to be provided by this function:
            comments.append(comment)

        request = service.commentThreads().list_next(
            request, response)

    return comments
defget_comment_replies(service, comment_id):
    request = service.comments().list(
        parentId = comment_id,
        part = 'id,snippet',
        maxResults = 1000
    )
    replies = []

    while request:
        response = request.execute()
        replies.extend(response['items'])
        request = service.comments().list_next(
            request, response)

    return replies


if __name__ == '__main__':
    # When running locally, disable OAuthlib's HTTPs verification. When# running in production *do not* leave this option enabled.
    os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = '1'
    service = get_authenticated_service()
    videoId = input('Enter Video id : ') # video id here (the video id of https://www.youtube.com/watch?v=vedLpKXzZqE -> is vedLpKXzZqE)
    comments = get_video_comments(service, videoId=videoId, part='id,snippet,replies', maxResults = 1000)


withopen('youtube_comments', 'w', encoding='UTF8') as f:
    writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in comments:
            # convert the tuple to a list and write to the output file
            writer.writerow([row])

it returns a file called youtube_comments with this format:

"{'kind': 'youtube#commentThread', 'etag': 'gvhv4hkH0H2OqQAHQKxzfA-K_tA', 'id': 'UgzSgI1YEvwcuF4cPwN4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'topLevelComment': {'kind': 'youtube#comment', 'etag': 'qpuKZcuD4FKf6BHgRlMunersEeU', 'id': 'UgzSgI1YEvwcuF4cPwN4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'This is a comment', 'textOriginal': 'This is a comment', 'authorDisplayName': 'Gabriell Magana', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLRGBvo2ZncDP1xGjlX6anfUufNYi9b3w9kYZFDl=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCKAa4FYftXsN7VKaPSlCivg', 'authorChannelId': {'value': 'UCKAa4FYftXsN7VKaPSlCivg'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 8, 'publishedAt': '2019-05-22T12:38:34Z', 'updatedAt': '2019-05-22T12:38:34Z'}}, 'canReply': True, 'totalReplyCount': 0, 'isPublic': True}}""{'kind': 'youtube#commentThread', 'etag': 'DsgDziMk7mB7xN4OoX7cmqlbDYE', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'topLevelComment': {'kind': 'youtube#comment', 'etag': 'NYjvYM9W_umBafAfQkdg1P9apgg', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'This is another comment', 'textOriginal': 'This is another comment', 'authorDisplayName': 'Mary Montes', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLTg1b1yw8BX8Af0PoTR_t5OOwP9Cfl9_qL-o1iikw=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UC_GP_8HxDPsqJjJ3Fju_UeA', 'authorChannelId': {'value': 'UC_GP_8HxDPsqJjJ3Fju_UeA'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 9, 'publishedAt': '2019-05-15T05:10:49Z', 'updatedAt': '2019-05-15T05:10:49Z'}}, 'canReply': True, 'totalReplyCount': 3, 'isPublic': True}, 'replies': {'comments': [{'kind': 'youtube#comment', 'etag': 'Tu41ENCZYNJ2KBpYeYz4qgre0H8', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF79DbfJ9zMKxM', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'this is first reply', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'JULIO EMPRESARIO', 'authorProfileImageUrl': 'https://yt3.ggpht.com/eYP4MBcZ4bON_pHtdbtVsyWnsKbpNKye2wTPhgkffkMYk3ZbN0FL6Aa1o22YlFjn2RVUAkSQYw=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCrpB9oZZZfmBv1aQsxrk66w', 'authorChannelId': {'value': 'UCrpB9oZZZfmBv1aQsxrk66w'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2020-09-15T04:06:50Z', 'updatedAt': '2020-09-15T04:06:50Z'}}, {'kind': 'youtube#comment', 'etag': 'OrpbnJddwzlzwGArCgtuuBsYr94', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF795E1w8RV1DJ', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'the second replay', 'textOriginal': 'the second replay', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'Anatolio27 Diaz', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLR1hOySIxEkvRCySExHjo3T6zGBNkvuKpPkqA=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UC04N8BM5aUwDJf-PNFxKI-g', 'authorChannelId': {'value': 'UC04N8BM5aUwDJf-PNFxKI-g'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2020-02-19T18:21:06Z', 'updatedAt': '2020-02-19T18:21:06Z'}}, {'kind': 'youtube#comment', 'etag': 'sPmIwerh3DTZshLiDVwOXn_fJx0', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF78wwH6Aabh4y', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'A third reply', 'textOriginal': 'A third reply', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'Voy detrás de mi pasión', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLTgzZ3ZFvkmmAlMzA77ApM-2uGFfvOBnzxegYEX=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCvv6QMokO7KcJCDpK6qZg3Q', 'authorChannelId': {'value': 'UCvv6QMokO7KcJCDpK6qZg3Q'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2019-07-03T18:45:34Z', 'updatedAt': '2019-07-03T18:45:34Z'}}]}}"

Now it is necessary a second step in order to information required. For this I a set of bash script toos like cut, awk and set:

cut -d ":" -f 10- youtube_comments | sed -e "s/', '/\n/g" -e "s/'//g" | awk '/replies/{print "------------------------****---------:::   Replies: "$6"  :::---------******--------------------------------"}!/replies/{print}' |sed '/^textOriginal:/,/^authorDisplayName:/{/^authorDisplayName/!d}' |sed '/^authorProfileImageUrl:\|^authorChannelUrl:\|^authorChannelId:\|^etag:\|^updatedAt:\|^parentId:\|^id:/d' |sed 's/<[^>]*>//g' | sed 's/{textDisplay/{\ntextDisplay/' |sed '/^snippet:/d' | awk -F":"'(NF==1){print "========================================COMMENT==========================================="}(NF>1){a=0; print $0}' | sed 's/textDisplay: //g' | sed 's/authorDisplayName/User/g' | sed 's/T[0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}Z//g' | sed 's/likeCount: /Likes:/g' | sed 's/publishedAt: //g' > output_file

The final result is a file called output_file with this format:

========================================COMMENT===========================================
This is a comment
User: Robert Everest
Likes:8, 2019-05-22
========================================COMMENT===========================================
This is another comment
User: Anna Davis
Likes:9, 2019-05-15
------------------------****---------:::   Replies:3,  :::---------******--------------------------------
this is first reply
User: John Doe
Likes:2, 2020-09-15
the second replay
User: Caraqueno
Likes:2, 2020-02-19
A third reply
User: Rebeca
Likes:2, 2019-07-03

The python script requires of the file token.pickle to work, it is generated the first time the python script run and when it expired, it have to be deleted and generated again.

Python Tutorial for Beginners

Youtube Data Api To Crawl All Comments And Replies

Solution 1:

Solution 2:

Post a Comment for "Youtube Data Api To Crawl All Comments And Replies"