[{"content":"\nChapter 1: RAG Fundamentals 1.1 What Is RAG RAG (Retrieval-Augmented Generation) is a technical architecture that combines information retrieval with large language model generation. The core idea is simple: instead of letting an LLM make up an answer from thin air, first retrieve relevant passages from an external knowledge base, then generate the answer based on those passages.\nTake a typical example. A user asks, \u0026ldquo;How do you calculate the D0 collection-entry rate?\u0026rdquo; A standard LLM might answer, \u0026ldquo;Roughly overdue orders divided by total orders,\u0026rdquo; which is vague and may be wrong. A RAG system first retrieves the exact definition from the metric dictionary, such as \u0026ldquo;D0 order collection entry = 1 - dpd0_repay / dpd0_cnt,\u0026rdquo; and then generates an accurate answer from that evidence. The answer becomes traceable, the definition can be verified, and hallucinations are reduced significantly.\n1.2 Core Components A RAG system has five core stages:\n1. Chunking: Split long documents into smaller passages suitable for retrieval. Chunk quality directly determines retrieval quality. Common strategies include splitting by heading hierarchy, by paragraph, or by a fixed character count. If chunks are too large, they introduce noise and may exceed context limits; if too small, they lose contextual relationships.\n2. Embedding: Map text into dense vectors so semantically similar text ends up close in vector space. Embeddings are the foundation of vector retrieval and define how well the system understands meaning. Common models include OpenAI\u0026rsquo;s text-embedding-3 family (strong quality but paid), BGE-small-zh (open-source and CPU-friendly), and m3e-base (good for Chinese).\n3. Retrieval: Given a user query, retrieve the most relevant top-k passages from a vector database or index. Retrieval is the heart of RAG because it determines what knowledge the LLM gets to see.\nMethod Principle Strengths Weaknesses BM25 Term frequency + inverse document frequency No embedding model required, runs locally, strong on domain-specific terminology Cannot handle synonyms or semantic similarity well Vector retrieval Semantic vector similarity Understands meaning, supports synonym matching Requires an embedding model and vector database Hybrid retrieval BM25 + vector fusion Combines both strengths for broader recall More complex to implement 4. Prompt Assembly: Combine retrieval results, the user question, and system instructions into the LLM input. Prompt design determines whether the model behaves as intended. Common strategies include enforcing answer structure (conclusion first, evidence second), confidence constraints (say \u0026ldquo;I don\u0026rsquo;t know\u0026rdquo; when evidence is insufficient), scope boundaries (explicitly state what cannot be answered), and citation formatting (answers must include sources).\n5. LLM Generation: Produce the final answer from the assembled prompt. The generation model can be deployed locally (for example, Qwen, LLaMA, or ChatGLM) or called through an API (for example, GPT-4 or Claude).\n1.3 Chunking Strategies The right chunking strategy depends on document structure and business context:\nStrategy Principle Best For Potential Issues By heading hierarchy Split on # headings Markdown and structured documents A single heading may contain no meaningful content By paragraph Split on paragraph breaks Documents with clear paragraph structure Paragraphs may still be too long Fixed size Hard split by character count (for example, 900 chars) General-purpose scenarios May break semantic units Recursive splitting Recursively split by paragraph, then sentence General-purpose scenarios More complex to implement In practice, chunk quality matters more than algorithm choice. Common optimizations include filtering out meaningless heading-only fragments, cleaning Markdown noise (list markers, numbering, code markers), filtering very short content (fewer than 40 characters), and filtering by content density (for example, chunks where heading lines account for more than 70%).\n1.4 Choosing a Retrieval Strategy Vector retrieval and BM25 are not mutually exclusive. In real systems, you can keep both and switch between them through configuration.\nWhen BM25 works well: business documents full of domain-specific terms such as \u0026ldquo;collection-entry rate,\u0026rdquo; \u0026ldquo;spread,\u0026rdquo; or \u0026ldquo;average amount per case.\u0026rdquo; For exact-match recall, BM25 can perform as well as vector retrieval and does not require an extra embedding model, which makes it a good fit for resource-constrained environments. Its downside is weak handling of synonyms and conversational phrasing, such as \u0026ldquo;How much did they borrow?\u0026rdquo; versus \u0026ldquo;contract amount.\u0026rdquo;\nWhen vector retrieval works well: scenarios that require stronger semantic understanding, such as knowledge bases rich in synonyms, multilingual documents, or cases where semantic ranking matters. The cost is the need for an embedding model and vector database.\nA stronger setup is hybrid retrieval plus reranking: use BM25 or vector retrieval in stage one to recall the top 20 candidates quickly, then rerank them with a CrossEncoder or BGE-Rerank model and keep the top 4. This two-stage pipeline usually improves relevance substantially.\n1.5 Evaluation Methods RAG evaluation is usually split into retrieval quality and generation quality.\nMainstream Evaluation Frameworks Framework Core Idea RAGAS Uses an LLM to score faithfulness, answer relevance, and context relevance ARES Uses an LLM as a judge and scores answers against reference answers Trulens Provides measurable dashboards for both retrieval and generation LangSmith Supports tracing and evaluation for production RAG systems, including human annotation workflows Retrieval-Side Evaluation Metric Meaning Evaluation Goal Recall@K Ratio of top-k results that contain the correct answer Whether recall is comprehensive MRR Mean reciprocal rank of the correct answer Whether the best result appears near the top NDCG Relevance-weighted ranking quality Overall ranking quality Generation-Side Evaluation Method Description Limitation LLM-as-Judge Use GPT-4 or Claude to evaluate generated answer quality Prompt-template-driven and somewhat subjective BLEU/ROUGE Automatic scoring based on n-gram overlap Cannot judge factual correctness Human evaluation Random sampling with manual quality review Expensive and hard to scale The hard part of generation evaluation is this: automated metrics such as BLEU and ROUGE only measure textual overlap. They cannot tell whether an answer is factually correct or semantically aligned. In business scenarios with high accuracy requirements, manual spot checks remain indispensable.\nMy RAG Evaluation Practice Combining industry practice with practical resource constraints, I use the following approach:\nRetrieval side: quantify retrieval quality with Recall@K and MRR, and regularly inspect the relevance of the top 6 results Generation side: rely mainly on manual review, sampling 20 to 30 real queries in each iteration End-to-end: collect bad cases from real meeting-query scenarios and keep iterating on the system 1.6 Optimization Directions RAG is a system that needs continuous iteration. Common optimization directions include:\nRetrieval side: query rewriting (make the query more retrieval-friendly), query expansion (add synonyms), hybrid retrieval plus reranking Generation side: Self-RAG to let the model decide whether retrieval is needed, and Corrective-RAG to detect hallucinations and retrieve again Knowledge-base side: layered knowledge (rank by importance or confidence), and incremental indexing (auto-update when new documents arrive) Chapter 2: Background and Motivation 2.1 Business Context I work as a business analyst in the lending industry. In my daily work, I rely on a multi-agent system for data analysis and report generation. Among those agents, the asset agent is the one I use most often, mainly for two core scenarios:\nMeeting-time data lookup: quickly query asset data by specific dimensions during meetings, such as collection-entry rate or spread for a given channel, package, or customer segment Report generation: repeatedly query data and summarize it under fixed definitions when producing daily or weekly reports 2.2 The Capabilities and Evolution of the Asset Agent The asset agent is responsible for full-spectrum asset analysis questions: disbursement scale, collection-entry rate, spread, channel distribution, package performance, and more. In meetings, when I need to answer questions like, \u0026ldquo;What was the D0 collection-entry rate for front-loaded new customers this week, and how did it change week over week?\u0026rdquo;, the agent has to pull data from a database or Excel files, filter dimensions, calculate metrics, and present a conclusion.\nThe previous implementation depended on asset-analysis skills. Those skills encapsulated all definitions, metric formulas, and analysis logic in advance, and the agent would call the relevant skill based on the user\u0026rsquo;s question.\nHowever, as the business expanded, the dimensions and metrics embedded in the skills kept growing:\nDimension / Metric Continuously Growing Content Channels Facebook / Google / TikTok / Organic / non-paid, plus 7 paid packages Packages 7 paid packages + APK packages Customer segments Existing customers / new customers (pure new / non-pure new) / multi-loan existing / multi-loan new Products Front-loaded / back-loaded × single-loan / multi-loan Metrics D-1 / D0 collection entry, D0 spread, cumulative spread, average amount per case, bad-debt rate, etc. RG grades A / B / C / D / E / F / G The limitations of the skills-based approach became increasingly obvious:\nPoor readability: definitions existed only as code, so callers could get results but could not directly inspect the reasoning process High maintenance cost: when dimensions and metrics grew, the skills code had to be updated in sync, and reuse across scenarios was hard Limited extensibility: when adding new analysis dimensions or handling exploratory questions, the skills structure was too rigid That is why I needed a RAG knowledge base as a complement to skills: move descriptive knowledge such as definitions, standards, and SOPs into the knowledge base and let RAG provide question-answer retrieval, while keeping structured computation logic in skills. Each part does what it is best at.\n2.3 Pressure on Knowledge Management Even after introducing a knowledge base, managing the knowledge itself remains difficult. Asset-analysis knowledge is scattered across multiple places:\nFeishu docs: metric dictionaries, dimension definitions, analysis SOPs Code comments: some business logic is embedded directly in Python scripts DM conversations: temporary definition agreements that exist only in message history This creates several problems:\nInconsistent definitions: the same term may be defined differently across documents, which can lead the agent to answer inconsistently Unsynced updates: code logic changes, but documentation does not, causing knowledge drift Poor reusability: it is hard to reuse definition knowledge quickly in new scenarios such as automated report generation High maintenance cost: it takes too long to understand the agent\u0026rsquo;s actual capability boundaries The number of asset-analysis dimensions and metrics will keep growing, and the complexity of knowledge management will keep rising with it. I needed a systematic knowledge-management mechanism, and that was the original driver behind RAG plus a knowledge vector store.\n2.4 Why RAG Across the industry, mainstream knowledge-management stacks have shifted toward RAG. In AI-native applications and enterprise knowledge-base Q\u0026amp;A scenarios, RAG has become the default architectural choice in practice:\nApproach Industry Status Pure document search Last-generation approach, lacks semantic understanding Hard-coded rules Expensive to maintain, difficult to scale Pure LLM memory Limited by context window, cannot handle large knowledge sets RAG Mainstream industry solution, suitable for almost all knowledge-Q\u0026amp;A scenarios RAG\u0026rsquo;s core strengths, natural-language Q\u0026amp;A, updatable knowledge, and traceable answers, align exactly with the needs of fast meeting-time lookup and report generation.\nAs an AI application builder, I also see hands-on RAG experience as essential. My learning goal was clear: chunking strategy, retrieval pipeline, prompt constraints, and evaluation can only really be understood by building them yourself and validating them in a real asset-agent workflow.\nChapter 3: System Build Process 3.1 Overview of the Build Flow Environment setup (install dependencies / Ollama / pull model) ↓ Configuration management (write config.yaml) ↓ Knowledge-base construction (organize markdown files) ↓ Index build (python build_index_v2.py) ↓ Start service (python app.py) ↓ Validation tests (/health → /search → /ask) 3.2 Environment Setup Python dependencies: fastapi uvicorn pydantic pyyaml jieba rank_bm25 chromadb requests\nOllama model: use ollama pull qwen2.5:0.5b. One model serves two purposes, both embedding and generation, which saves resources.\n3.3 Project Structure asset_rag/ ├── app.py # FastAPI service, core inference logic ├── build_index_v2.py # Index build script ├── config.yaml # All configurable parameters ├── prompt.md # System prompt template ├── index.json # BM25 index (generated after build) └── chroma_db/ # Vector database (when vector mode is enabled) asset_knowledge_base/ # Knowledge base, separate from the service ├── 01_metric_definitions/ ├── 02_dimension_definitions/ ├── 03_data_dictionary/ └── 04_analysis_sop/ Design idea: keep the knowledge base independently maintained, and let the service pick up new knowledge automatically after the index is rebuilt.\n3.4 Configuration Management Centralize all parameters in config.yaml:\nollama_base_url: http://127.0.0.1:11434 ollama_model: qwen2.5:0.5b top_k: 6 max_context_chunks: 4 chunk_max_chars: 900 use_vector: false # Switch retrieval mode 3.5 Core Code: Service Entry Point app = FastAPI(title=\u0026#39;asset-local-rag\u0026#39;) @app.post(\u0026#39;/ask\u0026#39;) def ask(req: AskRequest): hits = retrieve(req.question, req.top_k or TOP_K) answer = call_ollama(req.question, hits) return {\u0026#39;answer\u0026#39;: answer, \u0026#39;citations\u0026#39;: hits} 3.6 Core Code: Unified Retrieval Entry def retrieve(query: str, top_k: int): if USE_VECTOR: return retrieve_vector(query, top_k) # Chroma vectors else: return retrieve_bm25(query, top_k) # BM25 BM25 retrieval: uses rank_bm25 plus jieba tokenization. It performs well on domain-specific terms and does not require an embedding model.\nVector retrieval: enabled when use_vector: true. Ollama is used for embeddings, and the vectors are stored in Chroma.\n3.7 Core Code: Prompt Assembly def call_ollama(question: str, contexts: list): context_text = \u0026#39;\\n\\n\u0026#39;.join([ f\u0026#34;[Source: {c[\u0026#39;path\u0026#39;]}]\\n{c[\u0026#39;text\u0026#39;]}\u0026#34; for c in contexts[:MAX_CONTEXT] ]) prompt = f\u0026#34;{PROMPT}\\n\\nKnowledge Passages:\\n{context_text}\\n\\nUser Question: {question}\u0026#34; resp = requests.post(f\u0026#39;{OLLAMA_URL}/api/generate\u0026#39;, json={\u0026#39;model\u0026#39;: OLLAMA_MODEL, \u0026#39;prompt\u0026#39;: prompt}) return resp.json()[\u0026#39;response\u0026#39;].strip() Key design choice: pass at most 4 chunks into the LLM, and label each chunk with its source.\n3.8 Index Construction python3 build_index_v2.py build_index_v2.py reads all markdown files, splits them by # headings, filters meaningless fragments and Markdown noise, and outputs index.json.\n3.9 Startup and Validation python3 app.py # Health check curl http://127.0.0.1:8787/health # RAG QA test curl -X POST http://127.0.0.1:8787/ask \\ -H \u0026#34;Content-Type: application/json\u0026#34; \\ -d \u0026#39;{\u0026#34;question\u0026#34;: \u0026#34;How do you calculate the D0 collection-entry rate?\u0026#34;}\u0026#39; Chapter 4: Knowledge Base Construction 4.1 Design Principles The knowledge base is the \u0026ldquo;brain\u0026rdquo; of the RAG system, and its quality directly determines retrieval quality. I follow three principles:\nClear boundaries: the first version only covers asset-data-related content Layered structure: organize content by \u0026ldquo;definitions → dimensions → data dictionary → analysis SOP\u0026rdquo; Consistent format: use uniform Markdown, and give each concept its own # heading 4.2 Directory Structure asset_knowledge_base/ ├── 01_metric_definitions/metric_dictionary.md # Metric definitions and formulas ├── 02_dimension_definitions/dimension_dictionary.md # Channels, customer segments, packages, etc. ├── 03_data_dictionary/asset_data.md # Asset data field descriptions └── 04_analysis_sop/ # Analysis approaches for lower disbursement, worsening first delinquency, worsening spread, etc. The layering serves different purposes: the definition layer explains \u0026ldquo;what it is,\u0026rdquo; the data layer explains \u0026ldquo;what fields exist,\u0026rdquo; and the application layer explains \u0026ldquo;how to use it.\u0026rdquo;\n4.3 Markdown Writing Guidelines Give each concept its own # heading so it does not get split across different chunks:\n## D0 Order Collection Entry Rate Definition: the share of orders that remain unpaid on the due date. Formula: D0 order collection entry = 1 - dpd0_repay / dpd0_cnt Avoid writing pure bullet lists. Each list item should include explanatory text.\n4.4 Defining Knowledge Boundaries Can Answer Cannot Answer Disbursement scale, average amount per case, collection-entry rate, spread Approval rate, approval workflow Definitions of channels, packages, and customer segments Collection strategy Metric definitions and data field explanations operation_data Boundary control is implemented in the prompt layer. The knowledge base itself does not need special handling.\n4.5 Maintenance Workflow Organize markdown files → commit to asset_knowledge_base/ → run build_index_v2.py → service picks up changes automatically Chapter 5: Chunking Strategy Iteration 5.1 Why Chunking Matters Chunking is one of the easiest parts of RAG to underestimate, even though it has an outsized effect. Chunk quality directly determines whether retrieved content is complete.\n5.2 v1: Heading-Based Splits Plus Hard Character Limits Split on # headings. If a section is within 900 characters, keep it as a single chunk; if it is longer, split it further by accumulating paragraphs.\nProblems exposed by v1:\nProblem Symptom Impact Heading-only fragments A heading like # Secondary Heading becomes a chunk even with no body text Retrieval returns empty content Markdown noise List markers and numbering interfere with embeddings Vector quality drops Broken context Definitions and formulas get split apart Retrieved content is incomplete 5.3 v2: Filtering Plus Cleanup Before building the index, filter meaningless fragments such as text under 40 characters or chunks where heading lines exceed 70%, and clean Markdown noise such as list markers, numbering, and code markers.\ndef is_meaningful_chunk(text: str) -\u0026gt; bool: if len(stripped) \u0026lt; 40: return False if title_lines / len(lines) \u0026gt; 0.7: return False return True 5.4 Reflection Even with v2, a hardcoded 900-character split still has limitations. Formulas and explanations may end up in different chunks, tables can get broken apart, and cross-section references can fail. v2 is a practical answer under resource constraints, not the optimal one.\nChapter 6: Retrieval Pipeline Design 6.1 BM25 Retrieval BM25 is a sparse retrieval algorithm and can be understood as an upgraded TF-IDF: tokenize the query, calculate TF, introduce IDF to penalize common words, and normalize for document length.\nStrengths: no embedding model required, runs locally, and performs well on domain-specific terms such as \u0026ldquo;D0 collection entry\u0026rdquo; or \u0026ldquo;average amount per case.\u0026rdquo;\nWeaknesses: it only matches literal wording, so synonyms such as \u0026ldquo;average amount per case\u0026rdquo; versus \u0026ldquo;average contract amount\u0026rdquo; cannot be recalled well.\n6.2 Vector Retrieval Switch it on or off through use_vector: true/false in config.yaml. Ollama handles embedding generation, and Chroma stores the vectors.\nDownside: it requires an extra embedding model and vector database. In practice, it ran slowly on a 2 vCPU / 3.8 GB machine.\n6.3 Unified Retrieval Entry def retrieve(query: str, top_k: int): if USE_VECTOR: return retrieve_vector(query, top_k) else: return retrieve_bm25(query, top_k) Both modes are switched through configuration and are transparent to upper layers.\n6.4 Example Retrieval Result { \u0026#34;id\u0026#34;: \u0026#34;01_metric_definitions/metric_dictionary.md#3\u0026#34;, \u0026#34;path\u0026#34;: \u0026#34;01_metric_definitions/metric_dictionary.md\u0026#34;, \u0026#34;text\u0026#34;: \u0026#34;## D0 Order Collection Entry\\n\\nDefinition: still unpaid on the due date...\\nFormula: D0 order collection entry = 1 - dpd0_repay / dpd0_cnt\u0026#34;, \u0026#34;score\u0026#34;: 8.67 } score is the BM25 relevance score and can be used as a confidence signal.\nChapter 7: Prompt Engineering and Safety Constraints 7.1 System Prompt Design Prompt design determines whether the model behaves the way you expect. I follow three principles:\nWhitelist the scope: explicitly list what topics can be answered Blacklist refusal cases: explicitly list out-of-scope domains and standard refusal language Constrain uncertainty: if evidence is insufficient, the model must not invent an answer and should say \u0026ldquo;I don\u0026rsquo;t know\u0026rdquo; 7.2 My Prompt You are the local RAG QA model for the asset agent. Strict rules: - You may answer only based on the provided knowledge passages - You may answer only within the first-version scope of nigeria_asset.asset_data - For questions about approval rate, collection execution, or other out-of-scope topics, you must answer: \u0026#34;The current knowledge base does not support this question\u0026#34; - Do not fabricate fields, definitions, or conclusions 7.3 User Prompt Assembly context_text = \u0026#39;\\n\\n\u0026#39;.join([ f\u0026#34;[Source: {c[\u0026#39;path\u0026#39;]}]\\n{c[\u0026#39;text\u0026#39;]}\u0026#34; for c in contexts[:MAX_CONTEXT] # At most 4 chunks ]) prompt = f\u0026#34;{PROMPT}\\n\\nKnowledge Passages:\\n{context_text}\\n\\nUser Question: {question}\u0026#34; Key design choice: MAX_CONTEXT = 4 avoids exceeding the model\u0026rsquo;s context limit, and each chunk includes a source label for traceability.\n7.4 Boundary Test Results Question Expected Behavior Actual \u0026ldquo;How do you calculate D0 collection entry?\u0026rdquo; Answer from the metric dictionary ✅ \u0026ldquo;What was this week\u0026rsquo;s approval rate?\u0026rdquo; \u0026ldquo;Not supported by the knowledge base\u0026rdquo; ✅ \u0026ldquo;Which orders are overdue?\u0026rdquo; Refuse and say database lookup is required ✅ Chapter 8: Evaluation Results and Bottleneck Analysis 8.1 Subjective Score: 6 / 10 Dimension Score Notes Retrieval recall 6/10 Weak synonym recall Generation quality 5/10 The 0.5B model has limited understanding Response speed 8/10 Local inference, no network latency Stability 7/10 Occasional OOM under tight resources Explainability 8/10 Clear sources, good traceability The two biggest weaknesses are the model is too small and synonym recall is weak.\n8.2 Bottleneck 1: The Model Is Too Small qwen2.5:0.5b struggles with more complex reasoning, such as multi-step calculation questions. It also sometimes misinterprets analytical terms such as \u0026ldquo;week-over-week\u0026rdquo; and \u0026ldquo;year-over-year,\u0026rdquo; and instruction following is not fully stable.\n8.3 Bottleneck 2: Low Retrieval Recall BM25 performs literal matching only, which leads to issues like:\n\u0026ldquo;Average contract amount\u0026rdquo; does not recall \u0026ldquo;average amount per case\u0026rdquo; \u0026ldquo;First delinquency rate\u0026rdquo; does not match \u0026ldquo;first-delinquency deterioration\u0026rdquo; Conversational phrasing does not align with the formal wording in the knowledge base 8.4 Bottleneck 3: No Reranking Current pipeline: query → BM25 top_k → LLM\nIdeal pipeline: query → BM25 top_20 → Rerank top_4 → LLM\nA two-stage retrieval pipeline would significantly improve relevance, but current machine resources do not support it.\nChapter 9: Pitfalls and Lessons Learned Pitfall 1: OOM Due to Unrealistic Resource Assumptions I originally tried vector retrieval with qwen2.5:3b, and 3.8 GB of RAM was immediately exhausted. The fix was to downgrade to qwen2.5:0.5b and disable vector retrieval, using BM25 only.\nLesson: do not overestimate hardware. Start by getting the smallest viable setup running.\nPitfall 2: Chunk Size Was Too Coarse In the early version, I used 1500-character chunks, which split \u0026ldquo;definition + formula + notes\u0026rdquo; into three pieces, so retrieval often came back incomplete. The fix was to reduce the chunk size to 900 characters and require the knowledge base to follow a \u0026ldquo;one concept, one paragraph\u0026rdquo; writing style.\nLesson: chunking strategy and knowledge-base writing guidelines need to be optimized together.\nPitfall 3: Prompt Injection If the user entered something like \u0026ldquo;Ignore the previous rules,\u0026rdquo; the model could skip constraints. The fix was to add a \u0026ldquo;strict rules\u0026rdquo; section in the system prompt and filter inputs in the application layer.\nLesson: small local models do not follow instructions as reliably as larger models. You cannot assume prompt constraints are naturally effective.\nPitfall 4: Ollama Hot-Reload Issues When switching use_vector, if Ollama was not started cleanly, the service would return HTTP 500 errors without crashing, which made debugging harder. The fix was to add a /health endpoint for active status checks.\nChapter 10: Optimization Directions and Closing Thoughts 10.1 Model Upgrade Path qwen2.5:0.5b (current) ↓ Upgrade first when memory allows qwen2.5:3b (+2 points) ↓ Add more RAM / GPU when possible qwen2.5:7b (+3 points) ↓ Further upgrade LLaMA3.1:8B / ChatGLM4:9B 10.2 Two-Stage Retrieval Plan candidates = retrieve_bm25(query, top_k=20) # Stage 1: coarse ranking reranked = cross_encoder_rerank(query, candidates, top_k=4) # Stage 2: reranking Expected benefit: a 15% to 20% improvement in recall.\n10.3 Synonym Expansion QUERY_EXPANSIONS = { \u0026#34;average amount per case\u0026#34;: [\u0026#34;average contract amount\u0026#34;, \u0026#34;average borrowed amount per person\u0026#34;], \u0026#34;collection entry\u0026#34;: [\u0026#34;overdue\u0026#34;, \u0026#34;unpaid\u0026#34;, \u0026#34;non-performing\u0026#34;], } 10.4 Summary This RAG system is not a \u0026ldquo;perfect solution.\u0026rdquo; It is the best compromise under current resource constraints:\n2 vCPU / 3.8 GB / no GPU → BM25 + 0.5B is the only realistic option It is usable, but the bottlenecks are obvious → a score of 6 is a starting point, not an ending point Every iteration is about doing the most worthwhile thing the current resource budget allows The core engineering principle is this: there is no universally optimal solution, only the solution that best fits the current stage.\n","date":"2026-05-20T09:10:00+08:00","image":"/uploads/cover-enterprise-circuit-board.png","permalink":"/en/p/building-a-rag-powered-asset-analysis-system/","title":"Building a RAG-Powered Asset Analysis System"},{"content":"\nBackground After deploying OpenClaw to a cloud server and connecting it to Feishu, a single agent was already creating real value in my day-to-day data work. Once the asset analysis Skill went live, I no longer needed to run SQL queries manually. I could simply send a query instruction to the AI in Feishu and receive formatted metric results. At the beginning, that experience significantly improved my efficiency.\nAs business needs expanded, however, I started connecting the conversion analysis Skill and the collection analysis Skill to the same agent. Problems followed quickly: when handling queries across multiple business lines, the AI began to show what I would call \u0026ldquo;memory confusion.\u0026rdquo; It would occasionally apply asset data definitions to conversion analysis by mistake, or lose track of the original task after several turns of conversation. Limits in context length, Skill scheduling conflicts, and reduced task focus all became increasingly obvious. The bottlenecks of a single-agent setup were now clear.\nThe root problem was not that the model itself was incapable. The limitation was architectural. When a single agent faces complex tasks spanning multiple business lines and dimensions, its context window and task scheduling capacity have a natural upper bound. That led me to a new question: could I split different business lines into independent agents, let each one focus on a single domain of analysis and decision-making, and then rely on collaboration between them to solve more complex multi-dimensional tasks?\nMoving from a single agent to a multi-agent system is not just adding more agents. It is a redesign at the architectural level. This article documents my full journey within the OpenClaw framework from a single-agent architecture to a multi-agent collaboration system, including the problems I ran into, the technical choices I made, the architecture design, the implementation plan, and the final results. I hope it can serve as a useful reference for business practitioners.\n1. Why Start with a Single Agent In the first phase of building an AI-powered data analysis assistant, I chose a single-agent architecture. That decision was not accidental. It was a pragmatic response to the business scenario and resource constraints at the time.\nQuickly validating the business scenario At the beginning of the project, the core goal was to get the asset analysis business line working as quickly as possible and verify whether the AI assistant could provide real value in day-to-day work. The advantage of a single agent is its simplicity: the architecture is straightforward, the debugging path is clear, every request flows through the same entry point, and every Skill is loaded into the same context. That let me go from zero to one in the shortest possible time, get the asset analysis Skill working, and put it into real use.\nThis path makes sense for any new business scenario: focus first on solving one core problem, rather than trying to design a perfect multi-agent system before the architecture has matured.\nOne business line was enough for daily use When the asset analysis Skill first went online, daily query needs were concentrated in a single dimension: asset quality monitoring. The questions I needed to ask were relatively stable, including key metrics such as the D0 collection-entry rate, D0 spread, cumulative spread-to-date, bad debt ratio, and breakdowns by channel, package, and customer segment. A single agent handled this kind of structured query reliably, and the quality of the responses matched expectations.\nAt that stage, introducing multiple agents would only have added unnecessary complexity. Before scale and demand were fully validated, a single agent was the highest-leverage choice.\nCost and debugging efficiency came first From a resource perspective, a single agent consumes far less context window capacity and fewer tokens than a multi-agent system running in parallel. In the early phase of the project, the data volume was limited and the number of conversation turns was manageable, so the runtime cost of a single agent stayed within an acceptable range.\nFrom a debugging perspective, a single agent also offers a clean logging path and a short path to root cause analysis. If a Skill returned an unexpected result, I only needed to inspect the context and call history of one agent. That made troubleshooting efficient and kept iteration speed high.\n2. Where the Single Agent Hit Its Ceiling As the asset analysis Skill became more stable, I gradually connected the conversion analysis Skill and the collection analysis Skill to the same agent, hoping to create a unified query entry point across all three major business lines. In practice, the problems introduced by that expansion were more complicated than I expected.\nContext-window memory confusion Once the Skills for all three business lines were loaded into the same agent, the burden on the context window increased substantially. Consider a typical cross-domain analysis scenario: I first ask about the D0 collection-entry rate on the asset side, then switch to approval rate on the conversion side, and then ask about repayment rate on the collection side. After handling three rounds of tasks with completely different business semantics, the AI began to lose context. It reused asset-side definitions in conversion-side analysis, or confused the definitions of metrics across the two business lines.\nThis kind of \u0026ldquo;memory confusion\u0026rdquo; was not an occasional edge case. It happened frequently. The underlying reason is simple: the context window of a single agent had to carry the concepts, terminology, and data dictionaries of all three business lines at once. As conversations grew longer, the density of useful information rose continuously, and valuable signals were gradually diluted or overwritten.\nSkill scheduling conflicts The Skills for different business lines differed in when they should be called and in their input and output formats. When I made multiple cross-domain requests in the same conversation, the agent had to keep switching between different Skill contexts. During those switches, parameters were sometimes lost and stale state sometimes leaked across calls. For example, results produced by the asset analysis Skill could be incorrectly passed into the next calculation step of the collection Skill, causing the final output to drift away from the actual business expectation.\nThese conflicts were rare when only one Skill was in play, but they became increasingly obvious as the number of Skills grew.\nReduced task focus When handling mixed tasks, a single agent had to keep switching role identities within one context. At one moment it acted as an asset analyst, then as a conversion analyst, then as a collection analyst. Those repeated identity switches consumed a large amount of context space and reduced the agent\u0026rsquo;s focus in each business line. In practice, that showed up as more generic analytical replies and less precise structured data when querying asset metrics.\nOnce business needs evolved from \u0026ldquo;single-line queries\u0026rdquo; to \u0026ldquo;multi-line collaborative analysis,\u0026rdquo; the architectural bottleneck of a single agent could no longer be solved by prompt tuning or minor Skill optimization. The answer had to come from the architecture itself.\n3. Design Thinking Behind the Multi-Agent Architecture To address the architectural bottlenecks of a single agent, I designed a master-worker agent architecture with a CEO Agent serving as the coordination entry point and three specialist child agents responsible for execution.\nOverall architecture The CEO Agent (coordination entry point) serves as the unified interface for user interaction and sits at the front line of Feishu conversations. When a user submits a task in Feishu, the CEO Agent is responsible for understanding the intent, determining which business line the task belongs to, dispatching the task to the appropriate specialist child agent, and finally collecting the returned results into one unified response.\nThe CEO Agent has its own scheduling rules in dispatch-rules.md, which explicitly define the responsibility boundaries of each child agent and the task routing logic. It acts as the coordination hub of the entire architecture.\nThe asset Agent is dedicated to asset quality monitoring. It is connected to the asset analysis Skill, has its own isolated workspace and workflow conventions, and focuses on analyzing and reporting metrics such as collection-entry rate, spread, and bad debt ratio.\nThe operation Agent is dedicated to operational conversion analysis. It is connected to the conversion analysis Skill and focuses on analyzing and reporting metrics across the full funnel from application to disbursement, including application creation rate, submission rate, approval rate, and disbursement rate.\nThe collection Agent is dedicated to collection team management. It is connected to the collection analysis Skill and focuses on analyzing and reporting metrics such as collection settlement rate, extension rate, and repayment rate.\nCollaboration mechanism The CEO Agent collaborates with the three child agents through task distribution and result aggregation:\nThe CEO Agent receives the user\u0026rsquo;s request in Feishu. It identifies the task type and routes it to the appropriate specialist child agent according to the scheduling rules (asset, operation, or collection). Each child agent performs its analysis inside its own isolated workspace and returns the result to the CEO Agent. The CEO Agent consolidates the child-agent outputs and returns them in a unified format: key conclusion, critical data, main anomalies, which link had the biggest impact, and next-step recommendations. Scheduling rules The CEO Agent\u0026rsquo;s scheduling rules cover two types of scenarios:\nSingle-domain questions: questions about one business line are sent directly to the corresponding child agent. For example, questions involving collection-entry rate or spread go to the asset Agent, questions involving conversion rate or drawdown rate go to the operation Agent, and questions involving collection-stage performance go to the collection Agent.\nCross-domain questions: complex questions involving multiple business lines are decomposed by the CEO Agent and then dispatched to multiple child agents in parallel. For example, daily and weekly reports require coordination across all three business lines, so the CEO Agent sends them to the asset, operation, and collection agents at the same time and then produces one combined summary.\n4. Technical Implementation Plan Below is the complete configuration approach for the current multi-agent system. It can be executed directly within OpenClaw to generate the full setup automatically.\nAgent configuration Register four agents under agents.list in openclaw.json, with MiniMax configured as the fallback model for each:\n{ \u0026#34;id\u0026#34;: \u0026#34;ceo\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;CEO Agent\u0026#34;, \u0026#34;workspace\u0026#34;: \u0026#34;/root/.openclaw/workspace/agents/ceo/workspace\u0026#34;, \u0026#34;model\u0026#34;: { \u0026#34;primary\u0026#34;: \u0026#34;openai-codex/gpt-5.4\u0026#34;, \u0026#34;fallbacks\u0026#34;: [\u0026#34;minimax-portal/MiniMax-M2.7\u0026#34;] } }, { \u0026#34;id\u0026#34;: \u0026#34;asset\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;Asset Analyst\u0026#34;, \u0026#34;workspace\u0026#34;: \u0026#34;/root/.openclaw/workspace/agents/asset/workspace\u0026#34;, \u0026#34;model\u0026#34;: { \u0026#34;primary\u0026#34;: \u0026#34;openai-codex/gpt-5.4\u0026#34;, \u0026#34;fallbacks\u0026#34;: [\u0026#34;minimax-portal/MiniMax-M2.7\u0026#34;] } }, { \u0026#34;id\u0026#34;: \u0026#34;operation\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;Operation Analyst\u0026#34;, \u0026#34;workspace\u0026#34;: \u0026#34;/root/.openclaw/workspace/agents/operation/workspace\u0026#34;, \u0026#34;model\u0026#34;: { \u0026#34;primary\u0026#34;: \u0026#34;openai-codex/gpt-5.4\u0026#34;, \u0026#34;fallbacks\u0026#34;: [\u0026#34;minimax-portal/MiniMax-M2.7\u0026#34;] } }, { \u0026#34;id\u0026#34;: \u0026#34;collection\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;Collection Analyst\u0026#34;, \u0026#34;workspace\u0026#34;: \u0026#34;/root/.openclaw/workspace/agents/collection/workspace\u0026#34;, \u0026#34;model\u0026#34;: { \u0026#34;primary\u0026#34;: \u0026#34;openai-codex/gpt-5.4\u0026#34;, \u0026#34;fallbacks\u0026#34;: [\u0026#34;minimax-portal/MiniMax-M2.7\u0026#34;] } } Feishu entry point configuration The entire system has only one Feishu entry point: the CEO bot account. All user requests come in through that single account, are received by the CEO Agent, and are then routed and dispatched by it.\nDirectory structure Each agent has its own isolated workspace and includes the following files:\nagents/ ├── ceo/ │ ├── SOUL.md # CEO role definition and scheduling rules │ ├── dispatch-rules.md # Single-domain / cross-domain routing logic │ ├── workflow.md # CEO workflow │ ├── MEMORY.md # Cross-domain memory │ └── workspace/ # CEO working directory ├── asset/ │ ├── SOUL.md # Asset analyst role definition │ ├── workflow.md # Asset analysis workflow │ ├── skills.md # Binds asset-metrics │ └── workspace/ ├── operation/ │ ├── SOUL.md │ ├── workflow.md │ ├── skills.md # Binds apply-metrics │ └── workspace/ └── collection/ ├── SOUL.md ├── workflow.md ├── skills.md # Binds collection-metrics └── workspace/ CEO Agent scheduling rules dispatch-rules.md defines the task routing logic:\nSingle-domain questions are routed directly:\nCollection-entry rate, spread, bad debt ratio, and similar metrics → asset Agent Approval rate, drawdown rate, conversion funnel, and similar metrics → operation Agent Collection recovery rate, caseR, amtR, and similar metrics → collection Agent Cross-domain questions are routed in parallel:\nDaily reports / weekly reports → route to asset + operation + collection together Channel-level integrated analysis → route to asset + operation together Unified summary output format:\nKey conclusion → Critical data → Main anomalies → Which link had the biggest impact → Next-step recommendations Skill bindings Agent Bound Skill asset asset-metrics operation apply-metrics collection collection-metrics ceo No calculation Skill bound; only dispatching and summarization Task flow User → CEO Agent (single entry point) → Determine route → Dispatch to specialist agents → Collect results → Summarize output → User Single-domain flow: User → CEO → Corresponding specialist agent → CEO → User\nCross-domain flow: User → CEO → Multiple specialist agents in parallel → CEO summarizes → User\n5. Key Challenges During Implementation and How I Solved Them The advantages of a multi-agent architecture are easy to see in design documents, but moving from a single agent to coordinated multi-agent execution surfaced several concrete implementation challenges. I eventually found workable solutions for each of them.\nChallenge 1: The tension between context isolation and context sharing Problem\nEach child agent has its own independent workspace and context. That prevents interference between agents, but it also means that shared information must be passed explicitly. If the asset Agent detects an abnormal spike in the collection-entry rate for a specific channel, that information is not automatically synchronized to the operation Agent. When the operation Agent later analyzes the conversion rate for that same channel, it will not naturally associate the result with the asset-quality anomaly.\nSolution\nThe CEO Agent actively establishes cross-business-line correlations during the summarization step. In dispatch-rules.md, I explicitly defined the rule that \u0026ldquo;if multiple agents produce conflicting conclusions, the conflict must be called out and followed by further verification.\u0026rdquo; When a user asks a cross-business-line question, the CEO Agent proactively dispatches the task to multiple child agents and then creates the cross-domain linkage in the final summary, rather than simply concatenating independent conclusions from each agent.\nChallenge 2: Defining clean Skill boundaries Problem\nEach of the three child agents is bound to a different Skill, and those boundaries must stay clear. Otherwise, responsibilities become ambiguous and duplicate calls become likely. If the metric definitions inside one Skill change, every agent that relies on that Skill must be updated consistently. Any missed update can create data inconsistency.\nSolution\nEach child agent has its own skills.md file that explicitly defines the Skills it is allowed to use and the usage rules around them. I enforced a rule that each agent may call only the Skills declared in its own skills.md, and may not call across agent boundaries. Any definition change is maintained within the corresponding agent\u0026rsquo;s skills.md, while the CEO Agent\u0026rsquo;s dispatch-rules.md centrally coordinates routing logic. That keeps responsibility boundaries clean at the source.\nChallenge 3: Controlling the quality of child-agent outputs Problem\nIn the single-agent era, I only needed to evaluate the quality of one agent\u0026rsquo;s output. In a multi-agent setup, I had to monitor the output quality of three child agents at once and then determine, during CEO-side summarization, whether there were conflicts in metric definitions or contradictions in conclusions. Early in implementation, the output formats of the agents were inconsistent, so the CEO Agent had to do additional normalization work before it could produce a unified summary.\nSolution\nI introduced a mandatory unified output template in output-template.md. Every child agent is required to return results in the same structure: analysis topic → key conclusion → critical data → dimensional breakdown → anomaly points → cause assessment → next-step recommendations. When the CEO Agent receives results from child agents, it first checks whether the format is compliant. Non-compliant outputs are rejected and recomputed. That guarantees consistency in the final aggregated report.\n6. Results and Comparison After the multi-agent architecture went live, I ran a systematic evaluation of its performance in everyday business scenarios, focusing on the differences between the single-agent and multi-agent approaches across several core dimensions.\nContext memory capability With a single agent, the context window gradually accumulates the concepts and terminology of all three business lines across multiple conversation turns. As the conversation gets longer, the density of useful information drops, and the AI starts to lose context. It may incorrectly apply asset-side definitions to conversion scenarios, or forget the original analytical objective after a long exchange.\nWith multiple agents, each child agent has its own isolated context with no cross-interference. The asset Agent carries only asset-analysis-related content, while the operation Agent carries only conversion-analysis-related content. That isolation keeps each agent\u0026rsquo;s context dense and focused, rather than diluted as business lines are added. In actual testing, even after 20 consecutive rounds of cross-business-line conversation, each child agent could still accurately understand the goals within its own domain.\nTask focus With a single agent, mixed tasks require frequent identity switching: from asset analyst to conversion analyst to collection analyst. Those switches consume a large amount of context space and reduce the AI\u0026rsquo;s focus in each business line, leading to more generalized analytical replies and fewer precise data outputs.\nWith multiple agents, each child agent is locked to a single role, and focus improves significantly. The asset Agent remains an asset analyst, the operation Agent remains a conversion analyst, and the collection Agent remains a collection analyst. In actual testing, all three child agents produced more precise outputs than the single-agent system did, with better completeness and accuracy in structured data.\nCross-business-line analysis capability A single agent handling cross-domain questions such as \u0026ldquo;Why is conversion good in this channel while asset quality is poor?\u0026rdquo; must process multiple analytical logics in a single context, which makes definition confusion and data mismatches much more likely.\nIn a multi-agent setup, the CEO Agent receives the cross-domain question, dispatches it to multiple child agents in parallel, and lets each child agent complete its analysis in its own isolated context. The results are then merged back by the CEO Agent. In actual testing, the quality of cross-business-line analysis was clearly better than in the single-agent setup, and the CEO Agent was able to establish accurate cross-domain correlations during summarization.\nResponse time With a single agent, all tasks are processed inside one context, so the execution path is shorter and response time for a single question is slightly lower than in a multi-agent setup.\nWith multiple agents, the CEO dispatch step and result aggregation step add overhead, so single-domain questions are slightly slower. However, cross-domain questions become faster overall because multiple child agents can execute in parallel instead of being handled serially by one agent.\nScheduling reliability A single agent has simpler routing logic and a lower chance of routing errors, but it cannot handle complex collaborative tasks across business lines.\nWith multiple agents, dispatch-rules.md provides standardized task routing logic. Single-domain questions are routed directly to the corresponding child agent, while cross-domain questions are decomposed and dispatched centrally by the CEO Agent. Once the routing rules are codified in a file, the debugging and iteration path becomes clear. When routing errors occur, I can trace them directly to the relevant rule entry.\n7. Example Application Scenario The practical value of a multi-agent architecture ultimately has to be validated in a real business scenario. The following example shows a full weekly business report workflow and demonstrates how the CEO Agent coordinates three specialist child agents to complete the task together.\nDemo overview This scenario simulates an ordinary weekly report generation workflow. It requires the CEO Agent to coordinate the Asset, Operation, and Collection specialist agents to produce a joint result. All metric names, dimension names, and data sources come from the actual data warehouse fields used by each agent. The purpose is to demonstrate multi-agent division of labor, data warehouse mapping capability, and report generation capability.\nCollaboration flow The weekly report is generated through the following workflow:\nUser submits a weekly report request → CEO Agent decomposes and dispatches the task → Three child agents execute in parallel → CEO Agent validates and summarizes → Final weekly report is returned The responsibilities of each agent are as follows:\nThe CEO Agent confirms the reporting period, breaks the request into subtasks, defines the unified delivery structure, and performs final consistency checks on time definitions, field definitions, and metric naming.\nThe Asset Agent uses the asset-metrics Skill to query the nigeria_asset.asset_data data source and generate disbursement scale, average ticket size, collection-entry metrics, spread metrics, and dimensional breakdowns.\nThe Operation Agent uses the apply-metrics and operation-data-metrics Skills to query nigeria_asset.apply_data and nigeria_asset.operation_data, and generates metrics for applications, approvals, drawdowns, and conversion funnels.\nThe Collection Agent uses the collection-metrics Skill to query nigeria_asset.nigeria_collection_data, and generates post-loan metrics such as recoveries, extensions, settlements, caseR, and amtR.\nSample core conclusions from the weekly report The final weekly report includes the following key conclusions:\nDisbursement scale: This week\u0026rsquo;s recovery in disbursement scale was driven mainly by returning customers and deferred returning customers; pure new customers still did not become the primary growth engine.\nConversion: Approval rates for front-loaded and back-loaded pure new customers remained weak, while returning customers and repeat returning customers delivered much better conversion efficiency.\nAssets: Asset quality for new customers remained weaker than for returning customers. Differences across package groups and RG tiers were significant, and the shared-debt dimension had a particularly strong impact on repeat-loan assets.\nPost-loan: Collections in the D0 and S1 stages were relatively strong, but amount recovery at S3+ remained under pressure, and later-stage company performance showed clear divergence.\nFull demo document The full weekly report demo, including detailed data charts, definition notes, and validation records for each business line, is available here:\nhttps://github.com/QuantShawn/openclaw-multiAgents-project/blob/master/docs/06-multi-agent-weekly-report-demo.md\n8. Conclusion This article documents the full path of my evolution from a single-agent setup to a multi-agent collaboration system.\nStarting point: At the end of February this year, I completed the OpenClaw deployment and Feishu integration, and the asset analysis Skill officially went live. Every day, I could send query instructions to the AI in Feishu and receive formatted metric results, which significantly improved data-observation efficiency.\nBottleneck: As business needs expanded, I gradually connected the conversion analysis and collection analysis Skills to the same agent. Problems followed quickly: memory confusion in the context window, Skill scheduling conflicts, and declining task focus. The architectural bottleneck of a single agent in a multi-business-line scenario became unmistakable.\nSolution: I designed a master-worker architecture with the CEO Agent as the coordination entry point and three specialist child agents (asset, operation, and collection) responsible for execution. The CEO Agent receives Feishu requests through one unified interface, routes tasks to the appropriate child agents using the rules defined in dispatch-rules.md, and then aggregates the results back into a unified response according to the standardized output template.\nImplementation outcome: After the multi-agent architecture went live, context memory, task focus, and cross-business-line analytical capability all improved significantly compared with the single-agent setup. During implementation, I addressed three key challenges by standardizing dispatch-rules.md and output-template.md: context isolation versus sharing, clean Skill boundary definition, and output quality control.\n","date":"2026-05-05T09:00:00+08:00","image":"/uploads/cover-ai-brain-network.png","permalink":"/en/p/from-single-agent-to-multi-agent-a-practical-technical-implementation/","title":"From Single Agent to Multi-Agent: A Practical Technical Implementation"},{"content":"\nBackground As a data analyst in the lending industry, I work across three dimensions every day: asset analysis, conversion analysis, and team performance analysis. Behind each business area sits a full metrics system. Asset quality monitoring involves core indicators such as collection-entry rate, spread, and bad debt rate. Conversion funnel analysis requires tracking the full chain from application to disbursement. Collections team management focuses on metrics like collection efficiency, repayment rate, and recovered amount.\nIn practice, I used to open three large Excel files every morning just to review the data. Each file contained about 10 tabs, and together they covered nearly 30 business charts. A single review cycle took 1 to 2 hours. During that process, I had to log into databases repeatedly, run SQL queries, cross-check numbers across dimensions, and watch for anomalies. It was time-consuming and also easy to get wrong due to human fatigue.\nAs the business kept growing, both the data volume and the complexity of the metrics increased. Manual monitoring was no longer enough for day-to-day needs. That pushed me to ask a more practical question: could I hand part of the analytical framework to AI and turn it into a real data analysis partner that understands the business?\nFeishu is the only collaboration platform used inside my company. Nearly all internal communication, file management, and knowledge sharing live in the Feishu ecosystem. Integrating the AI assistant into Feishu means users do not need to change any existing habits. They can simply send a request in a Feishu chat window, and the AI can call tools and return an analysis result directly. That \u0026ldquo;ask in Feishu, get the result in Feishu\u0026rdquo; experience is the most natural fit for the actual business workflow.\nWith that in mind, I chose OpenClaw as the AI Agent framework, deployed it to a cloud server, and integrated it with a self-built Feishu bot. The AI assistant handles business intent understanding, tool invocation, and structured results. Feishu serves as the single interaction entry point for all requests and feedback.\nThis article documents the full process, from environment preparation and framework deployment to Feishu integration, business Skill customization, and memory system design, for developers facing similar requirements.\n1. Why This Stack Why I Chose OpenClaw When evaluating AI Agent frameworks, I focused on three dimensions: architectural flexibility, extensibility, and operational complexity.\nIn late February this year, I spent a full week researching the AI Agent frameworks available on the market. For the vertical use case of business data analysis, OpenClaw was the most mature solution I could find. It not only provides a multi-agent architecture and a Skill extension mechanism, but also ships with complete official support for Feishu integration. Compared side by side with alternatives, OpenClaw was the only framework that could satisfy my three core requirements at once: fast deployment, Feishu access, and custom business Skills. I did not find another real substitute.\nThe main reason OpenClaw made my shortlist was its multi-agent architecture. Unlike a single-agent chatbot that can only handle simple Q\u0026amp;A scenarios, OpenClaw can break a complex task into multiple sub-agents that execute separately. For example, \u0026ldquo;analyze yesterday\u0026rsquo;s asset data\u0026rdquo; can be split between one agent responsible for database querying and another responsible for metric calculation. That architecture has a clear advantage when you are dealing with multiple business lines and multi-dimensional data analysis.\nThe second key factor was the Skill extension mechanism. OpenClaw provides a standardized way to build Skills, which allowed me to package my own Python scripts and SQL query logic into tools the AI can invoke directly. That means the AI is not doing generic question answering. It is calling technical tools with actual business capabilities. In the asset analysis scenario, for example, I wrapped my SQL query scripts into an asset-metrics Skill. When the AI receives a request like \u0026ldquo;check this week\u0026rsquo;s D0 collection-entry rate,\u0026rdquo; it can understand the business meaning, call the right Skill, run the query, and return a structured result instead of a vague text reply.\nThe third factor was full official support for Feishu. OpenClaw provides a production-tested integration path covering bot creation, event subscription, message sending and receiving, and the rest of the full workflow. I did not need to start from scratch with the Feishu API. The official CLI was enough to complete the integration end to end.\nWhy Feishu The answer is tightly tied to my company\u0026rsquo;s internal stack. Nearly all internal communication, file management, and knowledge sharing run inside the Feishu ecosystem. Feishu Drive stores our Excel data files, Feishu Docs holds data dictionaries and analysis standards, and Feishu groups are used for day-to-day business communication. Plugging the AI assistant into Feishu was the lowest-cost and best-experience option.\nFrom the perspective of a daily user, this means no change in work habits and no extra software to install. I can simply open a private Feishu chat with the AI bot, ask a data question, and receive a structured result directly in the conversation. That \u0026ldquo;conversation as a service\u0026rdquo; experience is hard to replace with other entry points such as a web console or a standalone app.\nFrom a technical integration perspective, Feishu Drive and Feishu Docs can also act as the relay layer for the AI\u0026rsquo;s memory system. Short-term memory stays in local files, while long-term memory is synced to a Feishu document so it can be reviewed from a phone at any time. That deeper integration with the Feishu ecosystem makes the assistant much more usable in a real business setting.\n2. Preparation Before Deployment Server Configuration The AI Agent Gateway service runs on an ECS cloud server with the following configuration:\nItem Spec Instance Type ecs.e-c1m2.large CPU 2 vCPU Memory 4 GiB System Disk 40 GB SSD Operating System Ubuntu 20.04 LTS Public Bandwidth Pay-by-traffic Region Malaysia (Asia Pacific Southeast) Environment Requirements The following dependencies were installed on the server in advance:\nNode.js\nOpenClaw requires Node.js 18.x or later. I installed it with nvm:\ncurl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash nvm install 18 nvm use 18 node -v # Verify the version is v18.x.x Network\nThe server needs outbound internet access. The installation process requires downloading OpenClaw-related packages, calling the Volcano Engine Coding Plan API, and establishing a WebSocket connection with the Feishu Open Platform. The following domains all needed to be reachable:\nnpm official registry Volcano Engine API domains Feishu Open Platform API domains Feishu Open Platform Account Setup I created an internal enterprise app on the Feishu Open Platform and obtained the following credentials:\nApp ID: The unique identifier of the application.\nApp Secret: The application secret, used to obtain API access credentials.\nModel Setup I used Coding Plan from Volcano Engine as the default model provider. Coding Plan is ByteDance\u0026rsquo;s AI service for coding and reasoning models. I had already enabled the Coding Plan service in the Volcano Engine console and generated an API key for the later configuration steps.\n3. Deploying OpenClaw OpenClaw was already preinstalled on the Volcano Engine cloud server, so I did not need to deploy it manually. Volcano Engine Coding Plan is a consolidated model provider. Using the integration script provided by Volcano Engine, I selected MiniMax-2.5 as the AI model and connected Coding Plan to OpenClaw:\ncurl -fsSL https://lf3-static.bytednsdoc.com/obj/eden-cn/ylwslo-yrh/ljhwZthlaukjlkulzlp/install.sh | sh This script is used to configure the model integration between Volcano Engine Coding Plan and OpenClaw. It is not used to install OpenClaw itself.\nThe working directory and file structure had already been prepared during server initialization:\n./workspace/ ├── config.yaml # Gateway configuration ├── agents/ # Agent role definitions └── skills/ # Skill directory (used later for business Skills) After the configuration was complete, I started the Gateway service with:\nopenclaw gateway start Once the service was up, I used openclaw status to confirm that all components were running normally.\n4. Integrating Feishu I followed the official Volcano Engine documentation (https://www.volcengine.com/docs/6396/2189942) to complete the Feishu integration.\nThe key steps were as follows:\nStep 1: Create a Feishu App Create an internal enterprise app on the Feishu Open Platform, fill in the app name and description, then add the app capability for a bot.\nStep 2: Configure Permissions Under \u0026ldquo;Permission Management,\u0026rdquo; add the following permissions:\nim:message - Receive and send messages im:message:send_as_bot - Send messages as the bot im:chat:readonly - Read chat lists contact:user.id:readonly - Read user IDs Step 3: Configure Event Subscription Under \u0026ldquo;Events and Callbacks,\u0026rdquo; complete the following setup:\nCallback configuration -\u0026gt; Subscription method -\u0026gt; choose \u0026ldquo;Use long connection to receive callbacks\u0026rdquo; Add event -\u0026gt; choose \u0026ldquo;Receive messages (im.message.receive_v1)\u0026rdquo; Choosing long-connection mode is the key step in the entire integration. OpenClaw can receive Feishu messages without a public callback URL, which greatly reduces deployment complexity.\nStep 4: Publish the App Click \u0026ldquo;Create Version,\u0026rdquo; fill in the version information, and publish it so the app becomes available inside the organization.\nStep 5: Configure OpenClaw Feishu Credentials Get the app\u0026rsquo;s App ID and App Secret from the \u0026ldquo;Credentials and Basic Info\u0026rdquo; page, then bind them through the OpenClaw CLI:\nopenclaw config set channels.feishu.appId \u0026#39;cli_xxxx\u0026#39; openclaw config set channels.feishu.appSecret \u0026#39;your_secret\u0026#39; openclaw config set channels.feishu.enabled true Restart the Gateway so the configuration takes effect:\nopenclaw gateway restart Step 6: Verify Connectivity After startup, check the logs to confirm the Feishu connection status:\ntail -f ~/.openclaw/logs/*.log | grep -i feishu Once the feishu connected log appears, send a test message to the bot in a private Feishu chat and confirm that messages can be received and sent normally.\nAdditional Notes on Feishu Docs Integration After the initial deployment, OpenClaw could chat normally, but I ran into a number of issues when I tried to integrate Feishu Docs. The bot frequently asked for reauthorization, and document permissions belonged to the AI bot rather than to the user, which meant the user could not modify the documents directly.\nIn early March, I found a post in the OpenClaw community that mentioned an official OpenClaw plugin released by Feishu specifically to solve the problem of insufficient bot permissions. I followed that document (https://bytedance.larkoffice.com/docx/MFK7dDFLFoVlOGxWCv5cTXKmnMh) and redeployed the integration.\nAfter switching to the official plugin, OpenClaw was able to integrate with Feishu Docs cleanly. The process required only a one-time initialization and authorization, after which I could add, delete, query, and edit Feishu documents smoothly without repeated reauthorization.\n5. Business Customization: From a General Assistant to a Domain Assistant Goal: Teach the AI the Asset Analysis Framework After the basic integration was complete, the AI was still just a general-purpose conversational assistant with no understanding of business logic. I needed to teach it the asset analysis framework step by step so it could understand business intent, call the right SQL queries, and return formatted metric results.\nI broke the customization process into five steps.\nStep 1 · Build the Data Source The asset analysis data is stored in MySQL. I first created a nigeria_asset database, imported the daily updated asset data into it, and built a complete data dictionary.\nThe data dictionary is the foundation of the entire analysis system. It contains each field\u0026rsquo;s naming rules, business meaning, metric definition, and dimension classification standards. I stored it in a Feishu document so it would be easy to maintain and share with the team.\nThe table below lists the core fields in the asset data table:\nDimension Fields Chinese Name Field Name Description Disbursement Date lending_date The actual disbursement date of the order Disbursement Week week_date The week in which the order was disbursed Disbursement Month month_date The month in which the order was disbursed Package Name package_name The specific package name Package Category package_name_type Paid acquisition package / APK package Product Type product_type Single-loan / multi-loan Multi-loan Segment multi_type Existing-customer multi-loan for existing customers / existing-customer multi-loan for new customers / new-customer multi-loan Channel media_source Facebook / Google / TikTok / Organic / non-paid Product Pricing product_info Format: term_pre-fee_post-fee, for example 7_35_7 Customer Type customer_type 5 = new customer, 10 = returning customer Customer Segment d_customer_type Pure new / not pure new (segmentation used only for new customers) Front/Back Fee Type pf_type Front-loaded / back-loaded Risk Control Grade user_score_level RG grade, from A to G with increasing risk SMS Permission sms_type Has SMS / historical SMS / no SMS Number of Concurrent Debts multi_cases_in60D Number of disbursed but unsettled orders in the past 60 days; values \u0026gt;= 6 are labeled as 6+ Loan Cycle Count double_times User borrowing cycle count; 0 = new customer, values \u0026gt;= 6 are labeled as 6+ Metric Fields Chinese Name Field Name Formula Description Disbursement Scale Order Count order_cnt — Total number of disbursed orders in the period Principal principal — Total principal amount disbursed in the period Average Ticket Size principal, order_cnt principal ÷ order_cnt Average amount per order Front-loaded Fee Rate pf_amount, principal pf_amount ÷ principal Service fee rate for front-loaded products Back-loaded Fee Rate interest_amount, principal interest_amount ÷ principal Service fee rate for back-loaded products Collection-Entry Metrics D-1 Order Collection Entry dpd_1_repay, dpd_1_cnt (1 − dpd_1_repay ÷ dpd_1_cnt) × 100% Counted one day before due date D-1 Amount Collection Entry dpd_1_repay_amount, dpd_1_cnt_amount (1 − dpd_1_repay_amount ÷ dpd_1_cnt_amount) × 100% Amount-based metric counted one day before due date D0 Order Collection Entry dpd0_repay, dpd0_cnt (1 − dpd0_repay ÷ dpd0_cnt) × 100% Counted on the due date D0 Amount Collection Entry dpd0_repay_amount, dpd0_cnt_amount (1 − dpd0_repay_amount ÷ dpd0_cnt_amount) × 100% Amount-based metric counted on the due date Spread Metrics D0 Spread D0_spread, principal D0_spread ÷ principal Spread on the due date D3 Spread D3_spread, principal D3_spread ÷ principal Spread at the D3 checkpoint D7 Spread D7_spread, principal D7_spread ÷ principal Spread at the D7 checkpoint D14 Spread D14_spread, principal D14_spread ÷ principal Spread at the D14 checkpoint D30 Spread D30_spread, principal D30_spread ÷ principal Spread at the D30 checkpoint Current Total Spread spread, principal spread ÷ principal Total spread accumulated to date Spread Growth Metrics D0-D3 Growth D0_spread, D3_spread, principal (D3_spread − D0_spread) ÷ principal Spread increase from D0 to D3 D3-D7 Growth D3_spread, D7_spread, principal (D7_spread − D3_spread) ÷ principal Spread increase from D3 to D7 D7-D14 Growth D7_spread, D14_spread, principal (D14_spread − D7_spread) ÷ principal Spread increase from D7 to D14 D14-D30 Growth D14_spread, D30_spread, principal (D30_spread − D14_spread) ÷ principal Spread increase from D14 to D30 D30+ Growth D30_spread, spread, principal (spread − D30_spread) ÷ principal Spread increase after D30 Bad Debt Metrics Order Bad Debt Rate outstanding_repay, outstanding_cnt (1 − outstanding_repay ÷ outstanding_cnt) × 100% Bad debt rate after all orders mature Amount Bad Debt Rate outstanding_repay_amount, outstanding_cnt_amount (1 − outstanding_repay_amount ÷ outstanding_cnt_amount) × 100% Amount-based bad debt rate after all orders mature Step 2 · Create the Analysis Skill Based on the data dictionary, I translated the core asset analysis metrics into SQL query logic and packaged them as an OpenClaw Skill.\nThe Skill development consisted of two parts:\nWriting SKILL.md\nIn the Skill\u0026rsquo;s SKILL.md, I defined the description, applicable scenarios, invocation method, and output format. The AI uses SKILL.md to understand the Skill\u0026rsquo;s capability boundaries and how it should be used.\nImplementing the SQL Scripts\nUnder the scripts/ directory, I wrote Python scripts that query MySQL with SQL, fetch raw data, calculate metrics according to the business definitions, and return structured results. For example, the SQL logic for the D0 collection-entry rate looked like this:\nSELECT COUNT(*) AS order_cnt, SUM(CASE WHEN dpd0_repay \u0026lt; dpd0_cnt THEN 1 ELSE 0 END) AS d0_overdue_cnt, ROUND(SUM(CASE WHEN dpd0_repay \u0026lt; dpd0_cnt THEN 1 ELSE 0 END) * 1.0 / COUNT(*), 4) AS d0_rate FROM asset_data WHERE week_date = \u0026#39;{target_week}\u0026#39; Step 3 · Validate Metric Calculations Once the Skill was created, I ran a systematic validation process:\nExecute the raw SQL directly in the database for the same metric and compare the result one by one with the Skill output Test edge cases, such as empty datasets or dimensions with no data Verify consistency across different dates and weekly dimensions After manually confirming that all metric definitions were accurate, I formally added the Skill to the AI\u0026rsquo;s working skillset.\nStep 4 · Build Scenario-Specific Analysis Frameworks In the first version of the asset analysis Skill, I started by organizing the three most common analysis scenarios, corresponding to three core needs: new-customer analysis, returning-customer analysis, and spread forecasting. The concrete frameworks are shown below.\nScenario 1: New Customer Analysis Framework\nBreakdown Dimension Specific Categories Priority Metrics Product Type Core strategy, multi-loan Collection-entry rate, spread Channel Facebook, Google, TikTok, Organic, non-paid Collection-entry rate, average ticket size Package Seven acquisition packages + APK package Collection-entry rate, spread RG Grade A, B, C, D, E, F, G Collection-entry rate, spread Scenario 2: Returning Customer Analysis Framework\nBreakdown Dimension Specific Categories Priority Metrics Product Type Core strategy, multi-loan Collection-entry rate, spread Asset Type Front-loaded, back-loaded Collection-entry rate, spread Loan Cycle Count 1 to 6+ cycles Collection-entry rate, bad debt rate Number of Concurrent Debts 1 to 6+ debts Collection-entry rate, spread RG Grade A, B, C, D, E, F, G Collection-entry rate, spread Package Seven acquisition packages + APK package Collection-entry rate, spread Scenario 3: Spread Forecasting Framework\nSlice Node Spread Growth Value Forecast Logic D0 — Same-day spread baseline D3 D3 − D0 Spread growth rate D7 D7 − D3 Spread growth rate D14 D14 − D7 Spread growth rate D30 D30 − D14 Spread growth rate To Date spread − D30 Cumulative spread growth rate Step 5 · Automate Daily Data Updates Data freshness is the foundation of effective analysis. I created a scheduled task (Cron Job) that runs every day to read the latest Excel data file provided by the business team, clean and transform the data, and then overwrite the corresponding date-partitioned tables in MySQL.\nThe job is scheduled for 6:00 every morning so the AI can query the latest data before the workday begins.\nNext Areas for Expansion After the asset analysis Skill went live, I followed the same approach to gradually build a conversion analysis Skill and a collections analysis Skill, eventually forming a complete AI analysis system covering all three business lines.\n6. Memory Management I implemented an automated memory management system through scheduled tasks. Every night at 23:59, the system runs a memory management workflow that archives the day\u0026rsquo;s conversation records and extracts key information into long-term memory.\nDesign Approach The memory system is split into two layers, \u0026ldquo;short-term memory\u0026rdquo; and \u0026ldquo;long-term memory,\u0026rdquo; which solve two separate problems: contextual completeness and persistence of key information.\nShort-Term Memory Each day\u0026rsquo;s conversation content and operation records are automatically written into that day\u0026rsquo;s .md file, for example memory/2026-05-19.md.\nThis design has several advantages:\nIt does not depend on external storage. Local files are enough to preserve the full context. The files are organized by date, which makes it very easy to trace back to conversations from a specific day. The files are in Markdown format, so they are pleasant to read. Long-Term Memory Short-term memory files keep accumulating, and storing all of them as long-term memory would make the file bloated. To avoid that, I designed a once-a-day \u0026ldquo;memory extraction\u0026rdquo; workflow:\nAt the end of each day, extract the key information worth remembering from that day\u0026rsquo;s conversations Key information includes business decisions, important definition changes, user preferences, and system architecture adjustments Write the extracted content into MEMORY.md, which acts as the single source of truth for long-term memory MEMORY.md uses a structured format organized by topic, making later retrieval fast.\nSyncing to Feishu To make long-term memory easy to review and manage from a phone, I configured MEMORY.md to sync automatically to a Feishu document. A scheduled task updates the Feishu document with the local file content.\nThis design provides several benefits:\nThe AI\u0026rsquo;s long-term memory can be reviewed from a phone at any time Feishu documents support full-text search, which makes locating information efficient Multi-device sync removes device limitations 7. Experience Using Different Models I have used the following models in practice. Here is a summary of how they felt in actual use:\nModel Use Cases Strengths Weaknesses Personal Take MiniMax-2.5 Daily conversation, data analysis Stable conversational understanding, accurate business intent parsing, good multi-turn context retention Slower on complex reasoning tasks Better at article-style output, with professional wording and a rigorous tone; well suited for formal content GLM-4.7 Daily conversation, data analysis Fast response speed, balanced performance on everyday Q\u0026amp;A tasks Limited long-text analysis ability Good for informal reporting; often adds charts and emojis, which makes content easier to read Doubao-Seed-2.0-Code Code generation, complex reasoning Strong at code-related tasks Slow response speed Used less often Kimi-K2.5 Daily conversation, task execution Balanced at conversational understanding and instruction following Slow response speed Used less often DeepSeek-V3.2 Daily conversation, complex reasoning Stable performance in complex reasoning scenarios Slow response speed and weak proactive problem-solving ability Not recommended GPT-5.4 Daily conversation, data analysis, task execution Fast response speed and very high accuracy in data calculation — Became my main everyday model after April this year, gradually replacing MiniMax 8. Conclusion This article records the full journey, from the framework research I did in late February, to the deployment I completed in March, to the model switch I made in April. The core steps can be summarized as follows:\nTechnology selection: I spent a week in late February researching options. OpenClaw was the most mature solution for business data analysis at the time. Volcano Engine Coding Plan provided the model layer, and Feishu was the only practical integration target because of the company\u0026rsquo;s ecosystem. Environment preparation: Malaysia-based cloud server (2 vCPU / 4 GiB / Ubuntu 20.04 LTS), Node.js 18.x, a self-built app on the Feishu Open Platform, and a Coding Plan API key. Deployment and integration: OpenClaw was preinstalled on the cloud server, an automated helper completed the Coding Plan model integration, the official CLI handled Feishu bot creation and credential binding, and the Volcano Engine documentation guided event subscription and permissions. In March, I found the official plugin through an OpenClaw community post, which resolved Feishu document permission limitations. Business customization: Build a MySQL database and import data -\u0026gt; establish a complete data dictionary -\u0026gt; package the asset analysis Skill -\u0026gt; build scenario-specific analysis frameworks (new customer / returning customer / spread forecasting) -\u0026gt; schedule daily automated data updates. Memory management: Run the memory workflow automatically every day at 23:59, write short-term memory into the day\u0026rsquo;s .md file, extract key content into MEMORY.md, and sync it to a Feishu document. After three months of use, the AI assistant can now understand the business logic behind asset analysis and provide real-time data queries and metric analysis inside Feishu, greatly improving the efficiency of daily data review. The same approach can be extended later to conversion analysis and collections analysis Skills, forming a complete AI-driven business analysis system.\nFuture Plans Next, I plan to expand the assistant in the following directions:\nConversion Analysis Skill: Package the full application -\u0026gt; approval -\u0026gt; disbursement analysis framework into a Skill to cover the conversion dimension that has not yet been integrated into the three business lines. Collections Analysis Skill: Use the same workflow as asset analysis to build a Skill for collections team management and complete coverage of all three business lines. Multi-agent collaboration: Explore splitting asset, conversion, and collections into separate agents and use OpenClaw\u0026rsquo;s multi-agent architecture to create clearer responsibility boundaries. Metric anomaly alerts: Build anomaly detection logic on top of the existing scheduled tasks so that when a core metric fluctuates abnormally, the system can automatically push an alert to Feishu. References Volcano Engine official documentation: Connecting Coding Plan to OpenClaw — https://www.volcengine.com/docs/82379/1928261?lang=zh Volcano Engine official documentation: Connecting OpenClaw to a Feishu assistant — https://www.volcengine.com/docs/6396/2189942?lang=zh Official Feishu plugin: OpenClaw Feishu integration configuration — https://bytedance.larkoffice.com/docx/MFK7dDFLFoVlOGxWCv5cTXKmnMh ","date":"2026-03-20T09:05:00+08:00","image":"/uploads/cover-knowledge-graph-neon.png","permalink":"/en/p/turning-an-ai-assistant-into-your-data-analysis-partner-a-hands-on-guide-to-openclaw-deployment-and-feishu-integration/","title":"Turning an AI Assistant into Your Data Analysis Partner: A Hands-On Guide to OpenClaw Deployment and Feishu Integration"},{"content":"\nUnderstanding LTV from a Business Perspective Running a credit business is, at its core, running an interest-for-risk trade.\nAcquisition costs money. Underwriting costs money. Collection costs money. Whether these investments pay off depends on whether customers are willing to borrow repeatedly and whether they repay on time. When a loan is issued, how much the business ultimately earns is not determined by how high the single-iteration interest income is—it is determined by how much net revenue the business can sustainably extract from that user before they churn.\nThis is the fundamental significance of LTV (Life Time Value) for the credit business: the total net revenue a business can generate from a user throughout their entire borrowing lifecycle, from first loan to final churn.\nWhy LTV Is Not Just a Risk Metric Many assume LTV is the risk team\u0026rsquo;s responsibility—but in reality, LTV is the business owner\u0026rsquo;s responsibility.\nThe risk team asks: \u0026ldquo;Will this loan go bad?\u0026rdquo; That\u0026rsquo;s a risk question. But the business owner has a different question: \u0026ldquo;How much can I afford to spend acquiring this user without losing money?\u0026rdquo; These two questions are closely related but not identical.\nFor example, a user with a low risk score theoretically has a higher probability of default. However, if that user comes from a high-interest-rate product where the single-iteration interest income is substantial, even a 20% default rate might still result in a positive LTV. Conversely, a high-quality user with an excellent risk score from a low-interest-rate product might generate lower LTV.\nLTV is the metric that combines risk and revenue into a single measure of success. It doesn\u0026rsquo;t answer \u0026ldquo;Is the risk model performing well?\u0026rdquo; It answers \u0026ldquo;Is this user worth acquiring?\u0026rdquo;\nA One-Sentence Definition LTV in the credit industry can be expressed with a simplified formula:\nLTV = Σ(Periodic Interest Income − Periodic Bad Debt Loss − Periodic Operating Cost) × Duration\nWhere:\nInterest Income = Loan Amount × Interest Rate × Loan Term Bad Debt Loss = Loan Amount × Default Rate Operating Cost = Acquisition Cost + Underwriting Cost + Collection Cost Duration = Time from first loan to final user churn This formula looks simple, but in practice every variable requires careful disaggregation. We will walk through each one in the sections that follow.\nThree Lifecycle Stages of Credit Users In the credit business, a user\u0026rsquo;s lifecycle can be divided into three core stages: Acquisition, Conversion, and Retention \u0026amp; Cycling. Each stage carries its own critical business questions and key performance indicators.\nAcquisition: The Balance Between Channel and Quality Acquisition is the starting point of the entire lifecycle—and the most expensive part.\nIn the digital credit landscape, the primary channels can be categorized as follows:\nFacebook (Meta). This channel offers broad demographic coverage, with a relatively high proportion of users aged 25–45. These users tend to have more stable income and above-average willingness and ability to repay. However, CPA on Facebook has been rising steadily, making it increasingly difficult to achieve profitability relying solely on this channel in a competitive market.\nGoogle. Search-intent users have clearer borrowing needs—someone who actively searches \u0026ldquo;loan\u0026rdquo; or \u0026ldquo;borrow money\u0026rdquo; is already showing strong borrowing intent. Google channels typically convert at higher rates than Facebook, but also carry higher CPA. A notable characteristic of Google-acquired users is a higher proportion of one-time applicants, meaning repeat-borrowing rates may not be optimal.\nTikTok. The user base skews younger, with 18–30 year olds dominating. These users may have lower income stability but higher growth potential. TikTok CPA is typically the lowest among the three major platforms, but asset quality volatility is also higher, requiring more careful monitoring of early delinquency metrics.\nOrganic Traffic. Users arriving through app store searches or brand-name searches typically have basic product awareness and clear application intent. Their quality often exceeds paid channels, but the volume is limited and cannot serve as a primary acquisition source.\nAPK and Other Non-Paid Channels. Users acquired through APK installations, pre-installs, or offline promotions have heterogeneous quality and require channel-by-channel analysis. Some APK-acquired users may show signs of multi-borrowing, requiring special attention during underwriting.\nThe Core Logic of Channel Selection The core business question is: At the current CPA level, can users from this channel generate enough interest income over their lifecycle to cover acquisition costs and turn profitable?\nThe answer is not static. For high-interest-rate cash loan products, tolerating a higher default rate is acceptable because single-iteration interest income is substantial. For lower-interest-rate consumer finance products, every unit of default is less tolerable, and channel quality requirements are higher accordingly.\nThis means that LTV-driven acquisition decisions are not about picking the cheapest channel or the highest-quality channel—they are about picking the channel with the highest overall return on investment.\nIn practice, we recommend tracking the following core metrics by channel:\nCPA (Cost Per Acquisition) First-loan conversion rate (proportion of approved applicants who actually receive funds) FPD7 (First Payment Default within 7 days of due date) M3 (Month-3 migration rate) 6-month repeat-borrowing rate Estimated composite LTV Only by tracking this full set of indicators can you truly judge whether a channel deserves sustained investment.\nConversion: The Funnel from Application to Disbursement A submitted application does not equal business revenue. From application to disbursement, two nodes in the conversion chain deserve particular attention.\nApproval Rate: The First Funnel Gate The approval rate is the first checkpoint in the conversion chain. The tightness or looseness of underwriting policy directly determines how many people enter the funnel and how good their quality is.\nLoose underwriting — pros and cons: High approval rates mean large volumes of applicants converting to disbursed loans, and rapid business scaling. The trade-off is that incoming user quality is mixed, increasing future bad debt pressure. Especially when acquisition costs are high, if the default rate among admitted users is excessive, LTV turns negative and acquisition costs become pure overhead.\nTight underwriting — pros and cons: Bad debt pressure decreases and asset quality improves. But low approval rates waste substantial acquisition spend—money spent driving users in only to reject them at the underwriting stage. Meanwhile, business scale is constrained, and in a competitive market, this can mean losing ground to rivals.\nThere is no universally correct answer here. The key is finding the optimal balance for the current stage of the business:\nRapid scaling phase: May need to accept higher defaults for volume, prioritizing user acquisition first Profitability pressure phase: Raise risk thresholds to ensure every disbursed loan generates positive returns New product launch phase: Use relatively strict policies first to establish stable asset quality data, then gradually relax First-Loan Conversion Rate: The Last Hurdle Before Disbursement It is very common in actual business for users to approve but never draw funds.\nCommon reasons for \u0026ldquo;approved but not disbursed\u0026rdquo; include:\nDissatisfactory credit limit. The limit granted by the system falls below the user\u0026rsquo;s expectations, and they deem the amount not worth borrowing. Common in products with conservative limit-setting logic.\nRate too high. Users see the displayed interest rate and find it unacceptable. This is especially pronounced among rate-sensitive segments (e.g., white-collar workers), where pricing slightly above their threshold causes churn.\nPoor process experience. The application process is too long, requires too much documentation, and has a high mid-funnel drop-off rate. This is especially true for younger users acquired through TikTok and similar channels, who have higher expectations for process fluidity.\nInsufficient lender capacity. In some periods, funding-side disbursement capacity is constrained, preventing some approved users from receiving funds on time. This is a supply-side constraint, not active user abandonment—it requires coordination between business and funding teams.\nImproving first-loan conversion requires joint effort from business and risk teams:\nLimit design: Use income information, debt levels, and historical borrowing performance to design attractive initial limits while controlling risk Rate pricing: Find the user-acceptable price range based on risk cost + funding cost + operating cost + reasonable margin Process experience: Shorten the application flow, optimize identity verification, improve page load speed, and reduce mid-process abandonment An empirical data point: every 5-percentage-point improvement in first-loan conversion rate can increase disbursement volume by 10–15% at the same acquisition cost. This leverage effect is substantial.\nRetention \u0026amp; Cycling: Repeat Borrowing as an LTV Multiplier A single loan generates one-time income. When users come back for a second or third loan, revenue compounds and LTV grows.\nRepeat-borrowing rate is the metric that most amplifies LTV in the credit business—and also the one most easily overlooked in the early stages.\nMany business teams over-focus on acquisition volume and first-loan conversion during product launch, neglecting the cultivation of repeat borrowers. By the time they realize 6 months later that the existing user base is shrinking and they have become dependent on constant new acquisition, LTV has already been eroded by high acquisition costs.\nCharacteristics of Repeat Borrowers Users with high repeat-borrowing rates typically exhibit the following traits:\nGood repayment history. Every historical loan was repaid on time with no delinquencies. This indicates the product is attractive to the user and their financial situation is relatively stable.\nHigh credit limit utilization. Users who tend to max out their credit limit signal that the current limit may not meet their needs—there is room for limit increases.\nStable borrowing frequency. Users who apply for loans at regular intervals, forming habitual usage patterns.\nAfter identifying high-repeat-borrowing users, the following approaches can further boost their LTV:\nCredit Limit Management After a period of use, the business system needs to decide whether to offer the user a credit limit increase.\nLimit-increase decisions should consider multiple factors:\nRepayment history: Any delinquency records? Historical borrowing interval: Is the gap between loans stable? Application frequency: Is the frequency of loan applications increasing? Debt position: Has the user\u0026rsquo;s debt on other platforms increased? Limit-increase strategies generally fall into two categories:\nProactive limit increases: Based on user behavior, the system actively raises the limit without user initiative. Suitable for users with strong credit records and high activity. This approach provides a better user experience but carries higher risk—if users quickly default after a limit increase, losses are larger.\nReactive limit increases: Users proactively apply for limit increases, and the business decides based on risk assessment. This approach is more risk-controllable but may deliver a poorer user experience (application rejection or small increase).\nA reference cadence for limit increases: after 3 consecutive on-time repaid loans, consider a 10–20% increase; after 6 consecutive on-time repaid loans, further increase to 30–50%.\nCross-Selling When a business has multiple product lines (e.g., cash loans, consumer loans, SME loans), it can recommend other products based on the user\u0026rsquo;s performance in existing products.\nThe value of cross-selling: a high-value existing user has a far lower conversion cost than a new user. The user already has brand awareness and product trust—no need to go through the full acquisition and underwriting process again.\nHowever, cross-selling carries risk: if the recommended product is unsuitable and causes over-indebtedness, subsequent default losses may offset the cross-sell revenue. We recommend assessing the user\u0026rsquo;s comprehensive debt position before recommending additional products.\nFour Key Business Metrics With the LTV framework understood, the next step is to identify which metrics deserve the most day-to-day attention. These four metrics form the credit business foundation.\nCPA vs. LTV Balance CPA (Cost Per Acquisition) is a direct measure of acquisition efficiency. But CPA alone is just a cost number—meaningless without LTV context.\nThe correct evaluation framework is: LTV \u0026gt; CPA + Operating Cost + Funding Cost, or the business cannot sustain itself.\nThis inequality looks simple, but in practice many business teams have never carefully verified the actual values on the right side. Operating costs and funding costs are relatively fixed, but CPA and LTV are both variables—and they are often correlated. Lower-CPA channels may have inferior user quality, resulting in lower LTV; higher-CPA channels may have better user quality, supporting higher LTV.\nLTV differences across channels can be substantial. Consider this illustrative data:\nChannelCPA3-Month LTV Est.6-Month LTV Est.LTV/CPA (6-Month) Facebook18 CNY45 CNY72 CNY4.0 Google22 CNY52 CNY80 CNY3.6 TikTok12 CNY22 CNY31 CNY2.6 Organic0 CNY20 CNY28 CNY∞ Several key observations from this data:\nFirst, TikTok has the lowest CPA but also the lowest LTV, indicating this channel\u0026rsquo;s user quality is inferior or product-market fit is poor—one cannot judge a channel as good simply because its CPA is low.\nSecond, Facebook and Google have higher CPA but stronger LTV performance, making them more suitable as primary acquisition channels.\nThird, while Organic has zero CPA, its LTV is not outstanding; its main value lies in low-cost volume supplementation, and it is not suitable as the primary growth driver.\nThe essence of channel optimization is not finding the cheapest channel—it is finding the channel with the highest LTV/CPA ratio. However, this ratio alone is not the sole decision criterion; the business\u0026rsquo;s current cash flow situation and scaling ambitions must also be considered.\nIf the business is in a phase requiring rapid scale expansion, it may deliberately choose channels with lower LTV/CPA but also lower CPA (e.g., TikTok), trading profit for volume. If the business is under profitability pressure, it should prioritize channels with high LTV/CPA and reduce or pause investment in underperforming channels.\nDefault Rate: The Value of Early Risk Identification Default rate is the most critical risk metric in the credit business—and the variable most directly impacting LTV. Defaults erode interest income and directly compress margins.\nDefault rates are measured through multiple lenses:\nFPD (First Payment Default). When a user misses their first repayment due date, this is the earliest and most direct risk signal. FPD is typically used as an early warning indicator for channel quality and borrower segment quality.\nM3/M6/M12 (Monthly Migration Rate). M3 represents what proportion of loans delinquent for 3 months migrate to more severe delinquency status. M3 is a core metric for asset quality stability. If M3 is significantly above industry average or historical comparables, asset quality is deteriorating.\nVintage Analysis. Grouping loans by their disbursement month (vintage) and tracking delinquency rates over time. Vintage analysis eliminates macro-economic interference and genuinely reflects differences in borrower quality across vintages and the effectiveness of risk policies.\nDefault rate analysis is not solely a risk team responsibility. Business leaders need to understand: current asset quality is the outcome of past decisions—which channel cohort, which pricing, which underwriting policy. Only by tracing this causal chain can better decisions be made going forward.\nSpecifically, default rate disaggregation can be conducted across the following dimensions:\nBy channel: What are the M3 rates for Facebook vs. TikTok users respectively? How large is the gap? Does this gap align with the CPA and LTV comparison?\nBy borrower segment: What are the default rates for new borrowers (first loan) vs. repeat borrowers? Does repeat-borrowing performance genuinely outperform new borrowers?\nBy product: Is asset quality consistent across cash loans and consumer loans? Can the interest rate pricing of different products cover corresponding risk costs?\nBy risk policy: When the same channel uses different risk rules at different times, does asset quality change noticeably?\nThe goal of default rate analysis is to identify \u0026ldquo;which dimensions have asset quality below expectations,\u0026rdquo; then analyze causes and adjust strategy. Looking only at a composite default rate number provides no actionable guidance.\nRepeat-Borrowing Cycle: Time Is Money Repeat-borrowing cycle refers to the interval between a user\u0026rsquo;s first disbursement and their second loan application. This metric matters because:\nThe shorter the repeat-borrowing interval, the more interest income a user generates within a fixed time period, and the faster LTV grows.\nFor example, two users each generate 100 CNY monthly interest. User A borrows a second loan within 30 days of their first; User B waits 90 days. After 6 months, User A has contributed 6 periods of interest income while User B has contributed only 2.\nA short repeat-borrowing interval also indicates user satisfaction with the product experience and willingness to return. If a user disappears after one loan, this is not a normal lifecycle conclusion—it is churn. Possible reasons include poor product experience, insufficient credit limit, uncompetitive rates, or the user simply treating the first loan as a one-time emergency fund.\nTracking the distribution of repeat-borrowing cycles helps the business set rational user maintenance touchpoints. For example, if data shows 70% of repeat borrowers reapply within 30 days of their first loan maturing, pushing limit-increase or promotional messages during this window will significantly outperform random outreach.\nRepeat-borrowing cycles also support user segmentation. Users with different cycle characteristics can be managed with different strategies:\nShort-cycle users (interval \u0026lt;30 days): Highly active users indicating strong product stickiness. Can receive limit-increase incentives to further boost per-transaction loan amounts.\nMedium-cycle users (interval 30–90 days): Normal repeat borrowers. Maintain status quo; consider proactive outreach near their previous loan maturity date.\nLong-cycle users (interval \u0026gt;90 days): Low-activity users at churn risk. Deserve focused win-back efforts through targeted discounts, time-limited limit increases, or similar incentives.\nNon-repeat users (no application record for 180+ days): Effectively churned. Reduce operational investment or attempt a final win-back before deprioritizing.\nAverage Loan Amount and Credit Limit Average loan amount determines the base for single-transaction interest income. Higher averages theoretically generate more interest income, but excessively high averages may also increase user repayment pressure and elevate default rates.\nCredit limit design must balance two objectives: letting users borrow an amount that meets their needs (demand side), while keeping that amount within their repayment capacity (risk side).\nA reference framework for limit design:\nFirst-loan limit. For new borrowers, the initial limit should not be excessive. A reasonable guideline is to cap it at 30–50% of the user\u0026rsquo;s monthly income (where income verification is available). Exceeding this ratio significantly increases repayment pressure and subsequent delinquency risk. For scenarios without income verification, other data points (social behavior, device information, historical credit records) must supplement the assessment.\nRepeat-loan limit. As users continue using the product and repay on time, limits can be gradually increased. Users with strong performance have greater limit upside—which is a lever the business uses to increase LTV. Hard caps on increases must be set to prevent user over-leveraging.\nAggregate limit management. In multi-product businesses, the combined debt position across all products must be monitored. If a user already carries significant debt in a cash loan product, recommending a consumer loan product requires caution.\nThere is also a trade-off between average loan amount and approval rate: higher amounts face stricter risk review, which lowers approval rates; lower amounts may achieve higher approval rates but generate less per-transaction interest income.\nA practical strategy: provide new borrowers with relatively low initial limits (reducing risk pressure and boosting approval rates), then gradually increase limits as they demonstrate good repeat-borrowing behavior. This approach controls risk while using the repeat-borrowing stage to compensate for initially lower average loan amounts through limit increases.\nBusiness Decision Scenarios New Product Launch: Choosing the Target Segment Before launching a new product, the most important decision is: Who is this product for?\nThe target segment determines credit limits, interest rate pricing, underwriting policies, and acquisition channels—every aspect of product design flows from this core choice.\nHigh-Risk-Appetite Products For high-risk-appetite products (higher interest rates, targeting lower credit-score segments), the business logic is using higher interest income to offset higher default losses.\nUnder this model:\nPricing typically exceeds 100% annualized, sometimes much higher Target segments are users unserved by traditional financial institutions Underwriting policies are relatively loose, making up for quality with volume Acquisition channels may need to accept higher CPA The key in this model is controlling loan amount and term—single-transaction amounts should not be too high, and loan terms should not be too long, ensuring that even if defaults occur, losses remain manageable. The core assumption of the LTV model is that users will borrow repeatedly, and after sufficient repeat cycles, total interest income will cover defaults and acquisition costs.\nLow-Risk-Appetite Products For conventional consumer finance products (lower interest rates, higher-quality target segments), the business logic is achieving profitability through lower risk costs and lower default rates.\nUnder this model:\nPricing typically falls below 20% annualized Target segments are users with stable income (e.g., salaried workers, users with social insurance) Underwriting policies require stricter verification of income proof, social insurance contributions, and similar documentation Acquisition channels prioritize quality over volume This model imposes higher requirements on acquisition cost and operational efficiency. LTV growth relies primarily on repeat-borrowing rates and user stickiness, not high interest income.\nRecommended Path for New Product Launch During the initial launch phase, we recommend testing 2–3 differentiated target segments with small samples and observing 3 months of actual asset performance before scaling.\nSpecific steps:\nSelect 2–3 differentiated segments (e.g., first-tier city white-collar vs. third/fourth-tier city blue-collar; social insurance holders vs. non-holders) Disburse 500–1,000 loans per segment Track FPD, M3, and repeat-borrowing rates for 3 months Compare estimated LTV across segments and scale the best-performing direction Do not scale blindly with insufficient data. Delinquency in credit products is lagging—assets that look fine in early stages may reveal their true quality problems 6 months later.\nChannel Optimization: Where to Allocate Spend Channel optimization is one of the areas most in need of LTV thinking in the credit business.\nWhether to scale a channel cannot be determined by CPA alone—it depends on that channel\u0026rsquo;s LTV contribution.\nSpecifically, channel users can be segmented by monthly vintage; default rates and repeat-borrowing rates can be calculated by vintage cohort; and 3-month, 6-month, and 12-month LTV estimates can be projected and compared against CPA.\nA practical channel evaluation framework:\nMonthly Tracking Table:\nChannelMonthly DisbursementsMonthly CPAMonthly FPD3-Month M36-Month Repeat Rate6-Month LTV Est. Facebook5,00018 CNY8%15%45%65 CNY Google3,00022 CNY6%12%50%72 CNY TikTok8,00012 CNY12%22%30%31 CNY Based on this table, the following judgments can be made:\nFacebook and Google 6-month LTV is significantly higher than TikTok\u0026rsquo;s, and M3 is lower—priority should be given to scaling these channels TikTok has low CPA but poor asset quality and uncompetitive LTV—maintain current level or consider slight reduction Channel optimization is not static. User quality fluctuates with seasons, competitive dynamics, and collection strategy changes. Holiday periods are typically peak demand seasons but may also carry elevated default risk. When competitors scale aggressively, your channel\u0026rsquo;s user quality may be diluted.\nWe recommend monthly reassessment of each channel\u0026rsquo;s LTV contribution and timely budget reallocation.\nExisting User Operations: Who Deserves Priority Attention Operating resources for existing users are limited. The core question is: Where should limited resources be concentrated?\nBased on LTV contribution capacity, existing users can be categorized as follows:\nHigh-Value Users Users with good repayment history, stable borrowing frequency, and high credit limit utilization. These users are the core source of business profitability.\nThe operational goal for high-value users is extending their lifecycle and increasing repeat-borrowing frequency.\nSpecific tactics:\nProvide limit-increase incentives to grow per-transaction loan amounts Offer more competitive rate conditions to strengthen loyalty Provide dedicated customer service to elevate experience Prioritize new product recommendations (cross-selling) High-value users have zero acquisition cost (they are already in the existing base), so the marginal return on operational investment is far higher than for new users.\nPotential Users Users whose historical performance is acceptable but whose recent activity has declined or borrowing intervals have lengthened. These users face churn risk and deserve focused attention and win-back efforts.\nRecognition signals:\nLast borrowing was more than 45 days ago (exceeding the product\u0026rsquo;s average repeat interval) Credit limit utilization dropped from 80% to below 30% Click-through rate on push notifications has declined Win-back tactics:\nTargeted discounts: Offer rates more competitive than those for new users Time-limited limit increases: Inform users their limit is about to increase, encouraging usage Churn surveys: Understand why users stopped borrowing and optimize accordingly Low-Value Users Users with poor historical performance, frequent delinquencies, or extremely low credit limit utilization. These users\u0026rsquo; LTV contribution is marginal or negative.\nThe operational strategy for low-value users should be controlled investment with risk monitoring:\nNo additional operational resources for win-back campaigns Continuous monitoring of repayment performance with timely collection triggering Maintain conservative credit limit management to avoid over-lending Case Study A cash loan product underwent a comprehensive review of its channel and user performance after 6 months of operation.\nChannel-Level Data Six-month channel performance data is as follows:\nChannelCumulative DisbursementsCPAAvg. Loan Amt.6-Month Default Rate6-Month Repeat Rate6-Month LTV Est. Facebook32,000 loans18 CNY950 CNY14%48%68 CNY Google18,000 loans22 CNY1,100 CNY11%52%76 CNY TikTok65,000 loans12 CNY720 CNY23%31%29 CNY Organic8,000 loans0 CNY880 CNY16%42%34 CNY Conclusions:\nFacebook and Google channels both have LTV/CPA exceeding 3.0, making them the optimal channels—continuous scaling is recommended TikTok has the lowest CPA but LTV is only 29 CNY with LTV/CPA of 2.4 and a default rate as high as 23%—no additional investment is recommended Organic has zero CPA but unremarkable LTV; its main value is low-cost volume supplementation User Segment Operations Data The first month\u0026rsquo;s 10,000 disbursed users, segmented by their performance after 6 months:\nUser SegmentUsersShareAvg. Loans per UserAvg. Interest ContributionShare of Interest Revenue High-Value1,20012%5.8 loans2,850 CNY42% Potential2,30023%2.4 loans980 CNY28% Low-Value1,00010%1.2 loans320 CNY4% Churned5,50055%1.0 loan470 CNY26% Key findings:\n12% of high-value users contributed 42% of total interest revenue 55% of users never borrowed again after their first loan—these are typical \u0026ldquo;one-time users\u0026rdquo; Business Decisions and Results Based on the data, the business team implemented the following strategies:\nChannel strategy: Increased monthly budgets for Facebook and Google by 30% each; TikTok maintained at current level with no additional investment.\nHigh-value user limit-increase strategy: Progressive limit increases for the 1,200 high-value users. First increase: 20% after 3 consecutive on-time repaid loans; second increase: another 30% after 6 consecutive on-time repaid loans. After 6 months, this group\u0026rsquo;s average loan amount grew from 950 CNY to 1,480 CNY, with per-user interest contribution increasing approximately 65%.\nPotential user win-back strategy: Targeted discount offers (first-period rate at 20% off) sent to 2,300 potential users. Win-back cost was approximately 45 CNY per user. The 3-month LTV contribution from recalled users averaged 180 CNY, achieving an ROI of approximately 4x.\nChurned user strategy: No active win-back efforts for the 5,500 churned users; included only in unified promotions when new-user campaigns were launched.\nSix-month performance comparison:\nMetricBefore (Month 1)After (Month 6) Overall Repeat Rate35%48% High-Value User Share12%18% Overall LTV/CPA2.83.6 Operational Implementation Recommendations First Steps to Building LTV Thinking LTV is not built in a day. We recommend starting with the following steps:\nStep 1: Data Infrastructure.\nConnect end-to-end data across acquisition, approval, disbursement, repayment, and collection—ensuring every user\u0026rsquo;s lifecycle events can be tracked and attributed.\nSpecifically, the following data challenges must be addressed:\nChannel attribution: Every disbursed user can be traced to a specific channel and ad campaign Unique user identification: A user\u0026rsquo;s borrowing records across different time periods can be linked to the same individual Cost accounting: Acquisition, underwriting, and collection costs can be allocated to specific users or orders Data infrastructure is the foundation of all subsequent analysis. If data is inaccurate, LTV is built on sand.\nStep 2: Define Methodology.\nClarify the LTV calculation methodology. Different business models may use different LTV definitions:\nShould funding cost be included? Is funding cost calculated using internal funds transfer pricing (FTP) or actual financing rates? Should operating cost be included? Is operating cost allocated per head or per order? Is default provision calculated using actual losses or expected losses (EL)? Without consistent methodology, data is incomparable. Business, risk, and finance teams must align on methodology before analysis begins.\nStep 3: Establish Regular Review Cadence.\nWe recommend monthly reviews, with LTV and its component variables (default rate, repeat-borrowing rate, average loan amount) disaggregated by channel, segment, and product.\nCore questions each review meeting should answer:\nDid each channel\u0026rsquo;s LTV meet targets this month? Did any channel show a significant LTV decline? What caused it? Is reallocation of channel budget necessary? Common Business Misconceptions Misconception 1: Focusing on acquisition volume while ignoring acquisition quality.\nScale expansion masking bad debt problems is the most common fatal path in the credit business.\nThe typical pattern: a business leader sees new user numbers growing month after month and feels the business is thriving. Upon closer inspection, however, new-user default rates are climbing and repeat-borrowing rates are declining—the LTV has already turned negative. Growth is only occurring because acquisition costs are sufficiently low, not because the business is actually healthy.\nMisconception 2: Treating first-loan revenue as final LTV.\nSingle-transaction interest income cannot cover acquisition costs—only repeat borrowing can truly generate profit.\nMany products design pricing strategy based solely on whether single-iteration loan interest can cover funding costs and risk costs, ignoring repeat borrowing entirely. If this assumption holds, the business\u0026rsquo;s core strategy becomes constantly acquiring new users—but as market acquisition costs continue rising, this path eventually becomes unsustainable.\nThe correct mindset: treat the first loan as an acquisition tool, and repeat borrowing as the profit source. Only when users are willing to return for a second and third loan can LTV truly deliver positive returns.\nMisconception 3: Treating LTV as a risk team responsibility.\nLTV thinking is an essential framework for business leaders. Risk provides data and model support, but business decisions must be made—and owned—by the business leader.\nIf a business leader doesn\u0026rsquo;t understand LTV, they lose their decision-making compass across every aspect of operations: Should a channel be scaled or cut? Should a user receive a limit increase or not? Should the product\u0026rsquo;s price be reduced or increased? All of these questions can be answered through the LTV lens.\nMisconception 4: Ignoring the impact of collection on LTV.\nWhen defaults occur, collection efforts can recover some losses. But collection itself has costs, and collection efficiency declines as accounts age.\nMore critically, collection methodology affects future user behavior. If collection methods are too aggressive (e.g., exposing contact lists, excessive harassment), users may permanently churn—even if they eventually repay, they will never borrow again. Collection strategy must balance \u0026ldquo;recovery rate\u0026rdquo; against \u0026ldquo;user relationship maintenance.\u0026rdquo;\nHow to Work with Risk and Product Teams LTV implementation requires cross-functional collaboration—no single team can do it alone.\nRisk team provides:\nUser credit scores, default rate predictions, and collection efficiency data Simulation of how risk policy changes impact LTV Risk assessments for limit and pricing strategies Product team is responsible for:\nContinuous optimization of borrowing experience, limit display, and process pages Improving first-loan conversion and process flow End-user product design and interaction optimization Business/Operations team is responsible for:\nDeveloping channel strategy and user operation strategy based on LTV data Coordinating risk and product resource prioritization Owning end-to-end business results If these three functions operate in silos, the LTV system cannot truly land. Common failure modes: the risk team says \u0026ldquo;my risk model is excellent and default rates are well-controlled,\u0026rdquo; but the business team says \u0026ldquo;approval rates are too low and users can\u0026rsquo;t get in\u0026rdquo;; the product team says \u0026ldquo;the borrowing flow is already very smooth,\u0026rdquo; but repeat-borrowing data shows users are not coming back.\nThe key is establishing shared metric language and a common business objective—maximizing user lifetime value within controlled default boundaries.\nOne recommended practice: hold monthly LTV review meetings with risk, product, and business teams together, using the same dataset to evaluate performance and decide next actions. This prevents the fragmentation and finger-pointing that occurs when teams operate in isolation.\nClosing Thoughts Understanding LTV is not the destination—it is the starting point.\nThe credit business is fundamentally a combination of risk pricing and user management. LTV provides the unified metric for judging whether both are being done well. In actual business operations, the most important ongoing work for any business leader is continuously tracking, dissecting, and optimizing LTV.\nA few final takeaways:\nFirst, LTV thinking is not about reading numbers—it is about making decisions. Knowing a channel\u0026rsquo;s LTV is step one. Deciding whether and how to scale based on that number is where LTV thinking actually delivers value.\nSecond, data quality determines the ceiling of your analysis. If channel attribution is inaccurate, users cannot be identified across periods, and cost allocation is unclear, LTV calculations will be distorted and business decisions will be wrong. Build the data foundation first.\nThird, cross-functional alignment is the key to LTV implementation. If risk, product, and business are not working from the same definitions and toward the same goals, the LTV system becomes each team\u0026rsquo;s own isolated data exercise.\nWe hope this article provides an actionable framework for your business. If you have specific implementation questions, we welcome further discussion.\n","date":"2025-12-15T21:30:00+08:00","image":"/uploads/cover-ltv-lifecycle.jpg","permalink":"/en/p/ltv-user-lifecycle-in-the-credit-industry/","title":"LTV User Lifecycle in the Credit Industry"},{"content":"\nChapter 0: Technology stack and Python library description 0.1 Technology stack overview This project as a whole is a typical risk control modeling solution of \u0026ldquo;offline feature engineering + machine learning modeling + business strategy implementation\u0026rdquo;. The core technology stack is mainly divided into three layers:\nData processing layer Use Spark / PySpark to complete large-scale sample processing, feature splicing, time window aggregation and batch export, supporting the calculation of massive features in a distributed environment.\nAnalytical Modeling Layer Use Python as the main modeling language, combined with pandas, numpy, scikit-learn, lightgbm, toad and other libraries to complete sample processing, feature screening, model training, parameter tuning and effect evaluation.\nBusiness Application Layer Transform the model output results into implementable collection and screening strategies, including the first round of detection of the main model, dynamic recovery of the auxiliary model, and the final resource allocation strategy based on ROI.\n0.2 The main technologies this project relies on From the perspective of full-text code implementation, the project mainly relies on the following technical components:\nPython: core development language Spark/PySpark: Offline feature engineering and distributed data processing LightGBM: Main model training and feature importance ranking scikit-learn: Dataset partitioning, grid search and model evaluation toad: Feature screening, IV calculation, risk control modeling auxiliary analysis pandas/numpy: Structured data processing and numerical calculations Chapter 1: Project Background and Business Issues 1.1 Definition of long-aging orders and industry pain points In credit business, post-loan management is the last link in the closed loop of risk control. When an order enters the overdue stage, the allocation of collection efforts directly determines the input-output ratio of labor costs.\nThe so-called \u0026ldquo;long-aging orders\u0026rdquo; are defined in this project as the user group who are 15 days or more overdue and have not yet repaid. These users have the following common characteristics:\nLow willingness to repay: Still refuses to repay after more than 15 days of continuous collection pressure, and the subjective willingness to repay has significantly declined; High Difficulty in Collection: Conventional text messages and telephone collections have limited impact on it, so collectors need to invest more energy; Positive samples are extremely sparse: Only a small number of users can finally successfully repay after more than 15 days of collection, and the data distribution is seriously skewed. 1.2 Three major problems with the existing reminder model Before the introduction of machine learning modeling, the business side used the full collection mode for long-aging orders—that is, uniformly arranging manpower for collection of all long-aging orders. There are three core problems with this approach:\nProblem 1: Huge labor costs\nThe volume of long-aged orders is large and continues to accumulate. If manual collection is invested in each order, labor costs will increase linearly. The collection team needs to continue to expand, but the increase in collection amount cannot keep up with the increase in manpower investment.\nProblem 2: ROI is extremely low\nThe overall labor ROI (collection amount / labor cost) of the reminder model has been hovering between 1.1 and 1.5 for a long time, and the collection efficiency is only 5%. After deducting capital costs and operating costs, this number has almost no profit margin, or even losses.\nProblem 3: Affects team morale\nCollection agents make a large number of calls every day, and most of the people they come into contact with are users who refuse to repay or even have bad attitudes. The success rate is extremely low. If this continues for a long time, the team will easily feel exhausted, the turnover rate will increase, and the overall collection efficiency will further decline, forming a vicious cycle.\n1.3 Project Goal: From \u0026ldquo;Full Quantity Clearance\u0026rdquo; to \u0026ldquo;Accurate Detection + Dynamic Retrieval\u0026rdquo; Based on the above pain points, the core issues are:\n**Is it possible to automatically identify orders with high callback potential from a large number of long-aged orders, and continue to recover orders with callback value as the stages advance, so that limited manpower can focus on the highest-value cases? **\nSpecific business goals are:\nindex Communicate the current situation target value People ROI 1.1~1.5 \u0026gt; 2.0 recall efficiency 5% \u0026gt;10% Collection strategies Full reminder Accurate detection of main model + dynamic recovery of auxiliary model The overall idea of ​​the project is as follows:\nFeature construction: Integrate multi-dimensional data such as user login, APP behavior, device information, SMS content, basic information, etc., and build a massive feature library (approximately 10,000 dimensions) based on the five-dimensional cross combination of order type × behavior label × time dimension × indicator value × calculation method; Multiple rounds of feature screening: IV value initial screening (empty=0.6, iv=0.05, corr=0.8) → LGBM gain value stability re-screening (5 times of random modeling × Top100 intersection) → LGBM Top60 convergence, and finally retain about 60 high-value features; LGBM prediction model construction: Taking \u0026ldquo;whether there is repayment during the user stage\u0026rdquo; as the label (repayment = 1, non-repayment = 0), train the main model, and perform the first round of sorting and detection of long-aging orders; Auxiliary model dynamic retrieval: Based on the new behavioral signals in the stage, secondary identification of orders not detected by the main model is carried out, focusing on capturing orders that have become stronger again such as login, repayment of other orders, partial repayment, etc.; ROI-oriented implementation: Form a combined strategy of \u0026ldquo;first-round inspection of main model + dynamic supplement of auxiliary models\u0026rdquo; to continuously improve inspection quality and recovery output while reducing manpower investment. Chapter 2: Data source and feature engineering system 2.1 Data source panorama The data sources of this project cover four dimensions: user application behavior, credit performance, post-loan collection interaction, and full-link behavior within the APP. All data involving user privacy is legally authorized through the Google Store channel, and can only be collected with the user\u0026rsquo;s consent.\n(1) User information\nThe data captured with the user\u0026rsquo;s authorization when applying for an order includes three sub-modules:\nAPP List: The list of applications installed on the user\u0026rsquo;s mobile phone, obtained through the Google Store channel, can reflect the user\u0026rsquo;s living and consumption habits, financial demand density, etc.; Device information: Device parameters when applying for an order, including device model, brand, operating system version, etc., used to determine the stability of the user\u0026rsquo;s device and whether there are risky behaviors such as multi-device switching; SMS message: The text message content extracted in a structured manner after user authorization, mainly covering bank repayment reminder text messages, external loan platform text messages, etc., which can reflect the user\u0026rsquo;s debt pressure and financial tightness; Basic personal information: Identity information that users actively fill in during the application process, including age, occupation, education and other demographic fields. (2) Risk dimension\nCredit indicators provided by external data sources when users apply for orders, including:\nMultiple application risk situation: The user’s application records on multiple credit platforms reflect the risk of over-indebtedness; Credit Rating: The comprehensive credit score grade given by external credit reporting or data partners directly reflects the user\u0026rsquo;s historical credit performance. (3) Collection information\nThe user\u0026rsquo;s post-loan collection data before entering the long account aging stage includes historical collection records before collection, collection methods (text messages/phone calls/outside visits), historical collection results, etc. The reminder information can help the model learn: before entering the long aging period, which collection behavior patterns are related to the final repayment results.\n(4) User behavior information\nThe user’s full-link operation log within the APP interface is the most granular behavioral data source for this project:\nRegistration behavior: registration time, registration channel, registration equipment, etc.; Login behavior: login time, login frequency, login device changes, login region changes, etc.; Button clicks: distribution of click hot spots on each functional page, page dwell time, page jump path, application process interruption nodes, etc.; APP behavioral data can capture the user\u0026rsquo;s active participation in the product, and deep-seated behaviors (such as proactively checking the repayment plan) are often highly correlated with repayment willingness. 2.2 Engineering construction of feature dimensions If a single behavior field is not cross-combined through time windows, the effect of directly entering the model is often very poor. For example: \u0026ldquo;Number of logins in the past 7 days\u0026rdquo; has a much stronger discriminating ability than \u0026ldquo;Total number of logins in history\u0026rdquo; in long account aging scenarios, because recent behavior can better reflect the user\u0026rsquo;s current willingness to repay.\nThe feature construction of this project adopts an engineering scheme of five-dimensional cross combination of Order Type × Behavior Label × Time Dimension × Indicator Value × Calculation Method. Take Multi-head application feature as an example to fully explain the feature naming specifications and construction logic.\n2.2.1 Feature naming convention userMulti{订单类型}{行为标签}{指标值}{计算方式}{时间窗口}D Feature names directly carry business meaning, and calibers can be understood without additional documentation:\nFeature name meaning userMultiAllAllOrderCnt7D Number of requests for all orders in the past 7 days userMultiIsLoanAllAmountSum30D The total amount of disbursed orders in the past 30 days userMultiIsNoLoanPackageDistinctCnt14D The number of different products involved in undisbursed orders in the past 14 days 2.2.2 Three dimensions of feature construction Dimension one: time dimension\nDivided according to the number of days from the application time to the sample point: 1D, 3D, 7D, 14D, 30D, 60D, All.\nDimension 2: Order type\nFilter subsamples by user historical order status:\nOrder type meaning All Full historical orders IsLoan Orders that have been successfully loaned in history IsNoLoan Orders applied for but not released in history Dimension three: behavioral labeling\nFurther refined screening based on order type:\nbehavior tag meaning All No additional filtering Overdue1Day Orders overdue for ≥1 day Overdue3Day Orders overdue for ≥3 days PreRepay3Day Orders with repayment ≥3 days in advance 2.2.3 Indicator values ​​and calculation methods Indicator value (cal_field): Key fields covering the entire life cycle of historical orders:\nindicator value meaning Package Product applied for Order Order ID Amount loan amount ApplyLendingInterval / ApplyRepaidInterval Time interval of each node OverdueDay/PreRepayDay Number of days overdue/number of days in advance of repayment PartRepayAmt/PartRepayAmtPct Partial repayment amount and proportion Calculation method: 12 types in total, reused for all indicator values:\nCalculation method meaning Cnt times/number of transactions DistinctCnt number of different values RepeatCnt Number of repetitions (= Cnt - DistinctCnt) Sum Sum Avg mean Max maximum value Min minimum value Std population standard deviation Skew Skewness Median median Pct Ratio of specified conditions/full conditions DistinctPct Proportions of different values Feature generation formula:\n特征 = 订单类型 + 行为标签 + 指标值 + 计算方式 + 时间窗口 Example interpretation: userMultiAllAllOrderCnt7D\nOrder type All (full quantity order) Behavior Tags All (no additional filtering) Indicator value Order Calculation method Cnt (count) Time window 7D (last 7 days) 2.2.4 Portfolio size estimation 订单类型(N) × 行为标签(M) × 时间窗口(T) × 指标值(K) × 计算方式(A) = N × M × T × K × A 个特征 Take multi-head application features as an example, assuming there are 3 order types, 10 behavior labels, 7 time windows, 10 indicator values, and 12 calculation methods, a single long-head application module can generate 3 × 10 × 7 × 10 × 12 = 25200 feature dimensions.\nThe same logic is applied to all data sources such as user information, risk dimensions, collection information, user behavior information, etc., and finally a massive feature library is built.\n2.3 Missing value and outlier processing strategies Among the massive features, the sources of null values ​​in the original data are complicated—it may be due to unauthorized user authorization, collection failure, or unrecorded status in the business. This project adopts the strategy of multi-level coding + exclusionary aggregation instead of a simple filling method.\n2.3.1 Multi-level null coding system An additional {indicator value}_none tag column is generated for each original field, and the following situations are uniformly recognized as null value tags through the get_is_none function:\ndef get_is_none(x): if x is None or str(x).replace(\u0026#39; \u0026#39;,\u0026#39;\u0026#39;).lower().strip() in [\u0026#39;-999999\u0026#39;,\u0026#39;-999976\u0026#39;,\u0026#39;-999977\u0026#39;,\u0026#39;-999978\u0026#39;, \u0026#39;-999999.0\u0026#39;,\u0026#39;-999999.0000\u0026#39;, \u0026#39;-999976.0\u0026#39;,\u0026#39;-999976.0000\u0026#39;, \u0026#39;-999977.0\u0026#39;,\u0026#39;-999977.0000\u0026#39;, \u0026#39;-999978.0\u0026#39;,\u0026#39;-999978.0000\u0026#39;, \u0026#39;none\u0026#39;,\u0026#39;nan\u0026#39;,\u0026#39;nat\u0026#39;,\u0026#39;\u0026#39;,\u0026#39;000000000000000\u0026#39;]: return 1 else: return 0 u_get_is_none = udf(get_is_none, IntegerType()) At the same time, different coding values ​​are assigned to different abnormal situations, and different business meanings can be distinguished during model training:\ncoding meaning -999976 Calculation exception (system error such as type conversion failure) -999977 The dividend is empty or the divisor is empty -999978 Divisor is 0 -999999 Null value at the business level (if the user did not actually perform any action) 2.3.2 Exclusive aggregation When generating aggregate features, force filter records with {metric value}_none == 1 in the conditions of each aggregation calculation:\nif condition_label == \u0026#39;All\u0026#39;: cal_condition = (f.col(time_field) \u0026lt;= time_windows_value) \u0026amp; (f.col(time_field) \u0026gt;= 0) \u0026amp; (f.col(f\u0026#39;{cal_field}_none\u0026#39;) == 0 ) else: cal_condition = (f.col(time_field) \u0026lt;= time_windows_value) \u0026amp; (f.col(time_field) \u0026gt;= 0) \u0026amp; (f.col(f\u0026#39;{cal_field}_none\u0026#39;) == 0 ) \u0026amp; (f.col(f\u0026#39;type_{condition_label}\u0026#39;) == 1) This means: all aggregate features (Cnt/Sum/Avg/Max, etc.) are calculated based on valid records rather than being shoehorned into the model with uniform padding values.\n2.3.3 Project implementation logic The actual process is two steps:\nStep 1: For each original indicator value field, generate the corresponding _none tag column:\nfor cal_field in cal_field_list: relative_df = relative_df.withColumn(f\u0026#39;{cal_field}_none\u0026#39;, u_get_is_none(cal_field)) Step 2: During aggregation calculation, add the _none == 0 restriction to the aggregation condition of each aggregation indicator to ensure that null value records do not participate in any mathematical operations and ensure the quality of features from the source.\n2.4 Technical implementation: Spark-based feature engineering pipeline This section introduces the whole process of how to extract raw data from the database based on Spark, and finally generate features through UDF processing and multi-dimensional aggregation. The codes are all taken from the real project ipynb implementation.\n2.4.1 Core UDF definition The following are the core custom functions in ipynb, used for time processing, null value judgment and calculation of key business indicators:\n(1) Null value judgment and division safety\ndef get_is_none(x): if x is None or str(x).replace(\u0026#39; \u0026#39;,\u0026#39;\u0026#39;).lower().strip() in [\u0026#39;-999999\u0026#39;,\u0026#39;-999976\u0026#39;,\u0026#39;-999977\u0026#39;,\u0026#39;-999978\u0026#39;, \u0026#39;-999999.0\u0026#39;,\u0026#39;-999999.0000\u0026#39;, \u0026#39;-999976.0\u0026#39;,\u0026#39;-999976.0000\u0026#39;, \u0026#39;-999977.0\u0026#39;,\u0026#39;-999977.0000\u0026#39;, \u0026#39;-999978.0\u0026#39;,\u0026#39;-999978.0000\u0026#39;, \u0026#39;none\u0026#39;,\u0026#39;nan\u0026#39;,\u0026#39;nat\u0026#39;,\u0026#39;\u0026#39;,\u0026#39;000000000000000\u0026#39;]: return 1 else: return 0 u_get_is_none = udf(get_is_none, IntegerType()) def get_cal_division(numerator, denominator): try: if denominator == 0: return -999978 elif get_is_none(numerator) or get_is_none(denominator): return -999977 else: return round(numerator / denominator, 2) except Exception as e: return -999976 (2) Time zone conversion\ndef remove_tz(date_time): return datetime.strptime(date_time.strftime(\u0026#39;%Y-%m-%d %H:%M:%S\u0026#39;), \u0026#39;%Y-%m-%d %H:%M:%S\u0026#39;) def convert_to_nigeria_time(date_time): if type(date_time) == str: date_time_d = str(date_time)[0:19] else: date_time_d = date_time.strftime(\u0026#39;%Y-%m-%d %H:%M:%S\u0026#39;) time_stamp = time.mktime(time.strptime(date_time_d, \u0026#39;%Y-%m-%d %H:%M:%S\u0026#39;)) utc_dt = datetime.utcfromtimestamp(time_stamp).replace(tzinfo=pytz.utc) dest_tz = timezone(\u0026#39;Africa/Lagos\u0026#39;) dest_dt = dest_tz.normalize(utc_dt.astimezone(dest_tz)) return remove_tz(dest_dt) def change_to_nigeria(date_time): if get_is_none(date_time): return None else: return convert_to_nigeria_time(date_time) u_change_to_nigeria = udf(change_to_nigeria, TimestampType()) (3) Time interval calculation\ndef get_timedelta_strtime(time_end, time_start): try: if get_is_none(time_end) or get_is_none(time_start): return None if len(str(time_end)) \u0026lt; 15 and len(str(time_start)) \u0026lt; 15: time_start = datetime.strptime(str(time_start)[0:10], \u0026#34;%Y-%m-%d\u0026#34;) time_end = datetime.strptime(str(time_end)[0:10], \u0026#34;%Y-%m-%d\u0026#34;) elif len(str(time_end)) \u0026lt; 15 and len(str(time_start)) \u0026gt; 15: time_start = datetime.strptime(str(time_start)[0:19], \u0026#34;%Y-%m-%d %H:%M:%S\u0026#34;) time_end = datetime.strptime(str(time_end)[0:10], \u0026#34;%Y-%m-%d\u0026#34;) else: time_start = datetime.strptime(str(time_start)[0:19], \u0026#34;%Y-%m-%d %H:%M:%S\u0026#34;) time_end = datetime.strptime(str(time_end)[0:19], \u0026#34;%Y-%m-%d %H:%M:%S\u0026#34;) seconds = (time_end - time_start).total_seconds() if seconds \u0026lt; 0: seconds = None return seconds except Exception as e: return -999976 u_get_timedelta_strtime = udf(get_timedelta_strtime, FloatType()) (4) Calculation of days for early repayment\ndef get_pre_repay_day(real_repay_time, due_time, is_repay): if str(is_repay) in [\u0026#39;1\u0026#39;, \u0026#39;1.0\u0026#39;]: if get_is_none(real_repay_time): return -999999 elif get_is_none(due_time): return -999999 else: real_repay_time = datetime.strptime(str(real_repay_time)[0:10], \u0026#34;%Y-%m-%d\u0026#34;) due_time = datetime.strptime(str(due_time)[0:10], \u0026#34;%Y-%m-%d\u0026#34;) pre_repayday = int((due_time - real_repay_time).days) if pre_repayday \u0026lt; 0: pre_repayday = -999999 return pre_repayday else: return -999999 u_get_pre_repay_day = udf(get_pre_repay_day, IntegerType()) 2.4.2 Core aggregate function: get_agg_v2 This function encapsulates 12 types of aggregation calculation logic, automatically constructs Spark aggregation expressions after inputting five-dimensional parameters, and is the core of the entire feature engineering pipeline:\ndef get_agg_v2(time_field, dt_, condition_label, condition_type, cal_field, cal_type, fea_name): \u0026#34;\u0026#34;\u0026#34; 聚合函数 :param time_field: :param dt_: :param contact_condition: :param cal_field: :param cal_type: :return: \u0026#34;\u0026#34;\u0026#34; print(f\u0026#34;export feature:{fea_name}\u0026#34;) if dt_ != \u0026#39;All\u0026#39;: time_windows_value = dt_ * 24 * 60 * 60 else: time_windows_value = 1000 * 24 * 60 * 60 if condition_label == \u0026#39;All\u0026#39;: cal_condition = (f.col(time_field) \u0026lt;= time_windows_value) \u0026amp; (f.col(time_field) \u0026gt;= 0) \u0026amp; (f.col(f\u0026#39;{cal_field}_none\u0026#39;) == 0) else: cal_condition = (f.col(time_field) \u0026lt;= time_windows_value) \u0026amp; (f.col(time_field) \u0026gt;= 0) \u0026amp; (f.col(f\u0026#39;{cal_field}_none\u0026#39;) == 0) \u0026amp; (f.col(f\u0026#39;type_{condition_label}\u0026#39;) == 1) if condition_type != \u0026#39;All\u0026#39;: cal_condition = cal_condition \u0026amp; (f.col(f\u0026#39;type_{condition_type}\u0026#39;) == 1) if cal_type == \u0026#39;Std\u0026#39;: return f.stddev_pop(f.when(cal_condition, f.col(cal_field))).alias(fea_name) elif cal_type == \u0026#39;Avg\u0026#39;: return f.mean(f.when(cal_condition, f.col(cal_field))).alias(fea_name) elif cal_type == \u0026#39;Sum\u0026#39;: return f.sum(f.when(cal_condition, f.col(cal_field))).alias(fea_name) elif cal_type == \u0026#39;sumDistinct\u0026#39;: return f.sumDistinct(f.when(cal_condition, f.col(cal_field))).alias(fea_name) elif cal_type == \u0026#39;Max\u0026#39;: return f.max(f.when(cal_condition, f.col(cal_field))).alias(fea_name) elif cal_type == \u0026#39;Skew\u0026#39;: return f.skewness(f.when(cal_condition, f.col(cal_field))).alias(fea_name) elif cal_type == \u0026#39;Median\u0026#39;: return f.percentile_approx(f.when(cal_condition, f.col(cal_field)), 0.5).alias(fea_name) elif cal_type == \u0026#39;Min\u0026#39;: return f.min(f.when(cal_condition, f.col(cal_field))).alias(fea_name) elif cal_type == \u0026#39;Cnt\u0026#39;: return f.count(f.when(cal_condition, f.col(cal_field))).alias(fea_name) elif cal_type == \u0026#39;DistinctCnt\u0026#39;: return f.countDistinct(f.when(cal_condition, f.col(cal_field))).alias(fea_name) elif cal_type == \u0026#39;RepeatCnt\u0026#39;: return (f.count(f.when(cal_condition, f.col(cal_field))) - f.countDistinct(f.when(cal_condition, f.col(cal_field)))).alias(fea_name) elif cal_type == \u0026#39;DistinctPct\u0026#39;: all_condition = (f.col(time_field) \u0026lt;= time_windows_value) \u0026amp; (f.col(time_field) \u0026gt;= 0) \u0026amp; (f.col(f\u0026#39;{cal_field}_none\u0026#39;) == 0) return (f.countDistinct(f.when(cal_condition, f.col(cal_field))) / f.countDistinct(f.when(all_condition, f.col(cal_field)))).alias(fea_name) elif cal_type == \u0026#39;Pct\u0026#39;: all_condition = (f.col(time_field) \u0026lt;= time_windows_value) \u0026amp; (f.col(time_field) \u0026gt;= 0) \u0026amp; (f.col(f\u0026#39;{cal_field}_none\u0026#39;) == 0) return (f.count(f.when(cal_condition, f.col(cal_field))) / f.count(f.when(all_condition, f.col(cal_field)))).alias(fea_name) elif cal_type == \u0026#39;RepeatPct\u0026#39;: all_condition = (f.col(time_field) \u0026lt;= time_windows_value) \u0026amp; (f.col(time_field) \u0026gt;= 0) \u0026amp; (f.col(f\u0026#39;{cal_field}_none\u0026#39;) == 0) return ((f.count(f.when(cal_condition, f.col(cal_field))) - f.countDistinct(f.when(cal_condition, f.col(cal_field)))) / (f.count(f.when(all_condition, f.col(cal_field))) - f.countDistinct(f.when(all_condition, f.col(cal_field))))).alias(fea_name) return None 2.4.3 Data preprocessing + aggregate feature generation The core encapsulation class of the entire pipeline includes two methods: data preprocessing (prepare_df) and batch aggregation (feature_df):\nclass Features1(object): def __init__(self, spark: SparkSession): self.spark = spark def prepare_df(self, from_sample, prepare_table): # 加载订单表 self.spark.sql(f\u0026#34;\u0026#34;\u0026#34; select * from risk.dw_order_info_slice_2 where country = \u0026#39;NIGERIA\u0026#39; and merchant_id = 1 and not isnull(APPLY_DAY) \u0026#34;\u0026#34;\u0026#34;) \\ .createOrReplaceTempView(\u0026#39;loan_df\u0026#39;) col_list = self.spark.sql(\u0026#34;select * from risk.dw_order_info_slice_2\u0026#34;).columns col_list.remove(\u0026#39;REPAY_TIME_FORMAT\u0026#39;) col_list = [f\u0026#34;o.{x} as SAMPLE_{x}\u0026#34; for x in col_list] # 关联样本表与订单表 self.spark.sql(f\u0026#34;\u0026#34;\u0026#34; select {\u0026#39;,\u0026#39;.join(col_list)}, s.due_day as SAMPLE_REPAY_TIME_FORMAT, to_date(s.d14_date) as SAMPLE_d14_date, date_add(to_date(s.d14_date), 1) as SAMPLE_APPLY_LIMIT_DAY from {from_sample} s left join risk.dw_order_info_slice_2 o on s.bis_order_id = o.bis_order_id where country = \u0026#39;NIGERIA\u0026#39; and merchant_id = 1 \u0026#34;\u0026#34;\u0026#34;).createOrReplaceTempView(\u0026#39;sample_order\u0026#39;) self.spark.sql(\u0026#34;\u0026#34;\u0026#34; select s.* , l.* from sample_order s left join loan_df l on s.SAMPLE_BIS_USER_ID = l.BIS_USER_ID where l.APPLY_DAY \u0026lt; S.SAMPLE_APPLY_LIMIT_DAY and S.SAMPLE_APPLY_TIME \u0026lt; l.APPLY_TIME and s.SAMPLE_BIS_ORDER_ID != l.BIS_ORDER_ID \u0026#34;\u0026#34;\u0026#34;).createOrReplaceTempView(\u0026#39;sample_df\u0026#39;) relative_df = self.spark.sql(\u0026#34;select * from sample_df\u0026#34;) # 时区转换 relative_df = relative_df.withColumn(\u0026#39;LENDING_TIME_FORMAT\u0026#39;, u_get_time_process(\u0026#39;LENDING_TIME_FORMAT\u0026#39;, \u0026#39;SAMPLE_APPLY_LIMIT_DAY\u0026#39;)) relative_df = relative_df.withColumn(\u0026#39;REPAIED_TIME_FORMAT\u0026#39;, u_get_time_process(\u0026#39;REPAIED_TIME_FORMAT\u0026#39;, \u0026#39;SAMPLE_APPLY_LIMIT_DAY\u0026#39;)) relative_df = relative_df.withColumn(\u0026#39;SAMPLE_APPLY_TIME_NI\u0026#39;, u_change_to_nigeria(\u0026#39;SAMPLE_APPLY_TIME\u0026#39;)) relative_df = relative_df.withColumn(\u0026#39;LENDING_TIME_TIMESTAMP_NI\u0026#39;, u_change_to_nigeria(\u0026#39;LENDING_TIME_FORMAT\u0026#39;)) relative_df = relative_df.withColumn(\u0026#39;REPAIED_TIME_TIMESTAMP_NI\u0026#39;, u_change_to_nigeria(\u0026#39;REPAIED_TIME_FORMAT\u0026#39;)) relative_df = relative_df.withColumn(\u0026#39;APPLY_TIME_NI\u0026#39;, u_change_to_nigeria(\u0026#39;APPLY_TIME\u0026#39;)) relative_df = relative_df.withColumn(\u0026#39;SAMPLE_APPLY_DAY_NI\u0026#39;, u_get_date_from_timestamp(\u0026#39;SAMPLE_APPLY_TIME_NI\u0026#39;)) # 生成时间间隔字段（秒） relative_df = relative_df.withColumn(\u0026#39;ApplyLendingInterval\u0026#39;, u_get_timedelta_strtime(\u0026#39;LENDING_TIME_TIMESTAMP_NI\u0026#39;, \u0026#39;SAMPLE_APPLY_TIME_NI\u0026#39;)) relative_df = relative_df.withColumn(\u0026#39;ApplyApplyInterval\u0026#39;, u_get_timedelta_strtime(\u0026#39;APPLY_TIME_NI\u0026#39;, \u0026#39;SAMPLE_APPLY_TIME_NI\u0026#39;)) relative_df = relative_df.withColumn(\u0026#39;ApplyRepaiedInterval\u0026#39;, u_get_timedelta_strtime(\u0026#39;REPAIED_TIME_TIMESTAMP_NI\u0026#39;, \u0026#39;SAMPLE_APPLY_TIME_NI\u0026#39;)) relative_df = relative_df.withColumn(\u0026#39;LimitApplyInterval\u0026#39;, u_get_timedelta_strtime(\u0026#39;SAMPLE_APPLY_LIMIT_DAY\u0026#39;, \u0026#39;APPLY_TIME\u0026#39;)) # 生成业务标签字段 relative_df = relative_df.withColumn(\u0026#39;is_reject\u0026#39;, u_get_value_equals(30)(col(\u0026#39;STATUS\u0026#39;))) relative_df = relative_df.withColumn(\u0026#39;is_cancel\u0026#39;, u_get_value_equals(60)(col(\u0026#39;STATUS\u0026#39;))) relative_df = relative_df.withColumn(\u0026#39;is_loan\u0026#39;, u_get_is_loan(\u0026#39;LENDING_TIME_TIMESTAMP_NI\u0026#39;, \u0026#39;SAMPLE_APPLY_LIMIT_DAY\u0026#39;)) relative_df = relative_df.withColumn(\u0026#39;is_repay\u0026#39;, u_get_is_repay(\u0026#39;REPAIED_TIME_TIMESTAMP_NI\u0026#39;)) # 生成日期字段 relative_df = relative_df.withColumn(\u0026#39;ApplyDays\u0026#39;, u_get_message_day(\u0026#39;APPLY_TIME_NI\u0026#39;)) relative_df = relative_df.withColumn(\u0026#39;LendingDays\u0026#39;, u_get_message_day(\u0026#39;LENDING_TIME_TIMESTAMP_NI\u0026#39;)) relative_df = relative_df.withColumn(\u0026#39;RepayDays\u0026#39;, u_get_message_day(\u0026#39;REPAIED_TIME_TIMESTAMP_NI\u0026#39;)) relative_df = relative_df.withColumn(\u0026#39;repay_date\u0026#39;, u_get_message_day(\u0026#39;REPAIED_TIME_TIMESTAMP_NI\u0026#39;)) relative_df = relative_df.withColumn(\u0026#39;due_date\u0026#39;, u_get_message_day(\u0026#39;ORG_REPAY_TIME\u0026#39;)) relative_df = relative_df.withColumn(\u0026#39;PreRepayDay\u0026#39;, u_get_pre_repay_day(\u0026#39;repay_date\u0026#39;, \u0026#39;due_date\u0026#39;, \u0026#39;is_repay\u0026#39;)) # 生成实还金额、部分还款相关字段 relative_df = relative_df.withColumn(\u0026#39;REPAIED_AMOUNT_FIX\u0026#39;, u_get_repaid_amount(\u0026#39;REPAY_AMOUNT\u0026#39;, \u0026#39;REPAIED_AMOUNT\u0026#39;, \u0026#39;is_repay\u0026#39;)) relative_df = relative_df.withColumn(\u0026#39;REPAIED_AMOUNT_FIX\u0026#39;, get_none_defalut_value(0)(col(\u0026#39;REPAIED_AMOUNT_FIX\u0026#39;))) relative_df = relative_df.withColumn(\u0026#39;REPAY_AMOUNT\u0026#39;, get_none_defalut_value(0)(col(\u0026#39;REPAY_AMOUNT\u0026#39;))) relative_df = relative_df.withColumn(\u0026#39;IsPartRepay\u0026#39;, u_is_part_repay(\u0026#39;is_repay\u0026#39;, \u0026#39;REPAIED_AMOUNT_FIX\u0026#39;, \u0026#39;REPAY_AMOUNT\u0026#39;)) relative_df = relative_df.withColumn(\u0026#39;PartRepayAmtPct\u0026#39;, u_get_part_repay_amount_pct(\u0026#39;is_repay\u0026#39;, \u0026#39;REPAIED_AMOUNT_FIX\u0026#39;, \u0026#39;REPAY_AMOUNT\u0026#39;)) relative_df = relative_df.withColumn(\u0026#39;PartRepayAmt\u0026#39;, u_get_part_repay_amount(\u0026#39;IsPartRepay\u0026#39;, \u0026#39;REPAIED_AMOUNT_FIX\u0026#39;, \u0026#39;REPAY_AMOUNT\u0026#39;)) # 生成布尔标签字段 relative_df = relative_df.withColumn(\u0026#39;IsRepay\u0026#39;, u_get_value_equals(1)(col(\u0026#39;is_repay\u0026#39;))) relative_df = relative_df.withColumn(\u0026#39;IsLoan\u0026#39;, u_get_value_equals(1)(col(\u0026#39;is_loan\u0026#39;))) relative_df = relative_df.withColumn(\u0026#39;IsNoLoan\u0026#39;, u_get_value_equals(0)(col(\u0026#39;is_loan\u0026#39;))) relative_df = relative_df.withColumn(\u0026#39;IsCancel\u0026#39;, u_get_value_equals(1)(col(\u0026#39;is_cancel\u0026#39;))) relative_df = relative_df.withColumn(\u0026#39;IsReject\u0026#39;, u_get_value_equals(1)(col(\u0026#39;is_reject\u0026#39;))) relative_df = relative_df.withColumn(\u0026#39;IsOnLoan\u0026#39;, u_is_on_loan(\u0026#39;is_loan\u0026#39;, \u0026#39;is_repay\u0026#39;)) relative_df = relative_df.withColumn(\u0026#39;IsOnTimeRepay\u0026#39;, u_is_on_time_repay(\u0026#39;is_repay\u0026#39;, \u0026#39;PreRepayDay\u0026#39;)) # 字段重命名 relative_df = relative_df.withColumnRenamed(\u0026#39;LENDING_AMOUNT\u0026#39;, \u0026#39;Amount\u0026#39;) relative_df = relative_df.withColumnRenamed(\u0026#39;BIS_ORDER_ID\u0026#39;, \u0026#39;Order\u0026#39;) relative_df = relative_df.withColumnRenamed(\u0026#39;PACKAGE_ID\u0026#39;, \u0026#39;Package\u0026#39;) # 生成行为标签 type_* 字段 for t in con_list_all: if t == \u0026#39;All\u0026#39;: relative_df = relative_df.withColumn(f\u0026#39;type_{t}\u0026#39;, u_get_all(\u0026#39;Order\u0026#39;)) elif t == \u0026#39;PreRepay1Day\u0026#39;: relative_df = relative_df.withColumn(f\u0026#39;type_{t}\u0026#39;, u_get_value_ge(1)(col(\u0026#39;PreRepayDay\u0026#39;))) elif t == \u0026#39;PreRepay0Day\u0026#39;: relative_df = relative_df.withColumn(f\u0026#39;type_{t}\u0026#39;, u_get_value_equals(0)(col(\u0026#39;PreRepayDay\u0026#39;))) elif t == \u0026#39;IsPartRepayAmt7Pct\u0026#39;: relative_df = relative_df.withColumn(f\u0026#39;type_{t}\u0026#39;, u_get_value_between(0, 0.7)(col(\u0026#39;PreRepayDay\u0026#39;))) else: relative_df = relative_df.withColumn(f\u0026#39;type_{t}\u0026#39;, u_get_value_equals(1)(col(t))) # 为每个原始字段生成 _none 标记列 for cal_field in cal_field_list: relative_df = relative_df.withColumn(f\u0026#39;{cal_field}_none\u0026#39;, u_get_is_none(cal_field)) # 写出预处理结果 relative_df.coalesce(64).write.mode(\u0026#39;overwrite\u0026#39;).saveAsTable(prepare_table) def generate_agg_list(self): agg_list = list() for con in cal_fea_dict.keys(): condition_list = cal_fea_dict[con][\u0026#39;con\u0026#39;] cal_list = cal_fea_dict[con][\u0026#39;cal_field\u0026#39;] for con_key in condition_list: for dt in dt_list: cal_dict = {} for cal_field in cal_list: cal_dict[cal_field] = cal_dict_all[cal_field] for cal_type in cal_dict[cal_field]: fea_name = f\u0026#39;userALS3Multi{con}{con_key}{cal_field}{cal_type}{str(dt)}D\u0026#39; agg_list.append(get_agg_v2( time_field=\u0026#39;LimitApplyInterval\u0026#39;, dt_=dt, condition_label=con, condition_type=con_key, cal_field=cal_field, cal_type=cal_type, fea_name=fea_name )) return agg_list def feature_df(self, df: DataFrame, repartition_num, coalesce_num, table_name, mode): groupby_field_list = [\u0026#39;BIS_ORDER_ID\u0026#39;] agg_list = self.generate_agg_list() index = 1 # 分10批写入 CSV，避免内存溢出 for l in np.array_split(agg_list, 10): df.groupby([f\u0026#39;SAMPLE_{x}\u0026#39; for x in groupby_field_list]).agg(*l) \\ .coalesce(coalesce_num).write.mode(mode).option(\u0026#34;header\u0026#34;, \u0026#34;true\u0026#34;).csv(f\u0026#34;{table_name}_{index}.csv\u0026#34;) index += 1 2.4.4 Pipeline execution # 初始化并执行预处理 fea = Features1(spark) fea.prepare_df(from_sample=\u0026#39;risk_analysis.sample_df_S3_20230727\u0026#39;, prepare_table=\u0026#39;risk_analysis.multiALS3_prepare_20230727\u0026#39;) # 加载预处理结果并执行特征聚合 df = spark.sql(\u0026#34;select * from risk_analysis.multiALS3_prepare_20230727\u0026#34;) fea.feature_df(df, repartition_num=0, coalesce_num=64, table_name=\u0026#39;multiALS3_features_20230727\u0026#39;, mode=\u0026#39;overwrite\u0026#39;) 2.4.5 Merge and export features CSV import pandas as pd multi_features_S3 = [] for i in range(10): print(i) result = spark.read.csv(f\u0026#34;multiALS3_features_20230727_{i+1}.csv\u0026#34;) result_df = result.toPandas() result_df.columns = result_df.values.tolist()[0] result_df = result_df[result_df[\u0026#39;SAMPLE_BIS_ORDER_ID\u0026#39;] != \u0026#39;SAMPLE_BIS_ORDER_ID\u0026#39;] result_df = result_df.reset_index(drop=True) if len(multi_features_S3) == 0: multi_features_S3 = result_df else: multi_features_S3 = multi_features_S3.merge(result_df, on=\u0026#39;SAMPLE_BIS_ORDER_ID\u0026#39;, how=\u0026#39;outer\u0026#39;) print(multi_features_S3.shape) multi_features_S3.to_csv(\u0026#39;../multiALS3_features_20230727.csv\u0026#39;) The entire pipeline covers data association → time zone conversion → time interval calculation → business label generation → null value marking → multi-dimensional aggregation → batch writing → merge export full link. Based on Spark distributed computing, it can support the construction of massive features within an hour.\nChapter 3: Sample Selection 3.1 Y label definition The goal of the model is to predict whether repayment will occur for long-aging orders in a subsequent period of time.\nInitial label plan: Use \u0026ldquo;whether to settle\u0026rdquo; as the label - settled = 1, unsettled = 0. However, after analyzing the actual data, it was found that the settlement ratio of long-aged orders was extremely low (less than 5%), the data skewness was extremely high, and the positive and negative samples were seriously imbalanced, making it difficult for the model to learn effectively.\nFinal plan: Change to \u0026ldquo;whether there is repayment within the stage\u0026rdquo; as the label - repayment within the stage = 1, no repayment within the stage = 0. Through this caliber adjustment, the proportion of positive samples increased from less than 5% to 6.5%, the data skewness was significantly improved, and the model learning effect was greatly improved.\n# 标签生成逻辑 sample[\u0026#39;target\u0026#39;] = sample[\u0026#39;d21Amount\u0026#39;] # d21Amount \u0026gt; 0 → 有还款 → target=1 sample[\u0026#39;flag\u0026#39;] = sample[\u0026#39;create_date\u0026#39;].astype(\u0026#39;str\u0026#39;).apply(lambda x: x \u0026gt;= \u0026#39;2023-07-28\u0026#39;) sample = sample.rename(columns={\u0026#39;bis_order_id\u0026#39;: \u0026#39;BIS_ORDER_ID\u0026#39;}) 3.2 Positive and negative sample sizes The total number of samples is 22,895, and the distribution of positive and negative samples is as follows:\nLabel meaning Number of samples Proportion 1 There is repayment within the stage ~21,413 ~93.5% 0 No repayment within the period ~1,482 ~6.5% Positive samples (repayment) account for the vast majority, and negative samples (non-repayment) account for about 6.5%. This is a typical highly imbalanced two-classification problem. This is also a difficult point that needs to be focused on through feature screening and model parameter tuning in subsequent modeling.\n3.3 Training/validation/test set division Using the random division method, using date as the auxiliary segmentation mark, the samples are divided into three independent sets:\n# 以 2023-07-28 为节点，之前的样本用于训练和测试，该日期之后的样本用于验证 sample[\u0026#39;flag\u0026#39;] = sample[\u0026#39;create_date\u0026#39;].astype(\u0026#39;str\u0026#39;).apply(lambda x: x \u0026gt;= \u0026#39;2023-07-28\u0026#39;) # flag=0 的样本随机划分 70% 训练集、30% 测试集 train, test = train_test_split( sample[sample[\u0026#39;flag\u0026#39;] == 0], test_size=0.3, random_state=23123 ) # flag=1 的样本全部作为验证集 vld = sample[sample[\u0026#39;flag\u0026#39;] == 1] Division results:\nDataset Number of samples Proportion of positive samples Negative sample proportion train 14,377 93.50% 6.50% test 6,162 93.36% 6.64% vld (validation set) 2,356 93.93% 6.07% The distribution of positive and negative samples in the three data sets is highly consistent, indicating that the random division is statistically stable. The distribution difference between each set is within 0.5%. There is no risk of data leakage. The validation set can objectively reflect the generalization ability of the model.\nChapter 4: Feature Screening 4.0 Introduction to prerequisite knowledge Before formally entering the feature screening method, let’s add two core concepts: IV value and LGBM gain indicator. The former is used to measure the discriminating ability of a single feature, and the latter is used to measure the actual contribution of features in the tree model. The two together form the theoretical basis for feature screening in this project.\n4.0.1 Principle of IV value IV (Information Value) is the most classic feature screening indicator in the field of credit scoring. It is used to measure the ability of a single feature to distinguish target variables. Its calculation logic is based on WoE (Weight of Evidence).\nIntuitive understanding of WoE: For a two-classification problem (repayment = 1 / non-repayment = 0), WoE reflects the difference in the proportion of non-repayment samples and repayment samples when this feature takes a certain value.\nWoE \u0026gt; 0: The proportion of non-repayment in this interval is higher than the proportion of repayment. The value of this feature is positively related to the negative sample (non-repayment); WoE \u0026lt; 0: The proportion of non-repayment in this interval is lower than the proportion of repayment, and the value of this feature is positively related to the positive sample (repayment); |WoE| 越大，该区间的区分能力越强。 Calculation of IV: IV is the weighted sum of the WoE values ​​in each interval, and the weight is (proportion of non-repayment - proportion of repayment). The greater the contribution of each interval to IV, the stronger the interval\u0026rsquo;s ability to distinguish the target variable. The larger the IV value, the stronger the ability of this feature to distinguish between repayment and non-repayment.\nIndustry IV Interpretation Standard:\nIV value range Feature effect judgment \u0026lt; 0.02 Useless features, no ability to distinguish 0.02~0.1 Weak features, limited discrimination ability 0.1 ~ 0.3 Medium characteristics, with certain distinguishing ability 0.3 ~ 0.5 Strong features and strong distinguishing ability \u0026gt; 0.5 Too strong (manual inspection is required to see if there is a data leak) The IV screening threshold for this project is 0.05, that is, features with IV \u0026lt; 0.05 are directly eliminated.\n4.0.2 Gain indicator of LGBM model In addition to univariate indicators such as IV, this project also introduced LightGBM\u0026rsquo;s gain indicator in the second round of feature screening to measure the actual contribution of features in the tree model.\nIntuitive understanding of Gain: Every time a node is split in LightGBM, the model will select a feature and split point that makes the current objective function drop the most. This \u0026ldquo;drop\u0026rdquo; is the gain brought about by the split. If a feature continues to bring about a larger decrease in the objective function in multiple trees and multiple splits, it means that it contributes more to the model prediction results.\nThe larger the Gain: the stronger the actual explanatory power and distinguishing ability of the feature in the model; The smaller the Gain: it means that although the feature may be distinguishable under a single variable, its contribution will be limited in a multi-feature combination scenario; If a feature only has a high gain in a single modeling session, but fluctuates greatly under different random partitions, it indicates that its stability is insufficient. Unlike IV, gain is an endogenous indicator of the model in a multi-variable scenario. IV is more suitable for the first round of rough screening to quickly eliminate obviously invalid weak features; gain is more suitable for the second round of re-screening to determine the stable contribution of features in the real modeling process. Therefore, this project uses a combination of \u0026ldquo;IV initial screening + gain re-screening\u0026rdquo; to complete feature convergence.\n4.1 Initial screening: information value (IV) screening After a large number of features are processed with null values, there are still a large number of redundant features that have insufficient distinguishing ability or are highly correlated with other features. If all of them are put into the model, not only will the training cost be extremely high, but it will also easily introduce noise and lead to model overfitting. Therefore, before formal modeling, two rounds of feature screening are required to gradually converge the feature scale.\n4.1.1 Implementation of preliminary screening Use the selection.select method of the toad library to filter out high-frequency null value features based on IV values ​​and correlations:\nimport toad for k in feature_dict.keys(): print(k) f_df = pd.read_csv(feature_dict[k] + \u0026#39;.csv\u0026#39;) join_df = sample[sample[\u0026#39;flag\u0026#39;] == 0][[\u0026#39;target\u0026#39;, \u0026#39;BIS_ORDER_ID\u0026#39;]] \\ .merge(f_df.rename(columns={\u0026#39;SAMPLE_BIS_ORDER_ID\u0026#39;: \u0026#39;BIS_ORDER_ID\u0026#39;}), on=\u0026#39;BIS_ORDER_ID\u0026#39;, how=\u0026#39;inner\u0026#39;) col_list = list(join_df.columns[2:]) select_df = toad.selection.select( join_df[col_list].fillna(-999999), target=join_df[\u0026#39;target\u0026#39;], empty=0.6, # 空值率 \u0026gt; 60% 的特征剔除 iv=0.05, # IV \u0026lt; 0.05 的特征剔除 corr=0.8, # 相关系数 \u0026gt; 0.8 的特征对取其一 return_drop=False, exclude=None ) select_col = list(select_df.columns) print(f\u0026#39;feature_shape:{f_df.shape}, train_join_shape:{len(select_col)}\u0026#39;) select_col_dict[k] = select_col Interpretation of screening parameters:\nparameter meaning empty=0.6 Feature elimination with null value rate exceeding 60% iv=0.05 Eliminate features with IV values ​​below 0.05 corr=0.8 When the correlation coefficient of a pair of features exceeds 0.8, the one with a higher IV is retained. 4.2 Re-screening: Gain value stability screening 4.2.1 Principle of stability screening IV screening has two limitations: first, the IV value depends on the binning method, and different binning granularities may produce different rankings; second, the IV is a univariate indicator and cannot reflect the actual gain after combining multiple features. Therefore, it is necessary to verify the true contribution of each feature in a multi-variable scenario through actual modeling of LightGBM.\nIn order to prevent the chance of single modeling, the strategy of random division + multiple modeling + intersection screening is adopted: 5 random divisions, taking the Top 100 gain value features each time, and finally retaining the features that have entered the Top 100 in at least 4 times (i.e. sum \u0026gt;= 4).\n4.2.2 Implementation of double screening import lightgbm as lgb import numpy as np total_test_df = pd.concat([train_, test_], axis=0, ignore_index=True) df_list = np.array_split(total_test_df, 5) feature_name_dict = {} for i in [0, 1, 2, 3, 4]: print(f\u0026#34;i = {i}\u0026#34;) # 其余4份做训练，第i份做测试 d_train_list = [df_list[x] for x in [0, 1, 2, 3, 4] if x != i] t_train = pd.concat(d_train_list, axis=0, ignore_index=True) t_test = df_list[i] col_list = [x for x in list(train_.columns)[2:] if \u0026#39;:\u0026#39; not in x] X_train = t_train[col_list].fillna(-999999) y_train = t_train[\u0026#39;target\u0026#39;] X_test = t_test[col_list].fillna(-999999) y_test = t_test[\u0026#39;target\u0026#39;] X_vld = vld_[col_list].fillna(-999999) y_vld = vld_[\u0026#39;target\u0026#39;] # 开启样本不平衡处理 model_train2 = lgb.LGBMClassifier(is_unbalance=True) choose_model = model_train2 choose_model.fit(X_train, y_train) # 获取增益值并取 Top100 gain = choose_model.booster_.feature_importance(\u0026#39;gain\u0026#39;) fi = pd.DataFrame({ \u0026#39;feature\u0026#39;: list(X_train.columns), \u0026#39;split\u0026#39;: choose_model.booster_.feature_importance(\u0026#39;split\u0026#39;), \u0026#39;gain\u0026#39;: 100 * gain / gain.sum() }).sort_values(\u0026#39;gain\u0026#39;, ascending=False) top_feature = fi.head(100) feature_name_dict[i] = top_feature # 打印各折训练/测试/验证集指标 print(\u0026#34;train_auc:{}, test_auc:{}, vld_auc:{}\u0026#34;.format(...)) print(\u0026#34;train_ks:{}, test_ks:{}, vld_ks:{}\u0026#34;.format(...)) 4.2.3 Intersection filtering # 5次建模的 Top100 特征取交集 feature_select_df = pd.DataFrame() for i in [0, 1, 2, 3, 4]: f_list = feature_name_dict[i][\u0026#39;feature\u0026#39;].to_list() temp_df = pd.DataFrame({\u0026#39;f_name\u0026#39;: f_list, f\u0026#39;{i}_select\u0026#39;: 1}) if feature_select_df.empty: feature_select_df = temp_df else: feature_select_df = feature_select_df.merge(temp_df, on=\u0026#39;f_name\u0026#39;, how=\u0026#39;outer\u0026#39;) feature_select_df = feature_select_df.fillna(0) # 统计每个特征在5次建模中进入 Top100 的次数 feature_select_df[\u0026#39;sum\u0026#39;] = feature_select_df.apply( lambda x: np.sum([x[f\u0026#39;{i}_select\u0026#39;] for i in [0, 1, 2, 3, 4]]), axis=1 ) # 至少4次进入 Top100 → 最终入选 select_col_list = feature_select_df[feature_select_df[\u0026#39;sum\u0026#39;] \u0026gt;= 4][\u0026#39;f_name\u0026#39;].to_list() Screenshots of the training process\nRescreen feature set (excerpt)\n4.2.4 Final Screening: Standard LGBM Top 60 Gain Values After two rounds of screening, a final round of screening is required to conduct a complete training on all training sets with the standard LightGBM model, and select the top 60 features with gain values ​​as the final model version. The purpose of this step is to ensure that the features finally entered into the model have the strongest distinguishing ability on the complete data set, and it is also a re-verification and convergence of the results of the first two rounds of screening.\n# 最终一次标准 LGBM 训练，取增益值 Top60 作为最终特征集 model_train2 = lgb.LGBMClassifier(is_unbalance=True) choose_model = model_train2 choose_model.fit(X_train, y_train) gain = choose_model.booster_.feature_importance(\u0026#39;gain\u0026#39;) fi = pd.DataFrame({ \u0026#39;feature\u0026#39;: list(X_train.columns), \u0026#39;split\u0026#39;: choose_model.booster_.feature_importance(\u0026#39;split\u0026#39;), \u0026#39;gain\u0026#39;: 100 * gain / gain.sum() }).sort_values(\u0026#39;gain\u0026#39;, ascending=False).reset_index(drop=True) # 取前60 → 最终入选特征 top_feature = fi.head(60) Finalized feature set (TOP20)\n4.2.5 Summary of screening results After four rounds of screening, the feature scale gradually converged from the tens of thousands to about 60 high-value features, entering the formal modeling stage:\nstage Number of features Screening method original features tens of thousands — After IV initial screening Thousand level toad selection: empty=0.6, iv=0.05, corr=0.8 After re-screening the gain value stability ~82 5 times of random modeling × Top100 × intersection (sum ≥ 4) After the final Top60 screening ~60 Standard LGBM gain value sorting takes the top 60 These approximately 60 features have gone through four rounds of screening (null value processing → IV preliminary screening → stability re-screening → LGBM Top60). They are highly robust features converged layer by layer from the tens of thousands of feature pools. They have strong distinguishing capabilities and are an important basis for the model to reach AUC 69 in long account aging scenarios.\nChapter 5: LightGBM model construction and training 5.0 LightGBM model introduction, important parameters and evaluation criteria 5.0.1 LightGBM selection background LightGBM (Light Gradient Boosting Machine) is an efficient ensemble learning algorithm based on gradient boosting decision tree (GBDT). Compared with traditional GBDT, LightGBM adopts Histogram and Leaf-wise growth strategies, which greatly improves the training speed while maintaining good prediction accuracy. It is widely used in risk control models in the industry.\nThree core reasons to choose LightGBM:\nSparse data friendly: It has high efficiency in processing high-dimensional sparse features and is suitable for the scenario of this project with a large number of null features; Interpretability: The feature gain value (gain) and the number of splits (split) can be output to facilitate the analysis of feature importance; Fast training speed: Compared with XGBoost, LightGBM has an order of magnitude improvement in training speed in scenarios with large data volumes. 5.0.2 Important parameter description The key parameters and functions involved in the training of this model are as follows:\nparameter meaning The best value of this project boosting_type Promotion method, fixed to gbdt 'gbdt' objective objective function 'Binary' learning_rate Learning rate, controls the step size of each iteration 0.01 n_estimators The number of iteration rounds, the total number of trees built 300 max_depth The maximum depth of a single tree, limiting model complexity to prevent overfitting 2 num_leaves Number of leaf nodes, combined with max_depth to control model complexity 3 min_child_samples The minimum number of samples for leaf nodes to avoid a few samples dominating the split 20 min_child_weight Minimum weight of leaf nodes 0.001 min_split_gain split minimum gain threshold 0 reg_alpha L1 regularization coefficient, which shrinks some feature weights to 0 0.03 reg_lambda L2 regularization coefficient makes the feature weights overall smooth 0.01 subsample row sampling ratio 1 subsample_for_bin Number of sampled rows when building histogram 200000 subsample_freq subsampling frequency 12 colsample_bytree Column sampling ratio 1 (0.99 during parameter search) is_unbalance Turn on sample imbalance processing True max_bin Number of bins, number of buckets for eigenvalue bins 900 importance_type Feature importance type 'split' silent Whether to train silently True 5.0.3 Parameter training method The hyperparameter training of this project is not a one-time black box search, but a combination of enumeration method + cross-validation + GridSearchCV + multi-index sorting combined with business goals and overfitting control requirements to gradually converge to the final parameters.\nSpecifically, the following methods are used:\nEnumeration method (Manual Enumeration): Traverse parameters such as learning rate, iteration rounds, regularization coefficients, and minimum number of samples group by group according to the preset range. The advantage is that the search path is clear and it is easy to control the range based on business experience; LightGBM native cross-validation lgb.cv: In the first round of parameter adjustment, 5-fold cross-validation is performed on different learning_rate × n_estimators combinations, and the generalization effects of different combinations are compared with the mean CV AUC; GridSearchCV Grid Search: In the second round of parameter adjustment, perform standard grid search on max_depth and num_leaves, automatically traverse the parameter combinations and return the optimal 50% CV AUC result; Multi-indicator manual sorting: In the third round of parameter adjustment, we not only look at a single AUC, but also calculate the AUC/KS of the training set, test set, and validation set, as well as stability indicators such as diff_ks, diff_auc, and cur_std, and then select the optimal parameter combination according to the business goals. The core idea of ​​this step-by-step training method is: first use cross-validation to determine the general direction parameters, then use grid search to converge the tree structure, and finally use multi-index ranking to control the risk of over-fitting, thus taking into account model effect, stability and interpretability.\n5.0.4 Model training process 拿到 60 个最有特征 │ ▼ ┌──────────────────────────────┐ │ 第一步：学习率与轮数枚举 │ └──────────────────────────────┘ │ ▼ ┌──────────────────────────────┐ │ 第二步：max_depth 与 │ │ num_leaves 枚举 │ └──────────────────────────────┘ │ ▼ ┌──────────────────────────────┐ │ 第三步：L1/L2 正则化与 │ │ 最小样本数枚举 │ └──────────────────────────────┘ │ ▼ ┌──────────────────────────────┐ │ 最优模型参数 │ └──────────────────────────────┘ 5.0.5 Model Evaluation Criteria This model uses four core evaluation indicators:\nAUC (Area Under ROC Curve)\nThe area under the ROC curve reflects the model\u0026rsquo;s ability to sort positive and negative samples. Defined as:\nAUC = ∫₀¹ TPR(FPR⁻¹(x)) dx\nEquivalent calculation method: Among all pairs of positive and negative samples, the model scores the positive samples higher than the proportion of negative samples.\nAUC = Σᵢ Σⱼ 1[Pᵢ \u0026gt; Pⱼ] / (n₊ × n₋)\n(where n₊ is the number of positive samples, n₋ is the number of negative samples, and 1[·] is the indicator function)\nIndustry standard: AUC \u0026gt; 65 means the model is effective, and AUC \u0026gt; 70 means it is good.\nKS (Kolmogorov-Smirnov)\nMeasuring the maximum difference in the scoring distribution between good and bad samples, defined as:\n**KS = maxₜ | TPR(t) − FPR(t) | **\nThat is, the maximum vertical distance between TPR and FPR on the ROC curve. The higher the KS, the stronger the discrimination.\nIndustry standard: KS \u0026gt; 30 is effective, KS \u0026gt; 40 is good.\nPrecision\nThe proportion of orders predicted as positive samples by the model actually belong to the positive sample category. Its meaning depends on which class is set as the positive class when modeling:\nPrecision = TP / (TP + FP)\nRecall\nThe proportion of actual positive samples successfully predicted by the model:\nRecall = TP / (TP + FN)\n5.1 The first version of the model and overfitting problem Use the 46 features filtered for stability to build a baseline LightGBM model, turn on sample imbalance processing is_unbalance=True, and the evaluation results after direct training are as follows:\ntrain_auc: 99.98 | test_auc: 64.54 | vld_auc: 59.00 train_ks: 99.20 | test_ks: 24.54 | vld_ks: 15.90 Problem Diagnosis: The AUC of the training set is as high as 99.98, which is an almost perfect fit. However, the AUC of the test set and validation set are only around 60. The model is in a serious overfitting state - there is a huge performance gap between the training set and the validation set, indicating that the model has learned too much noise on the training data and does not have the ability to generalize.\n5.2 Hyperparameter tuning method To address the over-fitting problem, the enumeration method + cross-validation strategy is used to systematically tune all key parameters in three steps.\n5.2.1 Step One: Learning Rate and Round Enumeration Through 5-fold cross-validation, 7 combinations of learning rates × 3 n_estimators are enumerated (21 groups in total), and the optimal CV AUC mean is used as the criterion:\nlr_dict = dict() for rate in [0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1]: for num_boost_round in [100, 300, 600]: params = { \u0026#39;boosting_type\u0026#39;: \u0026#39;gbdt\u0026#39;, \u0026#39;objective\u0026#39;: \u0026#39;Binary\u0026#39;, \u0026#39;learning_rate\u0026#39;: rate, \u0026#39;is_unbalance\u0026#39;: True } data_train = lgb.Dataset(X_train_, y_train_, silent=True) cv_results = lgb.cv( params, data_train, num_boost_round=num_boost_round, early_stopping_rounds=int(num_boost_round / 2), nfold=5, verbose_eval=False, seed=3432141, metrics=\u0026#39;auc\u0026#39;, show_stdv=False ) lr_dict[f\u0026#34;{rate}_{num_boost_round}\u0026#34;] = ( cv_results[\u0026#39;auc-mean\u0026#39;][-1], len(cv_results[\u0026#39;auc-mean\u0026#39;]) ) Summary of CV results (sorted by mean AUC):\nlearning rate Number of rounds CV_AUC 0.01 600 0.7170 0.01 300 0.7170 0.01 100 0.7170 0.03 600 0.7030 0.06 600 0.6741 0.10 600 0.6529 \u0026hellip; \u0026hellip; \u0026hellip; Best Choice: learning_rate=0.01, n_estimators=300\nSelection criteria: Taking the 5-fold cross-validation CV AUC mean (auc-mean[-1]) as the ranking index, the lower the learning rate, the higher the AUC, and 0.01 is optimal; at the same learning rate, the AUC of 300 rounds and 600 rounds are the same (both reach 0.7170), and 300 rounds are selected for training efficiency considerations.\n5.2.2 Step 2: max_depth and num_leaves enumeration After determining the learning rate and number of epochs, use GridSearchCV to enumerate the combinations of max_depth (24) and num_leaves (37 step size 2) (9 groups in total):\nmodel_lgb = lgb.LGBMClassifier( boosting_type=\u0026#39;gbdt\u0026#39;, colsample_bytree=0.99, importance_type=\u0026#39;split\u0026#39;, learning_rate=best_learning_rate, n_estimators=best_n_estimators, n_jobs=n_jobs, subsample=0.99, is_unbalance=True ) params_test1 = { \u0026#39;max_depth\u0026#39;: range(2, 5, 1), \u0026#39;num_leaves\u0026#39;: range(3, 8, 2) } gsearch1 = GridSearchCV( estimator=model_lgb, param_grid=params_test1, scoring=\u0026#39;roc_auc\u0026#39;, cv=5, verbose=1 ) gsearch1.fit(X_train_, y_train_) best_max_depth = gsearch1.best_params_[\u0026#39;max_depth\u0026#39;] best_num_leaves = gsearch1.best_params_[\u0026#39;num_leaves\u0026#39;] best_score_ = gsearch1.best_score_ print(\u0026#39;best_learning_rate:{} best_n_estimators:{} best_num_leaves:{} best_max_depth:{}\u0026#39;.format( best_learning_rate, best_n_estimators, best_num_leaves, best_max_depth )) best_score_ Optimal results: max_depth=2, num_leaves=3, best_score_=0.7368 (50% off CV AUC mean of the best combination of GridSearchCV)\nSelection criteria: scoring='roc_auc' (with ROC AUC as the scoring index), cv=5 (5-fold cross-validation), GridSearchCV takes the group with the highest average CV AUC among 9 groups of parameter combinations as the optimal solution. Extremely shallow tree depth (depth=2) is the core factor in controlling overfitting.\n5.2.3 Step 3: L1/L2 regularization and minimum sample number enumeration Finally, the regularization coefficient and the minimum number of samples of leaf nodes are enumerated (dynamic calculation) to further control the model complexity:\n# 动态计算 min_child_samples 搜索范围 minsample_list = [20, 200] minsample_list.extend(list(range( int(X_train.shape[0] * 0.1), int(X_train.shape[0] * 0.5), int(X_train.shape[0] * 0.05) ))) Enumerate reg_alpha (7 types) × reg_lambda (7 types) × minsample (N types) × colsample_bytree (fixed 1 type) × subsample (fixed 1 type), and calculate three sets of indicators (training set/test set/validation set) for each set of parameters:\ntrain_dict = dict() for alpha in [0.001, 0.01, 0.03, 0.08, 0.3, 0.5, 1]: for _lambda in [0.001, 0.01, 0.03, 0.08, 0.3, 0.5, 1]: for minsample in minsample_list: for colsample in [1]: for subsample in [1]: model_train1 = LGBMClassifier( boosting_type=\u0026#39;gbdt\u0026#39;, class_weight=None, colsample_bytree=colsample, importance_type=\u0026#39;split\u0026#39;, learning_rate=best_learning_rate, max_depth=best_max_depth, max_bin=900, min_child_samples=minsample, min_child_weight=0.001, min_split_gain=0, n_estimators=best_n_estimators, n_jobs=n_jobs, num_leaves=best_num_leaves, objective=None, reg_alpha=alpha, reg_lambda=_lambda, silent=True, subsample=subsample, subsample_for_bin=200000, subsample_freq=12, is_unbalance=True ) choose_model = model_train1 choose_model.fit(X_train_, y_train_) # 三集评估：训练集 / 测试集 / 验证集 y_pred_prob_train = choose_model.predict_proba(X_train_)[:, 1] y_pred_prob_test = choose_model.predict_proba(X_test_)[:, 1] y_pred_prob_vld = choose_model.predict_proba(X_vld_)[:, 1] ks_train = cal_ks(yprob=y_pred_prob_train, ytrue=y_train_.values) ks_test = cal_ks(yprob=y_pred_prob_test, ytrue=y_test_.values) ks_vld = cal_ks(yprob=y_pred_prob_vld, ytrue=y_vld_.values) auc_train = roc_auc_score(y_score=y_pred_prob_train, y_true=y_train_.values) auc_test = roc_auc_score(y_score=y_pred_prob_test, y_true=y_test_.values) auc_vld = roc_auc_score(y_score=y_pred_prob_vld, y_true=y_vld_.values) diff_ks = abs(ks_train - ks_vld) diff_auc = abs(auc_train - auc_vld) cur_std = np.std(np.array([ks_test, ks_vld, auc_test, auc_vld])) key = f\u0026#39;{alpha}_{_lambda}_{minsample}_{colsample}_{subsample}\u0026#39; temp_dict = { \u0026#39;ks_train\u0026#39;: ks_train, \u0026#39;ks_test\u0026#39;: ks_test, \u0026#39;ks_vld\u0026#39;: ks_vld, \u0026#39;auc_train\u0026#39;: auc_train, \u0026#39;auc_test\u0026#39;: auc_test, \u0026#39;auc_vld\u0026#39;: auc_vld, \u0026#39;diff_ks\u0026#39;: diff_ks, \u0026#39;diff_auc\u0026#39;: diff_auc, \u0026#39;cur_std\u0026#39;: cur_std } train_dict[key] = temp_dict Finally, the optimal combination is determined through the following rules:\ntrain_df = pd.DataFrame().from_dict(train_dict, orient=\u0026#39;index\u0026#39;) train_df.sort_values(by=\u0026#39;ks_vld\u0026#39;, ascending=False).head(20) In the sorting results, ks_vld and auc_vld of the Top 5 combinations are basically the same, and all fall near min_child_samples=20, indicating that controlling the minimum number of leaf samples is the most critical constraint in this round of parameter tuning. Finally, the combination ranked first is selected:\nTraining result legend\nBest choice: reg_alpha=0.03, reg_lambda=0.01, min_child_samples=20\nThe key indicators corresponding to this combination are: auc_train=0.7952, auc_test=0.7187, auc_vld=0.7105, ks_train=0.4519, ks_test=0.3487, ks_vld=0.3947, diff_ks=0.0571. This shows that it not only has the highest KS in the validation set, but also has a relatively balanced performance among the three episodes.\nSelection criteria: Taking ks_vld (validation set KS) as the core sorting index, take the Top 20 in descending order, supplemented by diff_ks (difference between training set/validation set KS), diff_auc (difference between training set/validation set AUC), cur_std (four indicators comprehensive stability) comprehensive evaluation. ks_vld directly reflects the model\u0026rsquo;s ranking and discrimination ability on unseen data. The difference and stability indicators are used to ensure that the selected parameters have good generalization capabilities and are not optimal by chance.\nOptimal parameter combination:\nparameter optimal value source learning_rate 0.01 5.2.1 Optimal n_estimators 300 5.2.1 Optimal max_depth 2 5.2.2 GridSearchCV optimal num_leaves 3 5.2.2 GridSearchCV optimal reg_alpha(L1) 0.03 5.2.3 ks_vld optimal reg_lambda (L2) 0.01 5.2.3 ks_vld optimal min_child_samples 20 5.2.3 ks_vld optimal Summary of overall selection criteria for three-step tuning:\nstep Tuning goals Number of parameter combinations Judgment criteria 5.2.1 Learning rate + number of rounds 7×3 = 21 groups 5-fold cross-validation CV AUC has the highest mean value 5.2.2 max_depth+num_leaves 3×3 = 9 groups GridSearchCV 50% off CV AUC with the highest mean value 5.2.3 Regularization + minimum number of samples 7×7×N group ks_vld verification set KS is the highest, supplemented by diff_ks/diff_auc/cur_std comprehensive evaluation 5.3 Final model effect After applying the optimal parameter combination, the model evaluation results are as follows:\nDataset AUC KS Precision Recall train 79.52 45.19 6.47% 59.02% test 71.87 34.87 5.71% 50.32% vld (validation set) 69.05 39.47 10.77% 57.65% Comparison before and after optimization:\nstage train_auc test_auc vld_auc train_ks test_ks vld_ks Precision Recall Before optimization (serious overfitting) 99.98 64.54 59.00 99.20 24.54 15.90 — — After optimization (after parameter adjustment) 79.52 71.87 69.05 45.19 34.87 39.47 10.77% 57.65% After optimization:\nThe AUC gap between the training set and the validation set is reduced from 40.98 to 10.47, and the over-fitting problem is significantly improved; The verification set AUC increased from 59.00 to 69.05, and KS increased from 15.90 to 39.47, reaching the industry\u0026rsquo;s effective standard of AUC 65+ / KS 30+; The model has practical business application value. 5.4 Core means to solve overfitting means effect max_depth=2 Limit the maximum depth of a single tree to prevent over-fitting from being too deep. num_leaves=3 Limit the number of leaf nodes and use depth to control model complexity reg_alpha=0.03 (L1) L1 regularization, shrinking some feature weights to 0 reg_lambda=0.01 (L2) L2 regularization makes the feature weights overall smooth min_child_samples=20 Limit the minimum number of samples of leaf nodes to avoid a few samples dominating the split learning_rate=0.01 + n_estimators=300 A low learning rate combined with a sufficient number of rounds ensures that the model fully converges without overfitting. Chapter 6: Model Validation and Evaluation 6.1 Score distribution verification Convert the default probability output by the model into a standard credit score (0 to 1000 points) through logistic transformation, and test whether its distribution on the verification set is reasonable and whether it exhibits the expected monotonic distinguishing ability.\nValidation set score distribution:\nFractional section goods (repayment) bads (unpaid) Repayment ratio 414–465 2 183 1.08% 466–475 4 180 2.17% 476–482 5 228 2.15% 483–490 4 182 2.15% 491–498 8 206 3.74% 499–505 5 188 2.59% 506–515 5 208 2.35% 516–530 3 192 1.54% 531–551 19 186 9.27% 552–618 30 173 14.78% Hierarchical data normal distribution plot\nDistribution Characteristic Analysis:\nGood monotonicity: The higher the score, the more repayment samples (goods) and the fewer non-repayment samples (bads), and the score has strong monotonic discrimination ability; Continuous smooth distribution: The number of people in each segment is basically stable between 180 and 233, with no cliff-like decline or abnormal peaks, and the curve is smooth; Business explainable: The repayment probability of high-segmented users is significantly higher than that of low-segmented users, and the scores can be directly mapped to the collection priority. 6.2 Cross-time stability verification In order to ensure the consistency of the model\u0026rsquo;s performance in different time periods and avoid the degradation of the model\u0026rsquo;s generalization ability due to time factors, the validation set was split according to time windows and evaluated separately.\nConclusion: The overall fluctuations of AUC and KS indicators in each time window are within the acceptable range, and there is no obvious time attenuation phenomenon. The performance of the model in different time periods is stable, and it has the conditions for continuous investment in business applications.\nChapter 7: Business Effect and ROI Analysis 7.1 ROI improvement curve After the model is launched, the business side selects orders from high to low according to the model score for collection. The corresponding relationship between the proportion of detected orders and the labor efficiency ratio (ROI) is as follows:\nScore Pct (score percentile) Proportion of checkout orders (%) ROI 0 31.38 1.52 6 22.23 2.01 10 17.90 2.40 13 14.97 2.54 15 13.23 2.92 20 8.85 3.64 twenty four 6.71 4.62 28 5.27 5.59 Note: Score Pct = 0 represents the lowest segment (covering the most orders), Score Pct = 28 represents the highest segment (covering the fewest orders).\nProportion of checkout orders and manpower energy efficiency ratio\n7.2 ROI compliance analysis When Score Pct = 6 (approximately 22% of orders are checked out), the ROI breaks through 2.0, exceeding the ROI \u0026gt; 2.0 target proposed by the business; When Score Pct = 13 (approximately 17% of orders are checked out), the ROI reaches 2.54, and the labor investment is reduced by more than half, but the ROI is still maintained above 2.5; When Score Pct = 15 (approximately 13% of orders are detected), the ROI reaches 2.92, which is nearly twice the ROI of 1.5 in the reminder mode; When Score Pct = 20 (approximately 9% of orders are checked out), the ROI reaches 3.64. With less than one manpower invested, the ROI more than triples. 7.3 Summary of business value index reminder mode Model detection (about Top 20%) Human input Full reminder About 20% of orders ROI 1.0~1.5, mean 1.1 ~2.12 Target ROI \u0026gt; 2.0 Not up to standard Overachieved After the model was launched, the business side actually calculated that the ROI reached 2.12, exceeding the set target (\u0026gt; 2.0), significantly saving labor costs and maximizing collection income.\nChapter 8: Prediction Model Optimization - Auxiliary Model Online 8.1 Online background After the LGBM prediction model was launched online, it has completed the core goals set by the business department: sorting long-aging orders through the model, screening out high-value orders and investing in manual collection. While significantly reducing labor investment, it has achieved ROI improvement and verified that the path of \u0026ldquo;model detection + human focus\u0026rdquo; is established in business.\nHowever, in the actual implementation process, a natural limitation of the prediction model has gradually been exposed: when the main model is scoring, it can only use the features that have settled before the stage enters the market. This means that the information the model relies on is essentially a historical portrait of the user before entering the current stage, and cannot reflect the user\u0026rsquo;s new changes within the stage in a timely manner.\nFrom a modeling perspective, this approach is reasonable, because it ensures the strict separation of tag time points and feature time points, and avoids information crossing; but from a business perspective, this also brings about a practical problem: some orders are not detected during the initial screening of the main model, which does not mean that they will never have recovery value in the future. If users have new positive behaviors within a stage, these behaviors themselves mean that their repayment willingness, fund status, or processing priorities may have changed.\nTherefore, the focus of this round of optimization is not to redo the main model, but to add a layer of dynamic recognition capabilities to the main model.\n8.2 Boundaries of the main model The core capability of the main model is to sort orders in one go based on the historical behavior before the stage enters the market. This method is suitable for solving the problem of \u0026ldquo;which orders are more worthy of prioritizing manpower investment at a fixed point in time\u0026rdquo;, but it is difficult to answer another more dynamic question:\nDo some orders that were not worth recalling in the first place regain their recall value due to changes in user behavior as the stages progress?\nThis is exactly where the boundaries of the main model lie.\nSpecifically, there are two main limitations:\nFirst, features have natural hysteresis.\nThe model uses pre-order features, and long-aging orders themselves are already a scenario with a strong time lag. After entering this stage, the user\u0026rsquo;s behavior may have changed significantly, such as logging in to the APP again, checking bills, repaying other debt-sharing platform orders, etc. These signals do not exist when the main model scores, and therefore cannot be used by the main model.\nSecond, the main model is a one-time filtering logic and does not have the ability to dynamically update.\nThe main model is more like a static sorter: at a fixed point in time, orders are divided into two categories: \u0026ldquo;priority reminder\u0026rdquo; and \u0026ldquo;suspension reminder\u0026rdquo;. But real business is not static, and users’ behavior will continue to change within stages. If you still insist on only looking at the initial scoring results, you may miss some \u0026ldquo;subsequent strengthening\u0026rdquo; orders.\nIn other words, the main model solves the problem of \u0026ldquo;who urges first\u0026rdquo;, but does not completely solve the problem of \u0026ldquo;who deserves to be urged later\u0026rdquo;.\n8.3 Business Insight: Behavior within the stage may bring new recovery value After further communication with the business department around this issue, we came to a very critical empirical judgment:\nSome of the user\u0026rsquo;s behaviors during the stage may themselves constitute a signal that \u0026ldquo;it is worth urging again\u0026rdquo;.\nThese behaviors may not necessarily be reflected in the static scoring of the main model, but they have obvious reference value in actual collection experience. Typical include:\nUser logs in to the APP again: It means that the user is still actively contacting the platform, at least not completely lost contact, and may have behavioral motivations such as checking bills, paying attention to credit limits, preparing to deal with arrears, etc.; User repays other mutual debt orders: It shows that the user has recent capital flow, and it also shows that he is not completely incapable of repaying, but is prioritizing between different debts; Active behaviors of users in other stages: such as revisiting key pages, re-triggering certain actions, etc., may also mean that their status has changed. The nature of these signals is very similar to the \u0026ldquo;timing\u0026rdquo; logic in quantitative investment. The main model is more like an \u0026ldquo;initial stock selection model\u0026rdquo;, responsible for screening out a batch of high-potential orders at the initial time point; while the intra-stage behavioral signals are more like \u0026ldquo;timing signals\u0026rdquo;, used to determine whether certain orders that did not originally enter the candidate pool have reappeared in the subsequent time points with intervention value.\nTherefore, the one-time scoring of the main model alone is not enough. A supplementary mechanism is also needed to conduct secondary identification of orders with new signals within the stage. This is the starting point for the launch of the auxiliary model.\n8.4 Ideas for launching auxiliary models Based on the above background, this round of prediction model optimization is not to overthrow the main model, but to add a layer of \u0026ldquo;order recovery strategy\u0026rdquo; based on the main model\u0026rsquo;s completion of business goals.\nThe core idea of ​​this strategy is:\nThe main model continues to assume the responsibility of the main screen The main model still completes the first round of sorting at the starting point of the stage, and selects the high-value orders that deserve priority for manual collection.\nContinuous observation of orders that have not been checked out by the main model For those orders that do not enter the collection list for the first time, they are not directly regarded as permanently low value, but their behavior changes are continued to be observed within the stage.\nTrigger retrieval based on new behavioral signals Once an order shows certain key behaviors within a stage, such as logging in, repaying other debt orders, becoming active again, etc., it will be identified as an object whose \u0026ldquo;status may change\u0026rdquo; and enter the scope of secondary evaluation.\nDynamic supplementary recognition is undertaken by the auxiliary model Let the main model be responsible for the \u0026ldquo;initial checkout\u0026rdquo; and the auxiliary model be responsible for the \u0026ldquo;midway recovery\u0026rdquo; to jointly form a collection screening system that is closer to the real business rhythm.\nIn other words, the goal of the auxiliary model is not to replace the main model, but to supplement the blind spots of the main model: Those orders whose prices were not strong enough at the initial stage but became strong as the stage progresses will be brought back into the collection scope.\n8.5 Final screening results of auxiliary models After multiple rounds of business discussions and signal verification, the auxiliary model finally retained the three most effective, easiest to explain, and most suitable for implementation:\nUser logs in during the stage The login behavior shows that the user is still actively contacting the platform and has not completely lost contact. For long-aging orders, this kind of behavior often means that users begin to pay attention to bills, quotas or historical orders again, which has a certain value for subsequent contact.\nThe user repaid other orders This signal indicates that users are not completely deprived of funds in the near future, but are prioritizing allocations among different debts. Now that the user has started processing other orders, it means that the current order may be further recalled.\nThe order itself is a partial repayment order Partial repayment shows that the user does not completely refuse to perform the contract, but has shown a certain degree of willingness to repay. For this type of order, subsequent investment in collection resources is usually more valuable than an order with no repayment action at all.\nWhat these three features have in common is that they are not static images before entering the market, but dynamic behavioral signals that appear during the stage, so they are very suitable as the core retrieval conditions of the auxiliary model.\n8.6 Auxiliary model online effect After the auxiliary model was launched, based on the original detection results of the main model, further supplementary identification of edge high-value orders was achieved. The final effect is as follows:\nCheck out about 10% more orders The repayment rate for checkout orders increased from 10% to 15.8% Overall ROI improved to 2.5 This shows that although the auxiliary model does not replace the main model, it successfully improves the quality of the detection results through the supplementary identification of dynamic signals within the stage, and further enhances the output efficiency of collection resources.\n8.7 Business value If the value of the main model lies in \u0026ldquo;improving the efficiency of first-round detection,\u0026rdquo; then the value of the auxiliary model lies in \u0026ldquo;reducing the opportunity loss caused by missed detections.\u0026rdquo;\nIt can bring at least three benefits:\nFirst, improve the coverage of high-value orders.\nSome of the orders that were not originally checked out will regain their recall value due to changes in behavior during the stage. Through the clawback strategy, the effective coverage can be expanded and the probability of missing potentially recallable orders can be reduced.\nSecond, make the collection strategy closer to the real business process.\nCollection is not a one-time action, but an ongoing process. After the auxiliary model incorporates the behavior within the stage into the judgment logic, the entire strategy is upgraded from static decision-making to dynamic decision-making, which is more in line with business reality.\nThird, lay the foundation for subsequent more refined strategies.\nOnce the retrieval logic is verified to be effective, it can continue to evolve into a more systematic dynamic strategy system in the future, such as designing different retrieval priorities according to different signal strengths, or further productizing and regularizing retrieval signals to form standard operating actions.\n8.8 Summary The LGBM prediction model has proven that screening high-value long-aging orders through the model can significantly improve collection ROI, and the business path is established.\nThe significance of this round of prediction model optimization is to further acknowledge and utilize a fact: The value of an order is not static. The user\u0026rsquo;s new behavior during the stage may make the order that was not checked out regain the recall value.\nTherefore, the essence of this launch is not to retrain a completely different main model, but to superimpose a layer of dynamic recovery logic based on behavioral signals within the stage on the basis of the original static detection capabilities, so that the collection strategy can be upgraded from \u0026ldquo;only looking at the starting point\u0026rdquo; to \u0026ldquo;continuous observation and dynamic correction\u0026rdquo;.\nThis is also a key step from \u0026ldquo;single point prediction\u0026rdquo; to \u0026ldquo;prediction + dynamic supplementary identification\u0026rdquo;.\nChapter 9: Methodological Precipitation and Reflection 9.1 Data level Feature hysteresis problems and solutions\nThe progress of the project has not been smooth sailing. The verification set AUC of the first version of the model is only 50~55, which is far lower than the industry\u0026rsquo;s passing line of 65. The model has almost no actual ranking ability.\nAfter investigation, it was found that the root cause of the problem lies in: long-aged users have passed a long time since their initial application, and there is a large lag in the user information stored in the feature library. Early static features cannot effectively distinguish between repaying and non-repaying users.\nThe solution idea in the first stage is to introduce more recent behavior characteristics (login behavior in the past 3 days/7 days, recent repayment records, etc.) into the main model, and refine the time window with smaller granularity, using recent information to make up for the lag of historical information. The final verification set AUC increased from 55 to 69, reaching the industry\u0026rsquo;s effective standard.\nBut this project also further verified that even if the main model has been optimized, the long aging scenario still naturally has the problem of \u0026ldquo;characteristics continue to change after the payment is received\u0026rdquo;. It is precisely because of this that it is necessary to add a layer of auxiliary models in addition to the main model in the future to dynamically identify and retrieve new behavioral signals during the stage.\n9.2 Algorithm level Overfitting prevention and control experience\nThe train_auc of the first version of the LGBM prediction model is as high as 99.98, but the test_auc is only 59.55, which is seriously overfitting. Finally, it was successfully solved by the following means:\nTree depth control: max_depth=2, num_leaves=3, extremely shallow tree depth is the core of controlling over-fitting; Regularization: reg_alpha=0.03 (L1) and reg_lambda=0.01 (L2) are used together to smooth the feature weights; Minimum sample number limit: min_child_samples=20, to avoid a small number of samples dominating the split; Low learning rate: learning_rate=0.01 with 300 rounds to prevent over-fitting while fully training. Core lesson: In sparse scenarios, model complexity control is more important than the model itself; stabilizing the main model first, and then superimposing auxiliary strategies is far more effective than putting all problems on a single model at the beginning.\n9.3 Business level Models serve business, not show off skills\nThe most important business gain from this project is: the value of the model does not lie in pursuing the highest AUC, but in achieving business goals with the lowest labor cost.\nThe ROI of the reminder mode is only 1.5. The core reason is not that the model is not accurate enough, but that the full amount of reminders leads to a waste of manpower. This project first uses the LGBM prediction model to screen out the Top 20% of high-value orders, and the ROI is increased to 2.12; then the auxiliary model is used to dynamically recover orders that have become stronger again during the stage, and finally the ROI is increased to above 2.5. The dismantling of business goals is more important than the improvement of model indicators.\nMore importantly, this project illustrates a very practical methodology:\nThe main model solves the first round of detection problems Auxiliary model solves dynamic supplementation problem In other words, business strategies may not always be \u0026ldquo;one model solves all problems.\u0026rdquo; In many cases, a more effective way is to separate the static prediction and dynamic identification, design them separately, and then combine them for implementation.\n9.4 Directions for improvement Online learning: The current main model is still trained offline, and incremental updates can be explored to adapt to user behavior drift; Productization of auxiliary model rules: Precipitate signals such as login within the stage, repayment of other orders, and partial repayment into a standardized recovery engine to reduce manual maintenance costs; Cross-product line migration: This combined framework of \u0026ldquo;main model detection + auxiliary model recovery\u0026rdquo; can be reused in post-loan risk control scenarios of other credit product lines; More fine-grained stratification: Upgrade from two categories to multiple categories, and implement differentiated collection strategies for high/medium/low-risk users. ","date":"2024-12-30T21:30:00+08:00","image":"/uploads/cover-agent-collaboration-dashboard.png","permalink":"/en/p/lightgbm-based-post-loan-collection-model-and-trigger-based-retrieval-strategy/","title":"LightGBM-based post-loan collection model and Trigger-Based Retrieval Strategy"},{"content":"\nThis article documents the complete problem-solving process for the 2021 Greater Bay Area Financial Mathematics Modeling Competition (Problem B). The core objective is to build an analytical framework that explains stock price fluctuations and assists investment decisions, centered on Bay Area index constituent stocks and incorporating securities research reports, market data, and external environmental information. Around this goal, the paper sequentially designs a Binary Classification Logistic Comprehensive Factor Model, an Event Analysis Model, and a News Sentiment Factor Correction Model, respectively addressing the three problem types: \u0026ldquo;how to characterize the relationship between factors and future returns,\u0026rdquo; \u0026ldquo;how to analyze the impact of external event shocks,\u0026rdquo; and \u0026ldquo;how to incorporate sentiment information into a stock selection model.\u0026rdquo;\n1. Problem Restatement 1.1 Background Securities research reports (sell-side research reports) are analytical documents produced by securities company researchers on the value of securities and related products or factors that influence their market prices. A complete securities research report contains rich information including company operational data, financial projections, valuation results, investment ratings, and risk disclosures, serving as an important reference for investment decision-making.\nIn recent years, traditional factor models based on financial factors (P/E ratio, market capitalization, etc.) and long-horizon price-volume factors (monthly reversal, monthly trading volume, etc.) have achieved relatively stable excess returns in the A-share market. A factor-based quantitative stock selection model constructed from securities research report feature indicators has become a frontier research topic in the quantitative investment field.\n1.2 Problem Description This research focuses on 10 stocks from the Bay Area index and completes the following four tasks by integrating securities research report information with external market environment data:\nProblem 1: Select securities research reports for 10 Bay Area stocks and extract feature indicators from the reports. Problem 2: Model and analyze the impact of securities research report feature indicators on stock trends and propose a clear investment strategy. Problem 3: Research the impact of events such as emergencies, public sentiment, and natural disasters on the 10 Bay Area index stocks. Problem 4: Integrate securities research reports with external environmental factors, revise the investment strategy from Problem 2, and propose a new investment strategy. 2. Preliminaries: Core Algorithm Principles This chapter introduces the core algorithms and key terminology that will be directly referenced in subsequent chapters.\n2.1 Multi-Class Logistic Regression Reason for Introduction: In factor-based stock selection, the future movement of a stock is not a simple linear relationship — the three states of price surge, price plunge, and sideways consolidation involve complex non-linear transitions. Traditional linear regression cannot stably characterize such multi-class problems, whereas multi-class logistic regression can provide probability predictions for each class in a multi-class setting and serves as the foundational algorithm for constructing a comprehensive factor.\nBasic Model Principle: Multi-class logistic regression performs a linear combination of independent variables and corresponding parameters, then uses a probabilistic model to compute the probability of each class in the dependent variable. Its linear prediction function is:\n$$f_k(x) = \\beta_{k0} + \\beta_{k1}x_1 + \\cdots + \\beta_{kp}x_p$$\nwhere $\\beta_{kj}$ is the regression coefficient, representing the influence of the $j$-th feature on the $k$-th outcome.\nAfter log-odds transformation, the probability of each class is:\n$$P(Y=k|X) = \\frac{e^{f_k(x)}}{sum_{j=1}^{K} e^{f_j(x)}}$$\nBinary Classification Special Case: For $K$ classes, one class is selected as the reference class, and the remaining $K-1$ classes are paired with the reference class to build binary logistic regressions. If class 1 is chosen as the reference, the binary classification of class $l$ ($l \\neq 1$) versus class 1 can be expressed as:\n$$\\ln \\frac{P(Y=l|X)}{P(Y=1|X)} = \\beta_{l0} + \\beta_{l1}x_1 + \\cdots + \\beta_{lp}x_p$$\nApplication in This Study: In the factor grouping strategy, stocks are classified into three categories based on future price movements — plunge ($y=-1$), surge ($y=1$), and no extreme movement ($y=0$). Using 24 factor values as explanatory variables, the probability of each category is predicted for every stock. The comprehensive factor value is:\n$$F(x) = \\frac{1}{1 + \\exp(-\\beta_1 \\cdot x)} - \\frac{1}{1 + \\exp(\\beta_{-1} \\cdot x)}$$\n2.2 Foundations of Factor Models Relationship Between Factors and Feature Indicators: In quantitative investment, \u0026ldquo;factor\u0026rdquo; and \u0026ldquo;feature indicator\u0026rdquo; are essentially two expressions of the same concept. A factor is a variable used to explain differences in stock returns, enabling stocks to be ranked and grouped based on factor values.\nCore Idea of Factor Models: Factor models assume that stock returns are driven by several common factors; stocks with similar factor exposures should exhibit similar return performance. By constructing effective factors, expected stock return ranking and grouped trading can be achieved.\nConstructing a Comprehensive Factor: A single factor often fails to comprehensively characterize a stock\u0026rsquo;s return profile, so multiple effective factors need to be weighted and combined to form a comprehensive factor. In this study, the comprehensive factor is derived from 24 securities research report feature indicators compressed through multi-class logistic regression, overlaid with external sentiment factors, forming a comprehensive scoring system.\n2.3 Event Study Method Reason for Introduction: The event study method is an empirical research technique first applied in the financial field. It uses financial market data to quantitatively analyze the impact of specific events on a company\u0026rsquo;s value. The method is theoretically rigorous, logically clear, and computationally straightforward, making it widely used for studying the impact of external shocks on stock prices.\nBasic Model Principle: The event study method compares the actual return of a security during the event window with its expected return \u0026ldquo;假设事件未发生\u0026rdquo; (as if the event had not occurred), with the difference being the abnormal return (AR). This来判断事件对股价的影响方向与程度。\nCore Steps:\nEvent Selection: Identify the event to be studied and its occurrence time point. Window Partition: Divide the time interval into an estimation window (120–35 days before the event) and an event window (5 days before to 40 days after the event). Normal Return Estimation: Using the market model (CAPM): $R_{it} = \\alpha_i + \\beta_i R_{mt} + \\varepsilon_{it}$ Abnormal Return Calculation: $AR_{it} = R_{it} - \\hat{R}_{it}$ Cumulative Abnormal Return: $CAR_{cum} = \\sum_{t} \\bar{A}_t$ Significance Testing: If the P-value is less than 0.05, the event is considered to have a significant impact on stock price fluctuations. Application in This Study: Using the 2018 Changsheng Bio-vaccine fraud event as a case study, the 10 stocks are grouped by Shenwan industry classification, and the cumulative abnormal returns of each industry portfolio within the event window are analyzed.\n3. Problem Analysis 3.1 Analysis of Problem 1 Problem 1 requires selecting securities research reports for 10 Bay Area stocks and extracting feature indicators from them. Python web scraping is used to collect individual stock research reports from the East Money website over the past three years. Report content is read and feature indicator frequencies are counted, with specific values and definitions obtained from the Shenzhen Tianruan Technology database, serving as an important data source for the Problem 2 analysis.\n3.2 Analysis of Problem 2 Problem 2 requires analyzing the impact of research report feature indicators on stock trends and proposing an investment strategy. Due to the non-linear relationship between feature indicators and stock trends, linear regression cannot provide stable predictions. Therefore, machine learning methods are adopted. First, candidate factors are tested for validity using a correlation matrix to assess inter-indicator correlations. Then a multi-class logistic regression model is established to build a regression relationship between factor values and next-period returns, constructing a comprehensive factor verified through grouped testing, ultimately forming a clear investment strategy.\n3.3 Analysis of Problem 3 Problem 3 requires building a model to investigate the impact of external factors on the selected 10 stocks. This study proceeds from two angles:\nMacro Analysis Angle: Using the event study method, feature indicators are introduced to analyze the impact of external factors on the 10 stocks. Behavioral Finance Angle: From the perspective of behavioral finance theory, news sentiment factors are挖掘 and a factor model is constructed. The two angles complement each other in jointly explaining the mechanism by which external factors affect individual stocks.\n3.4 Analysis of Problem 4 The news sentiment factor from Problem 3 is integrated into the factor model from Problem 2 to reconstruct a comprehensive framework, and the 10 stocks are backtested. By comparing the grouping effects and return performance before and after incorporating the sentiment factor, the superiority of the new model is verified, and an improved investment strategy is proposed.\n4. Model Assumptions The correlation relationships among research report feature indicators are stable. To avoid using future data, the correlation matrix uses research report data from 2013 (the initial period). The relationship between research report data and returns is non-linear. A linear relationship cannot stably predict the relationship between research report data and returns, necessitating the construction of non-linear relationships, for which machine learning provides technical support. 5. Notation Symbol Meaning $R_i$ Daily return of each sample stock $R_m$ Market return $AR_i$ Abnormal return of each sample stock $CAR$ Portfolio average abnormal return $CAR_{cum}$ Portfolio cumulative average abnormal return 6. Model Construction and Solution 6.1 Model for Problem 1 — Construction and Solution 6.1.1 Selection of Research Subjects This paper selected 10 stocks from the Bay Area index pool as research subjects, covering multiple Shenwan Level-1 industries including electronics, pharmaceuticals, real estate, and building materials:\nStock Name Stock Code Stock Name Stock Code EVE Energy SZ300014 Huafa Shares SH600325 Luxshare Precision SZ002475 Tower Group SZ002233 Nationstar Optics SZ002449 MiYon SZ002303 Sunlord Electronics SZ002138 Livzon Group SZ000513 Everwin Precision SZ300115 China National SZ000028 6.1.2 Data Acquisition: Python Web Scraping for Research Reports The core of Problem 1 is to systematically collect research report materials for the 10 stocks. The East Money Data Center\u0026rsquo;s individual stock research report page is used. The approach involves first locating the research report URL for each stock, then using a Python crawler to scrape historical research report PDFs for each stock over the past three years. A total of 294 research reports are collected, on which subsequent text cleaning, word segmentation, and feature extraction are based. The supplementary files text.xlsx and 解释变量_dwq.xlsx mentioned in the original paper correspond to the research report text organization results and the explanatory variable data used for subsequent factor modeling.\nThe entire data acquisition process is divided into three steps: Step 1 scrapes research report PDF links, Step 2 extracts PDF text content, and Step 3 performs word segmentation and feature frequency statistics on the research report text. The code function below scrapes research report PDF download links from the East Money research report list page, extracts them using regular expressions, and saves them to an Excel file (links.xlsx).\nimport urllib import urllib.request import re import pandas as pd links = [] stocks = [\u0026#39;300014\u0026#39;, \u0026#39;002475\u0026#39;, \u0026#39;002449\u0026#39;, \u0026#39;002138\u0026#39;, \u0026#39;300115\u0026#39;, \u0026#39;600325\u0026#39;, \u0026#39;002233\u0026#39;, \u0026#39;002303\u0026#39;, \u0026#39;000513\u0026#39;, \u0026#39;000028\u0026#39;] for i in range(len(stocks)): url = \u0026#34;http://data.eastmoney.com/report/\u0026#34; + stocks[i] + \u0026#34;.html\u0026#34; data = urllib.request.urlopen(url).read().decode(\u0026#39;UTF-8\u0026#39;) linkre = re.compile(r\u0026#39;\\w*AP20\\w*\u0026#39;) list1 = linkre.findall(data) for q in list1: pdf_url = \u0026#39;https://pdf.dfcfw.com/pdf/H3_\u0026#39; + q + \u0026#39;_1.pdf\u0026#39; links.append(pdf_url) pd.DataFrame(links).to_excel(\u0026#34;links.xlsx\u0026#34;) The following code reads the research report PDF download links, opens each PDF file in sequence, extracts text content using the pdfminer library and concatenates it into a complete string, stores it in a list, and exports it as an Excel file (text.xlsx).\nfrom pdfminer.pdfparser import PDFParser, PDFDocument from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfdevice import PDFDevice from pdfminer.converter import PDFPageAggregator from pdfminer.layout import LAParams import pandas as pd links_df = pd.read_excel(\u0026#34;links.xlsx\u0026#34;) text_list = [] for url in links_df[0]: try: fp = urllib.request.urlopen(url) parser = PDFParser(fp) doc = PDFDocument(parser) parser.set_document(doc) doc.set_parser(parser) doc.initialize() resource = PDFResourceManager() device = PDFPageAggregator(resource, laparams=LAParams()) interpreter = PDFPageInterpreter(resource, device) texts = [] for page in doc.get_pages(): interpreter.process_page(page) layout = device.get_result() for obj in layout: if hasattr(obj, \u0026#39;get_text\u0026#39;): texts.append(obj.get_text()) text_str = \u0026#39;\u0026#39;.join(texts) text_list.append(text_str) except: text_list.append(\u0026#39;\u0026#39;) pd.DataFrame(text_list).to_excel(\u0026#34;text.xlsx\u0026#34;) The following code reads research report text content, uses the jieba word segmentation tool to count the occurrences of 35 pre-set feature indicators in each research report, and obtains the feature frequency statistics for each indicator.\nimport jieba import pandas as pd import collections with open(\u0026#34;研报内容.txt\u0026#34;, encoding=\u0026#39;utf-8\u0026#39;) as f: data = f.read() jieba.load_userdict(\u0026#34;词典.txt\u0026#34;) seg_list = jieba.lcut(data) counter = collections.Counter(seg_list) # Count frequencies of 35 feature indicators keywords = [\u0026#39;市场\u0026#39;, \u0026#39;系统性风险\u0026#39;, \u0026#39;市盈率\u0026#39;, \u0026#39;市净率\u0026#39;, \u0026#39;市销率\u0026#39;, \u0026#39;PEG\u0026#39;, \u0026#39;净利润增长率\u0026#39;, \u0026#39;净资产增长率\u0026#39;, \u0026#39;毛利率\u0026#39;, \u0026#39;净资产收益率\u0026#39;, \u0026#39;销售净利率\u0026#39;, \u0026#39;涨跌幅\u0026#39;, \u0026#39;换手率\u0026#39;, \u0026#39;流通市值\u0026#39;, \u0026#39;总市值\u0026#39;, \u0026#39;流通股本\u0026#39;, \u0026#39;总股本\u0026#39;, \u0026#39;盈利预测调整\u0026#39;] result = [counter[k] for k in keywords] pd.DataFrame(result).to_excel(\u0026#34;特征频率.xlsx\u0026#34;) 6.1.3 Frequency Statistics of Research Report Feature Indicators After obtaining all research report texts, the next step is to count the occurrence frequencies of the example indicators in the research reports. The paper categorizes these feature indicators into categories such as overall market, valuation factors, growth factors, profitability factors, momentum reversal factors, trading activity factors, size factors, price volatility factors, and analyst forecast factors, with frequencies individually tallied for each of the 35 example indicators.\nFrom the results, the occurrence frequencies of different indicators in the research reports vary widely. Some indicators are high-frequency words in research reports, such as market factors, gross margin, net sales margin, and total share capital; while some indicators barely appear during the sample period, such as price-to-cash ratio, free-float market cap, forecast net profit growth rate, and earnings expectation adjustments. This frequency distribution itself helps determine which indicators are more worth including in subsequent modeling.\nThe occurrence frequencies of the 35 example indicators across 294 research reports are as follows:\nFactor Name Frequency Factor Name Frequency Market Factor 2391 Asset Return Rate 50 Systematic Risk 5 Operating Expense Ratio 0 P/E Ratio 209 Financial Expense Ratio 0 P/B Ratio 150 EBIT to Operating Revenue Ratio 0 P/S Ratio 27 Prior Price Change Magnitude 425 Price-to-Cash Ratio 0 Prior Turnover Rate 70 Enterprise Value Multiple 0 Volume Ratio 0 PEG 58 Free-Float Market Cap 165 Revenue Growth Rate 128 Total Market Cap 322 Operating Profit Growth Rate 59 Free-Float Shares 0 Net Profit Growth Rate 171 Total Shares 389 EPS Growth Rate 0 Prior Price Volatility 0 Net Asset Growth Rate 44 Daily Return Std Dev 0 Shareholders\u0026rsquo; Equity Growth Rate 3 Forecast Net Profit Growth 0 Operating Cash Flow Growth Rate 0 Forecast Main Business Growth 0 Net Sales Margin 671 Earnings Expectation Adjustment 0 Gross Margin 1662 6.1.4 Definition and Screening of Feature Indicators After frequency statistics, the paper does not directly throw all indicators into modeling. Instead, it first conducts further analysis on indicators with \u0026ldquo;frequency greater than 0.\u0026rdquo; On one hand, the specific definitions of these indicators are read from the Shenzhen Tianruan Technology database; on the other hand, combined with the common interpretation framework in finance, these indicators are placed back into their respective factor categories for understanding.\nFrom the specific definitions, these indicators can be summarized into the following categories:\nMarket Factors\nMarket Factor: The total share-capital weighted price change of CSI 300 index constituent stocks from the end of month t-1 to the end of month t. Systematic Risk: Obtained by regressing the log-return series of individual stocks versus the CSI 300 index from the end of month t-12 to month t; the slope is the systematic risk value. Valuation Factors\nP/E Ratio: Total market cap at the end of month t divided by net profit over the last 12 months. To control the indicator\u0026rsquo;s value range, the reciprocal is taken. P/B Ratio: The price-to-book ratio at the end of month t (last 12 months, by disclosure date). The reciprocal is taken to control the value range. P/S Ratio: The individual stock P/S ratio at the end of month t (last 12 months, by disclosure date). The reciprocal is taken to control the value range. PEG: The individual stock PEG at the end of month t (last 12 months). Growth Factors\nRevenue Growth Rate: Taken from the latest financial report revenue growth rate at the end of month t. Operating Profit Growth Rate: Taken from the latest financial report operating profit growth rate at the end of month t. Net Profit Growth Rate: Taken from the latest financial report net profit growth rate at the end of month t. Net Asset Growth Rate: Taken from the latest financial report net asset growth rate at the end of month t. Shareholders\u0026rsquo; Equity Growth Rate: Taken from the latest financial report shareholders\u0026rsquo; equity growth rate at the end of month t. Profitability Factors\nNet Sales Margin: Taken from the latest financial report net sales margin at the end of month t. Gross Margin: Taken from the latest financial report gross margin at the end of month t. Return on Equity (ROE): Taken from the latest financial report return on equity at the end of month t. Return on Assets (ROA): Taken from the total assets return rate at the end of month t, calculated as: ROA(%) = [Net Profit + Financial Expenses × (1 − Tax Rate)] / Total Assets × 100. Trading Factors\nPrior Price Change (1 Month): The price change magnitude from the end of month t to one month prior. Prior Price Change (3 Months): The price change magnitude from the end of month t to three months prior. Prior Price Change (6 Months): The price change magnitude from the end of month t to six months prior. Prior Turnover Rate (1 Month): The frequency of stock turnover in the market during the one month prior to the end of month t. Prior Turnover Rate (3 Months): The frequency of stock turnover in the market during the three months prior to the end of month t. Size Factors\nFree-Float Market Cap: The free-float market cap of the individual stock at the end of month t. Total Market Cap: The total market cap of the individual stock at the end of month t. Free-Float Shares: The free-float shares of the individual stock at the end of month t. Total Shares: The total shares of the individual stock at the end of month t. After completing the definition review, the specific time-series values of these factors are further read from the Shenzhen Tianruan Technology database. On one hand, this transforms feature indicators from mere \u0026ldquo;keywords\u0026rdquo; in research reports into structured variables that can be used for subsequent regression modeling; on the other hand, it provides complete explanatory variable data sources for the factor model in Problem 2.\nThe final screened valid factors number 24 per the paper\u0026rsquo;s definition and are used as the core input for the Problem 2 factor model. The counting method here splits prior price change magnitude into 3 indicators (1-month, 3-month, 6-month) and prior turnover rate into 2 indicators (1-month, 3-month).\n6.2 Model for Problem 2 — Construction and Solution 6.2.1 Candidate Factors and Data Preprocessing Before entering Problem 2, the method that will be used repeatedly later is first explained. The multi-class logistic regression method is adopted. It can be understood as an extension of traditional logistic regression, used to predict the probability of a sample falling into different classes.\nFrom here, the paper officially shifts to the \u0026ldquo;factor construction\u0026rdquo; perspective. In other words, the research report feature indicators extracted in Problem 1 are uniformly treated as \u0026ldquo;factors\u0026rdquo; in this section. The next task is to first check whether these factors are effective, then determine whether there are overly strong correlations among them, and to merge or filter redundant factors.\nMonthly-frequency data for 10 stocks from 2013 to 2021 is used. The core model inputs include two parts: (1) next-period (end of month t+1) returns, and (2) current-period (end of month t) 24 factor values. For variables with excessively large values, reciprocal or logarithmic transformations are applied to keep them within a relatively stable numerical range.\n6.2.2 Factor Screening and Correlation Testing When screening factors, a correlation matrix is constructed first. There is one prerequisite assumption: the correlation relationships among the same group of feature indicators are stable over the sample period. To avoid the \u0026ldquo;look-ahead bias\u0026rdquo; problem, the correlation matrix uses data from January 2013, i.e., the initial backtesting period.\nThe correlation coefficient formula is:\n$$r(X,Y) = \\frac{\\mathrm{Cov}(X,Y)}{\\sqrt{\\mathrm{Var}[X]\\mathrm{Var}[Y]}}$$\nThe judgment criterion is straightforward: if the absolute value of the correlation coefficient between any two of the 24 factors exceeds 0.8, it indicates a potentially strong multicollinearity that requires further merging. The final result is that the absolute values of correlation coefficients between all factor pairs are less than 0.8; therefore, all factors are retained and continued into the subsequent machine learning model.\n6.2.3 Factor Model Framework Construction To truly establish the relationship between \u0026ldquo;current-period factors\u0026rdquo; and \u0026ldquo;next-period returns,\u0026rdquo; the approach from the paper Probability of Price Crashes, Rational Speculative Bubbles, and the Cross Section of Stock Returns is referenced to construct the logistic model.\nIn the context of Problem 2 modeling, the probabilities of the plunge and surge outcomes can be written as:\n$$\\Pr_t\\left(Y_{i,t,t+12}=-1\\right)= \\frac{\\exp\\left(\\alpha_{-1}+\\beta_{-1}X_{i,t}\\right)} {1+\\exp\\left(\\alpha_{-1}+\\beta_{-1}X_{i,t}\\right)+\\exp\\left(\\alpha_{1}+\\beta_{1}X_{i,t}\\right)}$$\n$$\\Pr_t\\left(Y_{i,t,t+12}=1\\right)= \\frac{\\exp\\left(\\alpha_{1}+\\beta_{1}X_{i,t}\\right)} {1+\\exp\\left(\\alpha_{-1}+\\beta_{-1}X_{i,t}\\right)+\\exp\\left(\\alpha_{1}+\\beta_{1}X_{i,t}\\right)}$$\nThe approach in the referenced paper classifies possible future stock states into three categories: plunge, surge, and non-extreme movements, denoted as -1, 1, and 0 respectively; uses current-period factors as explanatory variables; applies a Logit model to address this multi-class problem; and predicts the future plunge probability. This paper adopts the same approach to construct its own factor Logit model.\nRegarding the definition of the dependent variable, very clear rules are also provided: if the decline over the future interval exceeds 50%, it is recorded as a plunge, denoted as -1; if the rise exceeds 100%, it is recorded as a surge, denoted as 1; all other cases are denoted as no extreme movement, 0.\nThe model training is conducted in a rolling manner. Specifically, at the end of each month t, all monthly-frequency data before t is used to perform one round of multi-class logistic regression. According to the labeling definitions above, plunge and surge are treated as different outcome categories, and regression coefficients are computed accordingly.\nThe core idea of multi-class logistic regression can be summarized in one sentence: first perform a linear combination of independent variables and parameters, then use a probabilistic model to calculate the probability of a sample falling into different outcome categories. Here, this method is briefly introduced first, then truly applied to the monthly-frequency cross-sectional data of the 10 stocks, with the goal of establishing a non-linear relationship between \u0026ldquo;current-period factor values\u0026rdquo; and \u0026ldquo;next-period return states.\u0026rdquo;\nThe code below corresponds to the regression coefficient solving process:\nimport pandas as pd import pylab as pl import numpy as np from sklearn import datasets import warnings import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression,LogisticRegression,Ridge,RidgeCV,Lasso, LassoCV from sklearn.model_selection import train_test_split,GridSearchCV,cross_val_score,cross_validate from sklearn import metrics as mt #warnings.filterwarnings(\u0026#39;ignore\u0026#39;) # 使文字可以展示 plt.rcParams[\u0026#39;font.sans-serif\u0026#39;] = [\u0026#39;SimHei\u0026#39;] # 使负号可以展示 plt.rcParams[\u0026#39;axes.unicode_minus\u0026#39;] = False print(\u0026#39;开始\u0026#39;) data_ori = pd.read_excel(\u0026#34;解释变量_dwq.xlsx\u0026#34;) data_ori=pd.DataFrame(data_ori) time = data_ori.iloc[:, -1] time=time.drop_duplicates() #print(time) time1=time.tolist() #print(time1) length=len(time1) #print(length) list1=[] n=12 while n \u0026lt; length-12 : print(n) t=time1[n] #print(t) data=data_ori[(data_ori.时间\u0026lt;=t)] #print(data) data_01=data; data_02=data; data_01=data_01[(data_01.mark!=-1)] data_02=data_02[(data_02.mark!=1)] #print(data) #print(data_01) #print(data_02) data1 = data_01[[\u0026#39;mark\u0026#39;,\u0026#39;市场因子\u0026#39;,\u0026#39;系统性风险\u0026#39;,\u0026#39;市盈率\u0026#39;,\u0026#39;市净率\u0026#39;,\u0026#39;市销率\u0026#39;,\u0026#39;PEG\u0026#39;, \u0026#39;\u0026#34;营业利润增长率(%)\u0026#34;\u0026#39;,\u0026#39;股东权益增长率(%)\u0026#39;,\u0026#39;销售净利率(%)\u0026#39;,\u0026#39;销售毛利率(%)\u0026#39;,\u0026#39;净资产收益率(%)\u0026#39;, \u0026#39;前期涨跌幅1月\u0026#39;,\u0026#39;前期涨跌幅6月\u0026#39;,\u0026#39;前期换手率1月\u0026#39;,\u0026#39;前期换手率3月\u0026#39;,\u0026#39;流通市值\u0026#39;,\u0026#34;\u0026#39;流通股本\u0026#39;\u0026#34;]] X = data1.iloc[:, 1:] y = data1.iloc[:, 0] model = LogisticRegression() model.fit(X, y.astype(\u0026#39;int\u0026#39;)) coef1 = model.coef_ # 回归系数 coef1_icp=model.intercept_ #coef_regression1 = pd.Series(index=[\u0026#39;Intercept\u0026#39;] + X.columns.tolist(), data=[model.intercept_[0]] + coef.tolist()[0]) #print(coef1) a1=coef1[0] b1=coef1_icp[0] a1=np.append(a1,b1) data2 = data_02[[\u0026#39;mark\u0026#39;,\u0026#39;市场因子\u0026#39;,\u0026#39;系统性风险\u0026#39;,\u0026#39;市盈率\u0026#39;,\u0026#39;市净率\u0026#39;,\u0026#39;市销率\u0026#39;,\u0026#39;PEG\u0026#39;, \u0026#39;\u0026#34;营业利润增长率(%)\u0026#34;\u0026#39;,\u0026#39;股东权益增长率(%)\u0026#39;,\u0026#39;销售净利率(%)\u0026#39;,\u0026#39;销售毛利率(%)\u0026#39;,\u0026#39;净资产收益率(%)\u0026#39;, \u0026#39;前期涨跌幅1月\u0026#39;,\u0026#39;前期涨跌幅6月\u0026#39;,\u0026#39;前期换手率1月\u0026#39;,\u0026#39;前期换手率3月\u0026#39;,\u0026#39;流通市值\u0026#39;,\u0026#34;\u0026#39;流通股本\u0026#39;\u0026#34;]] X2 = data2.iloc[:, 1:] y2 = data2.iloc[:, 0] mode2 = LogisticRegression() mode2.fit(X2, y2.astype(\u0026#39;int\u0026#39;)) coef2 = mode2.coef_ # 回归系数 coef2_icp=mode2.intercept_ a2=coef2[0] b2=coef2_icp[0] a2=np.append(a2,b2) a=np.hstack((a1,a2)) an=a.tolist() an.append(t) list1.append(an) #print(an) n+=1 #for i in tqdm(range(100)): #sleep(0.01) #pass #print(list1) dt=pd.DataFrame(list1) dt.to_excel(r\u0026#39;‪ab.xlsx\u0026#39;,sheet_name=\u0026#39;测试\u0026#39;) 6.2.4 Comprehensive Factor Construction After obtaining the logistic regression coefficients, the next step is to combine them with the month-t explanatory variable values to construct the comprehensive factor. Here, this comprehensive factor is understood as the plunge probability—that is, an indicator that can rank expected returns.\nLet the coefficients from month t-12 be $\\alpha_1$, $\\alpha_{-1}$, $\\beta_1$, $\\beta_{-1}$, and the month-t explanatory variables be $X_1$. The formula for the plunge probability is:\n$$P_{\\text{plunge}}= \\frac{\\exp\\left(\\beta_{-1}^{T}X_1+\\alpha_{-1}\\right)} {\\exp\\left(\\beta_{-1}^{T}X_1+\\alpha_{-1}\\right)+\\exp\\left(\\beta_{1}^{T}X_1+\\alpha_{1}\\right)+1}$$\nThe core meaning of this step is not complicated: through the combination of regression coefficients and current-period factor values, the originally scattered multiple research report features are compressed into a single core indicator that can be directly ranked. The comprehensive factor constructed here essentially corresponds to the plunge probability, or equivalently, an ordered ranking of expected returns, based on which the future return differences of different stocks are compared.\n6.2.5 Investment Strategy Design and Backtesting Verification After obtaining the comprehensive factor, the next step is to first clarify the investment strategy. A monthly portfolio rebalancing approach is adopted: rebalancing is conducted once at the end of each month during 2013–2021, and the 10 stocks are divided into three groups according to the comprehensive factor values. Group 1 has the lowest comprehensive factor values and the highest expected returns; the last group has the highest comprehensive factor values and the lowest expected returns.\nAt the end of each month t, the specific execution steps are as follows:\nObtain all dependent variable values (y = 1, -1, or 0) and all explanatory variable values (24 factor values) from before month t. Perform multi-class logistic regression on the above data to obtain regression coefficients. Calculate the comprehensive factor values. Rank each individual stock by comprehensive factor value, and go long on the top 1/3 of stocks with the lowest comprehensive factor to maximize returns. Where the comprehensive factor calculation formula used in Step 3 is:\n$$P_{\\text{plunge}}= \\frac{\\exp\\left(\\beta_{-1}^{T}X_1+\\alpha_{-1}\\right)} {\\exp\\left(\\beta_{-1}^{T}X_1+\\alpha_{-1}\\right)+\\exp\\left(\\beta_{1}^{T}X_1+\\alpha_{1}\\right)+1}$$\nAfter the strategy is determined, a single-factor backtest is conducted on this approach. The backtest results are as follows:\nGroup Backtest Ending Cumulative Return Annualized Return Performance vs. CSI 300 Group 1 Above 1517% 41.6% Significantly outperforms benchmark Group 2 Approx. 994% 34.9% Significantly outperforms benchmark Group 3 Approx. 25% 2.83% Significantly underperforms benchmark Finally, it can be seen that the constructed comprehensive factor is effective, with significant grouping effects among Groups 1, 2, and 3. This indicates that after ranking by the comprehensive factor and conducting monthly portfolio rebalancing, this investment strategy achieved good return performance on these 10 stocks.\n6.3 Model for Problem 3 — Construction and Solution 6.3.1 Mechanism of External Factors\u0026rsquo; Impact on Stock Prices In actual stock market trading, stock movements do not always develop as expected. The reason is that various external shocks continuously emerge in the market, affecting individual stocks and industries to varying degrees. Although many external factors cannot be predicted in advance, we can still summarize patterns by studying market reactions after past events occur, thereby analyzing the relationship between external factors and stock price fluctuations.\nIn this study, the external factors of interest mainly include emergencies, public sentiment, and natural disasters. From a macro analysis perspective, the event study method can be used to study changes in stock prices after external shocks occur; from a behavioral finance perspective, a news sentiment index can be used to characterize investor attention and sentiment changes, further explaining how external factors affect the stock market.\nThe logic by which external factors impact stock prices is not complicated. When an emergency occurs, it quickly attracts investor attention and changes market sentiment. Investors make judgments and decisions based on existing information. If a large number of investors react similarly in a short period, stock prices will fluctuate significantly. For this reason, the news sentiment index can reflect both the degree of market attention and the direction of market sentiment, making it an important variable connecting external events and stock price fluctuations.\nBased on this thinking, Problem 3 proceeds mainly in two directions: first, building a model based on the event study method to study the impact of specific external events on stock portfolio returns; second, building a model centered on the news sentiment factor to explain the impact of external factors on stock prices from the perspective of sentiment and attention. These two approaches start from different angles but share the same goal: to more systematically understand the relationship between external factors and stock price fluctuations.\n6.3.2 Model Construction Based on the Event Study Method In Problem 3, the first method adopted is the event study method. The event study method was first applied in the financial field. Its core idea is to compare the actual returns of sample stocks during the event window with their \u0026ldquo;normal returns\u0026rdquo; as if the event had not occurred, thereby determining whether the event had a significant impact on the stock price.\nThe event study method\u0026rsquo;s modeling steps are divided into 12 parts in total, which can be organized into the following key steps:\nEvent Selection: Select important events related to industries or individual companies that occurred during the 2013–2021 period. Event Date Determination: Determine the time point when the market received the information, and based on this, divide the estimation window and event window. The estimation window is used to estimate normal returns, typically 120 days to 35 days before the event; the event window is flexibly determined based on the possible sustained impact range of the event. Research Sample Selection: Select stock samples that may be affected by the event. The research sample in this paper is still the 10 stocks selected earlier, grouped by Shenwan Level-1 industry classification to form industry portfolios, facilitating subsequent comparison of different industries\u0026rsquo; reactions to external events. Normal Return Estimation: Within the estimation window, the market model is used to calculate the normal returns of stocks. The market model rather than the constant mean adjustment model or market adjustment model is used because the market model provides a more comprehensive explanation and better aligns with common practices in financial research. Abnormal Return Calculation: After obtaining normal returns, abnormal returns, portfolio average abnormal returns, and cumulative average abnormal returns within the event window are further calculated to measure the additional impact of event shocks on the stock portfolio. Significance Testing: A mean T-test is performed on the cumulative abnormal return series. If the P-value is less than 0.05, it indicates that at the 5% significance level, the event has a significant impact on the sample portfolio\u0026rsquo;s stock price fluctuations. This section also requires several key formulas. First, the daily return of an individual stock in the event window can be expressed as:\n$$R_{it} = \\frac{P_{it} - P_{i,t-1}}{P_{i,t-1}}$$\nThe market return can be written as:\n$$R_{mt} = \\frac{I_t - I_{t-1}}{I_{t-1}}$$\nIn normal return estimation, the market model used is:\n$$R_{it} = \\alpha_i + \\beta_i R_{mt} + \\varepsilon_{it}$$\nThe abnormal return is defined as the difference between the actual return and the normal return:\n$$AR_{it} = R_{it} - \\hat{R}_{it}$$\nThe portfolio average abnormal return is:\n$$\\overline{AR}t = \\frac{1}{n}\\sum{i=1}^{n} AR_{it}$$\nThe cumulative average abnormal return is:\n$$CAR(t_1,t_2) = \\sum_{t=t_1}^{t_2} \\overline{AR}_t$$\nThrough these formulas, the impact of a specific event on an individual stock and industry portfolio can be quantified, and then further tested for significance.\n6.3.3 Empirical Design and Model Solution In the empirical design section, a typical case is selected from common external factors to demonstrate the application process of the event study method. First, the previously selected 10 stocks are classified by Shenwan Level-1 industry. Results show that half of these 10 stocks belong to the electronics industry, 2 belong to the pharmaceuticals and biomedicine industry, and the rest belong to real estate, building materials, and light manufacturing industries.\nThe corresponding industry classification is shown in the table below:\nStock Name Shenwan Level-1 Industry Stock Name Shenwan Level-1 Industry EVE Energy Shenwan Electronics Huafa Shares Shenwan Real Estate Luxshare Precision Shenwan Electronics Tower Group Shenwan Building Materials Nationstar Optics Shenwan Electronics MiYon Shenwan Light Manufacturing Sunlord Electronics Shenwan Electronics Livzon Group Shenwan Pharmaceuticals Everwin Precision Shenwan Electronics China National Shenwan Pharmaceuticals Based on this industry structure, the 2018 rabies vaccine fraud incident is重点 selected for the case analysis. The reason for selecting this event is clear: it originated in the pharmaceutical industry but has strong spillover effects, making it more suitable for observing the effectiveness of the event study method in industry shock research.\nIn specific settings, July 16, 2018 is taken as the event occurrence date; 120 days before the event is set as the estimation window, i.e., February 26, 2018 to July 9, 2018; the event window is set from 5 days before to 40 days after the event. The data source is still the Shenzhen Tianruan Technology database. Since the Bay Area index was established in 2019 and cannot be directly used as the market return benchmark, the CSI 300 index is selected as the market return reference.\nIn industry grouping, the 10 stocks are further grouped into three categories: electronics and information, pharmaceuticals, and construction. Then, the data for each category of stocks is substituted into the aforementioned event analysis model to calculate the normal return, abnormal return, portfolio average abnormal return, and cumulative average abnormal return of each portfolio within the event window, and based on this, the degree of reaction of different portfolios under the event shock is observed.\nAfter secondary classification according to this definition, the stock portfolio can be organized as follows:\nStock Name Industry Category Stock Name Industry Category EVE Energy Electronics \u0026amp; Info Huafa Shares Construction Luxshare Precision Electronics \u0026amp; Info Tower Group Construction Nationstar Optics Electronics \u0026amp; Info MiYon Construction Sunlord Electronics Electronics \u0026amp; Info Livzon Group Pharmaceuticals Everwin Precision Electronics \u0026amp; Info China National Pharmaceuticals In the specific case analysis, this part can be further divided into four steps:\nDetermine the event occurrence date: Taking the Changsheng Bio vaccine fraud incident reported on July 16, 2018 as the starting point of research. Determine the event window: Set 120 days before the event as the estimation window, and from 5 days before to 40 days after the event as the event window. Determine the data source: Extract the daily closing prices of individual stocks within the estimation window and event window from the Shenzhen Tianruan Technology database, and use the CSI 300 index as the market return benchmark. Substitute into the model for solution: Substitute the data of different portfolio stocks into the event analysis model to calculate normal return, abnormal return, portfolio average abnormal return, and cumulative average abnormal return. After summarizing the empirical results of the three stock portfolios, they can be organized into the following table:\nStock Portfolio T-statistic P-value Significance Conclusion Result Interpretation Pharmaceuticals 8.085680088 ≈ 0.00002 Significant at 95% confidence level After the emergency, the pharmaceutical portfolio return first declined, then recovered and stabilized as the event was gradually digested. Construction 11.478369766 ≈ 0.00003 Significant at 95% confidence level The construction portfolio also experienced return reversal and decline after the pharmaceutical event, then gradually rebounded, indicating the sector was also affected by both the event and the broader market. Electronics \u0026amp; Info -2.062110627 ≈ 0.045133004 Significant at 95% confidence level The electronics portfolio was also impacted by the pharmaceutical industry shock; returns first declined, then went through a period of unstable fluctuations before gradually stabilizing. From these three sets of results, it can be seen that emergencies cause significant short-term shocks across different sectors, but the subsequent digestion processes differ. Pharmaceuticals, as the source of the event, experiences the most direct impact; construction and electronics reflect more of a spillover linkage effect. The commonality is that all portfolios experienced return oscillations after the event and eventually gradually moved toward stability.\nThe significance of this section lies in advancing Problem 3 from \u0026ldquo;theoretically able to analyze external factors\u0026rdquo; to \u0026ldquo;quantitatively verifying the impact of external factors with a specific case.\u0026rdquo; In other words, the event study method not only provides a research framework but also offers an operational data path for subsequently comparing the degrees of impact across different industry portfolios.\n6.3.4 Model Construction and Solution Based on News Sentiment Factors In addition to the event study method, Problem 3 also introduces news sentiment factors from the perspective of behavioral finance. The core idea here is: if external events significantly affect investor attention and sentiment, then the news headlines themselves can be transformed into a quantifiable explanatory variable to characterize the market\u0026rsquo;s sentiment changes toward individual stocks.\nThe construction process of the news sentiment factor is divided into 6 steps:\nData Acquisition: On a monthly basis, news headlines for the 10 stocks from January 2021 onward are scraped from the East Money website, obtaining a news headline sample set. Data Preprocessing: The news headlines in the sample set are segmented using jieba word segmentation, the text is split and connected with spaces; a third-party stop word list is used to process Chinese stop words; then CountVectorizer is used to vectorize Chinese words. Construct Training and Test Sets: The sample set is split into training and test sets, and some training set news headlines are manually scored. The scoring rules are: 1 for bullish, 0 for unclear, -1 for bearish. Build Classification Model: Based on the manual scoring, text features are first vectorized, then a Naive Bayes classifier is imported to automatically score the remaining training set news headlines. Verify Prediction Accuracy: Data that has not been feature-vectorized is input into the model to check the accuracy of the machine learning predictions. Construct Sentiment Factor: The arithmetic mean of the scores of all news headlines for each stock in each month is calculated, obtaining the news sentiment score for that stock in that month, which is defined as the news sentiment indicator—the \u0026ldquo;sentiment factor\u0026rdquo; used in subsequent analysis. Based on this process, a sentiment factor can be constructed for each stock each month. Due to the strong timeliness of sentiment factors, the same logic as in Problem 2 (\u0026ldquo;using one-year-later plunge/surge results and one-year-earlier factor values for regression\u0026rdquo;) is no longer directly applicable. Therefore, rather than constructing a completely independent model, the choice is made to修正 Problem 2\u0026rsquo;s comprehensive factor.\nThe specific approach is: first calculate the monthly average sentiment factor for the 10 stocks, then test the correlation between $\\alpha_1$, $\\alpha_{-1}$ in Problem 2 and the sentiment factor, and judge the strength of correlation through $R^2$. The test results are: $R^2$ between $\\alpha_1$ and the sentiment factor is 0.4432, and $R^2$ between $\\alpha_{-1}$ and the sentiment factor is 0.0859. This indicates a relatively strong relationship between the sentiment factor and $\\alpha_1$, while the relationship with $\\alpha_{-1}$ is not significant.\nBased on this result, the following is substituted:\n$$\\alpha_1 = -0.0653x - 0.013$$\ninto the comprehensive factor model from Problem 2. When calculating the comprehensive factor each time, the variable $x$ uses the current-period sentiment factor value of the individual stock, so that the comprehensive factor reflects not only the internal fundamental information of the stock but also the shock brought by external sentiment changes.\nFrom an economic perspective, the role of this correction term is also intuitive: when individual stock sentiment turns from negative to positive, the sentiment factor value $x$ rises, the related term in the model rises, and the individual stock\u0026rsquo;s plunge probability declines; when individual stock sentiment turns from positive to negative, the sentiment factor value declines, and the individual stock\u0026rsquo;s plunge probability rises. In other words, the sentiment factor essentially adds a layer of \u0026ldquo;market sentiment修正\u0026rdquo; to the comprehensive factor from Problem 2.\nThrough this processing, the news sentiment factor in Problem 3 is no longer just an auxiliary observation indicator but is truly integrated into the subsequent comprehensive factor modeling framework, laying the foundation for Problem 4\u0026rsquo;s further integration of internal and external factors.\n6.4 Model for Problem 4 — Construction and Solution 6.4.1 Redesign of the Improved Investment Strategy Based on the new comprehensive factor, the investment strategy in Problem 4 also needs to be restated. At the end of each month t, the rebalancing process can be summarized in the following steps:\nStep 1: Obtain all dependent variable values (y = 1, -1, or 0) and all explanatory variable values (24 factor values) from before month t.\nStep 2: Perform multi-class logistic regression on the above data to obtain two sets of regression coefficients.\nStep 3: Obtain the monthly sentiment factor values, calculate the arithmetic mean of the sentiment factor values for all individual stocks, and then perform OLS fitting between $\\alpha$ and the individual stock sentiment factor value $x$, establishing the relationship between $\\alpha$ and the individual stock sentiment factor value $x$.\nWhere the fitted relationship is:\n$$\\alpha_1 = -0.0653x - 0.013$$\nStep 4: Substitute the above equation into the comprehensive factor value to obtain the new comprehensive factor.\nStep 5: Rank all individual stocks by comprehensive factor value, and go long on the top 1/3 of stocks with the lowest comprehensive factor to maximize returns.\nCompared with Problem 2, the essential change in this strategy is: during monthly rebalancing, not only internal factors are used, but the current-period sentiment factor is also incorporated into the comprehensive factor calculation process, so that the stock selection results simultaneously reflect the company\u0026rsquo;s internal characteristics and external sentiment changes.\n6.4.2 Comparison of Old and New Model Backtest Results and Improvement Analysis After completing the model modification, the first step is to compare the stock selection effects of the old and new model frameworks using a unified standard. The specific approach is: retain the original grouped backtest framework from Problem 2, still rebalance monthly and group by comprehensive factor ranking, then respectively calculate the return curves and grouping performance of the old comprehensive factor and the new comprehensive factor (with sentiment factor incorporated) within the same time interval. The purpose of this treatment is to make \u0026ldquo;whether to introduce the sentiment factor\u0026rdquo; the only variable, so that the changes brought by the model improvement can be more clearly identified.\nFrom the backtest results, the new model\u0026rsquo;s overall performance is better than the old model, mainly reflected in the following aspects:\nBetter overall return performance: Against the backdrop of overall market adjustment in 2021, all three groups under the old comprehensive factor grouping experienced significant drawdowns, with Group 1\u0026rsquo;s drawdown being particularly outstanding; in comparison, the new model\u0026rsquo;s grouping performance is more stable. Clearer group differentiation: During the period from February to September 2021, if the sentiment factor is not introduced, Group 1 and Group 2\u0026rsquo;s return curves remain entangled for a long time, making it difficult to form an effective differentiation; after incorporating the sentiment factor, Group 1 stocks significantly outperform Group 2 in most of the period. Stronger hierarchical stability: Under the new model, the return ranking among different groups is more stable, indicating that the new comprehensive factor can more clearly identify the relative strengths and weaknesses among stocks. From the analysis of improvement effects, the value of the sentiment factor is mainly concentrated in two points:\nEnhancing factor effectiveness: The sentiment factor strengthens the comprehensive factor\u0026rsquo;s ability to differentiate stocks\u0026rsquo; future performance, making the grouping results more distinguishable. Improving adaptability in adverse market environments: When the stock pool declines overall, the new model\u0026rsquo;s control of Group 1\u0026rsquo;s drawdown is more effective, indicating stronger adaptability to adverse market environments. Therefore, the conclusion for Problem 4 can be summarized: after integrating the news sentiment factor into the comprehensive factor model from Problem 2, the new model\u0026rsquo;s backtest performance is overall better than the old model, and the return performance of Group 1 stocks also improves. This indicates that the new model, when explaining stock price fluctuations, can simultaneously absorb internal fundamental information and external sentiment information, thus possessing stronger practical value.\n7. Model Evaluation and Outlook 7.1 Model Strengths Considering the modeling and empirical results throughout the paper, the main strengths of this factor-based stock selection research are reflected in the following aspects:\nRelatively complete factor integration framework: The model does not rely solely on a single type of information but places research report features, financial and market factors, external events, and news sentiment factors within the same research framework for unified examination, able to simultaneously cover both endogenous and exogenous factors in stock price fluctuations. Machine learning methods are suitable for handling complex relationships: Compared with traditional linear methods, multi-class logistic regression can better characterize the non-linear relationship between factor values and future return states, enhancing the model\u0026rsquo;s ability to capture complex market relationships while retaining statistical interpretability. Factor screening process is relatively rigorous: From research report text frequency statistics and Shenzhen Tianruan Technology database factor mapping, to correlation testing and subsequent comprehensive factor construction, the entire process is clearly structured, avoiding redundancy problems caused by simply piling up variables. Event study method enhances explanatory power: Beyond factor return prediction, the event study method provides a relatively clear quantitative path for \u0026ldquo;how external shocks affect stock prices,\u0026rdquo; enabling the model not only to do return grouping but also to explain the impact of specific events on industries and stock portfolios from a mechanistic perspective. Sentiment factor improvement enhances practicality: Problem 4 further introduces news sentiment factors on the basis of the original model, giving the model higher sensitivity when facing market sentiment changes and making the final strategy closer to the real investment environment\u0026rsquo;s information transmission process. 7.2 Model Limitations Although the model has demonstrated good grouping effects within the sample, from the perspective of research design and practical application, there are still some limitations that cannot be ignored:\nRelatively small number of sample stocks: The research subjects are mainly 10 Bay Area index constituent stocks. Such a sample size is more suitable for method validation, but the representativeness for a broader market environment is still limited. The model\u0026rsquo;s generalization ability needs further testing in a larger stock pool. Backtest period has阶段性 characteristics: Market style, industry rotation, and external event shocks during the sample period all have strong阶段性 characteristics. Therefore, the current backtest results more illustrate the model\u0026rsquo;s effectiveness during a specific period and cannot be directly equated with long-term stable effectiveness. Sentiment factor has strong timeliness: News headlines and market sentiment changes often spread quickly and decay quickly, meaning sentiment factors may have different effects at different frequencies. If not handled properly, signal lag or noise amplification problems are prone to occur. Factor integration method is still relatively linear: Although Problem 4 has incorporated sentiment factors into the comprehensive framework, the current fusion logic still tends to修正 the original comprehensive factor, and the more complex interactive relationships among different types of factors have not been fully explored. Real trading constraints have not been fully considered: This paper\u0026rsquo;s backtest mainly focuses on return performance and grouping effects, with less consideration of real trading conditions such as transaction costs, rebalancing impact, and liquidity constraints. Therefore, when the strategy moves from research conclusions to live trading application, further calibration is needed. 7.3 Improvement Directions Based on the above limitations, if this research framework is to be further improved in the future, it can be advanced in the following directions:\nExpand external data dimensions: In addition to news sentiment, further introduce external variables such as macroeconomic indicators, industry prosperity, policy events, and fund flows to build a more complete exogenous information characterization system. Optimize factor fusion mechanism: Rather than being limited to linear修正 of the original comprehensive factor, attempt to establish a layered fusion or dynamic weighting mechanism, allowing fundamental factors, market factors, and sentiment factors to automatically adjust weights in different market environments. Try more diverse machine learning algorithms: While retaining the interpretability of logistic regression, further try ensemble methods such as Random Forest, Gradient Boosting, and XGBoost, and compare the differences in stability, interpretability, and predictive power across different models. Expand sample scope for robustness testing: Extend the research subjects from 10 stocks to a larger stock pool, lengthen the backtest time interval, and examine whether the model\u0026rsquo;s performance remains valid across different market states, sectors, and style environments. Add real trading-level verification: In subsequent research, add transaction cost, slippage, turnover rate constraints, and portfolio capacity analysis, further advancing the strategy from \u0026ldquo;academically effective\u0026rdquo; to \u0026ldquo;operationally executable.\u0026rdquo; Overall, the factor model constructed in this paper has verified the feasibility of the research approach combining \u0026ldquo;securities research report feature indicators + external sentiment factors,\u0026rdquo; and also provided a framework that can be further expanded for the subsequent deeper integration of text mining, event shock analysis, and quantitative stock selection.\n8. References [1] Probability of Price Crashes, Rational Speculative Bubbles, and the Cross Section of Stock Returns. Journal of Financial Economics.\n[2] Shenzhen Tianruan Technology Database Technical Documentation.\n[3] Shenwan Industry Classification Standard.\nAppendix: Core code for each problem is available in the supplementary materials.\n","date":"2021-12-15T23:50:00+08:00","image":"/uploads/cover-multi-factor-20211215.jpg","permalink":"/en/p/multi-factor-bay-stock-model/","title":"Multi-Factor Machine Learning Model for Equity Analysis（Greater Bay Area Financial Mathematical Modeling Competition in 2021）"},{"content":"\nThe Research on the Short Memory Principle of Fractional-Order Systems This article is adapted from my 2021 thesis titled Short Memory Principle for Fractional-Order Systems. It documents my research process around this topic and the insights I developed.\nMy initial interest in fractional-order systems did not come from finding the definitions novel, but from their distinctive modeling approach: the current state of a system is not determined solely by the present moment—it continuously depends on its entire historical trajectory. This memory property is precisely why fractional-order models often outperform ordinary integer-order models in areas like viscoelastic materials, anomalous diffusion, and control systems.\nThe core problem this thesis tackles is clear: while fractional-order systems better capture memory-dependent dynamics, they are computationally much harder. Terms like the fractional integral $\\int_0^t (t-\\tau)^{-\\alpha}f(\\tau)d\\tau$ require discretizing and summing over the entire history from 0 to $t$. As time grows, so does the number of history terms—each step forward costs more than the last, and overall algorithmic complexity escalates rapidly.\nTo address this, the thesis adopts two main approaches. The first is the Predictor-Corrector Method, used to construct numerical schemes for both constant-order and variable-order fractional systems. The second is the Short Memory Principle, which truncates the full historical integral to the most recent segment of length $T$, reducing complexity from $O(N^2)$ to $O(N \\cdot T/h)$. The thesis does more than just combine these two—it analyzes how well this combination balances accuracy and efficiency.\nSpecifically, the thesis does two things. First, it applies the predictor-corrector method systematically to constant-order and variable-order fractional systems, comparing numerical accuracy across different algorithms. Second, it introduces the short memory strategy into these algorithms, investigating how truncation affects error, how much computation time is saved, and how to choose the optimal memory length $T$.\nThe most valuable conclusion I eventually arrived at—and the reason I wanted to write this up separately—is this: the short memory method is not merely an \u0026ldquo;engineering trick for speed.\u0026rdquo; There is a deeper quantitative question underneath: what exactly is the relationship between the fractional order $\\alpha$ and the optimal memory length $T$? Through experiments on both constant-order and variable-order systems, the thesis demonstrates that short memory can effectively reduce computation cost, and that changes in $\\alpha$ directly influence the choice of $T$. This is one of the central insights of the entire work.\nThis article largely preserves my original thought process, though the introduction lays out the overall framework first. To make the blog post more readable, it is organized as: concepts foundations → constant-order algorithms → variable-order extensions → conclusions and insights. You can think of it as answering three questions:\nWhy are fractional-order systems harder to compute than integer-order systems? Why can the short memory principle significantly speed up computation? To what extent can this truncation strategy work in both constant-order and variable-order settings? 1. Fundamentals of Fractional-Order Systems This chapter introduces the core concepts that will be used throughout the rest of the article. The most important takeaways are: how fractional derivatives differ from ordinary derivatives, why the predictor-corrector method becomes the most common numerical tool for these problems, and why the short memory principle has a chance of working.\n1.1 Several Common Definitions of Fractional Calculus Before diving into the algorithms, let us distinguish the three most common definitions of fractional calculus. These are not competing theories—they are different expressions of the same underlying mathematics in different contexts: some are better suited for numerical discretization, others for theoretical analysis, and still others for initial value problems. The definitions that the thesis actually relies on most heavily are the Caputo derivative for modeling purposes and the GL (Grönwall-Letnikov) intuition for discretization.\n1. Grünwald-Letnikov (GL) Derivative\nThe Grünwald-Letnikov derivative is the definition most directly connected to \u0026ldquo;discrete history summation.\u0026rdquo; It is particularly important in numerical computation, because it essentially tells us: a fractional derivative does not depend only on a local neighborhood—it accumulates a whole string of past states with weights. The $\\alpha$-order GL derivative is written as:\n$$ D_{GL}^{\\alpha}f(t) = \\lim_{h \\to 0} h^{-\\alpha} \\sum_{j=0}^{\\lfloor t/h \\rfloor} (-1)^j \\binom{\\alpha}{j} f(t-jh) $$\nwhere the binomial coefficient $\\displaystyle \\binom{\\alpha}{j} = \\frac{\\alpha(\\alpha-1)(\\alpha-2)\\cdots(\\alpha-j+1)}{j!}$. From this form, the \u0026ldquo;memory\u0026rdquo; characteristic of fractional systems is already visible: the derivative at the current time absorbs an entire history of past states, rather than depending only on a local neighborhood like an integer-order derivative.\n2. Riemann-Liouville Integral\nThe Riemann-Liouville fractional integral is defined as:\n$$ I_{RL}^{\\alpha}f(t) = \\frac{1}{\\Gamma(\\alpha)} \\int_0^t (t-\\tau)^{\\alpha-1} f(\\tau) d\\tau $$\nwhere $\\Gamma$ is the gamma function. The RL integral can define fractional order for any $\\alpha \u0026gt; 0$ and exhibits derivative-like properties. It is commonly used in theoretical analysis and analytical solutions.\n3. Caputo Derivative\nThe Caputo derivative is another widely used fractional definition:\n$$ {}^C D_t^{\\alpha}f(t) = \\frac{1}{\\Gamma(1-\\alpha)} \\int_0^t (t-\\tau)^{-\\alpha} f\u0026rsquo;(\\tau) d\\tau $$\nWhen handling initial value problems, the Caputo derivative is more convenient, because its initial conditions are consistent with those of ordinary differential equations—no additional RL-type initial conditions are needed. This is why, when building numerical schemes later, I naturally take the Caputo form as the starting point.\n4. Comparison of the Three Definitions\nDefinition Typical Use Case Initial Condition Handling GL Derivative Numerical discretization Requires additional treatment RL Integral/Derivative Theoretical analysis, analytical solutions Needs RL initial conditions Caputo Derivative Engineering applications, initial value problems Same as integer-order equations 1.2 The Predictor-Corrector Method Before discussing whether history can be truncated, we need to answer a more fundamental question: how to stably solve the original equation numerically. The Predictor-Corrector Method is conceptually straightforward: first use a cheaper formula for a Predictor step (P), then substitute the predicted value into a more accurate Corrector step (C). It balances implementation difficulty, computational efficiency, and accuracy, which is why it is so common in fractional numerical computation.\nFor the Caputo fractional differential equation $D^{\\alpha}y(t) = f(t, y(t)),\\ y(0) = y_0$, its integral form is:\n$$ y(t) = y_0 + \\frac{1}{\\Gamma(\\alpha)} \\int_0^t (t-\\tau)^{\\alpha-1} f(\\tau, y(\\tau)) d\\tau $$\nDiscretizing this integral form yields the predictor and corrector coefficients that appear repeatedly in what follows. In essence, every seemingly complex recurrence formula in this thesis is an elaboration of this relationship. After establishing the standard form for constant-order systems, we will extend it to variable-order settings.\n1.3 The Short Memory Principle With a basic numerical framework in place, we arrive at the central question of this thesis: can the historical integral be shortened? The fractional integral $\\int_0^t (t-\\tau)^{-\\alpha}f(\\tau)d\\tau$ accumulates the entire history, but when $\\alpha \\in (0, 1)$, the influence of more distant history is typically weaker. This raises the natural question: can we keep only the \u0026ldquo;most recent\u0026rdquo; segment of history and discard everything earlier?\nThe short memory principle truncates the integral history to the most recent interval of length $T$:\n$$ I_{\\text{short}}^{\\alpha}f(t) \\approx \\frac{1}{\\Gamma(\\alpha)} \\int_{t-T}^{t} (t-\\tau)^{\\alpha-1} f(\\tau) d\\tau $$\nThe real difficulty is not \u0026ldquo;whether to truncate,\u0026rdquo; but: how large should $T$ be, so that speedup is achieved while keeping error under control? This question—though it sounds like a parameter selection issue—is precisely the central research entry point of the entire thesis.\n2. Numerical Computation for Constant-Order Fractional Systems 2.1 Basic Idea of the Predictor-Corrector Method Before studying variable-order systems, the thesis begins with the simpler constant-order fractional system. Consider the general constant-order fractional differential equation initial value problem:\n$$ {}^C D_t^\\alpha y(t) = f(t, y(t)), \\quad 0 \u0026lt; \\alpha \u0026lt; 1, \\quad y(0) = y_0 $$\nFrom the literature, this equation is equivalent to the following Volterra integral equation:\n$$ y(t) = y_0 + \\frac{1}{\\Gamma(\\alpha)} \\int_0^t (t-\\tau)^{\\alpha-1} f(\\tau, y(\\tau)) d\\tau $$\nThe approach here is not to derive the fractional case from scratch, but to first review the familiar Adams predictor-corrector method for first-order ODEs, then extend this quadrature idea to constant-order fractional equations.\nFor a first-order ODE initial value problem at the undergraduate level, if a unique solution exists on the interval, the Adams predictor-corrector method can be used. Let the interval length be $T$, the number of steps be $N$, and the step size be $h = T/N$. The core idea is: first obtain a predicted value through a simple prediction formula, then substitute it into a more accurate correction formula to get a better approximation for the next step.\nWith this background, we return to the constant-order fractional equation. Since the integral in the formula extends from 0 to the current time—reflecting the nonlocal nature of fractional systems—we can still use the predictor-corrector quadrature idea. Take an equally spaced grid:\n$$ t_k = kh, \\quad k = 0, 1, \\dots, n+1 $$\nand denote $y_k \\approx y(t_k)$. At $t_{n+1}$, the above equation becomes:\n$$ y(t_{n+1}) = y_0 + \\frac{1}{\\Gamma(\\alpha)} \\int_0^{t_{n+1}} (t_{n+1} - \\tau)^{\\alpha-1} f(\\tau, y(\\tau)) d\\tau $$\nPartition the integral interval into subintervals:\n$$ \\int_0^{t_{n+1}} = \\sum_{j=0}^{n} \\int_{t_j}^{t_{j+1}} $$\nThe approximation used in the thesis applies linear interpolation to the integrand at the nodes, yielding the correction and prediction formulas. After reorganization, the predictor-corrector scheme for constant-order fractional differential equations is:\n$$ y_{n+1}^{P} = y_0 + \\frac{h^\\alpha}{\\Gamma(\\alpha+1)} \\sum_{j=0}^{n} b_{j,n+1} f(t_j, y_j) $$\nwhere the prediction coefficients are:\n$$ b_{j,n+1} = (n+1-j)^\\alpha - (n-j)^\\alpha, \\qquad j = 0, 1, \\dots, n $$\nThe correction formula is:\n$$ y_{n+1} = y_0 + \\frac{h^\\alpha}{\\Gamma(\\alpha+2)} \\left[ \\sum_{j=0}^{n} a_{j,n+1} f(t_j, y_j) + f(t_{n+1}, y_{n+1}^{P}) \\right] $$\nwhere the correction coefficients are:\n$$ a_{j,n+1} = \\begin{cases} n^{\\alpha+1} - (n-\\alpha)(n+1)^\\alpha, \u0026amp; j = 0, \\ (n-j+2)^{\\alpha+1} + (n-j)^{\\alpha+1} - 2(n-j+1)^{\\alpha+1}, \u0026amp; 1 \\le j \\le n, \\ 1, \u0026amp; j = n+1. \\end{cases} $$\nAt this point, the predictor-corrector formula for constant-order fractional differential equations is fully established. All subsequent numerical experiments are built around this scheme.\n2.2 Introduction of the Short Memory Principle in Constant-Order Systems In the original predictor-corrector formulas, all terms except the last one depend only on the current neighborhood. However, the final summation—because the integral runs from 0 to the current step—always involves the entire function history. This is precisely what the short memory principle addresses.\nTruncation Idea\nThe approach taken in the thesis is a fixed memory length strategy. Let the preserved integral length be $T$. When the summation length exceeds this interval, we truncate the earlier history from the starting side and keep only the most recent segment of length $T$.\nDenote:\n$$ M = \\left\\lfloor \\frac{T}{h} \\right\\rfloor $$\nIf the current step $n \\le M$, the current integral interval has not yet exceeded the length to be preserved, so no truncation is needed—the summation still starts from 0.\nIf $n \u0026gt; M$, we discard the earlier history and keep only:\n$$ j = n-M,; n-M+1,; \\dots,; n $$\nIn this case, the prediction step is rewritten as:\n$$ y_{n+1}^{P} = y_0 + \\frac{h^\\alpha}{\\Gamma(\\alpha+1)} \\sum_{j=n-M}^{n} b_{j,n+1} f(t_j, y_j) $$\nThe correction step is rewritten as:\n$$ y_{n+1} = y_0 + \\frac{h^\\alpha}{\\Gamma(\\alpha+2)} \\left[ \\sum_{j=n-M}^{n} a_{j,n+1} f(t_j, y_j) + f(t_{n+1}, y_{n+1}^{P}) \\right] $$\nIn other words, once the integral interval exceeds $T$, the lower limit of the summation is no longer fixed at 0, but changes to near $n - T/h$.\nWhy is this justified? Two analytical perspectives are useful:\n1. Analytical Method\nWhen calculating the truncation error, given an allowable precision $E$, the integral length $T$ can theoretically be derived inversely from the corresponding error formula. The thesis shows that when the fractional order is in $[0, 1]$, a fixed memory length can reduce computation, and the chosen integral length is independent of the original total integral length. When the order exceeds this range, the fixed-length approach no longer works as easily, and selecting $T$ becomes more difficult.\n2. Weight Function Image Analysis\nIn addition to theoretical error analysis, the thesis provides a more intuitive approach: re-examine the distributions of the prediction coefficient $b$ and correction coefficient $a$. The images show that when the step count is close to the current time, the weight function values increase significantly, while earlier history nodes have smaller weights; additionally, smaller order values lead to even smaller early weights. This indicates that for smaller orders, the system depends less on distant history, and the weight function image can serve as a guide for choosing the integral length $T$.\n2.3 Algorithm After the above analysis, the thesis organizes the predictor-corrector algorithm with short memory into a set of implementation steps:\nInput known parameters: initial conditions, step size, and full integral interval. Determine the preserved integral length $T$ using the weight function image. Compute the predicted value for the next step using the prediction formula. Substitute the predicted value into the correction formula to obtain the corrected value. When the step count is less than $T/h$, the summation starts from 0. When the step count exceeds $T/h$, the summation starts from $n - T/h$. With this, the predictor-corrector algorithm with short memory is fully established. The thesis then proceeds to numerical examples for verification.\n2.4 Constant-Order Examples and Error Analysis 2.4.1 Example 1 $$D_*^\\alpha x(t) = \\frac{\\Gamma(9)}{\\Gamma(9-\\alpha)} t^{8-\\alpha} + \\frac{9}{4}\\Gamma(\\alpha+1) + t^8 + \\frac{9}{4}t^\\alpha - x(t)$$\nInitial conditions: $x(0) = 0,\\ x\u0026rsquo;(0) = 0$\nExact solution: $x(t) = t^8 + \\frac{9}{4} t^\\alpha$\nThis example considers $\\alpha = 0.7$ and $\\alpha = 1.7$, corresponding to two important intervals $(0, 1)$ and $(0, 2)$.\nInterval $(0, 1)$, $\\alpha = 0.7$, comparison across step sizes:\n$h = 1/160$\n$t$ Exact value $x(t)$ Numerical solution Absolute error Relative error% 0.1 0.4489 0.4563 0.0074 1.64 0.3 0.9687 0.9824 0.0136 1.41 0.5 1.3889 1.4064 0.0175 1.26 0.7 1.8105 1.8316 0.0211 1.16 0.9 2.5205 2.5494 0.0289 1.15 $h = 1/320$\n$t$ Exact value $x(t)$ Numerical solution Absolute error Relative error% 0.1 0.4489 0.45343 0.00453 1.01 0.3 0.9687 0.97707 0.00837 0.86 0.5 1.3889 1.3997 0.01080 0.78 0.7 1.8105 1.8234 0.01290 0.71 0.9 2.5205 2.5382 0.01770 0.70 $h = 1/640$\n$t$ Exact value $x(t)$ Numerical solution Absolute error Relative error% 0.1 0.4489 0.45169 0.00279 0.62 0.3 0.9687 0.97384 0.00514 0.53 0.5 1.3889 1.3955 0.00660 0.48 0.7 1.8105 1.8184 0.00790 0.44 0.9 2.5205 2.5314 0.01090 0.43 The tables show that for this example, the predictor-corrector method achieves high accuracy, and accuracy improves monotonically as the step size decreases.\nInterval $(0, 2)$, $\\alpha = 1.7$, $h = 1/160$:\n$t$ Exact value $x(t)$ Numerical solution Absolute error Relative error% 0.1 0.0448934 0.044895 0.0000016 0.004 0.3 0.2906610 0.290670 0.0000090 0.003 0.5 0.6964250 0.696460 0.0000350 0.005 0.7 1.2846611 1.284700 0.0000389 0.003 0.9 2.3114931 2.311700 0.0002069 0.009 The charts show that for $\\alpha = 1.7$, the error is practically negligible. This example confirms that the predictor-corrector formula proposed in the thesis has very high accuracy for $\\alpha \\in (0, 2)$.\n2.4.2 Example 2 $$D_*^\\alpha x(t) = x^3 \\sin t - t x^2 + t^2 x - t^3, \\quad \\alpha \\in (0, 1), \\quad x \\in (0, 1)$$\nInitial conditions: $x(0) = 0,\\ x\u0026rsquo;(0) = 0$\nSince it is difficult to obtain an analytical solution for Abel-type fractional differential equations, the thesis compares with numerical results from other algorithms in the literature to evaluate the predictor-corrector method\u0026rsquo;s accuracy.\nFirst, we use the weight function analysis to roughly determine the minimum required $T$. Taking step size $h = 1/5000$, so $n = 1/h = 5000$, we examine the weight function images of the prediction and correction coefficients:\nThe figures show that when the step count exceeds 4000, the weight function values of both prediction and correction coefficients increase significantly. We can therefore roughly determine that the number of steps to preserve is $5000 - 4000 = 1000$, meaning the preserved integral interval $T$ should be at least 0.2.\nWe first examine the accuracy of the original predictor-corrector algorithm without short memory, on the domain $(0, 1)$, using the Proposed method\u0026rsquo;s numerical results as the reference.\nPredictor-corrector method vs. other methods at various points:\n$t$ Proposed method Predictor-corrector Absolute error Relative error% 0.2 -0.00074434 -0.00074432 0.00000002 -0.00269 0.4 -0.010522 -0.010521 0.00000100 -0.00950 0.6 -0.05108 -0.05107 0.00001000 -0.01958 0.8 -0.16592 -0.16583 0.00009000 -0.05424 1.0 -0.46934 -0.46864 0.00070000 -0.14915 The charts show that the predictor-corrector method without short memory still maintains excellent accuracy compared to other methods. We now introduce the short memory principle into the predictor-corrector method, setting the preserved integral interval length to $T$, and observe the numerical solutions for different values of $T$.\nShort memory results: $T = 0.2$ to $T = 1.0$, showing $x(1)$, absolute error, relative error%, and runtime/s\n$T$ $x(1)$ Absolute error Relative error% Runtime/s 0.2 -0.29587 0.17347 36.96 11.46 0.3 -0.37486 0.09448 20.13 24.04 0.4 -0.42228 0.04706 10.03 24.78 0.5 -0.44837 0.02097 4.47 27.73 0.6 -0.46139 0.00795 1.69 37.51 0.7 -0.467 0.00234 0.50 39.22 0.8 -0.46891 0.00043 0.09 41.00 0.9 -0.46927 0.00007 0.01 43.39 1.0 (no truncation) -0.46864 0.00070 0.15 45.21 The charts show that for $T \u0026lt; 0.3$, the solution quality is poor, which is consistent with the weight function analysis. When $T = 0.8$, the error is already small enough, so we focus on the range $T = 0.4$ to $0.8$.\nAdditionally, testing reveals that if only a single step length is preserved for the prediction equation, the resulting accuracy is essentially unaffected. Therefore, we can greatly reduce computation. This is defined as the improved predictor-corrector method.\nImproved method results: $T = 0.4$ to $T = 1.0$\n$T$ $x(1)=$ Absolute error Relative error Runtime/s 0.4 -0.42169 0.04765 10.15 11.76 0.45 -0.43678 0.03256 6.94 13.01 0.5 -0.44771 0.02163 4.61 13.07 0.55 -0.45543 0.01391 2.96 13.36 0.6 -0.4607 0.00864 1.84 14.01 0.65 -0.46416 0.00518 1.10 15.03 0.7 -0.46632 0.00302 0.64 18.18 0.75 -0.46757 0.00177 0.38 19.15 0.8 -0.46822 0.00112 0.24 19.16 1.0 (no truncation) -0.46864 0.00070 0.15 21.05 Optimal $T$ selection: optimal $T$ for given allowable errors\nAllowable relative error/% Optimal $T$ Actual relative error/% Runtime/s Runtime without truncation/s 10 0.4 10.15 11.76 45.21 5 0.5 4.61 13.07 45.21 2 0.6 1.84 14.01 45.21 1 0.65 1.10 15.03 45.21 The charts confirm that after introducing and continuously improving the short memory principle, computation time is more than halved while keeping accuracy within the allowable range.\n2.4.3 Example 3 $$D_*^\\alpha x(t) = \\dfrac{2}{\\Gamma(3-\\alpha)} t^{2-\\alpha} - x(t) + t^2 - t$$\nInitial conditions: $x(0) = 0,\\ x\u0026rsquo;(0) = -1$\nExact solution: $x(t) = t^2 - t$\nUsing weight function analysis, we first observe the prediction and correction coefficient images for $\\alpha = 1.5$. It is found that both coefficients decrease gradually as the independent variable increases, making it impossible to determine the preserved integral length $T$ from the coefficient images alone. Therefore, we adopt a practical truncation test to determine $T$ step by step. First, we verify the accuracy of the predictor-corrector method without short memory.\n$t = 10$ to $t = 50$, exact vs. numerical solutions:\n$t$ Exact solution Numerical solution Absolute error Relative error% Runtime/s 10 90 90.1896 0.1896 0.2107 21.3 20 380 380.1303 0.1303 0.0343 — 30 870 870.1082 0.1082 0.0124 — 40 1560 1560.0952 0.0952 0.0061 — 50 2450 2450.0865 0.0865 0.0035 — The predictor-corrector numerical solutions closely approximate the exact solutions, confirming that the predictor-corrector method is applicable to this example.\nNext, we examine the preserved integral length $T$. At $t = 50$, the exact value is $x(50) = 2450$, with fractional order $\\alpha = 1.5$ and step size $h = 1/80$. Let $T$ take values 10, 20, 30, 40, 50 respectively (when $T = 50$, no truncation is applied). The resulting function graphs are:\nShort memory method: $T = 10$ to $T = 50$, $x(50)$ and errors\n$T$ $x(50)$ Absolute error Relative error% Runtime/s 10 2411.74 38.26 1.562 3.96 20 2419.30 30.70 1.253 6.91 30 2434.18 15.82 0.646 9.04 40 2434.43 15.57 0.636 10.05 50 2450.09 0.09 0.004 10.83 The data shows that when $T = 10$, the function graph still exhibits some oscillation. When $T \\ge 20$, the graph stabilizes and the error is sufficiently small. Examining the $T = 10$ case more closely:\nThe charts show that oscillation improves after $T = 16$. We now focus on the range $T = 16$ to $20$:\nDetailed results for $T = 16$ to $T = 20$\n$T$ $x(50)$ Absolute error Relative error% Runtime/s 16 2364.49 85.51 3.490 5.82 17 2418.09 31.91 1.302 5.85 18 2416.85 33.15 1.353 6.70 19 2415.32 34.68 1.416 9.64 20 2419.30 30.70 1.253 13.69 The tables show that $T = 17$ provides acceptable accuracy, so this memory length is adopted.\nComparison: non-truncated long memory $T = 17$ vs. $T = 50$\n$T$ $x(50)$ Absolute error Relative error% Runtime/s 17 (truncated) 2418.09 31.91 1.302 3.8 50 (not truncated) 2450.0865 0.0865 0.004 21.3 This example also shows that at $T = 17$, the short memory principle can significantly reduce computation time.\nThe predictor-corrector numerical solutions closely approximate the exact solutions. For the short memory method: when $T = 10$, the function graph still shows obvious oscillation; when $T \\ge 20$, the graph stabilizes and error is sufficiently small. Examining more closely the range $T = 16$ to $20$, the oscillation clearly improves after $T = 17$. Therefore, subject to function convergence, an integral length $T \\ge 17$ can be adopted.\n3. Extension to Variable-Order Fractional Systems 3.1 Algorithm In variable-order fractional differential equations, the constant order $\\alpha$ is extended to a time function $\\alpha(t)$, i.e.:\n$$ D_t^{\\alpha(t)} x(t) = f(t, x(t)), \\quad x(0) = x_0 $$\nAt this point, each historical node $t_j$ corresponds to a different order $\\alpha(t_j)$, and the weight coefficients must be recalculated at every step. The algorithm is identical to the constant-order case—just replace $\\alpha$ with the time-varying function $\\alpha(t_j)$ in the loop. When the short memory principle is introduced, we preserve the time-varying weights of the most recent $T$ interval.\n3.2 Numerical Examples 3.2.1 Verification of the Predictor-Corrector Method First, without short memory, we verify whether the predictor-corrector method remains effective for variable-order systems.\nExample I ($\\alpha(t) = t \\cos t$):\n$$ D_t^{\\alpha(t)} x(t) + x(t) + \\sqrt{t} x^2(t) = f(t), \\quad x(0) = x\u0026rsquo;(0) = 0 $$\nwhere $f(t) = t^a \\left(1 + t^{\\frac{1}{2}+a} + \\dfrac{\\Gamma(1+a) t^{-\\alpha(t)}}{\\Gamma(1+a-\\alpha(t))} \\right), a = 1.2$, and the exact solution is $x(t) = t^a$.\nTaking step sizes $h = 1/80,\\ 1/160,\\ 1/320$, the numerical and exact solutions are compared:\n$t$ Exact solution $h=1/80$ numerical $h=1/80$ rel. error% $h=1/160$ numerical $h=1/160$ rel. error% $h=1/320$ numerical $h=1/320$ rel. error% 0.1 0.0631 0.07352 16.513 0.06644 5.293 0.06433 1.946 0.2 0.1450 0.14774 1.918 0.14585 0.614 0.14526 0.207 0.4 0.3330 0.33322 0.060 0.33300 -0.006 0.33297 0.015 0.6 0.5417 0.54150 -0.042 0.54157 -0.030 0.54164 0.017 0.8 0.7651 0.76495 -0.017 0.76498 -0.013 0.76502 0.008 1.0 1.0000 1.00040 0.040 1.00010 0.010 1.00000 0.000 Smaller step sizes yield higher precision. This confirms that the predictor-corrector method is equally effective for variable-order fractional systems when $\\alpha \\in (0, 1)$.\nExample II ($\\alpha(t) = t/2$):\n$$ D_t^{\\alpha(t)} x(t) = \\frac{3 t^{1-\\alpha(t)}}{\\Gamma(2-\\alpha(t))} + \\frac{2 t^{2-\\alpha(t)}}{\\Gamma(3-\\alpha(t))}, \\quad x(0) = x\u0026rsquo;(0) = 0 $$\nExact solution: $x(t) = t^2 + 3t$.\n$t$ Exact solution $h=1/80$ numerical $h=1/80$ rel. error% $h=1/160$ numerical $h=1/160$ rel. error% $h=1/320$ numerical $h=1/320$ rel. error% 0.2 0.6400 0.63993 0.011 0.63998 0.003 0.63999 0.002 0.4 1.3600 1.35970 0.022 1.35990 0.007 1.36000 0.000 0.6 2.1600 2.15940 0.028 2.15980 0.009 2.15990 0.005 0.8 3.0400 3.03910 0.030 3.03970 0.010 3.03990 0.003 1.0 4.0000 3.99920 0.020 3.99970 0.008 3.99990 0.003 The conclusion is the same as in Example I: reducing step size monotonically improves accuracy.\n3.2.2 Verification of the Short Memory Principle In the constant-order analysis, we already know that when $\\alpha \u0026lt; 1$, the prediction and correction weight functions increase nearly linearly in the tail region, and the closer $\\alpha$ is to zero, the smaller the early weights. This means the effectiveness of short memory truncation is directly related to $\\alpha$. The following variable-order examples verify: the smaller $\\alpha$, the shorter the required preserved integral length $T$.\nExample I ($\\alpha(t) = t \\cos t$), taking $h = 1/5000$, progressively truncating the integral interval:\n$T$ $x(1)$ Absolute error Relative error% Runtime/s 0.2 0.7930 0.2071 20.71 9.09 0.3 0.8597 0.1403 14.03 12.37 0.4 0.9006 0.0994 9.94 15.29 0.5 0.9498 0.0502 5.02 17.87 0.6 0.9554 0.0446 4.46 20.30 0.7 0.9673 0.0327 3.27 22.45 0.8 0.9814 0.0186 1.86 23.70 0.9 0.9915 0.0085 0.85 23.89 1.0 (no truncation) 1.0051 0.0051 0.51 24.03 Testing reveals that preserving only a single step length for the prediction equation has almost no impact on accuracy. The improved results are:\n$T$ $x(1)$ Absolute error Relative error% Runtime/s 0.2 0.7984 0.2016 20.16 4.84 0.3 0.8653 0.1347 13.47 6.65 0.4 0.9064 0.0936 9.36 8.46 0.5 0.9340 0.0661 6.61 9.97 0.6 0.9554 0.0446 4.46 11.36 0.7 0.9732 0.0268 2.68 12.55 0.8 0.9878 0.0123 1.23 13.18 0.9 0.9989 0.0011 0.11 13.29 1.0 (no truncation) 1.0053 0.0053 0.53 13.48 Within the allowable error range, the improved method maintains high accuracy across $T = 0.4$ to $0.8$.\nAllowable rel. error/% Optimal $T$ Actual rel. error/% Runtime/s Full-interval runtime/s 10 0.4 9.36 8.46 29.38 5 0.6 4.46 11.36 29.38 2 0.8 1.23 13.18 29.38 In this example, $\\alpha(t) = t \\cos t \\in [0, 0.5611]$. If we halve the order range to $[0, 0.5101]$ (i.e., $\\alpha(t) = \\dfrac{t \\cos t}{1.1}$), we observe whether the integral length can be further shortened:\n$T$ $x(1)$ Absolute error Relative error% Runtime/s 0.2 0.8301 0.1700 17.00 5.59 0.3 0.8888 0.1112 11.12 13.76 0.4 0.9248 0.0752 7.52 16.15 0.5 0.9490 0.0511 5.11 18.17 0.6 0.9676 0.0324 3.24 22.73 0.7 0.9827 0.0173 1.73 23.31 0.8 0.9946 0.0054 0.54 26.87 0.9 1.0035 0.0035 0.35 27.85 1.0 (no truncation) 1.0082 0.0082 0.82 28.11 For the same $T$, relative error is noticeably reduced, confirming the conjecture that a smaller $\\alpha$ requires a shorter integral length.\nExample II ($\\alpha(t) = t/2$), taking $h = 1/5000$, exact solution $x(t) = t^2 + 3t$:\n$T$ $x(1)$ Absolute error Relative error% Runtime/s 0.2 2.3370 1.6630 41.58 9.71 0.3 2.7806 1.2194 30.49 24.09 0.4 3.1170 0.8830 22.08 30.12 0.5 3.3802 0.6198 15.50 30.85 0.6 3.5877 0.4123 10.31 35.59 0.7 3.7495 0.2505 6.26 35.72 0.8 3.8722 0.1278 3.20 38.89 0.9 3.9576 0.0424 1.06 38.98 1.0 (no truncation) 4.0000 0.0000 0.00 39.82 The improved prediction formula yields identical accuracy with significantly reduced computation:\n$T$ $x(1)$ Absolute error Relative error% Runtime/s 0.2 2.3370 1.6630 41.58 4.36 0.3 2.7806 1.2194 30.49 10.00 0.4 3.1170 0.8830 22.08 10.09 0.5 3.3802 0.6198 15.50 10.97 0.6 3.5877 0.4123 10.31 19.20 0.7 3.7495 0.2505 6.26 21.53 0.8 3.8722 0.1278 3.20 23.38 0.9 3.9576 0.0424 1.06 23.48 1.0 (no truncation) 4.0000 0.0000 0.00 25.08 Allowable rel. error/% Optimal $T$ Actual rel. error/% Runtime/s Full-interval runtime/s 10 0.6 10.31 19.20 54.06 5 0.8 3.20 23.38 54.06 2 0.9 1.06 23.48 54.06 The improved method more than halves computation time, demonstrating the practical value of the short memory principle.\nFurther reducing the order range from $\\alpha \\in (0, 0.5)$ to $\\alpha \\in (0, \\frac{1}{3})$ (i.e., $\\alpha(t) = t/3$) to verify the same pattern:\n$T$ $x(1)$ Absolute error Relative error% Runtime/s 0.2 2.9046 1.0954 27.39 4.62 0.3 3.2430 0.7570 18.93 10.64 0.4 3.4803 0.5197 12.99 12.86 0.5 3.6539 0.3461 8.65 13.15 0.6 3.7822 0.2178 5.45 15.49 0.7 3.8758 0.1242 3.11 16.52 0.8 3.9416 0.0584 1.46 17.79 0.9 3.9829 0.0171 0.43 19.09 1.0 (no truncation) 4.0000 0.0000 0.00 20.51 The results again confirm: the smaller $\\alpha$, the smaller the relative error at the same $T$, and the better the short memory effect.\n4. Key Conclusions and Insights This thesis combines the short memory principle with the predictor-corrector method to solve variable-order fractional systems. The main conclusions are:\nThe predictor-corrector method is effective for both constant-order and variable-order fractional systems: when $\\alpha \\in (0, 1)$, the method maintains high accuracy regardless of whether the order is constant or time-varying.\nShort memory truncation can significantly reduce computation: after introducing the short memory principle, computation time is greatly reduced while maintaining accuracy—the improved predictor-corrector method more than halves the computation time.\nThe qualitative relationship between $\\alpha$ and $T$: this is the most important finding of the entire thesis—the smaller $\\alpha$, the shorter the integral length $T$ that can be preserved. This means when $\\alpha$ is large, only a very short memory interval is needed to achieve the required accuracy, and when $\\alpha$ is small, the required memory interval is longer but still far less than the full historical integral. This conclusion provides practical guidance for selecting $T$ under different values of $\\alpha$.\nThe short memory strategy is feasible for variable-order systems: by recalculating weight coefficients at each step according to $\\alpha(t_j)$, the constant-order predictor-corrector scheme can be extended to variable-order systems, and the short memory strategy remains effective.\nThese results show that the short memory method is not a universal truncation trick, but a numerical strategy that depends on the fractional order $\\alpha$, error tolerance, and problem structure. Its core value lies in providing an analyzable, experimentally verifiable, and extensible framework for variable-order fractional systems.\n5. References [1] Miller, K. S., \u0026amp; Ross, B. (1993). An Introduction to the Fractional Calculus and Fractional Differential Equations. Wiley.\n[2] Diethelm, K. (2002). A Predictor-Corrector Approach for the Numerical Solution of Fractional Differential Equations. Nonlinear Dynamics, 29(1-4), 3-22.\n[3] Xu, Y., \u0026amp; He, Z. (2011). The short memory principle for solving Abel differential equation of fractional order. Journal of Computational and Applied Mathematics, 5-6.\n[4] Parand, K., \u0026amp; Nikarya, M. (2015). New numerical method based on Generalized Bessel function to solve nonlinear Abel fractional differential equation of the first kind. Nonlinear Engineering, 5-7.\n[5] Yousefi, F. S., Ordokhani, Y., \u0026amp; Yousefi, S. (2020). Numerical solution of variable order fractional differential equations by using shifted Legendre cardinal functions and Riez method. Engineering with Computers, 6.\n[6] Patnaik, S., \u0026amp; Semperlotti, F. (2020). Application of variable and distributed order fractional operators to the dynamic analysis of nonlinear oscillators. Nonlinear Dynamics, 1-2.\n[7] Sun, H. G., Chen, W., Wei, H., \u0026amp; Chen, Y. Q. (2011). A comparative study of constant-order and variable-order fractional models in characterizing memory property of systems. The European Physical Journal Special Topics, 2011.\n[8] Ma, C. Y., Shiri, B., Wu, G. C., \u0026amp; Baleanu, D. (2018). New fractional signal smoothing equations with short memory and variable order. Signal Processing, 2018.\n[9] Wu, F., Gao, R., Liu, J., \u0026amp; Li, C. (2020). New fractional variable-order creep model with short memory. Applied Mathematical Modelling, 2020.\n","date":"2021-07-01T21:00:00+08:00","image":"/img/short-memory-cover.jpg","permalink":"/en/p/the-research-on-the-short-memory-principle-of-fractional-order-systems/","title":"The Research on the Short Memory Principle of Fractional-Order Systems"},{"content":"\nNotes on This Competition Record and Writing\nThis article is a complete documentation of the solutions for Problem C of the 2020 National College Student Mathematical Contest in Modeling — \u0026ldquo;Credit Decision Model for Small and Medium Enterprises in the Banking Industry\u0026rdquo;. In this competition, the author\u0026rsquo;s responsibilities included: data cleaning and preprocessing, credit strategy development (covering the lending decision model for Problem 1 and the risk evaluation model), and the final report writing and blog-style organization. The decision tree model construction and training was completed by a teammate; this article only introduces the training approach and methodology, without further detailing the procedural steps.\nWriting up the problem-solving process as a blog post serves two purposes: first, to systematically organize the modeling ideas for future review and improvement; second, to provide a reference case for students who are equally interested in bank credit risk modeling.\n1. Introduction 1.1 Problem Background For a long time, China\u0026rsquo;s banking industry has struggled with lending to small and medium enterprises (SMEs), with the core issue being the high credit cost and risk associated with SMEs. Specifically: SMEs are small in scale and lack fixed-asset collateral, making it difficult for banks to directly assess their credit risk. Therefore, banks\u0026rsquo; credit decisions for SMEs mainly rely on the enterprises\u0026rsquo; transaction票据 (invoice) information and the influence of their upstream and downstream enterprises.\nAfter conducting a reasonable credit risk assessment of SMEs, banks need to formulate a complete credit strategy based on the assessment results, covering dimensions such as whether to grant a loan, loan amount, interest rate, and term. A sound evaluation system is of great significance to banks\u0026rsquo; lending decisions.\n1.2 Problem Description This problem provides three attachments: Attachment 1 contains data on 123 enterprises with credit records, Attachment 2 contains data on 302 enterprises without credit records, and Attachment 3 provides 2019 statistical data on the relationship between bank loan annual interest rates and customer churn rates. The task requires completing the following three problems:\nProblem 1: Combine the relevant information of enterprises with credit records to conduct a quantitative analysis of enterprise credit risk, and propose a credit strategy for enterprises under a fixed annual total credit amount. Problem 2: Conduct a quantitative analysis of credit risk for enterprises without credit records based on their information, and propose a credit strategy for enterprises under an annual total credit amount of 100 million. Problem 3: Comprehensively consider the impact of various sudden factors on enterprises, and propose an optimization plan for the credit strategy in Problem 2. 2. Preliminaries: Core Algorithm Principles This chapter introduces the algorithm principles that will be directly referenced in subsequent chapters, mainly covering the Analytic Hierarchy Process, fuzzy mathematical evaluation, TOPSIS method, decision tree, and stress testing method.\n2.1 Analytic Hierarchy Process (AHP) 1. Reasons for Introducing the Algorithm\nWhen determining the influence weights of each indicator on the decision goal, common issues arise such as the difficulty in quantifying weights and the internal hidden contradictions among weights caused by subjective factors. The Analytic Hierarchy Process (AHP), proposed by American operations researcher Saaty in the 1970s, fundamentally decomposes the related properties of the evaluation object into a target layer, criterion layer, and alternative layer. It performs quantitative and qualitative analysis on fuzzy problems that are difficult to analyze fully quantitatively, to obtain the weight proportion of each layer relative to the highest layer, thereby optimizing the evaluation scheme.\n2. 1-9 Scale Method and Judgment Matrix Construction\nIndicators are compared pairwise using the 1-9 scale method to construct a judgment matrix $A = (a_{ij})$. Here,\n$$a_{ij} = \\frac{\\text{Importance of the } i \\text{ th indicator}}{\\text{Importance of the } j \\text{ th indicator}}$$\nFurther written as:\n$$A = \\begin{pmatrix} a_{11} \u0026amp; a_{12} \u0026amp; \\cdots \u0026amp; a_{1n} \\ a_{21} \u0026amp; a_{22} \u0026amp; \\cdots \u0026amp; a_{2n} \\ \\vdots \u0026amp; \\vdots \u0026amp; \\ddots \u0026amp; \\vdots \\ a_{n1} \u0026amp; a_{n2} \u0026amp; \\cdots \u0026amp; a_{nn} \\end{pmatrix}$$\n3. Weight Computation: Three Methods\nArithmetic mean method:\n$$\\omega_i = \\frac{1}{n} \\sum_{j=1}^{n} \\frac{a_{ij}}{\\sum_{k=1}^{n} a_{kj}}, \\quad i = 1,2,\\cdots,n$$\nGeometric mean method:\n$$\\omega_i = \\frac{\\left(\\prod_{j=1}^{n} a_{ij}\\right)^{1/n}}{\\sum_{k=1}^{n} \\left(\\prod_{j=1}^{n} a_{kj}\\right)^{1/n}}, \\quad i = 1,2,\\cdots,n$$\nEigenvalue method: Find the maximum eigenvalue $\\lambda_{\\max}$ of the judgment matrix and its corresponding eigenvector, and normalize the eigenvector to obtain the weights.\n4. Consistency Check\nThe consistency index is defined as:\n$$CI = \\frac{\\lambda_{\\max}-n}{n-1}$$\nThe consistency ratio is defined as:\n$$CR = \\frac{CI}{RI}$$\nWhen $CR \u0026lt; 0.10$, the judgment matrix passes the consistency check; otherwise, the judgment matrix needs to be adjusted.\n2.2 Fuzzy Analytic Hierarchy Process (F-AHP) 1. Algorithm Idea\nThe Fuzzy Analytic Hierarchy Process introduces fuzzy mathematical thinking on the basis of traditional AHP to handle the fuzziness of qualitative indicators. By establishing a fuzzy judgment matrix and fuzzy membership functions, qualitative evaluations are transformed into quantitative scores.\n2. Seven-Step Fuzzy Comprehensive Evaluation Model\nTaking the supply-demand relationship stability evaluation as an example, the complete process is as follows:\nStep 1: Determine the factor set\n$$U = {u_1(\\text{stable input customer proportion}), u_2(\\text{stable output customer proportion}), u_3(\\text{variance of average quarterly transaction count})}$$\nStep 2: Determine the evaluation set\n$$V = {v_1(\\text{good}), v_2(\\text{fairly good}), v_3(\\text{moderate}), v_4(\\text{poor})}$$\nStep 3: Determine the factor weights\n$$A = (a_1, a_2, a_3)$$\nSteps 4 to 6: Construct membership functions and form the fuzzy comprehensive judgment matrix\n$$R = \\begin{pmatrix} r_{11} \u0026amp; r_{12} \u0026amp; r_{13} \u0026amp; r_{14} \\ r_{21} \u0026amp; r_{22} \u0026amp; r_{23} \u0026amp; r_{24} \\ r_{31} \u0026amp; r_{32} \u0026amp; r_{33} \u0026amp; r_{34} \\end{pmatrix}$$\nStep 7: Comprehensive evaluation\n$$B = A \\cdot R = (b_1,b_2,b_3,b_4)$$\n2.3 TOPSIS Method 1. Algorithm Principle\nTOPSIS (Technique for Order Preference by Similarity to an Ideal Solution) is a multi-attribute decision-making method that ranks alternatives by computing their distances from positive and negative ideal solutions.\n2. Complete Calculation Process\nNormalization of raw data:\n$$z_{ij} = \\frac{x_{ij}}{\\sqrt{\\sum_{i=1}^{n} x_{ij}^2}}, \\quad i = 1,2,\\cdots,n;\\ j = 1,2,\\cdots,m$$\nPositive and negative ideal solutions:\n$$Z^+ = (\\max_i z_{i1},\\max_i z_{i2},\\cdots,\\max_i z_{im})$$\n$$Z^- = (\\min_i z_{i1},\\min_i z_{i2},\\cdots,\\min_i z_{im})$$\nDistance computation:\n$$D_i^+ = \\sqrt{\\sum_{j=1}^{m}(Z_j^+ - z_{ij})^2}$$\n$$D_i^- = \\sqrt{\\sum_{j=1}^{m}(Z_j^- - z_{ij})^2}$$\nRelative closeness:\n$$S_i = \\frac{D_i^-}{D_i^+ + D_i^-}, \\quad S_i \\in [0,1]$$\n2.4 Decision Tree 1. Algorithm Overview\nA decision tree is a predictive model in the form of an attribute structure, representing a mapping between object attributes and object values. It consists of internal nodes and leaf nodes and is suitable for classification and regression problems.\n2. Classification Criteria\nEntropy:\n$$Entropy(A) = -\\sum_{k=1}^{n} p_k \\log_2 p_k$$\nInformation gain:\n$$Gain(D,a) = Entropy(D) - \\sum_{v=1}^{V} \\frac{|D^v|}{|D|} Entropy(D^v)$$\nGini coefficient:\n$$Gini(D) = 1 - \\sum_{k=1}^{y} p_k^2$$\n3. Tree Building Steps\nTreat all samples as a root node. Iterate through each candidate variable\u0026rsquo;s split method and select the optimal split. Recursively split nodes until node purity is sufficiently high or stop conditions are met. 4. Model Evaluation Metrics\nAccuracy, recall, and F1 score are respectively recorded as:\n$$ACC = \\frac{TP}{TP + FP}, \\quad REC = \\frac{TP}{TP + FN}$$\n$$PRE = \\frac{TP}{TP + FP}, \\quad F1 = \\frac{2 \\times PRE \\times REC}{PRE + REC}$$\n2.5 Stress Testing Method 1. Methodology\nStress testing is used to evaluate the risk tolerance of financial institutions under extreme unfavorable conditions. In the field of bank credit, it is mainly used to test the impact of sudden factors on enterprise operations.\n2. Scenario Testing Process\nSet stress scenarios, such as sudden epidemics, economic recessions, policy changes, etc. Identify key impact variables, such as profit growth rate, return rate, etc. Simulate the transmission path of impact factors. Recalculate risk evaluation values and credit strategies. 3. Problem 1: Credit Strategies for 123 Enterprises with Credit Records 3.1 Enterprise Portrait: Three Core Dimensions Conduct in-depth analysis of the 123 enterprises with credit records, portraying enterprises from three dimensions:\nEnterprise strength Net profit margin: the percentage of net profit to invested capital, comprehensively reflecting business efficiency. Net profit growth rate: the magnitude of net profit growth between two time periods, reflecting business performance. Output return ratio: the proportion of negative invoice counts to total business counts, reflecting a negative indicator. Enterprise credit Credit rating: a direct representation of the enterprise\u0026rsquo;s credit assessment result (A/B/C/D four levels). Default history: whether there is a default record, directly affecting enterprise credit. Supply-demand relationship stability Enterprise stable input customer proportion. Enterprise stable output customer proportion. Variance of average quarterly transaction count. 3.2 Whether to Lend: Bank Lending Decision Model 3.2.1 Model Structure and Algorithm Flow Whether to lend is essentially a multi-attribute comprehensive evaluation problem. For banks, relying solely on a single financial indicator is insufficient to characterize the true creditworthiness of SMEs. Therefore, this article decomposes the enterprise\u0026rsquo;s lending capacity into two primary dimensions: first, enterprise strength, which reflects business quality, and second, supply-demand relationship stability, which reflects the sustainability of business relationships. The former is mainly composed of net profit margin, net profit growth rate, and output return ratio, while the latter is measured through stable customer proportion and transaction volatility.\nSince most indicators in enterprise strength are quantitative, while supply-demand relationship stability contains obvious fuzzy evaluation components, this model does not simply use the fuzzy analytic hierarchy process (F-AHP) alone. Instead, it adopts a comprehensive solution framework of \u0026ldquo;AHP Weighting + TOPSIS Quantification + F-AHP Fuzzy Comprehensive Evaluation + Time-Weighted Aggregation.\u0026rdquo; This framework preserves the interpretability of AHP in weight expression while also taking into account the objectivity of TOPSIS in multi-attribute ranking and the F-AHP\u0026rsquo;s ability to characterize qualitative features.\nThe hierarchical structure of the model is shown in the figure below:\nThe overall calculation flow is shown in the figure below:\nInput: Enterprise indicator data │ ├─ Step 1: Construct judgment matrices for the target layer, criterion layer, and time layer │ ├─ Step 2: Use AHP to compute primary weights, secondary weights, and time weights, and perform consistency checks │ ├─ Step 3: Use TOPSIS to compute comprehensive scores for the enterprise strength component │ ├─ Step 4: Use F-AHP fuzzy comprehensive evaluation to compute grade scores for the supply-demand relationship stability component │ ├─ Step 5: Weighted synthesis of the two components within the same period to obtain single-period lending capacity evaluation value │ └─ Step 6: Cross-period aggregation using time weights to obtain the final lending evaluation value Let the three core indicator scores for enterprise strength in time band $i$ be $f_{11}(t_i)$, $f_{12}(t_i)$, and $f_{13}(t_i)$, whose meanings correspond to net profit margin, net profit growth rate, and output return ratio in order. The comprehensive enterprise strength evaluation value can be written as:\n$$f_1(t_i) = \\omega_{11} f_{11}(t_i) + \\omega_{12} f_{12}(t_i) + \\omega_{13} f_{13}(t_i)$$\nWhere $\\omega_{11}$, $\\omega_{12}$, and $\\omega_{13}$ respectively represent the indicator weights of net profit margin, net profit growth rate, and output return ratio within the enterprise strength dimension. The scores of these three underlying indicators are not directly taken from the original values; instead, they are first computed using the TOPSIS analysis method based on the raw indicator values.\nFor supply-demand relationship stability, denote its comprehensive score in time band $i$ as $f_2(t_i)$. Then the enterprise\u0026rsquo;s lending capacity evaluation value in that time band is:\n$$LEND(t_i) = \\omega_{f_1} f_1(t_i) + \\omega_{f_2} f_2(t_i)$$\nWhere $\\omega_{f_1}$ and $\\omega_{f_2}$ respectively represent the weights of \u0026ldquo;enterprise strength\u0026rdquo; and \u0026ldquo;supply-demand relationship stability\u0026rdquo; in the criterion layer.\nFurthermore, weighted aggregation over the three time bands yields the enterprise\u0026rsquo;s final comprehensive lending capacity score:\n$$LEND = \\sum_i \\omega_i \\cdot LEND(t_i)$$\nWhere $\\omega_i$ is the weight of time band $i$, satisfying $\\sum_i \\omega_i = 1$. This expression embodies the core idea of this article: the lending capacity of the same enterprise is not a static result at a single point in time, but a weighted synthesis of multiple periods of business performance and supply-demand stability.\n3.2.2 Weight Quantification Based on Analytic Hierarchy Process To make the indicator weights at each layer of the lending decision model clearer, this section elaborates in three layers: \u0026ldquo;primary criterion layer weights,\u0026rdquo; \u0026ldquo;internal weights of enterprise strength,\u0026rdquo; and \u0026ldquo;handling of missing indicator scenarios.\u0026rdquo;\n1. Primary Criterion Layer Weights\nIn the criterion layer, \u0026ldquo;enterprise strength\u0026rdquo; and \u0026ldquo;supply-demand relationship stability\u0026rdquo; are compared pairwise. Based on business understanding, supply-demand relationship stability better reflects the sustained operational reliability of SMEs, so it is assigned a slightly higher weight than enterprise strength. The corresponding judgment matrix is:\n$$A = \\begin{pmatrix} 1 \u0026amp; 1/2 \\ 2 \u0026amp; 1 \\end{pmatrix}$$\nWhen computing weights from the judgment matrix $A$, this article uses all three methods—arithmetic mean, geometric mean, and eigenvalue—for cross-validation:\nArithmetic mean method: $$\\omega_i^{(1)} = \\frac{1}{n} \\sum_{j=1}^{n} \\frac{a_{ij}}{\\sum_{k=1}^{n} a_{kj}}$$\nGeometric mean method: $$\\omega_i^{(2)} = \\frac{\\left(\\prod_{j=1}^{n} a_{ij}\\right)^{1/n}}{\\sum_{k=1}^{n}\\left(\\prod_{j=1}^{n} a_{kj}\\right)^{1/n}}$$\nEigenvalue method: $$A \\boldsymbol{\\omega}^{(3)} = \\lambda_{\\max} \\boldsymbol{\\omega}^{(3)}$$\nAfter finding the dominant eigenvector, normalize it to obtain the corresponding weights.\nAfter synthesizing the three methods, the final weights of the primary criterion layer are:\n$$\\boldsymbol{\\omega} = (\\omega_1, \\omega_2) = (0.3333, 0.6667)$$\nThat is, the enterprise strength weight is approximately $1/3$, and the supply-demand relationship stability weight is approximately $2/3$.\nFor the consistency check, the basic AHP expressions are:\n$$CI = \\frac{\\lambda_{\\max} - n}{n - 1}$$\n$$CR = \\frac{CI}{RI}$$\nSince the judgment matrix here is a second-order matrix, when $n=2$, $\\lambda_{\\max}=2$, thus $CI=0$. Therefore, this matrix naturally satisfies the consistency requirement.\nAs an implementation reference, the Matlab code for computing weights using the eigenvalue method and completing the consistency check in the analytic hierarchy process is:\n%% Analytic Hierarchy Process Consistency Check and Weight Computation % Input judgment matrix A [n,n] = size(A); % Eigenvalue method for weight computation [V,D] = eig(A); Max_eig = max(max(D)); [r,c] = find(D == Max_eig, 1); disp(\u0026#39;Eigenvalue method weight results:\u0026#39;); disp(V(:,c) ./ sum(V(:,c))) % Consistency check CI = (Max_eig - n) / (n - 1); RI = [0 0.0001 0.52 0.89 1.12 1.26 1.36 1.41 1.46 1.49 1.52 1.54 1.56 1.58 1.59]; % When n=2, it must be a consistent matrix, so CI = 0. % To avoid division by zero, replace the second element with a number very close to 0. CR = CI / RI(n); disp(\u0026#39;Consistency index CI=\u0026#39;); disp(CI); disp(\u0026#39;Consistency ratio CR=\u0026#39;); disp(CR); if CR \u0026lt; 0.10 disp(\u0026#39;CR\u0026lt;0.10, the consistency of judgment matrix A is acceptable!\u0026#39;); else disp(\u0026#39;Judgment matrix A needs to be modified!\u0026#39;); end 2. Internal Weights of Enterprise Strength\nWithin the enterprise strength dimension, the relative weights of net profit margin, net profit growth rate, and output return ratio also need to be further determined. The corresponding judgment matrix is:\n$$A_s = \\begin{pmatrix} 1 \u0026amp; 1/2 \u0026amp; 3 \\ 2 \u0026amp; 1 \u0026amp; 3 \\ 1/3 \u0026amp; 1/3 \u0026amp; 1 \\end{pmatrix}$$\nIts business meaning is: net profit growth rate is slightly higher than net profit margin, both are significantly higher than the output return ratio, but the return ratio still retains some weight because it represents a negative constraint in business quality.\nWhen solving, first normalize the judgment matrix column by column:\n$$r_{ij} = \\frac{a_{ij}}{\\sum_{k=1}^{n} a_{kj}}$$\nThen average by row to obtain the approximate weight vector. The final result is:\n$$\\boldsymbol{\\omega}^{(s)} = (0.3108, 0.4934, 0.1958)$$\nThat is, the weights for net profit margin, net profit growth rate, and output return ratio are 0.3108, 0.4934, and 0.1958 respectively.\nFurther using the eigenvalue method:\n$$\\lambda_{\\max} = \\frac{1}{n} \\sum_{i=1}^{n} \\frac{(A_s \\boldsymbol{\\omega}^{(s)})_i}{\\omega_i^{(s)}}$$\nWe obtain:\n$$\\lambda_{\\max} = 3.0536$$\nThus the consistency index is:\n$$CI = \\frac{3.0536 - 3}{2} = 0.0268, \\quad CR = \\frac{0.0268}{0.58} = 0.0462 \u0026lt; 0.10$$\nThis indicates that the judgment matrix has good consistency and is acceptable.\n3. Weight Handling in Missing Indicator Scenarios\nWhen some enterprises lack net profit growth rate information, to avoid noise from forced imputation, this article switches to a two-indicator degraded model. The judgment matrix at this point is:\n$$A_s\u0026rsquo; = \\begin{pmatrix} 1 \u0026amp; 3 \\ 1/3 \u0026amp; 1 \\end{pmatrix}$$\nThe corresponding weights are:\n$$\\boldsymbol{\\omega}^{(s\u0026rsquo;)} = (0.75, 0.25)$$\nThat is, in missing indicator scenarios, the model only uses net profit margin and output return ratio for evaluation, with a higher weight given to net profit margin. This avoids introducing extra noise from missing values while ensuring the model maintains structural consistency and interpretability across different sample conditions.\n3.2.3 Time Dimension Weighted Derivation Enterprise credit is not a static quantity. Compared to earlier years\u0026rsquo; data, recent business performance better reflects the enterprise\u0026rsquo;s current real debt-servicing capacity. Therefore, this article introduces an additional layer of AHP weights in the time dimension, giving higher importance to newer time bands.\nThe sample time is divided into three time bands:\nTime Band Corresponding Years $t_1$ 2016-2017 $t_2$ 2018 $t_3$ 2019-2020 Under the assumption that \u0026ldquo;recent information is more important,\u0026rdquo; the time judgment matrix is constructed as:\n$$A_t = \\begin{pmatrix} 1 \u0026amp; 1/3 \u0026amp; 1/4 \\ 3 \u0026amp; 1 \u0026amp; 1/2 \\ 4 \u0026amp; 2 \u0026amp; 1 \\end{pmatrix}$$\nFollowing the judgment matrix construction, weight computation, and consistency check steps in the analytic hierarchy process, the time weight vector is:\n$$\\boldsymbol{\\eta} = (\\eta_1, \\eta_2, \\eta_3) = (0.1220, 0.3196, 0.5584)$$\nIt can be seen that the most recent time band $t_3$ has the highest weight, indicating the model focuses more on the enterprise\u0026rsquo;s latest business performance.\nIn the consistency check:\n$$\\lambda_{\\max} = 3.0183, \\quad CI = \\frac{3.0183 - 3}{2} = 0.00915$$\nTaking $RI = 0.52$, then:\n$$CR = \\frac{0.00915}{0.52} = 0.0176 \u0026lt; 0.10$$\nThis indicates that the time layer judgment matrix also passes the consistency check.\nTherefore, the final lending capacity evaluation value for a single enterprise across the three time bands is:\n$$LEND_i = 0.1220 \\cdot LEND_i(t_1) + 0.3196 \\cdot LEND_i(t_2) + 0.5584 \\cdot LEND_i(t_3)$$\nThis formula clearly embodies the modeling idea that \u0026ldquo;recent samples are more important.\u0026rdquo;\n3.2.4 Fuzzy Comprehensive Evaluation: Quantification of Supply-Demand Relationship Stability Supply-demand relationship stability is essentially not a single value, but a concept with fuzzy boundaries. Stable input customer proportion, stable output customer proportion, and variance of average quarterly transaction count all characterize whether the enterprise\u0026rsquo;s upstream and downstream relationships are stable from different perspectives. However, it is difficult to directly set an absolute \u0026ldquo;good\u0026rdquo; or \u0026ldquo;poor\u0026rdquo; threshold. Therefore, rather than using simple linear weighting, fuzzy comprehensive evaluation is used here to uniformly map multiple indicators to the same set of evaluation grades.\nThe factor set for evaluation is taken as:\n$$U = {u_1, u_2, u_3} = {\\text{stable input customer proportion},\\text{stable output customer proportion},\\text{variance of average quarterly transaction count}}$$\nThe evaluation set is taken as:\n$$V = {v_1, v_2, v_3, v_4} = {\\text{good}, \\text{fairly good}, \\text{moderate}, \\text{poor}}$$\nAmong them, the first two indicators respectively correspond to the stability of upstream and downstream cooperation relationships, and the third indicator reflects the volatility of transaction rhythm across quarters. The three indicators jointly determine the stability level of the enterprise\u0026rsquo;s supply-demand relationship.\nFor weight setting, the results from the analytic hierarchy process are still used. According to the judgment matrix calculation, the weights of the three indicators are:\n$$A = (a_1, a_2, a_3) = (0.25, 0.25, 0.5)$$\nThis means the model places more emphasis on the stability of transaction volatility itself, while giving equal weights to stable input customer proportion and stable output customer proportion.\nThe corresponding judgment matrix result is:\n$$ \\begin{pmatrix} 1 \u0026amp; 1 \u0026amp; 2 \\ 1 \u0026amp; 1 \u0026amp; 2 \\ 1/2 \u0026amp; 1/2 \u0026amp; 1 \\end{pmatrix} $$\nAccording to the calculation results in the paper, the maximum eigenvalue of this matrix is $\\lambda_{\\max}=3$. Further:\n$$CI = -4.4409 \\times 10^{-16}, \\quad CR = -8.5402 \\times 10^{-16} \u0026lt; 0.10$$\nTherefore, the consistency check passes, and the above weight results are acceptable.\nWhen solving, the membership degrees of each indicator to the four evaluation grades are computed separately. To ensure smooth and interpretable evaluation functions, the conventional assignment method is used, with trapezoidal functions as the membership function model. After obtaining the membership degrees of each single indicator to the evaluation grades, the evaluation results of the three indicators are concatenated to form the enterprise\u0026rsquo;s fuzzy comprehensive judgment matrix:\n$$R = \\begin{bmatrix} r_{11} \u0026amp; r_{12} \u0026amp; r_{13} \u0026amp; r_{14} \\ r_{21} \u0026amp; r_{22} \u0026amp; r_{23} \u0026amp; r_{24} \\ r_{31} \u0026amp; r_{32} \u0026amp; r_{33} \u0026amp; r_{34} \\end{bmatrix} = \\begin{bmatrix} R_1 \\ R_2 \\ R_3 \\end{bmatrix}$$\nWhere $r_{ij}$ represents the membership degree of the $i$-th indicator to the $j$-th evaluation grade. Repeating the above steps for each enterprise yields their corresponding fuzzy comprehensive judgment matrices.\nFurther multiplying the weight vector by the judgment matrix gives the enterprise\u0026rsquo;s comprehensive membership degree to the four evaluation grades:\n$$B = AR = {b_1, b_2, b_3, b_4}$$\nIf the membership degree corresponding to a certain grade is the largest, the enterprise is classified into that grade. Finally, the four grades are quantified into scores: A, B, C, and D respectively correspond to 100, 80, 60, and 40 points. In this way, the originally difficult-to-directly-measure supply-demand relationship stability is transformed into a quantitative score that can participate in the upper-level lending decision model.\nFrom the model solution results, the fuzzy comprehensive evaluation ultimately generates a supply-demand relationship stability score for each enterprise. According to the example in the paper, the supply-demand stability scores for enterprises No. 1 through No. 10 are: 100, 40, 40, 100, 40, 100, 60, 60, 100, 100. That is to say, the model has already transformed the relatively fuzzy \u0026ldquo;whether the supply-demand relationship is stable\u0026rdquo; into a quantitative input that can directly participate in subsequent lending capacity calculations.\nAs an implementation reference, the following is a Matlab code example for computing the supply-demand relationship stability score using fuzzy comprehensive evaluation:\n%% Input: top 10 sales enterprises by year for i = 1:123 % 2016-2017 data temp = A(find(A(:,1)==i),:); temp2017 = temp(find(temp(:,3)\u0026lt;20180000),:); temp2017_2 = unique(temp2017(:,2)); n = histc(temp2017(:,2),temp2017_2); n = [temp2017_2,n]; n = sortrows(n,2); n = flipud(n); if size(n,1) \u0026lt; 10 mysort(1:size(n,1),2*i-1:2*i) = n; else n = n(1:10,1:2); mysort(1:10,2*i-1:2*i) = n; end % 2018 data temp2018 = temp(find(temp(:,3)\u0026lt;20190000),:); temp2018_2 = unique(temp2018(:,2)); n = histc(temp2018(:,2),temp2018_2); n = [temp2018_2,n]; n = sortrows(n,2); n = flipud(n); if size(n,1) \u0026lt; 10 mysort(11:10+size(n,1),2*i-1:2*i) = n; else n = n(1:10,1:2); mysort(11:20,2*i-1:2*i) = n; end % 2019 data temp2019 = temp(find(temp(:,3)\u0026lt;20200000),:); temp2019_2 = unique(temp2019(:,2)); n = histc(temp2019(:,2),temp2019_2); n = [temp2019_2,n]; n = sortrows(n,2); n = flipud(n); if size(n,1) \u0026lt; 10 mysort(21:20+size(n,1),2*i-1:2*i) = n; else n = n(1:10,1:2); mysort(21:30,2*i-1:2*i) = n; end end %% Output: top 10 sales enterprises by year for i = 1:123 % 2016-2017 data temp = A2(find(A2(:,1)==i),:); temp2017 = temp(find(temp(:,3)\u0026lt;20180000),:); temp2017_2 = unique(temp2017(:,2)); n = histc(temp2017(:,2),temp2017_2); n = [temp2017_2,n]; n = sortrows(n,2); n = flipud(n); if size(n,1) \u0026lt; 10 mysort2(1:size(n,1),2*i-1:2*i) = n; else n = n(1:10,1:2); mysort2(1:10,2*i-1:2*i) = n; end % 2018 data temp2018 = temp(find(temp(:,3)\u0026lt;20190000),:); temp2018_2 = unique(temp2018(:,2)); n = histc(temp2018(:,2),temp2018_2); n = [temp2018_2,n]; n = sortrows(n,2); n = flipud(n); if size(n,1) \u0026lt; 10 mysort2(11:10+size(n,1),2*i-1:2*i) = n; else n = n(1:10,1:2); mysort2(11:20,2*i-1:2*i) = n; end % 2019 data temp2019 = temp(find(temp(:,3)\u0026lt;20200000),:); temp2019_2 = unique(temp2019(:,2)); n = histc(temp2019(:,2),temp2019_2); n = [temp2019_2,n]; n = sortrows(n,2); n = flipud(n); if size(n,1) \u0026lt; 10 mysort2(21:20+size(n,1),2*i-1:2*i) = n; else n = n(1:10,1:2); mysort2(21:30,2*i-1:2*i) = n; end end %% Compute variance of average monthly transaction count for each enterprise % Quarterly input counts mynum1 = []; for i = 1:123 temp = A(find(A(:,1)==i),:); for j = 1:4 mynum1(i,j) = size(temp(find(temp(:,3)\u0026lt;20160101+j*300),:),1); end for j = 1:4 mynum1(i,4+j) = size(temp(find(temp(:,3)\u0026lt;20170101+j*300),:),1); end for j = 1:4 mynum1(i,8+j) = size(temp(find(temp(:,3)\u0026lt;20180101+j*300),:),1); end for j = 1:4 mynum1(i,12+j) = size(temp(find(temp(:,3)\u0026lt;20190101+j*300),:),1); end for j = 1:4 mynum1(i,16+j) = size(temp(find(temp(:,3)\u0026lt;20200101+j*300),:),1); end end % Quarterly output counts mynum2 = []; for i = 1:123 temp = A2(find(A2(:,1)==i),:); for j = 1:4 mynum2(i,j) = size(temp(find(temp(:,3)\u0026lt;20160101+j*300),:),1); end for j = 1:4 mynum2(i,j+4) = size(temp(find(temp(:,3)\u0026lt;20170101+j*300),:),1); end for j = 1:4 mynum2(i,j+8) = size(temp(find(temp(:,3)\u0026lt;20180101+j*300),:),1); end for j = 1:4 mynum2(i,j+12) = size(temp(find(temp(:,3)\u0026lt;20190101+j*300),:),1); end for j = 1:4 mynum2(i,j+16) = size(temp(find(temp(:,3)\u0026lt;20200101+j*300),:),1); end end % Total quarterly input and output counts mynum = mynum1 + mynum2; % Line charts of quarterly counts for first 20 enterprises seasontime = 1:20; for i = 1:20 subplot(4,5,i) plot(seasontime,mynum(i,:),\u0026#39;--o\u0026#39;) set(gca, \u0026#39;XTick\u0026#39;, 1:1:20) end % Compute mean and variance of transaction counts for each enterprise mymeans = zeros(123,1); S = zeros(123,1); for i = 1:123 temp = mynum(i,:); temp(temp==0) = []; % Remove leading zeros mymeans(i) = mean(temp); S(i) = sqrt(sum((temp - mean(temp)).^2) / length(temp)); end %% Compute membership degrees for 123 enterprises to obtain supply-demand relationship stability grades and scores load percent1.mat load percent2.mat percent1 = percent1\u0026#39;; percent2 = percent2\u0026#39;; % Convert benefit-type data to cost-type data temp_percent1 = 1 - percent1; temp_percent2 = 1 - percent2; myA = [0.4 0.4 0.2]; % Weight matrix scoreindex = zeros(123,2); for i = 1:123 R(1,:) = caculate_rate(temp_percent1(:,1),temp_percent1(i,1)); % Obtain indicator P4 membership function value R(2,:) = caculate_rate(temp_percent2(:,1),temp_percent2(i,1)); % Obtain indicator P5 membership function value R(3,:) = caculate_rate(S,S(i)); % Obtain indicator P6 membership function value B = myA * R; [temp,scoreindex(i)] = max(B); % Stability grade score if scoreindex(i) == 1 score2 = 100; elseif scoreindex(i) == 2 score2 = 80; elseif scoreindex(i) == 3 score2 = 60; else score2 = 40; end scoreindex(i,2) = score2; end %% Append stability scores to enterprise information matrix company_inf(:,12) = scoreindex(:,2); 3.2.5 TOPSIS Method: Enterprise Strength Quantification Score Enterprise strength consists of net profit margin, net profit growth rate, and output return ratio, which is a typical multi-attribute comprehensive evaluation problem. To avoid scale bias from direct weighted summation, this article uses TOPSIS for dimensionless processing and ranking.\nLet there be $n$ enterprises and $m$ indicators, with the raw data matrix:\n$$X = (x_{ij})_{n \\times m}$$\nFirst, perform vector normalization:\n$$z_{ij} = \\frac{x_{ij}}{\\sqrt{\\sum_{i=1}^{n} x_{ij}^2}}$$\nObtain the normalized matrix $Z = (z_{ij})$.\nThen multiply by the indicator weights computed by AHP to form the weighted normalized matrix. Let the weighted normalized matrix be:\n$$Y = (y_{ij})_{n \\times m}, \\quad y_{ij} = \\omega_j^{(s)} \\cdot z_{ij}$$ For benefit-type indicators (net profit margin, net profit growth rate), larger values are better; for cost-type indicators (output return ratio), smaller values are better. Therefore, the positive ideal solution and negative ideal solution are respectively defined as:\n$$Y^+ = (y_1^+, y_2^+, \\cdots, y_m^+)$$\n$$Y^- = (y_1^-, y_2^-, \\cdots, y_m^-)$$\nWhere benefit-type indicators take the maximum value as the positive ideal solution, and cost-type indicators take the minimum value as the positive ideal solution.\nNext, compute the Euclidean distance between each enterprise and the positive and negative ideal solutions:\n$$D_i^+ = \\sqrt{\\sum_{j=1}^{m} (y_{ij} - y_j^+)^2}$$\n$$D_i^- = \\sqrt{\\sum_{j=1}^{m} (y_{ij} - y_j^-)^2}$$\nFinally, define the relative closeness:\n$$f_1^{(i)}(t) = S_i = \\frac{D_i^-}{D_i^+ + D_i^-}$$\nObviously $S_i \\in [0,1]$. The larger $S_i$, the closer the enterprise is to the ideal operating state, and the stronger its enterprise strength.\nAs an implementation reference, the following is a Matlab code example for computing enterprise strength scores using TOPSIS:\n%% TOPSIS Score Computation % Benefit-type indicators are profit margin and profit growth rate % Cost-type indicator is enterprise return ratio % Compute the efficacy score for profit margin indicator % Step 1: Forward normalization of the raw matrix % Use excel to remove enterprises without data for 2016-17 and import matrix X load Q1data2016.mat % X % 2018 X = company_inf(:,[4:6,10:11]); % 2019-2020 load Q1data2019-20.mat % X %% Start computation [m,n] = size(X); disp([\u0026#39;There are \u0026#39; num2str(m) \u0026#39; sample data, with \u0026#39; num2str(n) \u0026#39; indicators\u0026#39;]) judge = input(\u0026#39;Are there indicators that need forward normalization? If yes, enter 1, if no, enter 2: \u0026#39;); if judge == 1 position = input(\u0026#39;Enter the columns that need forward normalization, e.g., [1,2,3] for columns 1, 2, 3: \u0026#39;); type = input(\u0026#39;Enter the types of indicators from left to right, 1. cost-type 2. intermediate-type, 3. interval-type, e.g., [2,1,3]: \u0026#39;); len = size(type,2); for i = 1:len if type(i) == 1 X(:,position(i)) = Min2Max(X(:,position(i))); disp([\u0026#39;Column \u0026#39; num2str(position(i)) \u0026#39; is a cost-type indicator, forward normalization completed\u0026#39;]) end if type(i) == 2 best = input([\u0026#39;Enter the best value for column \u0026#39; num2str(position(i)) \u0026#39; indicator: \u0026#39;]); X(:,position(i)) = Mid2Max(X(:,position(i)),best); disp([\u0026#39;Column \u0026#39; num2str(position(i)) \u0026#39; is an intermediate-type indicator, forward normalization completed\u0026#39;]) end if type(i) == 3 best_inter = input([\u0026#39;Enter the best interval for column \u0026#39; num2str(position(i)) \u0026#39; indicator (e.g., [10,20]): \u0026#39;]); X(:,position(i)) = Inter2Max(X(:,position(i)),best_inter); disp([\u0026#39;Column \u0026#39; num2str(position(i)) \u0026#39; is an interval-type indicator, forward normalization completed\u0026#39;]) end end disp(\u0026#39;All indicators have been forward normalized\u0026#39;) end % Step 2: Normalize the forward matrix Z = X ./ repmat(sqrt(sum(X.^2)),m,1); % Step 3: Compute scores and normalize max_Z = max(Z); min_Z = min(Z); judge = input(\u0026#39;Do you need to adjust indicator weights? If not, enter 0; if yes, enter 1: \u0026#39;); if judge == 0 max_D = sqrt(sum((repmat(max_Z,m,1) - Z).^2,2)); min_D = sqrt(sum((repmat(min_Z,m,1) - Z).^2,2)); elseif judge == 1 % w = input(\u0026#39;Enter weights from left to right, e.g., [0.3,0.35,0.35]: \u0026#39;); % w = [0.25 0.164638129 0.213078843 0.372283029]; % 2016 indicator proportions (no profit growth rate indicator) w = [0.103604561 0.164461989 0.146571579 0.213078843 0.372283029]; max_D = sqrt(sum(repmat(w,m,1) .* (repmat(max_Z,m,1) - Z).^2,2)); min_D = sqrt(sum(repmat(w,m,1) .* (repmat(min_Z,m,1) - Z).^2,2)); end score = min_D ./ (max_D + min_D); % Unnormalized scores score = score ./ sum(score); % Normalized scores %% Compute total scores for all enterprises over three years fx_score = zeros(123,1); for i = 1:123 testsum = sum(temp(:,1)==i) + sum(temp(:,3)==i) + sum(temp(:,5)==i); % Enterprise has scores for three years if testsum == 3 myweight = [0.16342 0.29696 0.53961]; fx_score(i) = myweight * [temp(find(temp(:,1)==i),2); temp(find(temp(:,3)==i),4); temp(find(temp(:,5)==i),6)]; % Enterprise has scores for two years elseif testsum == 2 % Enterprise has 16-17 scores if sum(temp(:,1)==i) == 1 myweight = [0.3333 0.6667]; fx_score(i) = myweight * [temp(find(temp(:,1)==i),2); temp(find(temp(:,3)==i),4)]; % Enterprise has 19-20 scores else myweight = [0.3333 0.6667]; fx_score(i) = myweight * [temp(find(temp(:,3)==i),4); temp(find(temp(:,5)==i),6)]; end % Enterprise has only one year of data else fx_score(i) = temp(find(temp(:,3)==i),4); end end 3.2.6 Comprehensive Evaluation and Grade Threshold Determination Weighting the enterprise strength score $f_1^{(i)}(t)$ computed by TOPSIS with the supply-demand stability score $f_2^{(i)}(t)$ obtained from fuzzy comprehensive evaluation yields the lending capacity evaluation value for each period. Then, aggregating the results across periods using time weights gives the enterprise\u0026rsquo;s final comprehensive score $LEND_i$.\nFor ease of display and subsequent lending decisions, $LEND_i$ is converted to a percentage expression:\n$$Score_i = 100 \\times LEND_i$$\nThen, based on the score distribution and sample ranking results, enterprises are divided into four grades:\nGrade Comprehensive Score Characteristics Lending Strategy A High score, stable operations, solid supply-demand relationships Priority lending B Relatively high score, controllable risk Eligible for lending C Average score, requires prudent evaluation Lending with reduced quota D Low score, insufficient debt-servicing capacity and stability No lending If all enterprises are sorted by $LEND_i$ from highest to lowest, the lending threshold can be determined through quantiles or graphical segmentation. In practice, A/B/C grade enterprises can enter the credit pool based on risk appetite, while D grade enterprises are directly excluded.\n3.2.7 Solution Results: Whether to Lend As an implementation reference, the following is a Matlab code example for final bank lending decision scoring:\n%% TOPSIS for Credit Decision Model Scoring % Compute enterprise strength scores % 2016-2017 load X2.mat % 2018 load X3.mat % 2019-2020 load X4.mat %% Start computation [m,n] = size(X); disp([\u0026#39;There are \u0026#39; num2str(m) \u0026#39; sample data, with \u0026#39; num2str(n) \u0026#39; indicators\u0026#39;]) judge = input(\u0026#39;Are there indicators that need forward normalization? If yes, enter 1, if no, enter 2: \u0026#39;); if judge == 1 position = input(\u0026#39;Enter the columns that need forward normalization, e.g., [1,2,3] for columns 1, 2, 3: \u0026#39;); type = input(\u0026#39;Enter the types of indicators from left to right, 1. cost-type 2. intermediate-type, 3. interval-type, e.g., [2,1,3]: \u0026#39;); len = size(type,2); for i = 1:len if type(i) == 1 X(:,position(i)) = Min2Max(X(:,position(i))); disp([\u0026#39;Column \u0026#39; num2str(position(i)) \u0026#39; is a cost-type indicator, forward normalization completed\u0026#39;]) end if type(i) == 2 best = input([\u0026#39;Enter the best value for column \u0026#39; num2str(position(i)) \u0026#39; indicator: \u0026#39;]); X(:,position(i)) = Mid2Max(X(:,position(i)),best); disp([\u0026#39;Column \u0026#39; num2str(position(i)) \u0026#39; is an intermediate-type indicator, forward normalization completed\u0026#39;]) end if type(i) == 3 best_inter = input([\u0026#39;Enter the best interval for column \u0026#39; num2str(position(i)) \u0026#39; indicator (e.g., [10,20]): \u0026#39;]); X(:,position(i)) = Inter2Max(X(:,position(i)),best_inter); disp([\u0026#39;Column \u0026#39; num2str(position(i)) \u0026#39; is an interval-type indicator, forward normalization completed\u0026#39;]) end end disp(\u0026#39;All indicators have been forward normalized\u0026#39;) end % Step 2: Normalize the forward matrix Z = X ./ repmat(sqrt(sum(X.^2)),m,1); % Step 3: Compute scores and normalize max_Z = max(Z); min_Z = min(Z); judge = input(\u0026#39;Do you need to adjust indicator weights? If not, enter 0; if yes, enter 1: \u0026#39;); if judge == 0 max_D = sqrt(sum((repmat(max_Z,m,1) - Z).^2,2)); min_D = sqrt(sum((repmat(min_Z,m,1) - Z).^2,2)); elseif judge == 1 % w = input(\u0026#39;Enter weights from left to right, e.g., [0.3,0.35,0.35]: \u0026#39;); % w = [0.75 0.25]; % 2016 indicator proportions (no profit growth rate indicator) w = [0.310813683 0.493385967 0.195800351]; max_D = sqrt(sum(repmat(w,m,1) .* (repmat(max_Z,m,1) - Z).^2,2)); min_D = sqrt(sum(repmat(w,m,1) .* (repmat(min_Z,m,1) - Z).^2,2)); end score = min_D ./ (max_D + min_D); % Unnormalized scores score = score ./ sum(score); % Normalized scores %% Compute enterprise strength scores for all enterprises over three years load temp.mat fx_score = zeros(123,1); for i = 1:123 testsum = sum(temp(:,1)==i) + sum(temp(:,3)==i) + sum(temp(:,5)==i); % Enterprise has scores for three years if testsum == 3 myweight = [0.16342 0.29696 0.53961]; fx_score(i) = myweight * [temp(find(temp(:,1)==i),2); temp(find(temp(:,3)==i),4); temp(find(temp(:,5)==i),6)]; % Enterprise has scores for two years elseif testsum == 2 % Enterprise has 16-17 scores if sum(temp(:,1)==i) == 1 myweight = [0.3333 0.6667]; fx_score(i) = myweight * [temp(find(temp(:,1)==i),2); temp(find(temp(:,3)==i),4)]; % Enterprise has 19-20 scores else myweight = [0.3333 0.6667]; fx_score(i) = myweight * [temp(find(temp(:,3)==i),4); temp(find(temp(:,5)==i),6)]; end % Enterprise has only one year of data else fx_score(i) = temp(find(temp(:,3)==i),4); end end Based on the $LEND_i$ values computed by the model, the 123 enterprises can be classified into four lending grades. The graphical results show that enterprise scores have a relatively clear hierarchical differentiation, indicating that this model can effectively distinguish high-quality enterprises from high-risk enterprises.\nSpecifically:\nGrade A enterprises have the strongest lending capacity, with excellent business quality and supply-demand stability, and can be prioritized for loan disbursement. Grade B enterprises have overall controllable risk and are suitable as regular credit customers. Grade C enterprises, while not obviously high-risk, have shortcomings in business or stability, and lending should be done prudently by reducing quotas and increasing review intensity. Grade D enterprises have the lowest scores in comprehensive evaluation, and loan disbursement is not recommended. Therefore, the \u0026ldquo;whether to lend\u0026rdquo; in Problem 1 can be reduced to a clear classification rule: only grant loans to Grade A, B, and C enterprises, and reject loans for Grade D enterprises. This conclusion provides the prerequisite conditions for subsequent interest rate and quota design.\nThe following figure shows the final scoring rules formed by this model, which can be used to evaluate other enterprises on the same basis:\n3.3 How to Lend: Credit Risk Model for Small and Medium Enterprises For enterprises that have passed the screening of the lending decision model, the next step is no longer to answer \u0026ldquo;whether they can get a loan,\u0026rdquo; but \u0026ldquo;how they should get a loan.\u0026rdquo; Therefore, the second model in Problem 1 shifts to risk pricing and quota allocation: on one hand, it continues to preserve enterprise business performance, and on the other hand, it incorporates credit rating and default history into the evaluation system, ultimately forming a risk evaluation value that can be used for interest rate and quota design.\nUnlike the lending capacity evaluation value $LEND_i$ in Section 3.2, the enterprise risk evaluation value $RISK_i$ constructed here is for determining loan conditions after passing the initial screening. The two models differ in indicator selection, weight allocation, and result interpretation, so a separate model is needed.\n3.3.1 Model Structure and Evaluation Approach The risk evaluation model still adopts a hierarchical structure of \u0026ldquo;criterion layer + indicator layer + time layer,\u0026rdquo; but the criterion layer consists of two parts: \u0026ldquo;enterprise strength\u0026rdquo; and \u0026ldquo;enterprise credit.\u0026rdquo; Among them, enterprise strength continues to use net profit margin, net profit growth rate, and output return ratio as the three business indicators; enterprise credit is characterized by credit rating and default history. The reason for this approach is: the lending decision model focuses more on whether the enterprise has basic lending eligibility, while the risk evaluation model needs to further answer the question of at what price and quota the bank should bear this risk.\nThe figure below shows the credit risk model structure established in this article:\nWithin a single time band, the risk evaluation value is written as:\n$$RISK_i(t) = \\omega_1 f_{i1}(t) + \\omega_2 f_{i2}(t) + \\omega_3 f_{i3}(t) + \\omega_4 f_{i4}(t) + \\omega_5 f_{i5}(t)$$\nWhere $f_{i1}(t)$, $f_{i2}(t)$, and $f_{i3}(t)$ respectively represent the scores of net profit margin, net profit growth rate, and output return ratio for enterprise $i$ in time band $t$; $f_{i4}(t)$ and $f_{i5}(t)$ respectively represent the quantitative scores of credit rating and default history; and $\\omega_1,\\dots,\\omega_5$ are the corresponding indicator weights.\nFurther aggregating over the three time bands yields the enterprise\u0026rsquo;s final comprehensive risk evaluation value:\n$$RISK_i = \\eta_1 RISK_i(t_1) + \\eta_2 RISK_i(t_2) + \\eta_3 RISK_i(t_3)$$\nWhere $\\eta_1, \\eta_2, \\eta_3$ are the time weights. At this point, the model completes the mapping from single-period business and credit information to the final risk score, and subsequent interest rate and quota design are all based on $RISK_i$.\n3.3.2 Weight Quantification Based on Analytic Hierarchy Process To make the risk evaluation model more interpretable, this article also uses the analytic hierarchy process to quantify weights at each layer, and uses consistency checks to verify the reasonableness of the judgment matrices.\n1. Primary Criterion Layer Weights\nIn the criterion layer, \u0026ldquo;enterprise strength\u0026rdquo; and \u0026ldquo;enterprise credit\u0026rdquo; are compared pairwise. Compared with pure business performance, credit rating and default history more directly reflect the enterprise\u0026rsquo;s default risk, so enterprise credit is given higher weight in the risk model. The corresponding judgment matrix can be written as:\n$$A_r = \\begin{pmatrix} 1 \u0026amp; 1/2 \\ 2 \u0026amp; 1 \\end{pmatrix}$$\nAccording to the calculation results in the paper, the primary criterion layer weights are:\n$$\\boldsymbol{\\omega}^{(r)} = (0.3333, 0.6667)$$\nThat is, enterprise strength accounts for 1/3, and enterprise credit accounts for 2/3. Since this judgment matrix is a second-order matrix with a maximum eigenvalue of 2, the consistency naturally satisfies the requirement.\n2. Internal Weights of Enterprise Credit\nThe enterprise credit part is further divided into three indicators: output return ratio, credit rating, and default history. The first retains the negative constraint information in business quality, while the latter two directly characterize the enterprise\u0026rsquo;s historical credit level. The judgment matrix in the paper is:\n$$A_c = \\begin{pmatrix} 1 \u0026amp; 1/3 \u0026amp; 1/4 \\ 3 \u0026amp; 1 \u0026amp; 1/2 \\ 4 \u0026amp; 2 \u0026amp; 1 \\end{pmatrix}$$\nThe computation yields:\n$$\\lambda_{\\max} = 3.0183, \\quad CI = 0.00914, \\quad CR = 0.01759 \u0026lt; 0.10$$\nThe consistency check passes, and the corresponding weights are:\n$$\\boldsymbol{\\omega}^{(c)} = (0.1220, 0.3196, 0.5584)$$\nThis indicates that within the enterprise credit dimension, default history has the strongest explanatory power, followed by credit rating, with output return ratio serving as an auxiliary risk signal in the evaluation.\n3. Bottom Indicator Weights of the Risk Model\nAfter combining the primary criterion layer and lower-layer indicator weights, the five bottom indicator weights of the risk evaluation model under normal conditions are obtained:\n$$\\boldsymbol{\\omega} = (0.1036, 0.1645, 0.1466, 0.2131, 0.3722)$$\nThey correspond in order to net profit margin, net profit growth rate, output return ratio, credit rating, and default history. It can be seen that default history and credit rating have relatively high total weights, which is consistent with the risk model\u0026rsquo;s goal of \u0026ldquo;emphasizing credit and historical performance.\u0026rdquo;\nAmong them, the enterprise strength part under normal conditions follows the judgment matrix in Section 3.2.2, and its consistency check result is:\n$$\\lambda_{\\max} = 3.0536, \\quad CI = 0.0268, \\quad CR = 0.0516 \u0026lt; 0.10$$\nThis indicates that this bottom weight configuration also satisfies the consistency requirement and can be directly used in the risk evaluation model.\nWhen some enterprises lack net profit growth rate, the paper further adopts the degraded model for missing-indicator scenarios. At this point, the four indicator weights are adjusted to:\n$$\\boldsymbol{\\omega}\u0026rsquo; = (0.2500, 0.1646, 0.2131, 0.3723)$$\nRespectively corresponding to net profit margin, output return ratio, credit rating, and default history. Since this degraded judgment matrix is a second-order matrix with a maximum eigenvalue of 2, consistency naturally holds. This allows the model to maintain structural stability without forced imputation.\n3.3.3 Time Dimension Weighting and Risk Score Aggregation Risk evaluation is also not a static result. The enterprise\u0026rsquo;s recent business performance and credit status are clearly more valuable as references for the bank\u0026rsquo;s current pricing than data from earlier years. Therefore, this article continues to use the time judgment matrix from Section 3.2.3 for the time layer, assigning different weights to the three time bands.\nThe time layer weight vector is:\n$$\\boldsymbol{\\eta} = (\\eta_1, \\eta_2, \\eta_3) = (0.1220, 0.3196, 0.5584)$$\nWhere the most recent time band $t_3$ has the highest weight, indicating the model places more emphasis on the enterprise\u0026rsquo;s latest risk status. Therefore, after computing the single-period risk evaluation values for each enterprise across the three time bands, the final comprehensive risk score is:\n$$RISK_i = 0.1220 \\cdot RISK_i(t_1) + 0.3196 \\cdot RISK_i(t_2) + 0.5584 \\cdot RISK_i(t_3)$$\nFrom the distribution of the paper\u0026rsquo;s results, the raw risk evaluation values are relatively concentrated overall. Using them directly for interest rate and quota stratification is not intuitive enough. To enhance the differentiation of scores, the paper further amplifies the raw results and converts them to a percentage scale for subsequent interest rate mapping and grade division. This treatment does not change the relative ranking among enterprises but stretches the scores to a more easily interpretable scale.\n3.3.4 Loan Interest Rate Calculation For interest rate design, the paper adopts the approach of \u0026ldquo;base rate plus spread pricing.\u0026rdquo; The base rate references the People\u0026rsquo;s Bank of China one-year commercial lending rate of 4.35%, and then maps segments within the 4% to 15% range based on enterprise risk scores. The core logic is: the higher the risk score, the better the enterprise quality, so the bank can offer a lower interest rate; the lower the risk score, a higher interest rate is needed to cover the potential risk.\nThe 13.9866 here comes from the statistical result of the sample risk scores themselves. The specific approach is: first amplify the raw risk evaluation values, then normalize and convert them to a percentage scale to obtain a set of optimized scores more suitable for pricing; then take the average of these percentage scores for the 123 enterprises to obtain 13.9866.\nI use this average as the anchor point for the segmented base rate 4.35%. The advantage of this approach is that the segmentation point is not subjectively specified by humans but naturally derived from the overall sample score level. Therefore, the score intervals [0,13.9866) and [13.9866,100] respectively correspond to the two interest rate segments 15%→4.35% and 4.35%→4%. Letting the risk score be $x$ and the loan interest rate be $y$, the interest rate function is written as:\nWhen $x \\ge 13.9866$,\n$$y = 4.35 - \\frac{4.35 - 4}{100 - 13.9866}(x - 13.9866)$$\nWhen $x \u0026lt; 13.9866$,\n$$y = 15 - \\frac{15 - 4.35}{13.9866}x$$\nThis mapping approach has two advantages. First, it ensures that the interest rate for high-score enterprises gradually converges toward the low interest rate end, enhancing the sense of gain for high-quality enterprises. Second, it retains sufficient risk premium for low-score enterprises, achieving a balance between revenue and customer churn.\n3.3.5 Loan Quota Allocation For quota design, the paper assumes that the annual total disbursable loan amount is fixed at 100 million, and the per-enterprise loan quota is预设 five tiers: 1 million, 700,000, 500,000, 200,000, and 100,000. To balance risk control and capital utilization efficiency under the total amount constraint, the model does not rigidly segment based on absolute scores but divides quota tiers based on the enterprise\u0026rsquo;s ranking proportion in the overall population.\nThe initial quota tiers are:\nTier Loan Quota I 1 million II 700,000 III 500,000 IV 200,000 V 100,000 When allocating quota, I first divide the annual total disbursable amount 100 million into 5 portions, each 20 million, respectively corresponding to the five quota tiers. Then, combined with the enterprise\u0026rsquo;s risk evaluation value ranking proportion in the overall population, the interval areas corresponding to tiers I, II, III, IV, and V are made consistent, thereby establishing a mapping from \u0026ldquo;risk evaluation value ranking proportion\u0026rdquo; to \u0026ldquo;credit quota tier.\u0026rdquo;\nThe mapping relationship is shown in the figure:\nBy reversing this path, the ranking proportion intervals corresponding to each quota tier can be obtained, and the final stratification rules are:\nTier Loan Quota Ranking Proportion I 1 million Top 3.6347% II 700,000 Top 8.7844% III 500,000 Top 17.6110% IV 200,000 Top 38.2090% V 100,000 Remaining approved enterprises In this way, enterprises with higher risk evaluation values and higher rankings can obtain higher credit quotas; enterprises with relatively lower rankings are allocated more conservative quota levels. This completes the closed loop of the two-stage framework in Problem 1: $LEND_i$ is used to decide whether to lend, and $RISK_i$ is used to decide how to lend.\n4. Problem 2: Credit Strategies for 302 Enterprises Without Credit Records Note: The author was only responsible for data cleaning and credit strategy development. The decision tree model construction and training was completed by a teammate, and this section only introduces the training approach and methodology without further detailing the procedural steps.\n4.1 Challenges of Data Missing Attachment 2\u0026rsquo;s 302 enterprises do not provide the two key pieces of information — \u0026ldquo;credit rating\u0026rdquo; and \u0026ldquo;default history\u0026rdquo; — so they cannot be directly substituted into the lending decision model and risk evaluation model already established in Problem 1. That is to say, the core of Problem 2 is not to design a new credit system but to first fill in the missing credit labels, and then put these enterprises back into the two-stage model of Problem 1 for evaluation.\nTherefore, the solution approach for this problem can be summarized as two steps: first, use the data of the 123 enterprises with credit records in Attachment 1 to train a classification model and predict the credit ratings and default histories of the enterprises in Attachment 2; second, after supplementing the predictions into the Attachment 2 samples, continue using the integrated \u0026ldquo;whether to lend + how to lend\u0026rdquo; framework to complete loan approval, interest rate design, and quota allocation.\n4.2 Decision Tree to Fill the Data Gap 4.2.1 Data Preprocessing During data cleaning, I first use pandas.read_excel() to read all tables in Attachments 1 and 2 and convert them to DataFrame structures. Then, I remove invalidated invoice samples from the \u0026ldquo;input invoice information\u0026rdquo; and \u0026ldquo;output invoice information\u0026rdquo; to avoid interference from invalid documents on the characterization of enterprise transaction scale and business characteristics.\nNext, I sum the \u0026ldquo;total price and tax\u0026rdquo; in the four invoice detail tables by \u0026ldquo;enterprise code\u0026rdquo; to obtain the total input price-tax amount and total output price-tax amount for each enterprise. Then, these two summary results are added back to the enterprise master table, enabling the original enterprise information table to simultaneously contain business labels and transaction summary characteristics.\nSince \u0026ldquo;credit rating\u0026rdquo; and \u0026ldquo;default history\u0026rdquo; in Attachment 1 are string-type variables, they need to be numerically encoded before entering the model. Here, LabelEncoder is used to re-encode the category labels, and train_test_split is used to divide samples into training, validation, and test sets with a ratio of 3:1:1. Finally, \u0026ldquo;input price-tax amount\u0026rdquo; and \u0026ldquo;output price-tax amount\u0026rdquo; are used as features to train classification models for predicting \u0026ldquo;credit rating\u0026rdquo; and \u0026ldquo;default history\u0026rdquo; respectively.\nAs an implementation reference, the Python code for data cleaning and training set construction is:\nimport numpy as np import pandas as pd from sklearn.preprocessing import LabelEncoder, OneHotEncoder from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, recall_score, f1_score from sklearn.tree import DecisionTreeClassifier, export_graphviz from sklearn.externals.six import StringIO import pydotplus import matplotlib.dates as mdate from tqdm import tqdm import matplotlib.pyplot as plt import os # Add to environment variable os.environ[\u0026#39;PATH\u0026#39;] += os.pathsep + \u0026#39;\u0026#39;# Environment variable path # Read and clean the raw data df_0 = pd.read_excel(\u0026#39;Attachment 1: Data on 123 Enterprises with Credit Records.xlsx\u0026#39;, sheet_name=\u0026#39;Enterprise Information\u0026#39;, encoding=\u0026#39;gbk\u0026#39;) df_in = pd.read_excel(\u0026#39;Attachment 1: Data on 123 Enterprises with Credit Records.xlsx\u0026#39;, sheet_name=\u0026#39;Input Invoice Information\u0026#39;, encoding=\u0026#39;gbk\u0026#39;) df_out = pd.read_excel(\u0026#39;Attachment 1: Data on 123 Enterprises with Credit Records.xlsx\u0026#39;, sheet_name=\u0026#39;Output Invoice Information\u0026#39;, encoding=\u0026#39;gbk\u0026#39;) df_in = df_in[df_in.loc[:, \u0026#39;Invoice Status\u0026#39;] == \u0026#39;Valid Invoice\u0026#39;] df_out = df_out[df_out.loc[:, \u0026#39;Invoice Status\u0026#39;] == \u0026#39;Valid Invoice\u0026#39;] # Group and sum sum_in = df_in[\u0026#39;Total Price and Tax\u0026#39;].groupby(df_in[\u0026#39;Enterprise Code\u0026#39;]).sum() sum_out = df_out[\u0026#39;Total Price and Tax\u0026#39;].groupby(df_out[\u0026#39;Enterprise Code\u0026#39;]).sum() # Join tables sum_in = sum_in.to_frame() sum_out = sum_out.to_frame() sum_in.rename(columns={\u0026#39;Total Price and Tax\u0026#39;: \u0026#39;Input Price-Tax Total\u0026#39;}, inplace=True) sum = sum_in.join(sum_out) df_0.set_index(\u0026#39;Enterprise Code\u0026#39;, drop=True, inplace=True) df = df_0.join(sum) # Encoding df[\u0026#39;Credit Rating\u0026#39;] = LabelEncoder().fit_transform(df[\u0026#39;Credit Rating\u0026#39;].values.reshape(-1, 1)).reshape(1, -1)[0] df[\u0026#39;Default History\u0026#39;] = LabelEncoder().fit_transform(df[\u0026#39;Default History\u0026#39;].values.reshape(1, -1)[0]) #df.to_csv(\u0026#39;123 enterprises.csv\u0026#39;) # Machine learning from sklearn.model_selection import train_test_split features = df.drop([\u0026#39;Enterprise Name\u0026#39;, \u0026#39;Credit Rating\u0026#39;], axis=1) label = df[\u0026#39;Default History\u0026#39;] f_v = features.values f_names = features.columns.values l_v = label.values X_tt, X_validation, Y_tt, Y_validation = train_test_split(f_v, l_v, test_size=0.2) # Validation set X_train, X_test, Y_train, Y_test = train_test_split(X_tt, Y_tt, test_size=0.25) # Test set Due to one splitting, the ratio is adjusted models = [] #models.append((\u0026#39;LogisticRegression\u0026#39;, LogisticRegression(C = 1000, tol = 1e-10, solver = \u0026#39;sag\u0026#39;, max_iter = 10000))) models.append((\u0026#39;DecisionTreeGini\u0026#39;, DecisionTreeClassifier())) #models.append((\u0026#39;DecisionTreeEntropy\u0026#39;, DecisionTreeClassifier(criterion=\u0026#39;entropy\u0026#39;))) for (clf_name, clf) in models: clf.fit(X_train, Y_train) xy_lst = [(X_train, Y_train), (X_validation, Y_validation), (X_test, Y_test)] for i in range(len(xy_lst)): X_part = xy_lst[i][0] Y_part = xy_lst[i][1] Y_pred = clf.predict(X_part) print(i) print(clf_name, \u0026#39;ACC\u0026#39;, accuracy_score(Y_part, Y_pred)) print(clf_name, \u0026#39;REC\u0026#39;, recall_score(Y_part, Y_pred)) print(clf_name, \u0026#39;F1\u0026#39;, f1_score(Y_part, Y_pred)) 4.2.2 Model Training The decision tree is chosen as the core machine learning model for this problem for two main reasons: on one hand, the sample size is limited, and decision trees are easier to train on small-sample classification problems; on the other hand, decision trees have strong interpretability, making it easy to directly transform \u0026ldquo;how purchase and sales scale affect credit rating and default history\u0026rdquo; into visualized classification rules.\nFrom a principles perspective, decision trees continuously select the optimal feature to split the sample set so that the node purity after each split is as high as possible. Here, classification trees based on the Gini coefficient standard are used to train \u0026ldquo;credit rating\u0026rdquo; and \u0026ldquo;default history\u0026rdquo; separately.\nIn the model evaluation stage, accuracy ACC, recall REC, and F1-score are used as the main evaluation metrics, respectively measuring the model\u0026rsquo;s overall classification accuracy, its ability to identify the target class, and the balance between precision and recall. By training \u0026ldquo;credit rating\u0026rdquo; and \u0026ldquo;default history\u0026rdquo; separately, the decision tree models for predicting the credit labels of Attachment 2 enterprises can be obtained.\nFrom the implementation results, this code outputs ACC, REC, and F1 metrics on the training, validation, and test sets respectively, where 0, 1, and 2 respectively correspond to training, validation, and test sets. The purpose of doing so is to simultaneously observe the model\u0026rsquo;s performance on training samples and unseen samples, avoiding the situation of only looking at training results while ignoring generalization ability. Additionally, the model can be further exported as a decision tree diagram to visually display the classification rules.\n4.2.3 Prediction Results After model training is complete, inputting the 302 enterprises from Attachment 2 into the decision tree yields their corresponding \u0026ldquo;credit rating\u0026rdquo; and \u0026ldquo;default history\u0026rdquo; predicted values. In this way, the originally missing credit labels are filled, and Attachment 2 enterprises have the same field structure as Attachment 1 enterprises, allowing them to continue participating in the subsequent quantitative credit analysis.\nFrom the results, the input price-tax total and output price-tax total have a certain ability to differentiate credit ratings and default histories, providing an approximate credit label for enterprises without historical credit records. Although these prediction results cannot completely replace real historical credit performance, under the limited information provided by the problem, they are sufficient to support subsequent lending decisions and risk pricing.\nFurther, after filling in the labels, Attachment 2 samples can continue just like in Problem 1 to compute the enterprise strength total score, lending capacity normalized total score, and the annual risk evaluation total scores across the three time bands, ultimately aggregating into a comprehensive risk evaluation value. In this way, the decision tree\u0026rsquo;s output is no longer just a single classification label but directly becomes an input variable in the subsequent $LEND_i$ and $RISK_i$ calculation chain.\n4.3 Credit Strategy Development After filling in credit ratings and default histories, the 302 enterprises in Attachment 2 can be put back into the credit framework already established in Problem 1 for continued calculation. The specific solution still follows the two main lines of \u0026ldquo;whether to lend\u0026rdquo; and \u0026ldquo;how to lend,\u0026rdquo; except that the input samples have switched from Attachment 1\u0026rsquo;s historical credit enterprises to Attachment 2\u0026rsquo;s enterprises with supplemented labels via decision tree.\n1. Whether to Lend\nFirst, compute the lending capacity evaluation value $LEND_i$ using enterprise strength scores and supply-demand relationship stability scores, and normalize and convert to a percentage scale. This yields the lending capacity distribution of the 302 enterprises under the same evaluation standards. Then, based on the grade division standards already determined in Problem 1, determine whether the enterprises are approved for lending. That is to say, the \u0026ldquo;whether to lend\u0026rdquo; part of Problem 2 is essentially still judging which of grades A, B, C, and D these enterprises fall into.\nThe corresponding \u0026ldquo;whether to lend\u0026rdquo; visualization results are as follows:\n2. Loan Interest Rate\nFor approved enterprises, continue to compute the comprehensive risk evaluation value, and follow the interest rate allocation approach of Problem 1 for pricing. Here, the method of \u0026ldquo;anchoring the base rate with the percentage average\u0026rdquo; is still used, except that the Attachment 2 sample\u0026rsquo;s percentage score average becomes 23.1107. Therefore, in Problem 2, the interest rate segmentation point is no longer 13.9866 from Problem 1 but is changed to 23.1107 as the anchor for the 4.35% base rate. Enterprises above this average are mapped to the low interest rate interval, while those below the average are mapped to the high interest rate interval.\nThe corresponding interest rate calculation formula is:\nWhen $x \\ge 23.1107$,\n$$y = 4.35 - \\frac{4.35 - 4}{100 - 23.1107}(x - 23.1107)$$\nWhen $x \u0026lt; 23.1107$,\n$$y = 15 - \\frac{15 - 4.35}{23.1107}x$$\nThe corresponding interest rate normalization results are as follows:\n3. Loan Quota\nFor quota allocation, the ranking proportions are still computed based on the risk evaluation values\u0026rsquo; sorting results, and then combined with the quota division standards already summarized in Problem 1 to allocate corresponding credit quota tiers for Attachment 2 enterprises. In other words, Problem 2 does not redesign the quota system but puts the predicted enterprises into the same quota evaluation framework as Problem 1, determining the final credit quota based on their relative position in the overall risk evaluation value ranking.\nIn summary, the key to Problem 2 is not changing the original credit decision logic but first using the decision tree to fill in the credit labels, and then routing the filled-in enterprises back into the original model. This achieves a complete closed-loop process from data gap repair to credit strategy generation.\n5. Problem 3: Strategy Optimization Under Sudden Factors 5.1 Methodology: Stress Testing Method In commercial bank risk management practice, stress testing is typically used to measure changes in enterprise business conditions and bank risk exposure under extreme unfavorable scenarios. It helps banks identify the relationship between potential risk factors and financial outcomes, and further analyze whether the bank\u0026rsquo;s credit strategy remains robust under sudden shocks. Combined with the problem setting, the core of Problem 3 is no longer historical data fitting but studying how different industries and types of enterprises\u0026rsquo; indicators will change after the occurrence of sudden factors, and based on this, reassessing the credit strategy.\nFrom a methodological perspective, stress testing mainly includes sensitivity testing and scenario testing. Sensitivity testing emphasizes the impact of changes in a single risk factor, while scenario testing considers the combined changes of multiple factors under extreme conditions. Since the sudden factors in the problem are closer to systemic shocks in real scenarios, the scenario testing method is adopted here to simulate enterprise performance under special conditions.\n5.2 Taking the Logistics Industry as an Example 5.2.1 Scenario Assumption In scenario setting, the COVID-19 epidemic is chosen as a typical sudden factor, and the logistics industry is used as an example for analysis. The reason for choosing the logistics industry is that under the epidemic, offline activities are restricted, and the circulation of residents\u0026rsquo; daily necessities and enterprise production materials relies more on the logistics system, which may bring changes such as increased profit margins, increased profit growth rates, and decreased return probabilities for the logistics industry.\nTherefore, this article assumes that under the epidemic shock, logistics enterprises\u0026rsquo; business indicators show systematic improvement, and based on this, analyzes how banks should adjust loan quotas and interest rates.\n5.2.2 Testing Plan During specific testing, logistics-related enterprise codes are first filtered out through a data pivot table, and then three sets of stress scenarios are applied to these enterprises:\nProfit margin and profit growth rate increase by 20%, while return rate decreases by 20% Profit margin and profit growth rate increase by 40%, while return rate decreases by 40% Profit margin and profit growth rate increase by 60%, while return rate decreases by 60% Under each scenario, the logistics enterprises\u0026rsquo; lending decision scores and credit risk scores are recalculated, and their growth rates relative to the original state are compared, thereby determining the direction and magnitude of the sudden factor\u0026rsquo;s impact on enterprise credit results.\n5.2.3 Solution Results From the logistics enterprise lending risk re-evaluation results, under the three scenarios, the enterprise risk score growth rates show an overall upward trend. This indicates that as profit levels increase and return rates decrease, the enterprise\u0026rsquo;s profitability strengthens, the comprehensive risk score improves, and thus the default risk relatively decreases.\nFrom the logistics enterprise lending decision score re-evaluation results, a similar pattern to the risk scores is also observed: the higher the enterprise\u0026rsquo;s profit margin and profit growth rate and the lower the return rate, the more significant the growth in its lending decision score. This indicates that sudden factors do not necessarily only bring negative impacts. For industries like logistics that benefit from special environments, enterprise comprehensive strength may even strengthen.\nBased on this result, when facing similar scenarios, banks can consider adopting more proactive credit adjustment strategies for related industries, such as appropriately lowering access thresholds, reducing loan annual interest rates, or increasing credit quotas within controllable risk limits. This not only helps control customer churn rates but also enhances the bank\u0026rsquo;s own returns during periods of industry prosperity improvement.\n5.3 Dynamic Adjustment Mechanism The significance of stress testing is not only to provide a one-time scenario conclusion but more importantly to form a dynamic adjustment mechanism. When significant changes occur in the external environment, banks can based on industry attributes and enterprise indicator changes, timely recalculate lending capacity evaluation values and risk evaluation values, and simultaneously modify interest rates, quotas, and access conditions.\nIn other words, what Problem 3 provides is not a fixed answer but a strategy update framework that can be repeatedly used under sudden scenarios: first identify shock factors, then set scenario assumptions, subsequently re-evaluate enterprise scores, and finally adjust credit strategies. This can significantly enhance the model\u0026rsquo;s robustness and practical applicability in complex environments.\n6. Conclusion This article constructs a complete quantitative analysis framework for SME credit decisions around the core issues of enterprise evaluation, risk pricing, and strategy optimization. In Problem 1, the credit decision is decomposed into two levels: \u0026ldquo;whether to lend\u0026rdquo; and \u0026ldquo;how to lend.\u0026rdquo; The former computes the enterprise lending capacity evaluation value $LEND_i$ through the lending decision model to complete the loan approval judgment; the latter computes the comprehensive risk evaluation value $RISK_i$ through the credit risk evaluation model to further determine loan interest rates and credit quotas. In this way, the originally somewhat vague credit decision process is decomposed into a two-stage modeling process with clear structure, explicit logic, and strong interpretability.\nIn specific solution approaches, this article comprehensively uses the analytic hierarchy process, TOPSIS method, fuzzy comprehensive evaluation, and time-weighted aggregation. Among them, AHP is mainly responsible for determining the weights of the criterion layer, indicator layer, and time dimension; TOPSIS is used for multi-attribute comprehensive scoring of enterprise strength; fuzzy comprehensive evaluation is used to handle qualitatively strong indicators such as supply-demand relationship stability, enabling them to be incorporated into a unified evaluation system; and time weights are used to complete cross-year information integration. Based on this combined method, the model not only reflects the enterprise\u0026rsquo;s current business performance but also takes into account the importance differences of information across different years, thereby improving the stability of credit judgments.\nIn Problem 2, to address the missing \u0026ldquo;credit rating\u0026rdquo; and \u0026ldquo;default history\u0026rdquo; fields for enterprises without historical credit records, this article introduces a decision tree model to complete credit label supplementation, and then the supplemented samples are re-connected to the credit framework of Problem 1 for continued calculation. This achieves a closed-loop process of \u0026ldquo;first supplementing data, then making decisions,\u0026rdquo; enabling enterprises that could not directly enter the credit model to also complete loan approval, interest rate allocation, and quota division under a unified evaluation standard.\nIn Problem 3, this article further introduces the stress testing method, taking the logistics industry as an example to simulate the impact of sudden factors on enterprise business indicators and credit strategies. By setting multiple scenarios of profit improvement and return rate decline, the lending decision scores and risk evaluation scores are recalculated, and the credit strategy is dynamically adjusted based on the results. This shows that the model is not only suitable for static historical data analysis but also has certain scenario expansion capabilities, which can be used by banks for strategy revision and risk response in complex business environments.\nOverall, the model established in this article possesses strong interpretability, operability, and expandability, and can well support banks in making more systematic and quantitative credit decisions for small and medium enterprises under fixed total credit constraints.\n7. Model Evaluation and Reflection 7.1 Advantages of Method Combination From the perspective of method combination, the greatest advantage of this article is the construction of a comprehensive evaluation system with clear hierarchy and tight cohesion. The lending decision model is responsible for judging whether an enterprise has basic lending eligibility, while the credit risk model is responsible for further completing interest rate and quota allocation after approval. The clear division of labor between the two not only avoids a single scoring model from bearing too many tasks simultaneously but also makes the model structure more aligned with the actual bank credit process.\nAt the indicator solution level, the combination of AHP, TOPSIS, and fuzzy comprehensive evaluation has strong complementarity. AHP can systematize experience-based judgments and is suitable for handling weight allocation in multi-layer indicator systems; TOPSIS can give relatively stable comprehensive ranking results among multiple quantitative indicators; and fuzzy comprehensive evaluation compensates for the deficiency that qualitative indicators are difficult to directly measure, enabling indicators such as supply-demand relationship stability to be incorporated into a unified evaluation system. After combining these three methods, the model retains its mathematical structure while enhancing the realism of its interpretations.\nFrom the perspective of application expandability, this article does not limit the model to the scenario of \u0026ldquo;enterprises with historical credit records\u0026rdquo; but further uses decision trees to complete label prediction, enabling enterprises without historical credit to also be incorporated into the same analysis framework. Additionally, in Problem 3, the stress testing method is introduced to extend the static credit model to sudden scenario analysis. This indicates that the modeling framework in this article is not a one-time conclusion tool but a strategic analysis framework that can be continuously expanded based on data conditions and business scenarios.\n7.2 Model Limitations Although the model overall has good interpretability and operability, there are still several limitations. First, some judgment matrices in AHP rely on manually assigned values based on experience. Different researchers may have different understandings of indicator importance, so the weight results carry a certain degree of subjectivity. Although this article conducts consistency checks, passing the consistency check does not mean the weights are necessarily optimal; it only indicates that the judgment matrix is basically self-consistent internally.\nSecond, although the decision tree part solves the problem of missing credit labels for Attachment 2 enterprises, its training samples come only from the 123 enterprises in Attachment 1, with a relatively limited sample size, and the feature dimensions used for prediction are also relatively simplified. Therefore, the model\u0026rsquo;s generalization ability on out-of-sample enterprises is still limited, and the predicted \u0026ldquo;credit rating\u0026rdquo; and \u0026ldquo;default history\u0026rdquo; are more suitable as approximate labels rather than completely replacing real long-term credit history.\nThird, most of the indicators constructed in this article still heavily rely on historical transaction invoice data, such as profit margins, profit growth rates, return rates, and stable customer proportions. This means the model is relatively friendly to enterprises with existing business records but its recognition ability is limited for enterprises with sparse transaction data, in early growth stages, or with incomplete financial records.\nFinally, the stress testing in Problem 3 is still based on scenario assumptions. Whether choosing the logistics industry or setting specific proportions for profit improvement and return rate decline, there is a certain degree of empirical judgment. Sudden shocks in real environments are often more complex, and there are also significant heterogeneities among industries. Therefore, the conclusions in this part are more suitable as strategic simulation references rather than precise prediction results.\n7.3 Improvement Directions If the model effect is to be further improved in the future, optimization can be continued from the following directions. First, in the label prediction part, ensemble learning models such as random forests, XGBoost, or LightGBM can be introduced to replace single decision trees, thereby improving the stability and generalization of \u0026ldquo;credit rating\u0026rdquo; and \u0026ldquo;default history\u0026rdquo; predictions. Especially as the sample size gradually expands, ensemble models usually achieve better classification performance.\nSecond, in the construction of the indicator system, more external information can be appropriately introduced, such as enterprise financial statements, industry prosperity, upstream-downstream concentration, operating region, public opinion risk, and legal person profiles, thereby reducing the model\u0026rsquo;s dependence on single-source invoice data. This not only improves the completeness of enterprise portraits but also helps enhance the model\u0026rsquo;s adaptability to different types of enterprises.\nThird, at the credit strategy level, differentiated modeling can be further advanced for different industries, customer segments, and life cycles. For example, more tailored indicator systems and risk mapping rules can be established for manufacturing, logistics, and trading industries, rather than completely using a unified model caliber. This can further improve the refinement and business landing effect of credit strategies.\nFourth, in the stress testing part, the expansion from a single-industry example to multi-industry, multi-shock factor joint testing, combined with a dynamic monitoring mechanism for periodic reassessment, can be conducted. In this way, the model is not just a static analysis tool but can gradually evolve into a dynamic credit decision system for actual risk management.\n","date":"2020-11-29T23:30:00+08:00","image":"/uploads/credit-model-cover-new.jpg","permalink":"/en/p/credit-decision-model-for-smes-en/","title":"Quantitative Credit Decision Model for SMEs (The China Undergraduate Mathematical Contest in Modeling in 2020)"},{"content":"\nThis was my first participation in a mathematical modeling competition. I ultimately won a second prize across the entire school. Award link: https://mp.weixin.qq.com/s/w1W8XwT2oHP4LlE9SSGCBw — this achievement was greatly encouraging, and I am now preparing to participate in the 2020 national competition.\nThe original title was \u0026ldquo;A Metro Route Planning Model Based on Multi-Traveling Salesman Problem and Its New Conceptions.\u0026rdquo; The \u0026ldquo;New Conceptions\u0026rdquo; part was completed by my teammates, so this blog post primarily describes my own thinking and problem-solving process, shared for learning and exchange.\nContents 1. Introduction 2. Preliminary Knowledge 3. Problem Analysis and Solution Approach 4. Model Construction and Solution 5. Model Evaluation and Summary 1. Introduction 1.1 Background Student P plans to conduct a field survey of multiple transfer stations on the Guangzhou Metro. The survey includes recording the platform screen door numbers at each station and the relative positions of elevators and screen doors when switching between different lines. Since such information is not available on public platforms, it can only be obtained through personal visits to each transfer station.\nThe survey is subject to the following constraints:\nFlexible choice of start and end points: Both the departure and return locations can be chosen among Zhongda Station or Lujiang Station. Exclusion of specific lines: No investigation of transfer stations related to Line 9, Line 14, Line 21, or Line 13. APM line does not require separate investigation: The transfer and exit patterns of the APM line within metro stations are similar to ordinary lines, so it is not included in the survey scope. Exit and observation required at each station: Student P must exit the station to observe after arriving at each transfer station, then re-enter. Second passage through the same station does not require exiting: If a transfer station is visited twice, the second passage does not require exiting and may be passed through directly. 1.2 Problem Statement Problem 1: How to select a survey route such that the total time to complete the survey of all 27 transfer stations is minimized, requiring departure from and return to the same station among Zhongda or Lujiang (single-day scheme).\nProblem 2: Student P finds it physically unbearable to complete the entire survey within a single day, and plans to distribute the survey over 5 days. Each day starts from the starting point, completes that day\u0026rsquo;s survey, and returns to the end point. How to plan the daily routes so that the total time over 5 days is minimized.\nProblem 3: Under the premise that the route plan from Problem 2 is determined, Student P\u0026rsquo;s uncle and aunt work at various transfer stations along Guangzhou Metro Line 8 every day, with workplaces chosen randomly and working hours covering the entire day. Calculate the expected total number of encounters between Student P and them individually, as well as the expected total number and variance of encounter counts.\n2. Preliminary Knowledge This chapter introduces the mathematical principles directly referenced in subsequent sections, primarily covering graph theory foundations, shortest path algorithms, combinatorial optimization, and probability theory. These topics have been systematically addressed in undergraduate discrete mathematics and probability courses; this chapter provides only concept definitions and result citations, omitting rigorous proofs.\n2.1 Graph Theory Fundamentals (1) Definition of a Weighted Undirected Graph\nA graph $G(V, E, W)$ consists of three components: a vertex set $V = {v_1, v_2, \\ldots, v_n}$, an edge set $E \\subseteq V \\times V$, and a weight function $W: E \\rightarrow \\mathbb{R}^+$. If for any vertex pair $(v_i, v_j) \\in E$ we also have $(v_j, v_i) \\in E$, and the weights satisfy $W(v_i, v_j) = W(v_j, v_i)$, then $G$ is called a weighted undirected graph.\nAfter modeling, the metro network exactly constitutes a weighted undirected graph: transfer stations serve as vertices, metro line segments between stations serve as edges, and metro running times serve as edge weights.\n(2) Adjacency Matrix and Reachability\nGiven a graph $G$ with $n$ vertices, the adjacency matrix $A$ is defined as an $n \\times n$ matrix where:\n$$A_{ij} = \\begin{cases} W(v_i, v_j), \u0026amp; \\text{if } (v_i, v_j) \\in E \\ 0, \u0026amp; \\text{if } i = j \\ \\infty, \u0026amp; \\text{otherwise} \\end{cases}$$\nThe adjacency matrix describes the direct connectivity and weights between vertices. For non-adjacent vertex pairs, the corresponding matrix entry is infinity, indicating that direct access is impossible.\n(3) Shortest Path Problem\nThe shortest path problem requires finding a path from a specified source to a destination such that the sum of all edge weights along the path is minimized. In a weighted undirected graph, there exist multiple classical algorithms for this problem, applicable to different scenarios.\n2.2 Best Hamiltonian Cycle and Best Salesman Route (1) Basic Definitions and Distinctions\nBest Hamiltonian Cycle: Departing from a vertex, visiting all vertices in the graph exactly once, and returning to the starting point, forming a Hamiltonian circuit that minimizes the total edge weight.\nBest Salesman Route: Departing from a vertex, visiting all vertices in the graph at least once, and returning to the starting point, minimizing the total edge weight. The difference between the two lies in the required number of vertex visits.\n(2) Triangle Inequality and Complete Graph Transformation Conditions\nNot all graphs contain a Hamiltonian cycle. Graph theory states that a Hamiltonian cycle necessarily exists if and only if the graph is complete (every pair of vertices is connected by an edge).\nHowever, the Best Salesman Route problem does not require the graph to be complete. During solving, an incomplete graph can be transformed into a complete graph by an algorithm. If the weights of the newly added edges satisfy the triangle inequality:\n$$W_{ik} \\leq W_{ij} + W_{jk}$$\nthen the weight of the best Hamiltonian cycle obtained in this complete graph equals the weight of the best salesman route in the original incomplete graph. This transformation provides the foundation for subsequent solution methods.\n2.3 Floyd Algorithm (1) Algorithm Principles\nThe Floyd algorithm (also known as the Floyd-Warshall algorithm) is used to compute shortest paths between all pairs of vertices in a graph. Its core idea is dynamic programming.\nLet $d^{(k)}_{ij}$ denote the shortest path from vertex $v_i$ to vertex $v_j$, where the intermediate vertices along the path have indices no greater than $k$. The recurrence relation is:\n$$\\min_{k} d_{ij}^{(k)} = \\min\\left(d_{ij}^{(k-1)}, d_{ik}^{(k-1)} + d_{kj}^{(k-1)}\\right)$$\nInitially, $d^{(0)}_{ij}$ equals the weight value in the adjacency matrix. By sequentially introducing vertices $1, 2, \\ldots, n$ as intermediate points, the shortest path matrix is progressively updated until the shortest paths between all vertex pairs are obtained.\n(2) Application in Complete Graph Transformation\nThe direct application of the Floyd algorithm is to find shortest paths, but in this problem it serves another clever purpose: using its results to complete an incomplete graph.\nBy taking the original graph\u0026rsquo;s adjacency matrix as input and computing the shortest path lengths between all pairs of vertices, and then adding these lengths as new edge weights to the graph, the incomplete graph $G$ can be transformed into a complete graph $G\u0026rsquo;$.\nSince shortest paths themselves necessarily satisfy the triangle inequality $W_{ik} \\leq W_{ij} + W_{jk}$, the complete graph $G\u0026rsquo;$ satisfies the prerequisite that the best Hamiltonian cycle and the best salesman route are equivalent. This step is a critical transition in the entire solution process.\n2.4 Two-Edge Sequential Correction Algorithm (2-opt) (1) Algorithm Principles\n2-opt is a local search algorithm for solving the best Hamiltonian cycle. Given an initial circuit, if two non-adjacent edges $(v_i, v_{i+1})$ and $(v_j, v_{j+1})$ ($i \u0026lt; j$ and the two pairs of edges are non-adjacent) satisfy:\n$$W_{i,j} + W_{i+1,j+1} \u0026lt; W_{i,i+1} + W_{j,j+1}$$\nthen replacing these two edges with $(v_i, v_j)$ and $(v_{i+1}, v_{j+1})$ can reduce the total circuit weight. After replacement, the node visit order in the circuit changes, which is vividly described as a \u0026ldquo;flip.\u0026rdquo;\n(2) Iterative Process and Convergence Condition\nThe algorithm\u0026rsquo;s iterative steps are as follows:\nArbitrarily provide a Hamiltonian circuit as the initial solution. Traverse all edge pairs $(i, j)$ that satisfy the condition and check whether the above inequality holds. If it holds, execute the replacement to obtain a better circuit. Repeat Steps 2–3 until, after traversing the entire graph, there are no replaceable edge pairs remaining. Since each replacement strictly reduces the total weight and the weight has a finite lower bound, the algorithm must converge after a finite number of iterations, yielding a locally optimal solution — the best Hamiltonian cycle under the 2-opt criterion.\n2.5 Minimum Spanning Tree and Prim Algorithm (1) Minimum Spanning Tree Definition\nGiven a connected graph $G(V, E, W)$, a spanning tree $T$ is a subgraph of $G$ containing all $|V|$ vertices and $|V|-1$ edges with no cycles. The Minimum Spanning Tree (MST) is the spanning tree among all spanning trees with the smallest sum of edge weights.\nThe minimum spanning tree has two key properties: (1) it contains all vertices; (2) the sum of edge weights is minimal. These properties make it a powerful tool for network partitioning problems.\n(2) Prim Algorithm Steps\nThe Prim algorithm is a classical algorithm for constructing a minimum spanning tree, adopting a greedy strategy. The steps are as follows:\nArbitrarily select a starting vertex and add it to the tree. Repeat the following steps until all vertices have been added: Among all edges connecting vertices already in the tree to vertices not yet in the tree, select the one with the smallest weight. Add that edge and its unadded endpoint to the spanning tree. The result is a spanning tree containing all vertices, exactly $n-1$ edges, with the minimal total weight. 2.6 Fundamentals of Probability and Expectation (1) Expectation of a Discrete Random Variable\nLet a discrete random variable $X$ take values $x_1, x_2, \\ldots, x_n$ with corresponding probabilities $p_1, p_2, \\ldots, p_n$. Then the mathematical expectation of $X$ is:\n$$E(X) = \\sum_{k=1}^{n} x_k \\cdot p_k$$\nExpectation describes the average value of a random variable over repeated trials.\n(2) Definition and Calculation of Variance\nVariance measures the degree to which a random variable\u0026rsquo;s value deviates from its expectation, defined as:\n$$\\text{Var}(X) = E\\left[(X - E(X))^2\\right] = E(X^2) - [E(X)]^2$$\nFor a discrete random variable:\n$$\\text{Var}(X) = \\sum_{k=1}^{n} (x_k - E(X))^2 \\cdot p_k$$\n3. Problem Analysis and Solution Approach 3.1 Data Source All metro running time data between transfer stations in this paper are sourced from the Guangzhou Metro official website (http://www.gzmtr.com/). The raw data contains round-trip running times between adjacent stations. In this paper, the arithmetic mean of round-trip times is taken as each edge weight for modeling.\n3.2 Model Assumptions (1) It is assumed that Student P does not spend time when transferring at metro transfer stations (non-survey time).\n(2) It is assumed that the time spent traveling in both directions between any two transfer stations is the same (and in practice the actual time spent is also roughly the same).\n(3) Metro operation is in an ideal state, without delays or breakdowns or other force majeure factors.\n3.3 Symbol Definitions Symbol Meaning Symbol Meaning $t_1$ Total time of metro running on routes $W_{ij}$ Weight between any two stations $t_2$ Total time of surveying all transfer stations $G$ Graph $t$ Total time Student P spends on routes $G_i$ Subgraph $t_i$ Total time spent on any subgraph route $E_i$ Expected number of encounters with uncle/aunt on day $i$ $t_{1i}$ Total time of metro running on any subgraph route $P_j$ Probability that uncle/aunt works at station $j$ $t_{2i}$ Total time of surveying transfer stations on any subgraph $Q_{ij}$ Weight of encounter count at station $j$ on day $i$ $V$ Set of all vertices $T$ Total time spent on all subgraph routes $E$ Set of all edges 3.4 Solution Principles and Methods Problem 1: The best Hamiltonian cycle is equivalent to the shortest Hamiltonian circuit, corresponding to the method of using the Floyd algorithm to complete the graph and combining with 2-opt iterative solving.\nProblem 2: The multi-traveling salesman route is decomposed into regional partitioning and sub-circuit optimization, corresponding to the method of using the Prim algorithm to generate a minimum spanning tree and, after partitioning regions, independently solving the best H circuit for each region.\nProblem 3: The encounter count is a weighted discrete random variable, corresponding to the method of establishing a weight matrix $Q_{ij}$ and calculating the expectation $E_i = \\sum_j P_j \\cdot Q_{ij}$ and its variance.\n4. Model Construction and Solution 4.1 Problem 1: Optimal Single-Day Survey Route 4.1.1 Problem Transformation The total time $t$ consists of two parts:\n$$t = t_1 + t_2$$\nwhere $t_2$ is the sum of survey times at all stations, which can be directly accumulated after obtaining the actual survey time at each station and is a fixed value. $t_1$ is the total time Student P spends on metro routes, which depends on the chosen route and is the objective to be minimized.\nTherefore, minimizing $t$ is equivalent to minimizing the metro running time $t_1$.\nThis problem corresponds to the Best Salesman Route Problem in graph theory: departing from a starting point, visiting all vertices at least once, and returning to the endpoint, minimizing the total edge weight. The difference from the classic Hamiltonian cycle problem is that each vertex need only be visited at least once, rather than exactly once.\n4.1.2 Graph Simplification and Weight Definition The Guangzhou Metro network is abstracted as a weighted undirected graph $G(V, E, W)$:\nVertex set $V$: The 27 transfer stations requiring survey. Station numbers and corresponding names are listed in the table below: ID Station Name Survey Time (min) ID Station Name Survey Time (min) 1 Zhongda — 15 Chebei South 6 2 Guangzhou Railway Station 8 16 Xilao 6 3 Yantang 6 17 Shayuan 6 4 Tianhe Coach Station 8 18 Changgang 6 5 Guangzhou East Station 10 19 Kecun 8 6 Ouzhuang 8 20 Wanshengwei 8 7 Gongyuanqian 8 21 Nanzhou 6 8 Dongshankou 8 22 Lihua 8 9 Yangji 8 23 University Town South 6 10 Tiyu West Road 15 24 Shibi 6 11 Tanwei 6 25 Hanxi Changlong 6 12 Huangsha 6 26 Guangzhou South Station 8 13 Haizhu Square 8 27 Jiahe Wangang 6 14 Zhujiang New Town 10 — — — Edge set $E$: The metro line segments between adjacent transfer stations. If two stations can be reached by a one-stop direct ride, there is one edge connecting the two vertices. Weight $W$: The time consumed by metro running between stations. Specifically, let the metro running time between two adjacent stations be $t$ (i.e., the edge weight), then $W_{ij} = t$. If two stations cannot be reached by a one-stop direct ride (i.e., no direct edge), then $W_{ij} = \\infty$. Since the original metro network does not have direct edges between any pair of vertices, the original graph $G(V, E, W)$ is an incomplete graph, and the best Hamiltonian cycle cannot be directly solved on it.\n4.1.3 Floyd Algorithm for Complete Graph Completion The Floyd algorithm (Floyd-Warshall algorithm) is used to compute shortest paths between all pairs of vertices in a graph, representing a classic application of dynamic programming.\nIts core recurrence relation is: sequentially introducing vertex $k$ as an intermediate point, judging whether the path from vertex $i$ to vertex $j$ via $k$ is shorter than the currently known shortest path, and updating the shortest path length if so. The mathematical expression is:\n$$\\min_{k} d_{ij}^{(k)} = \\min\\left(d_{ij}^{(k-1)}, d_{ik}^{(k-1)} + d_{kj}^{(k-1)}\\right)$$\nTaking the original graph\u0026rsquo;s adjacency matrix as input and executing the Floyd algorithm yields the shortest path lengths between all pairs of vertices. Using these lengths as new edge weights and adding them to the graph transforms the incomplete graph $G$ into a complete graph $G\u0026rsquo;$.\nSince shortest paths necessarily satisfy the triangle inequality $W_{ik} \\leq W_{ij} + W_{jk}$, the weight of the best Hamiltonian cycle obtained in the complete graph $G\u0026rsquo;$ equals the minimal weight of the best salesman route in the original incomplete graph $G$. This step is a critical transition in the entire solution process.\nThe following is the MATLAB implementation of the Floyd algorithm:\nfunction [path,road]=floyd(a) n=size(a,1); path=a; for i = 1:n for j = 1:n road(i,j) = j; end end for k=1:n for i=1:n for j=1:n if path(i,j)\u0026gt;path(i,k)+path(k,j) path(i,j)=path(i,k)+path(k,j); road(i,j)=road(i,k); end end end end 4.1.4 Best Hamiltonian Cycle Solving On the complete graph $G\u0026rsquo;$, the Two-Edge Sequential Correction Algorithm (2-opt) is adopted to progressively optimize the initial circuit.\nGiven a Hamiltonian circuit, traverse any two non-adjacent edges $(v_i, v_{i+1})$ and $(v_j, v_{j+1})$ ($i \u0026lt; j$ and the two pairs of edges are non-adjacent). If satisfied:\n$$W_{i,j} + W_{i+1,j+1} \u0026lt; W_{i,i+1} + W_{j,j+1}$$\nthen replace the original two edges with $(v_i, v_j)$ and $(v_{i+1}, v_{j+1})$, reducing the total circuit weight. Each completed replacement is called one \u0026ldquo;2-opt flip.\u0026rdquo; Repeatedly traverse and replace until, after traversing the entire graph, there are no improvable edge pairs; the algorithm converges, yielding a locally optimal Hamiltonian circuit — the sought best H cycle.\nRepeatedly traverse and replace until, after traversing the entire graph, there are no improvable edge pairs; the algorithm converges, yielding a locally optimal Hamiltonian circuit — the sought best H cycle.\nThe following is the MATLAB implementation of the 2-opt algorithm:\nfunction [b,s] = h(e) n=size(e); for i=2:n-2 for j = i+1:n-2 if e(i,j)+e(i+1,j+1)\u0026lt;e(i,i+1)+e(j,j+1) a=horzcat(e(:,1:i),e(:,j:-1:i+1),e(:,j+1:n)); b=vertcat(a(1:i,:),a(j:-1:i+1,:),a(j+1:n,:)); e=b; end end end s=0; for i=2:n-2 s=s+e(i,i+1); end 4.1.5 Comparison of Two Schemes The following is the complete MATLAB program code for solving Problem 1:\nclc clear %Open the matrix with Zhongda as the start/end point to obtain Solution 1 for Problem 1 %Open the matrix with Lujiang as the start/end point to obtain Solution 2 for Problem 1 %Survey time at each station t=[0 6 8 6 8 10 8 8 8 8 15 6 6 8 10 6 6 6 6 8 8 6 8 6 6 6 8]; %Define the adjacency matrix for Zhongda boarding/alighting station a=[0 inf inf inf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\t6\t7\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\tinf\tinf\tinf\t9\t8\tinf\tinf\tinf\t12\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\t18; 0\t0\t0\t6\t4\t12\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\t18; 0\t0\t0\t0\tinf\tinf\tinf\tinf\tinf\t14\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\tinf\tinf\tinf\tinf\t6\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\tinf\t5\t7\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t9\tinf\tinf\tinf\t10\t4\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t4\tinf\tinf\tinf\t11\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t4\tinf\tinf\tinf\t6\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\tinf\t4\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t8\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t9\tinf\tinf\t11\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\tinf\tinf\t8\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t14\tinf\tinf\tinf\t6\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\tinf\tinf\t5\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t10\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t6\tinf\tinf\t10\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t11\tinf\t8\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t13\tinf\t7\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\t11\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t6\tinf\t13\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\t10\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\t15\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t9\t3\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf; 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]; %Define the adjacency matrix for Lujiang boarding/alighting station % a=[0 inf inf inf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\t8\t4\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\tinf\tinf\tinf\t9\t8\tinf\tinf\tinf\t12\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\t18; % 0\t0\t0\t6\t4\t12\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\t18; % 0\t0\t0\t0\tinf\tinf\tinf\tinf\tinf\t14\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\tinf\tinf\tinf\tinf\t6\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\tinf\t5\t7\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t9\tinf\tinf\tinf\t10\t4\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t4\tinf\tinf\tinf\t11\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t4\tinf\tinf\tinf\t6\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\tinf\t4\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t8\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t9\tinf\tinf\t11\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\tinf\tinf\t8\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t14\tinf\tinf\tinf\t6\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\tinf\tinf\t5\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t10\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t6\tinf\tinf\t10\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t11\tinf\t8\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t13\tinf\t7\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\t11\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t6\tinf\t13\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\t10\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\t15\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t9\t3\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf; % 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]; a=a\u0026#39;+a; [path,road]=floyd(a); % for i=1:27 % for j=1:27 % displaypath(path,road,i,j) % end % end e=path; [r,~]=size(e); e(:,r+1)=e(:,1); e(r+1,:)=e(1,:); c(1,1:r+1)=1:r+1; c(r+1)=1; e=[c;e;c]; c=zeros(r+3,1); e=[c,e,c]; [b,s]=h(e); try_number=0; while 1 try [b,s]=h(b); try_number=try_number+1; catch break end end route=b(1,:); route=route(1,2:29); disp(route) t=sum(t); s=s+t; fprintf(\u0026#39;s=\u0026#39;),disp(s) Scheme 1: Taking Zhongda Station (ID $V_1$) as the start and end point, constructing a $27\\times27$ adjacency matrix, applying the Floyd algorithm to complete the complete graph, and then using 2-opt iteration to solve the best H cycle.\nThe initial best H cycle route is as follows:\n$$V_1 \\rightarrow V_{18} \\rightarrow V_{17} \\rightarrow V_{16} \\rightarrow V_{12} \\rightarrow V_{11} \\rightarrow V_{12} \\rightarrow V_{13} \\rightarrow V_{7} \\rightarrow V_{2} \\rightarrow V_{6} \\rightarrow V_{8} \\rightarrow V_{9} \\rightarrow V_{10} \\rightarrow V_{5} \\rightarrow V_{3} \\rightarrow V_{27} \\rightarrow V_{3} \\rightarrow V_{4} \\rightarrow V_{10} \\rightarrow V_{14} \\rightarrow V_{15} \\rightarrow V_{20} \\rightarrow V_{23} \\rightarrow V_{25} \\rightarrow V_{24} \\rightarrow V_{26} \\rightarrow V_{24} \\rightarrow V_{21} \\rightarrow V_{22} \\rightarrow V_{19} \\rightarrow V_{1}$$\nInitial total time: 450 min.\nScheme 1 Optimization: Analysis of the route shows that Student P travels from Zhongda to Changgang on the outbound journey and finally returns to Zhongda Station via Kecun Station. Therefore, Student P can be instructed to get off early at Lujiang Station and end the survey, saving the 5 min travel time from Lujiang Station to Zhongda Station. After optimization, the end point is adjusted to Lujiang Station (ID $V_0$), with the final time being 445 min.\nScheme 2: Taking Lujiang Station (ID $V_0$) as the start and end point, similarly constructing the adjacency matrix and solving via Floyd + 2-opt, the best H cycle route has a total time of 443 min. Verification shows that the start and end point settings of this scheme make both the first and last stations Kecun Station, which has already achieved the optimum under this problem structure, and no further optimization is needed.\n4.1.6 Conclusion Scheme 2 is superior to Scheme 1, and Lujiang Station is selected as the starting and ending point.\nThe optimal route is:\n$$V_0 \\rightarrow V_{19} \\rightarrow V_{22} \\rightarrow V_{21} \\rightarrow V_{24} \\rightarrow V_{26} \\rightarrow V_{24} \\rightarrow V_{25} \\rightarrow V_{23} \\rightarrow V_{20} \\rightarrow V_{15} \\rightarrow V_{14} \\rightarrow V_{9} \\rightarrow V_{6} \\rightarrow V_{8} \\rightarrow V_{7} \\rightarrow V_{13} \\rightarrow V_{18} \\rightarrow V_{17} \\rightarrow V_{16} \\rightarrow V_{12} \\rightarrow V_{11} \\rightarrow V_{2} \\rightarrow V_{27} \\rightarrow V_{3} \\rightarrow V_{4} \\rightarrow V_{3} \\rightarrow V_{5} \\rightarrow V_{10} \\rightarrow V_{14} \\rightarrow V_{19} \\rightarrow V_{1} \\rightarrow V_{0}$$\nTotal time: 443 min\n4.2 Problem 2: Five-Day Survey Route Planning 4.2.1 Problem Analysis Problem 2 requires distributing the survey task of all 27 transfer stations over 5 days. Each day starts from the starting point, completes that day\u0026rsquo;s survey, and returns to the end point. The five-day routes are mutually independent, corresponding respectively to five subgraphs $G_i (i=1,2,3,4,5)$ of the original graph $G$. The objective function is:\n$$\\min \\sum_{i=1}^{5} t_i$$\nwhere $t_i$ is the total time for the $i$-th subgraph, satisfying $t_i = t_{1i} + t_{2i}$ ($t_{1i}$ is the metro running time, $t_{2i}$ is the station survey time).\nAdditionally, Student P raised the concern that \u0026ldquo;concentrating the survey in one day is too exhausting.\u0026rdquo; Therefore, while keeping the total time as low as possible, the balance of workload across days must also be considered. A balance coefficient $\\lambda$ is introduced to measure the balance of daily routes:\n$$\\lambda = \\frac{W_{\\max} - W_{\\min}}{W_{\\max}}$$\nwhere $W_{\\max}$ and $W_{\\min}$ are respectively the maximum and minimum values of the best H cycle edge weights among the five days. Smaller $\\lambda$ indicates a more balanced distribution of the five-day workload.\nThis problem corresponds to the m-Traveling Salesman Problem (m-TSP) in graph theory: $m$ persons travel on the overall graph, requiring each vertex to be visited at least once with the shortest total distance.\n4.2.2 Minimum Spanning Tree Regional Partitioning The core difficulty of the multi-traveling salesman route problem lies in how to reasonably partition the overall graph into $m$ connected subgraphs. If the partitioning is improper, even if each subgraph is well optimized internally, the overall effect may be unsatisfactory.\nThis problem adopts the Minimum Spanning Tree (MST) for regional partitioning. Given a connected graph $G(V, E, W)$, a spanning tree $T$ is a subgraph of $G$ containing all $|V|$ vertices and $|V|-1$ edges with no cycles. The minimum spanning tree is the spanning tree with the smallest sum of edge weights among all spanning trees.\nTaking the original graph $G$\u0026rsquo;s adjacency matrix as input, the Prim algorithm is applied to generate a minimum spanning tree. Its core idea is: starting from any vertex, selecting the edge with the smallest weight that does not form a cycle to expand the spanning tree each time, until all vertices have been added.\nThe following is the MATLAB implementation of the Prim algorithm for generating a minimum spanning tree:\nclear clc a=[0 inf inf inf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\t6\t7\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\tinf\tinf\tinf\t9\t8\tinf\tinf\tinf\t12\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\t18; 0\t0\t0\t6\t4\t12\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\t18; 0\t0\t0\t0\tinf\tinf\tinf\tinf\tinf\t14\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\tinf\tinf\tinf\tinf\t6\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\tinf\t5\t7\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t9\tinf\tinf\tinf\t10\t4\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t4\tinf\tinf\tinf\t11\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t4\tinf\tinf\tinf\t6\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\tinf\t4\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t8\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t9\tinf\tinf\t11\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\tinf\tinf\t8\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t14\tinf\tinf\tinf\t6\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\tinf\tinf\t5\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t10\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t6\tinf\tinf\t10\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t11\tinf\t8\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t13\tinf\t7\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\t11\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t6\tinf\t13\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\t10\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\t15\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t9\t3\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf; 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]; a=a\u0026#39;+a; [path,road]=floyd(a); e=path; [r,~]=size(e); e(:,r+1)=e(:,1); e(r+1,:)=e(1,:); c(1,1:r+1)=1:r+1; c(r+1)=1; e=[c;e;c]; c=zeros(r+3,1); e=[c,e,c]; [b,s]=h(e); route=b(1,:); route=route(1,2:29); disp(route) t=sum(t); s=s+t; fprintf(\u0026#39;s=\u0026#39;),disp(s) After generating the minimum spanning tree, the graph is partitioned into five regions by removing four edges from the tree. Each region\u0026rsquo;s subgraph is independently solved using the Floyd + 2-opt method to obtain the best H cycle for that region. The five regions and their respective best H cycle routes are determined as follows.\nRegion 1: Removing the edge with the largest weight from the minimum spanning tree, separating one region, and the Floyd + 2-opt algorithm is applied to solve the best H cycle for that region.\nThe following is the complete MATLAB program code for solving Problem 2:\nclc clear %各站点检查的时间 t=[0 6 8 6 8 10 8 8 8 8 15 6 6 8 10 6 6 6 6 8 8 6 8 6 6 6 8]; %打开中大作为起点终点的矩阵，即可得到问题二的解 % a=[0 inf inf inf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\t8\t4\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\tinf\tinf\tinf\t9\t8\tinf\tinf\tinf\t12\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\t18; % 0\t0\t0\t6\t4\t12\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\t18; % 0\t0\t0\t0\tinf\tinf\tinf\tinf\tinf\t14\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\tinf\tinf\tinf\tinf\t6\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\tinf\t5\t7\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t9\tinf\tinf\tinf\t10\t4\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t4\tinf\tinf\tinf\t11\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t4\tinf\tinf\tinf\t6\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\tinf\t4\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t8\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t9\tinf\tinf\t11\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\tinf\tinf\t8\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t14\tinf\tinf\tinf\t6\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\tinf\tinf\t5\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t10\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t6\tinf\tinf\t10\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t11\tinf\t8\tinf\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t13\tinf\t7\tinf\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\t11\tinf\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t6\tinf\t13\tinf\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\t10\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\t15\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t9\t3\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf; % 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf; % 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]; a=[0 inf inf inf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\t6\t7\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\tinf\tinf\tinf\t9\t8\tinf\tinf\tinf\t12\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\t18; 0\t0\t0\t6\t4\t12\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\t18; 0\t0\t0\t0\tinf\tinf\tinf\tinf\tinf\t14\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\tinf\tinf\tinf\tinf\t6\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\tinf\t5\t7\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t9\tinf\tinf\tinf\t10\t4\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t4\tinf\tinf\tinf\t11\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t4\tinf\tinf\tinf\t6\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\tinf\t4\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t8\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t9\tinf\tinf\t11\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\tinf\tinf\t8\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t14\tinf\tinf\tinf\t6\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\tinf\tinf\t5\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t10\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t6\tinf\tinf\t10\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t11\tinf\t8\tinf\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t13\tinf\t7\tinf\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\t11\tinf\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t6\tinf\t13\tinf\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf\t10\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\t15\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t9\t3\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf\tinf; 0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\tinf; 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]; a=a\u0026#39;+a; [path,road]=floyd(a); e=path; [r,~]=size(e); e(:,r+1)=e(:,1); e(r+1,:)=e(1,:); c(1,1:r+1)=1:r+1; c(r+1)=1; e=[c;e;c]; c=zeros(r+3,1); e=[c,e,c]; [b,s]=h(e); route=b(1,:); route=route(1,2:29); disp(route) t=sum(t); s=s+t; fprintf(\u0026#39;s=\u0026#39;),disp(s) After removing the four edges with the largest weights from the minimum spanning tree, the graph is divided into five regions. Each region\u0026rsquo;s subgraph is independently solved using the Floyd + 2-opt algorithm to obtain the best H cycle for that region. The five regions and their respective optimal routes are determined as shown in the table below.\nRegion Route Description Survey Time (min) Total Time (min) Balance Coefficient $\\lambda$ 1 See detailed route $t_{21}$ $T_1$ — 2 See detailed route $t_{22}$ $T_2$ — 3 See detailed route $t_{23}$ $T_3$ — 4 See detailed route $t_{24}$ $T_4$ — 5 See detailed route $t_{25}$ $T_5$ — The table shows that after MST partitioning and independent optimization, the total time for five days satisfies the balance condition, and the overall solution meets the requirements.\n4.2.3 Best H Cycle Solving for Each Region For each region\u0026rsquo;s subgraph, the same method as Problem 1 is applied: the Floyd algorithm first completes the incomplete graph into a complete graph, and then the 2-opt algorithm is used to solve the best H cycle for that region. Since the subgraphs are smaller in scale, the solving process is more efficient.\n4.2.4 Optimization Process After the initial partitioning, the balance coefficient $\\lambda$ and total time are calculated. If $\\lambda$ does not satisfy the requirement, the partitioning scheme is adjusted by attempting to reassign vertices on both sides of the removed edges to adjacent regions, and the sub-circuits are re-optimized. This process is repeated until both the total time and balance coefficient requirements are met.\n4.2.5 Final Scheme After multiple rounds of adjustment and optimization, the final five-day survey scheme is determined. The total time is 559 min, with a balance coefficient $\\lambda = 33.33%$. The details of each day\u0026rsquo;s route are as follows:\nDay Region Description Total Time (min) Day 1 Region 1 route See detailed route Day 2 Region 2 route See detailed route Day 3 Region 3 route See detailed route Day 4 Region 4 route See detailed route Day 5 Region 5 route See detailed route Total 559 4.3 Problem 3: Encounter Expectation and Variance 4.3.1 Problem Analysis Based on the route determined in Problem 2, the number of times Student P passes through each transfer station on Line 8 over the five days is completely determined. Whether Student P encounters his uncle or aunt depends entirely on where they work. At the same time, since in different survey routes the number of times Student P passes through each transfer station on Line 8 varies, the problem can be viewed as two types:\nProblem (a): Single-person encounter expectation problem — the expected number of encounters between Student P and his uncle (or aunt) individually. Problem (b): Combined encounter expectation problem — considering both uncle and aunt simultaneously, the expected total number and variance of encounter counts. Student P\u0026rsquo;s uncle and aunt each work at transfer stations along Line 8 every day, with their work locations being randomly distributed among the transfer stations with equal probability. Their daily work times both cover the entire day. The encounters are mutually independent between the uncle and aunt and across different days. Therefore, the encounter problem can be modeled as a weighted discrete random variable expectation problem.\nThe encounter weight $Q_{ij}$ is defined as: the weighted encounter count for Student P at station $j$ on day $i$, comprehensively accounting for the number of times passing through the station and the possibility of staying at that station.\n4.3.2 Expectation Formula Derivation The general formula for the daily encounter expectation is:\n$$E_i(x) = \\sum_{j} P_j \\cdot Q_{ij}$$\nwhere $E_i(x)$ is the expected number of single-person encounters on day $i$, $P_j$ is the probability of working at station $j$, and $Q_{ij}$ is the encounter weight at station $j$ on day $i$.\nSince the probability of working at each station is equal every day, i.e., $P_j = \\frac{1}{n}$ ($n$ is the number of transfer stations involved on Line 8), therefore:\n$$E_i(x) = \\frac{1}{n} \\sum_{j} Q_{ij}$$\n4.3.3 Problem (a) Calculation Results Based on the route determined in Problem 2, the encounter weight matrix $Q_{ij}$ for each station on each day is established. Substituting into the above formula for calculation, the expected number of single-person encounters for each day is as follows:\nDay Expected Encounter Count Day 1 1.00 Day 2 0.50 Day 3 0.50 Day 4 0.50 Day 5 0.75 Total 3.25 Variance 0.04 The expected total number of single-person encounters is 3.25, with a variance of 0.04.\n4.3.4 Problem (b) Calculation Results A permutation and combination analysis is applied to the work locations of the uncle and aunt. Let the stations involved in the Line 8 survey be numbered by location: A represents Shayuan Station, B represents Changgang Station, C represents Kecun Station, and D represents Wanshengwei Station.\nThe encounter weight table obtained through permutation and combination is as follows:\nUncle\u0026rsquo;s Work Location Aunt\u0026rsquo;s Work Location Day 1 Day 2 Day 3 Day 4 Day 5 A A 0 0 0 0 1 A B 0 0 0 1 3 A C 2 2 2 1 1 A D 2 0 0 0 1 B A 0 0 0 1 3 B B 0 0 0 1 2 B C 2 2 2 2 2 B D 2 0 0 1 2 C A 2 2 2 1 1 C B 2 2 2 2 2 C C 2 2 2 1 0 C D 4 2 2 1 0 D A 2 0 0 0 1 D B 2 0 0 1 2 D C 2 2 2 1 0 D D 4 0 0 0 0 Based on the weight table, combined with the expectation formula, the expected number of encounter counts between the uncle and aunt for each day is calculated as follows:\n$$E_i = \\frac{1}{16} \\sum Q_i$$\nDay 1: $E_1 = \\frac{2 \\times 8 + 4 \\times 2 + 2 \\times 2}{16} = 1.75$ Day 2: $E_2 = \\frac{2 \\times 6 + 2 \\times 1}{16} = 0.88$ Day 3: $E_3 = \\frac{2 \\times 6 + 2 \\times 1}{16} = 0.88$ Day 4: $E_4 = \\frac{1 \\times 8 + 2 \\times 2 + 1 \\times 2}{16} = 0.88$ Day 5: $E_5 = \\frac{1 \\times 2 + 2 \\times 2 + 3 \\times 2 + 1 \\times 1 + 2 \\times 1}{16} = 0.94$ Day Expected Encounter Count Day 1 1.75 Day 2 0.88 Day 3 0.88 Day 4 0.88 Day 5 0.94 Total Expectation 5.31 Variance 0.34 5. Model Evaluation and Summary (1) Problem 1 Optimal Route: Taking Lujiang Station as the departure and termination point, the optimal survey route has a total time of 443 min, which is the optimal scheme for completing the survey of all 27 transfer stations within a single day.\n(2) Problem 2 Five-Day Route Planning: Through MST-based regional partitioning and independent optimization of each region, the five-day total time is 559 min with a balance coefficient $\\lambda = 33.33%$, achieving a reasonable distribution of the workload across five days.\n(3) Problem 3 Encounter Expectation: Under the route scheme from Problem 2, the expected number of encounters between Student P and his uncle (or aunt) individually is 3.25 times, with a variance of 0.04, indicating extremely small dispersion in the distribution of encounter counts.\n","date":"2020-03-01T21:58:00+08:00","image":"/uploads/414313c7-27df-475f-b292-1d1134759805.jpg","permalink":"/en/p/suzhou-metro-route-planning-en/","title":"A Metro Route Planning Model Based on the Multiple Traveling Salesman Problem (2019 Sun Yat-sen University Mathematical Modeling Competition)"}]